中国IT动力,最新最全的IT技术教程
最新100篇 | 推荐100篇 | 专题100篇 | 排行榜 | 搜索 | 在线API文档 | 网通镜像
首 页 | 程序开发 | 操作系统 | 软件应用 | 图形图象 | 网络应用 | 精文荟萃 | 教育认证 | 硬件维护 | 未整理篇 | 站长教程
ASP JS PHP工程 ASP.NET 网站建设 UML J2EESUN .NET VC VB VFP 网络维护 数据库 DB2 SQL2000 Oracle Mysql
服务器 Win2000 Office C DreamWeaver FireWorks Flash PhotoShop 上网宝典 CorelDraw 协议大全 网络安全 微软认证
硬件维护  CPU  主板  硬盘  内存  显卡  显示器  键盘鼠标  声卡音箱  打印机  机箱电源  BIOS  网卡  C#  Java  Delphi  vs.net2005
  当前位置:> 程序开发 > 编程语言 > Java > Java与XML
Generating XML Instances from Flat Files @ JDJ
作者:未知 时间:2005-08-10 19:01 出处:Java频道 责编:chinaitpower
              摘要:Generating XML Instances from Flat Files @ JDJ

Enterprise applications such as banking, healthcare, and so on still use flat files to import/export data between applications. Flat files contain machine-readable data that is typically encoded in printable characters. There is a growing need for these applications to interact with XML-aware applications and Web services, and to satisfy this need these applications must convert flat file data to an XML format.

XML is suited for the interchange of data as XML documents are tagged, easily parsed, and can represent complex data structure. The conversion of a flat file to an XML format requires proper representation of the data embedded in the flat file in some template form so that it can be converted to XML. There are custom solutions based on XML templates and XML dtds made to capture the data structure of flat files to be converted to an XML format, but here a new schema-based approach to parse flat files to an XML instance will be discussed.

Why XML Schema?
The XML Schema provides a means to define the structure, content, and semantics of the data contained in XML documents. Flat files contain data, and in order to convert the data to XML, the underlying data structure and data validation rules should be captured in an XML Schema representation. XML Schema is now a well-known standard and there are various XML parsers available to parse and validate documents against an XML Schema. Moreover, it provides the flexibility of describing the data more effectively using the rich W3C schema language, which can be used to validate the generated XML document.

Several kinds of commercial software are available that convert flat files to XML instances based on proprietary templates and conversion routines. These solutions are tailored to meet specific needs and do not scale to fit the requirements of generic flat-file-to-XML-instance generation.

This approach is based on open standards such as W3C schema, API, and XERCES XML parser's schema implementation. It is suitable for any Java project or custom XML instance- generation project using open source technologies.

Process
XML Schema is the best way to represent data structure and validation rules in XML- aware applications. In order to parse a flat file to create an XML instance, information about the data and its hierarchy needs to be understood properly and then captured in the schema definition. Once the data structure is defined correctly in the schema, the parsing instructions for the flat file need to be introduced properly in the schema so that after producing an XML instance from a schema definition, the instance may be populated with live data from the flat file.

The following steps explain the conversion of a flat file to the XML Schema.

  • Data representation: Present the flat file in a schema form with required information on data structure and validation
  • Parsing logic implementation: Develop a concept of container (record) and contained objects (fields) in the flat file and place relevant control attributes for them in the schema definition
  • Default instance generation: Convert the schema into a default XML instance governed by schema rules
  • Populating the instance with data: Parse the flat file with the control attributes information provided in the schema and populate the generated instance with live data from flat file
To elaborate on the process in detail and explain the underlying Java implementation to convert flat files to an XML instance, a "," delimited flat file and a fixed-length field flat file are shown below.

A "," delimited flat file:
Social Security Number, Name, Salary
123456789,Ram Singh,100000.00
444556666,Barr Clark,87000.00
777227878,Simi? D Roy,123000.00
998877665,Charr Lee,92000.00

A fixed-length field flat file:
Social Security Number, Name, Salary
123456789Ram Singh 100000.00
444556666Barr Clark 87000.00
777227878Simi D Roy 123000.00
998877665Charr Lee 92000.00

In these flat files, the data and the data structure remain the same but the representation changes depending on the flat file type. Now let's go through the steps to convert these flat files to an XML instance. Figure 1 shows the approach in detail.

Data Representation
The main focus of converting data from flat file to XML instance depends on proper understanding of the underlying flat file data structure, which can be easily captured as a schema definition. In all of the preceeding examples the data structure remains the same, only the representations are changed depending on the flat file type. So, the basic data structure in these cases may be captured in the same schema definition shown in Listing 1.

For all the examples, any one of the two schema representations may be adopted. Both representations exhibit the flexibility to describe the data structure using the W3C schema language. In our example the first schema definition is considered. The examples here are simple but the same approach applies to complex cases as well.

Parsing Logic Implementation
The raw schema definition defined in Listing 1 captures the data structure of the flat file. To parse the flat file and to populate an XML instance, more parsing logic needs to be built inside the schema definition. To implement the parsing logic in the schema definition, annotations and control attributes with namespace, xmlns:t2xml="http://xmlns.oracle.com/t2xml" are introduced. These control attributes and annotations are added in the element declarations in the schema definition to mark the XML root element, physical container definition (records), and contained objects (fields) as per the flat file structure. Table 1 shows the list of main control attributes required to build the parsing logic in the schema.

These control attributes are pivotal to parsing the flat file and are defined in the t2xml.IParseProperties interface. This interface is implemented in the class xerces.xs.instance.T2xmlInstance to implement all the parsing logic to populate the XML instance with flat file data discussed in detail in the "Populating instance with data" section. See Listing 2 for the control attributes.

Apart from these control attributes, the attributes minOccurs and maxOccurs play a crucial role in determining repetitions of containers (records); depending upon the value of the minOccurs and maxOccurs, the optional and required containers are decided. For example, if minOccurs is "0," the container is optional; if it is more than "0," the container is mandatory. If maxOccurs is "unbounded," the number of containers is decided depending upon the records in the actual flat file. However, if a number is prespecified in the schema, that number of records is anticipated in the flat file.

Now let's see how these control attributes are used in the schema definition to mark parsing logic instruction for the flat file.

Delimited Flat File Case
For the "," delimited case let's consider a record from the flat file.

777227878,Simi? D Roy,123000.00

This shows an employee record with three fields: Social Security Number (SSN), name, and salary. The name field can be subdivided into first name and last name and separated by a " " delimiter. The "?" is considered an escape character in the name field. Figure 2 shows the basic mapping for a delimited record to a schema definition.

In Figure 2 the full record is mapped to the Employee element. Since the record is a delimited one, the following control attributes are added to the Employee element:

  • t2xml:container="true": Added to tell if it's a container or record
  • t2xml:object_sep=",": Added to tell the field delimiter
  • t2xml:container_type="delimited": Added to tell the record type is delimited
  • t2xml:container_endtoken="os:linesep": added to tell the os-specific line separator is used as a record terminator
  • t2xml:escape_char="?": Added to tell the escape character in the record definition
Apart from these control attributes, maxOccurs=" unbounded" is found in the Employee element declaration. It says to produce as many Employee elements as are encountered in the flat file.

For the ssn and salary fields the mapping is simple, as these are contained within the Employee container and do not have any additional contained objects inside. The following control attribute is added for them:

  • t2xml:object="true": Added to tell that it's a contained object

    For the name field the mapping is a little complex as it contains the subfields first name and last name. Thus it's a container as well as a contained object itself. The following control attributes should be added for it.

    • t2xml:container="true": To tell if it's a container for first name and last name
    • t2xml:object="true": To tell if it's also a contained object inside Employee
    • t2xml:object_sep="os:spacechar": To tell the os-specific space character as delimiter
    • t2xml:container_type="delimited": To tell the Container type as delimited
    • t2xml:escape_char="?": To tell about the escape character
    The complete schema definition is located in the file delimited-sample.xsd in the source jar (the source code for this article is at www.sys-con.com/xml/sourcec.cfm).

    Fixed-length Flat File Case
    For the fixed-length case too let's consider a record from the flat file.

    777227878Simi? D Roy 123000.00

    The fixed-length field case is almost the same as the delimited case; the only difference is that here the field lengths are fixed and not separated by any delimiters. Therefore, the control attributes for the Employee element are a little different from those of the delimited case.

    • t2xml:container="true": Added to tell it's a container or record
    • t2xml:container_type="fixed": Added to tell the record type is fixed
    • t2xml:container_endtoken="os:linesep": Added to tell the os-specific line separator is used as a record terminator
    The attributes t2xml:object_sep and t2xml:escape_char are abolished here as they are specific to the delimited case only.

    For the contained objects there is an additional attribute to specify the object length. The respective lengths of the ssn, name, and salary fields were updated in the t2xml:object_len attribute. So, for contained objects here are the required attributes:

    • t2xml:object="true": Added to tell it's a contained object
    • t2xml:object_len="9","30","9": For ssn, name, and salary fields
    For the name field, mapping the other control attributes remains the same as in the delimited case.

    The complete schema definition for the fixed-length case can be found in the file fixedLength-sample.xsd in the source jar.

    Default Instance Generation
    Based on the schema defined for the flat file data structure, a default XML instance that follows the rules defined in the schema definition is generated. Generating this XML instance from the schema definition requires the proper identification of the root element to start the instance generation. This XML instance generation from the schema is implemented in the xerces.xs.instance.SchemaInstance class. This class can generate an XML instance from any schema definition, provided the schema contains at least one element declaration. If there is more than one element declaration, it calls for the element with no reference from other elements as a potential candidate for the root element. A generic root element finder is implemented in the xerces.xs.instance.RootElementFinder class. The XML instance generation in the SchemaInstance class starts with the root element and comes up with XML elements on traversal of complextypes and elements defined in the root element definition. It converts each complex type/simple type element to a default XML instance node during schema traversal. If some elements are optional, it produces one default element. If some elements have maxOccurs="unbounded," it goes with the value defined in Instance.MAX_UNBOUND_OCCURANCE. The most important methods of SchemaInstance are shown in Listing 2.

    The source jar contains the full class declaration for SchemaInstance class. All the handler methods in the class are recursively called to generate the XML instance. Below is the sample code that demonstrates how to use this class to generate an XML instance from a schema definition:

    SchemaInstance lSchemaInstance= new SchemaInstance(aSchemaFilePathOrURL);
    lSchemaInstance.generate(aOutputFilePath); /* or */
    lSchemaInstance.generateInstance(aOutputStream);

    The XML instance generated from the schema example above looks like this:

    <root_element>
    <Employee>
    <ssn/>
    <name>
    <fname/>
    <lname/>
    </name>
    <salary/>
    </Employee>
    </root_element>

    Populating the Instance With Data
    As populating the XML instance with data comes after the XML instance is generated, one point worth mentioning here is that the XML instance to be filled up with data does not go on with the whole schema definition at one time; rather, each schema element contained in the schema definition under the root element is converted to the XML element and then filled up with data. When the XML instance generation starts from the root element, the control attributes are examined for each schema element that is generated as XML.

    These control attributes tell if an element is a container or contained object, and also tell about the container end token, object separator, etc. Therefore, depending upon the control attributes, after the XML instance is generated for a particular schema element, the physical record is read from the flat file and the instance is populated with live data from flat file. This process is repeated for each record defined in the flat file. Only after traversal of the full schema definition (starting from the root element) will a filled-up instance representing the full schema definition be created. If the maxOccurs attribute is "unbounded" for a schema element, the number of XML instances for this element is created as per the availability of records in the flat file; otherwise the actual number is regarded in the schema definition.

    The lookup for the control attributes and their correct handling is very important when filling the XML instance with data. To start, the implementation class xerces.xs.instance.T2xmlInstance implements the t2xml.IParseProperties and extends the xerces.xs.instance.SchemaInstance class. In the SchemaInstance class, the bare-bones XML elements were generated, but in the derived T2xmlInstance, with the help of the IParseProperties, these XML instances will now be filled up with data from the flat file. In Listing 3 you can see a skeleton representation of the T2xmlInstance class.

    The last three methods in the class skeleton were overridden in the T2xmlInstance class from the SchemaInstance class to fill up the XML elements generated in the SchemaInstance class.

    The getRootSchemaElement(.) method is overridden to find the root element in accordance with the control attribute "t2xml:rootelem."

    The handleParticle(.) method is overridden to look up specific control attributes for container and container type, so that the filler object that fills the data to the generated XML instance is set up properly.

    The fillupData(.) method is overridden to fill up data in the XML instance based on the control attribute for object marking and the type of filler object passed in.

    The other methods in the class are helper methods to get the instance fill-up mechanism working. The full source code for T2xmlInstance may be found in the source jar. The T2xmlInstance class uses the control attributes explained in Table 1 to populate the default XML instance with data. To run T2xmlInstance as an application, download source.jar, unzip the contents, and try the following commands for the delimited sample case and the fixed-length case, respectively. {your jdk home}\bin\java -cp .;classes;lib\xerces\resolver.jar;lib\xerces\xercesI mpl.jar;lib\xerces\xml-apis.jar;lib\xerces\xmlParserAPIs.jar xerces.xs.instance. T2xmlInstance test\delimited-sample.xsd test\delimited-input.txt test\delimited-output.xml

    {your jdk home}\bin\java -cp .;classes;lib\xerces\resolver.jar;lib\xerces\xercesI
    mpl.jar;lib\xerces\xml-apis.jar;lib\xerces\xmlParserAPIs.jar xerces.xs.instance.
    T2xmlInstance test\fixedLength-sample.xsd test\fixedLength-input.txt test\fixedLength-output.xml

    The XML instance for the delimited and fixed-length case is shown in Listing 4. Because the data structure is exactly the same for these two cases, the generated XML is also identical.

    Scope for Future Enhancement
    This article demonstrated the concept of converting flat files to an XML instance and provides a solution to prove the concept. Fairly complex flat files may be converted to XML instances using this approach, but the solution needs to be improved for industry-standard XML-instance generation because the support for doing custom calculations and mapping data directly to attributes is not yet established. Enabling XPath expressions might add some more goodies to the approach.

    Conclusion
    This approach is used to parse flat files and create an XML instance given a schema representation of the underlying data in the flat file. Many flat file?based systems such as EFT/EDI, custom database migration, and backup utilities will find this useful. Since this approach is based on open standards such as W3C schema and Xerces implementation for schema, it may be widely used.

    References

  • XML Convert 2.2 to transform flat files into XML and vice versa: www.unidex.com/xflat.htm
  • XML Instance from TIBCO Software: www.xml.com/pub/p/259
  • A description of an xsl-based approach to convert a schema to an XML instance: http://incrementaldevelopment.com/papers/xsltrick/#schema-to-instanc
  • Sun's XML instance generator: wwws.sun.com/software/xml/developers/instancegenerator/
  • XML instance generator: http://xml-xig.sourceforge.net/
  • W3C schema: www.w3.org/XML/Schema
  • 关闭本页
     
    首页 | 投资与合作 | 服务条款 | 隐私政策 | 收藏本站 | 设为首页 | 新用户注册 | 免责声明 | 使用帮助
    Copyright ©2005-2008 chinaitpower.com All rights reserved. www.chinaitpower.com 版权所有