开发者

Unix XML file convert into Flat file

We are having multiple xml files on unix. We need to convert them into flat files. And we did that parsing for one level of xml file using C (C was used as C can communicate with Teradata fastload which is our target box using inmod and it will complete within one parse other wise in other languages we need to do two times parsing one for converting into flat file and one for loading ito teradata). i.e. the below file

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
   </book>

Is c开发者_运维技巧onverted into

bk101~Gambardella, Matthew~XML Developer's Guide~Computer~44.95~

This we achieved by parsing the file in C. But after seeing the original format of xml file which is below. (Please do not consider it as the required file. I am just giving an idea)

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
             <modified>2010-01-02</modified>
             <modified>2010-01-03</modified>
      <price>44.95</price>
   </book>

This should be converted to two records it seems.

bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95~
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95~

But now we are feeling that our C code is going to be complex for this req. So we are looking at other options which can be easily used on unix. Can any one please give us any working example codes in different languages/options for unix?


You can use XSLT. I use Saxon (Java) which can be run on Unix.

This stylesheet handles both of your XML samples:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output  method="text"/>
  <xsl:template match="/book">
    <xsl:choose>
      <xsl:when test="modified">
        <xsl:for-each select="modified">
          <xsl:call-template name="dump-line">
            <xsl:with-param name="pos" select="position()"/>
          </xsl:call-template>          
        </xsl:for-each>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="@id"/><xsl:text>~</xsl:text>
        <xsl:value-of select="author"/><xsl:text>~</xsl:text>
        <xsl:value-of select="title"/><xsl:text>~</xsl:text>
        <xsl:value-of select="genre"/><xsl:text>~</xsl:text>
        <xsl:value-of select="price"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template name="dump-line">
    <xsl:param name="pos"/>
    <xsl:value-of select="/book/@id"/><xsl:text>~</xsl:text>
    <xsl:value-of select="/book/author"/><xsl:text>~</xsl:text>
    <xsl:value-of select="/book/title"/><xsl:text>~</xsl:text>
    <xsl:value-of select="/book/genre"/><xsl:text>~</xsl:text>
    <xsl:value-of select="/book/modified[$pos]"/><xsl:text>~</xsl:text>
    <xsl:value-of select="/book/price"/>
    <xsl:text>&#x0A;</xsl:text>
  </xsl:template>
</xsl:stylesheet>

If there are no modified elements, one record is output. If there are modified elements, it outputs as many records as there are modified elements.

Sample output w/modified elements:

bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02~44.95
bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-03~44.95


If you're loading the data into a database, and you have fields that share a many to one relationship with other fields, then you need to make sure your database structure is up to scratch. I.e. one table for the book, and one table for the modification date. Otherwise it will look like there are two books when in fact there is one with two modification dates.

However, if you are loading the data into a database, why are you first converting it to a flat file? You said you wanted to avoid having two passes one the parsing. Well it looks like you'll have one pass to parse the XML and output as a flat file, and another to parse the flat file and enter it into the database. Why not simply parse the XMl and put the data directly into the database?

There are reasons why formats like XML were invented and one is to encapsulate complicated data relationships in text based documents. By converting to a "flat file" you will lose that complexity. If you are then going to import the data into an environment that can handle that complexity and store those relationships...why not keep it?

Does your database have an API, or can it only import flat files?

---EDIT---

It's easier to reply as part of an answer than as a series of comments.

First, thanks for the clarification. Second, no I cannot provide example code. Mostly since what you want sounds very specific. Thirdly, I think you have two options:

1) You have a load of C code already written to parse the XML. You have to consider the cost of throwing it all away and writing it again in Perl and supporting that, against the cost of improving it to import data directly into your Teradata database and the cost of maintaining it thereafter.

2) For Perl, there are many XML parsers and in my experience they make traversing an XML tree/data structure much much easier than in C. I'm not a fan of Perl, but I have written code to deal with ready parsed XML trees in C and I have never failed to hate it. By contrast, doing it in Perl is simpler and probably even quicker.

There are a huge number of Perl modules out there to parse XML. I suggest you search the internet for some reviews on them to decide which is easiest or most appropriate for you to use.

There is a Perl module called Teradata::SQL that should allow you to import the data into your Teradata databse. There may be other modules that are easier/simpler/better to use. I have no experience in any of them so cannot make a recommendation. Search http://www.cpan.org for any modules that may be useful.


Lastly, I STRONGLY recommend ensuring that you take some time to ensure that the design of your Teradata database matches the data going into it. As I stated above, you clearly have a many to one relationship between modification dates and books, so that means you need a table for modification dates and a table for books and correct many to one relationships in your table design. To put one entry per line, resulting in multiple lines for the same book with only modification date varying is very wrong. There may be other many to one relationships such as author. Imagine book B written by authors A1 and A2 with modification dates of M1 and M2. If you use the approach you discussed above of having one line for each combination, you end up having 4 entries for the same book, and it looks like you have 2 books with the same title but written by different authors.

Spend some time to ensure you understand the structure of the data in the XML files. This should be clearly defined by the DTD.


XSLT is an option; check out the xsltproc tool.

Or, you can also the much easier XQuery, though you might need to coerce it into producing text. The following XQuery script does almost what you want (only a few fields listed):

for $book in doc("book.xml")/book
for $mod in $book/modified
return concat($book/@id, "~", $book/title, "~", $mod, "
")

You can run this through Saxon with

java net.sf.saxon.Query '!method=text' script.xq

Another popular XQuery processor for Unix is XQilla, though I'm not sure it can produce non-XML output.

(There may be a smart alternative to my awkward way of generating a newline.)


How about formating the line as bk101~Gambardella, Matthew~XML Developer's Guide~Computer~2010-01-02,2010-01-03~44.95~. Of course, special consideration must be taken to the fact that the modified field can contain a list of values. That's about as flat as you can make it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜