开发者

pipe one long line as multiple lines

Say I have a bunch of XML files which contain no newlines, but basically contain a long list of records, delimited by &开发者_开发知识库lt;/record><record>

If the delimiter were </record>\n<record> I would be able to do something like cat *.xml | grep xyz | wc -l to count instances of records of interest, because cat would emit the records one per line.

Is there a way to write SOMETHING *.xml | grep xyz | wc -l where SOMETHING can stream out the records one per line? I tried using awk for this but couldn't find a way to avoid streaming the whole file into memory.

Hopefully the question is clear enough :)


This is a little ugly, but it works:

sed 's|</record>|</record>\
|g' *.xml | grep xyz | wc -l

(Yes, I know I could make it a little bit shorter, but only at the cost of clarity.)


If your record body has no character like < or / or >, then you may try this:

grep -E -o 'SEARCH_STRING[^<]*</record>' *.xml| wc -l

or

grep -E -o 'SEARCH_STRING[^/]*/record>' *.xml| wc -l

or

grep -E -o 'SEARCH_STRING[^>]*>' *.xml| wc -l


Here is a different approach using xsltproc, grep, and wc. Warning: I am new to XSL so I can be dangerous :-). Here is my count_records.xsl file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text" />      <!-- Output text, not XML -->
  <xsl:template match="record">     <!-- Search for "record" node -->
    <xsl:value-of select="text()"/> <!-- Output: contents of node record -->
    <xsl:text>                      <!-- Output: a new line -->
    </xsl:text>
  </xsl:template>

</xsl:stylesheet>

On my Mac, I found a command line tool called xsltproc, which read instructions from an XSL file, process XML files. So the command would be:

xsltproc count_records.xsl *.xml | grep SEARCH_STRING | wc -l
  • The xsltproc command displays the text in each node, one line at a time
  • The grep command filters out the text you are interested in
  • Finally, the wc command produces the count


You may also try xmlstarlet for gig-sized files:

# cf. http://niftybits.wordpress.com/2008/03/27/working-with-huge-xml-files-tools-of-the-trade/

xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml | 
    awk '{n+=$1} END {print n}'

xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml | 
    paste -s -d '+' /dev/stdin | bc
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜