开发者

unix script to count the number of characters between particular xml tags

Hi I am trying to create a script that will count the number of characters between xml tags and idealy group by these values before returning the variations:

eg

<CONTEXT_1>aaaa<CONTEXT_1>
<CONTEXT_2>bb<CONTEXT_2>
<CONTEXT_2>dfgh<CONTEXT_2>
<CONTEXT_6>bb<CONTEXT_6>
<CONTEXT_1>bbbb<CONTEXT_1>

the result of this would be

<CONTEXT_1> 4
<CONTEXT_2> 2,4
<CONTEXT_6> 4

Any help would be much appreciated! I'm totally 开发者_Python百科stuck

Thanks M


1. Use XML-specific utilities

I think that any command line tool designed to work with XML is better than custom awk/sed hacks. Scripts using such tools are more robust and do not break when the XML input is slightly reformatted (e.g. it doesn't matter where line breaks are and how the document is indented). My tool of choice for XML querying from the command line is xmlstarlet.

2. Fix your XML

Then, you need to fix your XML: close tags properly and add a root element. Something like this:

<root>
<CONTEXT_1>aaaa</CONTEXT_1>
<CONTEXT_2>bb</CONTEXT_2>
<CONTEXT_2>dfgh</CONTEXT_2>
<CONTEXT_6>bb</CONTEXT_6>
<CONTEXT_1>bbbb</CONTEXT_1>
</root>

3. Use XPath and XSLT

Select the elements you need with XPath and process them with XSLT expressions. In your example, you can count elements' length with

$ xmlstarlet sel -t -m '//root/*' -v "name(.)" -o ": " -v "string-length(.)" -n test.xml 

//root/* selects all child nodes of the root. name(.) prints element name of the currently selected element, and string-length(.) prints the length of its contents.

And get the output:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
CONTEXT_1: 4

Group results as you like with awk or similar tools.


You can do something like this using sed:

sed  's/^<\([^>]*\)>\(.*\)<.*$/\1 \2/g' file.xml | sort | while read line
do
    context=`echo $line | cut -d' ' -f1`
    count=`echo $line | cut -d' ' -f2 | tr -d '\n' | wc -c`
    echo $context: $count
done | uniq

which prints:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2


This is a job for Awk: a full-featured text processing language.

Something like (not tested):

awk \
"BEGIN { $INIT_TAB_AWK } \
{ split(\$0, tab, \"\"); \
for (chara in tab) \
{ for (chara2 in tab_search) \
{ if (tab_search[chara2] == tab[chara]) { final_tab[chara2]++ } } } } \
END { for (chara in final_tab) \
{ print tab_search[chara] \" => \" final_tab[chara] } }"


Using Perl:

#! /bin/perl    
open FILE, $ARGV[0] or die $!;
while (my $line = <FILE>) {
        if ($line =~ /^<([^>]*)>(.*)<.*$/) {
            $table{$1}="$table{$1},".length($2);
         }
}    
foreach my $key (sort keys %table) {
  print "$key ".substr($table{$key},1)."\n";
}

Output is:

CONTEXT_1 4,4
CONTEXT_2 2,4
CONTEXT_6 2


$ awk -F">" '{sub("<.*","",$2);a[$1]=a[$1]","length($2)}END{for (i in a) print i,a[i]}' file
<CONTEXT_6 ,2
<CONTEXT_1 ,4,4
<CONTEXT_2 ,2,4
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜