unix script to count the number of characters between particular xml tags

2023-01-17 10:53 问答作者：

Hi I am trying to create a script that will count the number of characters between xml tags and idealy group by these values before returning the variations:

<CONTEXT_1>aaaa<CONTEXT_1>
<CONTEXT_2>bb<CONTEXT_2>
<CONTEXT_2>dfgh<CONTEXT_2>
<CONTEXT_6>bb<CONTEXT_6>
<CONTEXT_1>bbbb<CONTEXT_1>

the result of this would be

<CONTEXT_1> 4
<CONTEXT_2> 2,4
<CONTEXT_6> 4

Any help would be much appreciated! I'm totally 开发者_Python百科stuck

Thanks M

1. Use XML-specific utilities

I think that any command line tool designed to work with XML is better than custom awk/sed hacks. Scripts using such tools are more robust and do not break when the XML input is slightly reformatted (e.g. it doesn't matter where line breaks are and how the document is indented). My tool of choice for XML querying from the command line is xmlstarlet.

2. Fix your XML

Then, you need to fix your XML: close tags properly and add a root element. Something like this:

<root>
<CONTEXT_1>aaaa</CONTEXT_1>
<CONTEXT_2>bb</CONTEXT_2>
<CONTEXT_2>dfgh</CONTEXT_2>
<CONTEXT_6>bb</CONTEXT_6>
<CONTEXT_1>bbbb</CONTEXT_1>
</root>

3. Use XPath and XSLT

Select the elements you need with XPath and process them with XSLT expressions. In your example, you can count elements' length with

$ xmlstarlet sel -t -m '//root/*' -v "name(.)" -o ": " -v "string-length(.)" -n test.xml

//root/* selects all child nodes of the root. name(.) prints element name of the currently selected element, and string-length(.) prints the length of its contents.

And get the output:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
CONTEXT_1: 4

Group results as you like with awk or similar tools.

You can do something like this using sed:

sed  's/^<\([^>]*\)>\(.*\)<.*$/\1 \2/g' file.xml | sort | while read line
do
    context=`echo $line | cut -d' ' -f1`
    count=`echo $line | cut -d' ' -f2 | tr -d '\n' | wc -c`
    echo $context: $count
done | uniq

which prints:

CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2

This is a job for Awk: a full-featured text processing language.

Something like (not tested):

awk \
"BEGIN { $INIT_TAB_AWK } \
{ split(\$0, tab, \"\"); \
for (chara in tab) \
{ for (chara2 in tab_search) \
{ if (tab_search[chara2] == tab[chara]) { final_tab[chara2]++ } } } } \
END { for (chara in final_tab) \
{ print tab_search[chara] \" => \" final_tab[chara] } }"

Using Perl:

#! /bin/perl    
open FILE, $ARGV[0] or die $!;
while (my $line = <FILE>) {
        if ($line =~ /^<([^>]*)>(.*)<.*$/) {
            $table{$1}="$table{$1},".length($2);
         }
}    
foreach my $key (sort keys %table) {
  print "$key ".substr($table{$key},1)."\n";
}

Output is:

CONTEXT_1 4,4
CONTEXT_2 2,4
CONTEXT_6 2

$ awk -F">" '{sub("<.*","",$2);a[$1]=a[$1]","length($2)}END{for (i in a) print i,a[i]}' file
<CONTEXT_6 ,2
<CONTEXT_1 ,4,4
<CONTEXT_2 ,2,4

继续阅读：grep xml

unix script to count the number of characters between particular xml tags

1. Use XML-specific utilities

2. Fix your XML

3. Use XPath and XSLT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

1. Use XML-specific utilities

2. Fix your XML

3. Use XPath and XSLT

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？