unix script to count the number of characters between particular xml tags
Hi I am trying to create a script that will count the number of characters between xml tags and idealy group by these values before returning the variations:
eg
<CONTEXT_1>aaaa<CONTEXT_1>
<CONTEXT_2>bb<CONTEXT_2>
<CONTEXT_2>dfgh<CONTEXT_2>
<CONTEXT_6>bb<CONTEXT_6>
<CONTEXT_1>bbbb<CONTEXT_1>
the result of this would be
<CONTEXT_1> 4
<CONTEXT_2> 2,4
<CONTEXT_6> 4
Any help would be much appreciated! I'm totally 开发者_Python百科stuck
Thanks M
1. Use XML-specific utilities
I think that any command line tool designed to work with XML is better than custom awk/sed hacks. Scripts using such tools are more robust and do not break when the XML input is slightly reformatted (e.g. it doesn't matter where line breaks are and how the document is indented). My tool of choice for XML querying from the command line is xmlstarlet.
2. Fix your XML
Then, you need to fix your XML: close tags properly and add a root element. Something like this:
<root>
<CONTEXT_1>aaaa</CONTEXT_1>
<CONTEXT_2>bb</CONTEXT_2>
<CONTEXT_2>dfgh</CONTEXT_2>
<CONTEXT_6>bb</CONTEXT_6>
<CONTEXT_1>bbbb</CONTEXT_1>
</root>
3. Use XPath and XSLT
Select the elements you need with XPath and process them with XSLT expressions. In your example, you can count elements' length with
$ xmlstarlet sel -t -m '//root/*' -v "name(.)" -o ": " -v "string-length(.)" -n test.xml
//root/*
selects all child nodes of the root
. name(.)
prints element name of the currently selected element, and string-length(.)
prints the length of its contents.
And get the output:
CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
CONTEXT_1: 4
Group results as you like with awk
or similar tools.
You can do something like this using sed:
sed 's/^<\([^>]*\)>\(.*\)<.*$/\1 \2/g' file.xml | sort | while read line
do
context=`echo $line | cut -d' ' -f1`
count=`echo $line | cut -d' ' -f2 | tr -d '\n' | wc -c`
echo $context: $count
done | uniq
which prints:
CONTEXT_1: 4
CONTEXT_2: 2
CONTEXT_2: 4
CONTEXT_6: 2
This is a job for Awk: a full-featured text processing language.
Something like (not tested):
awk \
"BEGIN { $INIT_TAB_AWK } \
{ split(\$0, tab, \"\"); \
for (chara in tab) \
{ for (chara2 in tab_search) \
{ if (tab_search[chara2] == tab[chara]) { final_tab[chara2]++ } } } } \
END { for (chara in final_tab) \
{ print tab_search[chara] \" => \" final_tab[chara] } }"
Using Perl:
#! /bin/perl
open FILE, $ARGV[0] or die $!;
while (my $line = <FILE>) {
if ($line =~ /^<([^>]*)>(.*)<.*$/) {
$table{$1}="$table{$1},".length($2);
}
}
foreach my $key (sort keys %table) {
print "$key ".substr($table{$key},1)."\n";
}
Output is:
CONTEXT_1 4,4
CONTEXT_2 2,4
CONTEXT_6 2
$ awk -F">" '{sub("<.*","",$2);a[$1]=a[$1]","length($2)}END{for (i in a) print i,a[i]}' file
<CONTEXT_6 ,2
<CONTEXT_1 ,4,4
<CONTEXT_2 ,2,4
精彩评论