Help with sed regex: extract text from specific tag
First time sed'er, so be gentle.
I have the following text file, 'test_file':
<Tag1>not </Ta开发者_如何学JAVAg1><Tag2>working</Tag2>
I want to extract the text in between <Tag2>
using sed regex, there may be other occurrences of <Tag2>
and I would like to extract those also.
So far I have this sed based regex:
cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'
which gives the output:
not working
Anyone any idea how to get this working?
As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.
The problem with your try, is that you aren't analyzing the string properly.
cat test_file
is good - it prints out the contents of the file to stdout.
grep -i "Tag2"
is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.
sed 's/<[^>]*[>]//g'
isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.
You can try something like:
cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'
This will produce
working
but it will only work for one tag pair.
For your nice, friendly example, you could use
sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file
but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.
you can use gawk, eg
$ cat file
<Tag1>not </Tag1><Tag2>working here</Tag2>
<Tag1>not </Tag1><Tag2>
working
</Tag2>
$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here
working
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'
精彩评论