开发者

UNIX Parse HTML Page Display Contents of a Tag - One Liner?

I have an HTML file and I am interested in the data enclosed by <pre> </pre> tags. Is there a one-liner that can do achieve this?

Sample file :

<html>
<title>
Hello There!
</title>
<body>
<pre>
John   Working
Kathy  Working
Mar开发者_高级运维y   Working
Kim    N/A
</pre>
</body>
</html>

Output should be :

John 
Kathy 
Mary 
Kim 

Much appreciated guys, thank you!


Get your hands on the twig tools. One of the things it has is something called xml_grep. Your problem reduces into

cat foo.txt | xml_grep --nowrap pre 

pre is an xpath expression. Followed by some simple text processing and this will work even if your XML is formatted differently.

Word of advice - don't use sed and other stream based text processing tools to manipulate structured data like XML. Use a proper parser.


If you have XHTML then use xmlstarlet. If you don't then pass it through HTML Tidy first, and turn it into XHTML.


Since you specifically asked about a solution using sed... Assuming that the interesting lines are always between lines containing <pre> and </pre> (appearing exactly like that) and that the interesting content is never on the same line than the opening or closing tag, and assuming that the first such block is the only one you want to extract, and assuming that while you understand that this is really the wrong way to solve this problem you still want to do it, then you could do this using sed for example like this:

sed '1,/<pre>/d;/<\/pre>/,$d'

It deletes all lines from the first up to the one containing <pre> and all lines from the one containing </pre> to the last.

(FWIW, I would rather use an XPath expression for selecting the interesting content. For example using xmlstarlet as suggested by Ignacio Vazquez-Abrams it could go like this: xmlstarlet sel -t -v /html/body/pre.)


My Perl-fu is weak, but this works for your example:

$ cat file.html | perl -e'while(<>){if(/<\/pre>/){$a=0;}if($a){print}if(/<pre>/){$a=1;}}' | cut -f1 -d' '
John
Kathy
Mary
Kim
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜