How to rewrite a stream of HTML tokens into a new document?

2022-12-20 16:12 问答作者：

Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?

For example, suppose I have this HTML:

<html>
 <body>
  <p><a href="/foo">text</a></p>
  <p>Hello <span class="green">world</span></p>
 </body>
</html>

What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):

TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]

But now I don't have any idea how could I use this stream to create some transformations.

For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo] in TAG_OPEN[a] TAG_ATTRIBUTE[href] to something else.

Another transformation I would like to do is make it output TAG_ATTRIBUTE[href] attributes after the TAG_OPEN[a] in parenthesis, for example,

<a href="/foo">text</a>

gets rewritten into

<a href="/foo">text</a>(/foo)

What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.

Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?

Any suggestions are welcome.

And one more thing - I would like to learn all this parsin开发者_JS百科g myself, so I am not looking for a library!

Thanks beforehand, Boda Cydo

If we can assume that the html is xml compliant, then xslt would be a way to go. But I am assuming that would be out as you seem to want to write your own parser (not sure why). If you really want to write a parser (I'd write parse rules, not your own parser engine) take a look at antlr and MS oslo.

There are various ways of parsing/traversing an XML/HTML tree. Perhaps I can point you to:-

http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html

If you want to do pre-order or post-order manipulation of DOM elements, you can use the algorithm described there.

继续阅读：parsing tokenize

How to rewrite a stream of HTML tokens into a new document?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？