开发者

hierarchical regex expression

Is it possible/practical to build a single regular expression 开发者_JS百科that matches hierarchical data?

For example:

<h1>Action</h1>
  <h2>Title1</h2><div>data1</div>
  <h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
  <h2>Title3</h2><div>data3</div>

I would like to end up with matches.

"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"

As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.

This is a .NET project.


The solution is to not use regular expressions. They're not powerful enough for this sort of thing.

What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.


It's generally considered bad practice to attempt to parse HTML/XML with RegEx, precisely because it's hierarchical. You COULD use a recursive function to do so, but a better solution in this case is to use a real XML parser. I couldn't give you better advice than that without knowing the platform you're using.

EDIT: Regex is also very slow, which is another reason it's bad for processing HTML; however, I don't know that an XML/DOM processor is likely to be faster since it's likely to use a lot more memory.

If you JUST want data from a simple document like you've demonstrated, and/or if you want to build a solution yourself, it's not that tough to do. Just build a simple, recursive state-based stream processor that looks for tags and passes the contents to the the next recursive level.

For example:

- In a recursive function, seek out a "<" character.
- Now find a ">" character.
- Preserve everything you find until the next "<" character.
- Find a ">" character.
- Pass whatever you found between those tags into the recursive function.

You'd have to work out error checking yourself, but the base case (when you return back up to the previous level) is just when there's nothing else to find.

Maybe this helps, maybe not. Good luck to you.


Regex does not work for this type of data. It is not regular, per se.

You should use an XML parser for this.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜