Why is it that regex cannot match an XML element?

2023-03-11 17:49 问答作者：

This article argues that regular expressions cannot match nested structures because re开发者_JS百科gexes are finite automatons.

He then offers a list of problems in which the answer states that the following cannot be solved using regexes:

matching an XML element
matching a C/VB/C# math expression
matching a valid regex

Since 2 & 3 can conceivably contain brackets; this nesting is unsolvable for regexes. But why is it impossible to match an XML element ? (He didn't provide examples).

You can match a limited subset of HTML tags, if you know in advance the tags to be matched.

But you can't (reliably or nicely) parse arbitrary HTML. It is not a regular language.

How would you match this valid XML with regex?

<!--<d>>--<<--><div class='foo' id="bar" inline></div>

It's like making a wooden car. Sure you can try to do it, but why?

But then comes the part of parsing the XML. How would you extract a set of possibly infinite attributes from an infinite set of elements using a finite set of groups? It's just not possible due to the nature and structure of regex.

There are theoretical answers, based on what kind of grammar XML is and what kind of grammar regular expressions can match. These answers are sometimes flawed by the fact that most regular expression libraries we use today can do things that the formal regular expressions defined in computer science can't do (like back-references).

And there are practical answers. The practical answer is: don't do it because it's the wrong tool for the job, your code will be hard to write and unmaintainable, it will be inefficient, it will have bugs, and no-one will know how to change it when the structure of the document changes. And because there are better tools for the job, called XML parsers.

Regular expressions are free of state. To parse an XML file, you need state. A < might signal the opening of an XML element. If it's inside a comment  or an attribute value "<" though it means something else. Using Regexen you can only express things in terms of things that come before or after other things. To correctly parse < as opening an XML element you'd need to express something along the lines of:

< but not after  and not after " if " was not closed but only if " was an attribute because " as a text value has no influence on the next < and if not...

And that's only for a simple <, not even covering all the possibilities. There are a handful of XML special chars that all have the same kind of circular conditions. Constructing a Regex that expresses all these conditions correctly for all cases is virtually impossible. It's trivial with a state machine though.

继续阅读：language-agnostic regex xml

Why is it that regex cannot match an XML element?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？