开发者

Is there any way to parse invalid HTML?

I need to parse invalid HTML files that contain several random elements (like BODY) in random lines all over file. I tried to parse it as XML, but with no luck since this file has invalid XML structure as well(a lot of incorr开发者_高级运维ect attributes in random elements over file). HtmlAgilityPack has failed to read this file as well. It's only reading file before first incorrect element and nothing after it.

Here is small example of such file:

<HTML>
<HEAD>
    <TITLE>My title</TITLE>
</HEAD>
<BODY leftmargin=9 topmargin=7 >
    <TABLE>
        <TR>
            <TD>Test</TD>
        </TR>
        <TR>
            <TD>Test</TD>
            <TD>Test<TD>
        </TR>
            <BODY> <-- This is the point where HtmlAgilityPack is stuck --!>
                <TR>
                    <TD>Test</TD>
                    <TD>Test</TD>
                </TR>
                <TR>
            </BODY>
        <TR>
        <TD><FONT>Test</FONT></TD>
        </TR>
    </TABLE>
</BODY>

I'm trying to parse info from that table.


Let Internet Explorer do the hard work for you - it will do its best to "repair" the broken tag structure into something it understands (which is technically valid XML with correct tag pairings etc.).

Open the HTML in WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), then you can walk through the DOM via Document property. The DOM will always be correct, no matter how broken the original source was.

No third party libraries needed.


We parsed web pages with invalid html with Html Agility Pack. As I remember it did a pretty good job.


You can use SgmlReader. Of course if your html files are very incorrect, it won't help you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜