Removing unclosed opening <p>tags from xhtml document

2023-01-15 17:27 问答作者：

I have a big xhtml document with lots of tags. I have observed that a few unclosed opening paragraph tags are repeating unnecessarily and I want to remove them or replace them with blank space. i just want to code to identify unclosed paragraph tags and delete them.

Here's a small sample to show what I mean:

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p>      <!-- extra tag -->
<p>      <!-- extra tag -->

<hr/>     

<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPOR开发者_开发问答T AND FINANCIAL STATEMENTS </strong></p>

Can some one please give me code for console application, just to remove these unclosed paragraph tags.

this should work:

public static class XHTMLCleanerUpperThingy
{
    private const string p = "<p>";
    private const string closingp = "</p>";

    public static string CleanUpXHTML(string xhtml)
    {
        StringBuilder builder = new StringBuilder(xhtml);
        for (int idx = 0; idx < xhtml.Length; idx++)
        {
            int current;
            if ((current = xhtml.IndexOf(p, idx)) != -1)
            {
                int idxofnext = xhtml.IndexOf(p, current + p.Length);
                int idxofclose = xhtml.IndexOf(closingp, current);

                // if there is a next <p> tag
                if (idxofnext > 0)
                {
                    // if the next closing tag is farther than the next <p> tag
                    if (idxofnext < idxofclose)
                    {
                        for (int j = 0; j < p.Length; j++)
                        {
                            builder[current + j] = ' ';
                        }
                    }
                }
                // if there is not a final closing tag
                else if (idxofclose < 0)
                {
                    for (int j = 0; j < p.Length; j++)
                    {
                        builder[current + j] = ' ';
                    }
                }
            }
        }

        return builder.ToString();
    }
}

I have tested it with your sample example and it works...although it is a bad formula for an algorithm, it should give you a starting basis!

You have to find out, what kind of DOM-tree is created. It may be intepreted as

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p>      <!-- extra tag -->
  <p>      <!-- extra tag -->
    <hr/>     
    <p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
    <p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>
  </p>
</p>

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p></p>      <!-- extra tag -->
<p></p>      <!-- extra tag -->
<hr/>     
<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>

You could try to find nested p-tags and move the inner content to the outer p-tag and remove the inner p-tag that is left empty. Anyway, I believe you need to analyze the DOM-tree first.

继续阅读：regex xhtml xml

Removing unclosed opening <p>tags from xhtml document

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？