Removing unclosed opening <p>tags from xhtml document
I have a big xhtml document with lots of tags. I have observed that a few unclosed opening paragraph tags are repeating unnecessarily and I want to remove them or replace them with blank space. i just want to code to identify unclosed paragraph tags and delete them.
Here's a small sample to show what I mean:
<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>
<p> <!-- extra tag -->
<p> <!-- extra tag -->
<hr/>
<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPOR开发者_开发问答T AND FINANCIAL STATEMENTS </strong></p>
Can some one please give me code for console application, just to remove these unclosed paragraph tags.
this should work:
public static class XHTMLCleanerUpperThingy
{
private const string p = "<p>";
private const string closingp = "</p>";
public static string CleanUpXHTML(string xhtml)
{
StringBuilder builder = new StringBuilder(xhtml);
for (int idx = 0; idx < xhtml.Length; idx++)
{
int current;
if ((current = xhtml.IndexOf(p, idx)) != -1)
{
int idxofnext = xhtml.IndexOf(p, current + p.Length);
int idxofclose = xhtml.IndexOf(closingp, current);
// if there is a next <p> tag
if (idxofnext > 0)
{
// if the next closing tag is farther than the next <p> tag
if (idxofnext < idxofclose)
{
for (int j = 0; j < p.Length; j++)
{
builder[current + j] = ' ';
}
}
}
// if there is not a final closing tag
else if (idxofclose < 0)
{
for (int j = 0; j < p.Length; j++)
{
builder[current + j] = ' ';
}
}
}
}
return builder.ToString();
}
}
I have tested it with your sample example and it works...although it is a bad formula for an algorithm, it should give you a starting basis!
You have to find out, what kind of DOM-tree is created. It may be intepreted as
<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>
<p> <!-- extra tag -->
<p> <!-- extra tag -->
<hr/>
<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>
</p>
</p>
or
<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>
<p></p> <!-- extra tag -->
<p></p> <!-- extra tag -->
<hr/>
<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>
You could try to find nested p-tags and move the inner content to the outer p-tag and remove the inner p-tag that is left empty. Anyway, I believe you need to analyze the DOM-tree first.
精彩评论