How to parse HTML to modify all words

2023-02-09 19:54 问答作者：

This seems to be a recurring question, but here goes.

I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.

For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.

I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?

HtmlDocument HD = new HtmlDocument();
HD.Load (@"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(@alt)]");
if (NoAltElements != null)
{
    foreach (HtmlNode HN in NoAltElements)
    {
       HN.Attributes.Append("alt", "no alt image");
    }
}

HD.Save(@"e:\test.htm");

The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (whi开发者_StackOverflowch may involve creating new tags in the process).

A very simple sample of what I might do is take the following input:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>This is my page</h1>
        <p>This is a paragraph of text.</p>
    </body>
</html>

and produce the output, which takes every word and alternates between making it uppercase and making it italics:

<html>
    <head><title>Some Title</title></head>
    <body>
        <h1>THIS <em>is</em> MY <em>page</em></h1>
        <p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
    </body>
</html>

Ideas, suggestions?

Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).

Processing code:

IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
    string[] words = getWords(node.InnerText);

    node.InnerHtml = processHtml(node.InnerHtml, words);
}

identify words (there's probably some slicker way to do this but here's an initial stab):

private string[] getWords(string text)
{
    Regex reg = new Regex("/w+");
    MatchCollection matches = reg.Matches(text);
    List<string> words = new List<string>();
    foreach (Match match in matches)
    {
        words.Add(match.Value);
    }
    return words.ToArray();
}

process the html:

private string processHtml(string html, string[] words)
{
    int startPosition = 0;
    foreach (string word in words)
    {
        startPosition = html.IndexOf(word, startPosition);
        Regex reg = new Regex(word);
        html = reg.Replace(html, alterWord(word), 1, startPosition);
    }

    return html;
}

I'll leave the details of alterWord() to you. :)

Try .SelectNodes("//body//*"). That'll get you all elements within any body element, at any depth.

How to parse HTML to modify all words

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？