Read first 3 paragraphs of a long string. [C#, HTML AgilityPack]

2023-01-08 07:45 问答作者：

I would like to read from a long string and just output the first 3 paragraphs of the string. How do I achie开发者_JS百科ve this? I wanted to use this code to show (n) number of words but I have since changed to paragraphs.

public string MySummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;




    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...

        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;
            wordCount += elementWords.Count() + 1;

        }
        else
        {
            break;
        }
    }

    return summaryHtml ;
}

If by paragraphs you mean <p> tags, get all the childnodes of the document which are <p>s and pull the first 3's inner text?

Edit re comment:

RTFM?

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

something like:

string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");

Why don't you just use string tokenizer and read up to just before where forth

is located?

I've just had to do this myself and have come up with a very simplistic but forgiving way of doing this that works fine for our particular scenario:

    public string GetParagraphs(string html, int numberOfParagraphs)
    {
        const string paragraphSeparator = "</p>";
        var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
        return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
    }

I realise how naive this is regarding the structure of the document, it will also get any non <p> tags between <p>, however in my use case that is actually what I want - maybe that will work for you too?

It is better answer. but if we want to take paragraph from 2 to 5, then what will be coding.

public string GetParagraphs(string html, int numberOfParagraphs) {
    const string paragraphSeparator = "</p>";
    var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}

You have to use HtmlAgilityPack.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

继续阅读：asp.net

Read first 3 paragraphs of a long string. [C#, HTML AgilityPack]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？