开发者

Read first 3 paragraphs of a long string. [C#, HTML AgilityPack]

I would like to read from a long string and just output the first 3 paragraphs of the string. How do I achie开发者_JS百科ve this? I wanted to use this code to show (n) number of words but I have since changed to paragraphs.

public string MySummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;




    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...

        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;
            wordCount += elementWords.Count() + 1;

        }
        else
        {
            break;
        }
    }

    return summaryHtml ;
}


If by paragraphs you mean <p> tags, get all the childnodes of the document which are <p>s and pull the first 3's inner text?

Edit re comment:

RTFM?

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

something like:

string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");


Why don't you just use string tokenizer and read up to just before where forth

is located?


I've just had to do this myself and have come up with a very simplistic but forgiving way of doing this that works fine for our particular scenario:

    public string GetParagraphs(string html, int numberOfParagraphs)
    {
        const string paragraphSeparator = "</p>";
        var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
        return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
    }

I realise how naive this is regarding the structure of the document, it will also get any non <p> tags between <p>, however in my use case that is actually what I want - maybe that will work for you too?


It is better answer. but if we want to take paragraph from 2 to 5, then what will be coding.

public string GetParagraphs(string html, int numberOfParagraphs) {
    const string paragraphSeparator = "</p>";
    var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}


You have to use HtmlAgilityPack.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜