Read first 3 paragraphs of a long string. [C#, HTML AgilityPack]
I would like to read from a long string and just output the first 3 paragraphs of the string. How do I achie开发者_JS百科ve this? I wanted to use this code to show (n) number of words but I have since changed to paragraphs.
public string MySummary(string html, int max)
{
string summaryHtml = string.Empty;
// load our html document
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
int wordCount = 0;
foreach (var element in htmlDoc.DocumentNode.ChildNodes)
{
// inner text will strip out all html, and give us plain text
string elementText = element.InnerText;
// we split by space to get all the words in this element
string[] elementWords = elementText.Split(new char[] { ' ' });
// and if we haven't used too many words ...
if (wordCount <= max)
{
// add the *outer* HTML (which will have proper
// html formatting for this fragment) to the summary
summaryHtml += element.OuterHtml;
wordCount += elementWords.Count() + 1;
}
else
{
break;
}
}
return summaryHtml ;
}
If by paragraphs you mean <p>
tags, get all the childnodes of the document which are <p>
s and pull the first 3's inner text?
Edit re comment:
RTFM?
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
something like:
string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");
Why don't you just use string tokenizer and read up to just before where forth
is located?
I've just had to do this myself and have come up with a very simplistic but forgiving way of doing this that works fine for our particular scenario:
public string GetParagraphs(string html, int numberOfParagraphs)
{
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
I realise how naive this is regarding the structure of the document, it will also get any non <p>
tags between <p>
, however in my use case that is actually what I want - maybe that will work for you too?
It is better answer. but if we want to take paragraph from 2 to 5, then what will be coding.
public string GetParagraphs(string html, int numberOfParagraphs) {
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
You have to use HtmlAgilityPack.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());
精彩评论