Get first 100 characters of HTML content without stripping tags
There are lots of questions on how to strip html tags, but not many on functions/methods to close them.
Here's the situation. I have a 500 character Message summary ( which includes html tags ), but I only want the first 100 characters. Problem is if I truncate the message, it could be in the middle of an html tag... which messes up stuff.
Assuming the html is something like this:
<div class="bd">"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br/>
<br/>Some Dates: April 30 - May 2, 2010 <br/>
<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. <em>Duis aute irure dolor in reprehenderit</em> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br/>
</p>
For more information about Lorem Ipsum doemdloe, visit: <br/>
<a href="http://www.somesite.com" title="Some Conference">Some text link</a><br/>
</div>
How would I take the first ~100 characters or so? ( Although, ideally that would be the first approximately 100 characters of "CONTENT" ( in between the html tags )
I'm assuming the best way to do this would be a recursive algorithm that keeps track of the html tags and appends any tags that would be truncated, but that may not be the best approach.
My first thoughts are using recursio开发者_Go百科n to count nested tags, and when we reach 100 characters, look for the next "<" and then use recursion to write the closing html tags needed from there.
The reason for doing this is to make a short summary of existing articles without requiring the user to go back and provide summaries for all the articles. I want to keep the html formatting, if possible.
NOTE: Please ignore that the html isn't totally semantic. This is what I have to deal with from my WYSIWYG.
EDIT:
I added a potential solution ( that seems to work ) I figure others will run into this problem as well. I'm not sure it's the best... and it's probably not totally robust ( in fact, I know it isn't ), but I'd appreciate any feedback
Here's the solution for most cases. It doesn't process incorrect html tags, and cases like "a<b>c". But it works for my purposes and maybe it will be helpful for someone else.
/// <summary>
/// Gets first number of characters from the html string without stripping tags
/// </summary>
/// <param name="htmlString">The html string, not encoded, pure html</param>
/// <param name="length">The number of first characters to get</param>
/// <returns>The html string</returns>
public static string GetFirstCharacters(string htmlString, int length)
{
if (htmlString == null)
return string.Empty;
if(htmlString.Length < length)
return htmlString;
// regex to separate string on parts: tags, texts
var separateRegex = new Regex("([^>][^<>]*[^<])|[\\S]{1}");
// regex to identify tags
var tagsRegex = new Regex("^<[^>]+>$");
// separate string on tags and texts
var matches = separateRegex.Matches(htmlString);
// looping by mathes
// if it's a tag then just append it to resuls,
// if it's a text then append substing of it (considering the number of characters)
var counter = 0;
var sb = new StringBuilder();
for (var i = 0; i < matches.Count; i++)
{
var m = matches[i].Value;
// check if it's a tag
if (tagsRegex.IsMatch(m))
{
sb.Append(m);
}
else
{
var lengthToCut = length - counter;
var sub = lengthToCut >= m.Length
? m
: m.Substring(0, lengthToCut);
counter += sub.Length;
sb.Append(sub);
}
}
return sb.ToString();
}
What if you parse HTML into a DOM structure then begin traverse breadth-first or deep first whatever you like, collecting text of nodes until you reach 100 characters?
My suggestion would be to find a HTML friendly traverser (one that lets you traverse HTML like XML) and then starting from the beginning tags ignore the tags themselves and only count the data in the tag. Count that towards your limit and then once reached just close out each tag (I cant think of any tags that are not just /whatever as the tag).
This should work reasonably well and be fairly close to what you are looking for.
Its totally off the top of the ol'noggin so I am assuming that there will be some tricky parts, like attribute values that display (such as link tag values).
In the past I've done this with regex. Grab the content, strip out the tags via regex, then trim it down to your desired length.
Granted, that removes all HTML, which is what I had wanted. If you're looking to keep the HTML, I'd consider not closing open tags but rather removing the open tags.
I decided to roll my own solution... just for the challenge of doing it.
If anyone can see any logic errors or inefficiencies let me know.
I don't know if it's the best approach... but it seems to work. There are probably cases where it doesn't work... and it likely will fail if the html isn't correct.
/// <summary>
/// Get the first n characters of some html text
/// </summary>
private string truncateTo(string s, int howMany, string ellipsis) {
// return entire string if it's more than n characters
if (s.Length < howMany)
return s;
Stack<string> elements = new Stack<string>();
StringBuilder sb = new StringBuilder();
int trueCount = 0;
for (int i = 0; i < s.Length; i++) {
if (s[i] == '<') {
StringBuilder elem = new StringBuilder();
bool selfclosing = false;
if (s[i + 1] == '/') {
elements.Pop(); // Take the previous element off the stack
while (s[i] != '>') {
i++;
}
}
else { // not a closing tag so get the element name
while (i < s.Length && s[i] != '>') {
if ((s[i] >= 'a' && s[i] <= 'z') || (s[i] >= 'A' && s[i] <= 'Z')) {
elem.Append(s[i]);
}
else if (s[i] == '/' || s[i] == ' ') {
// self closing tag or end of tag name. Find the end of tag
do {
if (s[i] == '/' && s[i + 1] == '>') {
// at the end of self-closing tag. Don't store
selfclosing = true;
}
i++;
} while (i < s.Length && s[i] != '>');
}
i++;
} // end while( != '>' )
if (!selfclosing)
elements.Push(elem.ToString());
}
}
else {
trueCount++;
if (trueCount > howMany) {
sb.Append(s.Substring(0, i - 1));
sb.Append(ellipsis);
while (elements.Count > 0) {
sb.AppendFormat("</{0}>", elements.Pop());
}
}
}
}
return sb.ToString();
}
I've used an XmlReader and XmlWriter to do this: https://gist.github.com/2413598
As mentioned by others here, you should probably use SgmlReader or HtmlAgilityPack to santize incoming strings.
I see your problem. In the do while loop there is an error:
} while (i < s.Length && s[i] != '>');
should be replaced with
} while (i < s.Length && ***s[i+1]*** != '>');
精彩评论