Most efficient way to add missing alt tags for images in a large html document
In order to comply with accessibility standards, I need to ensure that all images in开发者_开发技巧 some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.
Example input:
<html>
<body>
<img src="foo.gif" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
Desired output:
<html>
<body>
<img src="foo.gif" alt="" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.
Can anyone suggest an efficient way to accomplish this?
Update:
It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.
Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.
I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):
string addAltTag(string html) {
StringBuilder sb = new StringBuilder();
int pos=0;
int lastPos=0;
while(pos>=0) {
int nextpos;
pos=html.IndexOf("<img",pos);
if (pos>=0) {
// images can't have children, and there should not be any angle braces
// anyhere in the attributes, so should work fine
nextPos =html.IndexOf(">",pos);
}
if (nextPos>0) {
// back up if XML formed
if (html.indexOf(nextPos-1,1)=="/") {
nextPos--;
}
// output everything from last position up to but
// before the closing caret
sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
// can't just look for "alt" could be in the image url or class name
if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
sb.Append(" alt="\"\"");
}
lastPos=nextPos;
} else {
// unclosed image -- just quit
pos=-1;
}
}
sb.Append(html.Substring(lastPos);
return sb.ToString();
}
You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = "
(that is, with spaces), etc. depending on the consistency you can expect from your HTML.
By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.
obj.Select("img").Not("[alt]").Attr("alt",String.Empty);
Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.
I just tested this on a 8mb HTML file with about 250,000 lines. It did take a few seconds for the document to load, but the select method was very fast. Not sure how big your file is or what you are expecting. I even edited the HTML file to include some missing tags, such as </body>
and some random </div>
. It still was able to parse correctly.
HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(@alt)]");
I had a total of 54,322 nodes. The select took milliseconds.
If the above will not work, and you can reliably predict the output, it is possible for you to stream the file in and break it in to manageable chunks.
pseduo-code
- stream file in
- parse in HtmlAgilityPack
- loop until end of stream
I imagine you could incorporate Parallel.ForEach()
in there as well, although I can't find documentation on whether this is safe with HtmlAgilityPack.
Well, if I review your content for Section 508 compliance, I will fail your web site or content - unless the blank alt text is for decorative (not needed for comprehension of content) only.
Blank alt text is only for decoration. Inserting it might fool some automated reporting tools, but you certainly are not meeting Section 508 compliance.
From a project management standpoint, you are better off leaving it failing so the end-users creating the content become responsible and the automated tool accurately reports it as non-compliant.
Hoping Chaps are clever enough to generate the Html markup wherever they need. Then here is the quick trick to convert the find out the SEO result for Images missing ALT attribute without too much struggle.
private static bool HasImagesWithoutAltTags(string htmlContent)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(@alt)]").Any();
}
精彩评论