Most efficient way to add missing alt tags for images in a large html document

2023-04-07 17:23 问答作者：

In order to comply with accessibility standards, I need to ensure that all images in开发者_开发技巧 some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.

Example input:

<html>
    <body>
          <img src="foo.gif" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

Desired output:

<html>
    <body>
          <img src="foo.gif" alt="" />
          <p>Some other content</p>
          <img src="bar.gif" alt="" />
          <img src="blah.gif" alt="Blah!" />
    </body>
</html>

The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.

Can anyone suggest an efficient way to accomplish this?

Update:

It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.

Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.

I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):

string addAltTag(string html) {
    StringBuilder sb = new StringBuilder();
    int pos=0;
    int lastPos=0;
    while(pos>=0) {
       int nextpos;
       pos=html.IndexOf("<img",pos);
       if (pos>=0) {
          // images can't have children, and there should not be any angle braces 
          // anyhere in the attributes, so should work fine
          nextPos =html.IndexOf(">",pos);

       }

       if (nextPos>0) {
          // back up if XML formed
          if (html.indexOf(nextPos-1,1)=="/") {
            nextPos--;
          }
           // output everything from last position up to but
           // before the closing caret
           sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
           // can't just look for "alt" could be in the image url or class name
           if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
               sb.Append(" alt="\"\"");
           }
           lastPos=nextPos;
       } else {
           // unclosed image -- just quit
           pos=-1;
       }
    }
    sb.Append(html.Substring(lastPos);
    return sb.ToString();
}

You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = " (that is, with spaces), etc. depending on the consistency you can expect from your HTML.

By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.

obj.Select("img").Not("[alt]").Attr("alt",String.Empty);

Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.

I just tested this on a 8mb HTML file with about 250,000 lines. It did take a few seconds for the document to load, but the select method was very fast. Not sure how big your file is or what you are expecting. I even edited the HTML file to include some missing tags, such as </body> and some random </div>. It still was able to parse correctly.

HtmlDocument doc = new HtmlDocument();
doc.Load(@"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(@alt)]");

I had a total of 54,322 nodes. The select took milliseconds.

If the above will not work, and you can reliably predict the output, it is possible for you to stream the file in and break it in to manageable chunks.

pseduo-code

stream file in
parse in HtmlAgilityPack
loop until end of stream

I imagine you could incorporate Parallel.ForEach() in there as well, although I can't find documentation on whether this is safe with HtmlAgilityPack.

Well, if I review your content for Section 508 compliance, I will fail your web site or content - unless the blank alt text is for decorative (not needed for comprehension of content) only.

Blank alt text is only for decoration. Inserting it might fool some automated reporting tools, but you certainly are not meeting Section 508 compliance.

From a project management standpoint, you are better off leaving it failing so the end-users creating the content become responsible and the automated tool accurately reports it as non-compliant.

Hoping Chaps are clever enough to generate the Html markup wherever they need. Then here is the quick trick to convert the find out the SEO result for Images missing ALT attribute without too much struggle.

  private static bool HasImagesWithoutAltTags(string htmlContent)
        {
            var doc = new HtmlDocument();
            doc.LoadHtml(htmlContent);
            return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(@alt)]").Any();
        }

继续阅读：accessibility

Most efficient way to add missing alt tags for images in a large html document

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？