开发者

Detect percentage of difference between HTML pages

Suppose I have 2 HTML sources. I want to compare these and if they differ more than a given percentage value I want to do something with the new HTML.

For example, if the 2 HTML pages diff开发者_StackOverflow社区er 5% or more, I want to e-mail somebody. How can I do this in Java? Is there a library for this?


Our Smart Differencer tool might be helpful here.

This tool compares the structure of "code" (various languages, HTML being one) and produces a "diff" like output but it is focused on code differences rather than just raw text differences, using language-specific (but somewhat limited) knowledge about what is really different. So, if you swapped the placement of two attributes in a tag, it would say there was no difference.

The diff output tells you what code blocks have been deleted, inserted, moved or copied complete with substitutions detectable according to language structure. (For HTML, any change in normally displayed text is considered a replacement; it doesn't do diff on such text strings). You'd have to scan the tool output to collect your "overall change" statistics, so this woldn't conceptually be different than doing the same thing with cygwin diff, but the numbers would likely be more precise. YMMV.


The cheap and nasty way to do this is to run everything through an HTML tidier, remove insignificant whitespace, then insert line-breaks before every '<' character. You can run the resulting text through a standard line-based diff utility to give you a rough difference metric which is "good enough", in my experience.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜