Worst-case behaviour of Python's HtmlDiff.make_table()

2023-03-27 06:09 问答作者：

I'm using Python 2.7's difflib.HtmlDiff.make_table() function to generate diffs between expected and actual files for an internal test case runner. They end up in an HTML test report.

This has worked fine so far -- until I added a test case with bigger files (~400 KiB), lots of differences, often containing no line breaks. Almost all of my test cases are executed in less than 2s, with a few more complex ones going as high as 4s. This new one is just as fast when passing, but takes 13 minutes (!) to fail. All of that time is spent generating the report. I hope you can see how that's a problem.

An attempt to demonstrate this (probably not the best way, I know):

s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.HtmlDiff().make_table(a, b)"""

import timeit
print 'length    100:', timeit.timeit(s, setup='length = 100', number=1)
print 'length   1000:', timeit.timeit(s, setup='length = 1000', number=1)
print 'length  10000:', timeit.timeit(s, setup='length = 10000', number=1)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=1)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=1)

And the results:

length    100: 0.022672659081
length   1000: 0.0125987213238
length  10000: 0.479898318086
length 100000: 54.9947423284
length 400000: 1451.59828412

difflib.ndiff() (which is used by make_table() internally, as far as I understand it) does not seem to have this problem:

s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.ndiff(a, b)"""

import timeit
print 'length    100:', timeit.timeit(s, setup='length = 100', number=100)
print 'length   1000:', timeit.timeit(s, setup='length = 1000', number=100)
print 'length  10000:', timeit.timeit(s, setup='length = 10000', number=100)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=100)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=100)

Gives me this:

length    100: 0.0233492320197
length   1000: 0.00770079984919
length  10000: 0.0672924110913
length 100000: 0.480133018906
length 400000: 1.866792587

Which looks very reasonable, i.e. it's proportional. Four times the size takes four times as long.

Not sure where to go from here. I would guess that the HTML generator does a lot of backtracking when there are differences (although you would think that ndiff() had already handled that). Can I tell it to abort earlier, give up and mark the whole section as 'different'?

I understand that there are a lot of different algorithms for generating diffs. In this case, I don't need it to do a very deep analysis and try to resynchronize everywhere. I just need it to tell me roughly from what position on the file is different and then terminate in a reasonable timeframe.

Alternatively, are there other HTML-generati开发者_运维百科ng Python diff libraries which do not have this worst-case problem?

CPython issues related to this:

http://bugs.python.org/issue6931: dreadful performance in difflib: ndiff and HtmlDiff
http://bugs.python.org/issue11740: difflib html diff takes extremely long

继续阅读：diff difflib python time-complexity

Worst-case behaviour of Python's HtmlDiff.make_table()

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？