Generating content diffs using SequenceMatcher (Python)
I want to generate a diff between to revisions of text (more specifically, Markdown-formatted articles) in Python.
I want to format this diff in a manner similar to what Github does.
I've looked at difflib
and have found that it does what I want. However, the Differ
class is too high-level; I would have to parse the diff lines to generate HTML with inline diffs. The Differ
class uses the SequenceMatcher
class to generate its diffs. But looking at the SequenceMatcher
it's very low-level in comparison. I haven't even figured out how to do a line-by-line diff (I'll admit I haven't spent a lot of time experimenting).
Does anyone know of any resources for 开发者_如何学运维using the SequenceMatcher
class (besides the difflib
documentation)?
SequenceMatcher is actually not that low-level. The most interesting method for you is get_grouped_opcodes
. It will return a generator, which generates lists with change descriptions.
I'll explain it on an example from a random commit on GitHub. Let's say you run SequenceMatcher(None, a, b).get_grouped_opcodes()
on the old and new file "tabs_events.js". The generator will generate two groups, which are represent by those "..." lines in GitHub. It's basically a group of changes. In each of the groups, you have a list of detailed changes stored as tuples. For the first group, it returns two changes that look like this (the first item is a change type, the next two numbers represent a line range to be removed, followed a line range to be added):
('replace', 24, 29, 24, 29)
('insert', 33, 33, 33, 35)
The first one tell you to replace lines 24-28 (starting with 0) from the old file with lines 24-28 from the new file. The second one tells you to insert lines 33-34 from the new file on line 33 in the old file. I think it's clear what would 'delete'
do and 'equal'
are those lines that are not highlighted in GitHub.
If you don't mind reading source code, take a look at the implementation of difflib.unified_diff()
. It's quite simple and it generates a plain-text equivalent of what you want.
精彩评论