How to track addition/deletion/changes in paragraph in different versions of documents?
We have a module in our web application where we enter a document. Its a normal document where you can enter different paragraphs one after another.
e.g.
Document Name
paragraph 1.
paragraph 2.
paragraph 3.
A document can have multiple versions like version 1.0, 1.2 , 2.0 and so on.
The way it works is you take a document of version 1.0, add/delete or change some paragraphs and save it as a new version.
For this I have
1) a Document table with (document_Id (PK), document_name, version)
2) a Paragraph table with (paragraph_Id (PK), paragraph_data)
3) a Document Paragraph reference table with (document_Id (PK) and paragraph_Id (PK))
For each version of the document a new entry will go in document table, so a new document_Id (PK) will be created.
So, tables will look like as follows
document_Id document_name version
1 Document 1 1.0
2 Document 1 1.2
3 Document 1 1.5
paragraph_Id paragraph_data
10 Para 1
20 Para 2
30 Para 3
40 Para 4
50 Para 5
60 Para 6
Document Paragraph Reference table
document_Id paragraph_Id
1 10
1 20
1 30
So our document 1 with name "Document 1" and version (1.0) has three paragraphs.
When we create a new version of this document say same name Document 1 and version is incremented to 1.2.
In this new version we remove first two paragraph from old version document and add two new paragraphs.
So, effectively our new document now have three paragraphs (one from older version and two newly added).
While creating a new version of the Document please note that the old paragraph id is also changed. i.e. from old document paragraphs with id 10, 20 are removed and paragraphs with id 30 becomes 40 in new versions.
The new id is created so that old document still can be accessed and has the reference of paragraph with id 30 and it is possible to change the content of old paragraph while creating new version of the doc开发者_运维百科ument.
So, now I need to compare the two versions of the document.
How do I compare the two versions i.e. how do i know which paragraphs were just changed in newer version or which once were added newly or which ones were removed from older versions as new ids are created every time and there is no way to map the paragraph ids from version to version.
Also note that there can be multiple versions of same document and I will need to compare any versions say 1.0 to 10.5 etc ?
Any help will be appreciated.
Thanks
If you leave the paragraph Id untouched, you can easily show differences at paragraph level on each document.
Say Document1 v1 have Parag 10,20,30, and v1.2 have Parag 30,40,50, then you can say "between v1 and v1.2, Parag 10 and 20 where deleted, and 40,50 added". This is the easy part.
Now the tricky part: If the content of a paragraph is changed between versions of the document, then you must create a new paragraph for the new content (with a new id) and link them (ie, "Parag 60 is a change from Parag 30") so you can say "for v1.2, parag 30 changed to parag 60". To get the differences between the two, you need a text-diff algorithm
This looks very much like a version control system. Your 'paragraphs' are 'files', and 'documents' are 'commits'.
The good news is that you don't have to fully reinvent the wheel. The bad news is that the thing is effectively a tree, and RDBMSes are not very good at handling of trees.
Every initial version of a paragraph is a root of a version tree (same for documents). You need a way to check if this paragraph is an ancestor of that paragraph, or vice versa, or they are unrelated. You can either directly traverse a bunch of child-parent links (Oracle is good at it), or use prefixes and like
queries, or use ranges and between
queries, depending on the way you choose to represent the tree. Assuming that you don't track millions of changes, either technique should be efficient. (See: the book, a refresher)
I failed to understand how do you track documents' versions. If you need to detect precedence based on paragraph versions, this is a bit tricky in corner cases (e.g. a new version of document reverts one paragraph to a previous version and simultaneously updates another paragraph).
If you are allowed to just mark the fact that 'this document is based on that document', this is far easier; you need just one tree for documents versions, not many trees for paragraph versions.
精彩评论