What are the underlying git merge processes within the staging area?
Git does the merge
magic, and then lets the user resolve real conflicts, which is as it should be. I'm looking for a low level description of the how and why of the basic git merge and how it uses the staging area.
I've just read the Git Parable, and the comment on here that
Even taking into account the fact that its is "parable" and not recount of the history of Git (whic开发者_开发百科h you can find in some detail on Git Wiki, by the way), one point stays: it is IMVHO bad practice to explain staging area in the terms of splitting changes into more than one commit and/or comitting with dity tree, i.e. with some changes uncomitted. Staging area main strength (besides being explicit version of other SCMs implicit to-be-added area) is dealing with CONFLICTED MERGE, and that is how it should be explained, I think.
The git merge
man page identifies the stage 1/2/3 elements of the merge, but obviously doesn't go into details of whys and wherefores.
Can folk advise on any articles on how and why git manages to achieve the results others don't (over and above the Linus V Bram detailed in Wincent's blog), i.e. the alleged Trivial part?
Most web articles assume that merges 'just happen', and I haven't found anything that explains the issues (e.g. the need for small commits, the value of a common commit, etc).
This should help with at least some of your questions as it's the most common merge that git does:
git merge-file
git merge-file is designed to be a minimal clone of RCS merge; that is, it implements all of RCS merge's functionality which is needed by git(1).
Most every VCS employs the basic concept of a three-way merge. This compares two branches with a common ancestor of each, so if a line of code is different between the two branches, you know which branch changed it. If they both changed it, you have a merge conflict that must be resolved by a human.
There are a few cases where it is difficult to determine a suitable common ancestor. A lot of research went into different algorithms for this, many involving the tracking of additional metadata with the commits.
Linus' essential innovation was the tracking of trees rather than files. That's sort of a subtle distinction. To illustrate with the example from Wincent's blog, consider a file foo
in branch A
. You branch off to make branch B
. In branch A
foo
is renamed to bar
. In branch B
, it is deleted. You then attempt to merge.
If you are tracking files, it goes like this:
Before branching, version 1 of file foo
is created.
After the next commit, branch A
points to version 2 of foo
, which is a deleted file, and version 1 of new file bar
.
After the next commit, branch B
points to version 2.1 of foo
, which is a deleted file.
When you merge, version 2 and 2.1 of foo
are compared and found to be identical. No merge conflict there. Branch B
doesn't even have a file called bar
, so no conflict there either. You end up with the merge algorithm silently accepting branch A
's rename, even though there was a real conflict between foo
being deleted and it being renamed.
If you are tracking trees, it goes like this:
Before branching, a blob with hash dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81 is created. A tree is created with an entry containing a label foo
pointing to the hash.
After the next commit, branch A
points to a tree with an entry containing a label bar
pointing to the same hash.
After the next commit, branch B
points to an empty tree.
When you merge, the trees are compared, with B
showing a deletion and A
showing a rename of the blob dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81. Human is asked which one he prefers.
You can mitigate the effect somewhat with a file-tracking VCS by adding metadata for renames, but git's way uses its normal standard data structure. Also, the metadata way has difficulties with complex merges where there are many possible choices for the common ancestor. You could put a billion possible paths between the common ancestor and the two branch heads, and git will still see a blob with the same hash and be able to detect a rename and a delete. It's also difficult to preserve metadata when accepting changes in a patch via email, for instance.
It gets a little trickier with a renamed file that changes at the same time, but by tracking the trees, git has all the information it needs. It sees blob dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81 gone from both branches, but it also sees a new tree entry pointing to a new blob, and can compare the two. If a significant portion of the file matches, it's considered a rename. Obviously this breaks down if you make a ton of changes in a renamed file, but at some point no merge algorithm is going to be able to help you.
See this email from Linus for more insight about his philosophy on this topic.
精彩评论