开发者

git history data for large projects

I am a newbie to git and trying to understand the principles. As I understand in Git every file is stored completely as per Git Book and also as per this post. However, git book also indicates the git gc which compresses binary files and computes the diff for text files and this statement seems to contradict the first point that git stores complete files.

1) Can some one expl开发者_StackOverflowain which one is correct? If git gc does compute partial diffs, and if it happens to run after a long time, is this going to make sure all the diffs are created from the base versions to all branches? Does this mean a lot of compute time if git gc is not run on regular basis?

2) Considering the projects like Android where there are huge number of source and resource files, it seems to indicate that git is going to blow up in size with every commit. When the developers pull Android source, isnt it going to take a lot of space if it pulls the entire history for all source and binary files? Am I missing something here? How is this sustainable in longterm?


Git is a fancy content-addressable filesystem. Notionally, it stores the complete content of every file in every revision, and every blob (file contents) has a sha that it can be retrieved by. Under the hood, git actually has the ability to store a file as a diff against a similar file (usually a previous version, but it doesn't necessarily have to be) but that takes place entirely at the lowest level of how git stores files. Not only do git users not have to think about it, even most of the git tools don't have to think about it.

As for the issue of size, because there are various forms of compression in the "git filesstem", and because they're pretty efficient, the .git directory containing the entire history of a project is usually smaller than a single checkout of the code until the project reaches hundreds of thousands of commits.

If a repository does become unmanageably large, it's possible (with the help of tools like git graft) to split a project's history into different repositories, along the lines of new/ancient history, or active/archived branches, or other things like that.


The way git gc computes diffs for storage is not necessarily related to the history of a file. In fact, I recall reading somewhere but can't find the reference at the moment, that it's likely to choose more recent revisions for the "base," because those are the ones you are most likely to check out. If you have 10,000 revisions and are checking out the latest, you don't want to apply 10,000 diffs to revision 1 to get the version you want.

There are certain operations where git gc is automatically run. pull is one of them, so you're unlikely to go very long between running it. There's nothing preventing you from running it after every commit if you're so inclined. Essentially that's what other version control systems do behind the scenes, and is one of the main reasons why git is so fast.


git gc is used to cleanup/compress loose objects in the repository. It does this by packing up the objects (each full state of a file/tree/commit). The diffs could be created from a previous file, or could be from a file completely unrelated, but has similar content git.

To address question 2, as stated before, git does pack up objects. Although conceptually there is a complete copy of every file, under the hood they are packed up when gc is ran. As for storing binary files, version control in general is not the best first choice.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜