git history data for large projects

2023-04-04 09:47 问答作者：

I am a newbie to git and trying to understand the principles. As I understand in Git every file is stored completely as per Git Book and also as per this post. However, git book also indicates the git gc which compresses binary files and computes the diff for text files and this statement seems to contradict the first point that git stores complete files.

1) Can some one expl开发者_StackOverflowain which one is correct? If git gc does compute partial diffs, and if it happens to run after a long time, is this going to make sure all the diffs are created from the base versions to all branches? Does this mean a lot of compute time if git gc is not run on regular basis?

2) Considering the projects like Android where there are huge number of source and resource files, it seems to indicate that git is going to blow up in size with every commit. When the developers pull Android source, isnt it going to take a lot of space if it pulls the entire history for all source and binary files? Am I missing something here? How is this sustainable in longterm?

Git is a fancy content-addressable filesystem. Notionally, it stores the complete content of every file in every revision, and every blob (file contents) has a sha that it can be retrieved by. Under the hood, git actually has the ability to store a file as a diff against a similar file (usually a previous version, but it doesn't necessarily have to be) but that takes place entirely at the lowest level of how git stores files. Not only do git users not have to think about it, even most of the git tools don't have to think about it.

As for the issue of size, because there are various forms of compression in the "git filesstem", and because they're pretty efficient, the .git directory containing the entire history of a project is usually smaller than a single checkout of the code until the project reaches hundreds of thousands of commits.

If a repository does become unmanageably large, it's possible (with the help of tools like git graft) to split a project's history into different repositories, along the lines of new/ancient history, or active/archived branches, or other things like that.

The way git gc computes diffs for storage is not necessarily related to the history of a file. In fact, I recall reading somewhere but can't find the reference at the moment, that it's likely to choose more recent revisions for the "base," because those are the ones you are most likely to check out. If you have 10,000 revisions and are checking out the latest, you don't want to apply 10,000 diffs to revision 1 to get the version you want.

There are certain operations where git gc is automatically run. pull is one of them, so you're unlikely to go very long between running it. There's nothing preventing you from running it after every commit if you're so inclined. Essentially that's what other version control systems do behind the scenes, and is one of the main reasons why git is so fast.

git gc is used to cleanup/compress loose objects in the repository. It does this by packing up the objects (each full state of a file/tree/commit). The diffs could be created from a previous file, or could be from a file completely unrelated, but has similar content git.

To address question 2, as stated before, git does pack up objects. Although conceptually there is a complete copy of every file, under the hood they are packed up when gc is ran. As for storing binary files, version control in general is not the best first choice.

继续阅读：diff git

git history data for large projects

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？