Managing Huge Repositories with Git

Linus Torvalds created Git in the mid 2000s to solve a problem that other open source version control systems at that time could not—to be distributed, reliable and fast.

huge git cat

As he mentions in this Google tech talk on Git, Git was created out of necessity for the Linux project. At the time the talk was given, Git was very young and people were getting used to it. It seemed to solve all the problems faced by version control software, and this contributed to its meteoric rise.

Git’s Shortcomings

Fast forward a few years, and people started noticing the first real flaw in Git: it was difficult to handle very large repositories. How large are we talking here? It’s Facebook large. Facebook’s team projected that in a few years, a simple git status would take up to half a minute to show the result, as Facebook adds a large number of commits from thousands of developers every day. Facebook shifted its whole codebase to Mercurial, and its team actively started contributing to Mercurial to meet Facebook’s needs.

Where did Git fail? A Mercurial contributor, Prasoon Shukla, on the comparison of scaling in Git and Mercurial, says that this can be attributed to the way Mercurial and Git store commits. Mercurial manages a couple of objects (or files) for each file in your repository, whereas Git creates an object for each commit. Therefore, on increasing the number of commits, the number of objects in Mercurial remains constant, in contrast to a linear increase in Git. Therefore, when you run a simple git status command, Git has to sift through all these objects, which takes a considerable amount of time (in spite of the high efficiency of Git).

Another area where Git might fall short is managing large binary files in the repository. Because Git tracks the changes in files, it’s not able to interpret the change in the content of binary files. And the size of the repository increases with every commit, because Git has to store the exact binary, rather than the change from the last version.

Over the years, developers of Git have tried to solve these problems. Each third-party service has come up with solutions to enable Git to manage larger repositories—such as GitHub’s Large File Storage extension.

This post looks at techniques that can be used to handle large repositories in Git—in terms of large histories, as well as the presence of large binary files, or both.

Projects with a Large Number of Commits

I’ll firstly look at a few ways to manage repositories with large histories more efficiently.

Shallow clone of repositories

As mentioned earlier, the primary reason why projects with large histories slow down is the huge number of commits. In a distributed system like Git, when you clone a repository, its full project history gets downloaded. However, Git provides a way to specify the number of commits you want to have in your clone of a project. This is known as a shallow clone. When you get the number of commits down, your Git operations run faster.

To perform shallow cloning, you need to add the --depth option, with the number of commits we want, to the clone command:

git clone --depth [number_of_commits] [url_of_remote]

In earlier versions of Git, there was limited support for shallow clones. If your truncated history didn’t stretch long enough, you weren’t allowed to push or pull. However, with the release of Git 1.9.0, support for shallow clones was increased significantly.

Clone a single branch

When you clone a repository, all the branches in the remote get downloaded. (If you run git branch in a newly cloned repository, it shows only the master branch. You should run git branch -a to list all the branches that were a part of the remote.) It’s probable that many of the commits present in other branches are irrelevant to one developer’s work. Therefore, you can clone just the master or the branch relevant to your development. Doing so significantly reduces the number of commits that make up the history of the cloned version, especially if branches in the repository have divergent histories.

To clone only a single branch of a remote, you can run the following command:

git clone [url_of_remote] --branch [branch_name] --single-branch

This command instructs Git to clone only the branch_name branch from the remote.

Projects With Large Files

happy git cat The next problem that arises is the presence of large binary files (which are not traditional text files). Changes in binary files are not tracked by Git, which is why any change in a binary file is stored as the binary file itself. If binary files are large (like 3D models, or graphic designs), the size of the repository increases considerably with every changing commit.

Using submodules

One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary. This logically separates the core part of your project from the large files, and helps in managing them separately.

Using a third-party extension

There are many extensions for Git, built by other developers, to handle large files. One option is to use git-annex, which allows you to manage files in Git without checking the contents of the file into Git. Another extension (the development of which has now been stopped) is git-bigfiles.

GitHub recently launched Git Large File Storage, an open source extension for Git, to manage large binary files in Git. LFS stores these large files in a remote server like GitHub, whereas only text pointers are stored in your Git repository. SitePoint recently published a tutorial on how to get started with Git LFS.

In a short period of time, Git LFS has gained popularity, signaling that it provides a good way of handing such large binaries in Git.

Final Thoughts

It has been a long time since Facebook announced its move to Mercurial (although it still continues to use Git for side projects like ReactJS). It’s good to see Git developers and third-party developers have both reacted positively to it and come up with innovative solutions to the problems at hand. If you’re thinking about learning version control, I’d recommend you go for Git—as the future is definitely bright!

If you’d like to learn more about Git and its amazing powers, check out Shaumik’s new book Jump Start Git, published right here at SitePoint!

Understand Git’s core philosophy.
Get started with Git: install it, learn the basic commands, and set up your first project.
Work with Git as part of a collaborative team.
Use Git’s debugging tools for maximum debug efficiency.
Take control with Git’s advanced features: reflog, rebase, stash, and more.
Use Git with cloud-based Git repository host services like Github and Bitbucket.
See how Git’s used effectively on large open-source projects.

Frequently Asked Questions (FAQs) about Managing Huge Repositories with Git

How can I manage large files in Git?

Git is not designed to handle large files effectively. However, there are tools like Git Large File Storage (LFS) that can help. Git LFS replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server. This way, you can work with large files just like any other file in Git.

What is the purpose of the git cat-file command?

The git cat-file command is a versatile tool that provides information about Git objects. It can be used to view the type of an object, the size of an object, or the actual content of an object. This command is particularly useful when you need to explore the Git database and understand how Git stores data.

How can I recover a deleted file in Git?

If you accidentally delete a file, Git provides a way to recover it. You can use the git log command with the –diff-filter option to find the commit that deleted the file. Once you have the commit hash, you can use the git checkout command to restore the file.

How can I find a specific commit in Git?

Git provides several ways to find a specific commit. You can use the git log command with various options to filter the commit history. For example, you can use the –grep option to search for a specific keyword in the commit messages, or the –author option to filter commits by a specific author.

How can I view the changes introduced by a commit in Git?

You can use the git show command to view the changes introduced by a commit. This command displays the commit message and the diff, showing what was added and removed in that commit. You can also use the git diff command with the commit hash to view the changes.

How can I undo a commit in Git?

Git provides several ways to undo a commit. You can use the git reset command to move the HEAD pointer to a previous commit, effectively “undoing” the last commit. Alternatively, you can use the git revert command to create a new commit that undoes the changes introduced by a specific commit.

How can I clean up my Git repository?

Git provides the git gc command to clean up unnecessary files and optimize your repository. This command performs several housekeeping tasks, such as compressing file revisions, removing unreachable objects, and packing referenced objects.

How can I handle merge conflicts in Git?

Merge conflicts occur when Git is unable to automatically merge changes from different branches. You can resolve merge conflicts manually by editing the conflicting files, choosing the changes you want to keep, and then committing the resolved files.

How can I view the history of a file in Git?

You can use the git log command with the –follow option and the file name to view the history of a file. This command displays all the commits that modified the file, in reverse chronological order.

How can I rename a branch in Git?

You can rename a branch in Git using the git branch -m command followed by the old branch name and the new branch name. Note that you need to be on the branch you want to rename to perform this operation.