When a GitLab user
forks a project
,
GitLab creates a new Project with an associated Git repository that is a
copy of the original project at the time of the fork. If a large project
gets forked often, this can lead to a quick increase in Git repository
storage disk use. To counteract this problem, we are adding Git object
deduplication for forks to GitLab. In this document, we describe how
GitLab implements Git object deduplication.
Establish the alternates link in the special file
A.git/objects/info/alternates
by writing a path that resolves to
B.git/objects
.
In repository A, run
git repack
to remove all objects in repository A that
also exist in repository B.
After the repack, repository A is no longer self-contained, but still contains its
own refs and configuration. Objects in A that are not in B remain in A. For this
configuration to work,
objects must not be deleted from repository B
because
repository A might need them.
Do not run
git prune
or
git gc
in object pool repositories, which are
stored in the
@pools
directory. This can cause data loss in the regular
repositories that depend on the object pool.
The danger lies in
git prune
, and
git gc
calls
git prune
. The
problem is that
git prune
, when running in a pool repository, cannot
reliable decide if an object is no longer needed.
Git alternates in GitLab: pool repositories
GitLab organizes this object borrowing by
creating special
pool repositories
which are hidden from the user. We then use Git
alternates to let a collection of project repositories borrow from a
single pool repository. We call such a collection of project
repositories a pool. Pools form star-shaped networks of repositories
that borrow from a single pool, which resemble (but not be
identical to) the fork networks that get formed when users fork
projects.
At the Git level, pool repositories are created and managed using Gitaly
RPC calls. Just like with normal repositories, the authority on which
pool repositories exist, and which repositories borrow from them, lies
at the Rails application level in SQL.
In conclusion, we need three things for effective object deduplication
across a collection of GitLab project repositories at the Git level:
A pool repository must exist.
The participating project repositories must be linked to the pool
repository via their respective
objects/info/alternates
files.
The pool repository must contain Git object data common to the
participating project repositories.
All repositories in a pool must be on the same Gitaly storage shard.
The Git alternates mechanism relies on direct disk access across
multiple repositories, and we can only assume direct disk access to
be possible within a Gitaly storage shard.
The only two ways to remove a member project from a pool are (1) to
delete the project or (2) to move the project to another Gitaly
storage shard.
If you didn't find what you were looking for,
search the docs
.
If you want help with something specific and could use community support,
post on the GitLab forum
.
For problems setting up or using this feature (depending on your GitLab
subscription).