Git Internals: How Git Actually Works — Objects, Refs & the DAG

Quick Reference — the mental model in one table

Concept	What it really is
Blob	Raw bytes of one file's content — no filename, no mode, no history. Identical content = one blob, shared by every file/commit that has it.
Tree	A directory listing: (mode, name) → blob/tree SHA. Nested trees make a nested directory structure.
Commit	A pointer to one tree + zero-or-more parent commits + author/committer/message. A snapshot, never a diff.
SHA (object ID)	The object's address and its integrity check — hash of the object's own bytes. Change one byte anywhere, the hash (and every descendant's hash) changes.
Branch	A 41-byte text file (`.git/refs/heads/<name>`) holding a commit SHA. Moving a branch is a single file write — no data is copied.
HEAD	Usually a symbolic pointer to the current branch ref. Points directly at a commit SHA instead = "detached HEAD."
Index (staging area)	A separate binary file (`.git/index`) listing what the next commit's tree will contain — not a commit-in-progress, not the working directory.
The DAG	Commits form a directed acyclic graph via parent pointers. Merges have 2+ parents; the graph can only grow forward because a commit's hash depends on its parent's hash, which must already exist.
Merge base	The best common ancestor of two branch tips in the DAG — the reference point for a three-way merge.
Rebase	Re-apply commits onto a new parent, one at a time. New parent ⇒ new hash for every commit from that point forward, even if the content is identical.

§1
The big idea: a content-addressable filesystem

Git is a key-value store, not a "diff tracker"

Linus Torvalds described Git's core as "a stupid content tracker" — and that's the accurate mental model. Underneath the porcelain (add, commit, merge...) sits a tiny key-value database: the key is a SHA hash, the value is a compressed blob of bytes, and there are exactly four kinds of values (objects). Everything Git does — history, branching, diffing, merging — is built on top of that one primitive.

This is why Git is fast and safe to reason about: nothing is ever edited in place. A "change" is always a brand-new object plus a moved pointer. The old object still exists until something explicitly garbage-collects it (§8).

Commits are snapshots, not diffs — a common misconception

Unlike older centralized VCS tools (CVS, Subversion) that store a file's history as a chain of diffs/deltas, every Git commit points at a complete tree — the full state of every file, as it existed at that commit. git diff, git log -p, and friends compute differences on the fly by comparing two full snapshots; nothing is stored as a diff at the object level.

Storage efficiency doesn't suffer: unchanged files are unchanged blobs — the tree for commit #50 simply reuses the exact same blob SHA as commit #49 for any file that didn't change. Space savings on top of that come later, from packfile delta compression (§8) — a storage-layer optimization, decoupled from the conceptual model.

Three separate areas, three separate jobs

Every Git command can be understood as copying data between three places:

Working directory — the plain files on disk you edit with a normal editor. Not part of the object database at all.
Index / staging area — a snapshot-in-progress recorded in .git/index. What git commit will turn into a tree.
Repository (object database + refs) — the immutable object graph plus the mutable pointers (branches, HEAD, tags) into it.

§2
The object model: blobs, trees, commits & tags

The four object types — what each one stores

Type	Stores	Inspect with
blob	Raw file bytes only. No name, no path, no permission bits — those live in the tree that references it.	`git cat-file -p <sha>`
tree	A sorted list of entries: `mode name\0<20-byte binary sha>` per entry — one directory level. Subdirectories are just entries whose SHA points at another tree.	`git ls-tree <sha>`
commit	One `tree` SHA, zero-or-more `parent` SHAs, author, committer, timestamps, and the message. Never a diff.	`git cat-file -p <sha>` / `git show <sha>`
tag (annotated)	A target object's type + SHA, tagger identity, message, and optionally a GPG/SSH signature. A durable, first-class object — unlike a lightweight tag.	`git cat-file -p <sha>`

Lightweight tags aren't objects at all. git tag v1.0 (no -a) just writes a ref file at .git/refs/tags/v1.0 containing a commit SHA — structurally identical to a branch, except convention says you don't move it. Use annotated tags (git tag -a) for anything you'll ship; they carry metadata and can be signed.

Gitlinks — the quasi-fifth reference (submodules)

A tree entry can carry mode 160000 — a gitlink — whose "SHA" is a commit hash in an entirely different repository, not a blob/tree in this one. That's the whole mechanism behind submodules: the parent repo records exactly one pinned commit of the child repo, with no copy of its objects. git submodule update is what actually fetches and checks out that pinned commit into the working directory.

§3
Content addressing & the object database

How a SHA is actually computed

An object's ID is the hash of a small header plus its content: sha1("<type> <byte-length>\0" + content). The header means a blob containing exactly the same bytes as a commit object (astronomically unlikely, but conceptually) would still hash differently, because the header text differs. This is why identical file content always produces the identical blob SHA, regardless of filename, path, or which commit it's in — the hash only ever depends on the bytes.

deterministic content addressing, proven

$ echo -n "hello git internals" | git hash-object --stdin
19a3d3d4a52002ac7f7ef476ffc2ba1de1471ec9

$ echo -n "hello git internals" | git hash-object --stdin
19a3d3d4a52002ac7f7ef476ffc2ba1de1471ec9   # identical input → identical hash, every time

On-disk storage: loose objects vs. packfiles

A freshly-created object is written as a loose object: zlib-deflate-compressed and stored at .git/objects/xx/yyyy…, where xx is the first two hex characters of its SHA (a directory, so no single directory ever needs millions of entries) and the rest is the filename.

Loose objects are simple but wasteful — one per object, no cross-object compression. Periodically (or explicitly via git gc), Git collapses many loose objects into a single .pack file plus a .idx index for O(log n) lookup by SHA prefix, using delta compression: similar objects (e.g. successive versions of one file) are stored as a small diff against a chosen base object rather than in full. Full detail in §8.

Inspecting the object graph with plumbing

Porcelain commands (log, show, diff) are convenience wrappers. The plumbing commands underneath let you walk the exact same graph a commit points at:

walk from a commit down to its files

$ git cat-file -p HEAD
tree 8f3d44b1c9e2a0f5b6d7c8a9b0e1f2a3b4c5d6e7
parent 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
author  David Veksler <[email protected]> 1782900000 -0700
committer David Veksler <[email protected]> 1782900000 -0700

Add README

$ git cat-file -p 8f3d44b1c9e2a0f5b6d7c8a9b0e1f2a3b4c5d6e7
100644 blob 3f2504e04f8964a875e69bfff8c5a4b8b5f2c1f3    README.md
100644 blob a1b2c3d4e5f60718293a4b5c6d7e8f9a0b1c2d3e    main.py
040000 tree 9d0e2f1a2b3c4d5e6f708192a3b4c5d6e7f80912    src

$ git cat-file -p 3f2504e04f8964a875e69bfff8c5a4b8b5f2c1f3
# Hello, Git

$ git cat-file -t 8f3d44b1c9e2a0f5b6d7c8a9b0e1f2a3b4c5d6e7
tree

Sample hashes above are illustrative; run this against your own repo to see the real graph.

§4
Refs, HEAD & the index — the mutable layer

A ref is just a file containing a SHA

Everything in .git/objects is content-addressed and immutable. Branches, HEAD, and tags are the opposite: small, deliberately mutable pointers that make the immutable graph usable as version control.

Ref	On disk	Contents
`refs/heads/main`	`.git/refs/heads/main`	A commit SHA — the branch tip. Moves on every commit to that branch.
`HEAD`	`.git/HEAD`	Normally `ref: refs/heads/main` (symbolic — "whatever main points at"). A raw SHA instead = detached HEAD.
`refs/tags/v1.0`	`.git/refs/tags/v1.0`	A commit (or tag-object) SHA. Immutable by convention only — nothing stops you from moving it.
`refs/remotes/origin/main`	`.git/refs/remotes/origin/main`	Local record of where `origin`'s `main` was as of your last fetch/push. Read-only in normal use.

The index (staging area) is a file, not a commit-in-progress

.git/index is a binary file: a header (signature, format version, entry count), then one sorted entry per staged path — mode, file size, mtime cache, and the blob SHA it would commit — followed by a checksum of the file itself. git add doesn't touch the object graph beyond writing a blob; it edits this one file.

This is why staging is fast even on huge repos: git status mostly compares cached stat info in the index against the filesystem, only re-hashing files whose mtime/size actually changed.

Build a commit entirely by hand (no porcelain)

The clearest way to internalize "porcelain is just choreography" is to do a commit's job yourself with plumbing:

what `git add` + `git commit` do, spelled out

$ echo "hello" > file.txt
$ BLOB=$(git hash-object -w file.txt)              # writes a blob object, prints its SHA
$ git update-index --add --cacheinfo 100644 $BLOB file.txt   # stage it (edit .git/index)
$ TREE=$(git write-tree)                             # index → tree object
$ COMMIT=$(git commit-tree $TREE -m "manual commit") # tree (+ parent, if any) → commit object
$ git update-ref refs/heads/main $COMMIT              # move the branch pointer — that's the "commit"

Four plumbing calls, one line of pointer-moving at the end. git commit is exactly this sequence, plus reading the previous HEAD as -p <parent> and appending a reflog entry.

Detached HEAD isn't dangerous, it's just unreachable

Checking out a commit or tag directly (git checkout <sha>) puts a raw SHA into .git/HEAD instead of a branch reference. You can commit, build, and test normally — those new commits are entirely valid objects. The only real risk: since no branch ref points at them, switching away leaves them unreachable and eligible for garbage collection (§8) unless you first anchor them with git branch <name> or git switch -c <name>.

§5
The commit DAG & merge base

Why the graph can only be acyclic

Each commit stores its parent(s) by hash, and a hash can only be computed after the thing it names already exists. A commit literally cannot reference a descendant, because that descendant's hash — which would have to include this commit as its parent — doesn't exist yet. The acyclic property isn't enforced by a rule Git checks; it falls directly out of hashing being one-directional in time.

Merge base: the reference point for a three-way merge

When you run git merge feature from main, Git first walks the DAG backward from both tips to find their best common ancestor — in the diagram, that's B. It then diffs base→ours (B→C) and base→theirs (B→F), applies non-overlapping changes automatically, and flags anything both sides touched as a conflict. Inspect it yourself with git merge-base main feature.

Criss-cross merges & the "virtual" merge base

If two branches have merged each other more than once, there can be multiple lowest common ancestors with no single "best" one. Git's default recursive/ort strategy handles this by first merging the candidate ancestors with each other to synthesize one virtual base, then doing the normal three-way merge against that. Most people never notice this machinery — until a merge conflicts in a spot that seems to have "nothing to do" with either branch's real changes, which is usually this case.

Ancestry shorthand: ~ vs. ^

HEAD~N — walk N steps back along first parents only. Meaningful for any commit, since it ignores side branches.
HEAD^N — the Nth parent of a merge commit. Only meaningful at a merge (which is the only kind of commit with more than one parent); HEAD^1 = HEAD^ = HEAD~1.
HEAD^2 on commit M above = F (the second parent, i.e. the branch that got merged in).

§6
Porcelain → plumbing: what commands really do

Every command, decoded at the object/ref level

Command	What actually happens
`git add <file>`	`hash-object` writes a blob for the file's current bytes; the index gets a new/updated entry (path, mode, blob SHA). No commit, no tree, yet.
`git commit`	`write-tree` turns the index into a tree object (nested trees for subdirectories) → `commit-tree` wraps it with the old HEAD as parent → `update-ref` moves the branch to the new commit → a reflog line is appended.
`git branch <name>`	`update-ref refs/heads/<name> $(rev-parse HEAD)` — one new 41-byte file. Index and working directory are untouched.
`git switch/checkout <branch>`	HEAD becomes a symbolic ref to `refs/heads/<branch>`; index and working directory are overwritten to match that branch tip's tree.
`git checkout <commit>`	Same tree checkout, but HEAD becomes a raw SHA — detached HEAD (§4).
`git merge` (fast-forward)	Target is a descendant of current HEAD — the branch ref is simply reassigned forward. No new commit object is created.
`git merge` (true merge)	Merge-base found (§5), three-way diff computed, a new commit object is created with two parents, branch ref moves to it.
`git rebase <base>`	Each source commit is replayed (cherry-picked) onto the new parent in turn — every one becomes a brand-new commit object with a new hash (§7), because the parent changed. Branch ref is moved to the final new commit only at the end.
`git tag -a`	Writes a tag object pointing at the target + a ref file at `refs/tags/<name>`.
`git stash`	Creates commit objects (working-tree state, and index state) that are not reachable from any branch — referenced only via `refs/stash` and the stash reflog.

git reset's three modes, precisely

Mode	Moves branch ref	Resets index	Resets working dir
`--soft`	✅	❌	❌
`--mixed` (default)	✅	✅ (unstages)	❌
`--hard`	✅	✅	✅ (discards edits)

In every mode the commits you "reset away from" aren't deleted — they simply become unreachable from the branch ref, recoverable via the reflog (§8) until it expires or gc prunes them.

§7
Why rebase changes every hash downstream

A commit's hash is a function of its parent's hash

A commit object's bytes include the literal text of its parent SHA. Change the parent — even with the tree and message held perfectly identical — and the commit's own hash changes, because the input to the hash function changed. Every commit downstream references this commit's hash as its parent, so the change cascades through the entire remainder of the branch.

Before git rebase main (feature branched from B):

Amain

→

Bmain

→

Cmain tip

Dabc123

→

Edef456

feature (parent of D = B)

After git rebase main while on feature:

A

→

B

→

Cnew parent

→

D'9f1a2c — new SHA

→

E'77bb01 — new SHA

D and E still exist as loose objects — unreachable from any ref now that feature points at E' — until reflog expiry / git gc prunes them (§8). This unreachability, multiplied across everyone who already pulled D/E, is exactly why rebasing shared/published history breaks collaborators: their branch's parent chain no longer matches anyone else's.

Fast-forward vs. true merge — a graph-shape decision, not a preference

Git doesn't "choose" to fast-forward stylistically — it's forced whenever the target is a straight-line descendant of the current tip (no divergence to reconcile), and impossible otherwise. git merge --no-ff exists specifically to force a merge commit even when a fast-forward is possible, purely to keep a visible marker of "a branch merged here" in the graph.

§8
Packfiles, garbage collection & the reflog

Reachability is the whole GC model

Git never "deletes on undo." An object is kept as long as it's reachable — findable by walking parent/tree/blob pointers starting from some ref (a branch, tag, stash, or a reflog entry). Reset, rebase, branch deletion, and amend all just make certain commits unreachable from normal refs; the bytes remain on disk until an explicit collection pass decides otherwise.

git gc & pruning — the actual defaults

gc.auto = 6700 — once loose objects exceed roughly this count, ordinary commands opportunistically trigger git gc --auto to pack them.
gc.autoPackLimit = 50 — once there are more than this many packs, they get consolidated into one.
gc.pruneExpire = "2 weeks ago" — unreachable loose objects younger than this are kept (as a safety margin) even during an explicit gc.

The reflog: your real local undo history

Every time a ref (HEAD, a branch) moves on your machine, Git appends a line to that ref's reflog — a purely local, never-pushed journal of "where this pointer has been." It's the practical safety net under reset/rebase/amend: the old commit is still there, you just need its former reflog position.

gc.reflogExpire = 90 days for entries still reachable some other way.
gc.reflogExpireUnreachable = 30 days for entries that are not reachable from any current ref (the common case after --amend/rebase).

recover a commit that "disappeared" after reset --hard / a bad rebase

$ git reflog                      # find the entry just before the mistake, e.g. HEAD@{2}
$ git branch recovered HEAD@{2}  # anchor it to a real ref so it survives gc
# or, if you don't even remember a ref name — search every dangling object directly:
$ git fsck --no-reflogs --unreachable --full | grep commit
$ git show <dangling-sha>

§9
Edge cases & advanced internals

SHA-1 → SHA-256: where the transition actually stands

Git has used a collision-detecting SHA-1 variant (hardened against the 2017 "SHAttered" attack) since 2017, so practical collision forgery against a Git repo isn't a live threat. A parallel SHA-256 object format exists (git init --object-format=sha256) and Git 2.51 marked it as the default hash for the planned Git 3.0 — but as of mid-2026, major hosting platforms (GitHub, GitLab) still don't support SHA-256 repositories, so the transition is gated on ecosystem support, not the Git client. SHA-1 and SHA-256 repos are designed to interoperate (a SHA-256 client can push/fetch to a SHA-1 server) once tooling catches up.

Shallow & partial clones deliberately break the graph

git clone --depth N fetches only the most recent N commits and synthesizes a "grafted" boundary — older parents genuinely don't exist locally. Anything that needs to walk past that boundary (full blame, some rebases, bisect across old history) fails or behaves oddly until git fetch --unshallow. --filter=blob:none (partial clone) takes the opposite approach: the full commit/tree graph is present, but blob content is fetched lazily on first access — different trade-off, same principle of deferring part of the object graph.

What a signature actually covers

Signing a commit or tag (-S, GPG or SSH) signs the object's own canonical serialized bytes — tree/parent/author/message for a commit — everything except the signature field itself, which gets appended. It authenticates that exact snapshot and metadata, not "the diff the author intended"; git verify-commit / git log --show-signature re-derive the object and check the signature against it.

Worktrees: multiple working directories, one object database

git worktree add ../hotfix hotfix-branch creates a second working directory + index, both pointed at the same .git/objects and refs as the original. It solves "I need two branches checked out simultaneously" without the disk/network cost of a second clone — a direct consequence of the working directory being just one of the three separate areas in §1.

Sparse checkout: the index doesn't have to mirror the whole tree

git sparse-checkout set <patterns> (cone mode) restricts which paths get materialized into the working directory, while the index and full commit history remain complete. Useful on monorepos where checking out every path would be prohibitively slow — the object graph is unaffected, only what gets written to disk for you to see.

Common misconceptions & anti-patterns

"Commits store diffs." They store full tree snapshots; diffs are always computed on the fly for display.

Rebasing published/shared history. Every rebased commit gets a new hash — collaborators who already have the originals now have a diverged, unmergeable history.

Force-pushing without --force-with-lease. Plain --force blindly overwrites the remote ref; --force-with-lease refuses if the remote moved since your last fetch.

".gitignore untracks files." It only affects untracked files. Already-tracked files need git rm --cached first.

Assuming reset --hard destroys data instantly. The commits become unreachable loose objects, recoverable via reflog for weeks (§8) — not deleted on the spot.

Trying to purge a leaked secret with a normal commit. It still lives in every prior tree/blob object. Real removal needs history rewriting (git filter-repo/BFG) + force-push + everyone re-clones + rotate the credential.

Confusing fetch and pull at the ref level. fetch only updates refs/remotes/*; it never touches your branches or working dir. pull = fetch + merge/rebase into your current branch.

Treating a shallow clone as a full repository. Ancestor commits genuinely aren't present locally — blame/bisect/rebase across that boundary fail until git fetch --unshallow.

Fearing detached HEAD. It's a perfectly valid state; just anchor new work with git branch <name> before switching away, or it becomes GC-eligible.

Assuming duplicate files cost double storage. Content addressing means identical bytes hash to the same blob — stored once, referenced by as many tree entries as need it.

Quick Reference — the mental model in one table

§1The big idea: a content-addressable filesystem

§2The object model: blobs, trees, commits & tags

§3Content addressing & the object database

§4Refs, HEAD & the index — the mutable layer

§5The commit DAG & merge base

§6Porcelain → plumbing: what commands really do

§7Why rebase changes every hash downstream

§8Packfiles, garbage collection & the reflog

§9Edge cases & advanced internals

Common misconceptions & anti-patterns

§1
The big idea: a content-addressable filesystem

§2
The object model: blobs, trees, commits & tags

§3
Content addressing & the object database

§4
Refs, HEAD & the index — the mutable layer

§5
The commit DAG & merge base

§6
Porcelain → plumbing: what commands really do

§7
Why rebase changes every hash downstream

§8
Packfiles, garbage collection & the reflog

§9
Edge cases & advanced internals