Git Internals: How Objects, the DAG, Packfiles, and the Reflog Actually Work

The Four Objects That Are All of Git

Mental Model

Git is not a "file versioning system." It is a content-addressable filesystem with a version-control interface on top. Once you understand the object model, every Git command becomes obvious.

Most engineers use Git for years without understanding what it actually stores. That lack of understanding is why git rebase feels scary and why git reflog feels magical. This lesson demystifies the internals so you can reason about Git rather than just executing commands from muscle memory.

Object Type 1: Blob — The File Content

A blob stores the raw content of a file — nothing else. No filename. No path. No permissions.

# Create a blob manually
echo "Hello, World" | git hash-object -w --stdin
# Returns: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# Read it back
git cat-file -p 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Hello, World

The key insight: Two identical files across different directories have the same blob hash. Git only stores the content once. This is why Git is space-efficient — you can rename a 1MB file and Git stores zero additional bytes.

The hash is SHA-1 of "blob <content-length>\0<content>". This means:

Same content = same hash, always
Different content = different hash, always
The hash is the identity of the content

Object Type 2: Tree — The Directory Listing

A tree is Git's representation of a directory. It maps file names and permissions to blob hashes (for files) or other tree hashes (for subdirectories).

# Inspect the tree at HEAD
git cat-file -p HEAD^{tree}
# 100644 blob a8c6b38...  .gitignore
# 100644 blob 2d3f5e1...  README.md
# 040000 tree 9b7c3a0...  src
# 040000 tree 1d8f4c2...  tests

Reading the output:

100644 = regular file permissions
040000 = directory (no permissions — directories don't have them in Git)
blob or tree = the object type
Then the hash and the name

The recursive structure: Trees point to blobs and other trees. The root tree of a commit represents the entire state of the repository at that moment — a complete, self-contained snapshot.

Object Type 3: Commit — The Snapshot with Context

A commit is a pointer to a tree (the repo state) plus metadata:

git cat-file -p HEAD
# tree 9b7c3a0f8e1d4c2b5a7f0e3d6b9c2a1e8f0d3c6b
# parent a4b2c8e1d3f5a7b9c2e4f6a8b0d2e4f6a8b0d2e
# author Sachin Sarawgi <sachin@example.com> 1716190000 +0530
# committer Sachin Sarawgi <sachin@example.com> 1716190000 +0530
#
# fix: resolve payment timeout in CheckoutService

The parent field is everything: It points to the previous commit, creating the chain of history. A merge commit has two parent hashes — that's literally all that makes a merge commit different from a regular commit.

Initial commit: no parent
Regular commit: one parent  
Merge commit:   two parents

Why history is immutable: If you change any file in a commit, the blob hash changes. That changes the tree hash. That changes the commit hash. That changes the next commit's parent hash. Every downstream hash changes. This is the cryptographic chain that makes Git history trustworthy.

Object Type 4: Tag — The Named Commit Pointer

An annotated tag is a fourth type of Git object. It wraps a commit with a name, date, tagger identity, and message:

git cat-file -p v2.1.0
# object 9d3b7c2a1e5f8d0b3c6e9a2f5b8d1c4e7a0f3b6
# type commit
# tag v2.1.0
# tagger Sachin Sarawgi <sachin@example.com> 1716190000 +0530
#
# Release 2.1.0: Virtual thread support + security hardening

A lightweight tag (created with git tag v2.1.0 without -a) is just a ref file pointing to a commit — no object is created. Use annotated tags for releases; lightweight tags for local bookmarks.

How Branches Work (They're Just Files)

Here is what "create a branch" actually does:

git checkout -b feature/new-auth
# Creates: .git/refs/heads/feature/new-auth
# Contents: 9d3b7c2a1e5f8d0b3c6e9a2f5b8d1c4e7a0f3b6

That's it. A branch is a 41-byte text file containing a commit hash. Creating a branch is instant because it's just writing 41 bytes to disk. This is why Git branches are cheap — they are fundamentally different from other VCS branching models.

HEAD is also just a file:

cat .git/HEAD
# ref: refs/heads/feature/new-auth

# In "detached HEAD" state:
# 9d3b7c2a1e5f8d0b3c6e9a2f5b8d1c4e7a0f3b6

When you git checkout main, Git:

Updates .git/HEAD to point to refs/heads/main
Reads the commit hash from refs/heads/main
Reads the tree from that commit
Updates the working directory to match that tree

The Directed Acyclic Graph (DAG)

Git's history is a directed acyclic graph: nodes are commits, edges are parent-child relationships, and it's acyclic because commits can't point to future commits (no cycles).

gitGraph
   commit id: "Initial commit"
   commit id: "Add auth"
   branch feature/payments
   checkout feature/payments
   commit id: "Add PaymentService"
   commit id: "Add PaymentController"
   checkout main
   commit id: "Fix auth bug"
   merge feature/payments id: "Merge payments"
   commit id: "Release v2.0"

Key properties of the Git DAG:

Every commit knows its parent(s) — history is always traversable backward
No commit knows its children — Git must traverse the entire graph to find them (this is why git log --all can be slow on large repos)
The DAG structure means there is always exactly one way to traverse from any commit to the root

Packfiles: How Git Stays Fast

Individual object files (called "loose objects") are efficient to write but inefficient to read in bulk. When you run git gc (or push to GitHub), Git packs loose objects into packfiles.

# See what's in your packfile
ls .git/objects/pack/
# pack-7a3b9c2e1f4d8b0e3f6a9c2e5b8d1c4e7a0f3b6.idx
# pack-7a3b9c2e1f4d8b0e3f6a9c2e5b8d1c4e7a0f3b6.pack

# Verify it
git verify-pack -v .git/objects/pack/*.pack | head -20

How packfiles work:

Objects are sorted by type and size
Similar objects are delta-compressed against each other (the diff is stored, not the full object)
The .idx file is an index that maps SHA-1 hashes to byte offsets in the .pack file

The result: A repository with 100,000 commits and 1 million file versions might use only 2GB on disk instead of 200GB, because sequential versions of the same file are stored as deltas.

The Reflog: Your Time Machine

The reflog records every time HEAD or a branch reference moves. This is your safety net for every "I accidentally deleted something" or "I rebased wrong" scenario.

# See the reflog for HEAD
git reflog
# 9d3b7c2 (HEAD -> main) HEAD@{0}: commit: fix payment timeout
# a4b2c8e HEAD@{1}: merge feature/payments: Merge made by 'ort' strategy
# 7f1d3a5 HEAD@{2}: checkout: moving from feature/payments to main
# 2e8c4b0 HEAD@{3}: commit: Add PaymentController
# 1b6e9a4 HEAD@{4}: commit: Add PaymentService
# ...

# Recover a "deleted" commit
git checkout -b recovery 2e8c4b0

Critical scenarios where reflog saves you:

# Scenario 1: Accidental "git reset --hard"
git reset --hard HEAD~5   # Oh no, lost 5 commits!
git reflog                # Find the hash before the reset
git reset --hard HEAD@{1} # Restore to before the reset

# Scenario 2: Deleted a branch you needed
git branch -D feature/auth  # Deleted by accident
git reflog                   # Find the last commit on that branch
git checkout -b feature/auth <hash>  # Recreate it

# Scenario 3: Bad rebase
git rebase -i HEAD~10  # Went badly
git reflog             # Find the pre-rebase HEAD
git reset --hard ORIG_HEAD  # Or: git reset --hard HEAD@{1}

The 90-day limit: By default, reflog entries expire after 90 days (gc.reflogExpire). For commits that are unreachable from any branch or tag after 30 days, the default is 30 days (gc.reflogExpireUnreachable). After expiry, git gc can collect them.

Garbage Collection and Loose Objects

# Manual GC (also happens automatically)
git gc

# Aggressive GC (repack everything, slower but more thorough)
git gc --aggressive

# See what GC would delete (dry run)
git prune --dry-run

# Count loose objects
git count-objects -v

What GC does:

Packs loose objects into packfiles
Deletes loose objects already in packfiles
Prunes unreachable objects (those not referenced by any branch, tag, or reflog)
Repacks existing packfiles if too many exist

Command Reference: The Plumbing Behind the Porcelain

Plumbing Command	What it Does	Porcelain Equivalent
`git hash-object -w`	Create a blob object	(none — used by `git add`)
`git cat-file -p`	Pretty-print any object	`git show`
`git cat-file -t`	Show object type	(none)
`git ls-tree`	List tree contents	`git show` for trees
`git rev-parse HEAD`	Print the raw commit hash	(none)
`git update-ref`	Move a branch pointer	`git reset`
`git symbolic-ref HEAD`	Read/write HEAD ref	`git checkout`

Understanding these plumbing commands reveals what the "porcelain" commands actually do under the hood, which is how you debug unusual situations that git status doesn't explain clearly.

Verbal Interview: Explain Git Internals

Interviewer: "What actually happens when you run git commit?"

Strong Answer:

"When I run git commit, Git does several things internally. First, it takes each file in the staging area and creates a blob object — the SHA-1 hash of the file's content, stored in .git/objects/. Second, it creates a tree object that maps filenames to those blob hashes, representing the current state of the directory. For subdirectories, it creates nested tree objects recursively. Third, it creates a commit object that points to the root tree, contains author/committer metadata and the commit message, and references the previous commit's hash as its parent. Finally, it updates the branch reference file — for example, .git/refs/heads/main — to contain the new commit's hash. The entire operation is atomic from Git's perspective: if any step fails, nothing is changed."

Key Takeaways

Every Git object is content-addressed by its SHA-1 hash — this is why Git history is immutable and trustworthy.
A branch is just a 41-byte text file containing a commit hash — creating one is instant regardless of repository size.
The reflog is your time machine: it records every HEAD movement for 90 days, enabling recovery from any local mistake.