Chapter 9. Git Internals

leonzhx

浏览: 799677 次
性别:
来自: 上海

最近访客更多访客>>

u012363178

justsimple

cdphantom

wang_xuewu

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2014-05 ( 22)
2014-04 ( 47)
2014-03 ( 25)
更多存档...

博客分类：

Pro Git 读书笔记

Git VCS

1. Git is fundamentally a content-addressable file system with a VCS user interface written on top of it.

2. Git has a bunch of verbs that do low-level work and were designed to be chained together UNIX style or called from scripts. These commands are generally referred to as plumbing commands, and the more user-friendly commands are called porcelain commands.

3. When you run git init in a new or existing directory, Git creates the .git directory, which is where almost everything that Git stores and manipulates is located. If you want to back up or clone your repository, copying this single directory elsewhere gives you nearly everything you need. It looks like:

$ ls

HEAD

branches/

config

description

hooks/

index

info/

objects/

refs/

The branches directory isn't used by newer Git versions, and the description file is only used by the GitWeb program, so don't worry about those. The config file contains your project-specific configuration options, and the info directory keeps a global exclude file for ignored patterns that you don't want to track in a .gitignore file. The hooks directory contains your client- or server-side hook scripts. The objects directory stores all the content for your database, the refs directory stores pointers into commit objects in that data (branches), the HEAD file points to the branch you currently have checked out, and the index file is where Git stores your staging area information.

4. G it is a content-addressable filesystem, which means at the core of Git is a simple key-value data store. You can insert any kind of content into it, and it will give you back a key that you can use to retrieve the content again at any time.

5. hash-object takes some data, stores it in your .git/objects directory, and gives you back the key the data is stored as:

$ echo 'test content' | git hash-object -w --stdin

d670460b4b4aece5915caf5c68d12f560a9fe3e4

The -w tells hash-object to store the object; otherwise, the command simply tells you what the key would be. --stdin tells the command to read the content from stdin ; if you don't specify this, hash-object expects the path to a file. The output from the command is a 40-character checksum hash. You can see how Git has stored your data:

$ find .git/objects -type f

.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4

Git stores the content initially— as a single file per piece of content, named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA, and the filename is the remaining 38 characters.

6. You can pull the content back out of Git with the cat-file command. Passing -p to it instructs the cat-file command to figure out the type of content and display it nicely for you:

$ git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4

test content

7. You aren't storing the filename in your system—just the content. This object type is called a blob . You can have Git tell you the object type of any object in Git, given its SHA-1 key, with cat-file -t :

$ git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

blob

8. All the content in Git is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents. A single tree object contains one or more tree entries, each of which contains an SHA-1 pointer to a blob or subtree with its associated mode, type, and filename. For example, the most recent tree may look something like:

$ git cat-file -p master^{tree}

100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README

100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile

040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074eO lib

The master^{tree} syntax specifies the tree object that is pointed to by the last commit on your master branch. The lib subdirectory isn't a blob but a pointer to another tree:

$ git cat-file -p 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0

100644 blob 47c6340d6459e05787f644c2447d2595f5d3a54b simplegit.rb

Conceptually, the data that Git is storing is like:

9. Git normally creates a tree by taking the state of your staging area or index and writing a tree object from it. So, to create a tree object, you first have to set up an index by staging some files. You can use update-index to artificially add a blob to a new staging area. You must pass it the --add option because the file doesn't yet exist in your staging area (you don't even have a staging area set up yet) and --cacheinfo because the file you're adding isn't in your directory but is in your database. Then, you specify the mode, SHA-1, and filename:

$ echo 'version 1' | git hash-object -w –stdin

83baae61804e65cc73a7201a7252750c76066a30

$ git update-index --add --cacheinfo 100644 \

83baae61804e65cc73a7201a7252750c76066a30 test.txt

You're specifying a mode of 100644 , which means it's a normal file. Other options are 100755 , which means it's an executable file; and 120000 , which specifies a symbolic link. These three modes are the only ones that are valid for files in Git (although other modes are used for directories and submodules).

write-tree automatically creates a tree object from the state of the index if that tree doesn't yet exist:

$ git write-tree

d8329fc1cc938780ffdd9f94eOd364eOea74f579

$ git cat-file -p d8329fc1cc938780ffdd9f94eOd364eOea74f579

100644 blob 83baae61804e65cc73a7201a7252750c76066a30 test.txt

You can also call write-tree with a file path:

$ echo 'new file' > new.txt

$ echo 'version 2' > test.txt

$ git update-index test.txt

$ git update-index --add new.txt

$ git write-tree

0155eb4229851634aOf03eb265b69f5a2d56f341

$ git cat-file -p 0155eb4229851634aOf03eb265b69f5a2d56f341

100644 blob fa49b077972391ad58037050f2a75f74e3671e92 new.txt

100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test.txt

Your staging area now has the new version of test.txt as well as the new file new. txt .

You can read an existing tree into your staging area as a subtree by using the --prefix option to read-tree :

$ git read-tree --prefix=bak d8329fc1cc938780ffdd9f94eOd364eOea74f579

$ git write-tree

3c4e9cd789d88d8d89c1073707c3585e41bOe614

$ git cat-file -p 3c4e9cd789d88d8d89cl073707c3585e41bOe614

040000 tree d8329fc1cc938780ffdd9f94eOd364eOea74f579 bak

100644 blob fa49b077972391ad58037050f2a75f74e3671e92 new.txt

100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a test.txt

10. To create a commit object, you call commit-tree and specify a single tree SHA-1 and which commit objects, if any, directly preceded it:

$ echo 'first commit' | git commit-tree d8329f

fdf4fc3344e67ab068f836878b6c4951e3b15f3d

You can look at your new commit object with cat-file :

$ git cat-file -p fdf4fc3

tree d8329fc1cc938780ffdd9f94eOd364eOea74f579

author Scott Chacon <schacon@gmail.com> 1243040974 − 0700

committer Scott Chacon <schacon@gmail.com> 1243040974 − 0700

first commit

The format for a commit object is simple: it specifies the top-level tree for the snapshot of the project at that point; the author/committer information pulled from your user.name and user.email configuration settings, with the current timestamp; a blank line, and then the commit message.

Then you can write the other two commit objects, each referencing the commit that came directly before it:

$ echo 'second commit' | git commit-tree 0155eb -p fdf4fc3

cac0cab538b970a37ea1e769cbbde608743bc96d

$ echo 'third commit' | git commit-tree 3c4e9c -p cac0cab

1a410efbd13591db07496601ebc7a059dd55cfe9

This is essentially what Git does when you run the git add and git commit commands—it stores blobs for the files that have changed, updates the index, writes out trees, and writes commit objects that reference the top-level trees and the commits that came immediately before them.

11. Git stores a header with the content which starts with the type of the object, in this case a blob. Then, it adds a space followed by the size of the content and finally a null byte: blob 16\000 . Git concatenates the header and the original content and then calculates the SHA-1 checksum of that new content. The Ruby program for generate a blob with content “what is up, doc? ” looks like:

$ irb

>> content = "what is up, doc?"

=> "what is up, doc?"

>> header = "blob #{content.length}\0"

=> "blob 16\000"

>> store = header + content

=> "blob 16\000what is up, doc?"

>> require 'digest/sha1'

=> true

>> shal = Digest::SHA1.hexdigest(store)

=> "bd9dbf5aae1a3862dd1526723246b20206e5fc37"

>> require 'zlib'

=> true

>> zlib_content = Zlib:: Deflate.deflate(store)

=> "x\234K\312\3110R04c(\317H,Q\310,V(-\320QH\3110\266\a\000_\034\a\235"

>> path = '.git/objects/' + sha1[0,2] + '/' + sha1[2,38]

=> ".git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37"

>> require 'fileutils'

=> true

>> FileUtils.mkdir_p(File.dirname(path))

=> ".git/objects/bd"

>> File.open(path, 'W') { |f| f.write zlib_content }

=> 32

12. You need a file in which you can store the SHA-1 value under a simple name so you can use that pointer rather than the raw SHA-1 value. In Git, these are called references or refs ; you can find the files that contain the SHA-1 values in the .git/refs directory.

13. To create a new reference that will help you remember where your latest commit is, you can technically do something as simple as this:

$ echo "1a410efbd13591db07496601ebc7a059dd55cfe9" > .git/refs/heads/master

You aren't encouraged to directly edit the reference files. Git provides a safer command to do this if you want to update a reference called update-ref :

$ git update-ref refs/heads/master 1a410efbd13591db07496601ebc7a059dd55cfe9

That's basically what a branch in Git is: a simple pointer or reference to the head of a line of work. To create a branch back at the second commit, you can do this:

$ git update-ref refs/heads/test cac0ca

Now, your Git database conceptually looks something like

When you run commands like git branch (branchname) , Git basically runs that update-ref command to add the SHA-1 of the last commit of the branch you're on into whatever new reference you want to create.

14. The HEAD file is a symbolic reference to the branch you're currently on. By symbolic reference, it means that unlike a normal reference, it doesn't generally contain a SHA-1 value but rather a pointer to another reference:

$ cat .git/HEAD

ref: refs/heads/master

You can also set the value of HEAD :

$ git symbolic-ref HEAD refs/heads/test

$ cat .git/HEAD

ref: refs/heads/test

You can't set a symbolic reference outside of the refs style:

$ git symbolic-ref HEAD test

fatal: Refusing to point HEAD outside of refs/

15. The tag object is very much like a commit object—it contains a tagger, a date, a message, and a pointer. The main difference is that a tag object points to a commit rather than a tree. It's like a branch reference, but it never moves—it always points to the same commit but gives it a friendlier name. You can make a lightweight tag by running something like this:

$ git update-ref refs/tags/v1.0 cac0cab538b970a37ea1e769cbbde608743bc96d

That is all a lightweight tag is—a branch that never moves. If you create an annotated tag, Git creates a tag object and then writes a reference to point to it rather than directly to the commit:

$ git tag -a v1.1 1a410efbd13591db07496601ebc7a059dd55cfe9 -m 'test tag'

$ cat .git/refs/tags/v1.1

9585191f37f7bOfb9444f35a9bf50de191beadc2

$ git cat-file -p 9585191f37f7bOfb9444f35a9bf50de191beadc2

object 1a410efbd13591db07496601ebc7a059dd55cfe9

type commit

tag v1.1

tagger Scott Chacon <schacon@gmail.com> Sat May 23 16:48:58 2009 − 0700

test tag

It doesn't need to point to a commit; you can tag any Git object. In the Git source code, for example, the maintainer has added their GPG public key as a blob object and then tagged it. You can view the public key by running

$ git cat-file blob junio-gpg-pub

16. If you add a remote and push to it, Git stores the value you last pushed to that remote for each branch in the refs/remotes directory. You can see what the master branch on the origin remote was the last time you communicated with the server, by checking the refs/remotes/origin/master file:

$ cat .git/refs/remotes/origin/master

Ca82a6dff817ec66f44342007202690a93763949

Remote references differ from branches (refs/heads references) mainly in that they can't be checked out. Git moves them around as bookmarks to the last known state of where those branches were on those servers.

17. Git compresses the contents of those files under objects folder with zlib . You can then use git cat-file to see how big one object is:

$ git cat-file -s 9bc1dc421dcd51b4ac296e3e5b6e2a99cf44391e

12898

18. The initial format in which Git saves objects on disk is called a loose object format. However, occasionally Git packs up several of these objects into a single binary file called a packfile in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server:

$ git gc

$ find .git/objects -type f

.git/objects/71/08f7ecb345ee9d0084193f147cdad4d2998293

.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4

.git/objects/info/packs

.git/objects/pack/pack-7al6e4488ae40c7d2bc56ea2bd43e25212a66c45.idx

.git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.pack

The objects that remain are the blobs that aren't pointed to by any commit. Because you never added them to any commits, they're considered dangling and aren't packed up in your new packfile . The packfile is a single file containing the contents of all the objects that were removed from your file system. The index is a file that contains offsets into that packfile so you can quickly seek to a specific object.

19. When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. The git verify-pack plumbing command allows you to see what was packed up:

$ git verify-pack -v pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.idx

It will show the SHA-1 of the objects packed in the packfile, the object type, the object size, object offset, etc. If two objects are very similar, the most recent version one will be stored intact and the original version will be stored as delta, it’s because you're most likely to need faster access to the most recent version of the file. Git will occasionally repack your database automatically, always trying to save more space. You can also manually repack at any time by running git gc by hand.

20. Suppose you add a remote:

$ git remote add origin git@github.com:schacon/simplegit-progit.git

It adds a section to your .git/config file, specifying the name of the remote (origin ), the URL of the remote repository, and the refspec for fetching:

[remote "origin"]

url = git@github.com:schacon/simplegit-progit.git

fetch = +refs/heads/*:refs/remotes/origin/*

The format of the refspec is an optional + , followed by <src>:<dst> , where <src> is the pattern for references on the remote side and <dst> is where those references will be written locally. The + tells Git to update the reference even if it isn't a fast-forward.

In the default case that is automatically written by a git remote add command, Git fetches all the references under refs/heads/ on the server and writes them to refs/remotes/origin/ locally. If you want Git to pull down only the master branch each time, and not every other branch on the remote server, you can change the fetch line to

fetch = +refs/heads/master:refs/remotes/origin/master

21. The following command are all equivalent, because Git expands each of them to refs/remotes/origin/master :

$ git log origin/master

$ git log remotes/origin/master

$ git log refs/remotes/origin/master

22. The fetch configuration in .git/config is just the default refspec for git fetch for that remote. If you want to do something one time, you can specify the refspec on the command line:

$ git fetch origin master:refs/remotes/origin/mymaster

You can also specify multiple refspecs:

$ git fetch origin master:refs/remotes/origin/mymaster topic:refs/remotes/origin/topic

You can also specify multiple refspecs for fetching in your configuration file:

[remote "origin"]

url = git@github.com:schacon/simplegit-progit.git

fetch = +refs/heads/master:refs/remotes/origin/master

fetch = +refs/heads/experiment:refs/remotes/origin/experiment

You can't use partial globs in the pattern, so this would be invalid:

fetch = +refs/heads/qa*:refs/remotes/origin/qa*

23. If the QA team wants to push their master branch to qa/master on the remote server, they can run:

$ git push origin master:refs/heads/qa/master

If they want Git to do that automatically each time they run git push origin , they can add a push value to their config file:

[remote "origin"]

url = git(@github.com:schacon/simplegit-progit.git

fetch = +refs/heads/*:refs/remotes/origin/*

push = refs/heads/master:refs/heads/qa/master

24. You can delete references by:

$ git push origin :topic

Because the refspec is <src>:<dst> , by leaving off the <src> part, this basically says to make the topic branch on the remote nothing, which deletes it.

25. Git transport over HTTP is often referred to as the dumb protocol because it requires no Git-specific code on the server side during the transport process. The fetch process is a series of GET requests, where the client can assume the layout of the Git repository on the server.

26. Let's follow the http-fetch process for the simplegit library:

$ git clone http://github.com/schacon/simplegit-progit.git

The first thing this command does is pull down the info/refs file. This file is written by the update-server-info command, which is why you need to enable that as a post-receive hook in order for the HTTP transport to work properly:

=> GET info/refs

Ca82a6dff817ec66f44342007202690a93763949 refs/heads/master

Now you have a list of the remote references and SHAs. Next, you look for what the HEAD reference is so you know what to check out when you're finished:

=> GET HEAD

ref: refs/heads/master

Now, you know you need to check out the master branch, you start by fetching ca82a6 commit object you saw in the info/refs file:

=> GET Objects/ca/82a6dff817ec66f44342007202690a93763949

(179 bytes of binary data)

That object is in loose format on the server. You can zlib-uncompress it, strip off the header, and look at the commit content:

$ git cat-file -p Ca82a6dff817ec66f44342007202690a93763949

tree Cfda3bf379e4f8dba8717dee55aab78aef7f4daf

parent 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7

author Scott Chacon <schacon@gmail.com> 1205815931 − 0700

committer Scott Chacon <schacon@gmail.com> 1240030591 − 0700

changed the version number

Next, you have two more objects to retrieve—cfda3b , which is the tree of content that the commit you just retrieved points to, and 085bb3 , which is the parent commit:

=> GET Objects/08/5bb3bcb608e1e8451d4b2432f8ecbe6306e7e7

(179 bytes of data)

=> GET objects/cf/da3bf379e4f8dba8717dee55aab78aef7f4daf

(404 - Not Found)

It looks like that tree object isn't in loose format on the server, so you get a 404 response back. There are a couple of reasons for this—the object could be in an alternate repository, or it could be in a packfile in this repository. Git checks for any listed alternates first:

=> GET objects/info/http-alternates

(empty file)

If this comes back with a list of alternate URLs, Git checks for loose files and packfiles there—this is a nice mechanism for projects that are forks of one another to share objects on disk. To see what packfiles are available on this server, you need to get the objects/info/packs file, which contains a listing of them (also generated by update-server-info ):

=> GET objects/info/packs

P pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack

You'll check the index file to see which packfile contains the object you need:

=> GET Objects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.idx

(4k of binary data)

You can see if your object is in it—because the index lists the SHAs of the objects contained in the packfile and the offsets to those objects. Your object is there, so go ahead and get the whole packfile:

=> GET Objects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack

(13k of binary data)

…

27. Git can transfer data between two repositories in two major ways: over HTTP and via the so-called smart protocols used in the file:// , ssh:// , and git:// transports.

28. For smart protocol, to upload data to a remote process, Git uses the send-pack and receive-pack processes. The send-pack process runs on the client and connects to a receive-pack process on the remote side. When you download data, the fetch-pack and upload-pack processes are involved. The client initiates a fetch-pack process that connects to an upload-pack process on the remote side to negotiate what data will be transferred down.

29. Occasionally, Git automatically runs a command called auto gc . If there are too many loose objects or too many packfiles, Git launches a full-fledged git gc command. The command does a number of things: it gathers up all the loose objects and places them in packfiles, it consolidates packfiles into one big packfile, and it removes objects that aren't reachable from any commit and are a few months old. You can run auto gc manually:

$ git gc --auto

You must have around 7,000 loose objects or more than 50 packfiles for Git to fire up a real gc command. You can modify these limits with the gc.auto and gc.autopacklimit config settings, respectively.

30. If you run git gc , you'll no longer have reference files in the refs directory. Git will move them for the sake of efficiency into a file named .git/packed-refs that looks like this:

$ cat .git/packed-refs

# pack-refs with: peeled

cac0cab538b970a37ea1e769cbbde608743bc96d refs/heads/experiment

ab1afef80fac8e34258ff41fc1b867c702daa24b refs/heads/master

cac0cab538b970a37ea1e769cbbde608743bc96d refs/tags/v1.0

9585191f37f7b0fb9444f35a9bf50de191beadc2 refs/tags/v1.1

^1a410efbd13591db07496601ebc7a059dd55cfe9

The last line of the file, which begins with a ^ means the tag directly above is an annotated tag and that line is the commit that the annotated tag points to. If you update a reference, Git doesn't edit this file but instead writes a new file to refs/heads . To get the appropriate SHA for a given reference, Git checks for that reference in the refs directory and then checks the packed-refs file as a fallback.

31. At some point in your Git journey, you may accidentally lose a commit. Generally, this happens because you force-delete a branch that had work on it, and it turns out you wanted the branch after all; or you hard-reset a branch, thus abandoning commits that you wanted something from. As you're working, Git silently records what your HEAD is every time you change it. Each time you commit or change branches, the reflog is updated. The reflog is also updated by the git update-ref command (that's why you are not encouraged to manually update files under .git/refs ) and is under .git/logs/ directory. You can run git reflog or git log –g to view it. If your loss was for some reason not in the reflog, one way is to use the git fsck utility, which checks your database for integrity. If you run it with the --full option, it shows you all objects that aren't pointed to by another object.

32. You can run the count-objects command to quickly see how much space you're using:

$ git count-objects -v

count: 4

size: 16

in-pack: 21

packs: 1

size-pack: 2016

prune-packable: 0

garbage: 0

33. You can identify what file or files were taking up so much space by git verify-pack and sorting on the third field in the output, which is file size. You can also pipe it through the tail command because you're only interested in the last few largest files:

$ git verify-pack -v .git/objects/pack/pack-3f8c0...bb.idx | sort -k 3 -n | tail −3

e3f094f522629ae358806b17daf78246c27c007b blob 1486 734 4667

05408d195263d853f09dca71d55116663690c27c blob 12908 3478 1189

7a9eb2fba2b1811321254ac360970fc169ba2330 blob 2056716 2056872 5401

To find out what file it is, you'll use the rev-list command, pass --objects to rev-list , it lists all the commit SHAs and also the blob SHAs with the file paths associated with them. You can use this to find your blob's name:

$ git rev-list --objects --all | grep 7a9eb2fb

7a9eb2fba2b1811321254ac360970fc169ba2330 git.tbz2

Now, you need to remove this file from all trees in your past. You can easily see what commits modified this file:

$ git log --pretty=oneline -- git.tbz2

da3f30d019005479c99eb4c3406225613985a1db oops - removed large tarball

6df764092f3e7c8f5f94cbe08ee5cf42e92a0289 added git tarball

You must rewrite all the commits downstream from 6df76 to fully remove this file from your Git history:

$ git filter-branch --index-filter \

'git rm --cached --ignore-unmatch git.tbz2' -- 6df7640^..

The --index-filter option is similar to the --tree-filter option except that instead of passing a command that modifies files checked out on disk, you're modifying your staging area or index each time. The reason to do it this way is speed—because Git doesn't have to check out each revision to disk before running your filter, the process can be much, much faster. The --ignore-unmatch option to git rm tells it not to error out if the pattern you're trying to remove isn't there. Finally, you ask filter-branch to rewrite your history only from the 6df7640 commit up.

Now, your history no longer contains a reference to that file. However, your reflog and a new set of refs that Git added when you did the filter-branch under .git/refs/original still do, so you have to remove them and then repack the database. You need to get rid of anything that has a pointer to those old commits before you repack:

$ rm -Rf .git/refs/original

$ rm -Rf .git/logs/

$ git gc

Counting objects: 19, done.

Delta compression using 2 threads.

Compressing objects: 100% (14/14), done.

Writing objects: 100% (19/19), done.

Total 19 (delta 3), reused 16 (delta 1)

The big object is still in your loose objects, so it's not gone; but it won't be transferred on a push or subsequent clone, which is what's important. If you really wanted to, you could remove the object completely by running git prune --expire .