Git由淺入深之存儲原理

01-28

本來計劃本篇介紹Git分支的相關知識點與操作，但是準備的過程中發現涉及到很多內部存儲原理，決定先介紹一下Git存儲原理，明白了這些，有助於理解後續內容，對Git的使用也會有很大幫助。

Git存儲目錄結構

在初始化項目倉庫時（git clone 或git init），Git會在根目錄下創建一個.git目錄，其下存放著Git操作和存儲相關的內容，該目錄結構大致如下：

如圖中所述：

HEAD文件指向當前分支；index文件存儲著暫存區的內容信息；
refs目錄存儲著所有分支指向各自提交對象的指針；
objects目錄存儲著Git資料庫的所有內容；
config文件包含項目的配置信息；
info目錄下的exclude文件包含項目全局忽略匹配模式，與.gitignore文件互補；
hooks目錄則存放項目的客戶端或服務端鉤子腳本。

註：其中的ORIG_HEAD記錄的是在進行極端（drastic）操作（如合併merge，回退reset等）時，此操作之前HEAD所指向的位置，便於我們在發生毀滅性失誤時進行回退，如使用

git reset --hard ORIG_HEAD指令可以回退到危險操作之前的狀態，但是對於正常的提交操作，該指針是不會變化的。在1.8.5版本以後，Git使用了鏈表記錄HEAD的所有移動軌跡，
可以使用git reflog查看，使用git reset HEAD@{num}方式可以回退到指定版本，這也是之後介紹Git數據恢復將要介紹的一個指令，推薦使用這種方式替代ORIG_HEAD方式。
更多信息可參考此處

Git存儲

Git是一個內容定址文件系統(content-addressed filesystem)，其存儲內容都是通過內容地址維護，可以把它理解成一個鍵值對存儲方式：即給定一個存儲文件，該系統根據文件信息和內容，使用SHA-1演算法計算，返回一個由40個十六進位字元組成的字元串，之後只需要通過該字元串即可訪問該文件，這個字元串就是Git中通常所說的校驗和。

內容定址

在了解Git內部存儲原理之前我們先了解下內容定址：

When being contrasted with content-addressed storage, a typical local or networked
storage device is referred to as location-addressed. In a location-addressed storage device,
each element of data is stored onto the physical medium, and its location recorded for later use.
The storage device often keeps a list, or directory, of these locations.
When a future request is made for a particular item, the request includes only the
location (for example, path and file names) of the data. The storage device can then use this
information to locate the data on the physical medium, and retrieve it. When new information is

written into a location-addressed device, it is simply stored in some available free space,
without regard to its content.
In contrast, when information is stored into a CAS system, the system will record a content address,
which is an identifier uniquely and permanently linked to the information content itself.
A request to retrieve information from a CAS system must provide the content identifier,
from which the system can determine the physical location of the data and retrieve it.
Because the identifiers are based on content, any change to a data element will necessarily
change its content address.
談到內容定址，有必要了解一下的就是本地定址，或者叫物理定址。對於物理定址系統，其所有數據存儲在物理媒介的可用空間，與其內容無關，系統記錄其物理地址（physical location）供隨後使用，這些物理地址通常通過使用一個列表或者目錄來維護，當再次請求特定數據時，需要使用其物理地址，如路徑和文件名。
而對於一個內容定址系統，系統記錄的是一個內容地址（content-address），該內容地址是對應數據的一個唯一且持久的識別符，它是通過加密哈希演算法（如，SHA-1或MD5）計算出來的一串值，當我們需要數據時，提供該內容地址，系統即可通過該地址獲取數據的物理地址，返回數據；同時，對於數據的任何變更都將導致內容地址發生變化。