Linux 內核文件描述符表的演變

03-17

我在《Linux 中每個 TCP 連接最少佔用多少內存？》一文中繪製了 Linux 內核現在（4.x）使用的文件描述符表（file descriptor table）的數據結構。本篇文章做一些代碼考古工作，還原這個數據結構的五個版本的來龍去脈。英文提綱：File Descriptor Table - Shuo Chens Notes

Linux 內核版本歷史

從 1991 年9 月發布的 Linux 0.01 起，Linux 已經走過了 1/4 世紀。下圖是陳碩根據《Linux kernel "historical" git repository with full history》整理的內核版本簡史。Linux 2.6 之前，內核採用的版本命名規則是「奇數開發、偶數穩定」。據此 1.0、1.2、2.0、2.2、2.4、2.6 是穩定版，而 1.1.x、1.3.x、2.1.x、2.3.x、2.5.x 則是開發版。2.6 之後不再單獨設置「開發版」，而是直接在當前穩定版上滾動開發。更詳細的介紹可參考 2. How the development process works 和 https://bootlin.com/doc/training/linux-kernel/linux-kernel-slides.pdf 第 401 頁起的「Linux versioning scheme and development process」一節。

Linux 內核版本簡史

Linux 從誕生以來，一直用 struct task_struct 來表示進程/線程，用 struct file 表示打開的文件，用 struct inode 表示文件本身。struct file 和 struct inode 的區別在於，如果兩次 open 同一個文件（比方說 web server 寫 access log，你用 less 看這個 assess log 文件），會有兩個 struct file 對象，指向同一個 struct inode 對象。容易想到，打開文件的偏移量（offset）應該放在 struct file 里（web server 打開的 access log 的偏移量在文件末尾，你用 less 打開的 access log 的偏移量在文件開頭），而文件本身的長度應該放在 struct inode 里（web server 往 access log 里繼續寫入內容，你不需要重新打開文件就能看到）。換句話說，你用 ls -l 命令看到的信息（除了文件名本身）都存在 struct inode 里；(f)stat(2) 返回的內容也都存在 struct inode 里。在過去的很長時間裡，更新 struct file 中偏移量的代碼有多線程安全方面的 bug，直到 2014 年 3 月底發布的 Linux 3.14 才修復，也就是說 Ubuntu 14.04 里的 write(2) 系統調用不是線程安全的。Bugs - Shuo Chens Notes

本文著眼於 struct task_struct 與 struct file 關係的演變。

版本 A：從 0.01 到 1.1.10

最早的 Linux 內核直接把元素為 struct file* 的定長數組放在 struct task_struct 里。

// include/linux/sched.h of linux-1.1.10struct task_struct { // ... struct file * filp[NR_OPEN]; fd_set close_on_exec; // ...}

從 int fd 取到 struct file* fp 的寫法是：

struct file* fp = current->filp[fd];

而 struct file 和 struct inode 也是位於各自的定長數組中。

// fs/file_table.c of linux-0.99struct file file_table[NR_FILE];// fs/inode.c of linux-0.99static struct inode inode_table[NR_INODE];

NR_OPEN、NR_FILE、NR_INODE 這幾個宏的值決定了上述數組的大小，它們的值逐漸增大。修改 NR_OPEN 會影響 sizeof (struct task_struct)，也會直接影響每個進程佔用的物理內存的大小，因為 task_struct 對象是不會 swap to disk 的。

| Version | NR_OPEN | NR_FILE | NR_INODE || ------- | ------- | ------- | -------- || 0.01 | 20 | 64 | 32 || 0.12 | 20 | 64 | 64 || 0.95 | 20 | 64 | 128 || 0.96a-3 | 32 | 64 | 128 || 0.96c-1 | 32 | 128 | 128 || 0.96pre | 32 | 64 | 128 || 0.97 | 32 | 128 | 128 || 0.98.4 | 256 | 128 | 128 || 0.99.10 | 256 | 1024 | 2048 |

在 0.99.10 中，struct file 和 struct inode 改成了動態分配，這樣整個系統能同時打開的文件數大大增加，但每個進程能打開的文件數還是 NR_OPEN。

// fs/file_table.c of linux-0.99.10-struct file file_table[NR_FILE];+struct file * first_file;

版本 B：1.1.11 到 1.3.21

1.1.11 從 task_struct 中分離出了 fs_struct、files_struct、mm_struct。

// include/linux/sched.h of linux-1.3.21/* Open file table structure */struct files_struct { int count; fd_set close_on_exec; struct file * fd[NR_OPEN];};struct task_struct { // .../* filesystem information */ struct fs_struct fs[1];/* open file information */ struct files_struct files[1];/* memory management info */ struct mm_struct mm[1]; // ...};

這樣做沒有改變程序的功能，只是更好地組織了數據結構，讓緊密相關的數據成員位於同一個結構體中，體現了封裝的思想。修改 NR_OPEN 也會直接影響 sizeof (struct task_struct)。

從 int fd 取到 struct file* fp 的寫法變成：

struct file* fp = current->files->fd[fd];

這裡為什麼要用長度為 1 的 struct 數組，而不直接放 struct，我猜是為了將來改成指針時不必修改客戶代碼。

file descriptor flag 與 file status flag

在 man 2 fcntl 中提到，文件的標誌分為 file descriptor flag 與 file status flag 兩類，分別用 F_GETFD/F_SETFD 和 F_GETFL/F_SETFL 來存取（例見 muduo/net/SocketsOps.cc ）。file descriptor flag 只有一個：close-on-exec；file status flags 包含 O_NONBLOCK、 O_APPEND、O_DIRECT 等等。因此 files_struct 要有 fd_set close_on_exec 成員，用於存儲 file descriptor flag，而 file status flag 則是放在 struct file 的 f_flags 成員中。這兩類標誌（flags）的區別體現在 dup(2) 系統調用上，後面還會講到。

版本 C：1.3.22 到 2.1.89

1.3.22 把 task_struct 的 files、fs、mm 等成員變成了指針，讓 sizeof(struct task_struct) 瘦身了很多。這麼做是為了支持多線程。

// include/linux/sched.h of linux-2.0.2struct task_struct { // ... /* filesystem information */- struct fs_struct fs[1];+ struct fs_struct *fs; /* open file information */- struct files_struct files[1];+ struct files_struct *files; /* memory management info */- struct mm_struct mm[1];+ struct mm_struct *mm; // ...};

從 int fd 取到 struct file* fp 的寫法不變，還是 current->files->fd[fd]。

Linux 2.0 開始支持多線程。（最早是 LinuxThreads 實現，2.6 改成了更符合 POSIX 語義的 NPTL 實現。）把 files_struct 成員從 task_struct 里移出來，讓同一進程內的多個線程可以共享一個 files_struct 對象，這樣線程 1 打開的文件自然就能被線程 2 看到了。

同一進程內的兩個線程共享 files_struct 對象

fs_struct 和 mm_struct 也是同理。

版本 D：2.1.90 到 2.6.13

2.1.90 把 files_struct 的 fd 成員從定長數組改成了動態數組，這樣每個進程就能同時打開很多文件了，為編寫高並發的網路服務掃清了一大障礙。

// include/linux/sched.h of linux-2.2.0/* * Open file table structure */struct files_struct { atomic_t count;+ int max_fds;+ struct file ** fd; /* current fd array */ fd_set close_on_exec; // changed to fd_set* in 2.2.12 fd_set open_fds;- struct file * fd[NR_OPEN];};

數據結構示意圖：

動態大小的文件描述符表

從 int fd 取到 struct file* fp 的寫法不變，還是 current->files->fd[fd]。

至此，文件描述符表的功能已經完善，下一個版本是性能的改進。

版本 E：2.6.14 至今

2.6.14 引入了 struct fdtable 作為 files_struct 的間接成員，把 fd、max_fds、close_on_exec 等成員移入 fdtable。這麼做是為了方便採用 RCU，讓 fdtable 可以整體替換。Read-Copy Update (RCU) 是 Paul E. McKenney 的傑作，是內核廣泛採用的一種伸縮性更好的讀寫同步機制，他還寫了著名的《Is Parallel Programming Hard, And If So, What Can You Do About It?》一書。

// include/linux/fdtable.h of linux-2.6.37struct fdtable { unsigned int max_fds; struct file __rcu **fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct fdtable *next;};/* * Open file table structure */struct files_struct { /* * read mostly part */ atomic_t count; struct fdtable __rcu *fdt; struct fdtable fdtab; /* * written part on a separate cache line in SMP */ spinlock_t file_lock ____cacheline_aligned_in_smp; int next_fd; struct embedded_fd_set close_on_exec_init; struct embedded_fd_set open_fds_init; struct file __rcu * fd_array[NR_OPEN_DEFAULT];};

數據結構示意圖如下：

採用 RCU 之後的文件描述符表

從 int fd 取到 struct file* fp 的途徑變成

current->files->fdt->fd[fd];

實際的代碼比這個要複雜，因為 files->fdt 這一步要用 rcu_dereference 來做（上圖的紅線）。

SO_REUSEPORT 和 dup(2)

將同一個listening socket加入多個epoll能否降低響應時間？有人提到可以對 listening socket 使用 dup(2) 來達到相同的效果，這是行不通的。原因在於 dup(2) 不會複製 struct file 本身，而只是複製 struct file 指針，並把 file 里的引用計數加一（f_count 成員）。對兩個 fd 的讀寫操作還是通過同一個 file 對象進行，性能不會提高（見下圖示意）。你把 3 和 4 兩個 fd 分別加到兩個 epoll 中，實際上是把同一個 file 加到了兩個 epoll 中（file 的 f_ep_links 成員會把這兩個 epoll 串起來），這跟 SO_REUSEPORT 有本質的區別。進一步可參考《Linux 4.5/4.6 中對 SO_REUSEPORT 的改進》。

BSD 與 FreeBSD

最後，我們看看 BSD 與 FreeBSD 內核里是怎麼做的。以下是 BSD 的版本簡史。

BSD 與 FreeBSD 版本演化圖

BSD 內核中文件描述符錶的歷史要簡單得多。

4.3BSD-Reno 及以前的版本（上至 Unix V4）採用的是 Linux 版本 A 的做法，直接在進程的描述符里放 struct file* 的定長數組。

struct user { // ... struct file *u_ofile[NOFILE]; /* file structures for open files */ // ...};

Unix V7 的寫法與此一模一樣 http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/h/user.h 。

BSD Net/2 到 FreeBSD 9.3 採用的是 Linux 版本 D 的做法，其中 proc == task_struct, filedesc == files_struct, file == file。

// sys/proc.h/* * Process structure. */struct proc { // ... struct filedesc *p_fd; /* (b) Open files. */ // ...};// sys/filedesc.hstruct filedesc { struct file **fd_ofiles; /* file structures for open files */ char *fd_ofileflags; /* per-process open file flags */ struct vnode *fd_cdir; /* current directory */ struct vnode *fd_rdir; /* root directory */ struct vnode *fd_jdir; /* jail root directory */ int fd_nfiles; /* number of open files allocated */ NDSLOTTYPE *fd_map; /* bitmap of free fds */ int fd_lastfile; /* high-water mark of fd_ofiles */ int fd_freefile; /* approx. next free file */ u_short fd_cmask; /* mask for file creation */ u_short fd_refcnt; /* thread reference count */ u_short fd_holdcnt; /* hold count on structure + mutex */ struct sx fd_sx; /* protects members of this struct */ struct kqlist fd_kqlist; /* list of kqueues on this filedesc */ int fd_holdleaderscount; /* block fdfree() for shared close() */ int fd_holdleaderswakeup; /* fdfree() needs wakeup */};// sys/file.hstruct file { void *f_data; /* file descriptor specific data */ struct fileops *f_ops; /* File operations */ struct ucred *f_cred; /* associated credentials. */ struct vnode *f_vnode; /* NULL or applicable vnode */ short f_type; /* descriptor type */ short f_vnread_flags; /* (f) Sleep lock for f_offset */ volatile u_int f_flag; /* see fcntl.h */ volatile u_int f_count; /* reference count */ // ... off_t f_offset; // ...};

可見 BSD 內核源碼比 Linux 要工整得多，注釋也很詳盡。