Unable to create new native thread 的本質是什麼？

12-29

這幾天線上伺服器響應變慢，大量的日誌出現「java.lang.OutOfMemoryError : unable to create new native Thread」。大量關於 JVM 這類異常的描述裡面，都提到兩個原因：
1 超過了 OS 允許的 max user process。
2 OS中不再有足夠的內存來 spawn 新的本地線程。
經過排查，當前用戶的進程額度為10萬，而OS空間里所有的進程數不過7000。排除了第一條。
【為了避免大家糾纏於 ulimit 的問題，我還是貼一下 ulimit 好了：

core file size (blocks, -c) 100000000
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 600000
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0
stack size (kbytes, -s) 102400
cpu time (seconds, -t) unlimited
max user processes (-u) 100000
總體的系統進程數大概是6000-7000之間。我是用 pstree 看的。其他的各種指標 top free 都有試過。在本地模擬這類問題的時候還試過各種 pmap、vmstat、smem。。。。不一而足。
有評論說要考慮文件句柄的問題，我認為不需要考慮（當然考慮也沒什麼關係，現在open files 是百萬級別的， /proc 和 lsof 都沒有顯示那麼多的 fd 用量）。因為這是一個 can not create natvie thread 問題，這意味著，如果有打開文件過多的問題，那首先要能夠創建一個線程，進入了堆棧以後才能開始消耗文件句柄。雖然我沒有試過，但我猜（此處有待考證），把 open files 調小，寫一個單線程 for 循環不斷消耗文件句柄，絕不會出現「 can not create natvie thread」。這類 error 的機制雖然我還在研究，但根據 JVM 的 spec 應該就是和堆棧的內存申請不足有關，與 io 資源的緊張是無關的。
】
我們項目的系統里有些項目產生了幾千個線程，有些項目卻只使用了幾十個線程就出現了這個「unable to create new native Thread」的慘劇。直觀的結果是，前面的JVM 進程踩踏了後面的 JVM進程應該可以申請到的某種資源。
根據我對 JVM 的理解，Java 要生成 Thread，必然要先生成一個本地的 native thread來做mapping，而這個本地的 native thread 在 Linux 上應該是由 Lightweight Process 扮演的。這個 Lightweight Process 可以跟主 JVM 共享 managed heap，那麼不能申請到的內存必然是它的 stack 棧。
進一步推測，實際上 JVM 在 spawn 新的線程的時候，是要從 OS 的未使用內存區域里尋找可以分配新的 native thread 的 stack area 的。也就是說，OS里應該有一段內存是 candidate process area，這些內存被 JVM 新申請下來會被主要運用於生成 stack section。

我把這個構想當做挖礦。前面的人把礦挖完了，後面的人就沒得挖了。今天本來想看看 JVM 的源碼，但甲骨文的鏈接死活下載不下來，所以我不是太確定到底 JVM 生成本地線程的過程是怎樣的，只能靠猜。
我想請了解 JVM 的高手或者 Linux內存調度的高手來詳細說說這類 OOM 的本質是什麼？是不是真的存在一個 candidate 的memory pool，用來生成 native thread 的 stack area？

寫在前面：

題主看起來有點著急。解決問題常常是個來回交流，互相提問和提供信息的過程。確實如果題主能在問題描述里稍微補充一點「我是如何就確定了排除疑點1/2/3，具體某個工具輸出的數值是多少」這樣的信息的話，會有效降低提問者與回答者之間來回溝通的開銷。

我特意不說 java 版本和 os 版本，就是想知道 JVM 是怎麼抽象地解決這個線程映射問題的，如果有不同版本的歷史沿革，我更好奇其中有什麼變化。

把問題泛化的好處是可能可以看到更大的big picture，但是壞處是很可能得不到回答——畢竟寫回答也是耗時間的，回答一個大問題比回答一個精準的小問題要麻煩多了。

回答題主的問題：

題主關心的是具體當Oracle JDK / OpenJDK的HotSpot VM輸出"java.lang.OutOfMemoryError: unable to create new native thread"是一些怎樣的情況。

假定題主用的是JDK8u，那麼值得參考的OpenJDK源碼在：jdk8u/jdk8u/hotspot: aa4ffb1f30c9 /

（如果是其它版本的JDK的話，具體源碼鏈接另外討論。另外Oracle JDK的絕大部分源碼都跟OpenJDK是完全一致的，所以調查Oracle JDK和OpenJDK的行為大都可以參考OpenJDK的源碼。另外HotSpot VM在這方面的設計大概從JDK 1.4開始就沒怎麼大改變過，要說有什麼變化就是很久以前有配置實用alternate signal stack，而現在沒有用了，一個線程只配置了一個stack，普通執行狀態下的native stack frame和Java stack frame都在這個棧上分配空間，然後如果進入signal handler的話也在同一個棧上分配空間。）

在這個HotSpot VM的代碼庫里搜索具體的錯誤信息，可以看到有：

$ grep -nr "unable to create new native thread" . ./src/share/vm/compiler/compileBroker.cpp:1056: "unable to create new native thread"); ./src/share/vm/gc_implementation/shared/concurrentGCThread.cpp:126: "unable to create new native thread"); ./src/share/vm/prims/jvm.cpp:3030: "unable to create new native thread"); ./src/share/vm/prims/jvm.cpp:3033: "unable to create new native thread"); ./src/share/vm/runtime/os.cpp:378: "unable to create new native thread"); ./src/share/vm/runtime/serviceThread.cpp:70: "unable to create new native thread"); ./src/share/vm/services/attachListener.cpp:525: "unable to create new native thread");

其中題主真正關心的多半只是 jvm.cpp 里的那組。它是 JVM_StartThread() 在HotSpot VM中的實現，對應到Java層面上就是 java.lang.Thread.start() 函數的內部實現。

其它幾個查找結果都是HotSpot VM的一些內部線程的初始化的地方，例如說JIT編譯器的線程、GC的線程、信號處理的線程等等。

從 JVM_StartThread() 可以看到，HotSpot VM真正開始運行一個Java線程，需要創建若干固定大小的、比較小的C++對象：

native_thread = new JavaThread(thread_entry, sz);

JavaThread是HotSpot VM在內部管理Java線程執行狀態的C++對象，裡面會進一步引用一個OSThread對象，是HotSpot VM對底層操作系統線程的狀態的抽象描述用的對象。這些管理用的對象都很小，可以忽略不計。

在每個平台上HotSpot VM對OSThread有特定的實現，裡面會包含創建真正的平台線程（例如通過 pthread 在Linux上創建LWP）的邏輯，其中會傳棧大小的參數下去來分配棧空間。例如說在Linux上現在HotSpot VM默認使用的棧大小是1MB，這個空間才是大頭，遠大於HotSpot VM用來管理線程用的那些小的C++對象的大小。

在Linux上，如果是讓 pthread 自己根據傳入的參數來分配棧空間，而不是用戶自己分配好棧空間之後傳入棧的指針的話，pthread 會直接調用 mmap() 來分配棧空間。Linux內核並不會為這種 mmap() 從什麼特殊的地方分配內存，而是跟所有其它 mmap() 一樣對待；pthread 自己倒是可以把最近回收的線程的棧暫時cache住，然後給新創建的線程復用，這個功能不是內核做的而是 pthread 做的，而這個「cache」並不是什麼特別的東西，並不是說這就是特別的「棧區」。

在Linux上HotSpot VM是對於完全受控的Java線程通過調用 pthread_create() 傳入棧大小然後讓 libpthread 自己去分配空間，而不是自行申請了棧空間然後通知 pthread 直接用那塊空間。對於已經創建好的線程要attach到JVM上的情況，那個線程之前是如何創建、其中的棧是如何分配的，HotSpot VM也管不著。

HotSpot on Linux:

bool os::create_thread(Thread* thread, ThreadType thr_type, size_t stack_size) { // ... pthread_attr_setstacksize(attr, stack_size); // ... int ret = pthread_create(tid, attr, (void* (*)(void*)) java_start, thread); // ... }

pthread:

mem = mmap (NULL, size, prot, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

對 pthread 不熟悉的同學可能會以為 MAP_STACK 會做什麼特別的事情，其實並沒有。它只是個針對老Linux的 MAP_32BIT 帶來的問題而弄出來的補丁。請參考傳送門：Tangled up in threads

（注意：HotSpot VM在Java為主（而非嵌入式）場景下是會忽略一個進程的初始線程（primordial thread）的，它的棧是Linux在創建進程的時候就初始化好的，HotSpot VM無法對它做額外的控制（例如說設定棧大小）所以Java launcher會讓這個primordial thread乖乖地睡覺直到Java跑完退出。）

所以說在HotSpot VM上遇到 "unable to create new native thread" ，最主要的原因還就是無法分配Java線程的棧的情況。

最後放個傳送門，是我以前做的一個實驗的筆記，記錄了一個JDK6u25的Java進程在64位Linux上的虛擬地址空間里各種東西的分布狀況。可以看到Java線程的棧是分散「隨機」地分布在地址空間中的，並沒有一個集中的所謂「棧區」：https://gist.github.com/rednaxelafx/2139774

如果是OS返回內存不夠的話那就看OS怎麼實現的了。linux在調用內存映射會檢查是否有足夠內存，以下內容出自Professional Linux Kernel Architecture，僅供參考。

vm_enough_memory is invoked9 if either the MAP_NORESERVE flag is not set or the value of the kernel parameter sysctl_overcommit_memory10 is set to OVERCOMMIT_NEVER, that is, when overcommiting is not allowed. The function chooses whether to allocate the memory needed for the operation. If it selects against, the system call terminates with -ENOMEM.
9Using security_vm_enough_memory, which calls __vm_enough_memory over varying paths depending on the security framework in use.
10sysctl_overcommit_memory can be set with the help of the /proc/sys/vm/overcommit_memory. Currently there are three overcommit options. 1 allows an application to allocate as much memory as it wants, even more than is permitted by the address space of the system. 0 means that heuristic overcommitting is applied with the result that the number of usable pages is determined by adding together the pages in the page cache, the pages in the swap area, and the unused page frames; requests for allocation of a smaller number of pages are permitted. 2 stands for the strictest mode, known as strict overcommitting, in which the permitted number of pages that can be allocated is calculated as follows:
allowed = (totalram_pages - hugetlb) * sysctl_overcommit_ratio / 100; allowed += total_swap_pages;
Here sysctl_overcommit_ratio is a configurable kernel parameter that is usually set to 50. If the total number of pages used exceeds this value, the kernel refuses to perform further allocations.
Why does it make sense to allow an application to allocate more pages than can ever be handled in principle? This is sometimes required for scientific applications. Some tend to allocate huge amounts of memory without actually requiring it — but, in the opinion of the application authors, it seems good to have it just in case. If the memory will, indeed, never be used, no physical page frames
will ever be allocated, and no problem arises.
Such a programming style is clearly bad practice, but unfortunately this is often no criterion for the value of software. Writing clean code is usually not rewarding in the scientific community outside computer science. There is only immediate interest that a program works for a given configuration, while efforts to make programs future-proof or portable do not seem to provide immediate benefits and are therefore often not valued at all.

grep (java running user) /etc/securitu/limits.conf

看一下第三列為as的，將該行的第四列的值記下來，然後運行top命令，查看以該用戶運行的所有Java進程的virt項是否相同，如果是的話修改limits.conf的限制

直覺告訴我，你可能是用了32位jdk，虛擬內存不夠了

雖然我不是高手，到這種問題碰到的原因一般確實是ulimit. ulimit -a的結果是什麼？

free呢？

什麼叫不超過7000， jstack切下去有多少線程？