系統調用真正的效率瓶頸在哪裡？

01-14

系統調用效率更低的原因是什麼？
或者說，系統調用的大部分時間在做什麼？
我問的就是syscall本身的開銷在哪裡，相對於函數調用

平時說的系統調用開銷大，主要是相對於函數調用來說的。

對於一個函數調用，彙編層面上就是一個CALL或者JMP，這種指令在硬體層面上雖然首次是會打亂流水線的，但如果是十分有規律的情況下，大多數CPU都能很好的處理。

對於一個CALL指令來說，CPU層面上做的事情是（來自intel手冊，near call）：

When executing a near call, the processor does the following (see Figure 6-2):
1. Pushes the current value of the EIP register on the stack.
2. Loads the offset of the called procedure in the EIP register.
3. Begins execution of the called procedure.
When executing a near return, the processor performs these actions:
1. Pops the top-of-stack value (the return instruction pointer) into the EIP register.
2. If the RET instruction has an optional n argument, increments the stack pointer by the number of bytes specified with the n operand to release parameters from the stack.

3. Resumes execution of the calling procedure.

其實就是存入EIP，載入新的EIP，執行；

對於系統調用來說，麻煩就大了，過去Linux採用的是INT 80H中斷的方式處理系統調用，一個帶有棧切換的中斷的流程如下：

If a stack switch does occur, the processor does the following:
1. Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and EIP registers.
2. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.
3. Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure』s stack onto the new stack.
4. Pushes an error code on the new stack (if appropriate).
5. Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.

6. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
7. Begins execution of the handler procedure at the new privilege level.
A return from an interrupt or exception handler is initiated with the IRET instruction. The IRET instruction is similar to the far RET instruction, except that it also restores the contents of the EFLAGS register for the interrupted procedure.
When executing a return from an interrupt or exception handler from the same privilege level as the interrupted procedure, the processor performs these actions:
1. Restores the CS and EIP registers to their values prior to the interrupt or exception.
2. Restores the EFLAGS register.
3. Increments the stack pointer appropriately.
4. Resumes execution of the interrupted procedure.

簡單點說，就是保存的東西多了，CPU要處理的事情也多了。

系統調用指令本身的開銷就比一般的CALL和JMP要多一些，是因為同時又要進行一些額外的檢查（許可權、有效性等）。同時，因為可能涉及到任務棧的切換，會導致部分cache失效，這在CPU性能上的損失是很大的，甚至會導致TLB失效的情況（老版本Linux有此問題）。

並且，由於系統調用屬於「大事」，一般操作系統都會在系統調用內部進行多次檢查，這又導致了一部分軟體層面上的開銷。

所以，系統調用開銷大基本上可以總結為：

1. CPU要做的事情太多；

2. 軟體要做的事情也太多；

因為INT指令開銷太大，所以intel後來推出了SYSCALL/SYSENTER/SYSEXIT指令，這些指令不再查IDT表項了，直接從寄存器里取值，同時這些指令不保存堆棧和返回地址，同時不搞那麼多許可權檢查了（因為這些指令必然是R3-R0之間切換的），所以CPU的開銷會小的多。

而且由於沒有堆棧切換，實際上對流水線基本上沒有多少破壞，所以CPU的性能會提升很多。但整體來說，由於系統調用CPU要做的事情還是多於一般的CALL/JMP指令，所以系統調用的開銷肯定要比一般的函數調用開銷要大，並且大的多。

大概只能這麼泛泛的說說，建議看Intel的手冊，這裡頭的坑實在太多。

---------------------------------------------------------

題主對CPU的動作了解的不多，那麼我貼一下CALL/INT/SYSENTER的偽代碼：

先是near CALL：

IF near call THEN IF near relative call THEN IF OperandSize = 64 THEN tempDEST ← SignExtend(DEST); (* DEST is rel32 *) tempRIP ← RIP + tempDEST; IF stack not large enough for a 8-byte return address THEN #SS(0); FI; Push(RIP); RIP ← tempRIP; FI; IF OperandSize = 32 THEN tempEIP ← EIP + DEST; (* DEST is rel32 *) IF tempEIP is not within code segment limit THEN #GP(0); FI; IF stack not large enough for a 4-byte return address THEN #SS(0); FI; Push(EIP); EIP ← tempEIP; FI; IF OperandSize = 16 THEN tempEIP ← (EIP + DEST) AND 0000FFFFH; (* DEST is rel16 *) IF tempEIP is not within code segment limit THEN #GP(0); FI; IF stack not large enough for a 2-byte return address THEN #SS(0); FI; Push(IP); EIP ← tempEIP; FI; ELSE (* Near absolute call *) IF OperandSize = 64 THEN tempRIP ← DEST; (* DEST is r/m64 *) IF stack not large enough for a 8-byte return address THEN #SS(0); FI; Push(RIP); RIP ← tempRIP; FI; IF OperandSize = 32 THEN tempEIP ← DEST; (* DEST is r/m32 *) IF tempEIP is not within code segment limit THEN #GP(0); FI; IF stack not large enough for a 4-byte return address THEN #SS(0); FI; Push(EIP); EIP ← tempEIP; FI; IF OperandSize = 16 THEN tempEIP ← DEST AND 0000FFFFH; (* DEST is r/m16 *) IF tempEIP is not within code segment limit THEN #GP(0); FI; IF stack not large enough for a 2-byte return address THEN #SS(0); FI; Push(IP); EIP ← tempEIP; FI; FI;rel/abs FI; near

看上去很長，實際上一般只走某個分支，所以實際的操作開銷並不大。

下面是INT指令的一部分偽代碼（完整代碼太長，只貼中斷門了跨許可權調用的部分，完整的部分大概相當於我貼出部分的3~5倍）：

IF PE = 0 THEN GOTO REAL-ADDRESS-MODE; ELSE (* PE = 1 *) IF (VM = 1 and IOPL &< 3 AND INT n) THEN #GP(0); ELSE (* Protected mode, IA-32e mode, or virtual-8086 mode interrupt *) IF (IA32_EFER.LMA = 0) THEN (* Protected mode, or virtual-8086 mode interrupt *) GOTO PROTECTED-MODE; ELSE (* IA-32e mode interrupt *) GOTO IA-32e-MODE; FI; FI; FI; IA-32e-MODE: IF ((vector_number . 16) + 15) is not in IDT limits or selected IDT descriptor is not an interrupt-, or trap-gate type THEN #GP((vector_number ? 3) + 2 + EXT); (* EXT is bit 0 in error code *) FI; IF software interrupt (* Generated by INT n, INT 3, but not INTO *) THEN IF gate descriptor DPL &< CPL THEN #GP((vector_number ? 3) + 2 ); (* PE = 1, DPL &< CPL, software interrupt *) FI; ELSE (* Generated by INTO *) #UD; FI; IF gate not present THEN #NP((vector_number ? 3) + 2 + EXT); FI; IF ((vector_number * 16)[IST] ≠ 0) NewRSP 仼 TSS[ISTx]; FI; GOTO TRAP-OR-INTERRUPT-GATE; (* Trap/interrupt gate *) END; TRAP-OR-INTERRUPT-GATE: Read segment selector for trap or interrupt gate (IDT descriptor); IF segment selector for code segment is NULL THEN #GP(0H + EXT); FI; (* NULL selector with EXT flag set *) IF segment selector is not within its descriptor table limits THEN #GP(selector + EXT); FI; Read trap or interrupt handler descriptor; IF descriptor does not indicate a code segment or code segment descriptor DPL &> CPL THEN #GP(selector + EXT); FI; IF trap or interrupt gate segment is not present, THEN #NP(selector + EXT); FI;


    IF code segment is non-conforming and DPL &< CPL    
        THEN
            IF VM = 0
                THEN    
                    GOTO INTER-PRIVILEGE-LEVEL-INTERRUPT;
                    (* PE = 1, interrupt or trap gate, nonconforming code segment, DPL &< CPL, VM = 0 *)
            ELSE (* VM = 1 *)
                IF code segment DPL ≠ 0
                    THEN #GP; (new code segment selector);
                        GOTO INTERRUPT-FROM-VIRTUAL-8086-MODE; FI;
                        (* PE = 1, interrupt or trap gate, DPL &< CPL, VM = 1 *)
            FI;
        ELSE (* PE = 1, interrupt or trap gate, DPL ≥ CPL *)
            IF VM = 1
                THEN #GP(new code segment selector); FI;
            IF code segment is conforming or code segment DPL = CPL
                THEN
                    GOTO INTRA-PRIVILEGE-LEVEL-INTERRUPT;
                ELSE
                    #GP(CodeSegmentSelector + EXT);
                    (* PE = 1, interrupt or trap gate, nonconforming code segment, DPL &> CPL *)

            FI;

    FI;

END;

INTRA-PRIVILEGE-LEVEL-INTERRUPT: (* PE = 1, DPL = CPL or conforming segment *) IF 32-bit gate and IA32_EFER.LMA = 0 THEN IF current stack does not have room for 16 bytes (error code pushed) or 12 bytes (no error code pushed) THEN #SS(0); FI; ELSE IF 16-bit gate IF current stack does not have room for 8 bytes (error code pushed) or 6 bytes (no error code pushed) THEN #SS(0); FI; ELSE (* 64-bit gate*) IF StackAddress is non-canonical THEN #SS(0); FI; FI; IF instruction pointer not within code segment limit THEN #GP(0); FI; IF 32-bit gate THEN Push (EFLAGS); Push (far pointer to return instruction); (* 3 words padded to 4 *) CS:EIP ← Gate(CS:EIP); (* Segment descriptor information also loaded *) Push (ErrorCode); (* If any *) ELSE IF 16-bit gate THEN Push (FLAGS); Push (far pointer to return location); (* 2 words *) CS:IP ← Gate(CS:IP); (* Segment descriptor information also loaded *) Push (ErrorCode); (* If any *) ELSE (* 64-bit gate*) Push(far pointer to old stack); (* Old SS and SP, each an 8-byte push *) Push(RFLAGS); (* 8-byte push *) Push(far pointer to return instruction); (* Old CS and RIP, each an 8-byte push *) Push(ErrorCode); (* If needed, 8 bytes *) CS:RIP ← GATE(CS:RIP); (* Segment descriptor information also loaded *) FI; FI; CS(RPL) ← CPL; IF interrupt gate THEN IF ← 0; FI; (* Interrupt flag set to 0: disabled *) TF ← 0; NT ← 0; VM ← 0; RF ← 0; END;

以上代碼基本上都要跑一遍。

最後是SYSENTER：

IF CR0.PE = 0 THEN #GP(0); FI; IF SYSENTER_CS_MSR[15:2] = 0 THEN #GP(0); FI; EFLAGS.VM ← 0; (* ensures protected mode execution *) EFLAGS.IF ← 0; (* Mask interrupts *) EFLAGS.RF ← 0; CS.SEL ← SYSENTER_CS_MSR (* Operating system provides CS *) (* Set rest of CS to a fixed value *) CS.BASE ← 0; (* Flat segment *) CS.LIMIT ← FFFFFH; (* 4-GByte limit *) CS.ARbyte.G ← 1; (* 4-KByte granularity *) CS.ARbyte.S ← 1; CS.ARbyte.TYPE ← 1011B; (* Execute + Read, Accessed *) CS.ARbyte.D ← 1; (* 32-bit code segment*) CS.ARbyte.DPL ← 0; CS.SEL.RPL ← 0; CS.ARbyte.P ← 1; CPL ← 0; SS.SEL ← CS.SEL + 8; (* Set rest of SS to a fixed value *) SS.BASE ← 0; (* Flat segment *) SS.LIMIT ← FFFFFH; (* 4-GByte limit *) SS.ARbyte.G ← 1; (* 4-KByte granularity *) SS.ARbyte.S ←; SS.ARbyte.TYPE ← 0011B; (* Read/Write, Accessed *) SS.ARbyte.D ← 1; (* 32-bit stack segment*) SS.ARbyte.DPL ← 0; SS.SEL.RPL ← 0; SS.ARbyte.P ← 1; ESP ← SYSENTER_ESP_MSR; EIP ← SYSENTER_EIP_MSR;

雖然比CALL要多，但是相對而言比INT要簡單的多了。

用 Ftrace/Perf 的 syscall tracer 可以看到內部到底在做什麼，除了 syscall 本身的開銷外，對於搶佔的系統，中斷，高優先順序的任務調度，線程間的同步，IO等待等等都可能會造成各種開銷~

這是操作系統為程序員做的統一介面，為了有好的程序員用戶體驗，做了很多業務。而普通的move, load指令，就很單一，執行後馬上執行下一條指令不用考慮下面這些亂七八糟的，所以很快。

系統調用即軟中斷（x86為int 0x80指令），用戶態調這個會陷入內核，操作系統要有很多的現場環境保存，各種寄存器值，壓棧操作…… 待內核任務執行完後，要切換到用戶態，這時還會伴隨一次任務調度的搶佔，運氣差了，cpu做其他進程的事了，運氣好也要有恢復原先的現場的操作，而且context_swicth都會涉及到加鎖，解鎖，cpu資源搶佔(多核)，涉及到數據地址的，內核要把數據複製用戶態，因為用戶態內核態不同地址空間（除非做了內存映射）………

系統調用會引起軟中斷。用戶態棧和內核態棧的切換開銷，特權級變化帶來的開銷，以及操作系統對用戶態程序傳來的參數安全性檢查等開銷

上下文切換

cpu cache miss

個人認為真正造成系統調用性能差的原因在於緩存局部性差。

感覺是因為系統調用主要是io操作，比如文件讀寫，設備讀寫等等………………

CPU操作內存和操作設備是兩個不同速度級別。

硬體要做的事情和軟體做的事情我說不上來。

不過，就C程序的優化來說，有幾條就是，避免頻繁調用文件IO，避免頻繁申請或釋放內存，會阻塞的操作不要和需要效率的代碼放一起，等等~~~