lvs 性能，轉發數據的理論極限?

01-21

這個問題，過來請教下大神。
lvs轉負載均衡器，配合keepalived 一般能完成多少數據的轉發，
我以自己的機器為例，dell 1950 八核CPU，BCM5708 網卡，使用3.0.78 linux 內核
在一次實際生產環境中完成大約300k (這裡指每秒數據包數) 的數據轉發，超過300k之後，網卡開始丟包，CPU SI接近40，後續壓力緩慢上至900k，基本上是全丟
想請問各位業內人士，lvs+keepalived 的理論極限是多少?

也歡迎各位提供優化建議。

LVS 的性能主要需要通過3個方面來提高

1. ipvs connection table size

官方的解釋已經非常清楚了：

IPVS connection table size (the Nth power of 2)

The IPVS connection hash table uses the chaining scheme to handle hash collisions. Using a big IPVS connection hash table will greatly reduce conflicts when there are hundreds of thousands of connections in the hash table.
Note the table size must be power of 2. The table size will be the value of 2 to the your input number power. The number to choose is from 8 to 20, the default number is 12, which means the table size is 4096. Don"t input the number too small, otherwise you will lose performance on it. You can adapt the table size yourself, according to your virtual server application. It is good to set the table size not far less than the number of connections per second multiplying average lasting time of connection in the table. For example, your virtual server gets 200 connections per second, the connection lasts for 200 seconds in average in the connection table, the table size should be not far less than 200x200, it is good to set the table size 32768 (2**15). Another note that each connection occupies 128 bytes effectively and each hash entry uses 8 bytes, so you can estimate how much memory is needed for your box.
You can overwrite this number setting conn_tab_bits module parameter or by appending ip_vs.conn_tab_bits=? to the kernel command line if IP VS was compiled built-in.

我們通常將線上的機器設置為 20

2. CPU Soft Interrupt 壓力的優化

貼上一段我以前在公司內部整理的資料，簡單的說，是利用網卡多 IRQ 的特性，將其分攤到多個 CPU，達到性能最大化壓榨：

SoftIRT (CPU si%佔用) 基本有兩個思路：

1.尋找支持多 IRQ 的網卡
2.使用 RPS

由於我們也是在去年後期才接觸到 RPS 這種方式，所以對 RPS 很多細節性能還不熟悉，正好這次我比較一下。
先看第一點，Dell R410 使用了 Broadcom Corporation NetXtreme II BCM5716 網卡，在系統中使用 bnx2 驅動，以下是 dmesg 的 log：
bnx2: Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v2.2.3f (Oct 25, 2012) bnx2 0000:01:00.0: PCI INT A -&> GSI 36 (level, low) -&> IRQ 36 bnx2 0000:01:00.0: setting latency timer to 64 bnx2 0000:01:00.0: eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 36, node addr d4:ae:52:a3:fe:15 bnx2 0000:01:00.1: PCI INT B -&> GSI 48 (level, low) -&> IRQ 48 bnx2 0000:01:00.1: setting latency timer to 64 bnx2 0000:01:00.1: eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 48, node addr d4:ae:52:a3:fe:16 bnx2 0000:01:00.0: irq 60 for MSI/MSI-X bnx2 0000:01:00.0: irq 61 for MSI/MSI-X bnx2 0000:01:00.0: irq 62 for MSI/MSI-X bnx2 0000:01:00.0: irq 63 for MSI/MSI-X bnx2 0000:01:00.0: irq 64 for MSI/MSI-X bnx2 0000:01:00.0: irq 65 for MSI/MSI-X bnx2 0000:01:00.0: irq 66 for MSI/MSI-X bnx2 0000:01:00.0: irq 67 for MSI/MSI-X bnx2 0000:01:00.0: irq 68 for MSI/MSI-X bnx2 0000:01:00.0: em1: using MSIX bnx2 0000:01:00.0: em1: NIC Copper Link is Up, 1000 Mbps full duplex bnx2 0000:01:00.1: irq 69 for MSI/MSI-X bnx2 0000:01:00.1: irq 70 for MSI/MSI-X bnx2 0000:01:00.1: irq 71 for MSI/MSI-X bnx2 0000:01:00.1: irq 72 for MSI/MSI-X bnx2 0000:01:00.1: irq 73 for MSI/MSI-X bnx2 0000:01:00.1: irq 74 for MSI/MSI-X bnx2 0000:01:00.1: irq 75 for MSI/MSI-X bnx2 0000:01:00.1: irq 76 for MSI/MSI-X bnx2 0000:01:00.1: irq 77 for MSI/MSI-X bnx2 0000:01:00.1: em2: using MSIX bnx2 0000:01:00.1: em2: NIC Copper Link is Up, 1000 Mbps full duplex
運氣不錯這個網卡已經有8個 irq，看來目前便宜的網卡都有這個支持了，再看一眼 /proc/interrupts：
我們這台機器是 Xeon E5620 x 2，4 core 8 HT per CPU，所以當 HT 功能打開時，我們會看到 16 個邏輯CPU。這張截圖已經是關閉了 HT 的時候，所以只看到 8 個邏輯CPU。請注意！當開啟 HT 時，一個網卡有 8 個 irq，所以如果有 16個 CPU 那麼這些 irq 如何分配一定要小心。
通過 /proc/cpuinfo 可以找到每一個邏輯 CPU 對應的物理core以及物理CPU。要盡量的把 irq 平均的分布到每一個物理 core 上。具體的判斷方式大家自己 Google 了，我寫一下結果：
L1 = 邏輯 CPU1
P1 = 物理 Core1
L0,L8 = P1

L1,L9 = P2
L2,L10 = P3
L3,L11 = P4
L4,L12 = P5
L5,L13 = P6
L6,L14 = P7
L7,L15 = P8
在這裡我還沒有區分物理CPU 的位置（我們有兩個），在某些情況下是有必要區分的。
根據這樣的格局當時我把 irq 平均分布到每個 Core 的第一個 HT：
echo 2 &> /proc/irq/60/smp_affinity

echo 8 &> /proc/irq/61/smp_affinity
echo 20 &> /proc/irq/62/smp_affinity
echo 80 &> /proc/irq/63/smp_affinity
echo 0 &> /proc/irq/64/smp_affinity
echo 4 &> /proc/irq/65/smp_affinity
echo 10 &> /proc/irq/66/smp_affinity
echo 40 &> /proc/irq/67/smp_affinity
通過這樣的調整後，你就會看到類似這樣的現象：

soft irq (si) 的壓力可以被平均的分布到每一個 CPU。

在生產環境中，我們通常不會用 Broadcom 的網卡，而是用 Intel 的 350 系列，給你看看線上的 IRQ 數量：

一屏幕有點顯示不下，你自行腦補一下。

可惜的是，我有 64 個 IRQ，CPU 缺只有 12 個呢。

3. iptables raw 表

用了 LVS 的必定在用 iptables 吧，哪家公司的前端居然不用防火牆請儘快通知各位烏雲白毛刷分啊～～

那麼默認情況下 iptables 也會記錄 connection 的狀態呢，很多情況下，我並不需要 iptables 來做這件事，怎麼辦呢？沒關係，我們有 raw 表：

是的，我們可以告訴它 NOTRACK，別記錄啦～

最後回到樓主的問題，LVS 有極限么？我不知道，我只能說，我還沒看到極限。2008 年的時候，我也用一台 Dell 1950（和今天 1950 配置不同）做了大約2個月的測試，那個時候壓到 10w connections 是妥妥的。

今天我們用 Dell R420 Xeon E5-2430 ，能上到多少？我是真不知道了…… 因為壓一台 LVS ，前端發壓力的機器，和後端做 RealServer 的機器實在是需要太多了，辦公環境局限，我真搞不出這麼多機器了。

目前大約 PV 3億每天，connection PS 估計只有 4K, package PS 大約是 50K，壓力是什麼狀況？我都不好意思說：

所以題主的 900K Package PS 距離極限還很遠呢。

我也希望誰能弄個 64 Core 的機器，再找個幾十台前後端，好好的壓一壓看個結果，然後心裡也有個答案。

--------

題主啊，我又想起來一件事，你後面的 Real Server 是否開啟了 syn cookie ? 這是否是你丟包的原因呢？

我剛才自己也在想 ipvs 是否會被系統層面的 syn cookie 影響，後來想想不太可能，因為 syn cookie 的邏輯必須有應用層接管的時候才能觸發。ipvs 上根本沒有應用層呢。

後來看到這個郵件 [lvs-users] IPVS SYN-cookies ，發現基本的思路和我一致，所以問題就變成了，你的 Real Server 是否有 syn cookie 呢？

還有鎖，hook點的優化