`a = false` 和 `if (a) { a = false; }` 哪個快?

如題。考慮存在緩存和編譯器優化的情況。


題主沒有給題目指定是什麼語言,也沒有指定a具體是什麼,單純問「`a = false` 和 `if (a) { a = false; }` 哪個快」是沒有辦法給出有用的答案的。

題主大概想問的是一個「blind write」(例如 a = false)與一個「conditional write」(例如if (a) a = false;)哪個更快。然而不給出足夠上下文的話是無法分析的。

==========================================

假如源語言是C,而a是一個局部變數並且沒有被取過地址,這是最簡單的情況:經過編譯器優化後,前者跟後者會一樣快。這個「a = false」的代碼看似是一個通過內存寫來實現的賦值,但實際上比較優化的編譯器都會在編譯時跟蹤變數的定義(賦值)情況,並不一定要真正使用一個內存寫操作來實現賦值——可能把變數分配到寄存器了,也可能直接把這個賦值給幹掉了。

例如說,看這樣的一個例子:

#include &

const char* foo(bool a) {
if (a) { a = false; }
return a ? "alpha" : "go";
}

隨便找個主流的優化C編譯器,都能至少把它優化到等價於下面代碼的形式:

#include &

const char* foo(bool ignored) {
return "go";
}

這樣的優化效果可以通過非常簡單的分析來實現。一種做法是把代碼先轉進SSA形式:

const char* foo(bool a0) {
if (a0 == true) {
// if.then
a1 = false;
} /* else {
// if.else
// a0 == false
} */
a2 = phi({if.then, a1}, {if.else, a0}); // SSA Phi
return a2 == true ? "alpha" : "go";
}

此時可以知道在if.else分支里a0肯定是false,然後做條件常量傳播就可以發現a2的Phi函數的兩個分支的輸入值都是false,所以可以直接把a2簡化到a2 = false;然後前面的分支就失去了引用(use)進而可以被消除,後面a2 == true的條件也可以常量摺疊,得到上面給出的最終形式。

這種情況跟代碼是否在多線程條件下運行完全沒關係。

==========================================

假如源語言是C++,而a是一個局部變數但其類型是一個C++類,並且該類帶有一個有副作用的隱式bool類型轉換運算符(operator bool),例如說:

#include &

class Foo {
int v_;
public:
Foo(int v) : v_(v) { }
operator bool() {
std::cout &<&< "Foo(" &<&< v_ &<&< ")" &<&< std::endl; return v_ != 0; } }; int main(int argc, char* argv[]) { Foo a(argc); if (a) { a = false; } return 0; }

那麼這裡的「if (a)」就會帶有副作用,使得編譯器無法對它做徹底的優化。

這種情況,顯然是直接 `a = false` 比 `if (a) a = false;`來得快。

這種情況也跟程序是否在多線程環境下運行沒直接關係。

==========================================

然而廢話了半天,題主真正想問的大概是這種情況:

如果源語言是C/C++或者相似使用共享內存語義來實現線程間通信的語言(例如Java、C#也會遇到同樣的問題),a是一個全局變數或者其它方式能在線程間共享的變數,然後要在多線程環境下對a賦值為false的話,這就有趣了。

這邊簡單講講我們做JVM實現的過程中遇到過的實際例子。

先放傳送門:False sharing induced by card table marking - Dave Dice

簡單來說,HotSpot JVM里的分代式GC通過一種叫做「card table」的數據結構來跟蹤從old generation到young generation的跨代引用。其實card table就是一個byte數組,裡面每個byte對應GC堆里的512位元組的區域(稱為一個「card page」);如果card table里的某個byte被標記為「dirty」值,則代表該byte對應的card page里存在跨代引用。

HotSpot VM會在所有Java對象的引用類型欄位的寫入動作(也就是putfield位元組碼)上插入一小塊代碼,稱為「write barrier」,來維護card table的信息。大概意思就是對於這樣的Java代碼:

x.referenceField = y;

實際上HotSpot VM默認會做這樣的事情:

x.referenceField = y;
// post-write barrier:
card_table[uintptr_t(x.referenceField) &>&> card_shift] = dirty_card_val;

可以看到這裡採用的write barrier形式是所謂的「blind card mark」,blind的意思就是不管card table里對應的項的當前值為什麼,都無條件(unconditional)去寫入到那項去。這種做法在單線程條件下是非常快的,指令條數少,無分支,CPU可以快速執行過去。

然而在多線程環境中,如果有很多線程都對card table里的同一項做寫入(或者更糟糕的是對同一cache line上的不同項做寫入,所謂false sharing),在現代CPU常見的緩存設計上就會有可能導致不必要的競爭,從而影響性能。

所有後來HotSpot VM有提供一個 -XX:+UseCondCardMark 參數,可以讓用戶自己選擇要不要把write barrier配置為有條件的形式:

x.referenceField = y;
// post-write barrier:
byte* addr = card_table[uintptr_t(x.referenceField) &>&> card_shift]
if (*addr != dirty_card_val) *addr = dirty_card_val;

看,這形式就跟題主問題里所說的形式一樣了不是? ^_^

至於為啥多線程條件下有條件的內存寫可能比無條件的內存寫還快,就留給其他大大講解啦~

==========================================

即便到了現在的多核時代,許多編譯器還是主要針對單線程性能來做優化的。本回答開頭提到的優化都是如此。

但也有例外。例如說AMD Open64里有一個-mso選項

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/open64.pdf

-mso Instructs the compiler to perform aggressive optimizations that are likely to improve the scalability of an application running on a system with multi-core processors. In particular, these optimizations may target machine resources that are shared among the multiple cores of a processor, e.g. memory bandwidth, shared L3 cache, etc.

這系列優化可能會想辦法增加程序的多線程性能,即便它有可能會使得單線程性能被捨棄一點。

看看AMD的Roy Ju做的一段訪談所提到的:(引用自Open64VideoTranscripts,錄像地址是AMD x86 Open64 Compiler Suite Team Insights。更多關於AMD Open64的錄像可以參考:Inside Track Video Series Archives - AMD)

Roy Ju, Architect in the Open 64 compiler team

Sharon: Roy, can you tell us what do you do here at AMD and how long have you been with AMD?

Roy: My name is Roy Ju, and I am an AMD Fellow. I have been working on compiler optimizations for about 20 years. I have been with AMD for over 3 years and currently I am an architect in the Open64 compiler team.

Sharon: So, the Open64 compiler, can you describe any of the unique features that are in that compiler?

Roy: Open64 has many state-of-the-art compiler technologies. But we have put a big focus on the compiler optimizations to reduce memory bandwidth consumption. We group such optimizations under a new -mso option, which stands for multi-core scalability optimizations.

Sharon: How does the memory bandwidth issue arise?

Roy: Memory bus is often a performance bottleneck, in particular in many data center server applications. This problem is particularly pronounced in a multi-core processor, where some resources, such as memory bus, L3 cache, and DRAM controller, are shared among different cores within a processor.

Sharon: Ok, so how does an application incur a memory bandwidth issue more than just needing a lot of data?

Roy: When an app needs to access data, it will go through the memory subsystem to get it. The quantity of data that gets fetched each time is fixed. It could be a page or a cache line. If you use only a small fraction of the data, you are wasting some memory bandwidth to bring in the data that you don』t really need. When the memory traffic is light, it』s probably no big deal. However, in a multi-core system, many applications are running at the same time among different cores. They all fire off requests of data to the same memory sub-system. This is like rushed hours on a highway, and the traffic is jammed.

Sharon: What can a compiler do to address this problem?

Roy: Dealing with memory bandwidth has been traditionally seen as a platform issue and it』s beyond compiler』s control. However the Open64 compilers take up the challenge. It performs a number of aggressive loop optimizations and data layout optimizations to improve the data locality. We use the data fetched from memory as many times as possible before they are replaced or put frequently used data next to each other so that more data are used within a fetched quantity. This is like carpooling on highway. We don』t build more lanes in the highway but we can squeeze more people into a car to reduce the number of cars on the highway.

Sharon: Interesting, ok so how come these aren』t these optimizations widely available in other compilers?

Roy: In some cases, while these optimizations can greatly reduce memory bandwidth consumption, they could potentially lead to the execution of a larger number of instructions as well. When you run a single application on a multi-core system, where bandwidth is not a bottleneck, such optimizations might actually slow down a program. This is like taking a step beyond carpooling to take a bus. But bus makes stops and passengers getting in and out of a bus take extra time. In our optimizations, this means we shuffle data around or rearrange them, which could lead to overhead during execution. Most compilers focus on the traditional single-thread performance, and not addressing platform performance issues. Therefore, they don』t pay enough attention to this type of optimizations.

Sharon: How does Open64 avoid these potential slowdown problems, the bus analogy?

Roy: To avoid possibly slowing down programs during light memory traffic, we group these optimizations under the -mso option so that a user can decide whether to apply this option based on the expected execution environment, e.g. how heavy the memory traffic may be. This allows you to get the fast speed of a car when the traffic is light and the benefit of carpooling or sharing a bus to reduce traffic jam when the traffic is heavy. This is a unique feature in the Open64 compilers. We encourage you to try our new -mso option in our upcoming release.

雖然AMD Open64的-mso選項並不會把題主說的第一種形式(a = false;)「優化」為第二種形式(if (a) a = false;),但它會做其它一些別的編譯器通常不會做的、針對多線程程序的內存訪問效率而做的優化。

也挺有趣的不是么 ^_^


不考慮操作符重載、類型cast等亂七八糟的話,大多數情況是第一個快這是顯然的。但是有沒有例外呢?

第一種:一次寫;

第二種:一次讀,有可能一次寫,有可能分支預測失敗。

第二種光是遇到a為真就夠受的了。要使第二種比第一種快,必須滿足兩個條件:

  1. 讀比寫快。簡單的變數類型,讀比寫快的場景不是很多,但也是有的。舉個例子,兩個處理器核在操作同一個cache塊內的內存。(不一定是同一個地址,一般同一個地址很少會這樣不加鎖直接上。)如果兩個都只讀,那麼這個塊在兩個核的cache里各存一份,不需要訪問主存;如果有寫,一個核寫了另一個核就得通過訪問主存來做同步,開銷較大。(Cache coherence - Wikipedia)
  2. a為真的概率很小。若a為真的概率為p,則第二種發生寫的概率為p,分支預測失敗率可以估算為2pleft(1-p
ight)(假設採用上次結果作為預測值)

實驗了一下,4核4線程用下面的方式遍歷400000000個bool,其中有0.1%為true,OpenMP線程,O3優化,確實第二種快兩三倍(統計意義上,因為每次寫是否引起主存訪問取決於時序,是不確定的)

for(size_t i = thread_id; i &< array_size; i += thread_num) { bool a = array[i]; // The question says `a`, make the asker happy partial_count += (a) ? 1 : 0; // analog of more memory access, not necessary a = false; // or `if(a) { a = false; }` }


前者是一次寫操作,後者為if 讀,或者if 讀 寫。假設非SMP(即只有一個CPU,),以及該變數沒有被優化為寄存器(此情形必然為前者快)。考慮命中cache,分支預測成功,兩者時間幾乎相等(大概1個指令周期);考慮命中cache,分支預測失敗,後者慢4倍左右;考慮不命中cache,分支預測成功,後者比前者快n倍(取決於接下來的代碼是什麼);考慮不命中cache,分支預測失敗,後者比前者慢一些(因為比前者多一次read);考慮該內存地址被多個核心共享,該情形屬於DCLP模式,多線程競爭激烈的情形下後者會比較快。

實際上,如果該變數在堆上(很可能不命中cache),例如循環掃描某個堆上的數組,linux內核源碼更傾向於後者的寫法,即先if判斷,再考慮賦值;若該變數在棧上(幾乎必然命中cache),通常用前者的寫法。


我來跑個題,有些MCU的寄存器,你不去讀那個BIT,它不讓你寫,特別是中斷標誌位和錯誤標誌位。所以經常用個條件式做強制讀BIT。一般那個BIT定義時會指定volatile,不會被優化掉。

編譯後像這個樣:

test_bit #0:3, #h"ABCD:16 ; 讀16位地址ABCD處的第0bit

branch_if_zero #h"02:8 ; 跳過clear_bit指令的3個位元組

clear_bit #0:3, #h"ABCD:16 ; 設16位地址ABCD處的第0bit為0

指令集是我自己瞎編的。


千萬別回答。

回答了的話,業內就有一大票工程師這樣寫,然後跟你說,這樣寫比較快。


這個看編譯器優化了,不過這一小片代碼不好判斷。有可能if為false,那麼後者少寫一次變數罷了,但差異不大,沒有多大意義。


這個並不能確定哦,c#中如果a是get set的方式讀和寫,而且set比get性能消耗更大比如有gc,那麼先讀一下確認要寫入再寫入,會比每次寫入的開銷要更低


我想說,時間複雜度都是 O(1).


如果是編譯型語言的話,加上優化命令應該兩者生成指令是一樣的,也就是一樣快。腳本語言應該是第一種比較快。


看最終代碼編譯後的機器指令是否一樣,一般來說,指令少的速度快。

這樣的一點代碼用不到緩存;編譯優化在編譯時有用,大部分編譯型語言都是只編譯一次或是提前編譯,執行時就是編譯後的機器指令,因此編譯優化也無關結論。


if語句一般涉及到branch prediction,肯定是不用來的快


有點吹毛求疵了


後者多個判斷的步驟,但二者本質是等價的。若不優化則通常前者跑得快,若優化則二者必將被優化至同一種機器操作,所以歸根到底寫第一種比較保險。


推薦閱讀:

對於平均大小在 10M 至 50M 左右的文件下載服務來說,有沒有什麼成熟的緩存方案呢?
為什麼從Intel Core i系列開始加入L3緩存,而不是使用更大的L2緩存?
新浪微博、Twitter 等 SNS 社交網站如何合理規劃自己的緩存設計?
Android 系統里的 RSS 訂閱會產生大量緩存垃圾么?如何清理?

TAG:編程 | 緩存 | 編譯原理 | 編譯器優化 |