本文打算對MSAA(Multisample anti aliasing)做一個深入的講解,包括基本的原理、以及不同平台上的實現對比(主要是PC與Mobile)。文章中難免有錯誤之處,如有發現,還請指證,以免誤導其他人。好了,廢話不多說,下面我們開始正文。






1. 幾何體走樣(幾何物體的邊緣有鋸齒),幾何走樣由於對幾何邊緣採樣不足導致。

2. 著色走樣,由於對著色器中著色公式(渲染方程)採樣不足導致。比較明顯的現象就是高光閃爍。


3. 時間走樣,主要是對高速運動的物體採樣不足導致。比如遊戲中播放的動畫發生跳變等。








  • 覆蓋 覆蓋是通過判斷一個圖形是否跟一個指定的像素重疊來決定的。在顯卡中,覆蓋是通過測試一個採樣點是否在像素的中心來決定的。接下來的圖片說明了這個過程。


  • 遮擋告訴我們被一個圖形覆蓋的像素是否被其它的像素覆蓋了,這種情況大家應該很熟悉就是z buffer的深度測試。


就光柵化而言,MSAA跟SSAA的方式差不多,覆蓋和遮擋信息都是在一個更大解析度上進行的。對於覆蓋信息來說,硬體會對每個子像素根據採樣規則生成n的子採樣點。接下來的這張圖展示了一個使用了旋轉網格(rotated grid)採樣方式的子採樣點位置。

三角形會與像素的每個子採樣點進行覆蓋測試,會生成一個二進位覆蓋掩碼,它代表了這個三角形覆蓋當前像素的比例。對於遮擋測試來說,三角形的深度在每一個覆蓋的子採樣點的位置進行插值,並且跟z buffer中的深度信息進行比較。由於深度測試是在每個子採樣點的級別而不是像素級別進行的,深度buffer必須相應的增大以來存儲額外的深度值。在實現中,這意味著深度緩衝區是非MSAA情況下的n倍。




MSAA Resolve(MSAA 解析)



當然不同的硬體廠商可能會使用不同的演算法。比如nVidia的」Quincunx」 AA等。隨著顯卡的不斷升級,我們現在可以通過自定義的shader來做MSAA的解析了。[6]





IMR vs TBR vs TBDR[21] [22]

IMR (立即渲染模式)

目前PC平台上基本上都是立即渲染模式,CPU提交渲染數據和渲染命令,GPU開始執行。它跟當前已經畫了什麼以及將來要畫什麼的關係很小(Early Z除外)。流程如下圖所示:


TBR把屏幕分成一系列的小塊,每個單獨來處理,所以可以做到並行。由於在任何時候顯卡只需要場景中的一部分數據就可完成工作,這些數據(如顏色 深度等)足夠小到可以放在顯卡晶元上(on-chip),有效得減少了存取系統內存的次數。它帶來的好處就是更少的電量消耗以及更少的帶寬消耗,從而會獲得更高的性能。


TBDR (分塊延遲渲染)

TBDR跟TBR有些相似,也是分塊,並使用在晶元上的緩存來存儲數據(顏色以及深度等),它還使用了延遲技術,叫隱藏面剔除(Hidden Surface Removal),它把紋理以及著色操作延遲到每個像素已經在塊中已經確定可見性之後,只有那些最終被看到的像素才消耗處理資源。這意味著隱藏像素的不必要處理被去掉了,這確保了每幀使用最低可能的帶寬使用和處理周期數,這樣就可以獲取更高的性能以及更少的電量消耗。





1. 4倍MSAA需要四倍的塊緩衝內存。由於晶元上的塊緩衝內存很最貴,所以顯卡會通過減少塊的大小來消除這個問題。減少塊的大小對性能有所影響,但是減少一半的大小並不意味著性能會減半,瓶頸在片斷程序的只會有一個很小的影響。

2. 第二個影響就是在物體邊緣會產生更多的片斷,這個在IMR模式下也有。每個多邊形都會覆蓋更多的像素如下圖所示。而且,背景和前景的圖形都貢獻到一個交互的地方,兩片斷都需要著色,這樣硬體隱藏背面剔除就會剔除更少的像素。這些額外片斷的消耗跟場景是由多少邊緣組成有關,但是10%是一個比較好的猜測。



JUST22 - Multisampled resolve on-tile is supported in hardware with no bandwidth hit Mali GPUs support resolving multisampled framebuffers on-tile. Combined with tile-buffer support for full throughput in 4x MSAA makes 4x MSAA a very compelling way of improving quality with minimal speed hit.[24]

In GLES on Mali GPUs, the simplest case for 4xMSAA would be to render directly to the window surface (FB0), having set EGL_SAMPLES to 4. This will do all multisampling and resolving in the GPU registers, and will only flush the resolved buffer to memory. This is the most efficient way to implement MSAA on a Mali GPU, and comes at almost no performance cost compared to rendering to a normal window surface. Note that this does not expose the sample buffers themselves to you, and does not require an explicit resolve.[25]

Qualcomm Adreno:

Anti-aliasing is an important technique for improving the quality of generated images. It reduces the visual artifacts of rendering into discrete pixels.Among the various techniques for reducing aliasing effects, multisampling is efficiently supported by Adreno 4x. Multisampling divides every pixel into a set of samples, each of which is treated like a 「mini-pixel」 during rasterization. Each sample has its own color, depth, and stencil value. And those values are preserved until the image is ready for display. When it is time to compose the final image, the samples are resolved into the final pixel color. Adreno 4xx supports the use of two or four samples per pixel.


Another benefit of the SGX and SGX-MP architecture is the ability to perform efficient 4x Multi-Sample Anti-Aliasing (MSAA). MSAA is performed entirely on-chip, which keeps performance high without introducing a system memory bandwidth overhead (as would be seen when performing anti-aliasing in some other architectures). To achieve this, the tile size is effectively quartered and 4 sample positions are taken for each fragment (e.g., if the tile size is 16x16, an 8x8 tile will be processed when MSAA is enabled). The reduction in tile size ensures the hardware has sufficient memory to process and store colour, depth and stencil data for all of the sample positions. When the ISP operates on each tile, HSR and depth tests are performed for all sample positions. Additionally, the ISP uses a 1 bit flag to indicate if a fragment contains an edge. This flag is used to optimize blending operations later in the render. When the subsamples are submitted to the TSP, texturing and shading operations are executed on a per-fragment basis, and the resultant colour is set for all visible subsamples. This means that the fragment workload will only slightly increase when MSAA is enabled, as the subsamples within a given fragment may be coloured by different primitives

when the fragment contains an edge. When performing blending, the edge flag set

by the ISP indicates if the standard blend path needs to be taken, or if the optimized path can be used. If the destination fragment contains an edge, then the blend needs to be performed individually for each visible subsample to give the correct resultant colour (standard blend). If the destination fragment does not contain an edge, then the blend operation is performed once and the colour is set for all visible subsamples (optimized blend). Once a tile has been rendered, the Pixel Back End (PBE) combines the subsample colours for each fragment into a single colour value that can be written to the frame buffer in system memory. As this combination is done on the hardware before the colour data is sent, the system memory bandwidth required for the tile flush is identical to the amount that would be required when MSAA is not enabled.[26]

On PowerVR hardware Multi-Sampled Anti-Aliasing (MSAA) can be performed directly in on-chip memory before being written out to system memory, which saves valuable memory bandwidth. In general, MSAA is considered to cost relatively little performance. This is true for typical games and UIs, which have low geometry counts but very complex shaders. The complex shaders typically hide the cost of MSAA and have a reduced blend workload. 2x MSAA is virtually free on most PowerVR graphics cores (Rogue onwards), while 4x MSAA+ will noticeably impact performance. This is partly due to the increased on-chip memory footprint, which results in a reduction in tile dimensions (for instance 32 x 32 -> 32 x 16 -> 16 x 16 pixels) as the number of samples taken increases. This in turn results in an increased number of tiles that need to be processed by the tile accelerator hardware, which then increases the vertex stages overall processing cost. The concept of ?good enough? should be followed in determining how much anti-aliasing is enough. An application may only require 2x MSAA to look ?good enough?, while performing comfortably at a consistent 60 FPS. In some cases there may be no need for anti-aliasing to be used at all e.g. when the target device?s display has high PPI (pixels per-inch). Performing MSAA becomes more costly when there is an alpha blended edge, resulting in the graphics core marking the pixels on the edge to 「on edge blend」. On edge blend is a costly operation, as the blending is performed for each sample by a shader (i.e. in software). In contrast, on opaque edge is performed by dedicated hardware, and is a much cheaper operation as a result. On edge blend is also ?sticky?, which means that once an on-screen pixel is marked, all subsequent blended pixels are blended by a shader, rather than by dedicated hardware. In order to mitigate these costs, submit all opaque geometry first, which keeps the pixels 「off edge」 for as long as possible. Also, developers should be extremely reserved with the use of blending, as blending has lots of performance implications, not just for MSAA. [27]


通過上面的講解,我們了解了MSAA的實現原理,以及在PC平台和移動平台上因為架構的不同導致具體實現細節的不同。MSAA是影響了GPU管理的光柵化、片斷程序、光柵操作階段(每個子採樣點都要做深度測試)的。每個子採樣點都是有自己的顏色和深度存儲的,並且每個子採樣點都會做深度測試。在移動平台上,是否需要額外的空間來存儲顏色和深度需要根據OpenGL ES的版本以及具體硬體的實現有關。MSAA在一般的情況下(不需要額外空間來存儲顏色和深度,直接在on-chip上完成子採樣點計算,然後直接resolve到framebuffer)是要比PC平台上效率高的,因為沒有了那麼大的帶寬消耗。但是鑒於硬體實現差異大,建議還是以實測為準。由於本人水平有限,難免會有錯誤的地方。如有發現,還請指正,以免誤導了他人。

