矩陣運算庫blas, cblas, openblas, atlas, lapack, mkl之間有什麼關係，在性能上區別大嗎？

12-28

主要是想從底層實現，所以沒有考慮一些具有高層介面的矩陣庫（如eigen之類的）

謝田老師邀 @田飛

@基爾已經很好的給出了 BLAS 與這些庫的關係。我在這裡補充一些幾個矩陣庫性能之間的對比。

Benchmark - Eigen Eigen官方對比，這份對比包括了常見的矩陣庫包括:Eigen3, Eigen2, Intel MKL, ACML, GOTO BLAS, ATLAS等。注意：這份對比各個庫均單線程運行。

最重量級的矩陣乘法操作對比

題主雖說不考慮具有高級介面的Eigen等庫，但從Eigen給出的官方對比來看，在大多數操作上Eigen的優化已經逼近MKL，甚至超過（個人認為是Eigen在單線程以及矩陣尺寸不整時的優化）。

Eigen給出了測試源碼，題主可以自己加以驗證：How to run the benchmark suite

順便一提，最近大熱的tensorflow/tensorflow · GitHub 就是基於Eigen的。

Benchmarking (python vs. c++ using BLAS) and (numpy) 這是StackOverflow上一位網友提出的問題引發的討論，其中一位網友在自己的HPC上親自驗證不同矩陣庫在不同矩陣尺寸和不同線程數下的不同操作的性能對比。

節選幾張圖：

8線程不同尺寸矩陣乘法

8線程不同矩陣操作

該網友得出如下結論：

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple"s Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).
Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.
In any case don"t use the default netlib blas implementation because it is way too slow for any serious computational work.
On our cluster I also installed AMD"s ACML and performance was similar to MKL and GotoBlas2. I don"t have any numbers tough.

I personally would recommend to use GotoBlas2 because it"s easier to install and it"s free.
If you want to code in C++/C also check out Eigen3 which is supposed to outperformMKL/GotoBlas2 in some cases and is also pretty easy to use.

大意是說幾個庫表現都差不多，MKL很好，ACML很接近，netlib blas太慢了沒試，GotoBlas2多線程不錯，Apple"s Accelerate Framework單線程不錯，最後如果你用C/C++ 那麼可以看一看Eigen3。

補充一下，intel mkl可以利用intel xeon phi coprocessor, 在幾百個gpu核上做並行運算。速度快，而且mkl自動做computation offload，比一般gpu要方便一點。

用 clblas 在 GPU（AMD的280x顯卡）上，跑1024尺寸的矩陣乘法，得分 2.2TFlops。理論極限是4.3TFlops 。比較優秀的 MKL在CPU上 20000MFlops=20GFlops=0.02TFlops ，CPU上的都是渣渣。當然這並不是 clblas更優秀。僅僅是 GPU更勝任這類計算。