ClickHouse數據壓縮[譯文]

原文:altinity.com/blog/2017/

Altinity是國外一家從事ClickHouse諮詢、服務的公司,該公司高管由ClickHouse開發者,以及來自Percona的專家組成。目前Altinity的ClickHouse雲服務測試版已經上線。

綜述

It might not be obvious from the start, but ClickHouse supports different kinds of compressions, namely two LZ4 and ZSTD.

There are evaluations for both of these methods: percona.com/blog/2016/0

But in short, LZ4 is fast but provides smaller compression ratio comparing to ZSTD. While ZSTD is slower than LZ4, it is often faster and compresses better than a traditional Zlib, so it might be considered as a replacement for Zlib compression.

  • 其實,從一開始ClickHouse就支持多種方式的數據壓縮:LZ4和ZSTD。
  • 關於壓縮演算法的測試,見這篇文章。簡而言之,LZ4在速度上會更快,但是壓縮率較低,ZSTD正好相反。儘管ZSTD比LZ4慢,但是相比傳統的壓縮方式Zlib,無論是在壓縮效率還是速度上,都可以作為Zlib的替代品。

實際壓測

To get some real numbers using ClickHouse, let』s review a table compressed with both methods.

For this, we will take the table lineorder, from the benchmark described in altinity.com/blog/2017/

The uncompressed datasize for lineorder table with 1000 scale is 680G.

  • 為了用事實說話,我們一起對比一下這兩種壓縮方式。
  • 壓測所用的表(lineorder)結構和數據來著這裡。
  • 未壓縮的數據集是680GB。

數據對比

And now let』s load this table into ClickHouse. With the default compression (LZ4), we have184G lineorderlz4

And with ZSTD135G lineorderzstd

There we need to mention how to make ClickHouse using ZSTD. For this, we add the following lines into config:

  • 把上述數據載入到ClickHouse後,默認的LZ4壓縮演算法下,數據容量是184G(壓縮到27%),而ZSTD達到了135GB(壓縮到20%)。
  • 關於如何使用ZSTD,需要簡單的提一下,使用如下配置即可:

<compression incl="clickhouse_compression"> <case> <method>zstd</method> </case></compression>

So the compression ratio for this table

壓縮比率對比

CompressionRatioLZ43.7ZSTD5.0

What about performance? For this let』s run the following query

  • 壓縮後的性能如何,我們來跑如下查詢看看。

SELECT toYear(LO_ORDERDATE) AS yod, sum(LO_REVENUE) FROM lineorder GROUP BY yod;

And we will execute this query in 「cold」 run (no data is cached), and following 「hot」 run when some data is already cached in OS memory after the first run.

  • 為了保持客觀,我們會跑兩次,第一次是冷數據請求,這次的數據沒有被操作系統緩存,第二次跑一次熱數據情求,這次的數據已經被操作系統的內存給緩存住了。

So query results, for LZ4 compression:

LZ4的性能如下:

# Cold run:7 rows in set. Elapsed: 19.131 sec. Processed 6.00 billion rows, 36.00 GB (313.63 million rows/s., 1.88 GB/s.)Hot run:7 rows in set. Elapsed: 4.531 sec. Processed 6.00 billion rows, 36.00 GB (1.32 billion rows/s., 7.95 GB/s.)

For ZSTD compression:

ZSTD性能如下:

Cold run:7 rows in set. Elapsed: 20.990 sec. Processed 6.00 billion rows, 36.00 GB (285.85 million rows/s., 1.72 GB/s.)Hot run:7 rows in set. Elapsed: 7.965 sec. Processed 6.00 billion rows, 36.00 GB (753.26 million rows/s., 4.52 GB/s.)

While there is practically no difference in cold run times (as the IO time prevail decompression time), in hot runs LZ4 is much faster (as there is much less IO operations, and performance of decompression becomes a major factor).

  • 冷數據查詢情況下,兩者區別不大,原因在於消耗在IO方面的時間,遠大於消耗在解壓縮上面的時間。
  • 熱數據請求下,LZ4會更快,此時IO代價小,數據解壓縮成為性能瓶頸。

Conclusion:

結論

ClickHouse proposes two methods of compression: LZ4 and ZSTD, so you can choose what is suitable for your case.

With LZ4 you may get a better execution time with the cost of the worse compression and data taking more space on the storage.

  • ClickHouse提供了兩種數據壓縮方式供我們選擇:LZ4和ZSTD。
  • 默認的LZ4壓縮方式,會給我們提供更快的執行效率,但是同時,我們要付出較多的磁碟容量佔用的代價了。

譯者注

  • ClickHouse在我們公司(Sina)內部已經有一段時間的使用了,拋開高效的SQL執行,數據容量也是一個非常喜人的地方
  • 我們使用的是大容量的服務(沒錯,就是Hadoop node節點的低配機器),單機容量輕鬆幾十T,再加上ClickHouse優秀的壓縮方式,日誌數據存1-2年,都沒有一點問題
  • 我們沒修改過壓縮演算法,就用的默認的LZ4

實際壓縮表現

ClickHouse

76G daggerSELECT count(*) / 100000000FROM dagger ┌─divide(count(), 100000000)─┐│ 53.75973187 │└────────────────────────────┘

  • 1億條用1.4GB
  • 1GB存7kw

ES

  • 1億條用33GB
  • 1GB存300W

對比


推薦閱讀:

TAG:數據分析工具 |