對著data.table我由衷的發出了一聲WOW!
時間比較晚了,儘快寫完。
主要是這個感覺太爽了,1.89G的文本(數據量約為600W * 24),不到30秒載入完成,不到30秒完成cast變形,玩起來和mtcars沒什麼兩樣嘛!
而官方文檔給出的數據支持量是100G!100G啊朋友們,什麼概念,將近1TB啊!再也不要說超過100W的數據R就hold不住了23333。
秘訣就在Efficient reshaping using data.tables,數據變形一定要用data.table帶的melt和dcast(而不是reshape2包帶的,我之前沒注意載入了reshape2的,結果跑了接近一個小時,死機了)。
看data.table包wiki的這一段,可以同時melt(或者cast)多列,也就是提供了這兩類變形的多線程(類似概念)版本!經過實際驗證,速度的確非常快,內存和CPU佔用都在合理水平(一般公務本配置,I5 + 8G)。
3. Enhanced (new) functionality
a) Enhanced melt
Since we』d like for data.tables to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an additional functionality, where we can melt to multiple columns simultaneously.
- melt multiple columns simultaneously
The idea is quite simple. We pass a list of columns to measure.vars, where each element of the list contains the columns that should be combined together.
b) Enhanced dcast
Okay great! We can now melt into multiple columns simultaneously. Now given the data set DT.m2 as shown above, how can we get back to the same format as the original data we started with?
If we use the current functionality of dcast, then we』d have to cast twice and bind the results together. But that』s once again verbose, not straightforward and is also inefficient.
- Casting multiple value.vars simultaneously
We can now provide multiple value.var columns to dcast for data.tables directly so that the operations are taken care of internally and efficiently.
PS:
坑爹的是Linux和Windows都原生或者安裝Rtools後就可以實現以上功能,macOS居然不!可!以!而是需要OpenMP支持下的編譯安裝(大霧)。
實際操作中,macOS版本10.12.3,R版本3.3.3(雖然根據了wiki上的官方指引)還是遇到了麻煩:Library not loaded: @rpath/libomp.dylib,現在提供暫時的解決方案如下:
首先官方指引大概步驟為:
XXX$ xcode-select --install
xcode-select: error: command line tools are already installed, use "Software Update" to install updates
brew upgrade && brew update && brew install llvm - done
Already up-to-date.
~/.R/Makevars - done
XXX$ cat Makevars
CC=/usr/local/opt/llvm/bin/clang -fopenmp
CXX=/usr/local/opt/llvm/bin/clang++
LDFLAGS=-L/usr/local/opt/gettext/lib -L/usr/local/opt/llvm/lib
CPPFLAGS=-I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include
但是大概率會遇到上面的那個報錯,此時可以用以下方案解決:
# 重新安裝gcc
brew reinstall gcc --without-multilib
# 編輯配置文檔
# GCC version:
VER=6
CC=gcc-${VER}
CXX=g++-${VER}
# 這裡的j8對應4核心cpu,可以根據實際情況改為j4或者j2
MAKE=make -j8
# These may be unnecessary
SHLIB_OPENMP_CFLAGS=-fopenmp
SHLIB_OPENMP_CXXFLAGS=-fopenmp
SHLIB_OPENMP_FCFLAGS=-fopenmp
SHLIB_OPENMP_FFLAGS=-fopenmp
# 如果你跟著官方文檔走過的話,再來一步
brew unlink llvm
# 安裝data.table包
# 最後,需要把Makevars文件改回去(否則安裝其他包會報錯O2z)
CC=clang
CXX=clang++
推薦閱讀:
※Python語言基礎
※數據分析實踐計劃——新的征程
※【翻譯】《利用Python進行數據分析·第2版》第4章(下)NumPy基礎:數組和矢量計算
※雨沐田:數據分析的目的 和 數據可視化
※pca(主成分分析)的python源碼實現