Python3 pandas如何加快SQL Server讀寫速度?

平台及軟體版本:Windows 10,SQL Server2008, Python3.4,pandas 0.16.1,sqlalchemy 0.9.9

經常需要從遠程資料庫讀取數據, 計算結果, 再寫入遠程資料庫,但是速度非常慢。

嘗試過修改sqlalchemy的代碼multirow insert(https://github.com/pydata/pandas/issues/8953), 但似乎SQL Server不支持multirow insert

也嘗試寫入的時候把數據集分塊,分配到多個線程寫入(http://techyoubaji.blogspot.jp/2015/10/speed-up-pandas-tosql-with.html),這對於單次寫入效果非常明顯,但一寫循環寫入多個數據集的時候,速度就跟單線程沒有區別。

請問,有沒有其他辦法,能提高pandas 讀寫SQL的速度?


-----------------------------&>&>&> 最近也碰到這個問題,這裡權當拋磚引玉了 &<&<&<---------------------------------

1. 應用場景描述:

需要將每隻股票的分鐘數據寫入MS SQL Server中,目標格式如下:

共2800+只股票(每隻股票為一個單獨的csv文件),時間跨度2年,總計大概3億+條數據。2. 初步嘗試:

利用pandas.DataFrame.to_sql寫入資料庫,例如:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pymssql://user:pws@server/db")
data = pd.read_csv("file_name.csv")
data.to_sql("TABLE_NAME", engine, if_exists="append", chunksize=10000)

會發現寫入速度極其慢,一個大致11w行的DataFrame寫入資料庫(區域網連接)需要400+ sec. 查了一圈發現是在類型轉換中耗時太多,關於這種寫入性能,odo文檔中有詳細的介紹:

Loading CSVs into SQL Databases

There is an unnecessary and very expensive amount of data conversion going on here. First we convert our CSV into an iterator of DataFrames, then those DataFrames are converted into Python data structures compatible with SQLAlchemy. Those Python objects then need to be serialized in a way that』s compatible with the database they are being sent to. Before you know it, more time is spent converting data and serializing Python data structures than on reading data from disk.

而事實上,將大容量csv文件導入資料庫中,可以利用BULK INSERT,但因為缺乏許可權,無法測試,詳細可參考BULK INSERT (Transact-SQL).

3. 曲線救國:

既然無權利用BULK INSERT導入,那麼只能退而求其次了。我們知道SQL Server 2008+是支持一次插入多行的:

INSERT INTO TABLE_NAME VALUES (a0, b0, c0, ...), (a1, b1, c1, ...), ..., (an, bn, cn, ...)

當然,這種執行性能也和目標表格的列數有關,曾有人做過性能測試,可以參見:

Comparing multiple rows insert vs single row insert with three data load methods

於是,我考慮直接讀取本地csv文件,每次將其若干行(chunk_size)以字元串的形式拼接,再傳入sql statement中執行,其一般實現如下:

# coding: utf-8

import os
import time

import pymssql
import pandas as pd

def multi_insert_table(src_file, conn, cursor, table="STOCK_SNAPSHOT_1m", chunk_size=80):
"""
手動優化&<一次多行寫入資料庫指定TABLE&>
"""
with open(src_file, "r") as f:
items = f.readlines()

line_decorator = lambda x: "("" + x.strip("
").replace(",", "", "") + "")"
all_rows = [line_decorator(row) for row in items]
sql = "INSERT INTO " + table + " VALUES "

n, k = len(all_rows) // chunk_size, len(all_rows) % chunk_size
for i in range(n):
multi_rows = ", ".join(all_rows[i*chunk_size:(i+1)*chunk_size])
cursor.execute(sql + multi_rows)
if k:
multi_rows = ", ".join(all_rows[-k:])
cursor.execute(sql + multi_rows)
conn.commit()
print("// %s has been writen to database successfully." % src_file)
return None

if __name__ == "__main__":
connect_account = {"server": "xxxx",
"user": "xxxx",
"password": "xxxx",
"database": "xxxx"}
conn = pymssql.connect(**connect_account)
cursor = conn.cursor()
os.chdir("your_work_directory")
all_files = os.listdir()

for file in all_files:
if os.path.isfile(file):
start_time = time.clock()
multi_insert_table(file, conn, cursor, chunk_size=100)
# The chunk_size has already been optimized.
duration = (time.clock() - start_time) / 60
print("--------------- Time elapsed: %.2f min ---------------
" % duration)

cursor.close()
conn.close()
print("// All files have been successfully written into the database.")
print("// And the connection has been shut down:)
")

其中選擇每次寫入100行,是手動測試的結果,在這個chunk_size附近,寫入的性能最高。11w+的csv文件平均寫入大概6 sec左右,基本滿足我的業務需求。

最後,這裡只是提供了一個針對SQL Server導入大量數據的方法,應該還有更好的實現,坐等大神出手。


python: to_csv

sql server: bulk insert (csv files)

插入500萬行3列數據大概15秒


推薦閱讀:

請教一個SQL,詳情請看問題補充?
為什麼用SQL而不是Excel+VBA?
sql中為什麼select要放在from之前?
如何優雅地寫SQL?
為什麼公司不準使用SQL語句查詢的「*」?

TAG:資料庫 | Python | SQL | SQLAlchemy | MicrosoftSQLServer |