Python內存資料庫/引擎

02-04

1 初探

　　在平時的開發工作中，我們可能會有這樣的需求：我們希望有一個內存資料庫或者數據引擎，用比較Pythonic的方式進行資料庫的操作（比如說插入和查詢）。

　　舉個具體的例子，分別向資料庫db中插入兩條數據，"a=1, b=1" 和 "a=1, b=2", n然後想查詢a=1的數據可能會使用這樣的語句db.query(a=1)，結果就是返回前面插入的兩條數據；如果想查詢a=1, nb=2的數據，就使用這樣的語句db.query(a=1, b=2)，結果就返回前面的第二條數據。

　　那麼是否擁有實現上述需求的現成的第三方庫呢？幾經查找，發現PyDbLite能夠滿足這樣的需求。其實，PyDbLite和Python自帶的SQLite均支持內存資料庫模式，只是前者是Pythonic的用法，而後者則是典型的SQL用法。

他們具體的用法是這樣的：

PyDbLite

import pydbliten# 使用內存資料庫npydb = pydblite.Base(:memory:)n# 創建a,b,c三個欄位npydb.create(a, b, c)n# 為欄位a,b創建索引npydb.create_index(a, b)n# 插入一條數據npydb.insert(a=-1, b=0, c=1)n# 查詢符合特定要求的數據nresults = pydb(a=-1, b=0)n

SQLite

import sqlite3n# 使用內存資料庫ncon = sqlite3.connect(:memory:)n# 創建a,b,c三個欄位ncur = con.cursor()ncur.execute(create table test (a char(256), b char(256), c char(256));)n# 為欄位a,b創建索引ncur.execute(create index a_index on test(a))ncur.execute(create index b_index on test(b))n# 插入一條數據ncur.execute(insert into test values(?, ?, ?), (-1,0,1))n# 查詢符合特定要求的數據ncur.execute(select * from test where a=? and b=?,(-1, 0))n

2 pydblite和sqlite的性能

　　毫無疑問，pydblite的使用方式非常地Pythonic，但是它的效率如何呢？由於我們主要關心的是數據插入和查詢速度，所以不妨僅對這兩項做一個對比。寫一個簡單的測試腳本：

import timencount = 100000nndef timeit(func):n def wrapper(*args, **kws):n t = time.time()n func(*args)n print time.time() - t, kws[des]n return wrappernn@timeitndef test_insert(mdb, des=):n for i in xrange(count):n mdb.insert(a=i-1, b=i, c=i+1)nn@timeitndef test_query_object(mdb, des=):n for i in xrange(count):n c = mdb(a=i-1, b=i)nn@timeitndef test_sqlite_insert(cur, des=):n for i in xrange(count):n cur.execute(insert into test values(?, ?, ?), (i-1, i, i+1))nn@timeitndef test_sqlite_query(cur, des=):n for i in xrange(count):n cur.execute(select * from test where a=? and b=?, (i-1, i))nnprint -------pydblite--------nimport pydblitenpydb = pydblite.Base(:memory:)npydb.create(a, b, c)npydb.create_index(a, b)ntest_insert(pydb, des=insert)ntest_query_object(pydb, des=query, object call)nnnprint -------sqlite3--------nimport sqlite3ncon = sqlite3.connect(:memory:)ncur = con.cursor()ncur.execute(create table test (a char(256), b char(256), c char(256));)ncur.execute(create index a_index on test(a))ncur.execute(create index b_index on test(b))ntest_sqlite_insert(cur, des=insert)ntest_sqlite_query(cur, des=query)n

　　在創建索引的情況下，10w次的插入和查詢的時間如下：

-------pydblite--------n1.14199995995 insertn0.308000087738 query, object calln-------sqlite3--------n0.411999940872 insertn0.30999994278 queryn

　　在未創建索引的情況（把創建索引的測試語句注釋掉）下，1w次的插入和查詢時間如下：

-------pydblite--------n0.0989999771118 insertn5.15300011635 query, object calln-------sqlite3--------n0.0169999599457 insertn7.43400001526 queryn

　　我們不難得出如下結論：

　　sqlite的插入速度是pydblite的3-5倍；而在建立索引的情況下，sqlite的查詢速度和pydblite相當；在未建立索引的情況下，sqlite的查詢速度比pydblite慢1.5倍左右。

3 優化

　　我們的目標非常明確，使用Pythonic的內存資料庫，提高插入和查詢效率，而不考慮持久化。那麼能否既擁有pydblite的pythonic的使用方式，又同時具備pydblite和sqlite中插入和查詢速度快的那一方的速度？針對我們的目標，看看能否對pydblite做一些優化。

　　閱讀pydblite的源碼，首先映入眼帘的是對python2和3做了一個簡單的區分。給外部調用的Base基於_BasePy2或者_BasePy3，它們僅僅是在__iter__上有細微差異，最終調用的是_Base這個類。

class _BasePy2(_Base):nn def __iter__(self):n """Iteration on the records"""n return iter(self.records.itervalues())nnnclass _BasePy3(_Base):nn def __iter__(self):n """Iteration on the records"""n return iter(self.records.values())nnif sys.version_info[0] == 2:n Base = _BasePy2nelse:n Base = _BasePy3n

　　然後看下_Base的構造函數，做了簡單的初始化文件的操作，由於我們就是使用內存資料庫，所以文件相關的內容完全可以拋棄。

class _Base(object):nn def __init__(self, path, protocol=pickle.HIGHEST_PROTOCOL, save_to_file=True,n sqlite_compat=False):n """protocol as defined in pickle / pickle.n Defaults to the highest protocol available.n For maximum compatibility use protocol = 0nn """n self.path = pathn """The path of the database in the file system"""n self.name = os.path.splitext(os.path.basename(path))[0]n """The basename of the path, stripped of its extension"""n self.protocol = protocoln self.mode = Nonen if path == ":memory:":n save_to_file = Falsen self.save_to_file = save_to_filen self.sqlite_compat = sqlite_compatn self.fields = []n """The list of the fields (does not include the internaln fields __id__ and __version__)"""n # if base exists, get field namesn if save_to_file and self.exists():n if protocol == 0:n _in = open(self.path) # dont specify binary mode !n else:n _in = open(self.path, rb)n self.fields = pickle.load(_in)n

　　緊接著比較重要的是create（創建欄位）、create_index（創建索引）兩個函數：

def create(self, *fields, **kw):n """n Create a new base with specified field names.nn Args:n - *fields (str): The field names to create.n - mode (str): the mode used when creating the database.nn - if mode = create : create a new base (the default value)n - if mode = open : open the existing base, ignore the fieldsn - if mode = override : erase the existing base and create an new one with the specified fieldsnn Returns:n - the database (self).n """n self.mode = kw.get("mode", create)n if self.save_to_file and os.path.exists(self.path):n if not os.path.isfile(self.path):n raise IOError("%s exists and is not a file" % self.path)n elif self.mode is create:n raise IOError("Base %s already exists" % self.path)n elif self.mode == "open":n return self.open()n elif self.mode == "override":n os.remove(self.path)n else:n raise ValueError("Invalid value given for open: %s" % open)nn self.fields = []n self.default_values = {}n for field in fields:n if type(field) is dict:n self.fields.append(field["name"])n self.default_values[field["name"]] = field.get("default", None)n elif type(field) is tuple:n self.fields.append(field[0])n self.default_values[field[0]] = field[1]n else:n self.fields.append(field)n self.default_values[field] = Nonenn self.records = {}n self.next_id = 0n self.indices = {}n self.commit()n return selfnn def create_index(self, *fields):n """n Create an index on the specified field namesnn An index on a field is a mapping between the values taken by the fieldn and the sorted list of the ids of the records whose field is equal ton this valuenn For each indexed field, an attribute of self is created, an instancen of the class Index (see above). Its name it the field name, with then prefix _ to avoid name conflictsnn Args:n - fields (list): the fields to indexn """n reset = Falsen for f in fields:n if f not in self.fields:n raise NameError("%s is not a field name %s" % (f, self.fields))n # initialize the indicesn if self.mode == "open" and f in self.indices:n continuen reset = Truen self.indices[f] = {}n for _id, record in self.records.items():n # use bisect to quickly insert the id in the listn bisect.insort(self.indices[f].setdefault(record[f], []), _id)n # create a new attribute of self, used to find the recordsn # by this indexn setattr(self, _ + f, Index(self, f))n if reset:n self.commit()n

　　可以看出，pydblite在內存中維護了一個名為records的字典變數，用來存放一條條的數據。它的key是內部維護的id，從0開始自增；而它的value則是用戶插入的數據，為了後續查詢和記錄的方便，這裡在每條數據中額外又加入了__id__和__version__。其次，內部維護的indices字典變數則是是個索引表，它的key是欄位名，而value則是這樣一個字典：其key是這個欄位所有已知的值，value是這個值所在的那條數據的id。

　　舉個例子，假設我們插入了「a=-1,b=0,c=1」和「a=0,b=1,c=2」兩條數據，那麼records和indices的內容會是這樣的：

# recordsn{0: {__id__: 0, __version__: 0, a: -1, b: 0, c: 1},n 1: {__id__: 1, __version__: 0, a: 0, b: 1, c: 2}}nn# indicesn{a: {-1: [0], 0: [1]}, b: {0: [0], 1: [1]}}n

　　比方說現在我們想查找a=0的數據，那麼就會在indices中找key為a的value，即{-1: set([0]), 0: set([1])}，然後在這裡面找key為0的value，即[1]，由此我們直到了我們想要的這條數據它的id是1（也可能會有多個）；假設我們對數據還有其他要求比如a=0,b=1，那麼它會繼續上述的查找過程，找到a=0和b=1分別對應的ids，做交集，就得到了滿足這兩個條件的ids，然後再到records里根據ids找到所有對應的數據。

　　明白了原理，我們再看看有什麼可優化的地方：

　　數據結構，整體的records和indeices數據結構已經挺精簡了，暫時不需要優化。其中的__version__可以不要，因為我們並不關注這個數據被修改了幾次。其次是由於indices中最終的ids是個list，在查詢和插入的時候會比較慢，我們知道內部維護的id一定是唯一的，所以這裡改成set會好一些。

　　python語句，不難看出，整個_Base為了同時兼容python2和python3，不得不使用了2和3都支持的語句，這就導致在部分語句上針對特定版本的python就會造成浪費或者說是性能開銷。比如說，d是個字典，那麼為了同事兼容python2和3，作者使用了類似與for key in d.keys()這樣的語句，在python2中，d.keys()會首先產生一個list，用d.iterkeys是個更明智的方案。再如，作者會使用類似set(d.keys()) - set([1])這樣的語句，但是python2中，使用d.viewkeys() - set([1])效率將會更高，因為它不需要將list轉化成set。

　　對特定版本python的優化語句就不一一舉例，概括地說，從數據結構，python語句以及是否需要某些功能等方面可以對pydblite做進一步的優化。前面只是說了create和create_index兩個函數，包括insert和__call__的優化也十分類似。此外，用普通方法來代替魔法方法，也能稍微提升下效率，所以在後續的優化中將__call__改寫為了query。

　　優化後的代碼，請見MemLite。

4 memlite、pydblite和sqlite的性能

　　讓我們在上文的測試代碼中加入對memlite的測試：

@timeitndef test_query_method(mdb, des=):n for i in xrange(count):n c = mdb.query(a=i-1, b=i)nnprint -------memlite-------nimport memlitendb = memlite.Base()ndb.create(a, b, c)ndb.create_index(a, b)ntest_insert(db, des=insert)ntest_query_method(db, des=query, method call)n

在創建索引的情況下，10w次的插入和查詢的時間如下：

-------memlite-------n0.378000020981 insertn0.285000085831 query, method calln-------pydblite--------n1.3140001297 insertn0.309000015259 query, object calln-------sqlite3--------n0.414000034332 insertn0.3109998703 queryn

　　在未創建索引的情況（把創建索引的測試語句注釋掉）下，1w次的插入和查詢時間如下：

-------memlite-------n0.0179998874664 insertn5.90199995041 query, method calln-------pydblite--------n0.0980000495911 insertn4.87400007248 query, object calln-------sqlite3--------n0.0170001983643 insertn7.42399978638 queryn

　　可以看出，在創建索引的情況下，memlite的插入和查詢性能在sqlite和pydblite之上；而在未創建索引的情況下，memlite的插入性能和sqlite一樣，好於pydblite，memlite的查詢性能比pydblite稍差，但好於sqlite。綜合來看，memlite即擁有pydblite的pythonic的使用方式，又擁有pydblite和sqlite中性能較高者的效率，符合預期的優化目標。

注

　　今天剛剛開通專欄，本文在前段時間寫於自己的博客。開設專欄是打算寫一些在學習編程和Python過程中的收穫和思考，和大家分享交流，認識更多的朋友。接下來會陸續發布Python生成器、計算的本質等相關內容，敬請期待~