關於在Python中利用字元串 fuzzy matching （模糊匹配）進行資料庫merge/join的一些經驗和tips

02-04

在經濟管理研究中，經常需要將來源不同的數據進行合併以形成所需要的dataset，以便進一步對合併後的dataset進行分析。而在合併過程中，資料庫之間是有通用的identifier來方便資料庫之間合併的進行的(ISIN, GVKEY, etc.)，然而有時候因為資料庫錄入以及沒有及時更新的原因，有一部分數據是沒有相應的identifier的，這時候就需要用其他的非數字型的identifier進行合併，比如company name, peoples name, etc. 因此，這篇文章介紹了如何利用fuzzy matching(模糊匹配）演算法來進行資料庫的合併。

對字元串進行預處理

由於在進行模糊匹配的時候，不同字元串之間如果有很多無意義的部分，那麼這就會降低模糊匹配的精準度。舉個例子，在這篇文章中我想利用公司的名稱進行模糊匹配。由於每個公司都會有後綴（e.g. Inc, Ltd, Gmbh, etc.), 那麼如果預先不摘除這些後綴，將會使得匹配不準確，比如 ABC International Ltd和CCD Interntional Ltd是完全不相同的公司，但因為沒有去除International Ltd使得演算法在計算這兩個字元串相似程度的時候會偏高進而產生錯誤的匹配。因此，我們首先需要對字元串進行預處理來提高匹配的精準度。在這裡，我們特別地介紹如果針對公司的名稱進行預處理的相關方法。

對於公司名稱後綴的處理，python中有一個十分有效的package: cleanco (Python Package Index). 利用這個package基本上可以處理乾淨公司後綴名的問題，其識別的對象幾乎涵蓋了所有國家的後綴名（e.g. 英聯邦的Ltd, Inc以及德語地區的Gmbh等等）。但在使用這個package的時候有三個tips(針對我的資料庫而言）：

Tip1: 如果你的資料庫是Dataframe類型，那麼最好將你需要處理的那個column單獨定義一個Series進行處理，而不是用Dataframe+index的方法來用cleanco。親測前者處理起來更佳的高效。但理論上來說兩者應該是相同的，有可能僅僅針對我的資料庫來說這樣做是高效的。

Tip2: 此外，使用兩次這個package要比一次效果更佳（十分奇怪，三次後就和第二次沒有差別了）。

Tip3: 在利用cleanco完成初步清理後，但是需要手動對處理後的數據進行查看，進一步用string.replace或者正則表達進行特定的處理。

2.Different packages for fuzzy matching

(1) difflib

difflib所使用的演算法並不是levenshtein distance. 它所使用的演算法是：The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980』s by Ratcliff and Obershelp under the hyperbolic name 「gestalt pattern matching」. The basic idea is to find the longest contiguous matching subsequence that contains no 「junk」 elements (R-O doesn』t address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that 「look right」 to people.

(2) fuzzywuzzy

fuzzywuzzy使用的演算法是計算不同的string之間的 levenshtein distance. levenshtein distance越小，則這兩個string越接近，或者說越相似。

(3) jellyfish

相比於前兩個庫，jellyfish更像是一個涵蓋所有字元串模糊匹配方法的library.具體介紹情參見鏈接：jellyfish 0.5.6 : Python Package Index 。其包含了字元串匹配中兩種最主流的方法，根據string distance以及字元串發音來來進行匹配。

以下是我針對我的案例寫的代碼，因為用python不久，因此代碼的書寫不夠優美，歡迎指正：

# -*- coding: utf-8 -*-n"""nCreated on Thu Jan 4 11:49:40 2018nn@author: Ye Songn"""nn#This script aims to apply fuzzy matching to do the string match.nn#Some descriptions of two dataset: df1 is a cross-sectional data from Dealscan and it includesn#all syndicated loan deals (borrower name, borrower parent name, year, loan amount, etc.)n#df2 is a panel data from compustat and it includes all financial data for companies worldwide.nn#Our task is to merge this two database using two variables, one is borrower parent name and the other is year.n#Since the variable of borrower parent name is a string variable, thus, we need to apply the fuzzy match rather n#normal match.nn#############################Import package#######################nimport numpy as npnimport pandas as pdnimport matplotlib.pyplot as pltnimport fuzzywuzzynimport fuzzynimport jellyfishnimport difflibnimport renimport osnimport cleanco #this package is a python library specilized in dealing with suffix of company namesnnfrom fuzzywuzzy import fuzznfrom fuzzywuzzy import processnfrom cleanco import cleanconnnn#set a new directoryn###################################################################npath = /Users/sou/Documents/PhD@FS/Programming/Pythonnos.chdir(path)nretval = os.getcwd()nprint (Directory changed successfully:, retval)nn#Read the data filen###################################################################ndf1 = pd.read_csv(firm_nogvkey_UK.csv)ndf2 = pd.read_csv(compustata_us.csv)ndf2.rename(columns={year:year1}, inplace=True)ndf2.rename(columns={fyear:year}, inplace=True)nnnnnn##To define a function to find the closest match. This function appliesn#Jaro-Winkler, which is one of the most performant and accurate approximate stringn#matching algorithms currently available [Cohen, et al.]n###################################################################ndef get_closest_match(x, list_strings):nn best_match = Nonen highest_jw = 0nn for current_string in list_strings:n current_score = jellyfish.jaro_winkler(x, current_string)nn if(current_score > highest_jw):n highest_jw = current_scoren best_match = current_stringnn return best_matchnnnnnnnn#Define a function called cleanname to trim suffix of company names in two datasets (df1 and df2)n###################################################################ndf1[conm_check] = df1[conm] #generate a variable to check the accuracy of the final matched resultnnn#Define a function to iteratively trim each company name (not 100% efficient, need to do extra trimming work)ndef cleanname (series):n for count in range(len(series)):n x = cleanco(series.iloc[count])n y = x.clean_name()n series.iloc[count] = ynnnn# Trim company name from Dealscan (df1)n###################################################################ntest1 = df1[conm]ntest1_clean = test1.str.lower()#Turn all company name into lower charactersnn#This function aims to get rid of text with bracket or []. This problem is specifically for company name from Dealscan databasendef a(test_str):n ret = n skip1c = 0n skip2c = 0n for i in test_str:n if i == [:n skip1c += 1n elif i == (:n skip2c += 1n elif i == ] and skip1c > 0:n skip1c -= 1n elif i == )and skip2c > 0:n skip2c -= 1n elif skip1c == 0 and skip2c == 0:n ret += in ret.strip()n return retnnnfor count in range(len(test1_clean)):n x = a(test1_clean.iloc[count])n test1_clean.iloc[count] = xn ncleanname(test1_clean)ncleanname(test1_clean)n n#Extra trimming work after using package cleanco since this package is not 100% effcient nfor count in range(len(test1_clean)):n x = test1_clean.iloc[count].replace(international, )n x = x.replace(bv, )n x = x.replace(&, )n x = x.replace(gmbh, )n x = x.strip()n test1_clean.iloc[count] = xnnn#Trim company name from Compustat (df2)n###################################################################ntest2 = df2[conm]ntest2_clean = test2.str.lower()ncleanname(test2_clean)ncleanname(test2_clean)nn# Give back the cleaned company name to master and using filesndf1[conm] = test1_cleanndf2[conm] = test2_cleannn#Conduct fuzzy matching and iteratively select the best match from using filen###################################################################n#select the best fuzzy match scorendf1[conm] = df1[conm].map(lambda x: get_closest_match(x, df2[conm]))nnn# merge two databasenresult = pd.merge(df1,df2, on = [conm,year]) #only matched observationsnresult_all = pd.merge(df1,df2, how = left, on = [conm,year],indicator = indicator)#Keep all the left(master file)nnn#Export dataframe into csv filen###################################################################nresult_all.to_csv("test1.csv",index=False,sep=,)n

這個代碼實現的是利用Jaro distance來進行選擇最有的匹配對象，而事實上這個方法經測驗並不是十分的精準，如果你想更佳精準（同樣會丟失一些相似的對象）的方法， soundex是更好的方法，大致上來說，soundex根據發音對於每個string進行編碼，如果編碼相同則表示兩個字元完全相似，具體實現方法和我所寫的代碼差不多，在這裡不再贅述。選擇哪一種方法進行match, 取決具體的案例。

附(參考）：

不同的string distance定義

Levenshtein distance: is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other (Levenshtein distance)

Damerau-Levenshtein distance: A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifhs for fish) as a single edit. (Damerau-Levenshtein distance)

Hamming distance: Hamming distance is the measure of the number of characters that differ between two strings.(Hamming distance)

Jaro distance: Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.

2.Soundex以及根據發音對字元串進行比較的方法

Soundex: Using Fuzzy Matching to Search by Sound with Python

下一篇文章的主題是「China China China China......? Too many"，這篇文章的主題包含前幾天參加的WHU(Otto Beisheim School of Management) campus for finance new year conference 18和科布倫茨之旅。

宋燁

23.Jan.2018

法蘭克福