Kaggle 數據清洗挑戰 Day 5 - 處理不一致數據

04-24

今天是 Kaggle 數據清洗挑戰的第五天，轉眼最後一天啦！這次任務是處理拼寫不一致的數據，例如「康涅狄格州」可能被記錄為「Connecticut」、「Coon.」或「Conecticutt」，這些實際代表是同一個值，而機器會將他們識別為不同的對象。今天用一個簡單的方法來整理這些拼寫不一致的數據，具體包括三個部分：

Get our environment set up
Do some preliminary text pre-processing
Use fuzzy matching to correct inconsistent data entry

1、搭建環境

首先還是引入需要的 lib 包：

# modules well useimport pandas as pdimport numpy as np# helpful modulesimport fuzzywuzzyfrom fuzzywuzzy import processimport chardet# set seed for reproducibilitynp.random.seed(0)

當第一次引入 PakistanSuicideAttacks Ver 11 (30-November-2017).csv 文件時，出現了編碼錯誤，所以用昨天在《Kaggle 數據清洗挑戰 Day 4 - 字元編碼（Character Encoding）處理》中介紹的方法來迅速查看一下該文件的編碼方式：

# look at the first ten thousand bytes to guess the character encodingwith open("../input/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", rb) as rawdata: result = chardet.detect(rawdata.read(100000))# check what the character encoding might beprint(result)

再使用 Windows-1252 編碼規則讀取文件：

# read in our datsuicide_attacks = pd.read_csv("../input/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", encoding=Windows-1252)

這次沒有輸出錯誤了～

2、對文本進行預處理

看一下「City「下的數據，雖然有更高效的做法，但先來手動排查整理，感受一下過程：

# get all the unique values in the City columncities = suicide_attacks[City].unique()# sort them alphabetically and then take a closer lookcities.sort()cities

從結果來看，有很多拼寫不一致的數據，例如 ATTOCK 和 Attock，D.G Khan 和 D.G Khan ，所以首先我們先把所有字母都轉為小寫，再去掉所有位於字元串前和後的空格。大小寫問題和空格問題是最常見的，所以解決了這兩個問題就相當於完成了 80% 的工作。

# convert to lower casesuicide_attacks[City] = suicide_attacks[City].str.lower()# remove trailing white spacessuicide_attacks[City] = suicide_attacks[City].str.strip()

3、使用模糊匹配處理數據不一致

繼續觀察 City 這一列，看看有沒有需要進一步處理的問題：

# get all the unique values in the City columncities = suicide_attacks[City].unique()# sort them alphabetically and then take a closer lookcities.sort()cities

從結果來看，還存在一些問題，如 d. i khan 和 d.i khan，這兩個應該是一樣的，但 d.g khan 是另一個城市，不能和它們兩個搞混淆。

我們嘗試使用 fuzzywuzzy 來識別彼此類似的字元串。這個數據集很小，我們可以手動處理錯誤，但如果面對很大的數據集，可能存在上千條數據不匹配，所以需要一個自動化的方法去處理。那麼就來看一下「模糊匹配」（Fuzzy Matching）是什麼吧：

模糊匹配是一個在文本中尋找和目標字元串相似的字元串的自動化過程。一般來說，如果要把一個字元串變為另一個字元串，需要改變的字元越少，這兩者就會判定為越接近。如「apple」和「snapple」，需要改變 2 個字母。我們不能 100% 依賴模糊匹配，但至少可以節省更多的時間。

對於兩個字元串，fuzzywuzzy 會返回一個比率值。字元串約相近，比率值就越接近 100。下面，我們來從 city 列表中獲取與「d.i khan" 最接近的 10 個字元串：

# get the top 10 closest matches to "d.i khan"matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)# take a look at themmatches

我們看到和目標字元串最相近的字元串中，前兩位是 "d. i khan" 和 "d.i khan"，相似度都為 100。另一個城市 "d.g khan" 的相似度為 88，不能將其替換，所以我們將所有相似度大於 90 的記錄替換為 "d. i khan"。

為了執行這個操作，我們來寫一個函數，便於多次調用：

# function to replace rows in the provided column of the provided dataframe# that match the provided string above the provided ratio with the provided stringdef replace_matches_in_column(df, column, string_to_match, min_ratio = 90): # get a list of unique strings strings = df[column].unique() # get the top 10 closest matches to our input string matches = fuzzywuzzy.process.extract(string_to_match, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio) # only get matches with a ratio > 90 # here matches[i] is a tuple where matches[i][0] is the name of the city and # matches[i][1] is how close the city matches[i][0] is to the string_to_match. # The code below is just an inline for loop that creates an array of city names # if the match ratio of a given city is greater than the threshold min_ratio # then the city is appended to the array. close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio] # get the rows of all the close matches in our dataframe rows_with_matches = df[column].isin(close_matches) # replace all rows with close matches with the input matches df.loc[rows_with_matches, column] = string_to_match # let us know the functions done print("All done!")

下面用這個方法來替換掉和 "d.i khan" 相似的數據：

# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"replace_matches_in_column(df=suicide_attacks, column=City, string_to_match="d.i khan")

再來看看所有的 "City" 數據：

# get all the unique values in the City columncities = suicide_attacks[City].unique()# sort them alphabetically and then take a closer lookcities.sort()cities

看起來沒什麼問題了。

這就是最後一天的內容啦，感覺這樣五天跟著做一遍，還是挺有收穫的，希望 Kaggle 以後多推出這種活動～