Python英文搜索引擎

02-09

假設在C:Record下面有若干個.txt文件，均為純英文文檔。以這些文檔為內容，實現一個本地搜索引擎，當用戶給出某個輸入時，列出相關的搜索結果。可以自行決定改搜索引擎的功能強弱，並給出有關的說明文檔。（可考慮NLTK）

說明文檔：

主要步驟

1：

如何設計一個搜索引擎，最簡單的是直接在文檔列表中利用最簡單的模式匹配演算法如KMP演算法進行查找，當然這一項在Python中只要1行就能完事。

2：

接下來我想到可以利用正則表達式進行模式匹配，這樣能夠增強匹配的準確性。於是我寫了一個10行左右的模式匹配的函數（fuzzy_finder_by_interval(key, name_list)）能夠根據用戶輸入和字典序進行模糊搜索。

3：

本以為這樣就萬事大吉了，但是我在測試時發現利用正則表達式搜索

learning python和python learning的結果不一樣，於是我就利用nltk.tokenize模塊進行對輸入詞的劃分（這在搜索文件名比較長時才能體現出優勢）這樣不管哪個單詞在前都能夠得到一致的結果。

4：

考慮到我們平時搜索時有時候懶得打空格符，即直接輸入learningpython，如此的話即使用NLTK的簡單功能和正則表達式也得不到正確的結果。於是我想到了對輸入的詞進行劃分。這時就要用到了外部字典文件，我把常用的單詞和計算機專業辭彙導入到txt文件中（不太清楚NTLK是否有類似的功能，寒假再研究一下，這裡先把我想要做的實現一下）如圖，這樣，對每個連續的字元串能夠進行自然語言的劃分。這裡用到的是正向最大匹配演算法。

5：

這樣一個正確的搜索引擎就完工了。最後為了使查找到的結果更加精確我從博客http://blog.csdn.net/sky_money/article/details/7957996學習到拼寫檢查器的基本原理，並利用樸素貝葉斯演算法對字典txt文件中的常用單詞進行訓練，對拼寫錯誤進行更正（當然如果用戶不希望開啟自動更正也可以）比如我輸入了learning pkthon這種輸入錯誤的字元串能夠自動改為learning python這樣搜索更加精確。

其它：

利用os.walk（）進行文件遍歷操作。另外自己寫了一個函數用來生成大量無關txt文件。

import osimport reimport collectionsfrom nltk.tokenize import TreebankWordTokenizerall_file = []word_list = []get_list = []outcome = []alphabet = abcdefghijklmnopqrstuvwxyzfile_path = **********dictionary = **********NWORDS = []def visit_dir(path): 文件讀取 if not os.path.isdir(path): print("ERROR") return list_dirs = os.walk(path) for root, dirs, files in list_dirs: for f in files: all_file.append(os.path.join(root, f)[len(path)+1:])def load_dic(): global NWORDS 字典讀取 f = open(dictionary, r) NWORDS = train(words(f.read())) for line in f: nlen = len(line)-1 word_list.append(line[:nlen]) f.close()def divide_str(note, wordlist): 連續字元串劃分如iloveyou劃分為 i love you i = 10 head = 0 flag = 0 while head <= len(note) - 1: if head >= (len(note)-i): i = len(note)-head for p in range(i): rear = head + i - p flag = 0 for each in wordlist: if note[head:rear] == each: get_list.append(each) head = head + len(each) flag = 1 break if flag == 1: break if flag == 0: head = head + 1def fuzzy_finder_by_interval(key, name_list): 根據字典序和輸入排序的模糊搜索 key = key.strip() findings = [] pattern = .*?.join(key) regex = re.compile(pattern) for item in name_list: match = regex.search(item) if match: findings.append((len(match.group()), match.start(), item)) return [x for _, _, x in sorted(findings)]def words(text): return re.findall([a-z]+, text.lower())def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return modeldef edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts)def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)def known(words): return set(w for w in words if w in NWORDS)def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)load_dic()visit_dir(file_path)tokenizer = TreebankWordTokenizer()my_input = input("請輸入文件名稱：")if not in my_input: divide_str(my_input, word_list) my_input = get_listelse: my_input = tokenizer.tokenize(my_input) print(my_input) for i in range(len(my_input)): my_input[i] = correct(my_input[i])print(my_input)for i in range(len(my_input)): temp = fuzzy_finder_by_interval(my_input[i], all_file) for j in range(len(temp)): if temp[j] not in outcome: outcome.append(temp[j])print(outcome)

運行測試及截圖：

生成的一些包含learning python的測試的txt文件

①查找pythonlearning（無空格查找）

②查找learning python（正常查找）

③查找 learning ptdthons（輸入錯誤）

以上三種情況都能講上述五個文件搜索出來！