頭腦王者的Python答題助手——從OCR文字識別到Fiddler抓包

02-07

由於自己的專業不是學計算機的，所以只能利用課餘時間自學Python。從上個暑假開始，寫了大大小小的Python小程序，雖然大多數都比較簡陋，但確實在每一次寫代碼的時候都能感受到編程的樂趣。

最近微信的小遊戲佔據了很多人的朋友圈，像跳一跳、頭腦王者。自從上次體驗了知乎大神寫的跳一跳輔助程序之後，自己就有了做一個頭腦王者答題助手的念頭，一開始也是希望能夠實現全自動答題，仿照跳一跳那個adb+Python的模式。看了網路上的一些教程，大多數教程都是比較簡單的，沒有完整代碼，僅僅提供一個思路，那就自己動手豐衣足食吧~~

1.OCR文字識別

一開始接觸到的就是OCR，經過百度谷歌之後，Python識別圖片的文字需要pytesseract和PIL兩個庫，還需要識別引擎tesseract-ocr。前面兩個庫通過命令行安裝就好了，然後tesseract可以在github下載，在安裝的過程中記得要選擇下載簡體中文的語言包。安裝完成之後，需要修改一下配置才能正常使用，找到你Python的安裝路徑，打開PythonLibsite-packagespytesseractpytesseract.py，打開之後，作以下修改：

#tesseract_cmd = tesseracttesseract_cmd = C:/Program Files (x86)/Tesseract-OCR/tesseract.exe

成功之後我們找一道題來試一下：

from PIL import Imageimport pytesseractquestion = pytesseract.image_to_string(Image.open(test.jpeg),lang=chi_sim)question = question.replace( ,) #去除空格print(question)

OK~雖然有一些錯別字，但是至少是識別出來了。看到成功識別出題目之後，我就興奮地去寫接下去的代碼，可是後來用不同的題目來測試代碼的時候，才發現識別率是真的低，除了一開始兵馬俑那道題，其他題測試出來全是亂碼。無奈只能去谷歌提高識別率的方法，網路上都是說黑白圖片、高解析度圖片的識別率會高一點。後來就加了一段修改圖片的代碼，都是運用了PIL這個庫。

修改圖片模式：

from PIL import Imageimg = img.convert(1)

裁剪圖片（從手機截圖裁剪題目的部分）：

from PIL import Imagep = Image.open(picname)p_size = p.size #獲得圖片尺寸t = p.crop((0,int(p_size[1])*0.25,p_size[0],int(p_size[1])*0.45)) #截取題目部分的圖片,後兩個數字要比前兩個大

但是修改之後，識別率並沒有明顯的變化，大多數圖片識別出來還是亂碼，在停滯了一段時間之後（主要還是因為學習期末很多事做- -），突然想到修改圖片的背景顏色和字體的顏色，經過多次檢驗，發現黃底黑字的識別率最高，顏色改了之後，大多數的題目都能識別出來了。

圖片修改顏色：

from PIL import Imaget2 = t1.convert(RGB) #轉rgb模式 for i in range(0,t2.size[0]): for j in range(0,t2.size[1]): r = t2.getpixel((i,j))[0] g = t2.getpixel((i,j))[1] b = t2.getpixel((i,j))[2] if b>r and b>g and (r,g<100)and (b<210): r=255 g=255 b=154 #背景藍色變黃 elif (r,g,b>=180): b=0 #白色字變黑 g=0 r=0 t2.putpixel((i,j), (r,g,b))

代碼大概的思路是用ADB命令實時截取頭腦王者的圖片，然後處理圖片，識別出題目和四個選項，用百度知道搜索題目，再用爬蟲抓下答案，根據四個選項在答案中的出現次數，得出最佳選項。

完整代碼：

from PIL import Imageimport pytesseractimport requestsfrom bs4 import BeautifulSoup as BSfrom urllib import parseimport datetimeimport osdef open_pic(picname): p = Image.open(picname) p_size = p.size #獲得圖片尺寸 t = p.crop((0,int(p_size[1])*0.25,p_size[0],int(p_size[1])*0.45)) #截取題目部分的圖片,後兩個數字要比前兩個大 t.save(./first_change.png) t_size = t.size #獲得截取後的圖片尺寸 return t_size,p,tdef get_question(picsize,firstpic): new_x = 0 new_y = 0 t = firstpic for i in range(0,picsize[0]): last_pixel = t.getpixel((i,0))[2] for j in range(0,picsize[1]): now_pixel = t.getpixel((i,j))[2] if last_pixel < 190 and now_pixel > 200: new_x = i-50 new_y = j-150 break if new_x: break #找到背景和文字剛剛轉換的像素點 #背景變黃色，字體變黑色 t1 = t.crop((new_x,new_y,new_x+894,new_y+280)) t2 = t1.convert(RGB) #轉rgb模式 for i in range(0,t2.size[0]): for j in range(0,t2.size[1]): r = t2.getpixel((i,j))[0] g = t2.getpixel((i,j))[1] b = t2.getpixel((i,j))[2] if b>r and b>g and (r,g<100)and (b<210): r=255 g=255 b=154 #背景藍色變黃 elif (r,g,b>=180): b=0 #白色字變黑 g=0 r=0 t2.putpixel((i,j), (r,g,b)) t2.save("./second_change.png") question = pytesseract.image_to_string(Image.open(second_change.png),lang=chi_sim) #分析題目 question = question.replace( ,) #去除空格 question = question.replace( ,) #去除換行 print(question) return questiondef get_choice(oldpic): p = oldpic p_size = p.size c = p.crop((250,int(p_size[1])*11/20,850,int(p_size[1])*8/9)) #截取選項部分的圖片,後兩個數字要比前兩個大 c1 = c.crop((0,0,600,691*1/6)) c2 = c.crop((0,160,600,300)) c3 = c.crop((0,360,600,500)) c4 = c.crop((0,550,600,691)) cc = [c1,c2,c3,c4] choices = [] for h in cc: for i in range(0,h.size[0]): for j in range(0,h.size[1]): r = h.getpixel((i,j))[0] g = h.getpixel((i,j))[1] b = h.getpixel((i,j))[2] if b>r and b>g and (r,g<100)and (b<220): r=0 g=0 b=0 #藍色字變黑 elif (r,g,b>=160): b=154 #白色背景變黃 g=255 r=255 h.putpixel((i,j), (r,g,b)) h.save("./ana_choice.png") choice = pytesseract.image_to_string(Image.open("ana_choice.png"), lang=chi_sim) # 分析選項 choice = choice.replace( ,) #解決選項中有英文大寫字母0的識別錯誤 if 0 in choice: choice=choice.replace(0,O) print (choice) choices.append(choice) return choicesdef search_answer(question,choices): ll = [0,10,20] answer = [] for p in ll: b = parse.quote(question.encode(gbk)) #轉gbk碼 url = https://zhidao.baidu.com/search?word= + b + &ie=gbk&site=-1&sites=0&date=0&pn= + str(p) r = requests.get(url) r.encoding = gbk #網址轉gbk編碼 soup = BS(r.text, html.parser) want = soup.find(div, id=wgt-list) wants = want.find_all(dl, class_=dl) for i in wants: ans = i.find(dd, class_=dd answer).text answer.append(ans) choiceset = {} choiceset[A] = choices[0] choiceset[B] = choices[1] choiceset[C] = choices[2] choiceset[D] = choices[3] for i in choiceset: account = [] for j in answer: if choiceset[i] in j: account.append(j) a = 0 for k in account: a += 1 print(選 + i + 的可能性是 + str(%.2f % (a * 100 / 30)) + %)def main(filename): picsize = open_pic(filename)[0] oldpic = open_pic(filename)[1] firstpic = open_pic(filename)[2] question = get_question(picsize,firstpic) choices = get_choice(oldpic) search_answer(question,choices)if __name__ == __main__: start = datetime.datetime.now() your = input(準備好了按y：) if your == y: os.system(adb shell screencap -p /sdcard/auto.png) os.system(adb pull /sdcard/auto.png) img = Image.open(auto.png) img.convert(RGB) img.save(auto.png) main(auto.png) end = datetime.datetime.now() print (本次一共花了+str((end-start).seconds)+秒)

嘗試運行一下，發現運行時間太太太太長了，估計是圖片識別會佔用很長時間，每當我5個題目答完，第一題才剛剛分析出來，雖然過程中花了很多心思，但是這種效果肯定是沒有實用性的，讓人心酸。

2.Fiddler抓包

正打算放棄這個程序的時候，發現了Fiddler這個抓包工具，之前學爬蟲的時候就聽到過，但是那時候沒認真研究。應用到這裡剛剛好，通過Fiddler實時抓取頭腦王者傳輸的數據，把數據保存下來給Python分析，接下來的事就簡單得多了。

Fiddler手機抓包的教程網上有很多，重點是把傳輸的數據自動保存下來。使用Fiddler時最後設置成只看含有『quiz』的url，不然會冒出很多無關的數據。

設置完之後玩一局遊戲，軟體中出現了五個新的數據，裡面就包含了每一道題的信息。原來之前辛辛苦苦弄圖片識別，現在這麼容易就把題目和選項拿到手了。

接下來就是最重要的自動保存json數據，在軟體中的『FiddlerScript』--『OnBeforeResponse』修改一下代碼：

在原有的基礎上加這段代碼：

if(oSession.host == question.hortor.net){ oSession.utilDecodeResponse(); //Decoding HTTP request in case its gzip //Saving full request object (Including HTTP headers) oSession.SaveResponse(C:\Users\XXXX\Desktop\data\response.txt,true); //Saving just body oSession.SaveResponseBody(C:\Users\XXXX\Desktop\data\responsebody.txt); }

有了數據文件，接下來的事就交給Python了，直接貼代碼：

import jsonimport timefrom urllib import parseimport requestsfrom bs4 import BeautifulSoup as BSdef get_appinf(filename): f = open(filename, r, encoding=utf-8) try: j = json.loads(f.read()) #判斷數據文件是否有題目和選項 if quiz in j[data] and options in j[data]: num = j[data][num] quiz = j[data][quiz] print((第+str(num)+題：+quiz).center(50,*)+ ) cho = j[data][options] else: pass return quiz,cho except: pass f.close()def search(question,choice): pagenum = [0,10,20] answer = [] for i in pagenum: q = parse.quote(question.encode(gbk)) # 轉gbk碼 url = https://zhidao.baidu.com/search?word= + q + &ie=gbk&site=-1&sites=0&date=0&pn= + str(i) requests.packages.urllib3.disable_warnings() # 忽視網頁安全性問題 r = requests.get(url, verify=False) # 不驗證證書 r.encoding = gbk # 網址轉gbk編碼 soup = BS(r.text, html.parser) want = soup.find(div, id=wgt-list) wants = want.find_all(dl, class_=dl) for i in wants: ans = i.find(dd, class_=dd answer).text answer.append(ans) choiceset = {} choiceset[A] = choice[0] choiceset[B] = choice[1] choiceset[C] = choice[2] choiceset[D] = choice[3] #計算四個選項在爬取百度答案中的出現次數 results = {} for i in choiceset: account = [] for j in answer: if choiceset[i] in j: account.append(j) result = len(account)/30 results[i] = result if i == D: print((選 + i + 的可能性是：%.2f%% % (result * 100 )).center(50)+ ) else: print((選 + i + 的可能性是：%.2f%% % (result * 100 )).center(50)) #選出數值最大元素的對應鍵 bestchoice = max(results.items(), key=lambda x: x[1])[0] print ((此題最好選+bestchoice).center(50,-)+nnn)def main(): try: que,cho = get_appinf(C:/Users/XXXX/Desktop/data/responsebody.txt) #修改成你自己的保存位置 search(que,cho) except: passif __name__ == __main__: while True: main() time.sleep(2)

這次的程序實際效果比之前的好多了，在手機上的題目出來之前，Fiddler就能抓取到數據並通過Python找到答案，但是問題也是很明顯，稍微複雜一點的題目百度也搜索不出來，還有反向題目（『不屬於』、『不包括』『不是』）的識別率也不高，偶爾也會被答題大神吊打，但是拿來娛樂一下其實也足夠了，畢竟頭腦王者不同什麼登頂大會，答對題沒有獎金。程序出來之後，花了大半個小時上了王者。