利用 tesseract 解析簡單數字驗證碼圖片

01-29

tesseract 是一個 OCR（Optical Character Recognition，光學字元識別）引擎，能夠識別圖片中字元，利用這個可以用來解析一些簡單的圖片驗證碼

Github 地址：tesseract-ocr/tesseract

Windows 平台 v3.05.01 版本下載地址：http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe

一開始弄這個是因為學校網路要上網每次都要在網頁驗證，就想能不能寫個程序自動驗證免去手動驗證過程。但這需要驗證碼，為了解決這個問題，就上網搜了一下，就看到有用 tesseract 的。

有人用 Python 實現了一個工具：madmaze/pytesseract

拿來試了一下，Windows 上使用總是有問題

我就把目光轉向了 tesseract 本身，這是它的使用說明

C:Program FilesTesseract-OCR>tesseractnUsage:n tesseract --help | --help-psm | --versionn tesseract --list-langs [--tessdata-dir PATH]n tesseract --print-parameters [options...] [configfile...]n tesseract imagename|stdin outputbase|stdout [options...] [configfile...]nnnOCR options:n --tessdata-dir PATH Specify the location of tessdata path.n --user-words PATH Specify the location of user words file.n --user-patterns PATH Specify the location of user patterns file.n -l LANG[+LANG] Specify language(s) used for OCR.n -c VAR=VALUE Set value for config variables.n Multiple -c arguments are allowed.n -psm NUM Specify page segmentation mode.nNOTE: These options must occur before any configfile.nnnPage segmentation modes:n 0 Orientation and script detection (OSD) only.n 1 Automatic page segmentation with OSD.n 2 Automatic page segmentation, but no OSD, or OCR.n 3 Fully automatic page segmentation, but no OSD. (Default)n 4 Assume a single column of text of variable sizes.n 5 Assume a single uniform block of vertically aligned text.n 6 Assume a single uniform block of text.n 7 Treat the image as a single text line.n 8 Treat the image as a single word.n 9 Treat the image as a single word in a circle.n 10 Treat the image as a single character.nnnSingle options:n -h, --help Show this help message.n --help-psm Show page segmentation modes.n -v, --version Show version information.n --list-langs List available languages for tesseract engine.n --print-parameters Print tesseract parameters to stdoutn

最後就決定自己實現一個簡單的介面

使用方法

ocr = Ocr(rC:Program FilesTesseract-OCR)n# result = ocr.exec(img_path=r"e:pythonpyocrimages1.png")nresult = ocr.exec(img_url="http://oog4yfyu0.bkt.clouddn.com/2.jpg")nprint(result)n

對參數解釋一下

def __init__(self, ocr_path, out_path=None, mode=3, delete=True):n """n ocr_path: n tesseract 引擎的安裝路徑，例如我的 rC:Program FilesTesseract-OCRn out_path: n 輸出文件路徑，如果只是簡單為了獲取解析出來的數字，可不管，默認地址為 r"D:result.txt"n mode:n 圖片的切割模式，參見 tesseract 使用方法，默認為 3n delete:n 是否保留生成的文本文件，默認不保存n """nndef exec(self, *, img_path="", img_url=None):n """ 執行命令n img_path:n 本地圖片路徑，如 r"e:pythonpyocrimages1.png"n img_url:n 網路圖片地址，如 "http://oog4yfyu0.bkt.clouddn.com/2.jpg"n """n

至於為什麼只是數字，是因為英文的總是不能完全解析出來，修改了 -l 參數也是沒用，使用其自帶的 tessdata 也沒用，中文的話解析出來的內容完全看不懂... （或許是我打開方式不對？）

效果

C:Users54186Anaconda3python.exe E:/Python/pyocr/ocr.pynTesseract Open Source OCR Engine v3.05.00dev with Leptonican5108n

C:Users54186Anaconda3python.exe E:/Python/pyocr/ocr.pynTesseract Open Source OCR Engine v3.05.00dev with Leptonican4893n

C:Users54186Anaconda3python.exe E:/Python/pyocr/ocr.pynTesseract Open Source OCR Engine v3.05.00dev with Leptonican130768n

Github地址：https://github.com/chenjiandongx/pyocr

溫馨提示：不能保證百分百正確，也不能保證百分百解析得出來。所以項目僅供參考！！！要有保證的話還是找打碼平台吧
推薦閱讀：

※[Python] 置換CPython 2.7.13的opcode
※Tornado與flask的特點和區別有哪些？
※Python利用嵌套函數二分搜索列表中大於等於m，小於等於n的數字

TAG:Python |