doc,docx解析的那些事

04-27

本篇文章主要寫一下如何對於doc,docx進行解析，以及解析其中的表格數據

沒有做了解之前，我一直以為它們是同一種編碼方式，只是在word里的表現是需要不同的word打開而已，等我仔細查閱文檔以後發現真是naive啊。

doc，docx有什麼不同的呢？

1.存儲方式的不同[1]： doc 是二進位存儲，docx是打包文件（知乎問題：除了收費的軟體或者庫以外，如何解析doc格式word文件，C++或C#語言的？ - 知乎），結構[2]大概如下：

├── [Content_Types].xml

├── _rels

|----.rels.xml

├── docProps │

├── app.xml

│── core.xml

└── word

├── _rels │

├── document.xml.rels

└── footnotes.xml.rels

├── document.xml -----------------存放文本的主要文件

├── fontTable.xml

├── footnotes.xml

├── media ------------------------存放docx文檔里的圖片、音頻、視頻等

├── numbering.xml

├── settings.xml

├── styles.xml

├── theme │

└── theme1.xml

└── webSettings.xml

上面的文檔有些是必需的，有一些可有可無，具體可以查閱相關文檔

既然是打包文件，我們可以使用python對其進行解析了,比如我們想獲取其中的document.xml

import zipfiledef parseZip(filepath): zipf = zipfile.Zipfile(filepath) return zipf.read("word/document.xml")

2.docx易於跨平台，主要是存儲xml 等組成的打包文件

3.docx文檔佔用體積更小

4.docx對於處理一些複雜對象更得心應手，因為可以通過xml的配置進行對於比如公式、表格、圖片等。

說了這麼多，好像就第一條有用，那我們就按照上面說的進行解析。

我去發現好多包可以做這些事，比如python-docx,docx2txt,pythondocx等，但是這些都是對docx格式的文檔進行，對於doc如何解析就有些束手無策了。不過沒關係，我們可以把doc文檔轉化成docx文檔。

import win32comfrom win32com.client import Dispatch,constantsw = Dispatch(Word.Application)w.Visible = 0w.DisplayAlerts=0doc = w.Documents.Open("input.doc")doc.SaveAs("output.docx",FileFormat=12)

現在我們的數據格式都是相同的了。

那就要對於docx進行解析，並且獲取到表格，綜合對比了一下，發現 python-docx更方便一些。

首先我們對於輸入的文檔進行格式判斷,定義函數judgeType()

其次是解析文檔

最後把得到的表格寫到excel裡面。

代碼如下：

#coding:utf-8import win32comfrom win32com.client import Dispatch,constantsimport docximport xlwtimport osimport sys# 判斷輸入的文檔格式def judgeType(file_path): tmp_result = os.path.splitext(file_path) file_type = tmp_result[1] return file_type# 格式轉換def convertFormat(file_path): w = Dispatch(Word.Application) w.Visible = 0 w.DisplayAlerts=0 doc = w.Documents.Open(file_path) doc.SaveAs("tmp.docx", FileFormat=12)def main(): file_path = sys.argv[1] # print file_path file_type = judgeType(file_path) # print file_type if file_type==.doc: convertFormat(file_path) doc = docx.Document("tmp.docx") book = xlwt.Workbook() tables = doc.tables for index,table in enumerate(tables): i = index+1 sheet = book.add_sheet("%dsheet"%i) for i_r, row in enumerate(table.rows): tmp_i = -1 for cell in row.cells: tmp_i=tmp_i+1 cell_data = [] for p in cell.paragraphs: cell_data.append(p.text) sheet.write(i_r, tmp_i, " ".join(cell_data)) book.save(sys.argv[2]) os.remove("tmp.docx") if file_type==.docx: doc = docx.Document(file_path) book = xlwt.Workbook() tables = doc.tables for index, table in enumerate(tables): i = index + 1 sheet = book.add_sheet("%dsheet" % i) for i_r, row in enumerate(table.rows): tmp_i = -1 for cell in row.cells: tmp_i = tmp_i + 1 cell_data = [] for p in cell.paragraphs: cell_data.append(p.text) sheet.write(i_r, tmp_i, " ".join(cell_data)) book.save(sys.argv[2])if __name__==__main__: main()

至此，對於word中解析表格的方法我們已經構建完成

最後，對於表格的合併、分割以上方法需要調整~

參考文檔：

[1]Difference Between DOC and DOCX

[2]https://geddy.cn/blog/item/jie-xi-docx
推薦閱讀：

TAG:Word文檔處理 | DOC文件格式 |