开发者

Extracting tables from a DOCX Word document in python

I'm trying to extract a content of tables in DOCX Word document, and boy I'm new to xml/xpath.

from docx import *
document = opendocx('someFile.docx')
tableList = document.xpath('/w:tbl')

This triggers "XPathEvalError: Undefined namespace开发者_开发知识库 prefix" error. I'm sure it's just the first one to expect while developing the script. Unfortunately, I couldn't find a tutorial for python-docx.

Could you kindly provide an example of table extraction?


After some back and forth, we found out that a namespace was needed for this to work correctly. The xpath method is the appropriate solution, it just needs to have the document namespace passed in first.

The lxml xpath method has the details for namespace stuff. Look down the page in the link for passing a namespaces dictionary, and other details.

As explained by mgierdal in his comment above:

tblList = document.xpath('//w:tbl', namespaces=document.nsmap) works like a dream. So, as I understand it w: is a shorthand that has to be expanded to the full namespace name, and the dictionary for that is provided by document.nsmap.


You can extract the table from docx using python-docx. Check the following code:

from docx import Document()
document = Document(file_path)

tables = document.tables


First install python-docx as mentioned by @abdulsaboor

pip install python-docx

Then this code should do:

from docx import Document


document = Document('myfile.docx')

for table in document.tables:
    print()
    for row in table.rows:
        for cell in row.cells:
            print(cell.text, end=' ')
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜