开发者

Doc, rtf and txt reader in python

Like csv.reader() are there any other functions which can read .rtf开发者_StackOverflow, .txt, .doc files in Python?


You can read a text file with

txt = open("file.txt").read()

Try PyRTF for RTF files. I would think that reading MS Word .doc files are pretty unlikely unless you are on Windows and you can use some of the native MS interfaces for reading those files. This article claims to show how to write scripts that interface with Word.


I've had a real headache trying to do this simple thing for word and writer documents.

There is a simple solution: call openoffice on the command line to convert your target document to text, then load the text into Python.

Other conversion tools I tried produced unreliable output, while other Python oOo libraries were too complex.

If you just want to get at the text so you can process it, use this on the linux command line:

soffice --headless --convert-to txt:Text /path_to/document_to_convert.doc

(call it from Python using subprocess if you want to automate it).

It will create text file you can simpley load into python.

(Credit)


csv is a specific format so you need a "parser" to read it. This is what the csv module provides as you've mentioned. Text files (usually suffixed with .txt) don't have any fixed "format" so you can just read them after opening them (Jesse's answer gives the details). CSV files are commonly text files so your distinction is not very accurate.

As for RTF, There are a bunch of them. See this answer for details. The PyRTF thing which Jesse mentioned seems to be the most popular though.

Microsoft Word document files (usually suffixed with .doc) are another beast since the format is proprietary. I don't have much experience with Python converters but there are a few command line ones (like wvHTML) which do a somewhat decent job. This question discusses quite a few. There's also the option of having MS-Word itself do that for you via. a COM interface like Jesse has mentioned.


import win32com.client
if tmpFile.endswith('.xml') or tmpFile.endswith('.doc') or tmpFile.endswith('.docx'):
       app = win32com.client.Dispatch("Word.Application")
       app.Visible = False
       app.Documents.Open(tmpFile)
       doc = app.ActiveDocument

       docText = doc.Content.Text 
       print(docText)
       doc.Close()
       app.Quit()


There is a python module called 'docx' which you can use to read .docx files. You won't be able to read .doc though because it is nearly obsolete nowadays.

from docx import Document
doc = Document(filepath)
# Reading Data
data = doc.paragraphs
tables = doc.tables

You can find it Here on Pypi.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜