开发者

Scraping structured information from hundreds of Word documents?

I've been tasked with extracting some structured information from hundreds of human readable documents (mostly MS Word) and to put it into a database. The data is pretty much embedded in tables throughout the entire document but there's a lot of text between the tables and although the documents are very similar in structure, there are a few differences. The documents are changed fairly often (we get an updated version every few 开发者_如何学Pythonmonths)

So far the only viable option i can think of is to manually go trough all the documents and insert/update the information but I thought I'd ask here if anyone think it's possible to scrape the documents in some way?

Oh, and the data has to be fairly correct...


I did similar work (without tables though) using a converter from RTF to FO.

You have convert docs to RTF, and then to FO, which gives you a nice XML structure of the document. You can then easily parse it and scrape the data.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜