Scraping structured information from hundreds of Word documents?
I've been tasked with extracting some structured information from hundreds of human readable documents (mostly MS Word) and to put it into a database. The data is pretty much embedded in tables throughout the entire document but there's a lot of text between the tables and although the documents are very similar in structure, there are a few differences. The documents are changed fairly often (we get an updated version every few 开发者_如何学Pythonmonths)
So far the only viable option i can think of is to manually go trough all the documents and insert/update the information but I thought I'd ask here if anyone think it's possible to scrape the documents in some way?
Oh, and the data has to be fairly correct...
I did similar work (without tables though) using a converter from RTF to FO.
You have convert docs to RTF, and then to FO, which gives you a nice XML structure of the document. You can then easily parse it and scrape the data.
精彩评论