开发者

Extract metadata from old Word files (from 2.0 onwards)

I have to extract metadata from a lot (my small working sample has hundreds, the total will probably be thousands) of Microsoft Office files, mostly Word ones.

These files Word versions go from Word 2.0 to Word 2007.

I have to do it in .net 3.5 (using c#) and it's a local winforms application.

I can extract metadata from the most recent ones with OLE Automation (DsoFile.dll), i think. I did it successfully with some of them.

The problem is that the older formats aren't supported by DsoFile. They dont use OLE, probably.

I did a lot of googling and i found that the best (are probably the only) way to get the data i wanted was using antiword (http://www.winfield.demon.nl/). With antiword i can invoke its process and collect its output. It can extract some of the data but not all that i need. Example: antiword gives me only one of the stored dates and i need two of them.

There's also wvware but i guess it's linux-only.

Another option would be gnu libextractor but i can't find a way to use in on .net

开发者_开发技巧

Office Interop would be a desperate last resort. I haven't tested that option but i'm guessing it's not a option when one wants to process a huge amount of files with decent performance.

Can anyone help? If you need more data, just ask.

Sorry for my english, i'm not a native speaker.


I used to work on a commercial office Metadata extraction and reporting tool. It ain't an easy task, esp if you also want to remove any of that metadata. From the sound of it, you're looking to just report on it, so that's better.

As of word2000, Word files were(are) stored in OLE Compound documents. There's plenty of docs online about reading those files but keep in mind that'll only get you a small subset of metadata. Most of the "meat" of a word doc is stored as big binary blobs within the compound doc file and the format of those blobs is proprietary.

There's documentation on the web for the DOC file format.

http://msdn.microsoft.com/en-us/library/cc313118.aspx

But it's a MASSIVE spec and insanely complicated. Still, you might be able to ferret out just those pieces you need to deal with.

The newer DOCX files are much easier to deal with (and have a lot less metadata lurking about too).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜