开发者

Extract MS Word document chapters to SQL database records?

I have a 300+ page word document containing hundreds of "chapters" (as defined by heading formats) and currently indexed by word. Each chapter contains a medium amount of text (typically less than a page) and perhaps an associated graphic or two. I would like to split the document up into database records for use in an iPhone program - each chapter would be a record consisting of a title, id #, and content fields. I haven't decided yet if I would want the pictures to be a separate field (probably just containing a file name), or HTML or similar style links in the content text. In any case, the end result would be that I could display a searchable table of titles that the user could c开发者_StackOverflow社区lick on to pull up any given entry.

The difficulty I am having at the moment is getting from the word document to the database. How can I most easily split the document up into records by chapter, while keeping the image associations? I thought of inserting some unique character between each chapter, saving to text format, and then writing a script to parse the document into a database based on that character, but I'm not sure that I can handle the graphics in this scenario. Other options?


To answer my own question:

Given a fairly simply formatted word document

  1. convert it to an Open Office XML document

  2. write a python script to parse the document into a database using the xml.sax python module.

Images are inserted into the record as HTML, to be displayed using a web interface.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜