Trying to convert MSWord 2007 document to an XML format
I'm hoping I can forgo the history, but trust me on the following:
- I have several people who have immediate access to MSWord 2007
- We are trying to prep a generic Word document that can be passed from person to pe开发者_如何学Crson over the course of several months and they can "add" new content to it.
Regardless of the answers below - the above will stay the same no matter how horrible an idea it is, or what better idea you may have... I've already been down this road :P.
- My 'thoughts' were to setup (within Word) an XML Schema so we could 'flag' the content for the specific content areas (e.g. item number, item description, item stem, item options, item answer, etc)
- I taught myself XML schema in a little under 6 hours, and apparently I'm a horrible teacher: I have the XML Schema file, I have imported it into Word, I am able to flag the areas as per all the online tutorials...
- I was HOPING to save out to an "XML" file (from Word) and have it look like:
<note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
(just pulled that off a random site to demonstrate I wanted to save out from the word document the XML structure with the data filled in)
The hope was I then could parse with Python, or send the XML file to a vendor who could then upload the information into a datebase (and no - we can't just upload to the database - it has to go from the Word Document to XML to the Vendor).
The problem: Whenever I save the file to XML from MSWord 2007 it gives me all this horrible horrible XML crap all over the place - I've checked to see if I could parse that, hoping to find my XML tags embedded, and I find them, but it's so garbled by all of Offices tags/crap that parsing it out would be a huge waste of time.
Finally: How can I have word automatically fill in the XML tags (and by automatically I understand that someone has to "select the text", "assign the XML"... talking more about the 'saving' out to an XML) from a schema I develop (or can I just create a sample XML tree without the schema?) and export the contents ready for uploading/parsing?
Thanks for reading my short novel :P (hope I was clear enough!)
-J
If the data will be as uniform as the example you provided (i.e. just note
elements, with a fixed number of fields) You might be able to get away with having one big table in the Word document, with columns for to
, from
, heading
, body
, etc. Then, you could parse it out in Python using one of the methods described in this question and output your custom XML. Since .docx
files are XML already, that may or may not make your job simpler.
If the data are going to be more complex, one idea might be using Word styles to map text to the correct tags. You could make a custom style for each tag, which would be quick and easy for the user to click (and perhaps have a different color and/or font). Then when parsing the document you could filter everything based on the paragraph style applied. I'm thinking this route would be painful, though.
Another option might be writing the document in a structured syntax like YAML, which is easy enough to read/write by hand and you could parse just from saving the file as plaintext, e.g.
# plaintext_export.txt
------------------
Notes:
- From: Somebody
To: Somebody-else
Heading: This is a heading
Message: >
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
- From: Another guy
To: Me
Heading: Huh?
Message: >
Some other message content.
Parsing would be as simple as:
>>> import yaml
>>> from pprint import pprint
>>> with open("plaintext_export.txt", 'r') as f:
... data = yaml.load(f)
...
>>> pprint(data)
{'Notes': [{'From': 'Somebody',
'Heading': 'This is a heading',
'Message': 'Lorem ipsum dolor sit amet, consectetur adipisicing elit
, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. \n',
'To': 'Somebody-else'},
{'From': 'Another guy',
'Heading': 'Huh?',
'Message': 'Some other message content.\n',
'To': 'Me'}]}
精彩评论