开发者

Conversion between docx / doc / rtf and lightweight markup

I am looking for a tool or set of tools to convert between file formats D and M where

  • D is a format handled by MSWord, in order of preference, docx, doc, rtf
  • M is a lightweight markup, such as markdown, textile, txt2tags, it can be an esoteric one
  • there is a way to generate html from M
  • conversion is two-way, it's done both from D to M, and from M to D
  • utf-8 encoding is handled properly
  • the content is simple, paragraphs, some simple formatting like bold and italics, maybe lists
  • the tools are platform-independent

What I've found so far

  • TeX, LaTeX -- too heavyweight
  • docx2txt -- too lightweight, it supports no formatting at all
  • html -- MSWord produces bloated html
  • a few one-way conversions, like doc to mediawiki,

UPDATE:

开发者_开发百科

The use case is a document workflow between technical and non-technical people

  • I, the technical guy edit a document in plain text, put it into version control, etc.
  • I send it to my manager or other non-technical people
  • They add comments, make changes to it using their Word, then they send it back to me
  • I want to simply grok their changes, make my changes, put it into version control, without having to use Word


I think that Pandoc much more than meet all requirements.

http://pandoc.org


Adam, I've used docx4j to convert docx to html, edit the html in CKEditor, and then use docx4j to convert the html back to docx. My process made some assumptions about the css (ie it was designed to handle docx4j's clean html, and editing in CKEditor).

You don't say whether there is a way to generate M from HTML?


This is probably hard to do two-way, since you will have impedance mismatches between the various formats.

The best world I can think of would be a sort of Wiki / Word hybrid: Maybe you can get Google Wave to do that for you?

Another solution that might work is a CMS like Plone (did they ever add WYSIWIG capability? I stopped caring after version 1). Keep your documents there. Let the system handle changes, annotations etc. You can automate retrieval of the source (should be ReStructuredText) and commit that to your source control if you have to.


This script I wrote might help you in your workflow:

https://github.com/matb33/docx2md

It is a command-line PHP script that will only work with .docx files. It will extract the XML, run some XSL transformations, and provide you the result in Markdown format.

I encourage you to send me .docx files that don't convert accurately. I'd love to make this script as robust and reliable as possible.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜