开发者

Convert 20k Word Doc into small HTML pages with auto created meta tags?

I have a huge word doc 20,000 words long and I would like to upload it to my blog.

However I would like to break it up i开发者_Go百科nto small(ish) webpages and if possible auto generate relevant keywords, title and description tags. Couldnt find a tool to do this so I'm thinking of coding something however I really have no idea where to begin. I write php/sql. I'm thinking of breaking it up every X characters then building the meta tags out of the most frequently occuring words. Which would be pretty easy but it also has quite a few images. Is there some php library I could use to manipulate word docs?


OpenOffice has the ability to churn Word dox into X/HTML/XML/other formats.

A while ago I wrote a PHP script that took the resulting XHTML output from large Word docs and performed XSL transformations on then - including HTMLTidy - and pump them into custom-built XHTML templates.

The result, surprisingly, was very good - with one caveat. Depending on the extent to which your Word docs have been edited - esp. with Track Change - you may find the occasional character drops out entirely, and you often get extra spacing.

In my case the output was legal in nature, so I had our edit team scour the output and give me an honest opinion, and to be honest they didn't feel good about the missing characters but imo a browser-based spellchecker would have picked up most of that.

So - my solution for you is to use Open Office to convert to XHTML (I believe I had to alter the conversion macro - there was a very simple typo in there that made it choke, from memory - it may have been fixed). And then have your way with the output however you please.

Check my profile and email me if you want the script I wrote and I'll mail you the source tomorrow if you like (its hacky but it works!).

EDIT: Many other solutions were tried, I forget the details, except that they all sucked a lot more than Open Office.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜