开发者

PHP Help converting diacritic characters to HTML quotes

I have a bunch of MS Word files that a client wants displayed o开发者_JAVA技巧n his web site. I've converted them to HTML using "Save as Web Page" -- and yes I know that this produces lousy HTML but other methods I've tried lose the links to the imbedded images.

For the most part, I can use PHP to clean up the display but one item has me completely baffled: All single and double quotes are coming through as various letters with diacritics (accents) and I can't figure out how to detect them and convert them to the correct HTML entities. For example: Õ (O tilde)should be single-quote, Ò (O grave) should be open double-quote, Ó (O acute) should be close double-quote. I've tried htmlentities, iconv and a bunch of other methods with no luck.


Word is a mess! For individual files I run through something like this: http://word2cleanhtml.com/

If this is going to be an ongoing thing, there are entire file libraries dedicated to de-word-ifying Word documents for the web. Try HTML Tidy or HTML Purifier

If you're going to be dealing with a WYSIWYG type tool and this is ongoing, CKEditor will automatically drop Word HTML garbage. The thing that differentiates CK from TinyMCE and others is that even if the user forgets to do "Copy From Word" it still will not allow the bad stuff through.

Since using CK and Tidy, I've not had a single problem on my company's site despite being used by hundreds of users with varying levels of web knowledge. Prior to the changes, it was a near-daily issue.


I suggest open those lousy html files into an editor like: Notepad++ and just do a search and replace in all open documents.


What's the encoding of the Word Document? You can either try to match the original encoding through PHP or change the encoding of the Word Document to something like UTF-8 and make sure your page is displayed as UTF-8 as well.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜