开发者

how to fetch a ms=word file in its original form. Some extra symbols are displayed which are because of bold lines etc

I am fetching a ms-word file. I am able to fetch it properly but a lot of unrecognised characters appear in this file now. I think these are because of like bold line,coloured line etc. But I want my file to be fetched as original form. All block lines should be displayed..

PERSONAL DETAILS: 
    Name                :   Deepak Narwal
    Sex         开发者_如何学编程    :   Male
    Date of Birth       :   December 19, 1986
    Nationality         :   Indian 
    Languages Known :   English and Hindi



DATE:

PLACE:                              Deepak Narwal
����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������


This is not a trivial task. The Word document format (before DOCX) is a proprietary format owned by Microsoft and very, very hard to parse.

If you can influence the way the documents are created, use a different, open format that is easier to parse in PHP: Plain text (which will lose all the formatting), RTF or PDF (you won't be able to work with that in PHP but you can display it in a web browser).

If you need to extract text from old Word documents and parse the text in PHP (instead of just displaying it) the following options come to mind:

  • Antiword is a free cross-platform (WIndows and Linux) Word reader that extracts plain text from word documents (This will destroy any formatting) I have worked with it, it's fiddly to set up for non-english character sets but works o.k. Don't know about the Word 2003 DOC format, though.

  • If you're on a Windows server with Word installed, the easiest way is probably connecting to Word via COM as explained in this article. It should be possible to convert a word document to a plain text file using that. I've never tried this, and the COM interface is said not to be the stablest, so you need to test it thoroughly if it's for heavy duty use.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜