开发者

How to deal with unicode character encoding issues while converting documents from PDF to Text

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for e开发者_运维问答xtraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized. Here is an example: Say text in PDF is : पवार

What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü

are there any suggestion?


PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.

PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).

I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜