How to deal with unicode character encoding issues while converting documents from PDF to Text

2023-04-05 22:30 问答作者：

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for e开发者_运维问答xtraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized. Here is an example: Say text in PDF is : पवार

What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü

are there any suggestion?

PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.

PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).

I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.

继续阅读：character-encoding pdf pdf-conversion text-processing unicode

How to deal with unicode character encoding issues while converting documents from PDF to Text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？