PDFtotext - whitespace showing as aacute on commandline

2023-02-26 08:09 问答作者：

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of ke开发者_如何学运维ywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.

The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.

In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).

I also printed to the command line and here I see a string of aacute. [?]

I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.

Oddly though, when I print "\255" I don't get an aacute, I get an o grave.

It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?

Many thanks.

The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

and edit your question and copy/paste the result.

Second tip: When asking for help, give information about your environment:

What version of Python? What version of what operating system?

Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.

What is the relevance of "\255"? Where did you get that from?

Wild guesses while waiting for some facts to emerge:

(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.

(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.

继续阅读：character-encoding pdftotext python removing-whitespace

PDFtotext - whitespace showing as aacute on commandline

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？