PDFtotext - whitespace showing as aacute on commandline
I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of ke开发者_如何学运维ywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.
The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.
In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).
I also printed to the command line and here I see a string of aacute. [?]
I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.
Oddly though, when I print "\255" I don't get an aacute, I get an o grave.
It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?
Many thanks.
The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:
print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x
and edit your question and copy/paste the result.
Second tip: When asking for help, give information about your environment:
What version of Python? What version of what operating system?
Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::
>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>
Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.
What is the relevance of "\255"
? Where did you get that from?
Wild guesses while waiting for some facts to emerge:
(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0"
and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850
and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.
(2) "\255"
== \xAD
implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.
精彩评论