Extract text from a PDF and save it to a database - preserving spacing
I have a PDF document containing only text that needs to be saved into a varchar column in MSSQL. The first catch is that the spacing of the text in the PDF needs to be 开发者_运维百科preserved as well, which can't be done simply by copy-pasting from the PDF into SSMS.
Okay, so I need an application to read the PDF as text, while preserving spacing. But now the second catch comes in: the PDF is rendered in Helvetica font, but the text saved into the DB will be displayed in Arial on a Crystal Report (Crystal 8... bleh), and when displayed, it needs to look like the PDF (i.e. same alignment) as far as possible.
The solution that I've proposed is to convert the PDF to a vector image, save the resulting byte stream into the DB, and pull the bytes in through via Crystal. Unfortunately, due to time constraints this can't be implemented now, so I need a quick-and-dirty solution.
Essentially, once I've got the Helvetica version from the PDF, I have to muck around with the spacing to convert it to look correct in Arial. I need a tool that can do this for me, as I don't have the time to write one - any suggestions?
Does your version of Crystal handle dynamic image locations? If so, you could save an image of the PDF (I'm sure there's a utility for that somewhere), and in your Crystal Report, create an image object with the image location set to whatever PDF you want.
I'm afraid that this is a user-education problem: output in Arial font is spaced differently to output in Helvetica font. This needs to be explained to the users.
A reference to Rathergate - http://en.wikipedia.org/wiki/Rathergate - may help convince them; essentially, Dan Rather's career was ended because he didn't understand the significance of character spacing in different fonts. (/over-simplification)
An alternative might be to use a font editor, to save a version of Arial font that has Helvetica spacing properties, then use this new font in your report - this really is a kludge, it will look terrible and may well violate the font's copyright (presumably Microsoft-owned). I really wouldn't recommend it.
精彩评论