Extracting text from PDF document - C# [duplicate]
Is there a reliable way to extract text from PDF? The first thought that comes to mind is that PDF开发者_开发问答 may have multiple columns and the extraction mechanism needs to know the logical structure somehow. I understand that some PDF docs are "tagged" but I'd need to support pretty much any PDF document.
Any third party components to the rescue here?
Please see: Extracting text from PDFs in C#
Some PDFs are scans, so OCR would be required (not easy, to say the least).
Some PDFs are compressed, others (more rarely) are bare PDFs.
The PDF file format itself is well-documented, but when it comes to extracting the right "structure" from anything but a simple one-column document, you're asking for a tall order. PDF sort of represents, internally, how HTML might look if every line of text was positioned in DIVs with absolute positioning.
精彩评论