Extracting text from a PDF file
I'm using PDFBox for a C# .NET project. and I'm getting a "TypeInitializationException" (The type initializer for 'java.lang.Throwable' threw an exception.) when executing the following block of code :
FileStream stream = new FileStream(@"C:\1.pdf",FileMode.Open);
//retrieve the pdf bytes from the stream.
byte[] pdfbytes=new byte[65000];
stream.Read(pdfbytes, 0, 65000);
//get the pdf file bytes.
allbytes = pdfbytes;
//create a stream from the file bytes.
java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
string txt;
//load the doc
PDDocument doc = PDDocument.load(ins);开发者_运维技巧
PDFTextStripper stripper = new PDFTextStripper();
//retrieve the pdf doc's text
txt = stripper.getText(doc);
doc.close();
the exception occurs at the 3rd statement :
PDDocument doc = PDDocument.load(ins);
What can I do to solve this ?
This is the stack trace :
at java.lang.Throwable.__<map>(Exception , Boolean )
at org.pdfbox.pdfparser.PDFParser.parse()
at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
at At.At.ExtractTextFromPDF(InputStream fileStream) in
C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61
Inner Exception of the InnerException :
- InnerException {"Could not load file or assembly 'IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.":"IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58"} System.Exception {System.IO.FileNotFoundException}
OK, I solved the previous problem by copying some .dll files of the PDFBox to my bin folder. but now I'm getting this error : expected='/' actual='.'--1 org.pdfbox.io.PushBackInputStream@283d742
Are there any alternatives to using PDFBox ? is there any other reliable library out there I can use to extract text from pdf files.
It looks like you missing some library for PDFBox. You need:
- IKVM.GNU.Classpath.dll
- PDFBox-X.X.X.dll
- FontBox-X.X.X-dev.dll
- IKVM.Runtime.dll
Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.
I found the versions of the DLL files were the culprits. Go to http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A and download the following files:
- IKVM.Runtime.dll
- IKVM.GNU.Classpath.dll
- PDFBox-0.7.2.dll
Then copy them into the root of your Visual Studio project. Right click the project and add reference, find all 3 and add them.
Finally here's the code I used to parse the PDF into Text
C#
private static string TransformPdfToText(string SourceFile)
{
string content = "";
PDDocument doc = new PDDocument();
PDFTextStripper stripper = new PDFTextStripper();
doc.close();
doc = PDDocument.load(SourceFile);
try
{
content = stripper.getText(doc);
doc.close();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
doc.close();
}
return content;
}
Visual Basic
Private Function parseUsingPDFBox(ByVal filename As String) As String
LogFile(" Attempting to parse file: " & filename)
Dim doc As PDDocument = New PDDocument()
Dim stripper As PDFTextStripper = New PDFTextStripper()
doc.close()
doc = PDDocument.load(filename)
Dim content As String = "empty"
Try
content = stripper.getText(doc)
doc.close()
Catch ex As Exception
LogFile(" Error parsing file: " & filename & vbcrlf & ex.Message)
Finally
doc.close()
End Try
Return content
End Function
had a similar problem but not with C++ but VisualBasic/VisualStudio; the missing dll is "commons-logging.dll"; after adding this dll to the bin-directory everything worked find
精彩评论