Error while parsing Binary Files... (mostly PDF)

2023-04-06 09:06 问答作者：

I am trying to parse pdf file using Apache Tika by using ByteArrayInputStream for Binary files... And started getting error for some pdf file and for some it is parsing very well.. Earlier I was able to parse same pdf files using Tika, but now when I tried using ByteArrayInputStream, I started getting error..I think there is some problem with the ByteArray This is the Error I am getting..

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@652489c0

And this is my code...

if (page.isBinary()) {
   handleBinary(page, curURL);
}

public int handleBinary(Page page, WebURL curURL) {
    try {
          binaryParser.parse(page.getBinaryData());
          page.setText(binaryParser.getText());
          handleMetaData(page, binaryParser.getMetaData());


          //System.out.println(" pdf url " +page.getWebURL().getURL());
          //System.out.println("Text" +page.getText());
    } catch (Exception e) {
          // TODO: handle excep开发者_开发知识库tion
    }
          return PROCESS_OK;
}

        public class BinaryParser {

            private String text;
            private Map<String, String> metaData;

            private Tika tika;

            public BinaryParser() {
                tika = new Tika();
            }

            public void parse(byte[] data) {
                InputStream is = null;
                try {
                    is = new ByteArrayInputStream(data);
                    text = null;
                    Metadata md = new Metadata();
                    metaData = new HashMap<String, String>();
                    text = tika.parseToString(is, md).trim();
                    processMetaData(md);
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    IOUtils.closeQuietly(is);
                }
            }

            public String getText() {
                return text;
            }

            public void setText(String text) {
                this.text = text;
            }


            private void processMetaData(Metadata md){
                if ((getMetaData() == null) || (!getMetaData().isEmpty())) {
                    setMetaData(new HashMap<String, String>());
                }
                for (String name : md.names()){
                    getMetaData().put(name.toLowerCase(), md.get(name));
                }
            }

            public Map<String, String> getMetaData() {
                return metaData;
            }

            public void setMetaData(Map<String, String> metaData) {
                this.metaData = metaData;
            }

        }

    public class Page {

        private WebURL url;

        private String html;

        // Data for textual content
        private String text;

        private String title;

        private String keywords;
        private String authors;
        private String description;
        private String contentType;
        private String contentEncoding;

        private byte[] binaryData;

        private List<WebURL> urls;

        private ByteBuffer bBuf;

        private final static String defaultEncoding = Configurations
                .getStringProperty("crawler.default_encoding", "UTF-8");

        public boolean load(final InputStream in, final int totalsize,
                final boolean isBinary) {
            if (totalsize > 0) {
                this.bBuf = ByteBuffer.allocate(totalsize + 1024);
            } else {
                this.bBuf = ByteBuffer.allocate(PageFetcher.MAX_DOWNLOAD_SIZE);
            }
            final byte[] b = new byte[1024];
            int len;
            double finished = 0;
            try {
                while ((len = in.read(b)) != -1) {
                    if (finished + b.length > this.bBuf.capacity()) {
                        break;
                    }
                    this.bBuf.put(b, 0, len);
                    finished += len;
                }
            } catch (final BufferOverflowException boe) {
                System.out.println("Page size exceeds maximum allowed.");
                return false;
            } catch (final Exception e) {
                System.err.println(e.getMessage());
                return false;
            }

            this.bBuf.flip();
            if (isBinary) {
                binaryData = new byte[bBuf.limit()];
                bBuf.get(binaryData);
            } else {
                this.html = "";
                this.html += Charset.forName(defaultEncoding).decode(this.bBuf);
                this.bBuf.clear();
                if (this.html.length() == 0) {
                    return false;
                }
            }
            return true;
        }
    public boolean isBinary() {
        return binaryData != null;
    }

    public byte[] getBinaryData() {
        return binaryData;
    }

Any suggestions what wrong I am doing...!!

UPDATED:- After upgrading to pdfbox 1.6.0 version, I started getting this error for some pdf...

Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@70dbdc4b
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)

And for some pdf this error...

 Did not found XRef object at specified startxref position 0
Invalid dictionary, found: '' but expected: '/'
 WARN [Crawler 2] Did not found XRef object at specified startxref position 0

This is a known bug of PDFBox version 1.4.0. Just update to PDFBox 1.5.0+.

Check this release notes:

[PDFBOX-578] NPE NullPointerException in PDPageNode.getCount

And this JIRA ticket.

继续阅读：apache-tika parsing pdf-parsing

Error while parsing Binary Files... (mostly PDF)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？