开发者

Parsing structured documents in Java

I would like to parse some legal documents with a Java library into pieces of text that represent headers, paragraphs etc. Legal documents are usually well-structured, so I would like to use something a bit easier than JavaCC (or other parser generators). Are开发者_如何学Go there any which would allow to (nearly) automatically detect such a structure?

Thanks.


I think there is no tool that can "nearly automatically" extract such structures. If it is realy easy to extract the structure you would not need any tool, you can easely code it yourself. If it is not so easy you need a tool that is powerfull enough (JavaCC, ANTLR ...).

I think parsing the text yourself with custom code is the best way. Maybe read beforehand a bit about parsing (recursive decent, lexer/parser seperation...). For simple structures it is not hard to get a working solution quickly.


Apache POI - the Java API for Microsoft Documents Apache PDFBox - Java PDF Library

easier one will be Apache Tika - a content analysis toolkit, toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

it uses pdfbox and poi internally

use: java -jar tika-app-0.9.jar [option] [file] -t

will parse the file(s) specified on the command line and output the extracted text content

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜