Parsing structured documents in Java

2023-02-24 22:17 问答作者：

I would like to parse some legal documents with a Java library into pieces of text that represent headers, paragraphs etc. Legal documents are usually well-structured, so I would like to use something a bit easier than JavaCC (or other parser generators). Are开发者_如何学Go there any which would allow to (nearly) automatically detect such a structure?

Thanks.

I think there is no tool that can "nearly automatically" extract such structures. If it is realy easy to extract the structure you would not need any tool, you can easely code it yourself. If it is not so easy you need a tool that is powerfull enough (JavaCC, ANTLR ...).

I think parsing the text yourself with custom code is the best way. Maybe read beforehand a bit about parsing (recursive decent, lexer/parser seperation...). For simple structures it is not hard to get a working solution quickly.

Apache POI - the Java API for Microsoft Documents Apache PDFBox - Java PDF Library

easier one will be Apache Tika - a content analysis toolkit, toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

it uses pdfbox and poi internally

use: java -jar tika-app-0.9.jar [option] [file] -t

will parse the file(s) specified on the command line and output the extracted text content

继续阅读：parsing

Parsing structured documents in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？