Parsing Documents with a DSL

2022-12-19 04:13 问答作者：

I'm trying to come up with a way to go through about a million documents which are formal documents (for arguments sake, they are Thesis documents). They are not all standardized but close enough. They are Titles, sections, paragraphs etc. There are subtle differences t开发者_Python百科hat might crop up such as in english, we call a title "Title" but in French it is "Titre".

Thus in my mind the best way to do this would be to create an EBNF with all possible combinations of Title := Title | Titre for instance.

I'm not too concerned with coming up with the EBNF. My main concern is how to achieve the parsing. I've looked at ANTLR, OSLO, Irony and a slew of others but don't have the expertise in them to judge whether they would be perfect for my task.

So, my question to the learned among you is

Which DSL tool would you recommend for parsing documents on this scale?
What DSL tool is the most accurate in parsing yet forgiving on the matching (ie. do we have to define rules for uppercase and lowercase, what about numbers vs roman numerals and foreign language (french).
Is there a process/algorithm that I have not considered that you would recommend as an alternative to DSL? (Rewriting from scratch is an option but I would like to get something working quickly).
Has anyone attempted to add learning and intelligence to the algorithms for parsing through DSLs (think genetic algorithms and neural networks)?
Would you use these DSL tools in a production environment?

My development platform of choice is C#. I mention this because ideally I would like to integrate the DSL tool into code so that we can work with it from existing apps.

I came across a tool called TinyPG. Its not completely what I needed but having the source code to look at will let me generate what I need.

继续阅读：.net dsl dsl-tools parsing

Parsing Documents with a DSL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？