latin bases language segmentation gramatical rules

2022-12-30 04:07 问答作者：

I am working on one feature i.e. to apply language segmentation rules (grammatical) for Latin based language (English currently).

Currently I am in phase of breaking sentences of user input.

e.g.:

"I am working in language translation". "I have used Google MT API for this"

In above example i will break above sentence by full stop . This is normal case开发者_如何转开发s where I am breaking sentence on dot, but there are n number of characters for breaking sentence like (. ! ?, etc).

I have following SRX rules for segmentation.

Is there any reference which I can use for resolving my language segmentation rules?

You probably want to take a look at Reynar and Ratnaparkhi's paper A Maximum Entropy Approach to Identifying Sentence Boundaries (1997).

Abstract

We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of., ?, and / as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Romanalphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

Their resulting sentence segmenter is known as MxTerminator and is available here.

There seems to be a good amount of literature about this in linguistics journals...

This is a nice report about the problem, hope it can help you http://repository.upenn.edu/cgi/viewcontent.cgi?article=1068&context=ircs_reports

nico

继续阅读：language-agnostic

latin bases language segmentation gramatical rules

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？