开发者

Segment text using fullstops

I need to segment text using fullstops using PHP/Javascript.The problem is if I use "." to split text then abbreviations , date formatting (12.03.2010 ) or urls as well split-ed , which I need to prevent.There are many such possibilities , I might not be able to imagine. How to recognize that the "." is used as fullstop and nothing else ?

When I googled I found about SRX http://www.lisa.org/fileadmin/standards/srx20.html , is any opensource PHP project segment text using these rules ?

I can do with any Linux based command line utility as well unless it is not paid.

This issue deals with ca开发者_如何学编程ses where segment is breaking with a dot (.) as it is considered as Fullstop.We need to distinguish between a dot(.) and a Fullstop

Cases where . are not fullstops :

  1. http://www.yahoo.com'>it is a good link. i liked it

    - only one valid fullstop
  2. This is a test case. Lets try it no valid fullstop

    http://www.yahoo.com'>Testing is done by amold12@…. - no valid fullstop

  3. Mr. Abc is in town today - no valid fullstop

  4. S. Khan had done it - no valid fullstop
  5. The U.S. is emerging from a recession. - no valid fullstop

As for as code is concerned - I am using javascript text.split(".") method

Thanks


Human language is quirky. Whatever rules you come up with some corner case is likely to defeat you. How important is it that you are 100% accurate? Would missing the occasional full stop really matter? Or would being a tad too aggressive really matter. If your objective is (for example) to come up with some statistical anlysis of sentance length in published material, then I doubt that some over or under counting would be crucial.

My suggestion would be to look for patterns such as

full-stop space(s) Capital letter
full-stop quote
full-stop new line

Run that across your sample text and see what anomalies remain.

Your's sincerely, David J. N. Artus. (not a complete sentance yet because I didn't use a . in that way, and that previous . isn't one either. But that last . was.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜