Segment text using fullstops
I need to segment text using fullstops using PHP/Javascript.The problem is if I use "." to split text then abbreviations , date formatting (12.03.2010 ) or urls as well split-ed , which I need to prevent.There are many such possibilities , I might not be able to imagine. How to recognize that the "." is used as fullstop and nothing else ?
When I googled I found about SRX http://www.lisa.org/fileadmin/standards/srx20.html , is any opensource PHP project segment text using these rules ?
I can do with any Linux based command line utility as well unless it is not paid.
This issue deals with ca开发者_如何学编程ses where segment is breaking with a dot (.) as it is considered as Fullstop.We need to distinguish between a dot(.) and a Fullstop
Cases where . are not fullstops :
http://www.yahoo.com'>it is a good link. i liked it
- only one valid fullstopThis is a test case. Lets try it no valid fullstop
http://www.yahoo.com'>Testing is done by amold12@…. - no valid fullstop
Mr. Abc is in town today - no valid fullstop
- S. Khan had done it - no valid fullstop
- The U.S. is emerging from a recession. - no valid fullstop
As for as code is concerned - I am using javascript text.split(".") method
Thanks
Human language is quirky. Whatever rules you come up with some corner case is likely to defeat you. How important is it that you are 100% accurate? Would missing the occasional full stop really matter? Or would being a tad too aggressive really matter. If your objective is (for example) to come up with some statistical anlysis of sentance length in published material, then I doubt that some over or under counting would be crucial.
My suggestion would be to look for patterns such as
full-stop space(s) Capital letter
full-stop quote
full-stop new line
Run that across your sample text and see what anomalies remain.
Your's sincerely, David J. N. Artus. (not a complete sentance yet because I didn't use a . in that way, and that previous . isn't one either. But that last . was.
精彩评论