splitting text with javascript match
for below code
var str = "I left the United States with my eyes full of tears! I knew I would miss my American friends very much.All the best to you";
var re = new RegExp("[^\.\?!]*(?:[\.\?!]+|\s$)", "g");
var myArray = str.match(re);
and This is what I am getting as a result
myArray[0] = "I left the United States with my eyes full of tears!"
myArray[1] = " I knew I would miss my American friends very much."
I want to add one more condition to regex such that the text will break only if there is a s开发者_如何学JAVApace after the the punctuation mark (? or . or !)
I do it do that so the result for above case is
myArray[0] = "I left the United States with my eyes full of tears!"
myArray[1] = " I knew I would miss my American friends very much.All the best to you "
myArray[2] = ""
var str = "I left the United States with my eyes full of tears! I knew I would miss my American friends very much.All the best to you";
var re =/[^\.\?!]+[\.?!]( +|[^\.\?!]+)/g;
var myArray = str.match(re);
myArray.join('\n')
/* returned value: (String)
I left the United States with my eyes full of tears!
I knew I would miss my American friends very much.All the best to you
*/
.+?([!?.](?= |$)|$)
should work.
It will match any sequence of characters that are either
- followed by a punctuation sign that is itself followed by a space or end-of-string, or
- followed by the end of the string.
By using the reluctant quantifier +?
, it finds the shortest possible sequences (=single sentences).
In JavaScript:
result = subject.match(/.+?([!?.](?= |$)|$)/g);
EDIT:
In order to avoid the regex splitting on "space/single letter or multidigit number/dot", you can use:
result = subject.match(/( \d+\.| [^\W\d_]\.|.)+?([!?.](?= |$)|$)/g);
This will split
I left the United States with my eyes full of tears! 23. I knew I would miss my American friends very much. I. All the best to you.
into
I left the United States with my eyes full of tears!
23. I knew I would miss my American friends very much.
I. All the best to you.
What it does is instead of simply matching any character until it finds a dot is:
- First try to match a space, a number, and a dot.
- If that fails, try to match a space, a letter, and a dot.
- If that fails, match any character.
That way, the dot after a number/letter has already been matched and will not be matched as the delimiting punctuation character that follows next in the regex.
精彩评论