Regexp for parsing words from sentence
I need a regular expression to parse words from a sentence or a paragraph. Some separaters that should be used are: spaces, and dots. So in:
My name is Bob.I'm 104 yrs old.
Bob
and I'm
are seperated even though there isn't any space between them, but a dot.
Any other regular 开发者_开发技巧seperaters of words should also be included.
What about str_word_count()
?:
For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.
Example:
$str = "My name is Bob.I'm 104 yrs old.";
print_r(str_word_count($str, 1, '0123456789'));
gives:
Array
(
[0] => My
[1] => name
[2] => is
[3] => Bob
[4] => I'm
[5] => 104
[6] => yrs
[7] => old
)
The third parameter takes a string which defines which additional characters should be considered as "word characters".
$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);
The \\s
will match all white space characters (such as space, tab, new line, etc). The .
will match, well a .
... If you wanted to add more characters, just add them after the .
(with the exceptions that a [
, a ]
and a #
must be escaped with \\
, and a -
must be the last character in the list)...
It will return for your above sentence:
array(9) {
[0]=>
string(2) "My"
[1]=>
string(4) "name"
[2]=>
string(2) "is"
[3]=>
string(3) "Bob"
[4]=>
string(3) "I'm"
[5]=>
string(3) "104"
[6]=>
string(3) "yrs"
[7]=>
string(3) "old"
}
Two ways to do this, either inclusive or exclusive, by splitting on either of the following:
Use "word characters", plus common "connectors" (apostrophe,hyphen,etc), and negate the whole group:
[^\w'-]+
Or specify what you consider non-word characters (spaces, dots, colons, parens, etc):
[\s.;:()]+
(In both cases, the +
avoids empty groups being created.)
Certain characters need to be escaped in character classes - for details see http://www.regular-expressions.info/charclass.html
Check out the word boundary anchor (\b or \w) to isolate individual words from whitespace and punctuation.
have a look at preg_split
$words = preg_split('/\W+/', $sentence); // split on non-word-characters
this will obviously split »I'm« into ›I‹ and ›m‹
精彩评论