开发者

Regexp for parsing words from sentence

I need a regular expression to parse words from a sentence or a paragraph. Some separaters that should be used are: spaces, and dots. So in:

My name is Bob.I'm 104 yrs old.

Bob and I'm are seperated even though there isn't any space between them, but a dot.

Any other regular 开发者_开发技巧seperaters of words should also be included.


What about str_word_count()?:

For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters, which also may contain, but not start with "'" and "-" characters.

Example:

$str = "My name is Bob.I'm 104 yrs old."; 
print_r(str_word_count($str, 1, '0123456789'));

gives:

Array
(
    [0] => My
    [1] => name
    [2] => is
    [3] => Bob
    [4] => I'm
    [5] => 104
    [6] => yrs
    [7] => old
)

The third parameter takes a string which defines which additional characters should be considered as "word characters".


$words = preg_split('#[\\s.]#', $string, -1, PREG_SPLIT_NO_EMPTY);

The \\s will match all white space characters (such as space, tab, new line, etc). The . will match, well a .... If you wanted to add more characters, just add them after the . (with the exceptions that a [, a ] and a # must be escaped with \\, and a - must be the last character in the list)...

It will return for your above sentence:

array(9) {
  [0]=>
  string(2) "My"
  [1]=>
  string(4) "name"
  [2]=>
  string(2) "is"
  [3]=>
  string(3) "Bob"
  [4]=>
  string(3) "I'm"
  [5]=>
  string(3) "104"
  [6]=>
  string(3) "yrs"
  [7]=>
  string(3) "old"
}


Two ways to do this, either inclusive or exclusive, by splitting on either of the following:

Use "word characters", plus common "connectors" (apostrophe,hyphen,etc), and negate the whole group:

[^\w'-]+

Or specify what you consider non-word characters (spaces, dots, colons, parens, etc):

[\s.;:()]+

(In both cases, the + avoids empty groups being created.)

Certain characters need to be escaped in character classes - for details see http://www.regular-expressions.info/charclass.html


Check out the word boundary anchor (\b or \w) to isolate individual words from whitespace and punctuation.


have a look at preg_split

$words = preg_split('/\W+/', $sentence); // split on non-word-characters

this will obviously split »I'm« into ›I‹ and ›m‹

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜