开发者

How to preg_match_all a set of words in any possible language?

I have a website that people enter lists of words into.

These lists of words could be written in any language in the world.

How can I extract these lists of words from their input data if I do not know what language they are entering?

Is there some kind of match-all international alphabet symbol I am missing, or do I have to manually write up a set of brackets that will match every possible international letter?

Is this what I am looking for and just don'开发者_如何转开发t know it yet?


You can use Unicode character properties, for example:

preg_match_all('#[\p{L}\p{Pc}]+#u', $str, $matches);

[\p{L}\p{Pc}]+ gives you letters and connector punctuation. You can shorten that to \pL+.
Either way, you'd want to define "word" better. It is probably more than a sequence of some letters...


My recommendation is to define your own input convention - force them to input one word at a time, or one word per line in a textbox. Else, you will need a segmentation algorithm for each script (granted, it will be something trivial like "split on characters which have the Unicode word separator property" for the vast majority of scripts, but the remaining special cases are basically still open AI research topics).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜