开发者

Question regarding regex and tokenizing

I need to make a tokenizer that is able to English words.

Currently, I'm stuck with characters where they can be part of of a url expression.

For instance, if the characters ':','?','=' are part of a url, i shouldn't really segment them.

My qns is, can this be expressed in regex? I have the regex

\b(?:(?:https?|ftp|file)://|www\.|ftp\开发者_开发百科.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

from here

but I don't know how to piece everything such that if the characters are spotted inside the above expression, don't insert spaces between them.

Help!


I would approach this problem by doing a sweep with a different regexp, putting hits into an array, removing those hits from the string, and then doing your tokenizer as normal.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜