java.util.regex matching anything before expression
I tr开发者_运维百科ying to tokenize following snippets by types of numbers:
"(0-22) 222-33-44, 222-555-666, tel./.fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555, tel: 555-666-888"
and
"tel: 555-666-888, tel./fax (111-222-333) 22-33-44 UK"
and
"fax (111-222-333) 22-33-44 UK, TEL/faks: 000-333-444, fax: 333-444-555"
and so on.
The conception is that this can be any combination of like "tel/faks" and "tel/fax numbers" after it or just a "tel/fax number" at the beginning of the string.
I make this:
"(?:.(?!((tel|fax|faks)[ /:.]+)+))++"
on example 1, but after find() it returns: (chars '_' were added by me)
-
_(0-22) 222-33-44, 222-555-666,_
_TEL./_
_FAX (111-222-333) 22-33-44 UK,_
_TEL_
_FAKS: 000-333-444,_
_FAX: 333-444-555_
it seems that I loosing one char in every group and combined types like "TEL/faks" are splited. I need also to grab (if this exist, if not then default number is tel) for future processing.
How can I get rid of this?
ps. I use: case-insensitive
Your regular expression means (roughly):
(?: Match a group consisting of:
. any character
(?! that is not followed by
((tel|fax|faks)[ /:.]+)+)) "tel" or "fax" or "fakx", followed by at least one
punctuation character from [ /:.]
+ (multiple times)
That's why you get a missing character before "Tel", "Fax" etc - because your regular expression says never to match the character before "Tel", "Fax" etc.
That's also why "Tel./.faks:" gets split - because the last "." comes before "fax", so it doesn't get matched.
I would suggest constructing two regular expressions that match:
A - a telephone number (parens, digits, commas, spaces), with at least one digit
B - a telephone/fax designation ("fax", "faks", "tel", punctuation)
Then search for strings matching
B*A+
精彩评论