开发者

preg_match exclude strings

From 10,000 lines of data I have to get all the lines that don't contain words that START like "en" or "it" or "de" etc.., that are from 2 to 5 long a-z and A-Z with "-" too (minus sign) and ";"

I tried this but doesn't work

 !preg_match("/\b(it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/", $value)

this would be read (to me) don't match all the lines have words that start with it, en, etc. are composed of 2 to 5 chars and in those 5 chars can contain also "-" or ";".

This returns me lines with "it;" which I need to exclude.

EDIT: I need to match every word that starts with those 2 characters (it or en or de) and can be everywhere in the line

Example to match (it doesn't 开发者_开发知识库contain words that start with "en", "de", etc.)

GET; SITE; 15:03:03; ; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1; .NET4.0C); 

Example not to match (it does contain a word that start with "en")

GET; SITE; 13:06:49; ; Mozilla/4.0 (compatible; **en;** MSIE 8.0; Windows NT 6.1; Trident/4.0; SIMBAR={E76F6580-EB92-49A3-A089-F6B8B9DEA9AA}; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; eSobiSubscriber 2.0.4.16; Media Center PC 5.0; SLCC1; .NET4.0C); ; 


As far as I can tell, your regex matches strings that start with one of the country codes and have a total length of 4 - 7, not 2 - 5. So en; does not match because it only contains three symbols. The {2,5} applies only to the expression to its immediate left, so your regex reads "A word that starts with it/en/de etc. and continues with between two and five letters/dashes/semicolons." Try \b(it|en|de|es|fr|ru)[a-zA-Z-;]{0,3}.

You might also want to be explicit about the semicolon being the last character, and perhaps also be more specific about the structure of the ISO language codes (which I assume that these strings are): \b(it|en|de|es|fr|ru)(-[a-zA-Z]{2})?;?\b. Here, we say "A word that starts with it/en/de etc. and might continue with a dash and two letters, and (irrespective of whether it had the dash and two letters) might continue with a semicolon. Nothing else will be allowed before the word should end."


The easiest way to do this would be to first split your data into individual lines and then check them one at a time:

$lines = explode("\n", $data); // I'm making an assumption here, discussed below.
foreach ($lines as $line)
{
  if (!preg_match('/\b(?=it|en|de|es|fr|ru)[a-z;-]{2,5}/i', $line))
  {
    // line doesn't contain a word beginning with en, de, etc.
  }
}

Your use of the \b word boundary metacharacter should work correctly; \b matches at the start of the string if the first character is a word character.

I am using a positive lookahead assertion ((?=)) to check if the first two characters of the word are the language codes you are looking for. This avoids the problem that @Aasmund Eldhuset pointed out in his answer. In other words, the regular expression engine looks for words that begin with the language codes you want to exclude, but then the result of the match is logically inverted by PHP, so any lines containing those words are ignored.


I'm making the assumption that your data is split into lines by a single \n (newline) character. It might be split by \r or \n\r instead. If you don't know which newline characters are being used, you can use preg_split instead of explode, ie:

$lines = preg_split('/\n|\n?\r/', $data);


The magic character you're looking for is the caret: ^:

!preg_match("/^(it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/", $value)

Other than that, looks good.


You can use a look-ahead assertion:

/\b(?!it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/

Here (?!…) asserts that there must not be a match of the containing pattern from the current position on without actually matching that pattern.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜