preg_match exclude strings

2023-02-13 10:26 问答作者：

From 10,000 lines of data I have to get all the lines that don't contain words that START like "en" or "it" or "de" etc.., that are from 2 to 5 long a-z and A-Z with "-" too (minus sign) and ";"

I tried this but doesn't work

 !preg_match("/\b(it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/", $value)

this would be read (to me) don't match all the lines have words that start with it, en, etc. are composed of 2 to 5 chars and in those 5 chars can contain also "-" or ";".

This returns me lines with "it;" which I need to exclude.

EDIT: I need to match every word that starts with those 2 characters (it or en or de) and can be everywhere in the line

Example to match (it doesn't 开发者_开发知识库contain words that start with "en", "de", etc.)

GET; SITE; 15:03:03; ; Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1; .NET4.0C);

Example not to match (it does contain a word that start with "en")

GET; SITE; 13:06:49; ; Mozilla/4.0 (compatible; **en;** MSIE 8.0; Windows NT 6.1; Trident/4.0; SIMBAR={E76F6580-EB92-49A3-A089-F6B8B9DEA9AA}; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; eSobiSubscriber 2.0.4.16; Media Center PC 5.0; SLCC1; .NET4.0C); ;

As far as I can tell, your regex matches strings that start with one of the country codes and have a total length of 4 - 7, not 2 - 5. So en; does not match because it only contains three symbols. The {2,5} applies only to the expression to its immediate left, so your regex reads "A word that starts with it/en/de etc. and continues with between two and five letters/dashes/semicolons." Try \b(it|en|de|es|fr|ru)[a-zA-Z-;]{0,3}.

You might also want to be explicit about the semicolon being the last character, and perhaps also be more specific about the structure of the ISO language codes (which I assume that these strings are): \b(it|en|de|es|fr|ru)(-[a-zA-Z]{2})?;?\b. Here, we say "A word that starts with it/en/de etc. and might continue with a dash and two letters, and (irrespective of whether it had the dash and two letters) might continue with a semicolon. Nothing else will be allowed before the word should end."

The easiest way to do this would be to first split your data into individual lines and then check them one at a time:

$lines = explode("\n", $data); // I'm making an assumption here, discussed below.
foreach ($lines as $line)
{
  if (!preg_match('/\b(?=it|en|de|es|fr|ru)[a-z;-]{2,5}/i', $line))
  {
    // line doesn't contain a word beginning with en, de, etc.
  }
}

Your use of the \b word boundary metacharacter should work correctly; \b matches at the start of the string if the first character is a word character.

I am using a positive lookahead assertion ((?=)) to check if the first two characters of the word are the language codes you are looking for. This avoids the problem that @Aasmund Eldhuset pointed out in his answer. In other words, the regular expression engine looks for words that begin with the language codes you want to exclude, but then the result of the match is logically inverted by PHP, so any lines containing those words are ignored.

I'm making the assumption that your data is split into lines by a single \n (newline) character. It might be split by \r or \n\r instead. If you don't know which newline characters are being used, you can use preg_split instead of explode, ie:

$lines = preg_split('/\n|\n?\r/', $data);

The magic character you're looking for is the caret: ^:

!preg_match("/^(it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/", $value)

Other than that, looks good.

You can use a look-ahead assertion:

/\b(?!it|en|de|es|fr|ru)[a-zA-Z-;]{2,5}/

Here (?!…) asserts that there must not be a match of the containing pattern from the current position on without actually matching that pattern.

继续阅读：php preg-match regex

preg_match exclude strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？