problem in regex syntax
I have a txt file that has many rows of lines I want to search versions and date
what regex would be proper for get something like v 1.31.6.7 2008/03/07
in an array
from Many txt files like this:
This file may contain proprietary rules that were created, tested and certified by Sourcefire, Inc. (the "VRT Certified Rules") as well as rules that were created by Sourcefire and other third parties and $Id: ddos.rules,v 1.31.6.7 2008/03/07 20:53:40 vrtbuild Exp DDOS RULES
versions can be different like: v 1.48.6.12
sth like this format
dates are different t开发者_运维知识库oo
suppose I have many lines that repeat the
$Id: ddos.rules,v 1.31.6.7 2008/03/07 20:53:40 vrtbuild Exp
$Id: exploit.rules,v 1.116.6.53 2008/11/18 16:36:27 vrtbuild Exp $
$Id: misc.rules,v 1.77.6.20 2008/10/17 19:36:59 vrtbuild Exp $
$Id: smtp.rules,v 1.77.6.19 2008/10/17 19:37:00 vrtbuild Exp $
$Id: tftp.rules,v 1.28.6.6 2008/07/22 17:59:06 vrtbuild Exp $
$Id: web-iis.rules,v 1.110.6.11 2008/07/22 17:59:06 vrtbuild Exp $
$Id: web-attacks.rules,v 1.23 2005/05/16 22:18:17 mwatchinski Exp $
with different values of date and v(version)
I found the pattern of date like this:
^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/((19|[2-9]\d)\d{2}))|(29\/02\/((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))$
can some one explain?
Your date regex:
^(((0[1-9]|[12]\d|3[01])\/(0[13578]|1[02])\/((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\/(0[13456789]|1[012])\/((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\/02\/((19|[2-9]\d)\d{2}))|(29\/02\/((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))$
... is quite interesting. I decided to analyze this to see precisely what it matches. It turns out that this regex matches all valid dates having a format: DD/MM/YYYY
from the year 1900 to 9999. Interestingly, it also correctly matches all valid leap days from the year 1597 to 9999. This regex understands the valid number of days in each month. It knows that May has 31 days and June has only 30. It also knows that February has 28 days except for leap years which have 29 days. Here it is broken down so that it can be read by mere mortals:
$re_date = '%
# Match all valid DD/MM/YYYY dates from 1900 to 9999 and
# all leap days from year 1597 to 9999.
^ # Anchor to start of string.
( # $1:
( # $2: Date format alternative 1: (months having 31 days)
( 0[1-9]|[12]\d|3[01]) # $3: Day: 01-09,10-19,20-29,30,31
\/
(0[13578]|1[02]) # $4: Month: 01,03,05,07,08,10,12
\/
((19|[2-9]\d)\d{2}) # $5,$6: Year: 1900-9999
) # End $2:
| ( # $7: Date format alternative 2: (months having 30 days)
(0[1-9]|[12]\d|30) # $8: Day: 01-09,10-19,20-29,30
\/
(0[13456789]|1[012]) # $9: Month: 01,03-09,10-12
\/
((19|[2-9]\d)\d{2}) # $10,$11: Year: 1900-9999
) # End $7:
| ( # $12: Date format alternative 3: (month having 28 days)
(0[1-9]|1\d|2[0-8]) # $13: Day 01-09,10-19,20-28
\/
02 # Month: 02
\/
((19|[2-9]\d)\d{2}) # $14,$15: Year: 1900-9999
) # End $12:
| ( # $16: Date format alternative 3: (leap days)
29 # Day: 29
\/
02 # Month: 02
\/ # Match all valid leap day dates from year 1597 to 9999.
( # $17: Year alt 1 (divisible by 4 but not 100)
(1[6-9]|[2-9]\d) # $18: Century part: 16-19,20-99
( 0[48] # $19: Year part: Either 04-08
| [2468][048] # or 20,24,28,40,44,48,60,64,68,80,84,88
| [13579][26] # or 12,16,32,36,52,56,72,76,92,96,
) # End $19:
| ( # or $20: Year alternative 2 (divisible by 400)
( 16 # $21: Century part: Either 16
| [2468][048] # or 20,24,28,40,44,48,60,64,68,80,84,88
| [3579][26] # or 32,36,52,56,72,76,92,96
) # End $21:
00 # Year part: 00
) # End $20:
) # End $17:
) # End $16:
) # End $1:
$ # Anchor to end of string.
%x';
To solve your immediate problem at hand, here is a more precise regex that does the trick:
$count = preg_match_all('%
# Match version/date sub-string
\b # Anchor to word boundary.
( # $1: Version number.
[Vv] # Version identifier (allow V or v).
[ ]+ # One or more spaces.
[0-9]+ # Major version number is one or more digits.
(?: # Group minor version numbers.
\. # Minor versions separated by dot.
[0-9]+ # Minor version is one or more digits.
)* # Zero or more minor versions.
) # End $1: Version number.
[ ]+ # One or more spaces.
( # $2: Date.
[0-9]{4} # Year is four digits.
/ # / Separator.
[0-9]{2} # Month is two digits.
/ # / Separator.
[0-9]{2} # Day is two digits.
) # End $2: Date.
%x', $text, $matches);
if ($count > 0) {
$versions = $matches[1];
$dates = $matches[2];
printf("Found %d matches:\n", $count);
for ($i = 0; $i < $count; ++$i) {
printf(" Match%3d: Version: %-15s Date: %s\n",
$i + 1, $versions[$i], $dates[$i]);
}
} else {
echo("No matches found.\n");
}
Note: When dealing with non-trivial regexes such as these, it is best to write them using the 'x'
free-spacing mode. This allows adding a generous amount of comments and indentation which makes it much easier to read.
foreach ($lines as $line){
if (preg_match("|v (.*?) (.*?) |", $line, $match)){
echo "found version ".$match[1]." date ".$match[2];
}
}
is that exact thing you want?
精彩评论