Constructing Regex To Extract Multiple Data
I need a regular expression to get the Event, Name, Scho开发者_JAVA百科ol, Final Swim Time, and Swim Threshold (The DIIA) from a Results page like the one at ( http://www.gliac.org/sports/mswimdive/2010-11/stats/Results_Wed_Finals.htm ). Note that the results are sepereated from the rest of the page by the "pre" html tag.
Each "line" looks like this:
1 Donahue, Maura 19 INDY 10:39.77 10:03.60 DIIA
Unfortunately, I'm not sure exactly how to do so. One of the problems (in my mind!) is that sometimes it displays the swimmers age (19
) and other times it doesn't. In addition, sometimes results show their seed time (10:39.77
) and other times it only has the final time (10:03.60
).
I started the regex by trying to split up to the ",
" in the first name, but failed miserably.
I'm using simple_html to extract the contents of the HTML page.
My code looks like this (I'm using PHP):
$results_url = "http://www.gliac.org/sports/mswimdive/2010-11/stats/Results_Wed_Finals.htm";
// Create a DOM object from a URL
$html = file_get_html($results_url);
if (!$html->find('pre')) {
$parse_error = "Yes";
}
if (!isset($parse_error)) {
$regex = "/[0-9]+(?=[ \s]+)(?=[A-Za-z]+)/";
$splits = preg_split($regex, $html, PREG_SPLIT_DELIM_CAPTURE);
print_r($splits);
}
If you can help out or point me in the right direction, that would be awesome! Is it even possible to run a regex against the results to extract this data?
Thank you!
I wont pretend to know what all those numbers mean, but here's something to help start you off with the first line of each person.
preg_match_all('/(?P<position>[0-9-]+)\s+(?P<last>[a-z]+)\s*,\s*(?P<first>[a-z]+)\s+((?P<age>[0-9]{2})\s)?(?P<school>[a-z -]+[a-z])\s+(?P<seed>(NT|[0-9:.]+))\s+(?P<final>[0-9:\.]+)\s+(?P<division>[a-z]+)/is', $html, $matches);
print_r($matches);
The regex is very basic and seems to work right now, but when dealing with content you don't have control over, you may want to account for a lot more. For instance, right now that name matching wont work with names that have accented characters or punctuation characters like in the name O'Reilly
.
Sounds like you could use either preg_match() or preg_match_all() (see links below)
http://php.net/manual/en/function.preg-match-all.php
http://php.net/manual/en/function.preg-match.php
精彩评论