Need PHP Regex help
I've been working on this simple script all day trying to figure it out. I'm new to regex so please keep that in mind. On top of that, I've tried just about anything and everything I could to get this to work.
I'm trying to (to learn, please don't point me to the API) download a TSV file from Yahoo Site Explorer via either cURL or file_get_contents (both work, just messing with different things) and then using regex to get only the URL column to appear. I realize I might have more luck with other functions, but I can't find anything dealing with TSV and now it's become a challenge. I've literally spent the entire day trying to get this correct.
开发者_如何学CSo a URL would be:
https://siteexplorer.search.yahoo.com/search?p=www.google.com&bwm=i&bwmo=&bwmf=s
And my regex currently looks like this (I know it's horrible...it's probably the millionth attempt):
preg_match_all('((http(s?)://?(([^/]+(\/.+))))^[\t]$)', $dl, $matches);
My issue right now is that there's 4 columns. TITLE URL SIZE FORMAT. I'm able to strip out everything from the first column (TITLE) and the last (FORMAT) column, but I cannot seem to strip out the SIZE column and get rid of the last slash in case the sites linking in don't have that last slash.
Another thing - I've actually accomplished getting JUST the URL to appear, but they all had ending slashes which leave out links from, say, Twitter.
Any help would be greatly appreciated!
Don't know much about PHP, but this regex works in python (should be the same in PHP):
".+?\t(.+?)\t.*"
Just match it and get the content of group 1. FWIW, code in Python:
import re
import fileinput
urlre = re.compile(".+?\t(.+?)\t.*")
for line in fileinput.input():
m = urlre.match(line)
if m:
print m.group(1)
Personally, I'd split the lines by tab. For example:
$stuff = file_get_contents($url);
// split the whole file by newlines, to get an array of lines
$lines = explode("\n", $stuff);
// loop through the lines
foreach ($lines as $line) {
// split by tab
$parts = explode("\t", $line);
// put the URLs in a list
$urls[] = $parts[1];
// or keep track of them by title
$urls[$parts[0]] = $parts[1];
// or whatever...
}
Just use parse_url or parse_str instead. Always try to find anything else than regular expressions which are extremely slow.
精彩评论