Regex to retrieve download link
I've been trying to get my regex to match a wide variety of download links and have narrowed down the following.
For 90% of download links they will start with either " or ' or http and end at " or ' or .exe. Three examples of this
Now the annoying part is I whipped up two regex's that cover this 90% however there has to be a way for it to only need one line of code. The only thing the user needs to change is the file extension they are looking for.
I tried $ anchoring but i'm not a regex expert so couldn't get it to work, tried to start the match at the first .exe occurance and then work its way back to match the very first " or ' or http that happens before the first .exe occurance. Yes, they do start with href= then " or ' however you can get href= and I don't know how to account for that PLUS some download links you don't want it to start from the href= and not all start with http
Example
href="/bouncer?t=http%3A%2F%2Fdownload.portableapps.com%2Fportableapps%2Ffoxitreaderportable%2FFoxitReaderPortable_4.2.paf.exe">
The two regex I have that cover the 90% of situations are
["']([^"']+(\.zip|\.rar|\.7z))
and (http[^"']+(\.zip|\.rar|\.7z))
EDIT: This is开发者_如何学Go used in a program called Ketarin, which parses the HTML for me and returns the page source with which I can use the regex on. I have found that Ketarin processes regex in this fashion, Singleline and IgnoreCase.
This flavor of regex treats the entire block of text as a single line, so the . character also matches \r\n.
This aside does anyone know how to start the regex match from the end of the string and work its way back to the first found " ' or http? The closest I got was
$?[^"']*.exe
But i'm not sure how to include http as an OR inclusive match in that
/href[\=][\"]((.*)([.]exe))[\"]/
try this using a group match (or the scan method if you are using ruby
EDIT: Sorry, i based this off something that did work hoping it would of work... anyways:
(?<=href=").+?\.(your|extensions|here)
Hope this one does help. Put your desired extensions separated by | [like (exe:|rar|zip....)]
Good Luck
精彩评论