开发者

python url fetch help - regex

I have a web site where there are links like 开发者_JAVA技巧<a href="http://www.example.com?read.php=123"> Can anybody show me how to get all the numbers (123, in this case) in such links using python? I don't know how to construct a regex. Thanks in advance.


import re
re.findall("\?read\.php=(\d+)",data)


"If you have a problem, and decide to use regex, now you have two problems..."

If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,

I recommend mechanize. If you notice, the Browser class there has a links method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .


This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/> and some look like <A TARGET="_blank" HREF='foo=123'/>).

import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
   m = p.match(a["href"])
   if m:
      print m.groups()[0]


While the other answers are sort of correct, you should probably use the urllib2 library instead;

from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123">')
for link in links:
    url = urlparse.urlparse(link)
    s = [x.split("=") for x in url[4].split(';')]
    d = {}
    for k,v in s:
        d[k]=v
    print d["read.php"]

It's not as simple as some of the above, but guaranteed to work even with more complex urls.


/[0-9]/

thats the regex sytax you want

for reference see

http://gnosis.cx/publish/programming/regular_expressions.html


One without the need for regex

>>> s='<a href="http://www.example.com?read.php=123">'
>>> for item in s.split(">"):
...     if "href" in item:
...         print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"

if you want to extract the numbers

item[item.index("a href")+len("a href="): ].split("=")[-1]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜