开发者

Using scrapy to get links in Python?

Sorry if this is a dumb question, but I have absolutely no idea how to use Scrapy. I don't want to create a Scrapy crawler (or w/e), I want to incorporate it into my existing code. I've looked at the docs, but I found them a bit confusing.

What I need to do is, get links from a开发者_运维百科 list on the site. I just need an example to better understand it. Also, is it possible to have a for loop to do something with each list item? They are ordered like

<ul>
  <li>example</li>
</ul>

Thanks!


You might want to consider BeautifulSoup, which is great for parsing HTML/XML, their documentation is quite helpful as well. Getting the links would be something like:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

SoupStrainer removes the need to parse the entire thing when all you're after are the links.

EDIT: Just saw that you need to use Scrapy. I'm afraid I've not used it, but try looking at the official documentation, it looks like they have what you might be after.


maybe you don't need scrappy if it's that simple.

cat local.html

<html><body>
<ul>  
<li>example</li>  
<li>example2</li>
</ul>
<div><a href="test">test</a><div><a href="hi">hi</a></div></div>
</body></html>

then...

import urllib2
from lxml import html

page =urllib2.urlopen("file:///root/local.html")
root = html.parse(page).getroot()
details = root.cssselect("li")
for x in details:
        print(x.text_content())

for x in root.xpath('//a/@href'):
        print(x)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜