开发者

Scrapy parsing issue with malformed br tags

I have an html file with urls separated with br tags e.g.

<a href="example.com/page1.开发者_Python百科html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>

Note the line break tag is <br/> instead of <br />. Scrapy is able to parse and extract the first url but fails to extract anything after that. If I put a space before the slash, it works fine. The html is malformed, but I've seen this error in multiple sites and since the browser is able to display it correctly, I'm hoping scrapy (or the underlying lxml / libxml2 / beautifulsoup) should also parse it correctly.


lxml.html parses it fine. Just use that instead of the bundled HtmlXPathSelector.

import lxml.html as lxml

bad_html = """<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>"""

tree = lxml.fromstring(bad_html)

for link in tree.iterfind('a'):
    print link.attrib['href']

Results in:

example.com/page1.html
example.com/page2.html
example.com/page3.html

So if you want to use this method in a CrawlSpider, you just need to write a simple (or a complex) link extractor.

Eg.

import lxml.html as lxml

class SimpleLinkExtractor:
    extract_links(self, response):
        tree = lxml.fromstring(response.body)
        links = tree.xpath('a/@href')
        return links

And then use that in your spider..

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(SimpleLinkExtractor(), callback='parse_item'),
    )

    # etc ...


Just use the <br> tags instead of <br/> tags, as suggested by latest conventions.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜