Scrapy parsing issue with malformed br tags
I have an html file with urls separated with br
tags e.g.
<a href="example.com/page1.开发者_Python百科html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>
Note the line break tag is <br/>
instead of <br />
. Scrapy is able to parse and extract the first url but fails to extract anything after that. If I put a space before the slash, it works fine. The html is malformed, but I've seen this error in multiple sites and since the browser is able to display it correctly, I'm hoping scrapy (or the underlying lxml / libxml2 / beautifulsoup) should also parse it correctly.
lxml.html
parses it fine. Just use that instead of the bundled HtmlXPathSelector.
import lxml.html as lxml
bad_html = """<a href="example.com/page1.html">Site1</a><br/>
<a href="example.com/page2.html">Site2</a><br/>
<a href="example.com/page3.html">Site3</a><br/>"""
tree = lxml.fromstring(bad_html)
for link in tree.iterfind('a'):
print link.attrib['href']
Results in:
example.com/page1.html example.com/page2.html example.com/page3.html
So if you want to use this method in a CrawlSpider, you just need to write a simple (or a complex) link extractor.
Eg.
import lxml.html as lxml
class SimpleLinkExtractor:
extract_links(self, response):
tree = lxml.fromstring(response.body)
links = tree.xpath('a/@href')
return links
And then use that in your spider..
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(SimpleLinkExtractor(), callback='parse_item'),
)
# etc ...
Just use the <br>
tags instead of <br/>
tags, as suggested by latest conventions.
精彩评论