How can I translate this XPath expression to BeautifulSoup?
In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?
hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+')
to td class altRow
to get the links from there.
I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.
Thanks for your help.
Edit: I am looking at this page:
>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>
Yet, if you look at the page source "/cabel"
is there:
<td class="altRow" valign="middle" width="34%">
<a href='/cabe开发者_运维问答l'>Abel, Christian</a>
For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')
catches "/cabel"
Edit: cobbal: It is still not working. But when I search this:
>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>
it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.
one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath
Edit:
try (untested) tested:
soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)
I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html
soup should be a BeautifulSoup object
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)
I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib
# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()
# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes,
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")
# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))
# compose total matching pattern (add trailing tdStart to filter out
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart
# scan input HTML source for matching refs, and print out the text and
# href values
for ref,s,e in patt.scanString(html):
print ref.text, ref.a.href
I extracted 914 references from your page, from Abel to Zupikova.
Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
Acuña, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
Ziółek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova
I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.
The thread is located at the Google Groups archive.
It seems that you are using BeautifulSoup 3.1
I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)
I just tested with 3.0.7 and got the results you expect:
>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]
Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.
精彩评论