How can I translate this XPath expression to BeautifulSoup?

2022-12-13 05:20 问答作者：

In answer to a previous question, several people suggested that I use BeautifulSoup for my project. I've been struggling with their documentation and I just cannot parse it. Can somebody point me to the section where I should be able to translate this expression to a BeautifulSoup expression?

hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+')

The above expression is from Scrapy. I'm trying to apply the regex re('\.a\w+') to td class altRow to get the links from there.

I would also appreciate pointers to any other tutorials or documentation. I couldn't find any.

Thanks for your help.

Edit: I am looking at this page:

>>> soup.head.title
<title>White & Case LLP - Lawyers</title>
>>> soup.find(href=re.compile("/cabel"))
>>> soup.find(href=re.compile("/diversity"))
<a href="/diversity/committee">Committee</a>

Yet, if you look at the page source "/cabel" is there:

 <td class="altRow" valign="middle" width="34%"> 
 <a href='/cabe开发者_运维问答l'>Abel, Christian</a>

For some reason, search results are not visible to BeautifulSoup, but they are visible to XPath because hxs.select('//td[@class="altRow"][2]/a/@href').re('/.a\w+') catches "/cabel"

Edit: cobbal: It is still not working. But when I search this:

>>>soup.findAll(href=re.compile(r'/.a\w+'))
[<link href="/FCWSite/Include/styles/main.css" rel="stylesheet" type="text/css" />, <link rel="shortcut icon" type="image/ico" href="/FCWSite/Include/main_favicon.ico" />, <a href="/careers/northamerica">North America</a>, <a href="/careers/middleeastafrica">Middle East Africa</a>, <a href="/careers/europe">Europe</a>, <a href="/careers/latinamerica">Latin America</a>, <a href="/careers/asia">Asia</a>, <a href="/diversity/manager">Diversity Director</a>]
>>>

it returns all the links with second character "a" but not the lawyer names. So for some reason those links (such as "/cabel") are not visible to BeautifulSoup. I don't understand why.

one option is to use lxml (I'm not familiar with beautifulsoup, so I can't say how to do with it), it defaultly supports XPath

Edit:
try ~~(untested)~~ tested:

soup.findAll('td', 'altRow')[1].findAll('a', href=re.compile(r'/.a\w+'), recursive=False)

I used docs at http://www.crummy.com/software/BeautifulSoup/documentation.html

soup should be a BeautifulSoup object

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html_string)

I know BeautifulSoup is the canonical HTML parsing module, but sometimes you just want to scrape out some substrings from some HTML, and pyparsing has some useful methods to do this. Using this code:

from pyparsing import makeHTMLTags, withAttribute, SkipTo
import urllib

# get the HTML from your URL
url = "http://www.whitecase.com/Attorneys/List.aspx?LastName=&FirstName="
page = urllib.urlopen(url)
html = page.read()
page.close()

# define opening and closing tag expressions for <td> and <a> tags
# (makeHTMLTags also comprehends tag variations, including attributes, 
# upper/lower case, etc.)
tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

# only interested in tdStarts if they have "class=altRow" attribute
tdStart.setParseAction(withAttribute(("class","altRow")))

# compose total matching pattern (add trailing tdStart to filter out 
# extraneous <td> matches)
patt = tdStart + aStart("a") + SkipTo(aEnd)("text") + aEnd + tdEnd + tdStart

# scan input HTML source for matching refs, and print out the text and 
# href values
for ref,s,e in patt.scanString(html):
    print ref.text, ref.a.href

I extracted 914 references from your page, from Abel to Zupikova.

Abel, Christian /cabel
Acevedo, Linda Jeannine /jacevedo
AcuÃ±a, Jennifer /jacuna
Adeyemi, Ike /igbadegesin
Adler, Avraham /aadler
...
Zhu, Jie /jzhu
ZÃdek, AleÅ¡ /azidek
ZiÃ³Å‚ek, Agnieszka /aziolek
Zitter, Adam /azitter
Zupikova, Jana /jzupikova

I just answered this on the Beautiful Soup mailing list as a response to Zeynel's email to the list. Basically, there is an error in the web page that totally kills Beautiful Soup 3.1 during parsing, but is merely mangled by Beautiful Soup 3.0.

The thread is located at the Google Groups archive.

It seems that you are using BeautifulSoup 3.1

I suggest reverting to BeautifulSoup 3.0.7 (because of this problem)

I just tested with 3.0.7 and got the results you expect:

>>> soup.findAll(href=re.compile(r'/cabel'))
[<a href="/cabel">Abel, Christian</a>]

Testing with BeautifulSoup 3.1 gets the results you are seeing. There is probably a malformed tag in the html but I didn't see what it was in a quick look.

继续阅读：python

How can I translate this XPath expression to BeautifulSoup?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？