Get <img>'s title-attribute with lxml in Python

2023-03-18 19:12 问答作者：

I want to extract the onel-iner-texts from this website using Python. The messages in HTML look like this:

<div class="olh_message"> 
    <p>foobarbaz <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" /></p> 
</div>

My code looks like this so far:

import lxml.html
url = "http://www.scenemusic.net/demovibes/oneliner/"
xpath = "//div[@class='olh_message']/p"
tree = lxml.html.parse(url)
texts = tree.xpath(xpath)
texts = [text.text_content() for text in texts]
print(texts)

Now, however, I only get foobarbaz, I however would like to get the title-argument of the img's in it as well, so in this example foobarbaz :necta:. It seems I need lxml's DOM 开发者_高级运维parser to do it, however I have no idea how. Anyone can give me a hint?

Thanks in advance!

try this

  import lxml.html
  url = "http://www.scenemusic.net/demovibes/oneliner/"
  parser = lxml.etree.HTMLParser()
  tree = lxml.etree.parse(url, parser)
  texts = tree.xpath("//div[@class='olh_message']/p/img/@title")

Use:

//div[@class='olh_message']/p/node()

his selects all children nodes (elements, text-nodes, PIs and comment-nodes) of any p element that is a child of any div element, whose class attribute is 'olh_message'.

Verification using XSLT as the host of XPath:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select="//div[@class='olh_message']/p/node()"/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<div class="olh_message">
    <p>foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" />
    </p>
</div>

the wanted, correct result is produced (showing that exactly the wanted nodes have been selected by the XPath expression):

foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:"/>

继续阅读：dom html-parsing lxml python

Get <img>'s title-attribute with lxml in Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？