开发者

How to replace links using lxml and iterlinks

I'm new to lxm开发者_如何学Pythonl and I'm trying to figure how to rewrite links using iterlinks().

import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
    if attibute == "src":
         link = link.replace('foo', 'bar')
print lxml.html.tostring(html)

However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.

Thanks in advance.


Instead of just assigning a new (string) value to the variable name link, you have to alter the element itself, in this case by setting its src attribute:

new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)

Note that - if you know which "links" you are interested in, for example, only img elements - you can also get the elements by using .findall() (or xpath or css selectors) instead of using .iterlinks().


lxml provides a rewrite_links method (or function that you pass the text to be parsed into a document) to provide a method of changing all links in a document:

.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None): This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL. For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob@example.com" (or javascript:...).


Probably link is just a copy of the actual object. Try replacing the attribute of the element in your loop. Even element can be just a copy, but it deserves a try...


Here is working code with rewrite_links:

from lxml.html import fromstring, tostring

e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")

def my_rewriter(link):
  return "http://newlink.com"

e.rewrite_links(my_rewriter)
print(tostring(e))

Output:

    b'<html><body><a href="http://newlink.com">hello</a></body></html>'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜