How to replace links using lxml and iterlinks
I'm new to lxm开发者_如何学Pythonl and I'm trying to figure how to rewrite links using iterlinks().
import lxml.html
html = lxml.html.document_fromstring(doc)
for element, attribute, link, pos in html.iterlinks():
if attibute == "src":
link = link.replace('foo', 'bar')
print lxml.html.tostring(html)
However, this doesn't actually replace the links. I know I can use .rewrite_links, but iterlinks provides more information about each link, so I would prefer to use this.
Thanks in advance.
Instead of just assigning a new (string) value to the variable name link
, you have to alter the element itself, in this case by setting its src
attribute:
new_src = link.replace('foo', 'bar') # or element.get('src').replace('foo', 'bar')
element.set('src', new_src)
Note that - if you know which "links" you are interested in, for example, only img
elements - you can also get the elements by using .findall()
(or xpath or css selectors) instead of using .iterlinks()
.
lxml provides a rewrite_links
method (or function that you pass the text to be parsed into a document) to provide a method of changing all links in a document:
.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None): This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL. For each link link_repl_func(link) is called. That function then returns the new link, or None to remove the attribute or tag that contains the link. Note that all links will be passed in, including links like "#anchor" (which is purely internal), and things like "mailto:bob@example.com" (or javascript:...).
Probably link is just a copy of the actual object. Try replacing the attribute of the element in your loop. Even element can be just a copy, but it deserves a try...
Here is working code with rewrite_links:
from lxml.html import fromstring, tostring
e = fromstring("<html><body><a href='http://localhost'>hello</body></html>")
def my_rewriter(link):
return "http://newlink.com"
e.rewrite_links(my_rewriter)
print(tostring(e))
Output:
b'<html><body><a href="http://newlink.com">hello</a></body></html>'
精彩评论