Python: replace urls with title names from a string
I wou开发者_StackOverflow社区ld like to remove urls from a string and replace them with their titles of the original contents.
For example:
mystring = "Ah I like this site: http://www.stackoverflow.com. Also I must say I like http://www.digg.com"
sanitize(mystring) # it becomes "Ah I like this site: Stack Overflow. Also I must say I like Digg - The Latest News Headlines, Videos and Images"
For replacing url with the title, I have written this snipplet:
#get_title: string -> string
def get_title(url):
"""Returns the title of the input URL"""
output = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
return output.title.string
I somehow need to apply this function to strings where it catches the urls and converts to titles via get_title.
Here is a question with information for validating a url in Python: How do you validate a URL with a regular expression in Python?
urlparse module is probably your best bet. You will still have to decide what constitutes a valid url in the context of your application.
To check the string for a url you will want to iterate over each word in the string check it and then replace the valid url with the title.
example code (you will need to write valid_url):
def sanitize(mystring):
for word in mystring.split(" "):
if valid_url(word):
mystring = mystring.replace(word, get_title(word))
return mystring
You can probably solve this using regular expressions and substitution (re.sub accepts a function, which will be passed the Match object for each occurence and returns the string to replace it with):
url = re.compile("http:\/\/(.*?)/")
text = url.sub(get_title, text)
The difficult thing is creating a regexp that matches an URL, not more, not less.
精彩评论