How to parse through string containing url changing them to proper links
Let's say I have a following string from twitter:
"This is my sample test blah blah http://t.co/pE6JSwG, hello all"
How I can parse through this string changing this link to <a href="link">link</a>
? Here's a code that parses user tags :
tweet = s.text;
user_regex = re.compile(r'@[0-9a-zA-Z+_]*',re.IGNORECASE)
for tt in user_regex.finditer(tweet):
url_tweet = tt.group(0).replace('@','')
tweet = tweet.replace(tt.group(0),
'<a href="http://twitter.com/'+
url_tweet+'" title="'+
tt.group(0)+'">'+
tt.group(0)+'</a>')
And my current regex for url's:
http_regex = re.compile(r'[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]*', r开发者_JS百科e.IGNORECASE)
>>> test = "This is my sample test blah blah http://t.co/pE6JSwG, hello all"
>>> re.sub('http://[^ ,]*', lambda t: "<a href='%s'>%s</a>" % (t.group(0), t.group(0)), test)
>>> This is my sample test blah blah <a href='http://t.co/pE6JSwG'>http://t.co/pE6JSwG</a>, hello all
This only works if you consider characters like the comma and space a valid stopping point for your url.
In general you should probably not use regexes for url matching, since there may not be a good way to know when a URL ends. If you are guaranteed to have a string with the same format every time, this solution will work. You may also always get URLs of the same length, in which case you can look for the http and collect the substring of that length afterward.
Perhaps you could get inspiration from the source code of the django-oembed project.
精彩评论