开发者

How to parse through string containing url changing them to proper links

Let's say I have a following string from twitter:

"This is my sample test blah blah http://t.co/pE6JSwG, hello all"

How I can parse through this string changing this link to <a href="link">link</a> ? Here's a code that parses user tags :

    tweet = s.text;
    user_regex = re.compile(r'@[0-9a-zA-Z+_]*',re.IGNORECASE)

    for tt in user_regex.finditer(tweet):
        url_tweet = tt.group(0).replace('@','')
        tweet = tweet.replace(tt.group(0),
            '<a href="http://twitter.com/'+
            url_tweet+'" title="'+
            tt.group(0)+'">'+
            tt.group(0)+'</a>')

And my current regex for url's:

    http_regex = re.compile(r'[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]*', r开发者_JS百科e.IGNORECASE)


>>> test = "This is my sample test blah blah http://t.co/pE6JSwG, hello all"

>>> re.sub('http://[^ ,]*', lambda t: "<a href='%s'>%s</a>" % (t.group(0), t.group(0)), test)

>>> This is my sample test blah blah <a href='http://t.co/pE6JSwG'>http://t.co/pE6JSwG</a>, hello all

This only works if you consider characters like the comma and space a valid stopping point for your url.

In general you should probably not use regexes for url matching, since there may not be a good way to know when a URL ends. If you are guaranteed to have a string with the same format every time, this solution will work. You may also always get URLs of the same length, in which case you can look for the http and collect the substring of that length afterward.


Perhaps you could get inspiration from the source code of the django-oembed project.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜