Python TypeError on regex [duplicate]

2023-02-14 11:38 问答作者：

This question already has answers here: TypeError: c开发者_Go百科an't use a string pattern on a bytes-like object in re.findall() (3 answers) Closed 4 years ago.

So, I have this code:

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

But then python returns this error:

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

What did I do wrong?

TypeError: can't use a string pattern on a bytes-like object

what did i do wrong??

You used a string pattern on a bytes object. Use a bytes pattern instead:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

(ps:

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

)

If you are running Python 2.6 then there isn't any "request" in "urllib". So the third line becomes:

m = urllib.urlopen(url)

And in version 3 you should use this:

links = linkregex.findall(str(msg))

Because 'msg' is a bytes object and not a string as findall() expects. Or you could decode using the correct encoding. For instance, if "latin1" is the encoding then:

links = linkregex.findall(msg.decode("latin1"))

Well, my version of Python doesn't have a urllib with a request attribute but if I use "urllib.urlopen(url)" I don't get back a string, I get an object. This is the type error.

The url you have for Google didn't work for me, so I substituted http://www.google.com/ig?hl=en for it which works for me.

Try this:

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

Hope this helps.

The regular expression pattern and string have to be of the same type. If you're matching a regular string, you need a string pattern. If you're matching a byte string, you need a bytes pattern.

In this case m.read() returns a byte string, so you need a bytes pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify a byte string literal:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

That worked for me in python3. Hope this helps

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

And also this in which i added b before regex to convert it into byte array.

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

继续阅读：python python-3.x regex typeerror

Python TypeError on regex [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？