Python raw strings and html parsing

2023-03-31 10:01 问答作者：

How do python raw strings and stri开发者_运维百科ng literals work? I'm trying to make a webscraper to download pdfs from a site. When I search the string it works, but when I try to implement it in python I always get None as my answer

import urllib
import re    
url="" //insert url here
sock=urllib.urlopen(url)
htmlSource=sock.read();
sock.close();

m=re.match(r"<a href.*?pdf[^>]*?", raw(htmlSource))
print m



$ python temp.py
None

The raw function is from here: http://code.activestate.com/recipes/65211-convert-a-string-into-a-raw-string/

That said, how can I complete this program so that I can print out all of the matches and then download the pdfs?

Thanks!

You seem to be very confused.

A 'string literal' is a string that you type into the program. Because there needs to be a clear beginning and end to your string, certain characters become inconvenient to have within the middle of the string, and escape sequences must be used to represent them.

Python offers 'raw' string literals which have different rules for how the escape sequences are interpreted: the same rules are used to figure out where the string ends (so a single backslash, followed by the opening quote character, doesn't terminate the string), but then the stuff between the backslashes doesn't get transformed. So, while '\'' is a string that consists of a single quote character (the \' in the middle is an escape sequence that produces the quote), r'\'' is a string that consists of a backslash and a quote character.

The raw string literal produces an object of type str. It is the same type as produced by an ordinary string literal. These are often used for the pattern for a regex operation, because the strings used for regexes often need to contain a lot of backslashes. If you wanted to write a regex that matched a backslash in the source text, and you didn't have raw string literals, then you would need to put, perhaps surprisingly, four backslashes between the quotes in your source code: the Python compiler would interpret this as a string containing two real backslashes, which in turn represents "match a backslash" in the regex syntax.

The function you found is an imperfect attempt to re-introduce escape sequences into input text. This is not what what you want to do, doesn't even really make sense, and doesn't meet the author's own spec anyway. It seems to be based on a misconception similar to your own. The concept of a "raw equivalent of" a string is nonsensical. There is, really, no such thing as "a raw string"; raw string literals are a convenience for creating ordinary strings.

You want to search for the pattern within htmlSource. It is already in the form you need it to be in. Your problem has nothing to do with string escapes. When a string comes from user input, file input, or basically anything other than the program source, it is not processed the way string literals are, unless you explicitly arrange for that to happen. If the web page contains a backslash followed by an n, the string that gets read by urllib contains, in the corresponding spot, exactly that - a backslash followed by an n, not a newline.

The problem is as follows: you want to search the string, as you said: "when I search the string it works". You are currently matching the string. See the documentation:

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found.

Your pattern does not appear at the beginning of the string, since the HTML for the webpage does not start with the <a> tag you are looking for.

You want m=re.search(r"<a href.*?pdf[^>]*?", htmlSource).

Check out this answer. It seems that Python’s urllib is a lot less user‐friendly — and Unicode‐friendly — than it should be. It seems to force you to deal with ugly raw bytes content instead of decoding it for you into a normal string.

继续阅读：python

Python raw strings and html parsing

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？