Python raw strings and html parsing
How do python raw strings and stri开发者_运维百科ng literals work? I'm trying to make a webscraper to download pdfs from a site. When I search the string it works, but when I try to implement it in python I always get None as my answer
import urllib
import re
url="" //insert url here
sock=urllib.urlopen(url)
htmlSource=sock.read();
sock.close();
m=re.match(r"<a href.*?pdf[^>]*?", raw(htmlSource))
print m
$ python temp.py
None
The raw function is from here: http://code.activestate.com/recipes/65211-convert-a-string-into-a-raw-string/
That said, how can I complete this program so that I can print out all of the matches and then download the pdfs?
Thanks!
You seem to be very confused.
A 'string literal' is a string that you type into the program. Because there needs to be a clear beginning and end to your string, certain characters become inconvenient to have within the middle of the string, and escape sequences must be used to represent them.
Python offers 'raw' string literals which have different rules for how the escape sequences are interpreted: the same rules are used to figure out where the string ends (so a single backslash, followed by the opening quote character, doesn't terminate the string), but then the stuff between the backslashes doesn't get transformed. So, while '\''
is a string that consists of a single quote character (the \'
in the middle is an escape sequence that produces the quote), r'\''
is a string that consists of a backslash and a quote character.
The raw string literal produces an object of type str
. It is the same type as produced by an ordinary string literal. These are often used for the pattern for a regex operation, because the strings used for regexes often need to contain a lot of backslashes. If you wanted to write a regex that matched a backslash in the source text, and you didn't have raw string literals, then you would need to put, perhaps surprisingly, four backslashes between the quotes in your source code: the Python compiler would interpret this as a string containing two real backslashes, which in turn represents "match a backslash" in the regex syntax.
The function you found is an imperfect attempt to re-introduce escape sequences into input text. This is not what what you want to do, doesn't even really make sense, and doesn't meet the author's own spec anyway. It seems to be based on a misconception similar to your own. The concept of a "raw equivalent of" a string is nonsensical. There is, really, no such thing as "a raw string"; raw string literals are a convenience for creating ordinary strings.
You want to search for the pattern within htmlSource
. It is already in the form you need it to be in. Your problem has nothing to do with string escapes. When a string comes from user input, file input, or basically anything other than the program source, it is not processed the way string literals are, unless you explicitly arrange for that to happen. If the web page contains a backslash followed by an n, the string that gets read by urllib
contains, in the corresponding spot, exactly that - a backslash followed by an n, not a newline.
The problem is as follows: you want to search the string, as you said: "when I search the string it works". You are currently matching the string. See the documentation:
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a match object, or None if no match was found.
Your pattern does not appear at the beginning of the string, since the HTML for the webpage does not start with the <a>
tag you are looking for.
You want m=re.search(r"<a href.*?pdf[^>]*?", htmlSource)
.
Check out this answer. It seems that Python’s urllib
is a lot less user‐friendly — and Unicode‐friendly — than it should be. It seems to force you to deal with ugly raw bytes content instead of decoding it for you into a normal string.
精彩评论