Regex in Python
SO, I am trying create a simple regex that matches the开发者_开发技巧 following string:
<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>
I have created the following regex:
<PRE>[.|[\n]]*</PRE>
yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.
Sorry about the formatting of this question.
Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.
If you're going to parse HTML, please use lxml, as Hank proposed.
But for this regex to work, you need to change the []
to ()
. A |
inside square brackets is interpreted as the symbol '|' and not as an OR operator.
Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:
m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)
outputs the string inside the PRE, without the < PRE >
and< /PRE >
themselves.
The issue is that inside []
's the .
is a period, not a match-anything dot; the |
is a pipe, not an or
; and the [
and ]
are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.
What you will want to do is this:
m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)
.search()
will look everywhere in the string for the match (.match()
only checks the beginning of the string), and re.DOTALL
(or re.S
) will have the .
match newlines as well.
If you don't want the <PRE>
and </PRE>
tags included, move the parentheses to surround the .*
.
精彩评论