How to use regular expressions on a .html file in python?
I'm very new to programming and I would really appreciate any help! I am trying to write this little python script:
I have an .html
file of a legal codification in §§. (For example: http://www.gesetze-im-internet.de/stgb/BJNR001270871.html) Now I want to write a python
regex script to automatically tag specific §§. The relevant html code of the document is:
"<div class="jnnorm" id="BJNR398310001BJNE000100305" title="Einzelnorm"><div
class="jnheader"> <a name="BJNR398310001BJNE000100305"/><a
href="index.html#BJNR398310001BJNE000100305">Nichtamtliches Inhaltsverzeichnis</a>h3><span
class="jnenbez">§ 1</span> <span class="jnentitel"></span></h3> </div>"
Here "div class="jnnorm"
should become "div class="jnnorm MYTAGHERE"
. The last element in "class="jnenbez">§ 1"
contains the number of the §, here § 1.
I am trying (and failing) to write a script that does the following:
1) Lets say I have a dictionary my_dict = [112, 204]
2) Find "<span class="jnenbez">§ 112"
and "<span class="jnenbez">§ 204"
in the .htm
file
3) Go left from "jnenbez">§ 112"
to the next "jnnorm"
string and replace it with
"jnnorm MYTAGHERE"
开发者_如何学Go.
Here is what I got so far, but I hit a roadblock quite soon.
f = file("filename.htm","r")
text = f.read()
import re
my_dict=[1,123,200]
# dont know how to find the §
re.sub("jnnorm", "jnnorm MYTAGHERE", text)
#re.sub does not seem to work?
re.sub
doesn't change the string, it returns a new (modified) string instead. If you want the text
variable to change you should assign the new value to it:
text = re.sub("jnnorm", "jnnorm MYTAGHERE", text)
Or simpler (given that regular expressions seem to be overdimensioned for a simple string replace):
text = text.replace("jnnorm", "jnnorm MYTAGHERE")
But for anything more complicated - yes, you should consider using a proper HTML parser.
using BeautifulSoup
, retrieve class
attribute's value.
from BeautifulSoup import BeautifulSoup
findAll('class')
will return list of values of attributes 'class'.
ex. with this doc
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')
gives
# [<b>one</b>, <b>two</b>]
then, use regex or simply test if element in your list is in one of the element of the list.
this answers 1. and 2. from your question.
精彩评论