开发者

How to use regular expressions on a .html file in python?

I'm very new to programming and I would really appreciate any help! I am trying to write this little python script:

I have an .html file of a legal codification in §§. (For example: http://www.gesetze-im-internet.de/stgb/BJNR001270871.html) Now I want to write a python regex script to automatically tag specific §§. The relevant html code of the document is:

"<div class="jnnorm" id="BJNR398310001BJNE000100305" title="Einzelnorm"><div 

class="jnheader"> <a name="BJNR398310001BJNE000100305"/><a 

href="index.html#BJNR398310001BJNE000100305">Nichtamtliches    Inhaltsverzeichnis</a>h3><span 

class="jnenbez">&#167; 1</span>&#160;<span class="jnentitel"></span></h3> </div>"

Here "div class="jnnorm" should become "div class="jnnorm MYTAGHERE". The last element in "class="jnenbez">&#167; 1" contains the number of the §, here § 1.

I am trying (and failing) to write a script that does the following:

1) Lets say I have a dictionary my_dict = [112, 204]

2) Find "<span class="jnenbez">&#167; 112" and "<span class="jnenbez">&#167; 204" in the .htm file

3) Go left from "jnenbez">&#167; 112" to the next "jnnorm" string and replace it with "jnnorm MYTAGHERE"开发者_如何学Go.

Here is what I got so far, but I hit a roadblock quite soon.

f = file("filename.htm","r")
text = f.read()
import re
my_dict=[1,123,200]
# dont know how to find the §   
re.sub("jnnorm", "jnnorm MYTAGHERE", text)
#re.sub does not seem to work?


re.sub doesn't change the string, it returns a new (modified) string instead. If you want the text variable to change you should assign the new value to it:

text = re.sub("jnnorm", "jnnorm MYTAGHERE", text)

Or simpler (given that regular expressions seem to be overdimensioned for a simple string replace):

text = text.replace("jnnorm", "jnnorm MYTAGHERE")

But for anything more complicated - yes, you should consider using a proper HTML parser.


using BeautifulSoup, retrieve class attribute's value.

from BeautifulSoup import BeautifulSoup     
findAll('class')

will return list of values of attributes 'class'.

ex. with this doc

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')

gives

# [<b>one</b>, <b>two</b>]

then, use regex or simply test if element in your list is in one of the element of the list.

this answers 1. and 2. from your question.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜