How to use regular expressions on a .html file in python?

2023-03-29 10:11 问答作者：

I'm very new to programming and I would really appreciate any help! I am trying to write this little python script:

I have an .html file of a legal codification in §§. (For example: http://www.gesetze-im-internet.de/stgb/BJNR001270871.html) Now I want to write a python regex script to automatically tag specific §§. The relevant html code of the document is:

"<div class="jnnorm" id="BJNR398310001BJNE000100305" title="Einzelnorm"><div 

class="jnheader"> <a name="BJNR398310001BJNE000100305"/><a 

href="index.html#BJNR398310001BJNE000100305">Nichtamtliches    Inhaltsverzeichnis</a>h3><span 

class="jnenbez">&#167; 1</span>&#160;<span class="jnentitel"></span></h3> </div>"

Here "div class="jnnorm" should become "div class="jnnorm MYTAGHERE". The last element in "class="jnenbez">§ 1" contains the number of the §, here § 1.

I am trying (and failing) to write a script that does the following:

1) Lets say I have a dictionary my_dict = [112, 204]

2) Find "<span class="jnenbez">§ 112" and "<span class="jnenbez">§ 204" in the .htm file

3) Go left from "jnenbez">§ 112" to the next "jnnorm" string and replace it with "jnnorm MYTAGHERE"开发者_如何学Go.

Here is what I got so far, but I hit a roadblock quite soon.

f = file("filename.htm","r")
text = f.read()
import re
my_dict=[1,123,200]
# dont know how to find the §   
re.sub("jnnorm", "jnnorm MYTAGHERE", text)
#re.sub does not seem to work?

re.sub doesn't change the string, it returns a new (modified) string instead. If you want the text variable to change you should assign the new value to it:

text = re.sub("jnnorm", "jnnorm MYTAGHERE", text)

Or simpler (given that regular expressions seem to be overdimensioned for a simple string replace):

text = text.replace("jnnorm", "jnnorm MYTAGHERE")

But for anything more complicated - yes, you should consider using a proper HTML parser.

using BeautifulSoup, retrieve class attribute's value.

from BeautifulSoup import BeautifulSoup     
findAll('class')

will return list of values of attributes 'class'.

ex. with this doc

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
soup.findAll('b')

gives

# [<b>one</b>, <b>two</b>]

then, use regex or simply test if element in your list is in one of the element of the list.

this answers 1. and 2. from your question.

继续阅读：python regex

How to use regular expressions on a .html file in python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？