Python RegEx skipping the first few characters?

2022-12-09 04:27 问答作者：

Hey I have a fairly basic q开发者_如何学Gouestion about regular expressions. I want to just return the text inside (and including) the body tags, and I know the following isn't right because it'll also match all the characters before the opening body tag. I was wondering how you would go about skipping those?

x = re.match('(.*<body).*?(</body>)', fileString)

Thanks!

I don't know Python, but here's a quick example thrown together using Beautiful Soup, which I often see recommended for Python HTML parsing.

import BeautifulSoup

soup = BeautifulSoup(fileString)

bodyTag = soup.html.body.string

That will (in theory) deal with all the complexities of HTML, which is very difficult with pure regex-based answers, because it's not what regex was designed for.

Here is some example code which uses regex to find all the text between <body>...</body> tags. Although this demonstrates some features of python's re module, note that the Beautiful Soup module is very easy to use and is a better tool to use if you plan on parsing HTML or XML. (See below for an example of how you could parse this using BeautifulSoup.)

#!/usr/bin/env python
import re

# Here we have a string with a multiline <body>...</body>
fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''

# re.DOTALL tells re that '.' should match any character, including newlines.
x = re.search('(<body>.*?</body>)', fileString, re.DOTALL)
for match in x.groups():
    print(match)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

If you wish to collect all matches, you could use re.findall:

print(re.findall('(<body>.*?</body>)', fileString, re.DOTALL))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

and if you plan to use this pattern more than once, you can pre-compile it:

pat=re.compile('(<body>.*?</body>)', re.DOTALL)
print(pat.findall(fileString))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

And here is how you could do it with BeautifulSoup:

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''
soup = BeautifulSoup(fileString)
print(soup.body)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

print(soup.findAll('body'))
# [<body>foo
# baby foo
# baby foo
# baby foo
# </body>, <body>bar</body>]

You cannot parse HTML with regex. HTML is not a regular language. Use an HTML parser like lxml instead.

 x = re.match('.*(<body>.*?</body>)', fileString)

Consider minidom for HTML parsing.

x = re.search('(<body>.*</body>)', fileString)
x.group(1)

Less typing than the match answers

Does your fileString contain multiple lines? In that case you may need to specify it or skip the lines explicitly:

x = re.match(r"(?:.|\n)*(<body>(?:.|\n)*</body>)", fileString)

or, more simply with the re module:

x = re.match(r".*(<body>.*</body>)", fileString, re.DOTALL)

x.groups()[0] should contain your string if x is not None.

继续阅读：html-parsing python regex

Python RegEx skipping the first few characters?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？