Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

2023-02-20 08:06 问答作者：

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:

menu_item = 'casserole';
menu_item = 'meat 
            loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
    gyro';

I want to grab only the text (and spaces), ignoring the tabs/newlines 开发者_StackOverflow- and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:

casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.

I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:

menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
    m = re.search(menuPat, line)
    if m is not None:
        print m.group()

There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.

Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.

Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.

This tested script should do the trick:

import re
re_sq_long = r"""
    # Match single quoted string with escaped stuff.
    '            # Opening literal quote
    (            # $1: Capture string contents
      [^'\\]*    # Zero or more non-', non-backslash
      (?:        # "unroll-the-loop"!
        \\.      # Allow escaped anything.
        [^'\\]*  # Zero or more non-', non-backslash
      )*         # Finish {(special normal*)*} construct.
    )            # End $1: String contents.
    '            # Closing literal quote
    """
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"

data = r'''
        menu_item = 'casserole';
        menu_item = 'meat 
                    loaf';
        menu_item = 'Tony\'s magic pizza';
        menu_item = 'hamburger';
        menu_item = 'Dave\'s famous pizza';
        menu_item = 'Dave\'s lesser-known
            gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
    match = re.sub('\s+', ' ', match) # Clean whitespace
    match = re.sub(r'\\', '', match)  # remove escapes
    menu_items.append(match)          # Add to menu list

print (menu_items)

Here is the short version of the regex:

'([^'\\]*(?:\\.[^'\\]*)*)'

This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

'((?:[^'\\]|\\.)*)'

This should do it:

menu_item = '((?:[^'\\]|\\')*)'

Here the (?:[^'\\]|\\')* part matches any sequence of any character except ' and \ or a literal \'. The former expression [^'\\] does also allow line breaks and tabulators that you then need to replace by a single space.

You cold try it like this:

pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)

It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.

继续阅读：python regex

Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？