开发者

Regular expression for matching a variety of types of numbered lists

I'd like to create a (PCRE) regular expression to match all commonly used numbered lists, and I'd like to share my thoughts and gather input on way to do this.

I've defined 'lists' as the set of canonical Anglo-Saxon conventions, i.e.

Numbers

1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)

Letters

a b c
a. b. c.
a) b) c)
(a) (b) (c) 
A B C
A. B. C. 
A) B) C)
(A) (B) (C)

Roman numerals

i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)

I'd like to know how strong a set of list this is, and if there are other numbering conventions that should be in there, and if any of these ought to be removed.

Here's a regular expression I've created to solve this problem (in Python):

numex = r'(?:\d{1,3}'\   # 1, 2, 3
    '(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
    '|[A-Z]{1,2}'\        # A. B. C.
    '|[ivxcl]{1,6}'       # i, iii, ...

rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?

rex.match("123. Some paragraph")    

I'开发者_如何学Pythond like to know how adequate this regex is for this problem, and if there are other alternative (regex or otherwise) solutions.

Incidentally, for my particular use-case, I wouldn't expect list numbers of more than 25-50.

Thank you for reading.

Brian


Here's a Wikified solution:

 numex = r"""^(?:
      \d{1,3}                 # 1, 2, 3
          (?:\.\d{1,3}){0,4}  # 1.1, 1.1.1.1
    | [B-H] | [J-Z]         # A, B - Z caps at 26.
    | [AI](?!\s)            # Note: "A" and "I" can properly start non-lists
    | [a-z]                 # a - z
    | [ivxcl]{1,6}          # Roman ii, etc
    | [IVXCL]{1,6}          # Roman IV, etc.
    )
    """

 rex = re.compile(r'^\s*(\(?%s\)|%s\.?)\s+(.*)'
   % (numex, numex), re.X)

Additions, changes and suggestions most welcome.


I'd change at least one thing, and that is to add word boundary anchors around your regex, otherwise it will match every single letter in any text:

rex = re.compile(r'(\(?\b%s\)|\b%s\b\.?)' % (numex, numes), re.I|re.M)

This helps a little, but of course any one- or two-letter word will still be matched.

You might want to anchor the search at the start of the line; after all these characters should be the first thing on the line (except maybe whitespace). A negative lookbehind won't word in Python because Python doesn't support variable-length lookbehind, so you could add this outside the matching parentheses:

rex = re.compile(r'^\s*(\(?%s\)|%s\b\.?)' % (numex, numex), re.I|re.M)

Of course, now you must look at the match object's group(1) to only get the actual match and not the leading whitespace.

You will still match too much (e. g. sentences starting with I thought so or It was a dark and stormy night, but your rules allow this, and I think you're aware of this.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜