开发者

Extract lines below category and stop when another category is reached

Let's suppose I have a text file of movie genres with my favorite movies under each genre.

[category] Horror:

  1. Movie
  2. Movie
  3. Movie

[category] Comedy:

  1. Movie
  2. 开发者_高级运维

[category] Action:

  1. Movie
  2. Movie

How would I create a function that extracts and packages all the movie titles below a certain [category] * into an array without spilling over into another category?


Already given others' advice for your text file format, I'm just stepping in giving another suggestion... If rewriting your file is possible, an easy solution could be to change it to ConfigParser-readable (and writable) file:

[Horror]
1: Movie
2: Movie
3: Movie

[Comedy]
1: Movie

[Action]
1: Movie
2: Movie


You could parse the file line-by-line this way:

import collections

result=collections.defaultdict(list)
with open('data') as f:
    genre='unknown'
    for line in f:
        line=line.strip()
        if line.startswith('[category]'):
            genre=line.replace('[category]','',1)
        elif line:
            result[genre].append(line)

for key in result:
    print('{k} {m}'.format(k=key,m=list(result[key])))

yields

 Action: ['1. Movie', '2. Movie']
 Comedy: ['1. Movie']
 Horror: ['1. Movie', '2. Movie', '3. Movie']


Use a negative lookahead:

\[category\](?:(?!\[category\]).)*

will match one entire category (if the regex is compiled using the re.DOTALL option).

You can grab the category and the contents separately by using

\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)

After a match, mymatch.group(1) will contain the category, and mymatch.group(2) will contain the movie titles.

Example in Python 3.1 (using your string as mymovies):

>>> import re
>>> myregex = re.compile(r"\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)", re.DOTALL)
>>> for mymatch in myregex.finditer(mymovies):
...     print("Category: {}".format(mymatch.group(1)))
...     for movie in mymatch.group(2).split("\n"):
...         if movie.strip():
...              print("contains: {}".format(movie.strip()))
...
Category: Horror:
contains: 1. Movie
contains: 2. Movie
contains: 3. Movie
Category: Comedy:
contains: 1. Movie
Category: Action:
contains: 1. Movie
contains: 2. Movie
>>>


import re

re_cat = re.compile("\[category\] (.*):")

categories = {}

category = None

for line in open("movies.txt", "r").read().split("\n"):
    line = line.strip()
    if not line:
        continue
    if re_cat.match(line):
        category = re_cat.sub("\\1", line)
        if not category in categories:
            categories[category] = []
 continue
    categories[category].append(line)

print categories

Makes the following dictionary:

{
'Action': ['Movie', 'Movie'],
'Horror': ['Movie', 'Movie', 'Movie'],
'Comedy': ['Movie']
}

We use the same regular expression for matching and stripping out the category name, so it's efficient to compile it with re.compile.

We have a running category variable which changes whenever a new category is parsed. Any line that doesn't define a new category gets added to the categories dictionary under the appropriate key. Categories defined for the first time create a list under the right dictionary key, but categories can also be listed multiple times and everything will end up under the right key.

Any movies listed before a category is defined will be in the dictionary under the None key.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜