Extract lines below category and stop when another category is reached
Let's suppose I have a text file of movie genres with my favorite movies under each genre.
[category] Horror:
- Movie
- Movie
- Movie
[category] Comedy:
- Movie
开发者_高级运维[category] Action:
- Movie
- Movie
How would I create a function that extracts and packages all the movie titles below a certain [category] * into an array without spilling over into another category?
Already given others' advice for your text file format, I'm just stepping in giving another suggestion... If rewriting your file is possible, an easy solution could be to change it to ConfigParser
-readable (and writable) file:
[Horror] 1: Movie 2: Movie 3: Movie [Comedy] 1: Movie [Action] 1: Movie 2: Movie
You could parse the file line-by-line this way:
import collections
result=collections.defaultdict(list)
with open('data') as f:
genre='unknown'
for line in f:
line=line.strip()
if line.startswith('[category]'):
genre=line.replace('[category]','',1)
elif line:
result[genre].append(line)
for key in result:
print('{k} {m}'.format(k=key,m=list(result[key])))
yields
Action: ['1. Movie', '2. Movie']
Comedy: ['1. Movie']
Horror: ['1. Movie', '2. Movie', '3. Movie']
Use a negative lookahead:
\[category\](?:(?!\[category\]).)*
will match one entire category (if the regex is compiled using the re.DOTALL
option).
You can grab the category and the contents separately by using
\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)
After a match, mymatch.group(1)
will contain the category, and mymatch.group(2)
will contain the movie titles.
Example in Python 3.1 (using your string as mymovies
):
>>> import re
>>> myregex = re.compile(r"\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)", re.DOTALL)
>>> for mymatch in myregex.finditer(mymovies):
... print("Category: {}".format(mymatch.group(1)))
... for movie in mymatch.group(2).split("\n"):
... if movie.strip():
... print("contains: {}".format(movie.strip()))
...
Category: Horror:
contains: 1. Movie
contains: 2. Movie
contains: 3. Movie
Category: Comedy:
contains: 1. Movie
Category: Action:
contains: 1. Movie
contains: 2. Movie
>>>
import re
re_cat = re.compile("\[category\] (.*):")
categories = {}
category = None
for line in open("movies.txt", "r").read().split("\n"):
line = line.strip()
if not line:
continue
if re_cat.match(line):
category = re_cat.sub("\\1", line)
if not category in categories:
categories[category] = []
continue
categories[category].append(line)
print categories
Makes the following dictionary:
{
'Action': ['Movie', 'Movie'],
'Horror': ['Movie', 'Movie', 'Movie'],
'Comedy': ['Movie']
}
We use the same regular expression for matching and stripping out the category name, so it's efficient to compile it with re.compile
.
We have a running category
variable which changes whenever a new category is parsed. Any line that doesn't define a new category gets added to the categories
dictionary under the appropriate key. Categories defined for the first time create a list under the right dictionary key, but categories can also be listed multiple times and everything will end up under the right key.
Any movies listed before a category is defined will be in the dictionary under the None
key.
精彩评论