Extract lines below category and stop when another category is reached

2023-01-24 22:27 问答作者：

Let's suppose I have a text file of movie genres with my favorite movies under each genre.

[category] Horror:

Movie

Movie

Movie

[category] Comedy:

Movie
开发者_高级运维

[category] Action:

Movie

Movie

How would I create a function that extracts and packages all the movie titles below a certain [category] * into an array without spilling over into another category?

Already given others' advice for your text file format, I'm just stepping in giving another suggestion... If rewriting your file is possible, an easy solution could be to change it to ConfigParser-readable (and writable) file:

[Horror]
1: Movie
2: Movie
3: Movie

[Comedy]
1: Movie

[Action]
1: Movie
2: Movie

You could parse the file line-by-line this way:

import collections

result=collections.defaultdict(list)
with open('data') as f:
    genre='unknown'
    for line in f:
        line=line.strip()
        if line.startswith('[category]'):
            genre=line.replace('[category]','',1)
        elif line:
            result[genre].append(line)

for key in result:
    print('{k} {m}'.format(k=key,m=list(result[key])))

yields

 Action: ['1. Movie', '2. Movie']
 Comedy: ['1. Movie']
 Horror: ['1. Movie', '2. Movie', '3. Movie']

Use a negative lookahead:

\[category\](?:(?!\[category\]).)*

will match one entire category (if the regex is compiled using the re.DOTALL option).

You can grab the category and the contents separately by using

\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)

After a match, mymatch.group(1) will contain the category, and mymatch.group(2) will contain the movie titles.

Example in Python 3.1 (using your string as mymovies):

>>> import re
>>> myregex = re.compile(r"\[category\]\s*([^\r\n]*)\r?\n((?:(?!\[category\]).)*)", re.DOTALL)
>>> for mymatch in myregex.finditer(mymovies):
...     print("Category: {}".format(mymatch.group(1)))
...     for movie in mymatch.group(2).split("\n"):
...         if movie.strip():
...              print("contains: {}".format(movie.strip()))
...
Category: Horror:
contains: 1. Movie
contains: 2. Movie
contains: 3. Movie
Category: Comedy:
contains: 1. Movie
Category: Action:
contains: 1. Movie
contains: 2. Movie
>>>

import re

re_cat = re.compile("\[category\] (.*):")

categories = {}

category = None

for line in open("movies.txt", "r").read().split("\n"):
    line = line.strip()
    if not line:
        continue
    if re_cat.match(line):
        category = re_cat.sub("\\1", line)
        if not category in categories:
            categories[category] = []
 continue
    categories[category].append(line)

print categories

Makes the following dictionary:

{
'Action': ['Movie', 'Movie'],
'Horror': ['Movie', 'Movie', 'Movie'],
'Comedy': ['Movie']
}

We use the same regular expression for matching and stripping out the category name, so it's efficient to compile it with re.compile.

We have a running category variable which changes whenever a new category is parsed. Any line that doesn't define a new category gets added to the categories dictionary under the appropriate key. Categories defined for the first time create a list under the right dictionary key, but categories can also be listed multiple times and everything will end up under the right key.

Any movies listed before a category is defined will be in the dictionary under the None key.

继续阅读：python text-extraction

Extract lines below category and stop when another category is reached

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？