开发者

How can I move forward to a subsequent portion of text once I've already printed a searched portion of text in Python?

I would like to search through a text file and print out a line and its subsequent 3 lines if a keyword is found in the line AND a different keyword is found within the subsequent 3 lines.

My code right now prints too much information. Is there a way to move forward to the next section of text once a portion is already printed?

text = """

here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""

text2 = open("tmp.txt","w")
text2.write(text)
text2.close()

searchlines = open("tmp.txt").readlines()

data = []

for m, line in enumerate(searchlines):
    line = line.lower()
    if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
        for line2 in searchlines[m:m+4]:
            data.append(line2)
print ''.join(data)

The output right now is:

I want to print out this line and the following 3 lines only once keyword 2
print this line 开发者_JAVA技巧since it has a keyword2 3
print this line keyword 4
print this line 5
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14

I would like it to print out only:

I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12


So, as someone else has pointed out, your first keyword keyword is a substring of your second keyword keyword2. So I've implemented this using regexp objects, so that you can use the word boundary anchor \b.

import re
from StringIO import StringIO

text = """

here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""


def my_scan(data,search1,search2):
  buffer = []
  for line in data:
    buffer.append(line)
    if len(buffer) > 4:
      buffer.pop(0)
    if len(buffer) == 4: # Valid search block
      if search1.search(buffer[0]) and search2.search("\n".join(buffer[1:3])):
        for item in buffer:
          yield item
        buffer = []

# First search term
s1 = re.compile(r'\bkeyword\b')
s2 = re.compile(r'\bkeyword2\b')

for row in my_scan(StringIO(text),s1,s2):
  print row.rstrip()

Produces:

I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12


So you want to print out all blocks of 4 lines containing more than 2 keywords?

Anyway, thats what I've just came up with. Maybe you can use it:

text = """

here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
""".splitlines()

keywords = ['keyword', 'keyword2']

buffer, kw = [], set()
for line in text:
    if len(buffer) == 0:                 # first line of a block
        for k in keywords:
            if k in line:
                kw.add(k)
                buffer.append(line)
                continue
    else:                                # continuous lines
        buffer.append(line)
        for k in keywords:
            if k in line:
                kw.add(k)
        if len(buffer) > 3:
            if len(kw) >= 2:             # just print blocks with enough keywords
                print '\n'.join(buffer)
            buffer, kw = [], set()


Your keywords are overlapping: "keyword" is a subset of "keyword2".

Also, your data implies you don't want to see line 13 but acc. to the problem statement it should be printed.

I changed your first keyword from "keyword" to "firstkey" like this and your code works (except for line 13).

$ diff /tmp/q /tmp/q2
4c4
< I want to print out this line and the following 3 lines only once keyword 2
---
> I want to print out this line and the following 3 lines only once firstkey 2
6c6
< print this line keyword 4
---
> print this line firstkey 4
11,12c11,12
< I want to print out this line again and the following 3 lines only once keyword 9
< please print this line keyword 10
---
> I want to print out this line again and the following 3 lines only once firstkey 9
> please print this line firstkey 10
30c30
<     if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
---
>     if "firstkey" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):


First, you could correct your code like that:

text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""
searchlines = map(str.lower,text.splitlines(1))
# splitlines(1) with argument 1 keeps the newlines

data,again = [],-1

for m, line in enumerate(searchlines):
    if "keyword" in line and m>again and "keyword2" in ''.join(searchlines[m:m+4]):
        data.extend(searchlines[m:m+4])
        again = m+4

print ''.join(data)

.

Second, a short regex solution is

text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""

import re

regx = re.compile('(^.*?(?<=[ \t]){0}(?=[ \t]).*\r?\n'
                  '.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
                  '.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
                  '.*?(?(1)|(?(2)|{1})).*)'.\
                  format('keyword','keyword2'),re.MULTILINE|re.IGNORECASE)

print '\n'.join(m.group(1) for m in regx.finditer(text))

result

A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜