Finding multiple words and print next line using Python

2023-03-12 14:54 问答作者：

I have huge text file. It looks as follows

> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....

Now i want to create python script to find words like (> <Enzymologic: Ki nM 1>, > <Enzymologic: EC50/IC50 nM 1>) and print next line to each word in tab delimited format as follows

> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
....

I tried following code

infile = path of th开发者_如何学Pythone file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]

But it doesnt work can any body suggest some code...to acheive it.

Thanks in advance

s = '''Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000'''
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
    result[key].append(value)
print dict(result)

>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}

Then format output as you like.

I think your problem comes from the fact that you do if searchtxt in line instead of doing if pattern in line for each pattern in your searchtxt. Here is what I'd do:

>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
    for pattern in searchtxt:
        if pattern in line and i+1 < len(lines):
             dict_patterns[pattern].append(lines[i+1])

>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
                            'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})

The use of the dict allows to group results by pattern (defaultdict is a convenient way not to be forced to initialize your object).

You really have too separate problems:

Parse the file and extract the data from it

import itertools

# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')

def iterate_on_couple(iterable):
  """
    Iterate on two elements, by two elements
  """
  iterable = iter(iterable)
  for x in iterable:
    yield x, next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
  results.setdefault(name, []).append(value)

# now you got a dictionary with all values linked to a name
print results

Now this code make the assumption that your files are not corrupted and that you have always the structure:

blank
name
value

If not you may need something more robust.

Secondly, this stores all the values in memory, which could be a problem if your have a lot of values. In that case, you'll need to look at some storage solution such as the shelve module or sqlite.

Save the results into a file

import csv

def get(iterable, index, default):
  """
    Return an item from array or default if IndexError
  """
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name], i, "-") for name in names]
    out.writerow(line)

Since I'm writting the whole code for you, I deliberatly used some advanced Python idioms so you'll have some food for thought while using it.

import itertools

def search(lines, terms):
    results = [[t] for t in terms]
    lines = iter(lines)
    for l in lines:
        for i,t in enumerate(terms):
            if t in l:
                results[i].append(lines.next().strip())
                break
    return results

def format(results):
    s = []
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        s.append("\t".join(row))
        s.append('\n')
    return ''.join(s)

And here's how to call the functions:

example = """> <Enzymologic: Ki nM 1>
257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

def test():
    terms = ["> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"]
    lines = example.split('\n')
    result = search(lines, terms)
    print format(result)

>>> test()
> <Enzymologic: IC50 nM 1>   > <Enzymologic: Ki nM 1>
n/a 257000

The above example separates each column by a single tab. If you need fancier formatting (as per your example), the format function gets a bit more complicated:

import math

def format(results):
    maxcolwidth = [0] * len(results)
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        for i,col in enumerate(row):
            w = int(math.ceil(len(col)/8.0))*8
            maxcolwidth[i] = max(maxcolwidth[i], w)

    s = []
    for row in rows:
        for i,col in enumerate(row):
            s += col
            padding = maxcolwidth[i]-len(col)
            tabs = int(math.ceil(padding/8.0))
            s += '\t' * tabs
        s += '\n'

    return ''.join(s)

import re

pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

searchtxt = "nzymologic: Ki nM 1>", "<Enzymologic: IC50 nM 1>"

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

tu = tuple(regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt)

model = '%%-%ss  %%s\n' % len(searchtxt[0])

regx_BBB = re.compile(('%s[ \t\r\n]+(.+)[ \t\r\n]+'
                       '.+?%s[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)') % tu)


print 'tu   ==',tu
print 'model==',model
print 'regx_BBB.findall(pseudo_file)==\n',regx_BBB.findall(pseudo_file)



with open('woof.txt','w') as f:
    f.write(model % searchtxt)
    f.writelines(model % x for x in regx_BBB.findall(pseudo_file))

result

tu   == ('nzymologic: .*?Ki.*? nM 1>', '<Enzymologic: .*?IC50.*? nM 1>')
model== %-20s  %s

regx_BBB.findall(pseudo_file)==
[('257000', 'n/a'), ('5000', '1000')]

and content of file 'woof.txt' is:

> <Enzymologic: Ki nM 1>  > <Enzymologic: IC50 nM 1>
257000                    n/a
5000                      1000

To obtain regx_BBB, I first compute a tuple tu because you want to catch a line > but there is only "> " in searchtxt

So, the tuple tu introduces .*? in the strings of searchtxt in order that the regex regx_BBB is able to catch lines CONTAINING IC50 and not only the lines strictly EQUAL to the elements of searchtxt

Note that I put strings "nzymologic: Ki nM 1>" and "<Enzymologic: IC50 nM 1>" in searchtxt, other than the ones you utilize, to show that the regexes are build so that the result is obtained yet.

The only condition is that there must be at least ONE character before the ':' in each of the strings of searchtxt

EDIT 1

I thought that in the file, a line '> <Enzymologic: IC50 nM 1>' or '> <Enzymologic: EC50/IC50 nM 1>' should always follow a line '> <Enzymologic: Ki nM 1>'

But after having read the answer of others, I think it is not evident (that's the common problem of questions: they don't give enough information and precisions)

If every line must be catched independantly, the following simpler regex regx_BBB can be used:

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

li = [ regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt]

regx_BBB = re.compile('|'.join(li).join('()') + '[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)')

But the formatting of the recording file will be harder. I am tired to write a new complete code without knowing what is precisely wanted

Probably the simplest way to find a string in a line then print the next line is to use itertools islice:

    from itertools import islice
    searchtxt = "<Enzymologic: IC50 nM 1>"
    with open ('file.txt','r') as itfile:
            for line in itfile:
                    if searchtxt in line:
                            print line
                            print ''.join(islice(itfile,1)

继续阅读：python

Finding multiple words and print next line using Python

Parse the file and extract the data from it

Save the results into a file

EDIT 1

更多精彩内容

精彩评论

最新问答

北医三院三代试管养囊一次费用是多少？贵不贵？？

下周全市中小学校恢复线下教学，如何让孩子收心准备开学？？

飞利浦液晶电视,不小心按了TV,现实无信号,要怎么样才能返回电视...？

如何治疗输卵管阻塞？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机如何样？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

Parse the file and extract the data from it

Save the results into a file

EDIT 1

更多精彩内容

精彩评论

最新问答

北医三院三代试管养囊一次费用是多少？贵不贵？？

下周全市中小学校恢复线下教学，如何让孩子收心准备开学？？

飞利浦液晶电视,不小心按了TV,现实无信号,要怎么样才能返回电视...？

如何治疗输卵管阻塞？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机 如何样？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

理光短焦家用投影仪pjk366蓝光3d超清家用会议教学家用投影机如何样？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？