开发者

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to e开发者_开发知识库xtract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).

import sys

file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]

#print result

for each in result:
    print each[0], each[1]

ThinkCode:~$ python find_string.py sample.txt "String Extraction"

The problem I have with this is that for cases where search string is broken towards the end of the line :

If you are going to index large binary files, remember to change the size limits. String

Extraction is a common problem

If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).

Much appreciated guys!


Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.

My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.

Edit:

Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:

def iterwords(fh):
    for number, line in enumerate(fh):
        for word in re.split(r'\s+', line.strip()):
            yield number, word

It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.

The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:

python search.py 'multi word search string' file.txt

There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)

* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).


There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.

Then you'd assign line2 to line1, get a new line2, and repeat the process.


Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE

Then use \s to represent all white space (including new lines).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜