Finding the line numbers of all occurences of a string in a text file
I'm trying to write a function that does the following:
Given a text fil开发者_开发技巧e, I want to find all occurences of a certain string in this file; then, for each occurence, the line on which it was found should be added to a list. We assume that each line only contains at most one occurence. The text file can get very large, which means a simple for-loop to iterate over each line the file will be too slow.
For example, say we have a file with the content:
- A B C D E F G
- H J K L M N O
- G F E D C B A
- P Q R S T U V
If I were to search for "A", the function would find it on lines 1 and 3 and thus add 1 and 3 to a list (and then return the list).
I was considering binary search, but it seems to require that a list to be sorted and the elements to be distinct - I'm looking for identical values.
So, is any other search algorithm i can base my function on, with roughly the same performance as binary search?
Thanks!
You can index your lines, if they change infrequently and you will be performing many searches on them. One way to index them would be to create a histogram of which characters are present in which lines (and how many times, perhaps). Then you can invert this and say that the letter A, for example, appears on lines 5, 10 and 20. If you are searching for "ABF", you can use the inverted histogram to determine which lines are candidates (i.e., contain the letters 'A', 'B' and 'F') and then only look at those lines.
Whether or not this is an effective strategy will depend on the selectivity of your searches and the length of the search strings, among other things. Only testing will reveal whether or not the algorithm has merit for your particular usage patterns.
精彩评论