Find consecutive substring indexes
Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges repr开发者_StackOverflow中文版esenting the indices in the result string corresponding to the letters in the search string?
Desired output:
substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]
substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ] # Alternative, acceptable, less-desirable output
substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]
This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.
Benchmark Results
Benchmarking the current working results shows that @tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:
user system total real
phrogz1 0.889000 0.062000 0.951000 ( 0.944000)
phrogz2 0.920000 0.047000 0.967000 ( 0.977000)
tokland 1.030000 0.000000 1.030000 ( 1.035000)
Here is the benchmark test:
a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
%w[ phrogz1 phrogz2 tokland ].each{ |method|
x.report(method){ test.each{ |words,terms|
words.each{ |master| terms.each{ |term|
2000.times{ send(method,term,master) }
} }
} }
}
end
To have something to start with, how about that?
>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]
And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).
Ruby's Abbrev
module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}
For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.
The keys will also give you a quick set of words to look up the subword matches in your original string.
For fast lookups to see if there's a substring match:
regexps = Regexp.union(
abbr.keys.sort.reverse.map{ |k|
Regexp.new(
Regexp.escape(k),
Regexp::IGNORECASE
)
}
)
Note that it's escaping the patterns, which would allow characters to be entered, such as ?
, *
or .
, and be treated as literals, instead of special characters for regex, like they would normally be treated.
The result looks like:
/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/
Regexp's match
will return information about what was found.
Because the union
"ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.
That should give you a good start on what you want to do.
EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )
target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
match_results = regexps.match(target_str, str_offset)
break if (match_results.nil?)
s, e = match_results.offset(0)
offsets << [s, e - s]
str_offset = 1 + s
end
pp offsets
>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]
If you want ranges replace offsets << [s, e - s]
with offsets << [s .. e]
which will return:
>> [[0..4], [5..6], [12..15], [27..29], [33..34]]
Here's a late entrant that's making a move as it nears the finish line.
code
def substrings( search_str, result_str )
search_chars = search_str.downcase.chars
next_char = search_chars.shift
result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
if next_char == c
(a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
next_char = search_chars.shift
end
next_char
end
end
demo
substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" ) #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" ) #=> [0..0, 10..11, 14..14]
benchmark
user system total real
phrogz1 1.120000 0.000000 1.120000 ( 1.123083)
cary 0.550000 0.000000 0.550000 ( 0.550728)
I don't think there are any built in methods that will really help with this, probably the best way is to go through each letter in the word you're searching for and build up the ranges manually. Your next best option would probably be to build a regex like in @tokland's answer.
Here's my implementation:
require 'strscan'
def substrings( search, master )
[].tap do |ranges|
scan = StringScanner.new(master)
init = nil
last = nil
prev = nil
search.chars.map do |c|
return nil unless scan.scan_until /#{c}/i
last = scan.pos-1
if !init || (last-prev) > 1
ranges << (init..prev) if init
init = last
end
prev = last
end
ranges << (init..last)
end
end
And here's a shorter version using another utility method (also needed by @tokland's answer):
require 'strscan'
def substrings( search, master )
s = StringScanner.new(master)
search.chars.map do |c|
return nil unless s.scan_until(/#{c}/i)
s.pos - 1
end.to_ranges
end
class Array
def to_ranges
return [] if empty?
[].tap do |ranges|
init,last = first
each do |o|
if last && o != last.succ
ranges << (init..last)
init = o
end
last = o
end
ranges << (init..last)
end
end
end
精彩评论