开发者

Find consecutive substring indexes

Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges repr开发者_StackOverflow中文版esenting the indices in the result string corresponding to the letters in the search string?

Desired output:

substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]

substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ]   # Alternative, acceptable, less-desirable output

substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]

This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.

Find consecutive substring indexes

Benchmark Results

Benchmarking the current working results shows that @tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:

               user     system      total        real
phrogz1    0.889000   0.062000   0.951000 (  0.944000)
phrogz2    0.920000   0.047000   0.967000 (  0.977000)
tokland    1.030000   0.000000   1.030000 (  1.035000)

Here is the benchmark test:

a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
  %w[ phrogz1 phrogz2 tokland ].each{ |method|
    x.report(method){ test.each{ |words,terms|
      words.each{ |master| terms.each{ |term|
        2000.times{ send(method,term,master) }
      } }
    } }
  }
end


To have something to start with, how about that?

>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]

And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).


Ruby's Abbrev module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}

For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.

The keys will also give you a quick set of words to look up the subword matches in your original string.

For fast lookups to see if there's a substring match:

regexps = Regexp.union(
  abbr.keys.sort.reverse.map{ |k|
    Regexp.new(
      Regexp.escape(k),
      Regexp::IGNORECASE
    )
  }
)

Note that it's escaping the patterns, which would allow characters to be entered, such as ?, * or ., and be treated as literals, instead of special characters for regex, like they would normally be treated.

The result looks like:

/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/

Regexp's match will return information about what was found.

Because the union "ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.

That should give you a good start on what you want to do.


EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )

target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
  match_results = regexps.match(target_str, str_offset)
  break if (match_results.nil?)
  s, e = match_results.offset(0)
  offsets << [s, e - s]
  str_offset = 1 + s
end

pp offsets

>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]

If you want ranges replace offsets << [s, e - s] with offsets << [s .. e] which will return:

>> [[0..4], [5..6], [12..15], [27..29], [33..34]]


Here's a late entrant that's making a move as it nears the finish line.

code

def substrings( search_str, result_str )
  search_chars = search_str.downcase.chars
  next_char = search_chars.shift
  result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
    if next_char == c
      (a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
      next_char = search_chars.shift
    end   
    next_char
  end
end

demo

substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" )       #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" )       #=> [0..0, 10..11, 14..14]

benchmark

              user     system      total        real
phrogz1   1.120000   0.000000   1.120000 (  1.123083)
cary      0.550000   0.000000   0.550000 (  0.550728)


I don't think there are any built in methods that will really help with this, probably the best way is to go through each letter in the word you're searching for and build up the ranges manually. Your next best option would probably be to build a regex like in @tokland's answer.


Here's my implementation:

require 'strscan'
def substrings( search, master )
  [].tap do |ranges|
    scan = StringScanner.new(master)
    init = nil
    last = nil
    prev = nil
    search.chars.map do |c|
      return nil unless scan.scan_until /#{c}/i
      last = scan.pos-1
      if !init || (last-prev) > 1
        ranges << (init..prev) if init
        init = last
      end
      prev = last
    end
    ranges << (init..last)
  end
end

And here's a shorter version using another utility method (also needed by @tokland's answer):

require 'strscan'
def substrings( search, master )
  s = StringScanner.new(master)
  search.chars.map do |c|
    return nil unless s.scan_until(/#{c}/i)
    s.pos - 1
  end.to_ranges
end

class Array
  def to_ranges
    return [] if empty?
    [].tap do |ranges|
      init,last = first
      each do |o|
        if last && o != last.succ
          ranges << (init..last)
          init = o
        end
        last = o
      end
      ranges << (init..last)
    end
  end
end
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜