Find consecutive substring indexes

2023-02-26 18:05 问答作者：

Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges repr开发者_StackOverflow中文版esenting the indices in the result string corresponding to the letters in the search string?

Desired output:

substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]

substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ]   # Alternative, acceptable, less-desirable output

substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]

This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.

Find consecutive substring indexes

Benchmark Results

Benchmarking the current working results shows that @tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:

               user     system      total        real
phrogz1    0.889000   0.062000   0.951000 (  0.944000)
phrogz2    0.920000   0.047000   0.967000 (  0.977000)
tokland    1.030000   0.000000   1.030000 (  1.035000)

Here is the benchmark test:

a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
  %w[ phrogz1 phrogz2 tokland ].each{ |method|
    x.report(method){ test.each{ |words,terms|
      words.each{ |master| terms.each{ |term|
        2000.times{ send(method,term,master) }
      } }
    } }
  }
end

To have something to start with, how about that?

>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]

And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).

Ruby's Abbrev module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}

For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.

The keys will also give you a quick set of words to look up the subword matches in your original string.

For fast lookups to see if there's a substring match:

regexps = Regexp.union(
  abbr.keys.sort.reverse.map{ |k|
    Regexp.new(
      Regexp.escape(k),
      Regexp::IGNORECASE
    )
  }
)

Note that it's escaping the patterns, which would allow characters to be entered, such as ?, * or ., and be treated as literals, instead of special characters for regex, like they would normally be treated.

The result looks like:

/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/

Regexp's match will return information about what was found.

Because the union "ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.

That should give you a good start on what you want to do.

EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:

require 'abbrev'
require 'pp'

abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )

target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
  match_results = regexps.match(target_str, str_offset)
  break if (match_results.nil?)
  s, e = match_results.offset(0)
  offsets << [s, e - s]
  str_offset = 1 + s
end

pp offsets

>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]

If you want ranges replace offsets << [s, e - s] with offsets << [s .. e] which will return:

>> [[0..4], [5..6], [12..15], [27..29], [33..34]]

Here's a late entrant that's making a move as it nears the finish line.

code

def substrings( search_str, result_str )
  search_chars = search_str.downcase.chars
  next_char = search_chars.shift
  result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
    if next_char == c
      (a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
      next_char = search_chars.shift
    end   
    next_char
  end
end

demo

substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" )       #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" )       #=> [0..0, 10..11, 14..14]

benchmark

              user     system      total        real
phrogz1   1.120000   0.000000   1.120000 (  1.123083)
cary      0.550000   0.000000   0.550000 (  0.550728)

I don't think there are any built in methods that will really help with this, probably the best way is to go through each letter in the word you're searching for and build up the ranges manually. Your next best option would probably be to build a regex like in @tokland's answer.

Here's my implementation:

require 'strscan'
def substrings( search, master )
  [].tap do |ranges|
    scan = StringScanner.new(master)
    init = nil
    last = nil
    prev = nil
    search.chars.map do |c|
      return nil unless scan.scan_until /#{c}/i
      last = scan.pos-1
      if !init || (last-prev) > 1
        ranges << (init..prev) if init
        init = last
      end
      prev = last
    end
    ranges << (init..last)
  end
end

And here's a shorter version using another utility method (also needed by @tokland's answer):

require 'strscan'
def substrings( search, master )
  s = StringScanner.new(master)
  search.chars.map do |c|
    return nil unless s.scan_until(/#{c}/i)
    s.pos - 1
  end.to_ranges
end

class Array
  def to_ranges
    return [] if empty?
    [].tap do |ranges|
      init,last = first
      each do |o|
        if last && o != last.succ
          ranges << (init..last)
          init = o
        end
        last = o
      end
      ranges << (init..last)
    end
  end
end

继续阅读：autocomplete ruby

Find consecutive substring indexes

Benchmark Results

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Benchmark Results

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？