Extract individual existing words in domain names
I'm looking 开发者_JS百科for a Ruby gem (preferably) that will cut domain names up into their words.
whatwomenwant.com => 3 words, "what", "women", "want".
If it can ignore things like numbers and gibberish then great.
You'll need a word list such as those produced by Project Gutenberg or available in the source for ispell &c. Then you can use the following code to decompose a domain into words:
WORD_LIST = [
'experts',
'expert',
'exchange',
'sex',
'change',
]
def words_that_phrase_begins_with(phrase)
WORD_LIST.find_all do |word|
phrase.start_with?(word)
end
end
def phrase_to_words(phrase, words = [], word_list = [])
if phrase.empty?
word_list << words
else
words_that_phrase_begins_with(phrase).each do |word|
remainder = phrase[word.size..-1]
phrase_to_words(remainder, words + [word], word_list)
end
end
word_list
end
p phrase_to_words('expertsexchange')
# => [["experts", "exchange"], ["expert", "sex", "change"]]
If given a phrase that has any unrecognized words, it returns an empty array:
p phrase_to_words('expertsfoo')
# => []
If the word list is long, this will be slow. You can make this algorithm faster by preprocessing the word list into a tree. The preprocessing itself will take time, so whether it's worth it will depend upon how many domains you want to test.
Here's some code to turn the word list into a tree:
def add_word_to_tree(tree, word)
first_letter = word[0..0].to_sym
remainder = word[1..-1]
tree[first_letter] ||= {}
if remainder.empty?
tree[first_letter][:word] = true
else
add_word_to_tree(tree[first_letter], remainder)
end
end
def make_word_tree
root = {}
WORD_LIST.each do |word|
add_word_to_tree(root, word)
end
root
end
def word_tree
@word_tree ||= make_word_tree
end
This produces a tree that looks like this:
{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :s=>{:e=>{:x=>{:word=>true}}}, :e=>{:x=>{:c=>{:h=>{:a=>{:n=>{:g=>{:e=>{:word=>true}}}}}}, :p=>{:e=>{:r=>{:t=>{:word=>true, :s=>{:word=>true}}}}}}}}
It looks like Lisp, doesn't it? Each node in the tree is a hash. Each hash key is either a letter, with the value being another node, or it is the symbol :word with the value being true. Nodes with :word are words.
Modifying words_that_phrase_begins_with
to use the new tree structure will make it faster:
def words_that_phrase_begins_with(phrase)
node = word_tree
words = []
phrase.each_char.with_index do |c, i|
node = node[c.to_sym]
break if node.nil?
words << phrase[0..i] if node[:word]
end
words
end
I don't know gems for this, but if I had to solve this problem, I would download some english words dictionary and read about text searching algorythms.
When you have more than one variant to divide letters (like in sepp2k's expertsexchange), than you can have two hints:
- Your dictionary is sorted by... for example, popularity of a word. So dividings with most popular words will be more valuable.
- You can go to the main page of site with domain you are anazyling and just read the content, searching your words. I don't think that you'll find sex on a page for some experts. But... hm... experts can be so different ,.)
Update
I've been working with this challenge and came up with the following code. Please refactor if I'm doing something wrong :-)
Benchmark:
Runtime: 11 sec.
f- file: 13.000 lines of domain names
w- file: 2000 words (to check against)
Code:
f = File.open('resource/domainlist.txt', 'r')
lines = f.readlines
w = File.open('resource/commonwords.txt', 'r')
words = w.readlines
results = {}
lines.each do |line|
# Start with words from 2 letters on, so ignoring 1 letter words like 'a'
word_size = 2
# Only get the .com domains
if line =~ /^.*,[a-z]+\.com.*$/i then
# Strip the .com off the domain
line.gsub!(/^.*,([a-z]+)\.com.*$/i, '\\1')
# If the domain name is between 3 and 12 characters
if line.size > 3 and line.size < 15 then
# For the length of the string run ...
line.size.times do |n|
# Set the counter
i = 0
# As long as we're within the length of the string
while i <= line.size - word_size do
# Get the word in proper DRY fashion
word = line[i,word_size]
# Check the word against our list
if words.include?(word)
results[line] = [] unless results[line]
# Add all the found words to the hash
results[line] << word
end
i += 1
end
word_size += 1
end
end
end
end
p results
精彩评论