Tool or API needed to find text contain any word from a large dictionary of words
I'm looking for a tool (ideally) or failing that an API to search text for instances of any word from a large dictionary of words in a large number of text files. "Words" in my case are actually file names but won't contain spaces.
A fast algorithm might perhaps build a DFA (deterministic finite automata) by reading the dictionary and then be able to use a single pass to find instances of the dictionary words over any number of files.
Note: I'm wanting exact text matching not fuzzy matching like this SO question: - Algorithm wanted: F开发者_StackOverflow中文版ind all words of a dictionary that are similar to words in a free text
Have you looked at lucene ? There's a java and a .net version
http://lucene.apache.org/java/docs/index.html
I'd load the dictionary of words to a HashMap or "Dictionary", then read the file in line by line or word by word, checking if the map contains the word.
精彩评论