Testing if a string contains one of several thousand substrings

2023-04-06 09:11 问答作者：

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to selec开发者_开发知识库t the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.

Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.

You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.

The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:

Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).

Cha, Moraru, et al: SplitScreen: Enabling Efﬁcient, Distributed Malware Detection

If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.

Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.

But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.

Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.

You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.

For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

继续阅读：language-agnostic regex string substring

Testing if a string contains one of several thousand substrings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？