开发者

Is there an algorithm to find matches *without* using regex and only type of regex?

I mean, is there an algorithm 开发者_运维知识库to automatically find matches given only the type of match you want. For instance, given "disease" is there a modern algorithm using ML techniques probably (I am just guessing) or any other techniques to find all the disease names in a given piece of text ? How do you think this can be done without regexes ?

Thanks


Topic-based searching is non-trivial at best, though it's rarely done using regexes (or at least on primarily regexes anyway).

For topic based searching, you typically use something that looks/acts (oddly enough) rather similar to a spam filter. In fact, assuming it used a pure Bayesian model, you could probably get a typical spam filter to do a decent job of classifying documents into those (probably) related to a particular topic, and those that (probably) aren't, just by using the right training data (i.e., instead of training it based on spam/non-spam, you train it on, in this case, medical/non-medical).

That really only works for one topic at a time though. You have to train it separately for each topic. If you want to manage multiple topics more or less simultaneously, you probably want to look at something like Latent Semantic Indexing (which is more commonly used for machine learning types of things). This will support (for example) taking a few thousand documents, and separating them into a number of groups, rather than just those related to a specific topic, and everything else.

Depending on the kinds of searches you want to support, there are also automated keyword extraction algorithms, but I won't try to get into this, since it's not clear that you care about it.

Since somebody mentioned using regexes for dealing with different forms of words, and for misspellings, I'll add that normally regexes are not typically used for either of those purposes. There are algorithms (e.g., Porter's stemmer) specifically for removing suffixes to get a (probable) base word. There are others (e.g., Levenshtein distance) that are more often used to deal with spelling errors.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜