Patterns of key-value pairs in text
I have some labels and attributes from text. I am looking for patterns (combinations of key-value pairs that occur across many documents) of labels and attributes amongst these documents.
What kind of an algorithm and tool should I be looking into? I want to score these patterns based on relevance and impo开发者_Go百科rtance and not just string matching.
Any kind of inputs would be great. Thanks
If I correctly understand your question, you are talking about association mining. Example: attr1==value1 ==> label=label1 (95% percision)
There are several algorithms, one of them is Apriori.
The second interpretation of your question is feature selection i.e. selecting attributes which has most impact on label prediction. There you can check infogain/chi^2 selection all of this staff you can find in Weka(www.cs.waikato.ac.nz/ml/weka).
If your don't want to use such algorithms and implement them, most simple implementation will look like:
attributes = new SortedSet()
for a in attributes:
for label in labels:
for value in posible_values(a)
prob = count(a,value, label)/count(label) //this is propability cireteria, chi^2 works better
if(count(a)>MIN_SUPPORT) //not too rare
attrbutes.add(prob, (a, value, label))
print(attributes)
I think using Regular Expressions and string matching (a set of rules, ordered by precedence) is your best options. Otherwise you should use complicated Language Processing tools that require lots of training and huge datasets to determine the concept of the data you're trying to mine out.
Depends. If the keys are natural classes, use classification on the keys using the labels as data (or vice versa). If not, use clustering, either hierchical (dendrograms) or flat (k-means).
In the clustering case, string matching is your friend, as you can cluster together those strings that have a low distance (Levenshtein, LCS, n-gram overlap). You can use it in addition to any other features you can think up.
精彩评论