开发者

Pattern Matching Algorithm

I am trying to find an algorithm that does the following but have been unsuccessful: I have a bunch of data that looks like the following:

Type    geneA    geneB    geneC    ...    geneN
A       1        0        1               1
A       0        0        1               1
B       1        1        1               0
C       0        1        1               0
B       1        1        0               0
etc        

So not all A's are exactly the same, not all B's are exactly the same, etc., but hopefully they have some kind of pattern. The data is preferably not just booleans (so would contain numbers other than just 0 and 1), but booleans would be fine for a start.

What I want to do is given a gene series for a person, find out what type they are. For example I would like to input "011...1" and be told that this person is most likely type A.

This seems like something that should have been done before but I can't seem to find any existing algorithms to do this (maybe pattern matching is 开发者_C百科the wrong key term?).

Any help on where to start looking for this kind of thing or algorithms that do this kind of thing would be much appreciated.


You could merge your gene* binary values into vectors: e.g. 1001...1, 001...1, etc. and perform k-means clustering on them.

For example, if you know a priori that you have three types A, B and C, you would likely perform clustering with k = 3.

Once you have clusters, you could use silhouettes to determine how well an input vector (e.g., 011...1) would fit to one of the three established clusters.


You should have a look at weka . It's a machine learning tool that implement a lot of algorithm. It will help you classify your new datas.

Weka will give you the opportunity to solve this with a decision tree, a bayes network, rules,neural network .etc...

If you want to implement it yourself just find the one suitable for your situation and implement it.

You want to predict class appartenance :

create a file like this one :

@relation gene

@attribute gene1 {0,1}
@attribute gene2 {0,1}
@attribute gene3 real
...
@attribute class {A, B, C }

@data
1,1,1,A
1,0,0,B
1,0,1,D 
0,0,1,?
..etc

and give it to weka. You will get your result in a second.

Hope it helps


If you don't know how to solve something, just use neural networks :-) I think it would fit to this case. Or use some N-dimensional clustering or classification algorithms.


This seems to be solvable without too much effort: All your inputs are vectors of integers. Your patterns are vectors too, with an attached type. To find the best pattern match for a given input vector, you can calculate the distance between the one input vector and all the pattern vectors. So, for a pattern [A: (0, 0, 1, 0)], the distance for input (0, 1, 1, 0) would be |0-0| + |0-1| + |1-1| + |0-0| = 1. The best matches are the patterns with the lowest distance. This would work for arbitrary integer components too.


There's a simple approach using clustering methods.

The centroid of a cluster is the average vector value for all the vectors in the cluster.

Collect all instances of type A (cluster A) and calculated it's centroid.
Collect all instances of type B (cluster B) and calculated it's centroid.
And similar for other types, collect them and calculate their centroid.

Then take unknown type X and calculate the distance (Euclidean distance) to every clusters centroid. The minimum distance is the most likely type.

Here's an example for 2 dimension.

Cluster A has two types with vector [1,1] and [3,3]. The centroid for cluster A is [2,2]

Cluster B has two types with vector [10,10] and [12,8]. The centroid for cluster B is [11,9]

Here's a random type X with vector [2,4]. The distance from X would be closer to cluster A in this example than cluster B.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜