Can I use K-means algorithm on a string?
I am working on a python project where I开发者_开发技巧 study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.
I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.
thanks!
One problem you would face if using scipy.cluster.vq.kmeans
is that that function uses Euclidean distance to measure closeness. To shoe-horn your problem into one solveable by k-means
clustering, you'd have to find a way to convert your strings into numerical vectors and be able to justify using Euclidean distance as a reasonable measure of closeness.
That seems... difficult. Perhaps you are looking for Levenshtein distance instead?
Note there are variants of the K-means algorithm that can work with non-Euclideance distance metrics (such as Levenshtein distance). K-medoids
(aka PAM), for instance, can be applied to data with an arbitrary distance metric.
For example, using Pycluster
's implementation of k-medoids
, and nltk
's implementation of Levenshtein distance,
import nltk.metrics.distance as distance
import Pycluster as PC
words = ['apple', 'Doppler', 'applaud', 'append', 'barker',
'baker', 'bismark', 'park', 'stake', 'steak', 'teak', 'sleek']
dist = [distance.edit_distance(words[i], words[j])
for i in range(1, len(words))
for j in range(0, i)]
labels, error, nfound = PC.kmedoids(dist, nclusters=3)
cluster = dict()
for word, label in zip(words, labels):
cluster.setdefault(label, []).append(word)
for label, grp in cluster.items():
print(grp)
yields a result like
['apple', 'Doppler', 'applaud', 'append']
['stake', 'steak', 'teak', 'sleek']
['barker', 'baker', 'bismark', 'park']
K-means only works with euclidean distance. Edit distances such as Levenshtein don't even obey the triangle inequality may obey the triangle inequality, but are not euclidian. For the sorts of metrics you're interested in, you're better off using a different sort of algorithm, such as Hierarchical clustering: http://en.wikipedia.org/wiki/Hierarchical_clustering
Alternately, just convert your list of RNA into a weighted graph, with Levenshtein weights at the edges, and then decompose it into a minimum spanning tree. The most connected nodes of that tree will be, in a sense, the "most representative".
K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.
That said, I haven't used scipy.cluster.vq
, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.
What you need for Kmeans is a 'distance' measure (numbers representing a vector so it can find the distances between the vectors and cluster them around centroids based on the distances). Following are some examples I wrote for you:
Let's say you've got strings that represent dates like
2019-06-27 15:52:41.623Z
. What you want to do in this case is pick a date say when UTC timestamps start. Now with that starting date and time as the reference, you can calculate the 'distance' to each date string.Suppose instead, you have code strings,
if(a == b)
vs.if(a == c)
then you might want to use a different 'distance' like the number of characters that differ between the strings.Or if you have Html DOM structure,
<html></html>
vs<html><head></head></html>
you might not want to count characters but how many tags are different as your 'distance'.Or for a known enum in database, you could define each key as a different number with your own idea of 'distance' between the enums. For example, 'male', 'female', 'neutral' if you define as the vectors [0], [1], [2] would imply neutral is closer to female than male. So you might instead want to do [0],[2],[1] or [-1],[1],[0].
For RNA/DNA structure asked in the question, the 'distance' could be how many base pairs are different between the strands.
I hope you get the idea. So, you need to consider what is the content of your string and think of the best way to define the 'distance' between your content. Simple character diff distance could work as a generic distance measure between strings, but if you get better distance ideas, your algorithm will work better.
精彩评论