DBSCAN algorithm and clustering algorithm for data mining
How do you i开发者_运维技巧mplement DBSCAN algorithm on categorical data (mushroom data set)?
And what is a one pass clustering algorithm?
Could you provide pseudo code for a one pass clustering algorithm?
You can run DBSCAN with an arbitrary distance function without any changes to it. The indexing part will be more difficult, so you will likely only get O(n^2)
complexity.
But if you look closely at DBSCAN, all it does is compute distances, compare them to a threshold, and count objects. This is a key strength of it, it can easily be applied to various kinds of data, all you need is to define a distance function and thresholds.
I doubt there is a one-pass version of DBSCAN, as it relies on pairwise distances. You can prune some of these computations (this is where the index comes into play), but essentially you need to compare every object to every other object, so it is in O(n log n)
and not one-pass.
One-pass: I believe the original k-means was a one-pass algorithm. The first k objects are your initial means. For every new object, you choose the closes mean and update it (incrementally) with the new object. As long as you don't do another iteration over your data set, this was "one-pass". (The result will be even worse than lloyd-style k-means though).
Read the first k items and hold them. Compute the distances between them.
For each remaining item:
Find out which of the k items it is closest to, and the distance between these two items.
If this is longer than the closest distance between any two of the k items, you can swap the new item with one of these two and at least not decrease the closest distance between any two of the new k items. Do so so as to increase this distance as much as possible.
Suppose that the set of all items can be divided up into l <= k clusters so that the distance between any two points in the same cluster is smaller than the distance between any two points in different clusters. Then after running this algorithm, you will retain at least one point from each cluster.
精彩评论