开发者

Outlier detection in data mining [closed]

Closed. This question is off-topic. It is not currently accepting answers.

Want to improve this question? Update the question so it's on-topic for Stack Overflow.

Closed 10 years ago.

Improve this question

I have a few sets of questions regarding outlier detection:

  1. Can we find outliers using k-means and is this a good approach?

  2. Is there any clustering algorithm which does not accept any input from the user?

  3. Can we use support vector machine or any other supervised learning algorithm for outlier detection?

  4. What are the pro开发者_如何学运维s and cons of each approach?


I will limit myself to what I think is essential to give some clues about all of your questions, because this is the topic of a lot of textbooks and they might probably be better addressed in separate questions.

  1. I wouldn't use k-means for spotting outliers in a multivariate dataset, for the simple reason that the k-means algorithm is not built for that purpose: You will always end up with a solution that minimizes the total within-cluster sum of squares (and hence maximizes the between-cluster SS because the total variance is fixed), and the outlier(s) will not necessarily define their own cluster. Consider the following example in R:

    set.seed(123)
    sim.xy <- function(n, mean, sd) cbind(rnorm(n, mean[1], sd[1]),
                                         rnorm(n, mean[2],sd[2]))
    # generate three clouds of points, well separated in the 2D plane
    xy <- rbind(sim.xy(100, c(0,0), c(.2,.2)),
                sim.xy(100, c(2.5,0), c(.4,.2)),
                sim.xy(100, c(1.25,.5), c(.3,.2)))
    xy[1,] <- c(0,2)     # convert 1st obs. to an outlying value
    km3 <- kmeans(xy, 3) # ask for three clusters
    km4 <- kmeans(xy, 4) # ask for four clusters
    

As can be seen in the next figure, the outlying value is never recovered as such: It will always belong to one of the other clusters.

Outlier detection in data mining [closed]

One possibility, however, would be to use a two-stage approach where one's removing extremal points (here defined as vector far away from their cluster centroids) in an iterative manner, as described in the following paper: Improving K-Means by Outlier Removal (Hautamäki, et al.).

This bears some resemblance with what is done in genetic studies to detect and remove individuals which exhibit genotyping error, or individuals that are siblings/twins (or when we want to identify population substructure), while we only want to keep unrelated individuals; in this case, we use multidimensional scaling (which is equivalent to PCA, up to a constant for the first two axes) and remove observations above or below 6 SD on any one of say the top 10 or 20 axes (see for example, Population Structure and Eigenanalysis, Patterson et al., PLoS Genetics 2006 2(12)).

A common alternative is to use ordered robust mahalanobis distances that can be plotted (in a QQ plot) against the expected quantiles of a Chi-squared distribution, as discussed in the following paper:

R.G. Garrett (1989). The chi-square plot: a tools for multivariate outlier recognition. Journal of Geochemical Exploration 32(1/3): 319-341.

(It is available in the mvoutlier R package.)

  1. It depends on what you call user input. I interpret your question as whether some algorithm can process automatically a distance matrix or raw data and stop on an optimal number of clusters. If this is the case, and for any distance-based partitioning algorithm, then you can use any of the available validity indices for cluster analysis; a good overview is given in

Handl, J., Knowles, J., and Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212.

that I discussed on Cross Validated. You can for instance run several instances of the algorithm on different random samples (using bootstrap) of the data, for a range of cluster numbers (say, k=1 to 20) and select k according to the optimized criteria taht was considered (average silhouette width, cophenetic correlation, etc.); it can be fully automated, no need for user input.

There exist other forms of clustering, based on density (clusters are seen as regions where objects are unusually common) or distribution (clusters are sets of objects that follow a given probability distribution). Model-based clustering, as it is implemented in Mclust, for example, allows to identify clusters in a multivariate dataset by spanning a range of shape for the variance-covariance matrix for a varying number of clusters and to choose the best model according to the BIC criterion.

  1. This is a hot topic in classification, and some studies focused on SVM to detect outliers especially when they are misclassified. A simple Google query will return a lot of hits, e.g. Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction by Thongkam et al. (Lecture Notes in Computer Science 2008 4977/2008 99-109; this article includes comparison to ensemble methods). The very basic idea is to use a one-class SVM to capture the main structure of the data by fitting a multivariate (e.g., gaussian) distribution to it; objects that on or just outside the boundary might be regarded as potential outliers. (In a certain sense, density-based clustering would perform equally well as defining what an outlier really is is more straightforward given an expected distribution.)

Other approaches for unsupervised, semi-supervised, or supervised learning are readily found on Google, e.g.

  • Hodge, V.J. and Austin, J. A Survey of Outlier Detection Methodologies.
  • Vinueza, A. and Grudic, G.Z. Unsupervised Outlier Detection and Semi-Supervised Learning.
  • Escalante, H.J. A Comparison of Outlier Detection Algorithms for Machine Learning.

A related topic is anomaly detection, about which you will find a lot of papers.

  1. That really deserves a new (and probably more focused) question :-)


1) Can we find outliers using k-means, is it a good approach?

Cluster-based approaches are optimal to find clusters, and can be used to detect outliers as by-products. In the clustering processes, outliers can affect the locations of the cluster centers, even aggregating as a micro-cluster. These characteristics make the cluster-based approaches infeasible to complicated databases.

2) Is there any clustering algorithm which does not accept any input from the user?

Maybe you can achieve some valuable knowledge on this topic: Dirichlet Process Clustering

Dirichlet-based clustering algorithm can adaptively determine the number of clusters according to the distribution of observation data.

3) Can we use support vector machine or any other supervised learning algorithm for outlier detection?

Any Supervised learning algorithm needs enough labeled training data to construct classifiers. However, a balanced training dataset is not always available for real world problem, such as intrusion detection, medical diagnostics. According to the definition of Hawkins Outlier("Identification of Outliers". Chapman and Hall, London, 1980), the number of normal data is much larger than that of outliers. Most supervised learning algorithms can't achieve an efficient classifier on the above unbalanced dataset.

4) What is the pros and cons of each approach?

Over the past several decades, the research on outlier detection varies from the global computation to the local analysis, and the descriptions of outliers vary from the binary interpretations to probabilistic representations. According to hypotheses of outlier detection models, outlier detection algorithms can be divided into four kinds: Statistic-based algorithms, Cluster-based algorithms, Nearest Neighborhood based algorithms, and Classifier-based algorithms. There are several valuable surveys on outlier detection:

  • Hodge, V. and Austin, J. "A survey of outlier detection methodologies", Journal of Artificial Intelligence Review, 2004.

  • Chandola, V. and Banerjee, A. and Kumar, V. "Outlier detection: A survey", ACM Computing Surveys, 2007.


  1. k-means is rather sensitive to noise in the data set. It works best when you remove the outliers beforehand.

  2. No. Any cluster analysis algorithm that claims to be parameter-free usually is heavily restricted, and often has hidden parameters - a common parameter is the distance function, for example. Any flexible cluster analysis algorithm will at least accept a custom distance function.

  3. one-class classifiers are a popular machine-learning approach to outlier detection. However, supervised approaches aren't always appropriate for detecting _previously_unseen_ objects. Plus, they can overfit when the data already contains outliers.

  4. Every approach has its pros and cons, that is why they exist. In a real setting, you will have to try most of them to see what works for your data and setting. It's why outlier detection is called knowledge discovery - you have to explore if you want to discover something new ...


You may want to have a look at the ELKI data mining framework. It is supposedly the largest collection of outlier detection data mining algorithms. It's open source software, implemented in Java, and includes some 20+ outlier detection algorithms. See the list of available algorithms.

Note that most of these algorithms are not based on clustering. Many clustering algorithms (in particular k-means) will try to cluster instances "no matter what". Only few clustering algorithms (e.g. DBSCAN) actually consider the case that maybe not all instance belong into clusters! So for some algorithms, outliers will actually prevent a good clustering!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜