Translation of clustering problem to graph theory language

2022-12-27 10:33 问答作者：

I have a rectangular p开发者_JAVA技巧lanar grid, with each cell assigned some integer weight. I am looking for an algorithm to identify clusters of 3 to 6 adjacent cells with higher-than-average weight. These blobs should have approximately circular shape.

For my case the average weight of the cells not containing a cluster is around 6, and that for cells containing a cluster is around 6+4, i.e. there is a "background weight" somewhere around 6. The weights fluctuate with a Poisson statistic.

For small background greedy or seeded algorithms perform pretty well, but this breaks down if my cluster cells have weights close to fluctuations in the background i.e. they will tend to find a cluster even though there is nothing. Also, I cannot do a brute-force search looping through all possible setups because my grid is large (something like 1000x1000) and I plan to do this very often (10^9 times). I have the impression there might exist ways to tackle this in graph theory. I heard of vertex-covers and cliques, but am not sure how to best translate my problem into their language. I know that graph theory might have issues with the statistical nature of the input, but I would be interest to see what algorithms from there could find, even if they cannot identify every cluster.

Here an example clipping: the framed region has on average 10 entries per cell, all other cells have on average 6. Of course the grid extends further.

| 8|  8|  2|  8|  2|  3| 
| 6|  4|  3|  6|  4|  4| 
        ===========
| 8|  3||13|  7| 11|| 7|
|10|  4||10| 12|  3|| 2|
| 5|  6||11|  6|  8||12|
        ===========
| 9|  4|  0|  2|  8|  7|

For graph theory solutions there are a couple of sentences in wikipedia, but you are probably best posting on MathOverflow. This question might also be useful.

The traditional method (and probably best considering its ubiquity) in computing to solve these is raster analysis - well known in the world of GIS and Remote Sensing, so there are a number of tools that provide implementations. Keywords to use to find the one most suited to your problem would be raster, nearest neighbor, resampling, and clustering. The GDAL library is often the basis for other tools.

E.g. http://clusterville.org/spatialtools/index.html

You could try checking out the GDAL library and source code to see if you can use it in your situation, or to see how it is implemented.

For checking for circular shapes you could convert remaing values to polygons and check the resultant feature

http://www.gdal.org/gdal_polygonize.html

I'm not sure I see a graph theory analogy, but you can speed things up by pre-computing an area integral. This feels like a multi-scale thing.

A[i,j] = Sum[Sum[p[u,v], {u, 0, i}, {v, 0, j}]];

Then the average brightness of a rectangular (a,b), (c,d) region is

(A[c,d] - (A[c,b] + A[a,d]) + A[a,b])/((c-a)(d-b))

Overflow is probably not your friend if you have big numbers in your cells.

Use the union-find algorithm for clustering? It's very fast.

I guess the graph would result from considering each pair of neighboring high-valued cells connected. Use the union-find algorithm to find all clusters, and accept all those above a certain size, perhaps with shape constraints too (eg based on average squared distance from cluster center vs cluster size). It's a trivial variation on the union-find algorithm to collect statistics that you would need for this as you go (count, sum of x, sum of x^2, sum of y, sum of y^2).

If you are just looking for a way to translate your problem into a graph problem heres what you could do.

From each point look at all your neighbors (this could be the 8 adjacent squares or the 4 adjacent depending on what you want). Look for the neighbor with the max value, if it is larger than you then connect yourself to this square. if it is smaller then do nothing.

after this you will have a forest or possibly a tree (though i imagine that that is unlikely). This is one possible translation of your matrix to a graph. Im not sure if it is the most useful translation.

继续阅读：cluster-analysis graph

Translation of clustering problem to graph theory language

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？