Efficient algorithm for calculating areas on geographical map with the highest density of spots
Let's say I have a geographical map, where points are represented by latitude\longitude. I have a number of points on this map, and points could be added\deleted\moved at any time.
Wh开发者_StackOverflow社区at I need is, to get the "hottest spots" - the areas that include the highest numbers of points divided by area - or in other words, the areas with the highest density of points.
I need an efficient data structure and also an algorithm that re-calculates the hottest spots on any change. The computational complexity and memory complexity must be minimal, because the number of points could get very high.
I wish to know and maintain a list of the hottest spots in descending order - the most crowdest area first, then the less crowded areas. It's OK to have a list of limited size - for example, the 100 hottest spots.
Of course, to prevent 100% density on one isolated point, there is a minimal area (defined as a constant).
The definition of "area" here is any perceivable area on the map that contains points. It could be the whole map, but the algorithm shouldn't see that as a hot spot of course =)
Thanks ahead! If it needs any clarification please say so...
What you are doing is statistically known as '2-dimensional density estimation' (so you know where to look).
One common method is 'kernel smoothing'. Imagine a disc sitting on each of your data points. The kernel smoothing over the area is the number of discs overlapping at each point. That's a kernel smoothing using a uniform kernel of fixed radius.
You don't have to use uniform kernels, nor do they have to be the same size at all points - this is now "adaptive bandwidth kernel smoothing".
That gives you a grid which you can update pretty easily, especially if you have a finite kernel (you can use a Gaussian (aka Normal) kernel which is infinite, but would be clipped to your study area). Every time you add or take away a point, you add or take away its kernel's contribution to the density.
So that's half the problem - you now have a grid of densities. Then you can use clustering algorithms to partition it and find separate peaks according to whatever criteria a) defines a 'peak' and b) defines it as separate from an adjacent peak. Someone else has already suggested clustering algorithms. I have done this (ten years ago) using clustering functions in "R", the statistical package. Speed isn't its strong point though...
You might want to take this over to http://stats.stackexchange.com
This is too long for a comment but here's just an idea as to how I'd "play" with this and see if I can come up with something interesting. But one thing is sure: the following can be made to be very fast.
Can this be easily translated to some discrete problem? You'd first "align" all your coordinates to a big map (you define how much each square is big and you make every entry map to one such point). You'd then end up with something like this:
0000000000000000000000000000
00XX000000000000000000X00000
00X00000000000000X0000000000
0000000000000000000000000000
0000000000000000000000000000
000000X00000000X000000000000
0000000000000000000000000000
000000000000X000000000X00000
00000000000000000000000X0000
0000000000000000000000000000
Then you'd compute each entry and it's number of adjacent neighbours:
0000000000000000000000000000
0033000000000000000001000000
0030000000000000010000000000
0000000000000000000000000000
0000000000000000000000000000
0000001000001001000000000000
0000000000000000000000000000
0000000000001010000000200000
0000000000000000000000020000
0000000000000100000000000000
Then you can augment the size of your square, say by two, and hence divide your map:
(the map ain't correct, it's jut to give an idea of what I'm thinking about)
09001001000000
00000000000000
00100001100000
00000110002000
00000002000000
00000100000000
Then you re-count the adjacent neighbours, etc.
To me this would allow to find hotspot depending on your "resolution": you'd simply look for the biggest numbers and that would be your "hotspots".
Because in this case:
0000X00000
0000XX0000
0000000000
0000000000
0Y0Y000000
0000000000
0Y0Y000000
Either 'X' can be the hottest spot (three interesting point close to each other) or 'Y' (four points close to each other, but they're not as close than for the 'X').
Because you said you needed speed, I'd just turn this into a discrete problem and represent my graphs as arrays. I'd then allow for a variable "area" size.
Looks like a fun problem to work on :)
The type of algorithm used will depend on the distribution of points. You could very well run into a much harder problem of group partitioning into "areas." Since you seem to need something to get you started, I recommend you read up on Convex Hulls and calculating the area of an abritray 2D polygon and how to determine if a point is in a polygon.
精彩评论