开发者

r points in polygons

I have a million points and a large shape file—8GB—which is too big for to开发者_运维百科 load into memory in R on my system. The shape file is single-layer so a given x, y will hit at most one polygon - as long as it's not exactly on a boundary! Each polygon is labelled with a severity - e.g. 1, 2, 3. I'm using R on a 64-bit ubuntu machine with 12GB ram.

What's the simplest way to be able to "tag" the data frame to the polygon severity so that I get a data.frame with an extra column, i.e. x ,y, severity?


Just because all you have is a hammer, doesn't mean every problem is a nail.

Load your data into PostGIS, build a spatial index for your polygons, and do a single SQL spatial overlay. Export results back to R.

By the way, saying the shapefile is 8Gb is not a very useful piece of information. Shapefiles are made from at least three files, the .shp which is the geometry, the .dbf which is the database, and the .shx which connects the two. If your .dbf is 8Gb then you can easily read the shapes themselves in by replacing it with a different .dbf. Even if the .shp is 8Gb it might only be three polygons, in which case it might be easy to simplify them. How many polygons have you got, and how big is the .shp part of the shapefile?


I think you should preprocess the data and create a structure that lists possible polygons for rectangular areas in a grid. This way, you can reduce the polygons you'll have to check against the points and the additional structure will fit into memory as it just has pointers to the polygons.

Here's an image to illustrate:

r points in polygons

You want to check which polygon the yellow point is in. You would usually check against all the polygons, but with the grid optimization (the orange lines, didn't draw the whole grid, just one of its fields) you only have to checked the filled polygons as they all are inside or partially inside the grid field.

A similar way would be not to store all the polygon data in memory, but just the polygons bounding boxes which would only require 2 instead of 3 X/Y pairs for each polygon (and an additional pointer to the actual polygon data), but this doesn't save as much space as the first suggestion.


I was interested to see this and wondered if you had made any progress on this front. Since you asked the question I imagine your computer hardware and the software you can use to do this relatively simple operation have improved somewhat to the point where the solution (if still needed!) might be quite simple, although it may take a long time to process a million points. You might like to try something like:

#   Load relevant libraries
library(sp)
library(maptools)
library(spatstat)

#   Read shapefile. Hopefully you have a .prj file with your .shp file
#   otherwise you need to set the proj4string argument. Don't inlcude 
#   the .shp extension in the filename. I also assume that this will
#   create a SpatialPolygonsDataFrame with the "Severity" attribute
#   attached (from your .dbf file).
myshapefile <- readShapePoly("myshapefile_without_extension",     proj4string=CRS("+proj=latlong +datum=WGS84"))


#   Read your location data in. Here I assume your data has two columns X and Y giving     locations
#   Remeber that your points need to be in the same projection as your shapefile. If they aren't
#   you should look into using spTransform() on your shapefile first.
mylocs.df <- read.table(mypoints.csv, sep=",", h=TRUE)

#   Coerce X and Y coordinates to a spatial object and set projection to be the same as
#   your shapefile (WARNING: only do this if you know your points and shapefile are in
#   the same format).
mylocs.sp <- SpatialPoints(cbind(mylocs.df$X,mylocs.df$Y),     proj4string=CRS(proj4string(myshapefile))

#   Use over() to return a dataframe equal to nrows of your mylocs.df
#   with each row corresponding to a point with the attributes from the
#   poylgon in which it fell.
severity.df <- over(mylocs.sp, myshapefile)

Hopefully that framework can give you what you want. Whether or not you can do it with the computer/RAM you have available now is another matter!


I don't have a really good answer, but let me throw out an idea. Can you flip the problem around and instead of asking which poly each point fits in, instead as "which points are in each poly?" Maybe you could bust your shapefile into, for example, 2000 counties then incrementally take each county and check each point to see if it is in that county. If a point is in a given county then you tag it, and take it out of your search next time.

Along the same lines, you might break the shapefile into 4 regions. Then you can fit a single region plus all your points into memory. Then just iterate 4 times through the data.

Another idea would be to use a GIS tool to lower the resolution (number of nodes and vectors) of the shapefile. That depends, of course, on how important accuracy is to your use case.


I'd give fastshp package a try. In my cursory tests it significantly beats other methods for reading shapefiles. And it has specialized inside function that my well suit your needs.

Code should be somehow similar to:

shp <- read.shp(YOUR_SHP_FILE, format="polygon")) 
inside(shp, x, y)

where x and y are coordinates.

If that doesn't work I'd go for PostGIS solution mentioned by @Spacedman.


To answer my own question... and in thanks to everybody for their help - The final solution adopted was to use gdal from python which was relatively easily adapted to both rasters and shape files. The some of rasters ran to about 40gb and some of the shape files exceeded 8gb - so there was no way they'd fit in memory on any of the machines we had at the time (Now I've access to a machine with 128gb ram - but I've moved onto new pastures!). The python/gdal code was able to tag 27million points in from 1 minute to 40 minutes depending on the sizes of the polygons in the shapefiles - if there were a lot of small polygons it was stunningly quick - if there were some massive (250k points) polygons in the shapefiles it was stunningly slow! However to compare, we used to run it previously on an oracle spatial database and it would take about 24 hours+ to tag the 27 million points, or rasterizing and tagging would take about an hour. As Spacedman suggested I did try using postgis on my machine with an ssd, but the turn-around time was quite a bit slower than using python/gdal as the final solution didn't require the shapefiles to be loaded into postgis. So to summarise, the quickest way to do this was using Python/gdal:

  • copy the shape files and the x,y csv to ssd
  • modify the config file for the python script to say where the files were and which layer to tag against
  • run a couple of layers in parallel - as it was cpu limited rather than i/o limited
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜