r points in polygons

2023-01-18 06:01 问答作者：

I have a million points and a large shape file—8GB—which is too big for to开发者_运维百科 load into memory in R on my system. The shape file is single-layer so a given x, y will hit at most one polygon - as long as it's not exactly on a boundary! Each polygon is labelled with a severity - e.g. 1, 2, 3. I'm using R on a 64-bit ubuntu machine with 12GB ram.

What's the simplest way to be able to "tag" the data frame to the polygon severity so that I get a data.frame with an extra column, i.e. x ,y, severity?

Just because all you have is a hammer, doesn't mean every problem is a nail.

Load your data into PostGIS, build a spatial index for your polygons, and do a single SQL spatial overlay. Export results back to R.

By the way, saying the shapefile is 8Gb is not a very useful piece of information. Shapefiles are made from at least three files, the .shp which is the geometry, the .dbf which is the database, and the .shx which connects the two. If your .dbf is 8Gb then you can easily read the shapes themselves in by replacing it with a different .dbf. Even if the .shp is 8Gb it might only be three polygons, in which case it might be easy to simplify them. How many polygons have you got, and how big is the .shp part of the shapefile?

I think you should preprocess the data and create a structure that lists possible polygons for rectangular areas in a grid. This way, you can reduce the polygons you'll have to check against the points and the additional structure will fit into memory as it just has pointers to the polygons.

Here's an image to illustrate:

r points in polygons

You want to check which polygon the yellow point is in. You would usually check against all the polygons, but with the grid optimization (the orange lines, didn't draw the whole grid, just one of its fields) you only have to checked the filled polygons as they all are inside or partially inside the grid field.

A similar way would be not to store all the polygon data in memory, but just the polygons bounding boxes which would only require 2 instead of 3 X/Y pairs for each polygon (and an additional pointer to the actual polygon data), but this doesn't save as much space as the first suggestion.

I was interested to see this and wondered if you had made any progress on this front. Since you asked the question I imagine your computer hardware and the software you can use to do this relatively simple operation have improved somewhat to the point where the solution (if still needed!) might be quite simple, although it may take a long time to process a million points. You might like to try something like:

#   Load relevant libraries
library(sp)
library(maptools)
library(spatstat)

#   Read shapefile. Hopefully you have a .prj file with your .shp file
#   otherwise you need to set the proj4string argument. Don't inlcude 
#   the .shp extension in the filename. I also assume that this will
#   create a SpatialPolygonsDataFrame with the "Severity" attribute
#   attached (from your .dbf file).
myshapefile <- readShapePoly("myshapefile_without_extension",     proj4string=CRS("+proj=latlong +datum=WGS84"))


#   Read your location data in. Here I assume your data has two columns X and Y giving     locations
#   Remeber that your points need to be in the same projection as your shapefile. If they aren't
#   you should look into using spTransform() on your shapefile first.
mylocs.df <- read.table(mypoints.csv, sep=",", h=TRUE)

#   Coerce X and Y coordinates to a spatial object and set projection to be the same as
#   your shapefile (WARNING: only do this if you know your points and shapefile are in
#   the same format).
mylocs.sp <- SpatialPoints(cbind(mylocs.df$X,mylocs.df$Y),     proj4string=CRS(proj4string(myshapefile))

#   Use over() to return a dataframe equal to nrows of your mylocs.df
#   with each row corresponding to a point with the attributes from the
#   poylgon in which it fell.
severity.df <- over(mylocs.sp, myshapefile)

Hopefully that framework can give you what you want. Whether or not you can do it with the computer/RAM you have available now is another matter!

I don't have a really good answer, but let me throw out an idea. Can you flip the problem around and instead of asking which poly each point fits in, instead as "which points are in each poly?" Maybe you could bust your shapefile into, for example, 2000 counties then incrementally take each county and check each point to see if it is in that county. If a point is in a given county then you tag it, and take it out of your search next time.

Along the same lines, you might break the shapefile into 4 regions. Then you can fit a single region plus all your points into memory. Then just iterate 4 times through the data.

Another idea would be to use a GIS tool to lower the resolution (number of nodes and vectors) of the shapefile. That depends, of course, on how important accuracy is to your use case.

I'd give fastshp package a try. In my cursory tests it significantly beats other methods for reading shapefiles. And it has specialized inside function that my well suit your needs.

Code should be somehow similar to:

shp <- read.shp(YOUR_SHP_FILE, format="polygon")) 
inside(shp, x, y)

where x and y are coordinates.

If that doesn't work I'd go for PostGIS solution mentioned by @Spacedman.

To answer my own question... and in thanks to everybody for their help - The final solution adopted was to use gdal from python which was relatively easily adapted to both rasters and shape files. The some of rasters ran to about 40gb and some of the shape files exceeded 8gb - so there was no way they'd fit in memory on any of the machines we had at the time (Now I've access to a machine with 128gb ram - but I've moved onto new pastures!). The python/gdal code was able to tag 27million points in from 1 minute to 40 minutes depending on the sizes of the polygons in the shapefiles - if there were a lot of small polygons it was stunningly quick - if there were some massive (250k points) polygons in the shapefiles it was stunningly slow! However to compare, we used to run it previously on an oracle spatial database and it would take about 24 hours+ to tag the 27 million points, or rasterizing and tagging would take about an hour. As Spacedman suggested I did try using postgis on my machine with an ssd, but the turn-around time was quite a bit slower than using python/gdal as the final solution didn't require the shapefiles to be loaded into postgis. So to summarise, the quickest way to do this was using Python/gdal:

copy the shape files and the x,y csv to ssd
modify the config file for the python script to say where the files were and which layer to tag against
run a couple of layers in parallel - as it was cpu limited rather than i/o limited

继续阅读：gis r

r points in polygons

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？