Abstract: Here is some practical expeirience in cleansing the shapefile using GRASS, which is an open-source platform to do scientific analysis on spatial data.
ERIS shapefile is a popular format to store spatial vector data. Various geometrics (e.g. points, lines, polygons etc.) can be processed and exchanged in a compatibable way by different softwares. When I am doing a research to calculate the shortest paths between pairs of a set of points, one convenient method is converting the network topology stored in a shapefile into a graph data structure and performing spatial analysis with libraries like
graph in R.
However, the shapefiles per se may have internal pitfalls after we obtained them from public domains. One of the common pitfalls in a shapefile is about the ‘connectedness’ of line-shaped objects, including absent intersection of cross roads, dangling polylines and duplicated points etc. These pitfalls moslty lead to a large number of disconnected parts in a single road network and we can not perform the shortest-path algorithm directly on the topology. Thus we must conduct some connectivity correction on the raw shapefiles before our analyses in next stage.
Obtain shapefile of target area
The geometric vector of target area is our ingrediant to go on this tutorial. Bbbike.org is a great extracter for OpenStreetMap layers. You can select, extract and download any areas you want in any shapes by dragging the selective bounding box. For this post, I have prepared a copy of our university to continue following steps. There are eight layers including roads, waterways, railways, points, places, natural, landuse and buildings, which are the minimum data obtained from previous extractor.
On Mac OS X, you can install grass quickly with
brew. This is recommended because the official version 7.0 dmg installer complains about the CPU arch under python 2.7 on my machine.
1 2 $ brew search grass $ brew install grass-70
Create new mapset (CLI mode)
<grassdb>/<location>/<mapset> are to be understood to start from scratch:
- Default database: a default directory to store all GRASS data;
- GRASS location: this corresponds to your individual projects and you’d better keep separated projects in different locations;
- GRASS mapset: the actual working directory for yourself. Others’ will be kept as distinctive mapsets in case that a team is sharing the same
PERMANENTbase but introducing different modification.
Create a new mapset in which
sjtu is the location and
chenxm indicates the mapset that is used by me. The GRASS promt is a modified CLI with built-in functionalities. Other commonds can be found on the manual page.
1 2 3 $ grass70 -c roads.shp ~/workspace/GRASS/sjtu -text $ > g.mapset -c chenxm $ > g.mapset chenxm
Import all shapefile layers into created mapset:
1 $ > v.in.ogr input=../shape layer=roads --overwrite
Correct topology connectivity
We use the
v.clean toolkit to repair the topology pitfalls, in the sequence of
breakthe cross and add intersection if it’s absent
snapundershot and overshot pologons
rmdangleremove the dangling lines whose lenght is larger than a threshold
rmdupremove the duplidated points
The threshold unit is degree when we are using the default coordination projection.
1 2 $ > v.clean input=roads@chenxm output=roads_clean@chenxm type=point,line,area tool=break,snap,rmdangle,rmdupl thres=0.00,0.00003,0.00001,0.00 --overwrite $ > v.out.ogr input=roads_clean@chenxm output=roads_clean format=ESRI_Shapefile --overwrite
Check out the performance
The topology correction removes the number of self-connected parts mostly by breaking the intersection and snapping the pitfalls. This accidently introduces over-correction as for the absence of information of real road networks, e.g. great flyovers. Nevertheless, this does not impact the result when we are involving the road network as suplementary analysis.
As show in figure below, the left is the original shapefile with 256 self- connected parts, and the middle figure gives the result after our correction with previous configurations. Actually, the resolution of topology is largely impacted by the selection of toolkit parameters. For example, we exhibit the snapping tool with threshold
0.0005 (right figure) which gives a rough sketch of underlying network. As for the personal requirements, you should experiment with multiple sets of parameters to get the satisfied result.
1 2 3 4 5 6 #!/usr/bin/env R # file: check_conn.R # R script to check the connectivity of road network. library(shp2graph) rn<-readShapeLines("roads_clean/roads_clean.shp", proj4string=CRS(as.character(NA))) res<-nt.connect(rn) # the largest connected part
Run the script in terminal
1 $ R --no-save]( check_conn.R