Paragon Corpoation PostGIS Spatial Database Engine OSGeo.org The Open Source Geospatial Foundation UMN Mapserver Boston Geographic Information Systems       PostGreSQL Object Relational Database Management System
Home   About Boston GIS   Consulting Services  Boston GIS Blog  Postgres OnLine Journal  Planet PostGIS  PostGIS Funding

Purpose of BostonGIS

BostonGIS is a testbed for GIS and Web Mapping solutions utilizing open source, freely available and/or open gis technologies. We will be using mostly Boston, Massachusetts data to provide mapping and spatial database examples.

If you have some thoughts or comments on what you would like to see covered on this site, drop us a line on our Feed Back page.


GIS Tutorials on Opensource and OpenGIS technologies Tutorials
GIS Article comments Article and Tutorial Comments
Boston GIS BLog Rss FeedBoston GIS blog

PDF HTML All BostonGIS tutorials packaged together in an E-Book.


Tutorial and Tip Sites
Desktop GIS
External Data
GIS Events and Groups
GIS SDKs and Frameworks
External Resources
Glossary
GIS Blogs Around Boston
External GIS Blogs
External Papers Articles
GIS Quick Guides and References
OpenStreetMap and OpenLayers Tutorials
PostGIS, pgRouting, and PostgreSQL Tutorials
Part 1: Getting Started With PostGIS: An almost Idiot's Guide more ...
pgRouting: Loading OpenStreetMap with Osm2Po and route querying more ...
Part 1: Getting Started With PostGIS: An almost Idiot's Guide (PostGIS 2.0) more ...
OSCON 2009: Tips and Tricks for Writing PostGIS Spatial Queries more ...
PGCon2009: PostGIS 1.4, PostgreSQL 8.4 Spatial Analysis Queries, Building Geometries, Open Jump more ...
PLR Part 3: PL/R and Geospatial Data Abstraction Library (GDAL) RGDAL more ...
PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution

Printer Friendly

A generic solution to PostGIS nearest neighbor

After some heavy brainstorming, I have come up with a faster and more generic solution to calculating nearest neighbors than my previous solutions. For the gory details on how I arrived at this solution - please check out the following blog entries Boston GIS Blog entries on Nearest Neighbor Search.

In the sections that follow I will briefly describe the key parts of the solution, the strengths of the solution, the caveats of the solution, and some general use cases.

The parts of the solution

The solution is composed of a new PostgreSQL type, and 2 PostgreSQL functions. It involves a sequence of expand boxes that fan out from the bounding box of a geometry to as far out as you want. Below are the descriptions of each part.

  • pgis_nn - a new type that defines a neighbor solution - for a given near neighbor solution it gives you, the primary key of the neighbor and the distance it is away from the reference object
  • expandoverlap_metric - this is a helper function that returns the first fanned out box that an object is found in relative to the reference geometry. It returns an integer where the integer varies from 0 (in bounding box) to maxslices where maxslices is how many slices you want to dice up your larger expand box.
  • _pgis_fn_nn - this is the workhorse function that should never be directly called - it is a pgsql function that returns for each geometry k near neighbors
  • pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))) - this is our interface function written in language sql to fool the planner into thinking we are using an sql function. This is to get around certain anomalies of PostgreSQL that prevent pgsql set returning functions from being used if they reference fields in the from part of an SQL statement. Description of the inputs is as follows.
    1. geom1 - this is the reference geometry.
    2. distguess - this is the furthest distance you want to branch out before you want the query to give up. It is measured in units of the spatial ref system of your geometries.
    3. numnn - number of nearest neighbors to return.
    4. maxslices - this is the max number of slices you want to slice up your whole expand box. The more slices the thinner the slices, but the more iterations. I'm sure there is some statistically function one can run against ones dataset to figure out the optimal number of slices and distance guess, but I haven't quite figured out what that would be.
    5. lookupset - this is usually the name of the table you want to search for nearest neighbors. In theory you could throw in a fairly complicated subselect statement as long as you wrap it in parethesis. Something like '(SELECT g.*, b.name FROM sometable g INNER JOIN someothertable b ON g.id = b.id)' . Althought I haven't had a need to do that yet so haven't tested that feature.
    6. swhere - this is the where condition to apply to your dataset. If you want the whole dataset to be evaluated, then put 'true'. In theory you can have this correlated with your reference table by gluing parameters together, so can get pretty sophisticated.
    7. sgid2field - this is the name of the unique id field in the dataset you want to check for nearest neighbors.
    8. sgeom2field - this is the name of the geometry field in the dataset you want to check for nearest neighbors

The code can be downloaded here http://www.bostongis.com/downloads/pgis_nn.txt. I will also be posting this to the PostGIS Wiki once I clean it up a bit more.

Key features

My objective in creating this solution is to give PostGIS the same expressiveness that Oracle's SDO_NN and SDO_NN_DISTANCE functions provide Oracle. Unlike the Oracle solution, this solution is not implemented as an Operator, but instead as a set returning function.

Features are outlined below

  • Can handle more than one reference point at a time and return a single dataset. For example you can say give me the 5 nearest neighbors for each of my records in this query and get the result back as a single table.
  • You can limit your search to a particular distance range, slice it up as granularly as you want in order to achieve the best resulting speed. Note I implemented this function as linearly expanding boxes to utilize GIST indexes. In hindsight it might have been more efficient processor wise to have the boxes expand like a normal distribution or something like that.
  • You can apply where conditions to the sets you are searching instead of searching the whole table - e.g. only give me the 5 nearest hospitals that have more than 100 beds. You can also have the where conditions tied to fields of each record of interest. E.g. include all near neighbors, but don't include the reference point.
  • This solution works for all geometries - e.g. you can find point near neighbors to a polygon, polygon near neighbors to a polygon etc. Nearest distance to a line etc.

Caveats

  • This function will only work if you give it a primary key that is an integer. I figured an integer is what most people use for their primary keys and is the most efficient join wise.
  • This function currently doesn't handle ties. E.g. if you specify 5 neighbors and you have 6 where the last 2 are the same distance away, one will arbitrarily get left out. I think it will be fairly trivial to make it handle this case though.
  • If there are no neighbors for a particular reference that meet the stated criteria, that record would get left out. There are workarounds for this which I will try to provide in more advanced use cases.

Use Cases

All the data we will be using is in SRID: 26986. For a refresher course on loading data and setting up indexes, check Part 1: Getting Started With PostGIS: An almost Idiot's Guide

This returned 1000 records in 6 seconds on my dual windows 2003 server
Datasets used: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm,
firestations: http://www.mass.gov/mgis/firestations.htm

Find nearest fire station for first 1000 buildings in Boston and sort in decreasing order of distance from the fire station such that the building furthest away from its nearest firestation is at the top.
SELECT g1.gid as gid_ref, f.name, g1.nn_dist/1609.344 as dist_miles, g1.nn_gid
FROM (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM buildings limit 1000) b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.nn_dist DESC



Data set used: Fire stations from above and Massachusetts census tiger roads -http://www.mass.gov/mgis/cen2000_tiger.htm
For each street segment in zip 02109 list the 2 nearest fire stations and the distance in meters from each fire station. Order results by street name, street id and distance from nearest fire station
Total query runtime: 687 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, f.name, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT b.*,
    (pgis_fn_nn(b.the_geom, 1000000, 2,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 



For each street segment in zip 02109 list the 2 nearest buildings with building footprint > 5000 sq meters and specify how far away they are and the areas of each building.

--Total query runtime: 2656 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, area(b.the_geom) as bldg_area, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT s.*,
    (pgis_fn_nn(s.the_geom, 1000000, 2,1000,'buildings', 'area(the_geom) > 5000', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') s) As g1, buildings b
    WHERE b.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 


More Advanced Queries

Example of two simultaneous neighbor queries in one query

Layers used
http://www.mass.gov/mgis/schools.htm - schools
http://www.mass.gov/mgis/hospitals.htm - hospitals

Now for this, I think we can safely assume that no street we care about should be further than 30,000 meters from its nearest school and hospital so we will limit our search to 30,000 meters and dice up our box into 10 expand boxes - so boxes expand at a rate of 3000 meters out per expansion.

Give me the nearest elemenary school and the nearest hospital emergency center for street segments in select zipcodes and how far away they are from these street segments.
--Total query runtime: 5516 ms.(~5.5 secs). 291 rows retrieved.
SELECT emer.gid as gid_ref, emer.street, emer.fraddl, emer.toaddl, hospitals.name As emer_name, emer.nn_dist as emer_dist_meters, 
        schools.name as sch_name, sch.nn_dist as sch_dist_meters
FROM (SELECT b.gid, b.street,b.fraddl, b.toaddl,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As emer INNER JOIN
    (SELECT b.gid,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'schools', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As sch ON emer.gid = sch.gid
    INNER JOIN hospitals ON hospitals.gid = emer.nn_gid 
    INNER JOIN schools ON schools.gid = sch.nn_gid
    ORDER BY emer.street, emer.fraddl, emer.gid

Example limit distance exclusion queries

Provide a list of all schools in Massachusetts Boston extent whose closest hospital emergency room is further than 3000 meters away

Using our new neighbor function exclusion query
--Total query runtime: 2969 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 3000, 1,2,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM schools WHERE schools.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)) b) As g1 ON sch.gid = g1.gid
    WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name
The old way doing a simple expand distance exclusion query. SOMETIMES THE OLD WAYS ARE JUST BETTER
-- Total query runtime: 63 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN 
(SELECT b.gid FROM schools b , hospitals h 
		WHERE b.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986) AND h.er_status = 'Y'
				AND expand(b.the_geom, 3000) && h.the_geom AND distance(b.the_geom, h.the_geom) <= 3000) g1
ON sch.gid = g1.gid
WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name

Caching data by storing closest neighbor in a table

For this exercise, we will create 2 new fields in our schools table called nearest_hospital and nearest_hospital_gid and update the fields with the closest hospital to that school. We will set our maximum distance search to 100,000 meters carved out into 10 10,000 meter expansions.

-- Query returned successfully: 2613 rows affected, 20625 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid integer;

UPDATE schools SET nearest_hospital = h.name , nearest_hospital_gid = h.gid
FROM (SELECT s.gid As sch_gid,
    (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid
WHERE schools.gid = nn_emer.sch_gid


Later on we realize we need an alternate hospital in case the first hospital is filled or because of traffic situation we can't get there, so we create a new set of fields to hold the second closest hospital. This particular exercise demonstrates where clause dependent on the additional attribute data of the reference record.


-- Query returned successfully: 2613 rows affected, 23032 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital_2 character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid_2 integer;

UPDATE schools SET nearest_hospital_2 = h.name , nearest_hospital_gid_2 = h.gid FROM (SELECT s.gid As sch_gid, (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y'' AND gid <> ' || CAST(s.nearest_hospital_gid As varchar(20)) , 'gid', 'the_geom')).* FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid WHERE schools.gid = nn_emer.sch_gid AND nearest_hospital_gid IS NOT NULL




Post Comments About PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution
Solving the Nearest Neighbor Problem in PostGIS more ...
PLR Part 2: PL/R and PostGIS more ...
PLR Part 1: Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide more ...
Part 3: PostGIS Loading Data from Non-Spatial Sources more ...
Part 2: Introduction to Spatial Queries and SFSQL with PostGIS more ...
Miscellaneous Tutorials/Cheatsheets/Examples
SpatiaLite Tutorials
Boston External Map Examples
SQL Server 2008 Tutorials
UMN Mapserver Tutorials
General Commentary
Boston GIS      Copyright 2024      Paragon Corporation