A generic solution to PostGIS nearest neighbor

After some heavy brainstorming, I have come up with a faster and more generic solution to calculating nearest neighbors than my previous solutions. For the gory details on how I arrived at this solution - please check out the following blog entries Boston GIS Blog entries on Nearest Neighbor Search.

In the sections that follow I will briefly describe the key parts of the solution, the strengths of the solution, the caveats of the solution, and some general use cases.

The parts of the solution

The solution is composed of a new PostgreSQL type, and 2 PostgreSQL functions. It involves a sequence of expand boxes that fan out from the bounding box of a geometry to as far out as you want. Below are the descriptions of each part.

pgis_nn - a new type that defines a neighbor solution - for a given near neighbor solution it gives you, the primary key of the neighbor and the distance it is away from the reference object
expandoverlap_metric - this is a helper function that returns the first fanned out box that an object is found in relative to the reference geometry. It returns an integer where the integer varies from 0 (in bounding box) to maxslices where maxslices is how many slices you want to dice up your larger expand box.
_pgis_fn_nn - this is the workhorse function that should never be directly called - it is a pgsql function that returns for each geometry k near neighbors
pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))) - this is our interface function written in language sql to fool the planner into thinking we are using an sql function. This is to get around certain anomalies of PostgreSQL that prevent pgsql set returning functions from being used if they reference fields in the from part of an SQL statement. Description of the inputs is as follows.
1. geom1 - this is the reference geometry.
2. distguess - this is the furthest distance you want to branch out before you want the query to give up. It is measured in units of the spatial ref system of your geometries.
3. numnn - number of nearest neighbors to return.
4. maxslices - this is the max number of slices you want to slice up your whole expand box. The more slices the thinner the slices, but the more iterations. I'm sure there is some statistically function one can run against ones dataset to figure out the optimal number of slices and distance guess, but I haven't quite figured out what that would be.
5. lookupset - this is usually the name of the table you want to search for nearest neighbors. In theory you could throw in a fairly complicated subselect statement as long as you wrap it in parethesis. Something like '(SELECT g.*, b.name FROM sometable g INNER JOIN someothertable b ON g.id = b.id)' . Althought I haven't had a need to do that yet so haven't tested that feature.
6. swhere - this is the where condition to apply to your dataset. If you want the whole dataset to be evaluated, then put 'true'. In theory you can have this correlated with your reference table by gluing parameters together, so can get pretty sophisticated.
7. sgid2field - this is the name of the unique id field in the dataset you want to check for nearest neighbors.
8. sgeom2field - this is the name of the geometry field in the dataset you want to check for nearest neighbors

The code can be downloaded here http://www.bostongis.com/downloads/pgis_nn.txt. I will also be posting this to the PostGIS Wiki once I clean it up a bit more.

Key features

My objective in creating this solution is to give PostGIS the same expressiveness that Oracle's SDO_NN and SDO_NN_DISTANCE functions provide Oracle. Unlike the Oracle solution, this solution is not implemented as an Operator, but instead as a set returning function.

Features are outlined below

Can handle more than one reference point at a time and return a single dataset. For example you can say give me the 5 nearest neighbors for each of my records in this query and get the result back as a single table.
You can limit your search to a particular distance range, slice it up as granularly as you want in order to achieve the best resulting speed. Note I implemented this function as linearly expanding boxes to utilize GIST indexes. In hindsight it might have been more efficient processor wise to have the boxes expand like a normal distribution or something like that.
You can apply where conditions to the sets you are searching instead of searching the whole table - e.g. only give me the 5 nearest hospitals that have more than 100 beds. You can also have the where conditions tied to fields of each record of interest. E.g. include all near neighbors, but don't include the reference point.
This solution works for all geometries - e.g. you can find point near neighbors to a polygon, polygon near neighbors to a polygon etc. Nearest distance to a line etc.

Caveats

This function will only work if you give it a primary key that is an integer. I figured an integer is what most people use for their primary keys and is the most efficient join wise.
This function currently doesn't handle ties. E.g. if you specify 5 neighbors and you have 6 where the last 2 are the same distance away, one will arbitrarily get left out. I think it will be fairly trivial to make it handle this case though.
If there are no neighbors for a particular reference that meet the stated criteria, that record would get left out. There are workarounds for this which I will try to provide in more advanced use cases.

Use Cases

All the data we will be using is in SRID: 26986. For a refresher course on loading data and setting up indexes, check Part 1: Getting Started With PostGIS: An almost Idiot's Guide

This returned 1000 records in 6 seconds on my dual windows 2003 server
Datasets used: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm,
firestations: http://www.mass.gov/mgis/firestations.htm

Find nearest fire station for first 1000 buildings in Boston and sort in decreasing order of distance from the fire station such that the building furthest away from its nearest firestation is at the top.


SELECT g1.gid as gid_ref, f.name, g1.nn_dist/1609.344 as dist_miles, g1.nn_gid
FROM (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM buildings limit 1000) b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.nn_dist DESC

Data set used: Fire stations from above and Massachusetts census tiger roads -http://www.mass.gov/mgis/cen2000_tiger.htm
For each street segment in zip 02109 list the 2 nearest fire stations and the distance in meters from each fire station. Order results by street name, street id and distance from nearest fire station


Total query runtime: 687 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, f.name, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT b.*,
    (pgis_fn_nn(b.the_geom, 1000000, 2,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist

For each street segment in zip 02109 list the 2 nearest buildings with building footprint > 5000 sq meters and specify how far away they are and the areas of each building.


--Total query runtime: 2656 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, area(b.the_geom) as bldg_area, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT s.*,
    (pgis_fn_nn(s.the_geom, 1000000, 2,1000,'buildings', 'area(the_geom) > 5000', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') s) As g1, buildings b
    WHERE b.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist

More Advanced Queries

Example of two simultaneous neighbor queries in one query

Layers used
http://www.mass.gov/mgis/schools.htm - schools
http://www.mass.gov/mgis/hospitals.htm - hospitals

Now for this, I think we can safely assume that no street we care about should be further than 30,000 meters from its nearest school and hospital so we will limit our search to 30,000 meters and dice up our box into 10 expand boxes - so boxes expand at a rate of 3000 meters out per expansion.

Give me the nearest elemenary school and the nearest hospital emergency center for street segments in select zipcodes and how far away they are from these street segments.


--Total query runtime: 5516 ms.(~5.5 secs). 291 rows retrieved.
SELECT emer.gid as gid_ref, emer.street, emer.fraddl, emer.toaddl, hospitals.name As emer_name, emer.nn_dist as emer_dist_meters, 
        schools.name as sch_name, sch.nn_dist as sch_dist_meters
FROM (SELECT b.gid, b.street,b.fraddl, b.toaddl,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As emer INNER JOIN
    (SELECT b.gid,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'schools', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As sch ON emer.gid = sch.gid
    INNER JOIN hospitals ON hospitals.gid = emer.nn_gid 
    INNER JOIN schools ON schools.gid = sch.nn_gid
    ORDER BY emer.street, emer.fraddl, emer.gid

Example limit distance exclusion queries

Provide a list of all schools in Massachusetts Boston extent whose closest hospital emergency room is further than 3000 meters away

Using our new neighbor function exclusion query


--Total query runtime: 2969 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 3000, 1,2,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM schools WHERE schools.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)) b) As g1 ON sch.gid = g1.gid
    WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name

The old way doing a simple expand distance exclusion query. SOMETIMES THE OLD WAYS ARE JUST BETTER


	-- Total query runtime: 63 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN 
(SELECT b.gid FROM schools b , hospitals h 
		WHERE b.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986) AND h.er_status = 'Y'
				AND expand(b.the_geom, 3000) && h.the_geom AND distance(b.the_geom, h.the_geom) <= 3000) g1
ON sch.gid = g1.gid
WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name

Caching data by storing closest neighbor in a table

For this exercise, we will create 2 new fields in our schools table called nearest_hospital and nearest_hospital_gid and update the fields with the closest hospital to that school. We will set our maximum distance search to 100,000 meters carved out into 10 10,000 meter expansions.

Later on we realize we need an alternate hospital in case the first hospital is filled or because of traffic situation we can't get there, so we create a new set of fields to hold the second closest hospital. This particular exercise demonstrates where clause dependent on the additional attribute data of the reference record.

Purpose of BostonGIS