PostGIS Spatial Database Engine UMN Mapserver Boston Geographic Information Systems    Checkout our PostGIS in Action book.  First chapter is a free download   PostGreSQL Object Relational Database Management System
GIS Books  Home   Consulting Services  About Boston GIS   Boston GIS Blog  Postgres OnLine Journal
PostGIS in Action is out in hard-copy,
download the first chapter
and SQL Primer for free. Tips and Tricks for PostGIS
  GIS Article comments Comments Rss
PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution

A generic solution to PostGIS nearest neighbor

After some heavy brainstorming, I have come up with a faster and more generic solution to calculating nearest neighbors than my previous solutions. For the gory details on how I arrived at this solution - please check out the following blog entries Boston GIS Blog entries on Nearest Neighbor Search.

In the sections that follow I will briefly describe the key parts of the solution, the strengths of the solution, the caveats of the solution, and some general use cases.

The parts of the solution

The solution is composed of a new PostgreSQL type, and 2 PostgreSQL functions. It involves a sequence of expand boxes that fan out from the bounding box of a geometry to as far out as you want. Below are the descriptions of each part.

  • pgis_nn - a new type that defines a neighbor solution - for a given near neighbor solution it gives you, the primary key of the neighbor and the distance it is away from the reference object
  • expandoverlap_metric - this is a helper function that returns the first fanned out box that an object is found in relative to the reference geometry. It returns an integer where the integer varies from 0 (in bounding box) to maxslices where maxslices is how many slices you want to dice up your larger expand box.
  • _pgis_fn_nn - this is the workhorse function that should never be directly called - it is a pgsql function that returns for each geometry k near neighbors
  • pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))) - this is our interface function written in language sql to fool the planner into thinking we are using an sql function. This is to get around certain anomalies of PostgreSQL that prevent pgsql set returning functions from being used if they reference fields in the from part of an SQL statement. Description of the inputs is as follows.
    1. geom1 - this is the reference geometry.
    2. distguess - this is the furthest distance you want to branch out before you want the query to give up. It is measured in units of the spatial ref system of your geometries.
    3. numnn - number of nearest neighbors to return.
    4. maxslices - this is the max number of slices you want to slice up your whole expand box. The more slices the thinner the slices, but the more iterations. I'm sure there is some statistically function one can run against ones dataset to figure out the optimal number of slices and distance guess, but I haven't quite figured out what that would be.
    5. lookupset - this is usually the name of the table you want to search for nearest neighbors. In theory you could throw in a fairly complicated subselect statement as long as you wrap it in parethesis. Something like '(SELECT g.*, b.name FROM sometable g INNER JOIN someothertable b ON g.id = b.id)' . Althought I haven't had a need to do that yet so haven't tested that feature.
    6. swhere - this is the where condition to apply to your dataset. If you want the whole dataset to be evaluated, then put 'true'. In theory you can have this correlated with your reference table by gluing parameters together, so can get pretty sophisticated.
    7. sgid2field - this is the name of the unique id field in the dataset you want to check for nearest neighbors.
    8. sgeom2field - this is the name of the geometry field in the dataset you want to check for nearest neighbors

The code can be downloaded here http://www.bostongis.com/downloads/pgis_nn.txt. I will also be posting this to the PostGIS Wiki once I clean it up a bit more.

Key features

My objective in creating this solution is to give PostGIS the same expressiveness that Oracle's SDO_NN and SDO_NN_DISTANCE functions provide Oracle. Unlike the Oracle solution, this solution is not implemented as an Operator, but instead as a set returning function.

Features are outlined below

  • Can handle more than one reference point at a time and return a single dataset. For example you can say give me the 5 nearest neighbors for each of my records in this query and get the result back as a single table.
  • You can limit your search to a particular distance range, slice it up as granularly as you want in order to achieve the best resulting speed. Note I implemented this function as linearly expanding boxes to utilize GIST indexes. In hindsight it might have been more efficient processor wise to have the boxes expand like a normal distribution or something like that.
  • You can apply where conditions to the sets you are searching instead of searching the whole table - e.g. only give me the 5 nearest hospitals that have more than 100 beds. You can also have the where conditions tied to fields of each record of interest. E.g. include all near neighbors, but don't include the reference point.
  • This solution works for all geometries - e.g. you can find point near neighbors to a polygon, polygon near neighbors to a polygon etc. Nearest distance to a line etc.

Caveats

  • This function will only work if you give it a primary key that is an integer. I figured an integer is what most people use for their primary keys and is the most efficient join wise.
  • This function currently doesn't handle ties. E.g. if you specify 5 neighbors and you have 6 where the last 2 are the same distance away, one will arbitrarily get left out. I think it will be fairly trivial to make it handle this case though.
  • If there are no neighbors for a particular reference that meet the stated criteria, that record would get left out. There are workarounds for this which I will try to provide in more advanced use cases.

Use Cases

All the data we will be using is in SRID: 26986. For a refresher course on loading data and setting up indexes, check Part 1: Getting Started With PostGIS: An almost Idiot's Guide

This returned 1000 records in 6 seconds on my dual windows 2003 server
Datasets used: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm,
firestations: http://www.mass.gov/mgis/firestations.htm

Find nearest fire station for first 1000 buildings in Boston and sort in decreasing order of distance from the fire station such that the building furthest away from its nearest firestation is at the top.
SELECT g1.gid as gid_ref, f.name, g1.nn_dist/1609.344 as dist_miles, g1.nn_gid
FROM (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM buildings limit 1000) b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.nn_dist DESC



Data set used: Fire stations from above and Massachusetts census tiger roads -http://www.mass.gov/mgis/cen2000_tiger.htm
For each street segment in zip 02109 list the 2 nearest fire stations and the distance in meters from each fire station. Order results by street name, street id and distance from nearest fire station
Total query runtime: 687 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, f.name, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT b.*,
    (pgis_fn_nn(b.the_geom, 1000000, 2,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 



For each street segment in zip 02109 list the 2 nearest buildings with building footprint > 5000 sq meters and specify how far away they are and the areas of each building.

--Total query runtime: 2656 ms. 296 rows retrieved.

SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, area(b.the_geom) as bldg_area, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT s.*,
    (pgis_fn_nn(s.the_geom, 1000000, 2,1000,'buildings', 'area(the_geom) > 5000', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') s) As g1, buildings b
    WHERE b.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 


More Advanced Queries

Example of two simultaneous neighbor queries in one query

Layers used
http://www.mass.gov/mgis/schools.htm - schools
http://www.mass.gov/mgis/hospitals.htm - hospitals

Now for this, I think we can safely assume that no street we care about should be further than 30,000 meters from its nearest school and hospital so we will limit our search to 30,000 meters and dice up our box into 10 expand boxes - so boxes expand at a rate of 3000 meters out per expansion.

Give me the nearest elemenary school and the nearest hospital emergency center for street segments in select zipcodes and how far away they are from these street segments.
--Total query runtime: 5516 ms.(~5.5 secs). 291 rows retrieved.
SELECT emer.gid as gid_ref, emer.street, emer.fraddl, emer.toaddl, hospitals.name As emer_name, emer.nn_dist as emer_dist_meters, 
        schools.name as sch_name, sch.nn_dist as sch_dist_meters
FROM (SELECT b.gid, b.street,b.fraddl, b.toaddl,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As emer INNER JOIN
    (SELECT b.gid,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'schools', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As sch ON emer.gid = sch.gid
    INNER JOIN hospitals ON hospitals.gid = emer.nn_gid 
    INNER JOIN schools ON schools.gid = sch.nn_gid
    ORDER BY emer.street, emer.fraddl, emer.gid

Example limit distance exclusion queries

Provide a list of all schools in Massachusetts Boston extent whose closest hospital emergency room is further than 3000 meters away

Using our new neighbor function exclusion query
--Total query runtime: 2969 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 3000, 1,2,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM schools WHERE schools.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)) b) As g1 ON sch.gid = g1.gid
    WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name
The old way doing a simple expand distance exclusion query. SOMETIMES THE OLD WAYS ARE JUST BETTER
-- Total query runtime: 63 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN 
(SELECT b.gid FROM schools b , hospitals h 
		WHERE b.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986) AND h.er_status = 'Y'
				AND expand(b.the_geom, 3000) && h.the_geom AND distance(b.the_geom, h.the_geom) <= 3000) g1
ON sch.gid = g1.gid
WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name

Caching data by storing closest neighbor in a table

For this exercise, we will create 2 new fields in our schools table called nearest_hospital and nearest_hospital_gid and update the fields with the closest hospital to that school. We will set our maximum distance search to 100,000 meters carved out into 10 10,000 meter expansions.

-- Query returned successfully: 2613 rows affected, 20625 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid integer;

UPDATE schools SET nearest_hospital = h.name , nearest_hospital_gid = h.gid
FROM (SELECT s.gid As sch_gid,
    (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid
WHERE schools.gid = nn_emer.sch_gid


Later on we realize we need an alternate hospital in case the first hospital is filled or because of traffic situation we can't get there, so we create a new set of fields to hold the second closest hospital. This particular exercise demonstrates where clause dependent on the additional attribute data of the reference record.


-- Query returned successfully: 2613 rows affected, 23032 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital_2 character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid_2 integer;

UPDATE schools SET nearest_hospital_2 = h.name , nearest_hospital_gid_2 = h.gid FROM (SELECT s.gid As sch_gid, (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y'' AND gid <> ' || CAST(s.nearest_hospital_gid As varchar(20)) , 'gid', 'the_geom')).* FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid WHERE schools.gid = nn_emer.sch_gid AND nearest_hospital_gid IS NOT NULL


Post Comments About PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution



 CommenterComment
2/13/2010 12:34:42 PMStephen MatherNice examples, and I've got the query returning the set of nearest neighbors. That said, what's the simplest example of performing the query and returning the nearest neighbor as a join? I'm struggling to accomplish this with the custom pgis_nn type. Thanks. Steve
10/21/2009 1:32:22 AMReginaJorge,
doing a transform on the fly will be slow without a spatial index on transformed data -- so you will want to do this first:

CREATE INDEX idx_puntos_the_geom_21897
ON puntos
USING gist
(ST_Transform(the_geom,21897));

vacuum analyze puntos;

Then try the below
Try this:
SELECT * FROM
pgis_fn_nn(ST_Transform (ST_GeomFromText('POINT(-74.071419171461 4.60402545420351)',4326),21897),
100000, 1,10,'(SELECT gid, ST_Transform(the_geom,21897) As the_geom FROM puntos WHERE tipo=''IGLESIA'')', 'true ', 'gid', 'the_geom');



10/20/2009 6:18:07 PMjorgeRegina, thanks for your answer, but yet don't work... I treated with "SELECT * from pgis_fn_nn(Transform (ST_GeomFromText('POINT(-74.071419171461 4.60402545420351)',4326),21897), 100000, 1,10,'puntos', 'tipo=''IGLESIA'' ', 'gid',ST_AsText(ST_transform(the_geom,21897)));"

but the system error are: ERROR: no existe la columna «the_geom» also treat whith 'the_geom', but return error in the geometry...

the query "select ST_AsText(ST_transform(the_geom,21897)) from puntos" works well ... return.. a list that "POINT(1002729.21396986 1005638.30386198)"

thank you for your time and interest
regards
10/18/2009 10:47:49 PMReginaJorge,

You can do it two ways.

1) Write a subselect kind of like we did with building.


2) Like you are doing it. You are close, but you are passing in the_geom like its a string. You should write it like this

SELECT * from pgis_fn_nn(ST_Transform (ST_GeomFromText('POINT(-74.071419171461 4.60402545420351)',4326),21897), 100000, 1,10,'puntos', 'tipo=''IGLESIA'' ', 'gid', 'ST_GeomFromText(ST_Transform (the_geom,21897))');


Note: I couldn't test so I may have made a typo. Also if you are using PostGIS 1.3+ please use the ST_.. functions. The non-ST versions are deprecated and will probably be removed completely in PostGIS 2.0.

PostGIS 1.5 will also support long lat and return meters, so we'll be writting an updated version of the script to deal with the new geograpy data type, and also correcting some mistakes in this one (it doesn't deal with non-point geometries right so may miss some close neighbors)

10/16/2009 12:22:29 AMjorgedon't work using transform() function ... what can I do???
the table puntos is in wgs84, but I need measure distance in meters

SELECT * from pgis_fn_nn(transform (GeomFromText('POINT(-74.071419171461 4.60402545420351)',4326),21897), 100000, 1,10,'puntos', 'tipo=''IGLESIA''', 'gid', GeomFromText(transform ('the_geom',21897)));

Código de error 0, estado SQLXX000: ERROR: parse error - invalid geometry
Línea 19, columna 1

Ejecución finalizada después de 0 s, ocurrió 1 error(s).
6/2/2009 3:06:21 PMMikeIt gives you not the nearest neighboor in all cases.

5/3/2009 5:18:41 PMM0ntyWorks perfect! Thank you.
4/15/2008 2:27:11 PMDavid FuhryRegina, a better way to find kNN of a point, with only one R-Tree traversal, is to maintain a priority queue of R-Tree nodes ordered by their minimum distance to the point. Expand the node at the head of the queue, add its children, and repeat until you've found k leaf nodes.

The papers most cited for this "best-first nearest neighbor" technique are Nearest Neighbor Queries, Roussopoulos et al. '95 and Distance Browsing in Spatial Databases, Hjaltason and Samet '99.

I don't think it could be implemented with a PL since it needs direct access to the tuples of the index relation. Maybe a C set-returning-function could.

This Document is available under the GNU Free Documentation License 1.2 http://www.gnu.org/copyleft/fdl.html & for download at the BostonGIS site http://www.bostongis.com

Boston GIS      Copyright 2014      Paragon Corporation