Boston Geographic Information Systems
Solving the Nearest Neighbor Problem in PostGIS

A common problem encountered in GIS is the Nearest Neighbor problem. In a nutshell the problem is to find the x number of nearest neighbors given a geometry and n geometries of data.

The nearest neighbor crops up in other disciplines as well, except in other disciplines the units of measurement are not spatial distance but instead some sort of matrix distance. For this particular talk, I will focus on the GIS nature of the problem and specifically solving it in PostGIS.

The easiest and most naive way to do this is to do an exhaustive search of the whole space comparing the distances against the geometry of interest and sorting them. This solution while intuitive and easy to write is the slowest way of doing this.

If you were to write such a simple solution in PostgreSQL, then it would take something of the form shown below. The below gives you the nearest 5 neighbors from a reference row with gid = 1.

``` SELECT g1.gid As gref_gid, g1.description As gref_description, g2.gid As gnn_gid, g2.description As gnn_description FROM sometable As g1, sometable As g2 WHERE g1.gid = 1 and g1.gid <> g2.gid ORDER BY ST_Distance(g1.the_geom,g2.the_geom) LIMIT 5 ```

Now for moderate to large datasets, this exercise becomes painfully slow because distance is not an indexable function because it involves relations between two entities.

If you knew something about your data like there are no neighbors past 300 units away from your geometry of interest or that at that point you could care less what lies past 300 units, then you could significantly improve the speed of this query by writing it like this:

``` SELECT g1.gid As gref_gid, g1.description As gref_description, g2.gid As gnn_gid, g2.description As gnn_description FROM sometable As g1, sometable As g2 WHERE g1.gid = 1 and g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 300) ORDER BY ST_Distance(g1.the_geom,g2.the_geom) LIMIT 5 ```

If you needed to get the nearest neighbor for all records in a table, but you only need the first nearest neighbor for each, then you can use PostgreSQL's distinctive DISTINCT ON syntax. Which would look something like this :

``` SELECT DISTINCT ON(g1.gid) g1.gid As gref_gid, g1.description As gref_description, g2.gid As gnn_gid, g2.description As gnn_description FROM sometable As g1, sometable As g2 WHERE g1.gid <> g2.gid AND ST_DWithin(g1.the_geom, g2.the_geom, 300) ORDER BY g1.gid, ST_Distance(g1.the_geom,g2.the_geom) ```

If you needed to run this for more than one record in sometable, but wanted a cross tab like query that lists the n nearest neighbors in one row with neighbors listed as n1, n2, n3 etc, then you can take advantage of PostgreSQL unique array offerings which we detail in SQL Idiom Design Patterns. The below code is excerpted from that article.

``` SELECT m.building_name, m.nn_precincts[1] As nn_precinct1, m.nn_precincts[2] As nn_precinct2, m.nn_precincts[3] As nn_precinct3 FROM (SELECT b.building_name, b.the_geom, ARRAY(SELECT p.precint_name FROM police_precints p WHERE ST_DWithin(b.the_geom, p.the_geom, 5000) ORDER BY ST_Distance(b.the_geom, p.the_geom) LIMIT 3) As nn_precincts FROM buildings b ) m ORDER BY m.builidng_name; ```

If you needed to run this for more than one record in sometable, then you can either put this in a loop and run it for each record of interest in sometable. Alternatively you could create an SQL function that returns a table for each record and use it like shown below.

```CREATE OR REPLACE FUNCTION fnsometable_nn(gid integer, geom1 geometry, distguess double precision, n umnn integer) RETURNS SETOF sometable AS \$BODY\$ SELECT g2.* FROM sometable As g2 WHERE \$1 <> g2.gid AND expand(\$2, \$3) && g2.the_geom ORDER BY distance(g2.the_geom, \$2) LIMIT \$4 \$BODY\$ LANGUAGE 'sql' VOLATILE; ```

To use the function to run for the first 100 records in sometable and return the 5 closest neighbors for each of those records, you would do this

``` SELECT g1.gid, g1.description, (fnsometable_nn(g1.gid, g1.the_geom, 300, 5)).gid As nn_gid, (fnsometable_nn(g1.gid, g1.the_geom, 300, 5)).description as nn_description, distance((fnsometable_nn(g1.gid, g1.the_geom, 300, 5)).the_geom,g1.the_geom) as dist FROM (SELECT * FROM sometable ORDER BY gid LIMIT 100) g1 ```

A lot of people are probably thinking "HUH" at this point. I will now wave my hand and leave it as an exercise for people to figure out why this works. I will say this much. I am using a feature in PostgreSQL which from my readings is an accident of design and only works with functions written in SQL and C language. This accident will probably be later corrected in future versions so don't expect this to work moving forward.

Returning tables in the select part of an sql statement is super non-standard and should I say bizarre and unelegant, but is the only way to do it currently in PostgreSQL where the you have a set returning function whose inputs depend on values from a relating table. This is because PostgreSQL currently does not support subselects and set returning functions in the FROM clause that dynamically change based on inputs from other table fields. In fact most databases do not support this.

Unfortunately PostgreSQL doesn't support a predicate similar to the CROSS APPLY predicate that SQL Server 2005 has. I think Oracle does by pretty much just leaving out the CROSS APPLY predicate. CROSS APPLY is more elegant looking I think in general and also more self-documenting in the sense that when you see the clause - you know what is being done.

SQL Server which I use very often - introduced this feature in the SQL Server 2005 offering and in a few cases its a life-saver.

I think the next next release of PostgreSQL will probably support a feature similar to this. Just to demonstrate how powerful this feature can be, if we had the cross apply feature, we could do away with our intermediate set returning function and simply write our query like this.

``` SELECT g1.gid, g1.description, nn.gid As nn_gid, nn.description As nn_description, distance(nn.the_g eom,g1.the_geom) as dist FROM (SELECT * FROM sometable ORDER BY gid LIMIT 100) g1 CROSS APPLY (SELECT g2.* FROM sometable As g2 WHERE g1.gid <> g2.gid AND expand(g1.the_geom, 300) && g2.the_geom) ORDER BY distance(g2.the_geom, g1.the_geom) LIMIT 5) As nn ```

This would alleviate the need to create a new nn function for each table or kind of query we are formulating.

Its always interesting to see how different databases tackle the same issues. Oracle Locator and Oracle Spatial for example does have an SDO_NN and SDO_NN_DISTANCE functions which magically does all this for you, but I haven't used it so not sure of its speed or if it has any caveats. One I did notice from reading is that it doesn't deal with where clauses too well.