26. Nearest-Neighbour Searching

Note

This section refers to a feature that is only available with PostGIS 2.0 and higher.

26.2. Index-based KNN

“KNN” stands for “K nearest neighbours”, where “K” is the number of neighbours you are looking for.

KNN is a pure index based nearest neighbour search. By walking up and down the index, the search can find the nearest candidate geometries without using any magical search radius numbers, so the technique is suitable and high performance even for very large tables with highly variable data densities.

Note

The KNN feature is only available on PostGIS 2.0 with PostgreSQL 9.1 or greater.

The KNN system works by evaluating distances between bounding boxes inside the PostGIS R-Tree index.

Because the index is built using the bounding boxes of geometries, the distances between any geometries that are not points will be inexact: they will be the distances between the bounding boxes of geometries.

The syntax of the index-based KNN query places a special “index-based distance operator” in the ORDER BY clause of the query, in this case “<->”. There are two index-based distance operators,

  • <-> means “distance between box centers”

  • <#> means “distance between box edges”

One side of the index-based distance operator must be a literal geometry value. We can get away with a subquery that returns as single geometry, or we could include a WKT geometry instead.

-- Closest 10 streets to Broad Street station are ?
SELECT
  streets.gid,
  streets.name
FROM
  nyc_streets streets
ORDER BY
  streets.geom <->
  (SELECT geom FROM nyc_subway_stations WHERE name = 'Broad St')
LIMIT 10;

-- Same query using a geometry EWKT literal

SELECT ST_AsEWKT(geom)
FROM nyc_subway_stations
WHERE name = 'Broad St';
-- SRID=26918;POINT(583571 4506714)

SELECT
  streets.gid,
  streets.name,
  ST_Distance(
    streets.geom,
    'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
    ) AS distance
FROM
  nyc_streets streets
ORDER BY
  streets.geom <->
  'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
LIMIT 10;

The results of the second query show how odd the index-based query on non-point geometries can appear at first glance. Wall Street is coming up third in our list, even though the absolute distance from the station to the street is 0.714 meters!

  gid  |     name     |     distance
-------+--------------+-------------------
 17360 | Exchange Pl  |    101.6241843136
 17350 | New St       |  63.9499165490674
 17385 | Wall St      | 0.714202224374917
 17332 | Exchange Aly |  159.618545539243
 17402 | Pine St      |  75.8461038368021
 17347 | Cedar St     |  133.009278387597
 17335 | Beaver St    |  221.988864601724
 17314 | Trinity Pl   |  205.942231743204
 17515 | Hanover St   |  198.414568622805
 17345 | Thames St    |  167.802276238319

Remember that all the calculations are being done on bounding boxes. The bounding box of the station point is just the point itself, so there is no approximation there. But the bounding boxes of the streets aren’t the same as the street lines. Here’s what the boxes of the top ten closest streets look like:

_images/knn1.jpg

We can see that the station falls right on the Wall Street line, and within the Wall Street box, but this index ordering is controlled by the <-> operator, which calculates distance between box centers. The centers of the boxes look like this:

_images/knn2.jpg

Now it is clear why Wall Street isn’t showing up as the first item in our search. The center of the Wall Street box is indeed further from the station than the centers of the Exchange Place and New Street boxes.

What about the <#> operator? If we calculate the distance between box edges, the station would fall inside the Wall Street box, giving it a distance of zero and the first entry in the list, right?

-- Closest 10 streets to Broad Street station are ?
SELECT
  streets.gid,
  streets.name
FROM
  nyc_streets streets
ORDER BY
  streets.geom <#>
  'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
LIMIT 10;

Unfortunately, no.

  gid  |                               name
-------+------------------------------------------------------------------
 19088 | FDR Dr
 17315 | Broadway
 19087 | FDR Dr
 17402 | Pine St
 17385 | Wall St
 17357 | South St
 17308 | Pearl St
 17235 | West Side Highway; West St; West Side Highway; West Side Highway
 17230 |
 17390 | Broad St

There are a number of large street features with big boxes that also overlap the station and yield a box distance of zero.

_images/knn3.jpg

The right way to get a high-performance–yet accurate–nearest neighbour calculation is to recognize that you’ll have to pull the top 100 (or a smaller number if you feel certain your data is more homogeneous in distribution) possible results in a sub-query, calculate the true distance for all of them, and return the closest record from that set.

-- "Closest" 100 streets to Broad Street station are?
WITH closest_candidates AS (
  SELECT
    streets.gid,
    streets.name,
    streets.geom
  FROM
    nyc_streets streets
  ORDER BY
    streets.geom <->
    'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
  LIMIT 100
)
SELECT gid, name
FROM closest_candidates
ORDER BY
  ST_Distance(
    geom,
    'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
    )
LIMIT 1;

Note that when querying a point table, because the boxes are identical to the points you can use the index-sorted result directly and dispense with the sub-query.

-- The 10 nearest stations to Broad St station
SELECT gid, name
FROM nyc_subway_stations
ORDER BY geom <-> 'SRID=26918;POINT(583571.905921312 4506714.34119218)'::geometry
LIMIT 10;