The Nominatim Python library, that was added in the last release, makes it quite a bit easier to work with geocoding on your local machine. Still, the library needs access to a PostgreSQL database. And setting this up can be tedious. This is the first of two posts discussing the question: can a geocoding database be squeezed into a SQLite file?

SQLite has made it really easy to work with smaller SQL databases. A single file contains all the data and can be easily copied and distributed to run anywhere from a server to your cell phone. The SpatiaLite extension adds the spatial capabilities a geocoding database needs.

SQLite can be quite efficient when used read-only, so this is what we target here: converting an existing Nominatim database into SQLite to be used with the existing Python frontend. The heavy lifting of preprocessing the data during import is left to Postgres.

Converting the database

In an earlier blog post we have already seen that it is fairly straightforward to convert the Nominatim database using SQLAlchemy. A line-by-line copy may not be the fastest method but it means that we may do some processing or filtering on the way to optimize the database output.

Creating spatial indexes is a bit more tricky. Spatialite doesn’t implement indexes as first class citizens. Instead the lookup trees are saved in a special table SpatialIndex. The GeoAlchemy2 extension would hide all these details from the SQLAlchemy user. Unfortunately, we had to get rid of it for performance reasons. Following GeoAlchemy’s example, the convert handles geometry columns now manually as follows:

  • create a placeholder column with type GEOMETRY
  • convert the column to a geometry column using the RecoverGeometrycolumn() function
  • copy in all data using the GeometryFromText() function to import the geometry in the correct format
  • add a spatial index where necessary using the CreateSpatialIndex() function

Nominatim has one index for reverse lookup that is rather special. It is a partial index over a transformed geometry:

CREATE INDEX ON placex
  USING gist (ST_Buffer(geometry, reverse_place_diameter(rank_search)))
  WHERE osm_type = 'N' and rank_address between 4 and 25

The index is used to check, if the coordinate that was looked up is still in the vicinity of a place node, the centre point of a village, city or similar. What counts as vicinity depends on the type of place. It is smaller for a hamlet, larger for a town or city.

Spatialite provides support neither for partial spatial indexes nor for indexes over transformed geometries. The solution here is to resort to a helper table that holds the transformed geometries. Or to be more precise, it is sufficient if it holds the bounding box of the transformed geometries because the bounding box is all that is used by a spatial index.

Converting the frontend

Once you start trying to run SQL code designed for PostGIS against another dialect like Spatialite, it becomes clear very quickly that while they may provide almost the same functionality, there are a lot of tiny but essential syntactic differences. For example, PostGIS has taken great care to switch to a consistent naming for all functions using an ST_ prefix and CamelCase for the function names. Spatialite still uses the older inconsistent naming in many places.

While SQLAlchemy is designed to be used with different SQL backends, I doubt that there is much code out there designed to be used with different backends at the same time. The library has a concept that allows to provide different compilations for different dialects. Documentation around this feature is rather sparse and it takes a few rounds of trial-and-error to get the implementation right. However, once done right, it does a tremendous job. There is not a single switch in the implementation of the Reverse API that needs to distinguish between PostGIS or Spatialite. All dialect-specific SQL is neatly hidden behind custom SQLAlchemy functions. Here is an example, how to implement Greatest which is called max in SQLite:

import sqlalchemy as sa

class Greatest(sa.sql.functions.GenericFunction):
    """ Function to compute maximum of all its input parameters.
    """
    name = 'greatest'
    inherit_cache = True


@compiles(Greatest, 'sqlite')
def sqlite_greatest(element, compiler, **kw):
    return "max(%s)" % compiler.process(element.clauses, **kw)

The new function can then be used like any other function:

sql = sa.select(sa.func.greatest(1, 3, 5))

When running this code against a PostgreSQL connection, SQLAlchemy does the standard handling for function, which is to render a function with the given name: SELECT greatest(1, 3, 5). When connected to SQLite, it will execute the function defined with @compile instead. The SQL comes out as SELECT max(1, 3, 5). This is pretty straightforward. The tricky part comes when spatial indexes are involved.

We have already briefly touched on the fact that spatial indexes work differently in Spatialite. They take the form of a table that needs to be queried explicitly. As a result, we need to know which queries in the code are supposed to use a spatial index. This may look like a step backwards compared to Postgres, which magically uses the right index. However, the queries in Nominatim are complex enough that we need a tight control over index use in any case. Being explicit is better for maintainability.

Thus, the relevant spatial functions can now come in two variants: one with index use and one without. Most functions are used by Nominatim only in one way or the other, so changes to the code were minimal.

Testing on Real-World Data

Nominatim in SQLite is not meant for a planet-sized database but rather for use-cases where a smaller geographic or thematic extract is needed. To show-case this, we shall test SQLite against two kinds of extract:

  • Germany represents one of the best-mapped geographic areas. With 4.2 million road objects, 19 million address points and 5 million additional points of interest, the complete extract makes for a fairly large geocoding database.
  • Admin is a geocoding database created with Nominatim’s admin style. It is useful for applications where a search for place names is sufficient, for example when tagging photos. It contains 0.5 million boundaries and 6.5 million place names.

Database size

Both database have been created with the import options --reverse-only and --drop because we are neither interested in search nor in updates at this point. The resulting PostgreSQL database was converted with the new convert command now available with the Nominatim master branch. The following table shows the size of the PostgreSQL and SQLite tables:

Extract PostgreSQL SQLite
Germany 29 GB 18 GB
Admin 12 GB 12 GB

The larger size of the Postgres DB for Germany is mainly the result of table bloat. After cleaning the table, the database shrinks to 19GB. Thus, there is no significant difference for the two database types.

Overall the SQLite files are still a bit larger than is practical. Remember however, that these files still contain the full wealth of OSM data. Depending on the use-case, objects like rivers, park benches or dust bins may be dropped. The SQLite file also contains unmodified geometries. Simplification may reduce the database size but also lead to incorrect results around the boundaries.

Performance

To get a first idea of performance, we tested the two database against a set of 10.000 random coordinates. Using the new Nominatim library this can be done with a few lines of code:

from pathlib import Path
import asyncio
import random
import time
import nominatim.api

async def run_reverse(rounds):
    random.seed(34)
    api = nominatim.api.NominatimAPIAsync(Path('.'))
    for i in range(rounds):
        await api.reverse((random.uniform(5.8, 15.1), random.uniform(47.2, 55.2)),
                          max_rank=random.randint(5, 30))

start = time.time()
asyncio.run(run_reverse(10000))
print(f'Result: {10000/(time.time() - start)} request/s')

Here are the results, given in requests per second:

Extract PostgreSQL SQLite
Germany 454 23
Admin 149 21

Not surprisingly, Postgres will do a lot better. For one thing, it was designed for high performance tasks like this. It also has the advantage that it can make use of additional partial spatial indexes, which cannot be implemented directly in SQLite as explained above. Still, even with this simple implementation, the SQLite-based frontend can achieve 20 requests/s in both cases, which should be sufficient for many use cases.

While there is clearly room for improvement, these experiments have shown that SQLite is clearly up to the task to serve as a database for a geocoder. Next up is the search API (or forward geocoding), which comes with another set of challenges. More on that in the second part of this blog post.

Many thanks to NGI Zero who are funding this work on improving Nominatim.