We are happy to announce a new release of Nominatim. It concludes a year-long reorganization of the code around how Nominatim parses and indexes names of places. The result is a new, more flexible approach to handling how places can be searched, which is the base for the version 4 of Nominatim.
This release introduces the new ICU tokenizer. It replaces the custom PostgreSQL transliteration module with the standardized ICU libraries from the Unicode consortium. This makes it much easier in the future to adapt and fix transliteration. For those of you who run their own installation it means that you no longer have to struggle with compiling and running a PostgreSQL module. The new tokenizer also gives you the means to easily extend and customize the way OSM names are parsed and how they are found.
If you already have an custom installation of Nominatim running, the old ways of parsing and indexing are still available with the legacy tokenizer. So you don’t have to switch to the new ICU-based parsing right away but you can still update to the new release and take advantage of all the other improvements.
In fact, the legacy tokenizer is still the default choice for a custom import.
A database using the ICU tokenizer is already deployed at the reference site
at https://nominatim.openstreetmap.org, where you can try it out. We will
collect more user feedback in the coming months to weed out any regressions
before fully switching to the ICU tokenizer in the next release. If you want
to switch already, add
NOMINATIM_TOKENIZER=icu to your
.env file before
starting a new database import. It is not possible to convert an existing
The tokenizer was not the only change in this release. Here are some other highlights:
- The former PHP utils have been removed completely in favour of the nominatim CLI tool.
- The handling of special phrases has seen a complete overhaul. You can now supply your own set of special phrases in a CSV file and it is possible to update the phrases in an existing database.
- External postcode files are now supported for any country. These external postcode files are now also expected to be in a simple CSV format.
- The manual has a new Customization Guide, which collects all the documentation around configuration files and ways to customize your installation.
This was one of the largest overhauls of the code ever done in Nominatim. According to github’s statistics there are more than 480 commits touching almost 35000 lines of code. Many thanks to this year’s Google Summer of Code students Antonin Jolivat and Yash Srivastava who contributed some of the features of this release next to their regular work on their GSOC projects.
For further information on the release have a look at the Release notes.
The ICU tokenizer brings lots of improvements for the data import. The algorithms on the search frontend have remained largely the same. They still use the same principles as 10 years ago, when an OSM planet was a tenth of the size it is today. We now have much more detailed information available and consequently a much larger search index. It is time to rethink the algorithms how queries are parsed and results are found in the database. The reorganised tokenizer code also allows us to hand more detailed information to the search frontend which can be used to further improve precision of answers. The changes will likely include a similar switch from PHP to Python that we have already done in the backend.
The ICU tokenizer is also far from finished. The new pre-processing step opens up a lot of possibilities for better cleaning of OSM data. For example, it is now possible to implement special parsing functions for house numbers, so that Nominatim can be more forgiving to different styles of using spaces. We can check the format of postcodes to detect data errors in OSM. And we could finally do something about the directional prefixes, so common an issue for street names in North America.
This years work has been funded to a large part by NLNet’s NGI Zero Discovery fund.
I’d also like to thank OpenCage, Graphhopper and Komoot whose long term commitment to supporting Nominatim has made it possible to start working full-time on geocoding. If you are interested in supporting work on Nominatim, please don’t hesitate to get in touch.