Sanitizers🔗
Sanitizing is the process of cleaning up and otherwise preprocessing names before adding them to the search index during the import process. This allows to clean up tagging, normalise different spellings and mark names with extra attributes for further processing.
Hint
Sanitizers only have an effect on how the search index is built. They do not change the information about each place that is saved in the database. In particular, they have no influence on how the results are displayed. The returned results always show the original information as stored in the OpenStreetMap database.
Configuration🔗
The sanitizing process is defined in the 'sanitizers.yaml' configuration
file. The file must contain a list of steps. Each step has a mandatory
parameter step which defines the type of sanitizer. Additional step
configuration may then be set with additional parameters.
The steps are executed in the order that they are defined in the configuration file. Order matters here: each sanitizer works with the output of the previous step.
Pre-defined sanitizers🔗
The following is a list of sanitizers that are shipped with Nominatim. To learn about how to add your own custom sanitizer, see the section on custom sanitizer modules.
affix-expansion🔗
Sanitizer which contracts or expands names based on the presence of prefix and suffix tags.
The sanitizer can handle three kinds of prefix/suffix tags: The most simple
one is of the form <kind>:<prefix-tag>. It is presumed to refer to the
name tag <kind>. For example, the name:prefix tag will be recognised
as a prefix tag and paired with name, while alt_name:suffix is paired
with alt_name. For name tags that are of the form <kind>:<suffix>,
meaning that they have another suffix, for example a language suffix, to
notations for the prefix/suffix tag are accepted: <kind>:<prefix-tag>:<suffix>
and <kind>:<suffix>:<prefix-tag>. That means for a German name tag name:de
both name:prefix:de and name:de:prefix will work.
| PARAMETER | DESCRIPTION |
|---|---|
prefix-tags
|
Specifies how to identify tags containing name prefixes. This is a single string or a list of suffixes which identify prefix names. (default: prefix)
|
prefix-tags
|
Specifies how to identify tags containing name suffixes. This is a single string or a list of suffixes which identify suffix names. (default: suffix)
|
mode
|
Defines how names are handled. full-name means to only keep the expanded version of the name with prefix/suffix attached. short-name means to only keep the contracted version without prefix/suffix. Prefixes and suffixes are still added as partial terms to the index and are thus still searchable. all-variants adds the expanded and contracted version of the name. add_expanded adds the expanded version if it doesn't exist yet. If name contains the contracted name, then it will not be removed. add_contracted add the contracted version if it doesn't exist yet. Any expanded version of the name that already exists will be kept.
|
clean-housenumbers🔗
Sanitizer that preprocesses address tags for house numbers. The sanitizer allows to
- define which tags are to be considered house numbers (see 'filter-kind')
- split house number lists into individual numbers (see 'delimiters')
- expand interpolated house numbers
| PARAMETER | DESCRIPTION |
|---|---|
delimiters
|
Define the set of characters to be used for splitting a list of house numbers into parts. (default: ',;')
|
filter-kind
|
Define the address tags that are considered to be a house number. Either takes a single string or a list of strings, where each string is a regular expression. An address item is considered a house number if the 'kind' fully matches any of the given regular expressions. (default: 'housenumber')
|
convert-to-name
|
Define house numbers that should be treated as a name instead of a house number. Either takes a single string or a list of strings, where each string is a regular expression that must match the full house number value.
|
expand-interpolations
|
When true, expand house number ranges to separate numbers when an 'interpolation' is present. (default: true)
|
clean-postcodes🔗
Sanitizer that filters postcodes by their officially allowed pattern.
| PARAMETER | DESCRIPTION |
|---|---|
convert-to-address
|
If set to 'yes' (the default), then postcodes that do not conform with their country-specific pattern are converted to an address component. That means that the postcode does not take part when computing the postcode centroids of a country but is still searchable. When set to 'no', non-conforming postcodes are not searchable either.
|
default-pattern
|
Pattern to use, when there is none available for the country in question. Warning: will not be used for objects that have no country assigned. These are always assumed to have no postcode.
|
clean-tiger-tags🔗
Sanitizer that preprocesses tags from the TIGER import.
It makes the following changes:
- remove state reference from tiger:county
delete-names🔗
Sanitizer which prevents certain names from getting into the search index. It removes names which matches all selected properties.
| PARAMETER | DESCRIPTION |
|---|---|
type
|
Define which type of names should be considered for removal: proper names of the object ('name') or names defining the address ('address'). (default: 'name')
|
filter-kind
|
Define which 'kind' of names are affected. Takes a string or list of strings where each string is a regular expression. A name is considered to be a candidate for removal if its 'kind' property fully matches any of the given regular expressions. (default: no filter)
|
filter-suffix
|
Define the 'suffix' property of the names which should be removed. Takes a string or list of strings where each string is a regular expression. A tag is considered to be a candidate for removal if its 'suffix' property fully matches any of the given regular expressions. (default: no filter)
|
filter-name
|
Select a subset of name values to be deleted. Takes a string or list of strings where each string is a regular expression. A tag is considered to be a candidate for removal if its name fully matches any of the given regular expressions. (default: no filter)
|
filter-country
|
Define the country code of places whose names should be considered for removed. Takes a string or list of strings where each string is a two-letter lower-case country code. (default: no filter)
|
filter-rank
|
Define the address rank of places whose names should be
considered for removal. Takes a string or list of strings
where each string is a number or range of number or the
form
|
derive-names🔗
This sanitizer can create additional name variants based on existing names.
| PARAMETER | DESCRIPTION |
|---|---|
type
|
Define which type of names should be considered for removal: proper names of the object ('name') or names defining the address ('address'). (default: 'name')
|
filter-kind
|
Define which 'kind' of names are affected. Takes a string or list of strings where each string is a regular expression. A name is considered to be a candidate for removal if its 'kind' property fully matches any of the given regular expressions. (default: no filter)
|
filter-suffix
|
Restrict sanitizer to names with certain 'suffix' properties. Takes a string or list of strings where each string is a regular expression. A tag is considered to be a candidate for removal if its 'suffix' property fully matches any of the given regular expressions. (default: no filter)
|
filter-country
|
Restrict sanitizer to given countries. Takes a string or list of strings where each string is a two-letter lower-case country code. (default: no filter)
|
filter-rank
|
Define the address rank of places whose names should be
considered. Takes a string or list of strings
where each string is a number or range of number or the
form
|
name-pattern
|
Regular expression to match the name proper against. Replacements will only be made when the full name matches against this expression. The expression may contain capture expressions which can be used in the variant expression below.
|
variants
|
Single string or list of strings of new variants to be created.
The string may contain numbered backreferences, e.g.
|
keep-original
|
When set to true, the original name will be kept. Otherwise the original is discarded when it matched the pattern. (default: true)
|
split-name-list🔗
Sanitizer that splits lists of names into their components.
| PARAMETER | DESCRIPTION |
|---|---|
delimiters
|
Define the set of characters to be used for splitting the list. (default: ',;')
|
strip-brace-terms🔗
This sanitizer creates additional name variants for names that have addendums in brackets (e.g. "Halle (Saale)"). The additional variant contains only the main name part with the bracket part removed.
tag-analyzer-by-language🔗
This sanitizer sets the analyzer property depending on the
language of the tag. The language is taken from the suffix of the name.
If a name already has an analyzer tagged, then this is kept.
| PARAMETER | DESCRIPTION |
|---|---|
filter-kind
|
Restrict the names the sanitizer should be applied to the given tags. The parameter expects a list of regular expressions which are matched against 'kind'. Note that a match against the full string is expected.
|
whitelist
|
Restrict the set of languages that should be tagged. Expects a list of acceptable suffixes. When unset, all 2- and 3-letter lower-case codes are accepted.
|
suffix-ignore
|
List of suffixes that are not language-related. Names with a suffix from that list will be handled like a name without suffix. (default: empty)
|
use-defaults
|
Configure what happens when the name has no suffix. When set to 'all', a variant is created for each of the default languages in the country the feature is in. When set to 'mono', a variant is only created, when exactly one language is spoken in the country. The default is to do nothing with the default languages of a country.
|
mode
|
Define how the variants are created and may be 'replace' or 'append'. When set to 'append' the original name (without any analyzer tagged) is retained. (default: replace)
|