Increasing code quality: Static typing for Python Code
When porting the Nominatim’s data importer to Python last year, code quality played an important role. There are unit tests for every function and a linting tool runs regularly as part of the continuous integration to find the little code smells that are the first sign of bigger issues with the code. Now we have added an additional level of quality checks: static type checking.
Python is a dynamically typed programming language. Variables in Python do
have types but they are implicit and can be changed with a simple assignment.
This gives Python a great degree of flexibility and makes it so useful as a
quick prototyping language. Problem is that as the code base grows it can also get
in the way of efficient programming. Whenever you define a function, you usually
make implicit assumptions about the type of the parameters and return values.
Say, your function takes the ID of an OSM node as a parameter. As you know it
is a number, you might expect that the ID has a type int
because it would be
inefficient if the function itself tries to convert the input parameter into
a number every time. These kind of assumptions are made all the time. As the
code base grows, it gets harder to remember these assumptions. The result is
that you need to keep re-reading your own code to remember what you did. This
is tedious and error prone.
To solve the problem, Python has come up with the concept of
type annotations, They allow
the programmer to add hints about the expected type of a parameter, variable
or return parameter of a function. These annotations don’t make a statically
typed language out of Python. In fact, the annotations are completely ignored
at runtime. Instead they can be used with static type checkers like
mypy. They analyse the code without executing it
and highlight places where functions are used without the expected types.
This helps a lot with code quality. However, the far bigger benefit of
annotations for a code maintainer is their documentation value. No more
guessing if a function may return a None
value. The annotation spells it out.
In the past few weeks, I’ve gone through the Python code in Nominatim and annotated parameters and return values of every single function. This was no small undertaking. The code has grown to more than 500 functions by now. Some of them are quickly annotated but others caused some real headache. Here are the most important lessons I learned.
The temptations of Python data types
Python comes with a set of basic data types: tuples, lists, dictionaries, sets.
They are simple to understand and, when put together, provide convenient way
to quickly build up complex data structures. There are some places in Nominatim
where I was lazy and used such nested data types instead of defining a proper data class.
These turned out to be a big hurdle for type annotations. Reverse engineering
the structures from the code is time consuming and the resulting annotations are
hard to read. In addition, the annotations have a limited documentation value.
It is nice to know that a function returns a
Tuple[List[Tuple[int, str]], str, Optional[str]]
. The type checker may
even infer if the function was used correctly. But the programmer is none
the wiser what all these lists and strings in the return value may contain
and has to go back to reading the code.
The type annotations have pinpointed the parts where there is some future work to do. Getting rid of nested data structures will make the code more understandable for future maintainers and contributors alike.
The battle of the code checkers
One of the dangers of having many different code checkers is that they
disagree on how things are done. This usually is less of a problem with Python
because it come with a lot of official coding rules. If you stick to the
recommendations, then most code checkers will work right out of the box.
Type annotations seem to be a bit of an exception. They do come with
official recommendations how to use them in code but the recommended style is
different enough from that for normal code, that our linting tool
pylint would often enough disagree
with the demands of mypy. In the end, it is a question of containing the
problematic cases. A new module nominatim.typing
takes mostly care of that.
Working with external libraries
Probably the most tricky part of type annotations was where external libraries are involved. While Python’s standard library is well annotated by now, there is very little typing information available for the other libraries Nominatim uses. In theory, this shouldn’t be a problem because mypy can work with a mixture of dynamically and statically typed code. In practise, the transition from unannotated library functions to our statically typed code occasionally requires rather ugly casting. Not helpful for the goal to produce code that is well readable. In the case of the psycopg2 library it turned out to be easier to submit type annotations to typeshed than to litter Nominatim with casts. The PR was reviewed, merged and published in a matter of days. That’s the spirit of open source!