
On the initiatic journey leading from the notebook Data scientist to the actual Data science / Machine learning engineer, one discovers the joy of managing Python environments for developing and deploying a service.
If one has had the chance to interact with a seasoned Javascript backend software engineer further up on the road, one quickly becomes convinced that there is something clumsy about how Python virtual environments and packaging tools integrate.
To this day, there is still no one-and-only way to manage Python development environments and package dependencies.
This indecision goes unbeknowst to most data scientist: as sole library end users, the data scientist’s knowledge about Python dependency management need not extend further than the usual routine: Create a new conda
environment, install the packages you need, and work on your model.
But when it comes to building an actual Data or ML application in Python, setting up a proper development workflow is crucial to ensure frictionless collaboration on source code, uncompromising code quality checks (formating, linting, testing), and reliable dependency resolution for deployment.
Virtual environment and package manager
Section titled [object Undefined]Every Python practionner is already familiar with virtual environment and package manager.
Virtual environments create isolated Python installation, keeping dependencies required by different projects in separate environments.
Virtual environments allow to:
- bundle only the requirements our project needs at a specific location of the file system.
- avoid conflicts between incompatible dependencies required by different projects.
A package manager is used to install, update, and remove Python packages in our virtual environments.
The Python ecosystem has all these tools readily available and documented: they are called venv
, pip
, and the related community-contributed constellation of packages such as pyenv
, pipenv
or pip-tools
.
The dominance of the Anaconda distribution targetting data scientists has occulted the complex interplay of components in this ecosystem, by providing a unique framework that let library users set up virtual environments and manage Python version and dependencies under a unified conda
command line.
There is no doubt that the conda
ecosystem has simplified the packaging and delivery of scientific computing packages written in Python for end-users1.
The remaining challenge is the definition of a common ground for setting up reproducible and flexible Python environment targeted at developer’s experience.
Minimal properties of a good Python application development environment manager
Section titled [object Undefined]As most Python application deployment in the industry nowadays involve containerization, distributing a project has become less of an issue: to deploy an application, it is safe and easy to just copy the entire source tree in an isolated container and install from sources.
From a software engineering point of view, what is thus left to define is a framework providing methods to accomplish three goals:
- Clearly define the production and development requirements of a project.
- Provide convenience methods to pin and update package dependencies.
- Integrate seamlessly with standard tools for code quality check and CI/CD best practices.
We will explain a bit more in details the relevance of each of these points.
1) Distinguishing between production and development environments
Section titled [object Undefined]Not all packages used during the development of a project are required at runtime: that includes the various utilities for quality checks and testing of code, or for generating documentation. The smaller our container image, the better, hence any non-necessary executable should be left out of our production image.
2) Managing project dependencies in a deterministic yet flexible manner
Section titled [object Undefined]When developing a project, every developer should work on the exact same environment to ensure new features do not break code in an untractable way. Breaking things is ok, but understanding why a test fails require to keep everything else equal relative to a reference, stable, environment.
The way to achieve such a reproducible environment is to freeze the dependencies at a given point in time, by pinning them in a .lock
file. This file get committed to the version control system so that developers pulling the project can replicate the exact same environment on their machine. To keep up with the latest development effort in the field, the list of frozen dependencies should be refreshed regularly.
3) Enforcing code quality
Section titled [object Undefined]They are two ways to enforce code quality on a project: peer review, and automated tools.
The first aspect has a lot to do with the culture and standard set by the team working on the project.
The second aspect is much more straighforward to enforce, as it boils down to setting up an array of tools for automated code formating, linting, and static typing. These will force every commit to have the same style, avoid suspicious construction, and detect early bugs and error-prone function definition.
Setting up a reproducible Python development environment
Section titled [object Undefined]The way to set up this workflow using generic Python tools has been to:
- Create a Python virtual environment with
conda
or a combination ofvenv
andpyenv
. - Define direct production dependencies in a
requirements.txt
and development dependencies in arequirements-dev.txt
file. This logic can be further extended to whatever optionalgroup
dependencies is needed, with arequirements-group.txt
file, and so forth. - Install all the required dependencies in the local development environment.
- Set up all the configuration files for standard code quality checks (Black, Ruff, mypy)
- Before first commit, pin these dependencies in a
requirements-freeze.txt
,requirements-dev-freeze.txt
, etc, using thepip freeze
command. - Leverage these various configuration files and tools in the downstream CI/CD pipeline.
- Push everything to a remote code repository.
The exact recipe can vary, and take advantage of various convenience utilities like pip-tools, or pip-env.
Wrapping it up with Poetry
Section titled [object Undefined]Poetry is a framework designed to accomplish all these tasks, via a neat, concise API.
Compared to alternatives, Poetry standout features are:
- An exclusive reliance on the
pyproject.toml
specification file. Thepyproject.toml
file is the new norm2 for specifying Python project build. It also acts as a unique metadata and configuration hub supported by most of the standard development tools3. - A deterministic dependency resolution: Before installing or updating any library, Poetry fetches the entire dependency requirements tree, hence any dependency conflict that is discovered causes the installation process to stop. This ensures that there are no dependency conflicts within the project that could lead to bugs at runtime.
- A smart dependency specification scheme that let you easily “loosely” constrain package versions4.
A Python Poetry bootstrap template
Section titled [object Undefined]I assembled a project-agnostic GitHub template for setting up a Poetry-managed Python development environment. The template ships with current standard tools used for code quality and CI/CD best practices.
Will Poetry end the series of Python dependencies management tool churn5 and eventually impose as the no-brainer solution ? Poetry has been around for a while now, and despite one serious culprit6, its straight to the point features and neat API brings sunshine to an otherwise quite boring7 task when developing in Python.