Skip to content

Packaging

Marc Garcia edited this page Feb 25, 2019 · 3 revisions

In the open source world, as most packages can be just installed (without need of a payment, sign up...), it is common to have tools to make it easier for users to install any package for a system.

The most well known example are Linux distributions, where one can simply install a package by writing in the terminal something like dnf install <some-package>, apt-get install <some-package> or similar. Also, graphical package managers are common.

The Python community developed several package systems over the years, being the more popular one today pip. Python is a multiplatform programming language, and the package ecosystem is huge. Fot this reasons it was helpful to have its own packaging system, instead of packaging for every Linux distribution, and also for propietary operating systems like MacOS (brew) or Windows.

This does not mean that Python packages can't be provided as .rpm, .deb, .msi... for those platforms. Popular projects usually exist as packages for those systems. But for smaller projects, it's usually enough to have a package in pip.

While pip has been serving well the Python community for the last years, it has a major limitation. It is a packaging system for Python packages. This means, that if our Python project depends on a non-Python library, we will only be able to package the Python library with pip, not the non-Python dependency. Imagine for example Pillow, the most common Python package for image manipulation. It can depend on libraries like libjpeg for performing operations in .jpeg files. In this cases, pip will work as expected if libjpeg is available in the system, but it will simply fail if it doesn't.

While most of the Python ecosystem is not strongly affected by this, in the data world, most projects depend on non-Python dependencies (think of numpy, scipy, pandas...). For this reason, the PyData community has been moving from pip to Anaconda, a package manager designed with this problem in mind.

In practice, this means that if we are developing a Python project related to data engineering, data science... and we want to let users install it easily, we should provide packages for pip and Anaconda.

pip

There are two main things to do to make a package available for pip:

  • Create a setup.py file for our project
  • Upload the package to PyPI.

The file setup.py is just a regular Python file that should call setuptools.setup() to "register" the information of the Python package. While this is very powerful (as you can run any Python code), it also adds complexity to the packaging system.

Once this file is written, it can be called (with certain parameters) to generate a package with one of the formats supported by pip (think of a zip file with all the required files of our project).

This package can then be uploaded to the Python Package Index (aka Cheese shop), a central repository of Python packages, where pip looks for packages to install by default. An account needs to be created to upload a package to PyPI.

There are several tutorials online on how the whole process works, for example:

Anaconda

Anaconda makes it easy to install dependencies by using conda install <some-package>. This will download a package from the Anaconda Inc repository, and install it locally. Anaconda Inc, the company who developed and maintains conda provides professional support to some of the PyData packages, and they are provided in their distribution.

To let users use conda with other packages then what the company supports, the concept of channel exists, which is a repository of packages. People can create their own channels and publish their packages there, but conda-forge was created as a centralized place of conda packages not officially supported by Anaconda. To install packages from conda-forge, one can simply do conda install -c conda-forge <some-package>. Or in general, conda-forge can be added to the settings, so it's always checked when looking for packages.

In this case, conda-forge seems like a reasonable place to upload our package, but a custom channel can also be used.

Documentation on how to write a conda recipe and upload a package to conda-forge can be found here:

Clone this wiki locally