9 Jun 2021 · Software Engineering

    Building Python Projects at Scale with Pants

    16 min read
    Contents

    Warm thanks 🤗 to Benjy Weinberger for reviewing this post. Listen him talking about his contributions to Pants and his take on monorepos on the Semaphore Uncut episode.

    Web developers, data scientists, artificial intelligence and machine learning specialists all count Python as a valuable tool. Python has no rival as a general language. Yet, building and testing Python applications is a whole different can of worms. It’s a complex jigsaw composed of virtual environments, pip, pipenv, setuptools or poetry commands.

    This tutorial will show you a way out of the whole mess: we’ll learn how to use Pants v2, a comprehensive, user-friendly build system open-sourced by Twitter and Foursquare.

    What is Pants

    Pants is a fast, scalable build system for Python applications. It was designed with a specific goal: work at scale with large repositories and monorepos. A monorepo is a code repository with many separate but possibly interconnected applications and shared components.

    Pants takes a complex build process and breaks it up into small logical work units, which can be parallelized and cached. Pants features are:

    • Complete toolset: Pants takes care of all the little technical details such as creating virtual environments or downloading dependencies. Under the hood, it employs well-known Python tools like pip, setuptools, and pytest.
    • Test automation: Pants finds test files and runs them for you.
    • Speed and extensibility: the core is written in Rust for performance, while language-specific backends run on Python for extensibility.
    • Dependency inference: Pants understands the project’s structure and can determine dependencies by itself, without much boilerplate.
    • Built-in cache: work is split into small units, which are fingerprinted and cached. Pants resolves work from the cache whenever possible.
    • Isolation: each work unit runs in an isolated sandbox to prevent side effects.
    • Concurrency: work units run in parallel for greater speed.
    • Remote execution: when work exceeds the capacity of a single machine, Pants can distribute tasks in a cluster.

    Thanks to its parallel processing capabilities and cache, Pants is fast. It doesn’t waste time duplicating already-done work.

    Installing and configuring Pants

    Let’s take a brief look at how Pants is installed and configured. We’ll work with a demo project called semaphore-demo-python-pants. Feel free to fork it and clone it to follow this tutorial.

    The project is composed of two folders, one containing a “hello world” application and the other with some shared functions.

    [demo structure]
        │  
        ├── commons        [shared libraries]
        │   ├── __init__.py
        │   ├── math_utils.py
        │   ├── math_utils_test.py
        │   ├── string_utils.py
        │   └── string_utils_test.py
        │  
        ├── hello_world    [sample application]
        │   ├── __init__.py
        │   ├── main.py
        │   ├── requirements.txt
        │   └── test_main.py
        ├── LICENSE.md
        └── README.md

    After cloning the repository in your machine. Create a new file called pants.toml with these lines:

    [GLOBAL]
    pants_version = "2.4.1"

    Download the Pants shell script. You can use curl for this:

    $ curl -L -o ./pants https://pantsbuild.github.io/setup/pants
    $ chmod +x ./pants

    The first time Pants run, it will bootstrap itself and start a background process, pantsd.

    $ ./pants --version

    You might want to add the Pants script and the initial config file to version control. Anyone cloning the repository will be able to run ./pants --version and bootstrap the build system.

    $ git add pants pants.toml && git commit -m "add pants"

    Pants functionalities are implemented through backends. The very first backend you’ll want to enable is pants.backend.python, responsible for Python support. Append the following line into pants.toml:

    backend_packages = ["pants.backend.python"]

    Pants BUILD files

    The next step is to create BUILD files. These contain information about the project’s structure and outputs, which in Pants-lingo are called targets, and occasionally dependencies that can’t be inferred automatically. Targets may be internal libraries, executable binaries, releasable packages, or simply test outputs.

    To create initial BUILD use pants tailor:

    $ ./pants tailor
    Created commons/BUILD:
      - Added python_library target commons
      - Added python_tests target commons:tests
    Created hello_world/BUILD:
      - Added python_library target hello_world
      - Added python_tests target hello_world:tests

    Pants scans folders recursively and creates BUILDs for every detected folder with Python code. Opening either of the build files reveals its basic structure:

    python_library()
    
    python_tests(
        name="tests",
    )

    Initial builds usually contain two directives:

    • python_library: marks the current folder as containing Python non-test code.
    • python_tests: was created because Pants detected tests in the folder.

    Testing with Pants

    Since Pants detected the tests, let’s try to run them:

    $ ./pants test commons/:
    ✓ commons/math_utils_test.py:tests succeeded.
    ✓ commons/string_utils_test.py:tests succeeded.

    The colon (:) acts as a wildcard, so essentially, the command says: “run all the tests in the commons/ folder.”

    Let’s try the tests in the hello_world folder:

    $ ./pants test hello_world/:
    == ERRORS ==
    __ ERROR collecting hello_world/test_main.py __
    ImportError while importing test module '/tmp/process-execution23AVPp/hello_world/test_main.py'.
    Hint: make sure your test modules/packages have valid Python names.
    E   ModuleNotFoundError: No module named 'ansicolors'
    == short test summary info ==

    Oops, something went wrong. The error is clear: a module is missing. A seasoned Python developer would install the missing package with pip install ansicolors or pip install -r hello_world/requirements.txt. But Pants does this for you. It can figure out missing dependencies and download them on the fly. To enable this feature, we must add the following line into hello_world/BUILD:

    python_requirements()

    This makes Pants parse requirements.txt and convert each module into a target. Now run the failed test command again. This time you should see that the missing module is downloaded as needed.

    $ ./pants test hello_world/:
    ✓ hello_world/test_main.py:tests succeeded.

    You can run all tests in one go using two double colons (::), which makes Pants travel all folders and subfolders recursively:

    $ ./pants test ::
    ✓ commons/math_utils_test.py:tests succeeded.
    ✓ commons/string_utils_test.py:tests succeeded.
    ✓ hello_world/test_main.py:tests succeeded.

    Keeping dependencies in check

    Pants maintains a graph of all dependencies in the repository. You can use pants dependencies to see what modules a particular file imports:

    $ ./pants dependencies hello_world/main.py
    commons/string_utils.py
    hello_world:ansicolors

    Here Pants has detected that main.py needs ansicolors plus one file in a sibling folder. As you can see, it can cross-reference the whole repository. This dramatically simplifies dependency management in monorepo setups.

    To see the reverse dependencies, that is, the parts that depend on a specific file use pants dependees:

    $ ./pants dependees commons/string_utils.py
    commons
    commons:tests
    commons/string_utils_test.py:tests
    hello_world
    hello_world/main.py

    Building binaries

    Pants builds standalone executables using the PEX library. To try it, add this to hello_world/BUILD:

    pex_binary(
        name="pex_binary",
        entry_point="main.py",
    )

    PEX files are specially-crafted compressed files containing the application and its dependencies. The pants run command builds and runs the program:

    $ ./pants run hello_world/:
    Hi world

    You can even make a distributable file with pants package. Pants also lets you create packages in traditional formats like wheels or sdists.

    $ ./pants package hello_world/:
    Wrote dist/hello_world/pex_binary.pex

    You should have a new folder and file inside dist. Try running the hello world program from there:

    $ ./dist/hello_world/pex_binary.pex
    Hi world

    Analyzing the code

    Because Pants has a complete understanding of all the code in the repository, it can analyze things at a much deeper level. Pants comes with out-of-the-box support for several static analysis tools:

    • black: auto-formats Python files in place.
    • docformatter: auto-formats docstrings.
    • flake8: a linter that enforces Python style guides
    • pylint: another linter for Python.
    • mypy: adds static type checks to Python code, needs code to include type hints for it to work.
    • isort: sorts imports alphabetically.
    • bandit: finds common security issues in Python code.

    Enabling them is just a matter of adding backends into pants.toml:

    backend_packages = [
    "pants.backend.python",
    "pants.backend.python.lint.black",
    "pants.backend.python.lint.docformatter",
    "pants.backend.python.lint.flake8",
    "pants.backend.python.lint.pylint",
    "pants.backend.python.typecheck.mypy"
    ]

    With pants lint we can run all linters simultaneously:

    $ ./pants lint ::
    --------------------------------------------------------------------
    Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
    
    ✓ Black succeeded.
    ✓ Docformatter succeeded.
    ✓ Flake8 succeeded.
    ✓ Pylint succeeded.

    We also have a typechecker:

    $ ./pants typecheck ::
    Success: no issues found in 8 source files
    
    ✓ MyPy succeeded.

    And a code formatter to fix style deviations:

    $ ./pants fmt ::

    You can even run multiple actions at once. Pants will figure out the correct order:

    $ ./pants fmt lint typecheck ::

    Check the updated files into the repository as we’ll need them in the next section when we configure continuous integration with Semaphore.

    $ git add pants.toml commons/BUILD hello_world/BUILD
    $ git commit -m "add pants build config"
    $ git push origin master

    Building with Pants on Semaphore

    Of course, the tutorial wouldn’t be complete if we didn’t show you how to build Pants projects with Semaphore CI/CD.

    To begin, create a new project and add your repository to Semaphore.

    The next page lets you grant permissions to other users in your organization. Continuing, select single job and then customize.

    Build stage

    The first job will kick off the build stage, where we’ll build binary packages to check that the projects are buildable.

    The build job will need to complete the following steps:

    1. Clone the Git repository. Semaphore provides the built-in checkout command for this.
    2. Ensure we’re using the correct Python version with sem-version.
    3. Restore the Pants cache from previous builds, recognizing that it will be empty the first time the job runs.
    4. Run pants package to build the PEX binaries.
    5. Save the cache, so successive jobs run faster.

    We’ll achieve the steps by adding the following commands to the job:

    checkout
    sem-version python 3.8
    cache restore pants-$SEMAPHORE_GIT_BRANCH,pants-master
    ./pants package ::
    cache store pants-$SEMAPHORE_GIT_BRANCH,pants-master $HOME/.cache/pants

    Pants’ intensive use of cache requires us to make some strategic decisions. In the job we call cache twice. The first time, it looks up cached files by branch, defaulting to master. If your repository default branch is different, adjust as needed.

    The cache tool takes a comma-separated list of keys to look up cached files. It will always return the first match found.

    The pattern is repeated once the build is done and we save the cache. This maximizes the chances that a job will find the freshest cache for the active branch.

    Static tests

    The next block will run some static tests to ensure the code is Pythonic enough for a release.

    Create a new block with + add block

    Scroll down the right pane until you find the prologue. Add the following commands, which may sound familiar from the build job:

    checkout
    sem-version python 3.8
    cache restore pants-$SEMAPHORE_GIT_BRANCH,pants-master

    Commands in the prologue always run before any job in the block, ensuring the cache is retrieved.

    Open the epilogue and type this line to save the cache after the job runs:

    cache store pants-$SEMAPHORE_GIT_BRANCH,pants-master $HOME/.cache/pants

    Type the following commands in the job. These run the linter and the typecheck analyzer.

    ./pants lint ::
    ./pants typecheck ::

    Unit tests

    We’ll add one more block to run the unit tests. It will be similar to the last one as the prologue and epilogue are the same. The only difference is that the job contains a test command:

    ./pants test ::

    This concludes the basic CI pipeline setup. We’ll see strategies for optimizing large repositories in a bit.

    Running the pipeline

    Ready to try out the pipeline? Click on run the workflow > looks good, start.

    The first time it runs should take the longest, caches being empty and all. From now on, every time code is committed into the repository, the pipeline will run automatically.

    Monorepo optimizations

    Benjy Weinberger, one of the lead developers together with John Sirois, talked about Pants origins and design goals in the Semaphore Uncut podcast.

    “In Pants V2, everything is modeled as a workflow. Your build is broken down into a very large number of small steps. And that gives you very fine-grained invalidation. You know all the inputs and you can fingerprint them. You can run concurrently because every piece of work is fully encapsulated and you know all the dependencies.”

    The Pants team has put a lot of thought into making sure it scales well with monorepos. When asked about the challenges and opportunities a monorepo offers, Benjy said: “the problem about scaling your code base comes with analyzing changes and how they propagate through your dependencies. You will have this problem in a monorepo and with multiple repos. Why I come down on the side of monorepos is that at least they make the problem explicit. Your first party dependencies all live in the repo with you. You have visibility into everything that depends on you and when you make a change, you have the opportunity to propagate the ripple effects of that change correctly in your repo.”

    A CI-optimized config

    By default, Pants has some developer-friendly features such as scrolling text and verbose messages that we don’t need in the CI environment. You can disable these features as well as limit the number of concurrent processes by adding the following lines in pants.toml:

    process_execution_local_parallelism = 2
    dynamic_ui = false
    colors = true

    You should ensure that the parallelism number matches the number of CPU cores in the CI machine. The setting above works well on Semaphore’s default two-core machine. If you’re using a more powerful alternative, you should increase the number to take full advantage of its resources.

    You may check in a second CI-only Pants config file (for example pants.ci.toml) into the repository and activate it on the pipeline by setting the environment variable PANTS_CONFIG_FILES = pants.ci.toml.

    Parallel jobs

    Running parallel jobs in Semaphore is easy. Simply add more jobs to a block. The pipeline will take care of the rest.

    With Pants, however, parallel jobs can interfere with cache serialization, as we can’t be sure which job will end last, overwriting the cache. If you wish to try this feature, you should consider using different keys for each job, for example:

    • Lint job: pants-lint-$SEMAPHORE_GIT_BRANCH,pants-lint-master
    • Typecheck job: pants-typecheck-$SEMAPHORE_GIT_BRANCH,pants-typecheck-master
    • Format job: pants-fmt-$SEMAPHORE_GIT_BRANCH,pants-fmt-master

    The caveat here is that you may more quickly use up all the available cache. Semaphore will automatically delete older files when this happens.

    Pants change detection

    When caching is not enough, we can enable change detection. Since Pants understands how Git works, it can detect past changes and skip running tests on unchanging parts. Change detection is a boon for developers testing big repositories in their local machines.

    The feature is enabled with the --changed-since option. For instance, you can use this to run tests on all code that changed in a branch since veering off master/main:

    ./pants test --changed-since=origin/master

    Change detection scans the Git repository and automatically selects the files, so we don’t need to add any paths to the command. You can use it to filter and select new or modified code.

    If you want to run tests when dependencies change, you might want to add --changed-dependees.

    ./pants test --changed-since=origin/master --changed-dependees=transitive

    If you want to run all the changes in the last commit:

    ./pants tests --changed-since=HEAD --changed-dependees=transitive

    Bear in mind that change detection won’t work in the CI environment unless you change the checkout commands. This happens because Semaphore makes a shallow clone by default, only downloading the latest commits. Yet, Pants needs the entire history, so you’ll need to do a full checkout with the complete repository using: checkout --use-cache~.

    Semaphore change detection

    Semaphore includes support for monorepo workflows via the change_in function. You can use it to compute changes without cloning the full repository or using extra parameters in Pants. Often, this will be a better alternative than using the Pants built-in change detection.

    To enable change detection in the pipeline, you’ll have to rework the pipeline around subprojects. Create a block for each application folder or shared library. For the case of our demo project, you can create two independent CI paths in the pipeline, one for commons and another for hello_world. The main difference is that instead of using ::, we use the path to the folder. For example:

    • ./pants lint commons/:
    • ./pants lint hello_world/:

    You can configure multiple parallel paths in a pipeline by tweaking block’s dependencies.

    You should end with something that looks more or less like this:

    Next, we need to tell Semaphore what conditions will trigger the execution of each block. Open the run/skip conditions section and select “run this block when conditions are met.”

    Type the following conditions:

    change_in('/hello_world/')

    This says: “run the jobs in the block when any file inside the hello_world folder changes.” Apply the same changes for all the blocks related to the hello_world folder.

    To complete the setup, repeat the same procedure for the code in commons (the lower row of blocks in the picture). The condition is this case is:

    change_in('/commons/')

    Note that if your repository default branch is NOT master, you’ll need to add an extra parameter:

    change_in('/commons/', { default_branch: 'BRANCH_NAME'})

    Once done, every block should have a matching condition.

    What happens next? Well, when we change one of the applications, only the relevant blocks will run, the rest will be skipped, saving time and money.

    There is one caveat though, when jobs run in parallel, we might lose some cache information. To mitigate this, you should experiment using different cache keys for every application or library.

    For more details on Semaphore change detection, check out change_in reference page.

    Next Steps

    When it’s time to deploy, you can add a continuous deployment pipeline. To do this, edit the pipeline using the edit workflow button and click on + add a promotion:

    If you’re working with AWS Lambda, you’re in luck as Pants comes with AWS integrations out of the box.

    For other destinations, check these resources:

    Final words

    As applications grow in complexity, the need for a user-friendly and no-nonsense build system increases. Pants offers an easy-to-use experience that works on repositories of all sizes.

    Interested in monorepos? Semaphore is the only CI/CD platform with native support for monorepos. Check these tutorials to learn more:

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar
    Writen by:
    I picked up most of my skills during the years I worked at IBM. Was a DBA, developer, and cloud engineer for a time. After that, I went into freelancing, where I found the passion for writing. Now, I'm a full-time writer at Semaphore.