Parsl Accelerated QMCPy Notebook Tests

Introduction

Notebook regression testing ensures that interactive examples and analyses remain correct and reproducible, catching regressions introduced by changes in code, dependencies, or execution environments. For QMCPy [1], this process is both massively parallel and resource-intensive due to the number and complexity of its notebooks.

This blog post summarizes our work on accelerating notebook regression testing (using Testbook-based tests, which can be viewed as notebook-level unit tests) as presented in our ParslFest 2025 talk [2], and outlines directions for further development.

The slides accompanying the presentation are available at: Parsl Testbook Speedup .

Methodology

Our choice to adopt Testbook [3] is motivated by its ability to execute Jupyter notebooks directly within a test environment, enabling fine-grained validation of both code cells and notebook state. Testbook also integrates cleanly with our existing testing directory structure, where other unit tests are organized without requiring full notebook execution. This preserves modularity, simplifies debugging, and avoids unnecessary duplication of logic.

To support scalable notebook testing, we developed a lightweight yet flexible test harness that enables Parsl [4] to orchestrate Testbook-based unit tests. By treating each notebook test as an independent Parsl app, the harness realizes an embarrassingly parallel workflow suitable for local multiprocessing, HPC schedulers, or cloud environments.

The harness coordinates three primary components to achieve reproducible, high-throughput notebook testing:

  • Continuous Integration (CI): A GitHub Actions workflow prepares the execution environment (Conda environment creation, minimal LaTeX installation, optional swap configuration), installs project dependencies, and triggers the appropriate test targets (e.g., make booktests_parallel_no_docker). This ensures consistent, version-controlled execution across platforms.
  • Parsl controller and workers: Parsl provisions local or remote executors—processes, threads, or cluster jobs—and schedules notebook tests as independent tasks. This enables parallel execution with configurable concurrency limits, resource profiles, and executor backends.
  • Testbook runner and artifact collection: Each worker executes its assigned notebook tests through Testbook. Outputs, execution logs, error traces, generated figures, and notebook artifacts with executed cells are returned to the Parsl controller and uploaded by CI for inspection, provenance tracking, and debugging.

Key features of the harness include pinned Conda environment specifications for reproducibility, customizable Parsl executors (local, HPC, or cloud), timeout and retry policies for handling flaky or long-running tests, and centralized logging to streamline diagnosis of failures. Together, these components provide a robust framework for scalable, automated validation of computational notebooks.

Results

To establish a performance baseline, we first measured the wall‐clock time required to execute a representative subset of demo notebooks sequentially. After extending test coverage to include syntax-validation checks and
additional notebooks, we repeated the experiment under the parallel Testbook–Parsl workflow. Across these configurations, we observed a consistent 3.0-fold speedup, demonstrating that notebook-based tests parallelize cleanly and benefit substantially from concurrent execution. The overall trend is illustrated in Figure 1.

  • Figure 1: Speedup achieved by running Testbook-based notebook tests under Parsl with various number of workers compared to sequential execution. figure_1
  • All tests were executed on a Linux system (AMD64 architecture) with 16 CPU cores. When run in continuous integration, the workflow executes the same test suite on the GitHub Actions ubuntu-latest runner. Users may reproduce the parallel Testbook workflow locally by running:

    make testbook
    

    Table 1 summarizes the scope of the tested notebooks and their corresponding generated test files.

    Table 1: Coverage summary for the QMCPy Testbook suite.
    Item Count Notes
    Demo notebooks in demos/ 33 All notebooks in demos/ (including subfolders)
    Generated test files in test/booktests/tb_*.py 32 Each tb_*.py tests a single demo notebook
    Notebooks executed by booktests workflow 32 Workflow runs all generated test files
    Runner configuration notes.
    • The GitHub ubuntu-latest runner typically provides 2 virtual CPUs per job.
    • The workflow allocates a 12 GB swap file to mitigate transient memory spikes during notebook execution and reduce the likelihood of out-of-memory failures.

    CI Tests (from .github/workflows/booktests.yml)

    The continuous integration workflow automates the execution of notebook-based tests and prepares a controlled environment for reproducible runs. The workflow checks out the repository, sets up Miniconda, installs the project’s test extras (pip install -e .[test]) along with optional components (test_torch, test_gpytorch, etc.), and creates a 12 GB swap file early in the job to reduce out-of-memory failures for notebooks with heavy memory demands.

    To ensure notebook tests are present and up to date, the workflow invokes make check_booktests and make generate_booktests, which confirm or regenerate test/booktests/tb_*.py files (one per demo notebook). The final step triggers the test target—typically make booktests_parallel_no_docker for parallel execution or make booktests_no_docker for sequential execution. These targets run the generated tests via pytest/testbook.

    For diagnostics, setting ACTIONS_STEP_DEBUG: true increases log verbosity. Timing and memory usage can be captured by wrapping the make call with /usr/bin/time -v and uploading the resulting logs using actions/upload-artifact@v4.

    Parallel notebook execution must respect the resource limits of GitHub runners, which typically provide only two CPUs. When notebooks request more workers (e.g., max_workers = 8), adapting the code max_workers = min(8, os.cpu_count() or 1) to ensure compatibility. In cases of memory pressure, the workflow may fall back to the sequential booktests_no_docker target or reduce -j parallelism inside the Makefile.

    The workflow is readily extensible: caching package installations with actions/cache accelerates subsequent runs; notebook outputs and HTML artifacts can be uploaded for failure analysis; and a simple CSV timing log in test/booktests/ can be collected to track notebook performance over time.

    Further Work

    Due to the above results, this indicates that we should extend our testing to doctest and pytest in Parsl.

    Now, because many people have multicore processors, we can increase individual productivity so that our tests can demonstrate that no regressions have been introduced.

    Furthermore, regarding feedback on the presentation to the ParslFest participants, the system is quite general. This suggests a distributed test system could benefit Parsl users by enabling them to distribute their own test workloads.

    Finally, we could expand our work to encompass Python doctests, as well as unit testing using pytest or unittest, in addition to testing Jupyter notebooks.

    References

    1. Parsl Project. (2025). Globus Compute + ParslFest 2025: Annual Community Gathering (Hybrid), August 28–29, University of Chicago. Retrieved from https://parsl-project.org/parslfest/parslfest2025.html .
    2. Choi, S.-C. T., Hickernell, F., McCourt, M., & Sorokin, A. (2020). QMCPy: A quasi-Monte Carlo Python Library. Retrieved from https://qmcsoftware.github.io/QMCPSoftware/ .
    3. nteract team. (2021). testbook: Unit testing framework for Jupyter notebooks. Version 0.4.2, accessed September 25, 2025. Retrieved from https://github.com/nteract/testbook .
    4. Babuji, Y. et al. (2019). Parsl: Pervasive Parallel Programming in Python. In Proceedings of the 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’19) (pp. 25–36). Association for Computing Machinery, New York, NY, USA. DOI: 10.1145/3307681.3325400 .
    + posts

    Dr. Sou-Cheng T. Choi is Research Associate Professor of Applied Mathematics at the Illinois Institute of Technology and Founder of SouLab LLC. She formerly served as Chief Data Scientist at the Kamakura Corporation, acquired by SAS Institute Inc. in 2022.

    Leave a Reply

    Discover more from QMCPy

    Subscribe now to keep reading and get access to the full archive.

    Continue reading