Running the ETL Pipeline#

So you want to run the PUDL data processing pipeline? This is the most involved way to get access to PUDL data. It’s only recommended if you want to edit the ETL process or contribute to the codebase. Check out the Data Access documentation if you just want to use the data we process and distribute.

These instructions assume you have already gone through the Development Setup.

Alembic#

PUDL uses Alembic to manage the creation our database and migrations of the schema as it changes over time. However, we only use file-based databases (SQLite, DuckDB) and these migrations are mostly a way to allow us to change the schema without needing to repopulate the entire database from scratch. They are not used in production.

Database initialization#

Before we run anything, we’ll need to make sure that the schema in the database actually matches the schema defined by the code. Run pixi run alembic upgrade head to create the database with the right schema. If you already have a pudl.sqlite you’ll probably need to delete it first.

Database schema migration#

If you’ve changed the database schema locally (by renaming a column, adding a table, defining a new primary key, changing a datatype, etc.), you’ll need to make a migration reflecting that change and apply the migration to the database to keep the database schema synchronized with the code:

$ pixi run alembic revision --autogenerate -m "Add my cool table"
$ pixi run alembic upgrade head
$ git add migrations
$ git commit -m "Migration: added my cool table"

When switching branches, Alembic may refer to a migration version that is not on your current branch. This will manifest as an error like this when running an Alembic command:

FAILED: Can't locate revision identified by '29d443aadf25'

If you encounter that, you will want to check out the git branch that does include that migration in the migrations directory. Then you should run alembic downgrade head-1 to revert the database to the prior version. Then you can go back to the branch that doesn’t have your migration, and use Alembic in peace.

If the migrations have diverged for more than one revision, you can specify the specific version you would like to downgrade to with its hash. You may also want to keep a copy of the old SQLite database around, so you can easily switch between branches without having to regenerate data.

More information can be found in the Alembic docs.

Dagster#

Catalyst uses Dagster to manage our data pipelines. Dagster is an open source data orchestration framework written in Python. It makes it easy to manage data dependences, parallelize processes, cache results and handle IO.

If you are interested in contributing to PUDL, you may want to familiarize yourself with Dagster’s excellent documentation:

If you use coding agents, you may also want to check out the Dagster agent skills:

dagster-expert
dignified-python
AI Driven Data Engineering (Dagster Course)

These skills are also configured in the PUDL repo and can be installed with this pixi task (which uses npx skill).

$ pixi run install-skills

Because Dagster’s documentation is extensive and constantly being updated, the rest of this section will focus only on the specifics of the PUDL project, with links to the Dagster docs for more info.

Core Dagster concepts used in PUDL#

Definitions [Dagster ref] are the top-level collection of Dagster objects that get loaded into a code location. They bundle together the assets, asset checks, resources, jobs, schedules, and sensors that Dagster can see and execute. In PUDL, the canonical Dagster assembly lives in pudl.dagster, while pudl.definitions remains the stable top-level code location used by dg. The package is split by Dagster abstraction so contributors can edit the relevant layer directly:
- pudl.dagster.assets loads and groups assets.
- pudl.dagster.asset_checks defines Dagster asset checks.
- pudl.dagster.resources defines the default resource set.
- pudl.dagster.jobs defines the standard PUDL jobs.
- pudl.dagster.sensors defines Dagster sensors.
- pudl.dagster.config contains reusable run-configuration helpers.
- pudl.dagster.build assembles dagster.Definitions via pudl.dagster.build_defs().
Assets [Dagster ref] are the primary building blocks in Dagster. They represent the underlying entities in our pipelines, such as database tables or machine learning models. In PUDL, most assets represent a pandas.DataFrame that is written to Parquet and SQLite files on disk. Depending on which part of the PUDL DAG you are looking at, assets might represent messy raw dataframes extracted from spreadsheets, partially cleaned intermediary dataframes, or fully normalized tables ready for distribution.
Resources [Dagster ref] are objects used by Dagster assets to provide access to external systems, databases, or services. In PUDL, we’ve defined a pudl.workspace.datastore.Datastore Resource that pulls our raw input data from archives on Zenodo identified by DOI. The pudl.workspace.datastore.ZenodoDoiSettings Resource defines the current Zenodo DOI for each dataset. We also store our dataset-specific data config (like what years of EIA-861 data to process) in a Resource pudl.dagster.resources.GlobalDataConfigResource.
IO Managers [Dagster ref] in Dagster let us keep the code for data processing separate from the code for reading and writing data. PUDL defines I/O Managers for reading data out of the FERC SQLite databases we curate, for reading and writing Parquet files, and for writing out to SQLite. For example pudl.dagster.io_managers.PudlMixedFormatIOManager allows assets to read and write dataframes to SQLite and Parquet-backed outputs using a single logical interface.
Jobs [Dagster ref] are preconfigured collections of assets, resources and IO Managers. Jobs are the main unit of execution in Dagster. The main jobs assembled in pudl.dagster are:
- ferc_to_sqlite to rebuild the raw FERC prerequisite databases only.
- pudl to run the main PUDL ETL assuming those raw FERC databases already exist.
- pudl_with_ferc_to_sqlite to run the full end-to-end build in one Dagster job.
- ferceqr a DuckDB based pipeline to process the very large FERC EQR dataset.
Configs [Dagster ref] are the runtime settings passed to Dagster jobs, assets, and resources to control what gets executed and how. In PUDL, we usually store these settings in YAML files like dg_fast.yml, dg_full.yml, dg_pytest.yml, and dg_nightly.yml, which configure execution options and shared resources like global_data_config. The reusable helpers that assemble these run configs live in pudl.dagster.config.

The Dagster Web UI#

The Dagster UI is the main interactive interface for inspecting the PUDL Dagster code location, launching jobs and asset materializations, reviewing logs, and debugging runs without dropping down to the CLI for every operation.

The Dagster CLI: `dg`#

dg is Dagster’s official CLI. It can perform many of the same actions managed through the Dagster UI, but is better suited to programmatic usage. PUDL is configured as a dg project. Some PUDL specific usage examples:

# Start up the Dagster UI webserver and daemons
$ pixi run dg dev
# Launch a full job with its default config
$ pixi run dg launch --job pudl
# Select a subset of assets to materialize
$ pixi run dg launch --assets "group:raw_eia861"
# List all of the Dagster definitions
$ pixi run dg list defs

Interactive asset loading#

If you want to inspect asset values interactively from a notebook, REPL, or local script, use pudl.dagster.build.build_interactive_defs() rather than the default build_defs() assembly. This helper constructs concrete FERC SQLite IO managers for interactive use, which allows dagster.Definitions.load_asset_value() to work outside a dg-spawned environment.

For example, to load a raw FERC asset in a notebook:

from dagster import AssetKey
from pudl.dagster.build import build_interactive_defs

defs = build_interactive_defs()
test_df = defs.load_asset_value(AssetKey("raw_ferc1_dbf__f1_edcfu_epda"))
test_df.sample(10)

If you need to point at non-default packaged settings, build_interactive_defs() also accepts global_data_config_path=... and zenodo_dois_path=... overrides. If you need to inspect assets against non-default workspace locations, you can also pass pudl_input=... and pudl_output=... directly. These values are used to configure the underlying pudl_paths Dagster resource for the interactive Definitions object.

Running the ETL via the Dagster UI#

Dagster needs a directory to store run logs system state, and interim assets that are not written to Parquet or SQLite for distribution. Create a new directory outside of your cloned PUDL repository and then define an environment variable named DAGSTER_HOME to the path of the new directory. E.g.

$ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.zshrc # zsh
$ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.bashrc # bash
$ set -Ux DAGSTER_HOME /path/to/dagster_home # fish

Add DAGSTER_HOME to the current session with

$ export DAGSTER_HOME=/path/to/dagster_home

Once DAGSTER_HOME is set, launch the dagster UI by running:

$ pixi run dg dev

Note

If DAGSTER_HOME is not set, you will still be able to execute jobs but dagster logs and outputs of assets that use the default fs_io_manager will be saved to a temporary directory that is deleted when the dagster process exits.

This will launch the Dagster UI on localhost:3000. See the Dagster UI docs <https://docs.dagster.io/guides/operate/webserver>`__ for all the details on how to use the UI.

Cloning the FERC databases#

The raw FERC SQLite databases are part of the raw_ferc_to_sqlite asset group. If you only need those outputs, select the ferc_to_sqlite job and hit “Materialize All”, or you can select the specific FERC Form you actually need. If you want to run the whole ETL from scratch, use the pudl_with_ferc_to_sqlite job. The pudl job is intended for day-to-day development once compatible raw FERC outputs have been materialized locally. See Converting raw FERC data to SQLite for more background on this process.

PUDL checks that your existing FERC SQLite databases are compatible with the current run configuration before downstream assets read them. Incompatible databases usually mean the FERC SQLite assets need to be rematerialized. This will happen when:

The configured Zenodo DOI for a FERC dataset changed.
The set of FERC years requested by your current ETL config includes years that are missing from the existing FERC SQLite database.
The version of catalystcoop.ferc_xbrl_extractor that was used to perform the XBRL conversion is out of date.

When this happens, refresh the FERC databases by materializing ferc_to_sqlite again, then rerun the pudl job or selected downstream assets. You can also choose to materialize only the FERC Form 1 and Form 714 databases in the Dagster UI, since those are the only ones that feed directly into the PUDL ETL. To ease development flow, this job will check to see if the SQLite DBs from nightly builds are compatible with your current run, and skip running the full conversion if that is true.

Running the PUDL ETL#

For most day-to-day development, you will want to select the pudl job. This will bring you to a window that displays all of the assets and their dependencies. Subsets of the asset graph are organized by asset groups, which are helpful for visualizing and executing subsets of the asset graph.

To execute the whole pudl job end-to-end click “Materialize all”. Depending on how many CPUs and how much memory your computer has, this may take hours. On an M1 Macbook Pro with 32GB of RAM and 10 CPUs it takes about 90 minutes. To run the full ETL you’ll need at least 16GB of RAM.

Read the Configuring resources section to learn more. To view the status of the run, click the date next to “Latest run:”.

You can also re-execute specific assets by selecting one or multiple assets in the “Overview” tab and clicking “Materialize selected”. This is helpful if you are updating the logic of a specific asset and don’t want to rerun the entire ETL.

Note

To process a subset of years for a specific asset group, select the asset group, shift+click “Materialize all” and configure the global_data_config resource with the desired years.

See Troubleshooting Dagster for tips on how to fix common issues we run into.

Running the FERC EQR ETL#

All processing for FERC EQR data is contained in a separate ETL from the rest of PUDL. This is because the dataset is too large to archive the raw data on Zenodo. This means the ETL can only be run by developers with credentials to access private cloud storage containing the raw data. Any external contributors interested in working on this ETL should contact the Catalyst team to set up access to the raw data.

The FERC EQR ETL is contained in a Dagster job called ferceqr. Executing this job from the Dagster UI is slightly different from the main PUDL ETL jobs because the EQR job uses Dagster partitions. After selecting “Materialize All” (or “Materialize selected” for a selection of assets), a screen will popup allowing you to select the partitions to execute. From here you can select a set of year-quarter combinations. This will trigger a backfill, which will execute each partition in its own run. To properly handle a backfill, you will need to configure dagster to use a QueuedRunCoordinator. This can be done using a dagster.yaml file in your DAGSTER_HOME directory with the following content:

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    tag_concurrency_limits:
      - key: "dagster/backfill"
        limit: 2

The config section shown above is not strictly necessary, but will limit the number of concurrent runs Dagster will start, which can be helpful to avoid out-of-memory issues while running many quarters in one backfill.

Running the ETL via CLI#

The dg command line interface is Dagster’s official tool and has a ton of built-in functionality. For full documentation see the Dagster docs: dg CLI reference.

These commands are a quick way to confirm your local Dagster setup is healthy before launching runs:

$ pixi run dg check toml
$ pixi run dg check defs --verbose
$ pixi run dg list defs

You can also kick off jobs directly with dg launch. The Dagster UI does not need to be running for this to work, but if it is running, you’ll see the run appear in it.

$ pixi run dg launch --job ferc_to_sqlite
$ pixi run dg launch --job pudl
$ pixi run dg launch --job pudl_with_ferc_to_sqlite --config src/pudl/package_data/settings/dg_full.yml

You can also target specific assets rather than an entire job, and use Dagster’s rich asset selection syntax to pick and choose:

# Materialize all assets in the raw_eia861 group
$ pixi run dg launch --assets "group:raw_eia861"
# Materialize all assets upstream and downstream of a table
$ pixi run dg launch --assets "+key:core_eia923__fuel_receipts_costs+"

We also have a pixi task defined in pyproject.toml to process all data with the full default configuration (this can take hours):

$ pixi run pudl

Dagster Config and PUDL Data Config Files —————————–_————

The dg launch command can read run configuration from YAML files. This avoids undue command line complexity and preserves a record of how the pipeline was run. The standard Dagster config files we use are:

src/pudl/package_data/settings/dg_fast.yml for smaller, faster local runs.
src/pudl/package_data/settings/dg_full.yml for full local builds.
src/pudl/package_data/settings/dg_pytest.yml for integration-test prebuilds.
src/pudl/package_data/settings/dg_nightly.yml for the nightly cloud build.

Warning

The Dagster config file selects resources and execution settings. The referenced data config YAML still determines partitions, years, and other dataset-specific parameters, but job and asset selection determine which parts of the graph run.

Each Dagster config file includes execution options and resource configuration, including the global_data_config_path used by the shared global_data_config resource. The referenced data config YAML files specify which partitions of each dataset should be processed, and are generally structured like this:

# FERC-to-SQLite data config
ferc_to_sqlite:
  ├── ferc1_dbf
  |   └── years
  ├── ferc1_xbrl
  |   └── years
  └── ferc2_xbrl
      └── years

# PUDL ETL data config
name : unique name identifying the etl outputs
title : short human readable title for the etl outputs
description : a longer description of the etl outputs
pudl:
  ├── dataset name
  │    └── dataset parameter (e.g. years) : editable list of years
  └── dataset name
       └── dataset parameter (e.g. years) : editable list of years

The Fast ETL#

Running the Fast ETL processes a limited subset of data for each dataset. This is similar to what we do in our software integration tests. Depending on your computer, it may take up to an hour to run.

$ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_fast.yml

The Full ETL#

The Full data config includes all available data that PUDL can process. All the years, all the states, and all the tables, including the ~1 billion record EPA CEMS dataset. Assuming you already have the data downloaded, on a computer with at least 16 GB of RAM, and a solid-state disk, the Full ETL including EPA CEMS should take a couple of hours.

$ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_full.yml

Custom ETL#

If you need a custom run profile, copy one of the existing Dagster config files, change its global_data_config_path or other resource settings, and point dg launch at the new file.

$ pixi run dg launch --job pudl --config the/path/to/your/custom_dg_config.yml

Additional Notes#

Logging#

The commands above should result in a bunch of Python logging output describing what Dagster is doing, and file outputs in the directory you specified via the $PUDL_OUTPUT environment variable. When the ETL is complete, you should see new files at e.g. $PUDL_OUTPUT/ferc1_dbf.sqlite, $PUDL_OUTPUT/pudl.sqlite and $PUDL_OUTPUT/core_epacems__hourly_emissions.parquet.

The Dagster CLI also has built-in help if you want additional information:

$ pixi run dg launch --help

Foreign Key Constraints#

The order assets are loaded into pudl.sqlite is non-deterministic because the assets are executed in parallel so foreign key constraint violations can’t be identified in real time. However, foreign key constraints can be checked after all of the data has been loaded into the database successfully. To check the constraints, run:

$ pixi run pudl_check_fks

The foreign key check is also run as part of the PUDL integration tests.