Running the ETL Pipeline#
So you want to run the PUDL data processing pipeline? This is the most involved way to get access to PUDL data. It’s only recommended if you want to edit the ETL process or contribute to the codebase. Check out the Data Access documentation if you just want to use the data we process and distribute.
These instructions assume you have already gone through the Development Setup.
Alembic#
PUDL uses Alembic to manage the creation our database and migrations of the schema as it changes over time. However, we only use file-based databases (SQLite, DuckDB) and these migrations are mostly a way to allow us to change the schema without needing to repopulate the entire database from scratch. They are not used in production.
Database initialization#
Before we run anything, we’ll need to make sure that the schema in the database
actually matches the schema defined by the code. Run pixi run alembic upgrade head
to create the database with the right schema. If you already have a pudl.sqlite
you’ll probably need to delete it first.
Database schema migration#
If you’ve changed the database schema locally (by renaming a column, adding a table, defining a new primary key, changing a datatype, etc.), you’ll need to make a migration reflecting that change and apply the migration to the database to keep the database schema synchronized with the code:
$ pixi run alembic revision --autogenerate -m "Add my cool table"
$ pixi run alembic upgrade head
$ git add migrations
$ git commit -m "Migration: added my cool table"
When switching branches, Alembic may refer to a migration version that is not on your current branch. This will manifest as an error like this when running an Alembic command:
FAILED: Can't locate revision identified by '29d443aadf25'
If you encounter that, you will want to check out the git branch that does
include that migration in the migrations directory. Then you should run
alembic downgrade head-1 to revert the database to the prior version. Then
you can go back to the branch that doesn’t have your migration, and use Alembic
in peace.
If the migrations have diverged for more than one revision, you can specify the specific version you would like to downgrade to with its hash. You may also want to keep a copy of the old SQLite database around, so you can easily switch between branches without having to regenerate data.
More information can be found in the Alembic docs.
Dagster#
Catalyst uses Dagster to manage our data pipelines. Dagster is an open source data orchestration framework written in Python. It makes it easy to manage data dependences, parallelize processes, cache results and handle IO.
If you are interested in contributing to PUDL, you may want to familiarize yourself with Dagster’s excellent documentation:
Dagster Essentials (Dagster Course)
If you use coding agents, you may also want to check out the Dagster agent skills:
AI Driven Data Engineering (Dagster Course)
These skills are also configured in the PUDL repo and can be installed with this pixi task (which uses npx skill).
$ pixi run install-skills
Because Dagster’s documentation is extensive and constantly being updated, the rest of this section will focus only on the specifics of the PUDL project, with links to the Dagster docs for more info.
Core Dagster concepts used in PUDL#
`Definitions <https://docs.dagster.io/getting-started/concepts#definitions>`__ are the top-level collection of Dagster objects that get loaded into a code location. They bundle together the assets, asset checks, resources, jobs, schedules, and sensors that Dagster can see and execute. In PUDL, the canonical Dagster assembly lives in
pudl.dagsterand is exposed viapudl.dagster.defs, whilepudl.definitionsremains the stable top-level code location used bydg. The package is split by Dagster abstraction so contributors can edit the relevant layer directly:pudl.dagster.assetsloads and groups assets.pudl.dagster.asset_checksdefines Dagster asset checks.pudl.dagster.resourcesdefines the default resource set.pudl.dagster.jobsdefines the standard PUDL jobs.pudl.dagster.sensorsdefines Dagster sensors.pudl.dagster.configcontains reusable run-configuration helpers.pudl.dagster.buildassemblesdagster.Definitionsviapudl.dagster.build_defs().
`Assets <https://docs.dagster.io/guides/build/assets>`__ are the primary building blocks in Dagster. They represent the underlying entities in our pipelines, such as database tables or machine learning models. In PUDL, most assets represent a
pandas.DataFramethat is written to Parquet and SQLite files on disk. Depending on which part of the PUDL DAG you are looking at, assets might represent messy raw dataframes extracted from spreadsheets, partially cleaned intermediary dataframes, or fully normalized tables ready for distribution.`Resources <https://docs.dagster.io/guides/build/external-resources>`__ are objects used by Dagster assets to provide access to external systems, databases, or services. In PUDL, we’ve defined a
pudl.workspace.datastore.DatastoreResource that pulls our raw input data from archives on Zenodo identified by DOI. Thepudl.workspace.datastore.ZenodoDoiSettingsResource defines the current Zenodo DOI for each dataset. We also store our dataset-specific data config (like what years of EIA-861 data to process) in a Resourcepudl.dagster.resources.GlobalDataConfigResource.`IO Managers <https://docs.dagster.io/guides/build/io-managers>`__ in Dagster let us keep the code for data processing separate from the code for reading and writing data. PUDL defines I/O Managers for reading data out of the FERC SQLite databases we curate, for reading and writing Parquet files, and for writing out to SQLite. For example
pudl.dagster.io_managers.PudlMixedFormatIOManagerallows assets to read and write dataframes to SQLite and Parquet-backed outputs using a single logical interface.`Jobs <https://docs.dagster.io/guides/build/jobs>`__ are preconfigured collections of assets, resources and IO Managers. Jobs are the main unit of execution in Dagster. The main jobs assembled in
pudl.dagsterare:ferc_to_sqliteto rebuild the raw FERC prerequisite databases only.pudlto run the main PUDL ETL assuming those raw FERC databases already exist.pudl_with_ferc_to_sqliteto run the full end-to-end build in one Dagster job.ferceqra DuckDB based pipeline to process the very large FERC EQR dataset.
`Configs <https://docs.dagster.io/guides/operate/configuration/run-configuration>`__ are the runtime settings passed to Dagster jobs, assets, and resources to control what gets executed and how. In PUDL, we usually store these settings in YAML files like
dg_fast.yml,dg_full.yml,dg_pytest.yml, anddg_nightly.yml, which configure execution options and shared resources likeglobal_data_config. The reusable helpers that assemble these run configs live inpudl.dagster.config.
The Dagster Web UI#
The Dagster UI is the main interactive interface for inspecting the PUDL Dagster code location, launching jobs and asset materializations, reviewing logs, and debugging runs without dropping down to the CLI for every operation.
The Dagster CLI: dg#
dg is Dagster’s official CLI.
It can perform many of the same actions managed through the Dagster UI, but is better
suited to programmatic usage. PUDL is configured as a dg project. Some PUDL
specific usage examples:
# Start up the Dagster UI webserver and daemons
$ pixi run dg dev
# Launch a full job with its default config
$ pixi run dg launch --job pudl
# Select a subset of assets to materialize
$ pixi run dg launch --assets "group:raw_eia861"
# List all of the Dagster definitions
$ pixi run dg list defs
Running the ETL via the Dagster UI#
Dagster needs a directory to store run logs system state, and interim assets that are
not written to Parquet or SQLite for distribution. Create a new directory
outside of your cloned PUDL repository and then define an environment variable
named DAGSTER_HOME to the path of the new directory. E.g.
$ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.zshrc # zsh
$ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.bashrc # bash
$ set -Ux DAGSTER_HOME /path/to/dagster_home # fish
Add DAGSTER_HOME to the current session with
$ export DAGSTER_HOME=/path/to/dagster_home
Once DAGSTER_HOME is set, launch the dagster UI by running:
$ pixi run dg dev
Note
If DAGSTER_HOME is not set, you will still be able to execute jobs but dagster
logs and outputs of assets that use the default fs_io_manager will be
saved to a temporary directory that is deleted when the dagster process exits.
This will launch the Dagster UI on localhost:3000. See the Dagster UI docs <https://docs.dagster.io/guides/operate/webserver>`__ for all the details on how to use the UI.
Cloning the FERC databases#
The raw FERC SQLite databases are part of the raw_ferc_to_sqlite asset group. If
you only need those outputs, select the ferc_to_sqlite job and hit “Materialize
All”, or you can select the specific FERC Form you actually need. If you want to run
the whole ETL from scratch, use the pudl_with_ferc_to_sqlite job. The pudl job
is intended for day-to-day development once compatible raw FERC outputs have been
materialized locally. See Converting raw FERC data to SQLite for more background on this process.
PUDL checks that your existing FERC SQLite databases are compatible with the current run configuration before downstream assets read them. Incompatible databases usually mean the FERC SQLite assets need to be rematerialized. This will happen when:
The configured Zenodo DOI for a FERC dataset changed.
The set of FERC years requested by your current ETL config includes years that are missing from the existing FERC SQLite database.
When this happens, refresh the FERC databases by materializing ferc_to_sqlite again,
then rerun the pudl job or selected downstream assets. You can also choose to
materialize only the FERC Form 1 and Form 714 databases in the Dagster UI, since those
are the only ones that feed directly into the PUDL ETL.
If you intentionally want to skip this compatibility check (for example, when using
prebuilt FERC SQLite files downloaded from nightly outputs), set
PUDL_SKIP_FERC_SQLITE_PROVENANCE=1 in your shell before running the ETL. This bypass
is intended for development workflows and can lead to downstream failures if the
FERC databases are stale.
Running the PUDL ETL#
For most day-to-day development, you will want to select the pudl job. This will
bring you to a window that displays all of the assets and their dependencies. Subsets
of the asset graph are organized by asset groups, which are helpful for visualizing and
executing subsets of the asset graph.
To execute the whole pudl job end-to-end click “Materialize all”. Depending on how
many CPUs and how much memory your computer has, this may take hours. On an M1 Macbook
Pro with 32GB of RAM and 10 CPUs it takes about 90 minutes. To run the full ETL you’ll
need at least 16GB of RAM.
Read the Configuring resources section to learn more. To view the status of the run, click the date next to “Latest run:”.
You can also re-execute specific assets by selecting one or multiple assets in the “Overview” tab and clicking “Materialize selected”. This is helpful if you are updating the logic of a specific asset and don’t want to rerun the entire ETL.
Note
To process a subset of years for a specific asset group, select the asset group,
shift+click “Materialize all” and configure the global_data_config resource with the
desired years.
See Troubleshooting Dagster for tips on how to fix common issues we run into.
Running the FERC EQR ETL#
All processing for FERC EQR data is contained in a separate ETL from the rest of PUDL. This is because the dataset is too large to archive the raw data on Zenodo. This means the ETL can only be run by developers with credentials to access private cloud storage containing the raw data. Any external contributors interested in working on this ETL should contact the Catalyst team to set up access to the raw data.
The FERC EQR ETL is contained in a Dagster job called ferceqr.
Executing this job from the Dagster UI is slightly different from the main
PUDL ETL jobs because the EQR job uses Dagster partitions. After selecting
“Materialize All” (or “Materialize selected” for a selection of assets),
a screen will popup allowing you to select the partitions to execute.
From here you can select a set of year-quarter combinations. This will
trigger a backfill, which will execute each partition in its own run.
To properly handle a backfill, you will need to configure dagster to use a
QueuedRunCoordinator. This can be done using a dagster.yaml file in your
DAGSTER_HOME directory with the following content:
run_coordinator:
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
tag_concurrency_limits:
- key: "dagster/backfill"
limit: 2
The config section shown above is not strictly necessary, but will limit the
number of concurrent runs Dagster will start, which can be helpful to avoid
out-of-memory issues while running many quarters in one backfill.
Running the ETL via CLI#
The dg command line interface is Dagster’s official tool and has a ton of built-in
functionality. For full documentation see the Dagster docs:
dg CLI reference.
These commands are a quick way to confirm your local Dagster setup is healthy before launching runs:
$ pixi run dg check toml
$ pixi run dg check defs --verbose
$ pixi run dg list defs
You can also kick off jobs directly with dg launch. The Dagster UI does not need to
be running for this to work, but if it is running, you’ll see the run appear in it.
$ pixi run dg launch --job ferc_to_sqlite
$ pixi run dg launch --job pudl
$ pixi run dg launch --job pudl_with_ferc_to_sqlite --config src/pudl/package_data/settings/dg_full.yml
You can also target specific assets rather than an entire job, and use Dagster’s rich asset selection syntax to pick and choose:
# Materialize all assets in the raw_eia861 group
$ pixi run dg launch --assets "group:raw_eia861"
# Materialize all assets upstream and downstream of a table
$ pixi run dg launch --assets "+key:core_eia923__fuel_receipts_costs+"
We also have a pixi task defined in pyproject.toml to process all data with
the full default configuration (this can take hours):
$ pixi run pudl
Dagster Config and PUDL Data Config Files —————————–_————
The dg launch command can read run configuration from YAML files. This avoids
undue command line complexity and preserves a record of how the pipeline was run.
The standard Dagster config files we use are:
src/pudl/package_data/settings/dg_fast.ymlfor smaller, faster local runs.src/pudl/package_data/settings/dg_full.ymlfor full local builds.src/pudl/package_data/settings/dg_pytest.ymlfor integration-test prebuilds.src/pudl/package_data/settings/dg_nightly.ymlfor the nightly cloud build.
Warning
The Dagster config file selects resources and execution settings. The referenced data config YAML still determines partitions, years, and other dataset-specific parameters, but job and asset selection determine which parts of the graph run.
Each Dagster config file includes execution options and resource configuration,
including the global_data_config_path used by the shared global_data_config
resource. The referenced data config YAML files specify which partitions of each
dataset should be processed, and are generally structured like this:
# FERC-to-SQLite data config
ferc_to_sqlite:
├── ferc1_dbf
| └── years
├── ferc1_xbrl
| └── years
└── ferc2_xbrl
└── years
# PUDL ETL data config
name : unique name identifying the etl outputs
title : short human readable title for the etl outputs
description : a longer description of the etl outputs
pudl:
├── dataset name
│ └── dataset parameter (e.g. years) : editable list of years
└── dataset name
└── dataset parameter (e.g. years) : editable list of years
See also
For an exhaustive listing of the available parameters, see the data config models in
pudl.settings and the packaged settings files under
src/pudl/package_data/settings/.
In general, you should not fiddle with these settings unless you are actually adding a new year of data. We only test the combinations of inputs found in the full and fast data config files that are checked into the PUDL repo. Many other combinations are obviously possible, but most of them probably don’t work!
The Fast ETL#
Running the Fast ETL processes a limited subset of data for each dataset. This is similar to what we do in our software integration tests. Depending on your computer, it may take up to an hour to run.
$ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_fast.yml
The Full ETL#
The Full data config includes all available data that PUDL can process. All the years, all the states, and all the tables, including the ~1 billion record EPA CEMS dataset. Assuming you already have the data downloaded, on a computer with at least 16 GB of RAM, and a solid-state disk, the Full ETL including EPA CEMS should take a couple of hours.
$ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_full.yml
Custom ETL#
If you need a custom run profile, copy one of the existing Dagster config files, change
its global_data_config_path or other resource settings, and point dg launch at
the new file.
$ pixi run dg launch --job pudl --config the/path/to/your/custom_dg_config.yml
Additional Notes#
Logging#
The commands above should result in a bunch of Python logging output describing
what Dagster is doing, and file outputs in the directory you specified via the
$PUDL_OUTPUT environment variable. When the ETL is complete, you should see new
files at e.g. $PUDL_OUTPUT/ferc1_dbf.sqlite, $PUDL_OUTPUT/pudl.sqlite and
$PUDL_OUTPUT/core_epacems__hourly_emissions.parquet.
The Dagster CLI also has built-in help if you want additional information:
$ pixi run dg launch --help
Foreign Key Constraints#
The order assets are loaded into pudl.sqlite is non-deterministic because the
assets are executed in parallel so foreign key constraint violations can’t be identified
in real time. However, foreign key constraints can be checked after all of the data
has been loaded into the database successfully. To check the constraints, run:
$ pixi run pudl_check_fks
The foreign key check is also run as part of the PUDL integration tests.