.. _run_the_etl: =============================================================================== Running the ETL Pipeline =============================================================================== So you want to run the PUDL data processing pipeline? This is the most involved way to get access to PUDL data. It's only recommended if you want to edit the ETL process or contribute to the codebase. Check out the :doc:`/data_access` documentation if you just want to use the data we process and distribute. These instructions assume you have already gone through the :ref:`dev_setup`. Alembic ------- PUDL uses `Alembic `__ to manage the creation our database and migrations of the schema as it changes over time. However, we only use file-based databases (SQLite, DuckDB) and these migrations are mostly a way to allow us to change the schema without needing to repopulate the entire database from scratch. They are not used in production. Database initialization ^^^^^^^^^^^^^^^^^^^^^^^ Before we run anything, we'll need to make sure that the schema in the database actually matches the schema defined by the code. Run ``pixi run alembic upgrade head`` to create the database with the right schema. If you already have a ``pudl.sqlite`` you'll probably need to delete it first. Database schema migration ^^^^^^^^^^^^^^^^^^^^^^^^^ If you've changed the database schema locally (by renaming a column, adding a table, defining a new primary key, changing a datatype, etc.), you'll need to make a migration reflecting that change and apply the migration to the database to keep the database schema synchronized with the code: .. code-block:: bash $ pixi run alembic revision --autogenerate -m "Add my cool table" $ pixi run alembic upgrade head $ git add migrations $ git commit -m "Migration: added my cool table" When switching branches, Alembic may refer to a migration version that is not on your current branch. This will manifest as an error like this when running an Alembic command:: FAILED: Can't locate revision identified by '29d443aadf25' If you encounter that, you will want to check out the git branch that *does* include that migration in the ``migrations`` directory. Then you should run ``alembic downgrade head-1`` to revert the database to the prior version. Then you can go back to the branch that doesn't have your migration, and use Alembic in peace. If the migrations have diverged for more than one revision, you can specify the specific version you would like to downgrade to with its hash. You may also want to keep a copy of the old SQLite database around, so you can easily switch between branches without having to regenerate data. More information can be found in the `Alembic docs `__. Dagster ------- Catalyst uses `Dagster `__ to manage our data pipelines. Dagster is an open source data orchestration framework written in Python. It makes it easy to manage data dependences, parallelize processes, cache results and handle IO. If you are interested in contributing to PUDL, you may want to familiarize yourself with Dagster's excellent documentation: * `Getting Started (open source) `__ * `Dagster Core Concepts `__ * `Dagster Basics Tutorial `__ * `Dagster Essentials `__ (Dagster Course) If you use coding agents, you may also want to check out `the Dagster agent skills `__: * `dagster-expert `__ * `dignified-python `__ * `AI Driven Data Engineering `__ (Dagster Course) These skills are also configured in the PUDL repo and can be installed with this pixi task (which uses `npx skill `__). .. code-block:: console $ pixi run install-skills Because Dagster's documentation is extensive and constantly being updated, the rest of this section will focus only on the specifics of the PUDL project, with links to the Dagster docs for more info. Core Dagster concepts used in PUDL ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * **`Definitions `__** are the top-level collection of Dagster objects that get loaded into a code location. They bundle together the assets, asset checks, resources, jobs, schedules, and sensors that Dagster can see and execute. In PUDL, the canonical Dagster assembly lives in :mod:`pudl.dagster` and is exposed via :data:`pudl.dagster.defs`, while :mod:`pudl.definitions` remains the stable top-level code location used by ``dg``. The package is split by Dagster abstraction so contributors can edit the relevant layer directly: - :mod:`pudl.dagster.assets` loads and groups assets. - :mod:`pudl.dagster.asset_checks` defines Dagster asset checks. - :mod:`pudl.dagster.resources` defines the default resource set. - :mod:`pudl.dagster.jobs` defines the standard PUDL jobs. - :mod:`pudl.dagster.sensors` defines Dagster sensors. - :mod:`pudl.dagster.config` contains reusable run-configuration helpers. - :mod:`pudl.dagster.build` assembles :class:`dagster.Definitions` via :func:`pudl.dagster.build_defs`. * **`Assets `__** are the primary building blocks in Dagster. They represent the underlying entities in our pipelines, such as database tables or machine learning models. In PUDL, most assets represent a :py:class:`pandas.DataFrame` that is written to Parquet and SQLite files on disk. Depending on which part of the PUDL DAG you are looking at, assets might represent messy raw dataframes extracted from spreadsheets, partially cleaned intermediary dataframes, or fully normalized tables ready for distribution. * **`Resources `__** are objects used by Dagster assets to provide access to external systems, databases, or services. In PUDL, we've defined a :py:class:`pudl.workspace.datastore.Datastore` Resource that pulls our raw input data from `archives on Zenodo `__ identified by DOI. The :py:class:`pudl.workspace.datastore.ZenodoDoiSettings` Resource defines the current Zenodo DOI for each dataset. We also store our dataset-specific data config (like what years of EIA-861 data to process) in a Resource :py:class:`pudl.dagster.resources.GlobalDataConfigResource`. * **`IO Managers `__** in Dagster let us keep the code for data processing separate from the code for reading and writing data. PUDL defines I/O Managers for reading data out of the FERC SQLite databases we curate, for reading and writing Parquet files, and for writing out to SQLite. For example :class:`pudl.dagster.io_managers.PudlMixedFormatIOManager` allows assets to read and write dataframes to SQLite and Parquet-backed outputs using a single logical interface. * **`Jobs `__** are preconfigured collections of assets, resources and IO Managers. Jobs are the main unit of execution in Dagster. The main jobs assembled in :mod:`pudl.dagster` are: - ``ferc_to_sqlite`` to rebuild the raw FERC prerequisite databases only. - ``pudl`` to run the main PUDL ETL assuming those raw FERC databases already exist. - ``pudl_with_ferc_to_sqlite`` to run the full end-to-end build in one Dagster job. - ``ferceqr`` a DuckDB based pipeline to process the very large FERC EQR dataset. * **`Configs `__** are the runtime settings passed to Dagster jobs, assets, and resources to control what gets executed and how. In PUDL, we usually store these settings in YAML files like ``dg_fast.yml``, ``dg_full.yml``, ``dg_pytest.yml``, and ``dg_nightly.yml``, which configure execution options and shared resources like ``global_data_config``. The reusable helpers that assemble these run configs live in :mod:`pudl.dagster.config`. The Dagster Web UI ^^^^^^^^^^^^^^^^^^ `The Dagster UI `__ is the main interactive interface for inspecting the PUDL Dagster code location, launching jobs and asset materializations, reviewing logs, and debugging runs without dropping down to the CLI for every operation. The Dagster CLI: ``dg`` ^^^^^^^^^^^^^^^^^^^^^^^ ``dg`` is `Dagster's official CLI `__. It can perform many of the same actions managed through the Dagster UI, but is better suited to programmatic usage. PUDL is configured as a ``dg`` project. Some PUDL specific usage examples: .. code-block:: console # Start up the Dagster UI webserver and daemons $ pixi run dg dev # Launch a full job with its default config $ pixi run dg launch --job pudl # Select a subset of assets to materialize $ pixi run dg launch --assets "group:raw_eia861" # List all of the Dagster definitions $ pixi run dg list defs .. _run-dagster-ui: Running the ETL via the Dagster UI ---------------------------------- Dagster needs a directory to store run logs system state, and interim assets that are not written to Parquet or SQLite for distribution. Create a new directory **outside of your cloned PUDL repository** and then define an environment variable named ``DAGSTER_HOME`` to the path of the new directory. E.g. .. code-block:: console $ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.zshrc # zsh $ echo "export DAGSTER_HOME=/path/to/dagster_home" >> ~/.bashrc # bash $ set -Ux DAGSTER_HOME /path/to/dagster_home # fish Add ``DAGSTER_HOME`` to the current session with .. code-block:: console $ export DAGSTER_HOME=/path/to/dagster_home Once ``DAGSTER_HOME`` is set, launch the dagster UI by running: .. code-block:: console $ pixi run dg dev .. note:: If ``DAGSTER_HOME`` is not set, you will still be able to execute jobs but dagster logs and outputs of assets that use the default `fs_io_manager `__ will be saved to a temporary directory that is deleted when the ``dagster`` process exits. This will launch the Dagster UI on `localhost:3000 `__. See the `Dagster UI docs `__` for all the details on how to use the UI. Cloning the FERC databases ^^^^^^^^^^^^^^^^^^^^^^^^^^ The raw FERC SQLite databases are part of the ``raw_ferc_to_sqlite`` asset group. If you only need those outputs, select the ``ferc_to_sqlite`` job and hit "Materialize All", or you can select the specific FERC Form you actually need. If you want to run the whole ETL from scratch, use the ``pudl_with_ferc_to_sqlite`` job. The ``pudl`` job is intended for day-to-day development once compatible raw FERC outputs have been materialized locally. See :doc:`/dev/clone_ferc1` for more background on this process. PUDL checks that your existing FERC SQLite databases are compatible with the current run configuration before downstream assets read them. Incompatible databases usually mean the FERC SQLite assets need to be rematerialized. This will happen when: * The configured Zenodo DOI for a FERC dataset changed. * The set of FERC years requested by your current ETL config includes years that are missing from the existing FERC SQLite database. When this happens, refresh the FERC databases by materializing ``ferc_to_sqlite`` again, then rerun the ``pudl`` job or selected downstream assets. You can also choose to materialize only the FERC Form 1 and Form 714 databases in the Dagster UI, since those are the only ones that feed directly into the PUDL ETL. If you intentionally want to skip this compatibility check (for example, when using prebuilt FERC SQLite files downloaded from nightly outputs), set ``PUDL_SKIP_FERC_SQLITE_PROVENANCE=1`` in your shell before running the ETL. This bypass is intended for development workflows and can lead to downstream failures if the FERC databases are stale. Running the PUDL ETL ^^^^^^^^^^^^^^^^^^^^ For most day-to-day development, you will want to select the ``pudl`` job. This will bring you to a window that displays all of the assets and their dependencies. Subsets of the asset graph are organized by asset groups, which are helpful for visualizing and executing subsets of the asset graph. To execute the whole ``pudl`` job end-to-end click "Materialize all". Depending on how many CPUs and how much memory your computer has, this may take hours. On an M1 Macbook Pro with 32GB of RAM and 10 CPUs it takes about 90 minutes. To run the full ETL you'll need at least 16GB of RAM. Read the :ref:`resource_config` section to learn more. To view the status of the run, click the date next to "Latest run:". You can also re-execute specific assets by selecting one or multiple assets in the "Overview" tab and clicking "Materialize selected". This is helpful if you are updating the logic of a specific asset and don't want to rerun the entire ETL. .. note:: To process a subset of years for a specific asset group, select the asset group, shift+click "Materialize all" and configure the ``global_data_config`` resource with the desired years. See :ref:`troubleshooting_dagster` for tips on how to fix common issues we run into. Running the FERC EQR ETL ^^^^^^^^^^^^^^^^^^^^^^^^^^ All processing for FERC EQR data is contained in a separate ETL from the rest of PUDL. This is because the dataset is too large to archive the raw data on Zenodo. This means the ETL can only be run by developers with credentials to access private cloud storage containing the raw data. Any external contributors interested in working on this ETL should contact the Catalyst team to set up access to the raw data. The FERC EQR ETL is contained in a Dagster job called ``ferceqr``. Executing this job from the Dagster UI is slightly different from the main PUDL ETL jobs because the EQR job uses Dagster partitions. After selecting "Materialize All" (or "Materialize selected" for a selection of assets), a screen will popup allowing you to select the partitions to execute. From here you can select a set of year-quarter combinations. This will trigger a ``backfill``, which will execute each partition in its own ``run``. To properly handle a ``backfill``, you will need to configure dagster to use a ``QueuedRunCoordinator``. This can be done using a ``dagster.yaml`` file in your ``DAGSTER_HOME`` directory with the following content: .. code-block:: yaml run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator config: tag_concurrency_limits: - key: "dagster/backfill" limit: 2 The ``config`` section shown above is not strictly necessary, but will limit the number of concurrent ``runs`` Dagster will start, which can be helpful to avoid out-of-memory issues while running many quarters in one ``backfill``. .. _run-cli: Running the ETL via CLI ----------------------- The ``dg`` command line interface is Dagster's official tool and has a ton of built-in functionality. For full documentation see the Dagster docs: `dg CLI reference `__. These commands are a quick way to confirm your local Dagster setup is healthy before launching runs: .. code-block:: console $ pixi run dg check toml $ pixi run dg check defs --verbose $ pixi run dg list defs You can also kick off jobs directly with ``dg launch``. The Dagster UI does not need to be running for this to work, but if it is running, you'll see the run appear in it. .. code-block:: console $ pixi run dg launch --job ferc_to_sqlite $ pixi run dg launch --job pudl $ pixi run dg launch --job pudl_with_ferc_to_sqlite --config src/pudl/package_data/settings/dg_full.yml You can also target specific assets rather than an entire job, and use Dagster's rich `asset selection syntax `__ to pick and choose: .. code-block:: console # Materialize all assets in the raw_eia861 group $ pixi run dg launch --assets "group:raw_eia861" # Materialize all assets upstream and downstream of a table $ pixi run dg launch --assets "+key:core_eia923__fuel_receipts_costs+" We also have a ``pixi`` task defined in ``pyproject.toml`` to process all data with the full default configuration (this can take hours): .. code-block:: console $ pixi run pudl Dagster Config and PUDL Data Config Files -----------------------------_------------ The ``dg launch`` command can read run configuration from YAML files. This avoids undue command line complexity and preserves a record of how the pipeline was run. The standard Dagster config files we use are: - ``src/pudl/package_data/settings/dg_fast.yml`` for smaller, faster local runs. - ``src/pudl/package_data/settings/dg_full.yml`` for full local builds. - ``src/pudl/package_data/settings/dg_pytest.yml`` for integration-test prebuilds. - ``src/pudl/package_data/settings/dg_nightly.yml`` for the nightly cloud build. .. warning:: The Dagster config file selects resources and execution settings. The referenced data config YAML still determines partitions, years, and other dataset-specific parameters, but job and asset selection determine which parts of the graph run. Each Dagster config file includes execution options and resource configuration, including the ``global_data_config_path`` used by the shared ``global_data_config`` resource. The referenced data config YAML files specify which partitions of each dataset should be processed, and are generally structured like this: .. code-block:: # FERC-to-SQLite data config ferc_to_sqlite: ├── ferc1_dbf | └── years ├── ferc1_xbrl | └── years └── ferc2_xbrl └── years # PUDL ETL data config name : unique name identifying the etl outputs title : short human readable title for the etl outputs description : a longer description of the etl outputs pudl: ├── dataset name │ └── dataset parameter (e.g. years) : editable list of years └── dataset name └── dataset parameter (e.g. years) : editable list of years .. seealso:: For an exhaustive listing of the available parameters, see the data config models in :mod:`pudl.settings` and the packaged settings files under ``src/pudl/package_data/settings/``. In general, you should not fiddle with these settings unless you are actually adding a new year of data. We only test the combinations of inputs found in the full and fast data config files that are checked into the PUDL repo. Many other combinations are obviously possible, but most of them probably don't work! The Fast ETL ^^^^^^^^^^^^ Running the Fast ETL processes a limited subset of data for each dataset. This is similar to what we do in our :doc:`software integration tests `. Depending on your computer, it may take up to an hour to run. .. code-block:: console $ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_fast.yml The Full ETL ^^^^^^^^^^^^ The Full data config includes all available data that PUDL can process. All the years, all the states, and all the tables, including the ~1 billion record EPA CEMS dataset. Assuming you already have the data downloaded, on a computer with at least 16 GB of RAM, and a solid-state disk, the Full ETL including EPA CEMS should take a couple of hours. .. code-block:: console $ pixi run dg launch --job pudl --config src/pudl/package_data/settings/dg_full.yml Custom ETL ^^^^^^^^^^ If you need a custom run profile, copy one of the existing Dagster config files, change its ``global_data_config_path`` or other resource settings, and point ``dg launch`` at the new file. .. code-block:: console $ pixi run dg launch --job pudl --config the/path/to/your/custom_dg_config.yml Additional Notes ---------------- Logging ^^^^^^^ The commands above should result in a bunch of Python :mod:`logging` output describing what Dagster is doing, and file outputs in the directory you specified via the ``$PUDL_OUTPUT`` environment variable. When the ETL is complete, you should see new files at e.g. ``$PUDL_OUTPUT/ferc1_dbf.sqlite``, ``$PUDL_OUTPUT/pudl.sqlite`` and ``$PUDL_OUTPUT/core_epacems__hourly_emissions.parquet``. The Dagster CLI also has built-in help if you want additional information: .. code-block:: console $ pixi run dg launch --help Foreign Key Constraints ^^^^^^^^^^^^^^^^^^^^^^^ The order assets are loaded into ``pudl.sqlite`` is non-deterministic because the assets are executed in parallel so foreign key constraint violations can't be identified in real time. However, foreign key constraints can be checked after all of the data has been loaded into the database successfully. To check the constraints, run: .. code-block:: console $ pixi run pudl_check_fks The foreign key check is also run as part of the PUDL integration tests.