pudl.dagster.asset_checks#

Programmatically defined Dagster asset checks for PUDL.

This module should contain Dagster asset-check definitions and helper functions that evaluate the quality or structural correctness of already-materialized assets. Put checks here when they belong in the Dagster asset graph and should run as blocking or reporting validations attached to specific assets, especially when they can be derived from metadata or shared validation patterns. Keep business transformations and dbt-only data tests out of this module so it remains focused on Dagster-native asset validation.

For the underlying Dagster concept see https://docs.dagster.io/guides/test/asset-checks

For data validation we almost entirely rely on dbt data tests defined using SQL and executed across our Parquet outputs using duckdb.

We primarily use Dagster asset checks to validate the schemas of PUDL tables throughout the pipeline. We use pandera to programmatically define dataframe schemas based on the PUDL metadata with the asset check factory asset_check_from_schema() defined below. A handful of asset checks that were particularly difficult to translate to SQL/dbt data tests are also defined here, but in general all data validation tests should go in dbt.

Attributes#

Functions#

group_mean_continuity_check(→ dagster.AssetCheckResult)

Check that certain variables don't vary too much on average between groups.

asset_check_from_schema(...)

Create a Dagster asset check based on the resource schema, if defined.

Module Contents#

pudl.dagster.asset_checks.group_mean_continuity_check(df: pandas.DataFrame, thresholds: dict[str, float], groupby_col: str, n_outliers_allowed: int = 0) dagster.AssetCheckResult[source]#

Check that certain variables don’t vary too much on average between groups.

Groups and sorts the data by groupby_col, then takes the mean across each group. Useful for saying something like “the average water usage of cooling systems didn’t jump by 10x from 2012-2013.”

Parameters:
  • df – the df with the actual data

  • thresholds – a mapping from column names to the ratio by which those columns are allowed to fluctuate from one group to the next.

  • groupby_col – the column by which we will group the data.

  • n_outliers_allowed – how many data points are allowed to be above the threshold.

pudl.dagster.asset_checks.asset_check_from_schema(asset_key: dagster.AssetKey, package: pudl.metadata.classes.Package, duckdb_asset: bool, high_memory_asset: bool) dagster.AssetChecksDefinition | None[source]#

Create a Dagster asset check based on the resource schema, if defined.

The vast majority of assets will be loaded as Polars LazyFrames directly using the PudlParquetIOManager and validated with Pandera’s Polars backend, but there are two exceptions to this. The first exception are assets which contain a geometry data type. These assets will all be loaded as geopandas GeoDataFrames and use Pandera’s Pandas backend as Polars does not support geometry data types. The second exception are assets produced entirely using DuckDB. These assets return ParquetData objects, which are handled by the default io-manager. In this case, the resulting parquet file(s) will be scanned with Polars to produce a LazyFrame, then handled exactly the same as a typical asset.

pudl.dagster.asset_checks.default_asset_checks[source]#
pudl.dagster.asset_checks.duckdb_assets = ['core_ferceqr__quarterly_identity', 'core_ferceqr__contracts',...[source]#
pudl.dagster.asset_checks.high_memory_assets = ['out_vcerare__hourly_available_capacity_factor', 'core_epacems__hourly_emissions'][source]#