pudl.dagster.asset_checks#
Programmatically defined Dagster asset checks for PUDL.
This module should contain Dagster asset-check definitions and helper functions that evaluate the quality or structural correctness of already-materialized assets. Put checks here when they belong in the Dagster asset graph and should run as blocking or reporting validations attached to specific assets, especially when they can be derived from metadata or shared validation patterns. Keep business transformations and dbt-only data tests out of this module so it remains focused on Dagster-native asset validation.
For the underlying Dagster concept see https://docs.dagster.io/guides/test/asset-checks
For data validation we almost entirely rely on dbt data tests defined using SQL
and executed across our Parquet outputs using duckdb.
We primarily use Dagster asset checks to validate the schemas of PUDL tables throughout
the pipeline. We use pandera to programmatically define dataframe schemas based
on the PUDL metadata with the asset check factory asset_check_from_schema()
defined below. A handful of asset checks that were particularly difficult to translate
to SQL/dbt data tests are also defined here, but in general all data validation tests
should go in dbt.
Attributes#
Functions#
|
Check that certain variables don't vary too much on average between groups. |
Create a Dagster asset check based on the resource schema, if defined. |
Module Contents#
- pudl.dagster.asset_checks.group_mean_continuity_check(df: pandas.DataFrame, thresholds: dict[str, float], groupby_col: str, n_outliers_allowed: int = 0) dagster.AssetCheckResult[source]#
Check that certain variables don’t vary too much on average between groups.
Groups and sorts the data by
groupby_col, then takes the mean across each group. Useful for saying something like “the average water usage of cooling systems didn’t jump by 10x from 2012-2013.”- Parameters:
df – the df with the actual data
thresholds – a mapping from column names to the ratio by which those columns are allowed to fluctuate from one group to the next.
groupby_col – the column by which we will group the data.
n_outliers_allowed – how many data points are allowed to be above the threshold.
- pudl.dagster.asset_checks.asset_check_from_schema(asset_key: dagster.AssetKey, package: pudl.metadata.classes.Package, duckdb_asset: bool, high_memory_asset: bool) dagster.AssetChecksDefinition | None[source]#
Create a Dagster asset check based on the resource schema, if defined.
The vast majority of assets will be loaded as Polars LazyFrames directly using the
PudlParquetIOManagerand validated with Pandera’s Polars backend, but there are two exceptions to this. The first exception are assets which contain a geometry data type. These assets will all be loaded as geopandas GeoDataFrames and use Pandera’s Pandas backend as Polars does not support geometry data types. The second exception are assets produced entirely using DuckDB. These assets returnParquetDataobjects, which are handled by the default io-manager. In this case, the resulting parquet file(s) will be scanned with Polars to produce a LazyFrame, then handled exactly the same as a typical asset.