pudl.extract.ferceqr#

Extract FERC EQR data.

Attributes#

Functions#

_get_csv(→ zipfile.ZipFile)

Download CSV to a tempmorary directory to avoid reading into memory.

_clean_csv_name(→ pathlib.Path)

Standardize zip file names to avoid errors when opening.

_get_table_name(→ str)

_extract_ident(→ str)

Extract data from ident csv, write to parquet, and return CID from table.

_extract_other_table(table_type, csv_path, ...)

Extract data from a table other than ident and add year_quarter and CID columns.

_csvs_to_parquet(csv_path, year_quarter, filing_name, ...)

Mirror CSVs in filing to a parquet file.

_save_extract_errors(year_quarter, duckdb_connection)

Create parquet file with metadata on any CSV parsing errors.

extract_ferceqr() → tuple[pudl.helpers.ParquetData, ...)

Extract year quarter from CSVs and load to parquet files.

Module Contents#

pudl.extract.ferceqr.logger[source]#
pudl.extract.ferceqr._get_csv(base_path: upath.UPath, year_quarter: str) zipfile.ZipFile[source]#

Download CSV to a tempmorary directory to avoid reading into memory.

pudl.extract.ferceqr._clean_csv_name(csv_path: pathlib.Path) pathlib.Path[source]#

Standardize zip file names to avoid errors when opening.

pudl.extract.ferceqr._get_table_name(table_type: str, year_quarter: str) str[source]#
pudl.extract.ferceqr._extract_ident(ident_csv: str, year_quarter: str, filing_name: str, duckdb_connection: duckdb.DuckDBPyConnection) str[source]#

Extract data from ident csv, write to parquet, and return CID from table.

This table is always extracted first so we can pull the CID from it and include a CID column in all other tables.

pudl.extract.ferceqr._extract_other_table(table_type: str, csv_path: str, year_quarter: str, cid: str, filing_name: str, duckdb_connection: duckdb.DuckDBPyConnection)[source]#

Extract data from a table other than ident and add year_quarter and CID columns.

pudl.extract.ferceqr._csvs_to_parquet(csv_path: pathlib.Path, year_quarter: str, filing_name: str, duckdb_connection: duckdb.DuckDBPyConnection)[source]#

Mirror CSVs in filing to a parquet file.

Each filing contains a CSV for 4 EQR tables. These will each be extracted to a separate parquet file.

pudl.extract.ferceqr._save_extract_errors(year_quarter: str, duckdb_connection: duckdb.DuckDBPyConnection)[source]#

Create parquet file with metadata on any CSV parsing errors.

pudl.extract.ferceqr.extract_ferceqr(context: dagster.AssetExecutionContext, ferceqr_data_config: pudl.dagster.resources.FercEqrDataConfig = FercEqrDataConfig()) tuple[pudl.helpers.ParquetData, pudl.helpers.ParquetData, pudl.helpers.ParquetData, pudl.helpers.ParquetData, pudl.helpers.ParquetData][source]#

Extract year quarter from CSVs and load to parquet files.

This method will loop through the nested EQR archive zipfiles and extract all tables from them, and write to parquet. It opens a duckdb connection at the top level to keep track of extraction errors, so we can write these to the raw_ferceqr__extract_errors table.