pudl.extract.parquet#

Extractor for Parquet data.

Attributes#

Classes#

ParquetExtractor

Class for extracting dataframes from parquet files.

Module Contents#

pudl.extract.parquet.logger[source]#
class pudl.extract.parquet.ParquetExtractor(ds: pudl.workspace.datastore.Datastore)[source]#

Bases: pudl.extract.extractor.GenericExtractor

Class for extracting dataframes from parquet files.

The extraction logic is invoked by calling extract() method of this class.

source_filename(page: str, **partition: pudl.extract.extractor.PartitionSelection) str[source]#

Produce the source Parquet file name as it will appear in the archive.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009}

Returns:

string name of the parquet file

load_source(page: str, **partition: pudl.extract.extractor.PartitionSelection) pandas.DataFrame[source]#

Produce the dataframe object for the given partition.

This method assumes that the archive includes one unzipped file per partition.

Parameters:
  • page – pudl name for the dataset contents, eg “boiler_generator_assn” or “data”

  • partition – partition to load. Examples: {‘year’: 2009} {‘year_month’: ‘2020-08’}

Returns:

pd.DataFrame instance containing CSV data