pudl.extract.sec10k#

Load pre-processed SEC 10-K assets from Google Cloud Storage.

These “raw” tables are generated by the SEC 10-K data extraction pipeline which can be found in this repository: catalyst-cooperative/mozilla-sec-eia

Upstream data is not partitioned by year, but we want to be able to extract a subset of the data for testing, so the Sec10kDataConfig allow specification of which years to extract, and those are used to filter the extracted data before returning it.

Attributes#

Functions#

extract(→ pandas.DataFrame)

Extract SEC 10-K data from the datastore.

raw_sec10k_asset_factory(→ dagster.AssetsDefinition)

An asset factory for extracting SEC 10-K data by table.

Module Contents#

pudl.extract.sec10k.extract(ds: pudl.workspace.datastore.Datastore, table: str, years: list[int]) pandas.DataFrame[source]#

Extract SEC 10-K data from the datastore.

Allows filtering by year to enable testing of the pipeline with a smaller amount of data, like a pseudo-partition. This is necessary because the SEC 10-K data is not partitioned upstream.

Parameters:
  • ds – Initialized PUDL datastore.

  • table – Which of the valid tables should be extracted?

  • years – Which years of data to include in the output.

Returns:

A dataframe containing the SEC 10-K data.

pudl.extract.sec10k.raw_sec10k_asset_factory(table) dagster.AssetsDefinition[source]#

An asset factory for extracting SEC 10-K data by table.

pudl.extract.sec10k.raw_sec10k_assets[source]#