pudl.extract.sec10k#
Load pre-processed SEC 10-K assets from Google Cloud Storage.
These “raw” tables are generated by the SEC 10-K data extraction pipeline which can be found in this repository: catalyst-cooperative/mozilla-sec-eia
Upstream data is not partitioned by year, but we want to be able to extract a subset of
the data for testing, so the Sec10kDataConfig allow specification of which years
to extract, and those are used to filter the extracted data before returning it.
Attributes#
Functions#
|
Extract SEC 10-K data from the datastore. |
|
An asset factory for extracting SEC 10-K data by table. |
Module Contents#
- pudl.extract.sec10k.extract(ds: pudl.workspace.datastore.Datastore, table: str, years: list[int]) pandas.DataFrame[source]#
Extract SEC 10-K data from the datastore.
Allows filtering by year to enable testing of the pipeline with a smaller amount of data, like a pseudo-partition. This is necessary because the SEC 10-K data is not partitioned upstream.
- Parameters:
ds – Initialized PUDL datastore.
table – Which of the valid tables should be extracted?
years – Which years of data to include in the output.
- Returns:
A dataframe containing the SEC 10-K data.
- pudl.extract.sec10k.raw_sec10k_asset_factory(table) dagster.AssetsDefinition[source]#
An asset factory for extracting SEC 10-K data by table.