Learner Profiles
Sparky Watts
Profile
Sparky Watts is a second-year PhD student in an interdisciplinary energy studies department. As she’s preparing for her qualifying exams, Sparky is looking at the available energy data sources that she could use for her doctoral research.
A few years ago, a postdoctoral scholar in her research lab worked on an analysis of hourly demand data published by the EIA. With a clear research question in mind, Sparky is interested in building on the postdoc’s data cleaning scripts and updating them to include newer years of data, but the code is scattered over several Jupyter Notebooks without much documentation. None of the other lab members remember much about the project.
The postdoc also left a folder with a handful of CSVs that Sparky thinks contain the raw data he was using, and she’ll need to come up with a strategy to get new input data. She’s considering using the EIA’s new API to access the new years of data, but other lab members have told her the documentation is a bit confusing.
Though Sparky has taken a Python Software Carpentry workshop before and done some coding using Pandas in Jupyter Notebooks, she’s not quite sure where to start. The data is larger than other datasets she’s worked with before, and she can’t even open it in a spreadsheet to look at a few rows without Excel crashing. Inspired by her experiences of looking at this postdoc’s code, she’s hoping to write software that others will be able to use and adapt for their own work long after she’s left the lab.
Background knowledge and skills:
- Has energy domain knowledge and a research area in mind
- Recently attended the Data Analysis and Visualization in Python for Ecologists workshop
- Can install software using GUI’s
- Has recently used Jupyter Notebooks and Pandas to analyze tabular data
- Is comfortable reading in CSVs to Python using
pandas.read_csv()
Gene Rator
Profile
Gene Rator is a postdoctoral researcher studying the carbon intensity of electricity production.
He writes his own analysis pipelines using Python scripts that he’s developed over the years and is comfortable transforming data using Pandas.
Recently, Gene sent his scripts to a collaborator at another research institute, who struggled to run the pipeline because it’s heavily tied to Gene’s own research environment. They also complained about having to manually download 50 CSVs to get the raw data needed in order to run Gene’s pipeline locally.
Gene needs to publish his own pipeline to both share his work and fulfill publication requirements. More worryingly, he recently tried re-running the pipeline after making some small tweaks in response to feedback from a collaborator, and got some unexpected results - something might be going wrong somewhere in his pipeline, but it’s hard to tell where.
Gene needs strategies to debug his existing workflow, and a way to effectively share his work with collaborators and the public.
Saul R. Panel
Profile
Saul R. Panel is a 1st-year PhD student who wants to work on modeling the US energy system. Until recently they were working at a nonprofit which paid for a commercial energy data subscription. Now that they’re in school, they have lost access to that data. They want to build a web portal that shows people the effects of different local policy decisions on their energy bills.
They want to use public data to do this, but the constellation of different open datasets in different formats is overwhelming and confusing compared to the commercial dataset which provided a simple interface for getting the data into an analysis-ready form. They’ve played around with some data for a single county, but figuring out how to scale this up to a U.S.-wide analysis is overwhelming.
Some folks in their lab have done a lot of policy modeling in Python. Saul is just starting out, but has learned the basics of Python from their coworkers in the past few months. They’re wondering how to connect the publicly available data into the policy models from the lab so the web portal can access model outputs.