Key Points

Introduction

Open data principles such as reproducibility, transparency, and collaboration make it easier to share, interpret, and build upon research projects.
Enacting these values doesn’t ‘just happen’ - it requires specific skills and strategies.

pandas has functionality to read in many data formats (e.g., XML, JSON, Parquet) into Pandas DataFrames in Python. We can take advantage of this to transform many kinds of structured and semi-structured data into similarly formatted data.
The help function can be used to access function documentation, providing avenues to resolve problems on import of various data types.
When semi-structured data contains tabular data, we can extract the tabular data into a Pandas Dataframe.

requests is useful when you need to access remote data
response.status_code tells you if the request succeeded or why it failed.
response.text gives you the raw response, if you need to check that the data is formatted how you expect
response.json() will parse the response as JSON, which is handy

web APIs can be thought of as bundles of fancy URLs
each web API is different, but if you can read the documentation and make requests to URLs, you can figure them out

requests is a swiss-army knife for accessing remote data
web APIs are just collections of fancy URLs, which you can interact with via requests
to learn an API, you need to read the documentation and experiment with the API to see how it responds

beautiful soup lets you grab links out of a webpage so that you can then download them
if you need to get more than one request worth of results from an API, they usually provide some “pagination” capabilities so you can make all the requests programmatically.
web scraping is a wide world - if you get stuck, try searching for some of the keywords above.

Different kinds of data – indexing, categorical, numeric, time series – are suited to different kinds of summarization and visualization.
Successful strategies for assessing data problems alternate between noticing your expectations about the data and checking to see if the data match your expectations – and sometimes, updating your expectations based on what you find!
Visualization is not just for reports, papers, and talks! If you incorporate plotting into your exploration & troubleshooting toolbox, you’ll be able to identify and diagnose data problems much more quickly than if you wait for your model to exhibit strange behavior.

you’re always making assumptions about your data, and many of them are likely to be wrong, so you need to check them
you can prioritize assumptions by thinking about their impact, likelihood, and testability
you can use assert statements to tell you if an assumption is wrong every time you run the code

Plain language descriptions can help us choose which code to reorganize by identifying goals and intent.
We can attach our descriptions directly to our functions using docstrings.

Jupyter is great for data exploration and visualization, but working with scripts and modules is preferable for reusability, legibility and collaboration
uv bundles packages into a virtual environment, and helps us move our code into a codebase
Reorganizing code into multiple modules can help us reuse code in multiple places and keep our project organized.