Key Points

Introduction


  • Open data principles such as reproducibility, transparency, and collaboration make it easier to share, interpret, and build upon research projects.
  • Enacting these values doesn’t ‘just happen’ - it requires specific skills and strategies.

Handling diverse filetypes in Pandas


  • pandas has functionality to read in many data formats (e.g., XML, JSON, Parquet) into Pandas DataFrames in Python. We can take advantage of this to transform many kinds of structured and semi-structured data into similarly formatted data.
  • The help function can be used to access function documentation, providing avenues to resolve problems on import of various data types.
  • When semi-structured data contains tabular data, we can extract the tabular data into a Pandas Dataframe.

Accessing remote data


  • requests is useful when you need to access remote data
  • response.status_code tells you if the request succeeded or why it failed.
  • response.text gives you the raw response, if you need to check that the data is formatted how you expect
  • response.json() will parse the response as JSON, which is handy
  • web APIs can be thought of as bundles of fancy URLs
  • each web API is different, but if you can read the documentation and make requests to URLs, you can figure them out
  • requests is a swiss-army knife for accessing remote data
  • web APIs are just collections of fancy URLs, which you can interact with via requests
  • to learn an API, you need to read the documentation and experiment with the API to see how it responds

Scraping Data


  • beautiful soup lets you grab links out of a webpage so that you can then download them
  • if you need to get more than one request worth of results from an API, they usually provide some “pagination” capabilities so you can make all the requests programmatically.
  • web scraping is a wide world - if you get stuck, try searching for some of the keywords above.

Visual Data Exploration


  • Different kinds of data – indexing, categorical, numeric, time series – are suited to different kinds of summarization and visualization.
  • Successful strategies for assessing data problems alternate between noticing your expectations about the data and checking to see if the data match your expectations – and sometimes, updating your expectations based on what you find!
  • Visualization is not just for reports, papers, and talks! If you incorporate plotting into your exploration & troubleshooting toolbox, you’ll be able to identify and diagnose data problems much more quickly than if you wait for your model to exhibit strange behavior.

Making assumptions about your data


  • you’re always making assumptions about your data, and many of them are likely to be wrong, so you need to check them
  • you can prioritize assumptions by thinking about their impact, likelihood, and testability
  • you can use assert statements to tell you if an assumption is wrong every time you run the code

Modularization


  • Plain language descriptions can help us choose which code to reorganize by identifying goals and intent.
  • We can attach our descriptions directly to our functions using docstrings.

Escape from Jupyter!


  • Jupyter is great for data exploration and visualization, but working with scripts and modules is preferable for reusability, legibility and collaboration
  • uv bundles packages into a virtual environment, and helps us move our code into a codebase
  • Reorganizing code into multiple modules can help us reuse code in multiple places and keep our project organized.

Making sure your system is behaving