pandas has functionality to read in many data formats
(e.g., XML, JSON, Parquet) into Pandas DataFrames in Python. We can take
advantage of this to transform many kinds of structured and
semi-structured data into similarly formatted data.
The help function can be used to access function
documentation, providing avenues to resolve problems on import of
various data types.
When semi-structured data contains tabular data, we can extract the
tabular data into a Pandas Dataframe.
beautiful soup lets you grab links out of a webpage so that you can
then download them
if you need to get more than one request worth of results from an
API, they usually provide some “pagination” capabilities so you can make
all the requests programmatically.
web scraping is a wide world - if you get stuck, try searching for
some of the keywords above.
Different kinds of data – indexing, categorical, numeric, time
series – are suited to different kinds of summarization and
visualization.
Successful strategies for assessing data problems alternate between
noticing your expectations about the data and checking to see if the
data match your expectations – and sometimes, updating your expectations
based on what you find!
Visualization is not just for reports, papers, and talks! If you
incorporate plotting into your exploration & troubleshooting
toolbox, you’ll be able to identify and diagnose data problems much more
quickly than if you wait for your model to exhibit strange
behavior.
Jupyter is great for data exploration and visualization, but working
with scripts and modules is preferable for reusability, legibility and
collaboration
uv bundles packages into a virtual environment, and
helps us move our code into a codebase
Reorganizing code into multiple modules can help us reuse code in
multiple places and keep our project organized.