Files
VI_Lab_01_EDA/TASK0_Datasaurus_Starter.ipynb
sssilva1980 52e38435fa Add README, starter notebooks, and .gitignore
Add a comprehensive README that explains how to prepare a Python virtual environment, install packages, and register an ipykernel for running the lab notebooks in VS Code. Include two starter notebooks: TASK0_Datasaurus_Starter.ipynb (datasaurus warm-up with data loading, summary stats, optional SweetViz report, and faceted scatter plotting) and VisInt_Lab_01_Task_0.ipynb (lab header/metadata). Add .gitignore to exclude a local /.venv directory.
2026-02-21 16:33:46 +00:00

185 lines
4.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Task 0 — Datasaurus WarmUp (Starter)\n",
"\n",
"**Goal:** Show why we must *always* visualize by comparing groups with nearly identical summary statistics but very different shapes when plotted.\n",
"\n",
"This starter uses `datasaurus_task0.csv` (long format) with four groups: `dino`, `star`, `circle`, `bullseye`.\n",
"\n",
"**What to do:**\n",
"1. Load the CSV (as strings first).\n",
"2. Compute basic stats per group (mean, std, correlation).\n",
"3. Generate a SweetViz report (optional but recommended).\n",
"4. Plot x vs y for each group (facet grid).\n",
"5. Write *24 sentences* reflecting on why summary stats were misleading.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0) (Optional) Install packages in this environment\n",
"Uncomment if needed."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"tags": [
"setup"
]
},
"outputs": [],
"source": [
"# !pip install -q numpy pandas seaborn matplotlib sweetviz dtale\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Load the dataset\n",
"Load as strings first (safer), then coerce numeric columns."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [
{
"ename": "",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[1;31mRunning cells with '.venv (Python 3.11.9)' requires the ipykernel package.\n",
"\u001b[1;31mInstall 'ipykernel' into the Python environment. \n",
"\u001b[1;31mCommand: 'd:/Projects/43679_InteractiveVis/VI_Lab_01_EDA/.venv/Scripts/python.exe -m pip install ipykernel -U --force-reinstall'"
]
}
],
"source": [
"import pandas as pd\n",
"csv_path = 'datasaurus_task0.csv'\n",
"df_raw = pd.read_csv(csv_path, dtype=str)\n",
"df_raw.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Coerce numeric columns and quick sanity checks"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"df = df_raw.copy()\n",
"for c in ['x','y']:\n",
" df[c] = pd.to_numeric(df[c], errors='coerce')\n",
"df.info()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2) Basic summary stats by group (fill in)\n",
"Compute mean, std for x & y by `dataset`, and the correlation within each group."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# TODO: groupby summaries\n",
"# g = df.groupby('dataset')\n",
"# means = g[['x','y']].mean()\n",
"# stds = g[['x','y']].std()\n",
"# corr = g.apply(lambda d: d[['x','y']].corr().iloc[0,1])\n",
"# means, stds, corr\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3) (Optional) SweetViz profile\n",
"Generate a quick report to observe that top-level stats look very similar."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# import sweetviz as sv\n",
"# report = sv.analyze(df)\n",
"# report.show_html('task0_sweetviz_report.html')\n",
"# print('Wrote task0_sweetviz_report.html')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4) Visualize — scatter by group (facet)\n",
"Create a facet grid with one subplot per dataset and compare shapes."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"sns.set_theme(style='white', context='notebook')\n",
"g = sns.FacetGrid(df, col='dataset', col_wrap=2, height=4, sharex=True, sharey=True)\n",
"g.map_dataframe(sns.scatterplot, x='x', y='y', s=20, edgecolor=None)\n",
"g.set_titles('{col_name}')\n",
"for ax in g.axes.flatten():\n",
" ax.set_xlabel('x'); ax.set_ylabel('y')\n",
"plt.tight_layout()\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5) Reflection (write here)\n",
"**Prompt:** If the per-group mean/variance/correlation were similar, why do the plots look different?\n",
"- Which shapes do you observe?\n",
"- What does this imply for relying solely on `.describe()` or correlation before plotting?\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}