Add a comprehensive README that explains how to prepare a Python virtual environment, install packages, and register an ipykernel for running the lab notebooks in VS Code. Include two starter notebooks: TASK0_Datasaurus_Starter.ipynb (datasaurus warm-up with data loading, summary stats, optional SweetViz report, and faceted scatter plotting) and VisInt_Lab_01_Task_0.ipynb (lab header/metadata). Add .gitignore to exclude a local /.venv directory.
185 lines
4.8 KiB
Plaintext
185 lines
4.8 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Task 0 — Datasaurus Warm‑Up (Starter)\n",
|
||
"\n",
|
||
"**Goal:** Show why we must *always* visualize by comparing groups with nearly identical summary statistics but very different shapes when plotted.\n",
|
||
"\n",
|
||
"This starter uses `datasaurus_task0.csv` (long format) with four groups: `dino`, `star`, `circle`, `bullseye`.\n",
|
||
"\n",
|
||
"**What to do:**\n",
|
||
"1. Load the CSV (as strings first).\n",
|
||
"2. Compute basic stats per group (mean, std, correlation).\n",
|
||
"3. Generate a SweetViz report (optional but recommended).\n",
|
||
"4. Plot x vs y for each group (facet grid).\n",
|
||
"5. Write *2–4 sentences* reflecting on why summary stats were misleading.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 0) (Optional) Install packages in this environment\n",
|
||
"Uncomment if needed."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {
|
||
"tags": [
|
||
"setup"
|
||
]
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# !pip install -q numpy pandas seaborn matplotlib sweetviz dtale\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 1) Load the dataset\n",
|
||
"Load as strings first (safer), then coerce numeric columns."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "",
|
||
"evalue": "",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[1;31mRunning cells with '.venv (Python 3.11.9)' requires the ipykernel package.\n",
|
||
"\u001b[1;31mInstall 'ipykernel' into the Python environment. \n",
|
||
"\u001b[1;31mCommand: 'd:/Projects/43679_InteractiveVis/VI_Lab_01_EDA/.venv/Scripts/python.exe -m pip install ipykernel -U --force-reinstall'"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"csv_path = 'datasaurus_task0.csv'\n",
|
||
"df_raw = pd.read_csv(csv_path, dtype=str)\n",
|
||
"df_raw.head()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Coerce numeric columns and quick sanity checks"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df = df_raw.copy()\n",
|
||
"for c in ['x','y']:\n",
|
||
" df[c] = pd.to_numeric(df[c], errors='coerce')\n",
|
||
"df.info()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2) Basic summary stats by group (fill in)\n",
|
||
"Compute mean, std for x & y by `dataset`, and the correlation within each group."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# TODO: groupby summaries\n",
|
||
"# g = df.groupby('dataset')\n",
|
||
"# means = g[['x','y']].mean()\n",
|
||
"# stds = g[['x','y']].std()\n",
|
||
"# corr = g.apply(lambda d: d[['x','y']].corr().iloc[0,1])\n",
|
||
"# means, stds, corr\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3) (Optional) SweetViz profile\n",
|
||
"Generate a quick report to observe that top-level stats look very similar."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# import sweetviz as sv\n",
|
||
"# report = sv.analyze(df)\n",
|
||
"# report.show_html('task0_sweetviz_report.html')\n",
|
||
"# print('Wrote task0_sweetviz_report.html')\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 4) Visualize — scatter by group (facet)\n",
|
||
"Create a facet grid with one subplot per dataset and compare shapes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 0,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import seaborn as sns\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"sns.set_theme(style='white', context='notebook')\n",
|
||
"g = sns.FacetGrid(df, col='dataset', col_wrap=2, height=4, sharex=True, sharey=True)\n",
|
||
"g.map_dataframe(sns.scatterplot, x='x', y='y', s=20, edgecolor=None)\n",
|
||
"g.set_titles('{col_name}')\n",
|
||
"for ax in g.axes.flatten():\n",
|
||
" ax.set_xlabel('x'); ax.set_ylabel('y')\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 5) Reflection (write here)\n",
|
||
"**Prompt:** If the per-group mean/variance/correlation were similar, why do the plots look different?\n",
|
||
"- Which shapes do you observe?\n",
|
||
"- What does this imply for relying solely on `.describe()` or correlation before plotting?\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"name": "python",
|
||
"version": "3.11.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|