{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Task 0 — Datasaurus Warm‑Up (Starter)\n", "\n", "**Goal:** Show why we must *always* visualize by comparing groups with nearly identical summary statistics but very different shapes when plotted.\n", "\n", "This starter uses `datasaurus_task0.csv` (long format) with four groups: `dino`, `star`, `circle`, `bullseye`.\n", "\n", "**What to do:**\n", "1. Load the CSV (as strings first).\n", "2. Compute basic stats per group (mean, std, correlation).\n", "3. Generate a SweetViz report (optional but recommended).\n", "4. Plot x vs y for each group (facet grid).\n", "5. Write *2–4 sentences* reflecting on why summary stats were misleading.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0) (Optional) Install packages in this environment\n", "Uncomment if needed." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "tags": [ "setup" ] }, "outputs": [], "source": [ "# !pip install -q numpy pandas seaborn matplotlib sweetviz dtale\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1) Load the dataset\n", "Load as strings first (safer), then coerce numeric columns." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [ { "ename": "", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31mRunning cells with '.venv (Python 3.11.9)' requires the ipykernel package.\n", "\u001b[1;31mInstall 'ipykernel' into the Python environment. \n", "\u001b[1;31mCommand: 'd:/Projects/43679_InteractiveVis/VI_Lab_01_EDA/.venv/Scripts/python.exe -m pip install ipykernel -U --force-reinstall'" ] } ], "source": [ "import pandas as pd\n", "csv_path = 'datasaurus_task0.csv'\n", "df_raw = pd.read_csv(csv_path, dtype=str)\n", "df_raw.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Coerce numeric columns and quick sanity checks" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "df = df_raw.copy()\n", "for c in ['x','y']:\n", " df[c] = pd.to_numeric(df[c], errors='coerce')\n", "df.info()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2) Basic summary stats by group (fill in)\n", "Compute mean, std for x & y by `dataset`, and the correlation within each group." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# TODO: groupby summaries\n", "# g = df.groupby('dataset')\n", "# means = g[['x','y']].mean()\n", "# stds = g[['x','y']].std()\n", "# corr = g.apply(lambda d: d[['x','y']].corr().iloc[0,1])\n", "# means, stds, corr\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3) (Optional) SweetViz profile\n", "Generate a quick report to observe that top-level stats look very similar." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# import sweetviz as sv\n", "# report = sv.analyze(df)\n", "# report.show_html('task0_sweetviz_report.html')\n", "# print('Wrote task0_sweetviz_report.html')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4) Visualize — scatter by group (facet)\n", "Create a facet grid with one subplot per dataset and compare shapes." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "sns.set_theme(style='white', context='notebook')\n", "g = sns.FacetGrid(df, col='dataset', col_wrap=2, height=4, sharex=True, sharey=True)\n", "g.map_dataframe(sns.scatterplot, x='x', y='y', s=20, edgecolor=None)\n", "g.set_titles('{col_name}')\n", "for ax in g.axes.flatten():\n", " ax.set_xlabel('x'); ax.set_ylabel('y')\n", "plt.tight_layout()\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5) Reflection (write here)\n", "**Prompt:** If the per-group mean/variance/correlation were similar, why do the plots look different?\n", "- Which shapes do you observe?\n", "- What does this imply for relying solely on `.describe()` or correlation before plotting?\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }