VI_Lab_01_EDA/deploy/lab01_task1_datasets.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d321d996",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 43679 -- Interactive Visualization\n",
    "# 2025 - 2026\n",
    "# 2nd semester\n",
    "# Lab 1 - EDA (guided)\n",
    "# ver 1.0 - 2026-02-20 Initial version\n",
    "# ver 1.1 - 2026-02-23  Added more comments and explanations\n",
    "# ver 1.2 - 2026-02-24  Added code for additional visualizations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d44c354e",
   "metadata": {},
   "source": [
    "# Lab 01<br>Task 1: Exploratory Data Analysis with Pandas & Seaborn\n",
    "\n",
    "This task serves two purposes. It introduces you to some of the basic tools to start understanding datasets and shows you why descriptive statistics may not be enough to understand the nature of a dataset.\n",
    "\n",
    "Also, this task also walks you through some basic visualizations of the datasets to show how the type of visualization matters when trying to understand the data.\n",
    "\n",
    "Additionally, this simple first task also serves the purpose of getting you acquainted with Jupyter notebooks.\n",
    "\n",
    "**Dataset:** `datasaurus.csv`\n",
    "\n",
    "---\n",
    "\n",
    "### Objectives\n",
    "\n",
    "By the end of this task you will be able to:\n",
    "- Use `pandas` to inspect a dataset's structure, types, and summary statistics\n",
    "- Apply grouped aggregations to compare subsets of data\n",
    "- Use `seaborn` to produce scatter plots that reveal structure invisible to statistics\n",
    "- Articulate *why* visualisation is an essential — not optional — step in data analysis\n",
    "\n",
    "---\n",
    "\n",
    "### Context\n",
    "\n",
    "The **Datasaurus Dozen** is a collection of 13 small datasets created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
    "\n",
    "This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover, through visualisation, that numbers alone were hiding the story.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "350a4fd8",
   "metadata": {},
   "source": [
    "## Part 1: Load and Inspect the Data\n",
    "\n",
    "Start by importing the libraries you need and loading the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed1a7a01",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Configure plot style\n",
    "sns.set_theme(style='whitegrid', palette='tab10')\n",
    "plt.rcParams['figure.dpi'] = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cf77ef2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "df = pd.read_csv('datasaurus.csv')\n",
    "\n",
    "# Preview the first rows\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2e51209",
   "metadata": {},
   "source": [
    "### 1.1. Structure and data types\n",
    "\n",
    "Before computing anything, always understand what you are working with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a45f4e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shape of the dataset (rows, columns)\n",
    "print('Shape:', df.shape)\n",
    "\n",
    "# Column names and data types\n",
    "print('\\nDtypes:')\n",
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d01329b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many unique sub-datasets are there, and how many rows does each contain?\n",
    "print('Unique datasets:', df['dataset'].nunique())\n",
    "print('\\nRows per dataset:')\n",
    "print(df['dataset'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1545a53f",
   "metadata": {},
   "source": [
    "### 1.2. Overall summary statistics\n",
    "\n",
    "Use `describe()` to get a global numerical summary of `x` and `y`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a92b670e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary statistics for the entire dataset\n",
    "df[['x', 'y']].describe().round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16b1a9e3",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Part 2: Grouped Statistics: The Reveal\n",
    "\n",
    "The dataset column holds 13 different named groups. Let's compute summary statistics **per group** and see if the groups differ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7693c95",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute mean and standard deviation of x and y for each sub-dataset\n",
    "grouped_stats = (\n",
    "    df.groupby('dataset')[['x', 'y']]\n",
    "    .agg(['mean', 'std'])\n",
    "    .round(2)\n",
    ")\n",
    "\n",
    "grouped_stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "837a2552",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Also compute the Pearson correlation between x and y per group\n",
    "correlation = df.groupby('dataset').apply(lambda g: g['x'].corr(g['y'])).round(2)\n",
    "correlation.name = 'corr(x,y)'\n",
    "print(correlation)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c40be027",
   "metadata": {},
   "source": [
    "> **Question:** Look at the table above. Are the 13 datasets statistically different from each other?  \n",
    "> Write your answer in the cell below before moving on.\n",
    "\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc4c40dd",
   "metadata": {},
   "source": [
    "<!-- ## Part 3: Now Let us Actually Look at the Data\n",
    "\n",
    "We will focus on three sub-datasets: **`dino`**, **`star`**, and **`bullseye`**. These three were chosen because they produce a dramatic visual contrast despite their identical statistics.\n",
    "\n",
    "Later, feel free to explore the remaining 10 groups. -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4fde0b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter to the three focus datasets\n",
    "focus = ['dino', 'star', 'bullseye']\n",
    "df_focus = df[df['dataset'].isin(focus)].copy()\n",
    "\n",
    "print(f'Rows in subset: {len(df_focus)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86d8b1b6",
   "metadata": {},
   "source": [
    "### 3.1 — Individual scatter plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2f4c527",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)\n",
    "\n",
    "colors = sns.color_palette('tab10', 3)\n",
    "\n",
    "for ax, name, color in zip(axes, focus, colors):\n",
    "    subset = df_focus[df_focus['dataset'] == name]\n",
    "    ax.scatter(subset['x'], subset['y'], color=color, alpha=0.7, s=40, edgecolors='white', linewidths=0.4)\n",
    "    ax.set_title(name, fontsize=14, fontweight='bold')\n",
    "    ax.set_xlabel('x')\n",
    "    ax.set_ylabel('y')\n",
    "\n",
    "fig.suptitle('Same statistics, completely different data', fontsize=16, fontweight='bold', y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "538ecb6f",
   "metadata": {},
   "source": [
    "### 3.2 — Side-by-side with statistics overlay\n",
    "\n",
    "Let's add the mean and standard deviation annotations to make the point explicit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d677b3ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(15, 5.5), sharey=True)\n",
    "\n",
    "for ax, name, color in zip(axes, focus, colors):\n",
    "    subset = df_focus[df_focus['dataset'] == name]\n",
    "    \n",
    "    ax.scatter(subset['x'], subset['y'], color=color, alpha=0.65, s=40,\n",
    "               edgecolors='white', linewidths=0.4, label='observations')\n",
    "    \n",
    "    # Mean crosshair\n",
    "    mx, my = subset['x'].mean(), subset['y'].mean()\n",
    "    ax.axvline(mx, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n",
    "    ax.axhline(my, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n",
    "    ax.scatter([mx], [my], color='black', s=80, zorder=5, label=f'mean ({mx:.1f}, {my:.1f})')\n",
    "    \n",
    "    # Stats box\n",
    "    stats_text = (\n",
    "        f\"mean x = {subset['x'].mean():.2f}\\n\"\n",
    "        f\"mean y = {subset['y'].mean():.2f}\\n\"\n",
    "        f\"sd x   = {subset['x'].std():.2f}\\n\"\n",
    "        f\"sd y   = {subset['y'].std():.2f}\\n\"\n",
    "        f\"corr   = {subset['x'].corr(subset['y']):.2f}\"\n",
    "    )\n",
    "    ax.text(0.03, 0.97, stats_text, transform=ax.transAxes,\n",
    "            fontsize=8.5, verticalalignment='top', fontfamily='monospace',\n",
    "            bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.85, edgecolor='grey'))\n",
    "    \n",
    "    ax.set_title(name, fontsize=14, fontweight='bold')\n",
    "    ax.set_xlabel('x')\n",
    "    ax.set_ylabel('y')\n",
    "\n",
    "fig.suptitle('Datasaurus Dozen — statistics are identical, shapes are not',\n",
    "             fontsize=14, fontweight='bold', y=1.01)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e295910e",
   "metadata": {},
   "source": [
    "> **❓ Question:** What would a data analyst have concluded if they had only looked at the summary statistics table?  \n",
    "> What does this tell you about when and why visualisation is necessary?\n",
    "\n",
    "*(Double-click to write your answer here)*\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86dea1fb",
   "metadata": {},
   "source": [
    "## Part 4 — Small Multiples: All 13 Datasets at Once\n",
    "\n",
    "Seaborn's `FacetGrid` makes it easy to produce a *small multiples* plot — the same chart type repeated for each group. This is a powerful pattern for comparing distributions across many categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7eb9f5a",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.FacetGrid(df, col='dataset', col_wrap=5, height=3, aspect=1.0,\n",
    "                  sharex=False, sharey=False)\n",
    "g.map(sns.scatterplot, 'x', 'y', alpha=0.6, s=18, color='steelblue', edgecolor='white', linewidth=0.2)\n",
    "g.set_titles(col_template='{col_name}', size=10)\n",
    "g.figure.suptitle('All 13 Datasaurus Dozen datasets — identical statistics',\n",
    "                   fontsize=13, fontweight='bold', y=1.01)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "becc716d",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Some Exploration\n",
    "\n",
    "For each chart type below, run the cell and then answer the key question:\n",
    "\n",
    "> **Does this chart type reveal the structural differences between datasets, or does it hide them?**\n",
    "\n",
    "---\n",
    "\n",
    "### Histograms\n",
    "\n",
    "Plot the marginal distribution of `x` and `y` separately for each focus dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83a2bc01",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 3, figsize=(15, 8), sharey=True)\n",
    "\n",
    "for col_idx, var in enumerate(['x', 'y']):\n",
    "    for ax, name, color in zip(axes[col_idx], focus, colors):\n",
    "        subset = df_focus[df_focus['dataset'] == name]\n",
    "        sns.histplot(subset[var], ax=ax, color=color, bins=15, kde=False)\n",
    "        ax.set_title(f'{name} — {var}', fontsize=12, fontweight='bold')\n",
    "        ax.set_xlabel(var)\n",
    "\n",
    "fig.suptitle('Histograms — marginal distributions of x and y per dataset',\n",
    "             fontsize=13, fontweight='bold', y=1.01)\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exploration_md",
   "metadata": {},
   "source": [
    "> **Answer:** Partially. Histograms show the marginal distribution of one variable at a time, so they reveal that the datasets differ along each axis individually. But they lose all information about the *relationship* between x and y — you cannot see the dinosaur or the star from a histogram alone. They reveal more than summary statistics, but less than a scatterplot.\n",
    "\n",
    "---\n",
    "\n",
    "### KDE plots\n",
    "\n",
    "Overlay density curves for the three focus datasets on the same axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cc44f9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "for ax, var in zip(axes, ['x', 'y']):\n",
    "    sns.kdeplot(data=df_focus, x=var, hue='dataset', ax=ax, fill=True, alpha=0.3, linewidth=1.5)\n",
    "    ax.set_title(f'KDE of {var} — three focus datasets', fontsize=12, fontweight='bold')\n",
    "    ax.set_xlabel(var)\n",
    "\n",
    "fig.suptitle('KDE plots — overlaid density curves per dataset',\n",
    "             fontsize=13, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exploration_md",
   "metadata": {},
   "source": [
    "> **Answer:** Same limitation as histograms — KDE plots show the marginal density of one variable at a time. The three curves look somewhat different from each other (especially for y), but you cannot reconstruct the actual shapes from them. The structural difference between dino, star, and bullseye is heavily underrepresented.\n",
    "\n",
    "---\n",
    "\n",
    "### Pair plots\n",
    "\n",
    "Plot all pairwise combinations of variables, coloured by dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "exploration_cell",
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.pairplot(df_focus, hue='dataset', plot_kws={'alpha': 0.5, 's': 20},\n",
    "                 diag_kind='kde', height=3.5)\n",
    "g.figure.suptitle('Pair plot — dino, star, bullseye', fontsize=13,\n",
    "                   fontweight='bold', y=1.01)\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exploration_md",
   "metadata": {},
   "source": [
    "> **Answer:** Yes — the off-diagonal scatter plot (x vs y) fully reveals the structural differences, showing the dinosaur, star, and bullseye shapes clearly. The diagonal KDE plots add the marginal distributions. For a dataset with only two variables the pair plot is essentially a scatter plot with extras, but the pattern scales well to datasets with many variables.\n",
    "\n",
    "---\n",
    "\n",
    "### Box plots\n",
    "\n",
    "Summarise the distribution of `x` and `y` per dataset using box plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "exploration_cell",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "for ax, var in zip(axes, ['x', 'y']):\n",
    "    sns.boxplot(data=df_focus, x='dataset', y=var, ax=ax,\n",
    "                palette='tab10', width=0.5, linewidth=1.2)\n",
    "    ax.set_title(f'Box plot of {var} per dataset', fontsize=12, fontweight='bold')\n",
    "    ax.set_xlabel('dataset')\n",
    "    ax.set_ylabel(var)\n",
    "\n",
    "fig.suptitle('Box plots — do they reveal the structural differences?',\n",
    "             fontsize=13, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exploration_md",
   "metadata": {},
   "source": [
    "> **Answer:** No — and this is the most important result. The three box plots look nearly identical for both x and y: same median, same IQR, same whiskers. Box plots summarise only five statistics per group (min, Q1, median, Q3, max), so they suffer the same blindspot as the summary statistics table. The dinosaur, star, and bullseye are completely invisible. Some chart types hide structure rather than revealing it — and box plots are a prime example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c09cd29",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Key Takeaways\n",
    "\n",
    "- Summary statistics (mean, SD, correlation) can be completely identical across datasets with totally different structure\n",
    "- Visualisation is not a finishing step — it is a **diagnostic step** that must happen early\n",
    "- Different chart types reveal different aspects: scatterplots show point-level structure, histograms show marginal distributions, box plots summarise spread but can hide shape\n",
    "- The small multiples pattern (FacetGrid) is a powerful way to compare many groups at a glance\n",
    "\n",
    "--> In **Task 2**, you will move to a real-world dataset with real problems — and discover that the \"hard work\" you just did manually can be partially automated."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}