{ "cells": [ { "cell_type": "markdown", "id": "d44c354e", "metadata": {}, "source": [ "# Lab 01
Task 1: Exploratory Data Analysis with Pandas & Seaborn\n", "\n", "This task serves two purposes. It introduces you to some of the basic tools to start understanding datasets and shows you why descriptive statistics may not be enough to understand the nature of a dataset.\n", "\n", "Additionally, this simple first task also serves the purpose of getting you acquainted with Jupyter notebooks.\n", "\n", "**Dataset:** `datasaurus.csv`\n", "\n", "---\n", "\n", "### Objectives\n", "\n", "By the end of this task you will be able to:\n", "- Use `pandas` to inspect a dataset's structure, types, and summary statistics\n", "- Apply grouped aggregations to compare subsets of data\n", "- Use `seaborn` to produce scatter plots that reveal structure invisible to statistics\n", "- Articulate *why* visualisation is an essential — not optional — step in data analysis\n", "\n", "---\n", "\n", "### Context\n", "\n", "The **Datasaurus Dozen** is a collection of 13 small datasets created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n", "\n", "This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover, through visualisation, that numbers alone were hiding the story.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "350a4fd8", "metadata": {}, "source": [ "## Part 1: Load and Inspect the Data\n", "\n", "Start by importing the libraries you need and loading the dataset." ] }, { "cell_type": "code", "execution_count": 1, "id": "ed1a7a01", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "# Configure plot style\n", "sns.set_theme(style='whitegrid', palette='tab10')\n", "plt.rcParams['figure.dpi'] = 100" ] }, { "cell_type": "code", "execution_count": 2, "id": "9cf77ef2", "metadata": {}, "outputs": [], "source": [ "# Load the dataset\n", "df = pd.read_csv('datasaurus.csv')\n", "\n", "# Preview the first rows\n", "df.head(10)" ] }, { "cell_type": "markdown", "id": "a2e51209", "metadata": {}, "source": [ "### 1.1. Structure and data types\n", "\n", "Before computing anything, always understand what you are working with." ] }, { "cell_type": "code", "execution_count": 3, "id": "6a45f4e3", "metadata": {}, "outputs": [], "source": [ "# Shape of the dataset (rows, columns)\n", "print('Shape:', df.shape)\n", "\n", "# Column names and data types\n", "print('\\nDtypes:')\n", "print(df.dtypes)" ] }, { "cell_type": "code", "execution_count": 4, "id": "d01329b3", "metadata": {}, "outputs": [], "source": [ "# How many unique sub-datasets are there, and how many rows does each contain?\n", "print('Unique datasets:', df['dataset'].nunique())\n", "print('\\nRows per dataset:')\n", "print(df['dataset'].value_counts())" ] }, { "cell_type": "markdown", "id": "1545a53f", "metadata": {}, "source": [ "### 1.2. Overall summary statistics\n", "\n", "Use `describe()` to get a global numerical summary of `x` and `y`." ] }, { "cell_type": "code", "execution_count": 5, "id": "a92b670e", "metadata": {}, "outputs": [], "source": [ "# Summary statistics for the entire dataset\n", "df[['x', 'y']].describe().round(2)" ] }, { "cell_type": "markdown", "id": "16b1a9e3", "metadata": {}, "source": [ "---\n", "\n", "## Part 2: Grouped Statistics: The Reveal\n", "\n", "The dataset column holds 13 different named groups. Let's compute summary statistics **per group** and see if the groups differ." ] }, { "cell_type": "code", "execution_count": 6, "id": "e7693c95", "metadata": {}, "outputs": [], "source": [ "# Compute mean and standard deviation of x and y for each sub-dataset\n", "grouped_stats = (\n", " df.groupby('dataset')[['x', 'y']]\n", " .agg(['mean', 'std'])\n", " .round(2)\n", ")\n", "\n", "grouped_stats" ] }, { "cell_type": "code", "execution_count": 7, "id": "837a2552", "metadata": {}, "outputs": [], "source": [ "# Also compute the Pearson correlation between x and y per group\n", "correlation = df.groupby('dataset').apply(lambda g: g['x'].corr(g['y'])).round(2)\n", "correlation.name = 'corr(x,y)'\n", "print(correlation)" ] }, { "cell_type": "markdown", "id": "c40be027", "metadata": {}, "source": [ "> **Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n", "> Write your answer in the cell below before moving on.\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "cc4c40dd", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 8, "id": "d4fde0b1", "metadata": {}, "outputs": [], "source": [ "# Filter to the three focus datasets\n", "focus = ['dino', 'star', 'bullseye']\n", "df_focus = df[df['dataset'].isin(focus)].copy()\n", "\n", "print(f'Rows in subset: {len(df_focus)}')" ] }, { "cell_type": "markdown", "id": "86d8b1b6", "metadata": {}, "source": [ "### 3.1 — Individual scatter plots" ] }, { "cell_type": "code", "execution_count": 9, "id": "c2f4c527", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)\n", "\n", "colors = sns.color_palette('tab10', 3)\n", "\n", "for ax, name, color in zip(axes, focus, colors):\n", " subset = df_focus[df_focus['dataset'] == name]\n", " ax.scatter(subset['x'], subset['y'], color=color, alpha=0.7, s=40, edgecolors='white', linewidths=0.4)\n", " ax.set_title(name, fontsize=14, fontweight='bold')\n", " ax.set_xlabel('x')\n", " ax.set_ylabel('y')\n", "\n", "fig.suptitle('Same statistics, completely different data', fontsize=16, fontweight='bold', y=1.02)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "538ecb6f", "metadata": {}, "source": [ "### 3.2 — Side-by-side with statistics overlay\n", "\n", "Let's add the mean and standard deviation annotations to make the point explicit." ] }, { "cell_type": "code", "execution_count": 10, "id": "d677b3ec", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, figsize=(15, 5.5), sharey=True)\n", "\n", "for ax, name, color in zip(axes, focus, colors):\n", " subset = df_focus[df_focus['dataset'] == name]\n", " \n", " ax.scatter(subset['x'], subset['y'], color=color, alpha=0.65, s=40,\n", " edgecolors='white', linewidths=0.4, label='observations')\n", " \n", " # Mean crosshair\n", " mx, my = subset['x'].mean(), subset['y'].mean()\n", " ax.axvline(mx, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n", " ax.axhline(my, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n", " ax.scatter([mx], [my], color='black', s=80, zorder=5, label=f'mean ({mx:.1f}, {my:.1f})')\n", " \n", " # Stats box\n", " stats_text = (\n", " f\"mean x = {subset['x'].mean():.2f}\\n\"\n", " f\"mean y = {subset['y'].mean():.2f}\\n\"\n", " f\"sd x = {subset['x'].std():.2f}\\n\"\n", " f\"sd y = {subset['y'].std():.2f}\\n\"\n", " f\"corr = {subset['x'].corr(subset['y']):.2f}\"\n", " )\n", " ax.text(0.03, 0.97, stats_text, transform=ax.transAxes,\n", " fontsize=8.5, verticalalignment='top', fontfamily='monospace',\n", " bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.85, edgecolor='grey'))\n", " \n", " ax.set_title(name, fontsize=14, fontweight='bold')\n", " ax.set_xlabel('x')\n", " ax.set_ylabel('y')\n", "\n", "fig.suptitle('Datasaurus Dozen — statistics are identical, shapes are not',\n", " fontsize=14, fontweight='bold', y=1.01)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e295910e", "metadata": {}, "source": [ "> **❓ Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n", "> What does this tell you about when and why visualisation is necessary?\n", "\n", "*(Double-click to write your answer here)*\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "86dea1fb", "metadata": {}, "source": [ "## Part 4 — Small Multiples: All 13 Datasets at Once\n", "\n", "Seaborn's `FacetGrid` makes it easy to produce a *small multiples* plot — the same chart type repeated for each group. This is a powerful pattern for comparing distributions across many categories." ] }, { "cell_type": "code", "execution_count": 11, "id": "d7eb9f5a", "metadata": {}, "outputs": [], "source": [ "g = sns.FacetGrid(df, col='dataset', col_wrap=5, height=3, aspect=1.0,\n", " sharex=False, sharey=False)\n", "g.map(sns.scatterplot, 'x', 'y', alpha=0.6, s=18, color='steelblue', edgecolor='white', linewidth=0.2)\n", "g.set_titles(col_template='{col_name}', size=10)\n", "g.figure.suptitle('All 13 Datasaurus Dozen datasets — identical statistics',\n", " fontsize=13, fontweight='bold', y=1.01)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "becc716d", "metadata": {}, "source": [ "---\n", "\n", "## ✏️ Your Turn — Free Exploration\n", "\n", "The cells below are yours. Here are some things to try:\n", "\n", "- **Histograms:** Use `sns.histplot()` to plot the distribution of `x` or `y` for two contrasting datasets. Do the distributions look different?\n", "- **KDE plots:** Try `sns.kdeplot(data=df_focus, x='x', hue='dataset')` to overlay density curves for the three focus groups.\n", "- **Pair plots:** Use `sns.pairplot(df_focus, hue='dataset')` — what does it add?\n", "- **Box plots:** Use `sns.boxplot(data=df, x='dataset', y='x')` — can boxplots reveal the structural differences?\n", "\n", "> **Key question to keep in mind:** For each plot type you try — does it reveal the structural difference between the datasets, or does it hide it?" ] }, { "cell_type": "code", "execution_count": null, "id": "83a2bc01", "metadata": {}, "outputs": [], "source": [ "# Your exploration here\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d7aac288", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "3cc44f9f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3c09cd29", "metadata": {}, "source": [ "---\n", "\n", "## 🔑 Key Takeaways\n", "\n", "- Summary statistics (mean, SD, correlation) can be completely identical across datasets with totally different structure\n", "- Visualisation is not a finishing step — it is a **diagnostic step** that must happen early\n", "- Different chart types reveal different aspects: scatterplots show point-level structure, histograms show marginal distributions, box plots summarise spread but can hide shape\n", "- The small multiples pattern (FacetGrid) is a powerful way to compare many groups at a glance\n", "\n", "→ In **Task 2**, you will move to a real-world dataset with real problems — and discover that the \"hard work\" you just did manually can be partially automated." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }