Add deploy assets and update telemetry datasets

Prepare deployment package and clean telemetry/lab data: add deploy/ (README, datasaurus.csv, datasets and lab01 notebooks), add new lab02 dataset notebook variants (lab02_task1_datasets_v2/ v2b) and solutions for task3, and update multiple lab02 telemetry and git-activity notebooks. Clean and normalize claude/dataset_A_indie_game_telemetry_clean.csv (fill/standardize timestamps, session lengths and other fields) to improve consistency for downstream analysis.
This commit is contained in:
2026-02-24 10:07:31 +00:00
parent fa9898b321
commit d689ada45e
17 changed files with 46042 additions and 9782 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -4,10 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 02 · Task 1 Exploratory Data Analysis with Pandas & Seaborn\n",
"# Lab 01<br>Task 1: Exploratory Data Analysis with Pandas & Seaborn\n",
"\n",
"**Estimated time:** ~30 minutes \n",
"**Dataset:** `datasaurus_dozen.csv`\n",
"This task serves two purposes. It introduces you to some of the basic tools to start understanding datasets and shows you why descriptive statistics may not be enough to understand the nature of a dataset.\n",
"\n",
"Additionally, this simple first task also serves the purpose of getting you acquainted with Jupyter notebooks.\n",
"\n",
"**Dataset:** `datasaurus.csv`\n",
"\n",
"---\n",
"\n",
@@ -23,9 +26,9 @@
"\n",
"### Context\n",
"\n",
"The **Datasaurus Dozen** is a collection of 13 small datasets deliberately constructed to share *identical* summary statistics while looking completely different when plotted. It was created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
"The **Datasaurus Dozen** is a collection of 13 small datasets created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
"\n",
"This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover through visualisation that numbers alone were hiding the story.\n",
"This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover, through visualisation, that numbers alone were hiding the story.\n",
"\n",
"---"
]
@@ -34,7 +37,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1 Load and Inspect the Data\n",
"## Part 1: Load and Inspect the Data\n",
"\n",
"Start by importing the libraries you need and loading the dataset."
]
@@ -181,7 +184,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1 Structure and data types\n",
"### 1.1. Structure and data types\n",
"\n",
"Before computing anything, always understand what you are working with."
]
@@ -255,7 +258,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 Overall summary statistics\n",
"### 1.2. Overall summary statistics\n",
"\n",
"Use `describe()` to get a global numerical summary of `x` and `y`."
]
@@ -363,7 +366,7 @@
"source": [
"---\n",
"\n",
"## Part 2 Grouped Statistics: The Reveal\n",
"## Part 2: Grouped Statistics\n",
"\n",
"The dataset column holds 13 different named groups. Let's compute summary statistics **per group** and see if the groups differ."
]
@@ -577,7 +580,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\sss\\AppData\\Local\\Temp\\ipykernel_95640\\2163207487.py:2: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
"C:\\Users\\sss\\AppData\\Local\\Temp\\ipykernel_64804\\2163207487.py:2: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
" correlation = df.groupby('dataset').apply(lambda g: g['x'].corr(g['y'])).round(2)\n"
]
}
@@ -593,10 +596,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n",
"> **Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n",
"> Write your answer in the cell below before moving on.\n",
"\n",
"*(Double-click this cell to write your answer here)*\n",
"\n",
"---"
]
@@ -605,11 +607,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3 Now Let's Actually Look at the Data\n",
"<!-- ## Part 3: Now Let us Actually Look at the Data\n",
"\n",
"We will focus on three sub-datasets: **`dino`**, **`star`**, and **`bullseye`**. These three were chosen because they produce a dramatic visual contrast despite their identical statistics.\n",
"\n",
"Later, feel free to explore the remaining 10 groups."
"Later, feel free to explore the remaining 10 groups. -->"
]
},
{
"cell_type": "markdown",
"id": "d6f82ff1",
"metadata": {},
"source": [
"## Part 3: Visualizing the Data"
]
},
{
@@ -739,10 +749,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n",
"> **Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n",
"> What does this tell you about when and why visualisation is necessary?\n",
"\n",
"*(Double-click to write your answer here)*\n",
"\n",
"---"
]
@@ -789,7 +798,7 @@
"source": [
"---\n",
"\n",
"## ✏️ Your Turn — Free Exploration\n",
"## Your Turn — Free Exploration\n",
"\n",
"The cells below are yours. Here are some things to try:\n",
"\n",
@@ -801,15 +810,6 @@
"> **Key question to keep in mind:** For each plot type you try — does it reveal the structural difference between the datasets, or does it hide it?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your exploration here\n"
]
},
{
"cell_type": "code",
"execution_count": null,

View File

@@ -0,0 +1,424 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d44c354e",
"metadata": {},
"source": [
"# Lab 01<br>Task 1: Exploratory Data Analysis with Pandas & Seaborn\n",
"\n",
"This task serves two purposes. It introduces you to some of the basic tools to start understanding datasets and shows you why descriptive statistics may not be enough to understand the nature of a dataset.\n",
"\n",
"Additionally, this simple first task also serves the purpose of getting you acquainted with Jupyter notebooks.\n",
"\n",
"**Dataset:** `datasaurus.csv`\n",
"\n",
"---\n",
"\n",
"### Objectives\n",
"\n",
"By the end of this task you will be able to:\n",
"- Use `pandas` to inspect a dataset's structure, types, and summary statistics\n",
"- Apply grouped aggregations to compare subsets of data\n",
"- Use `seaborn` to produce scatter plots that reveal structure invisible to statistics\n",
"- Articulate *why* visualisation is an essential — not optional — step in data analysis\n",
"\n",
"---\n",
"\n",
"### Context\n",
"\n",
"The **Datasaurus Dozen** is a collection of 13 small datasets created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
"\n",
"This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover, through visualisation, that numbers alone were hiding the story.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "350a4fd8",
"metadata": {},
"source": [
"## Part 1: Load and Inspect the Data\n",
"\n",
"Start by importing the libraries you need and loading the dataset."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ed1a7a01",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Configure plot style\n",
"sns.set_theme(style='whitegrid', palette='tab10')\n",
"plt.rcParams['figure.dpi'] = 100"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9cf77ef2",
"metadata": {},
"outputs": [],
"source": [
"# Load the dataset\n",
"df = pd.read_csv('datasaurus.csv')\n",
"\n",
"# Preview the first rows\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"id": "a2e51209",
"metadata": {},
"source": [
"### 1.1. Structure and data types\n",
"\n",
"Before computing anything, always understand what you are working with."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6a45f4e3",
"metadata": {},
"outputs": [],
"source": [
"# Shape of the dataset (rows, columns)\n",
"print('Shape:', df.shape)\n",
"\n",
"# Column names and data types\n",
"print('\\nDtypes:')\n",
"print(df.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d01329b3",
"metadata": {},
"outputs": [],
"source": [
"# How many unique sub-datasets are there, and how many rows does each contain?\n",
"print('Unique datasets:', df['dataset'].nunique())\n",
"print('\\nRows per dataset:')\n",
"print(df['dataset'].value_counts())"
]
},
{
"cell_type": "markdown",
"id": "1545a53f",
"metadata": {},
"source": [
"### 1.2. Overall summary statistics\n",
"\n",
"Use `describe()` to get a global numerical summary of `x` and `y`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a92b670e",
"metadata": {},
"outputs": [],
"source": [
"# Summary statistics for the entire dataset\n",
"df[['x', 'y']].describe().round(2)"
]
},
{
"cell_type": "markdown",
"id": "16b1a9e3",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 2: Grouped Statistics: The Reveal\n",
"\n",
"The dataset column holds 13 different named groups. Let's compute summary statistics **per group** and see if the groups differ."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e7693c95",
"metadata": {},
"outputs": [],
"source": [
"# Compute mean and standard deviation of x and y for each sub-dataset\n",
"grouped_stats = (\n",
" df.groupby('dataset')[['x', 'y']]\n",
" .agg(['mean', 'std'])\n",
" .round(2)\n",
")\n",
"\n",
"grouped_stats"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "837a2552",
"metadata": {},
"outputs": [],
"source": [
"# Also compute the Pearson correlation between x and y per group\n",
"correlation = df.groupby('dataset').apply(lambda g: g['x'].corr(g['y'])).round(2)\n",
"correlation.name = 'corr(x,y)'\n",
"print(correlation)"
]
},
{
"cell_type": "markdown",
"id": "c40be027",
"metadata": {},
"source": [
"> **Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n",
"> Write your answer in the cell below before moving on.\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "cc4c40dd",
"metadata": {},
"source": [
"<!-- ## Part 3: Now Let us Actually Look at the Data\n",
"\n",
"We will focus on three sub-datasets: **`dino`**, **`star`**, and **`bullseye`**. These three were chosen because they produce a dramatic visual contrast despite their identical statistics.\n",
"\n",
"Later, feel free to explore the remaining 10 groups. -->"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d4fde0b1",
"metadata": {},
"outputs": [],
"source": [
"# Filter to the three focus datasets\n",
"focus = ['dino', 'star', 'bullseye']\n",
"df_focus = df[df['dataset'].isin(focus)].copy()\n",
"\n",
"print(f'Rows in subset: {len(df_focus)}')"
]
},
{
"cell_type": "markdown",
"id": "86d8b1b6",
"metadata": {},
"source": [
"### 3.1 — Individual scatter plots"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c2f4c527",
"metadata": {},
"outputs": [],
"source": [
"fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)\n",
"\n",
"colors = sns.color_palette('tab10', 3)\n",
"\n",
"for ax, name, color in zip(axes, focus, colors):\n",
" subset = df_focus[df_focus['dataset'] == name]\n",
" ax.scatter(subset['x'], subset['y'], color=color, alpha=0.7, s=40, edgecolors='white', linewidths=0.4)\n",
" ax.set_title(name, fontsize=14, fontweight='bold')\n",
" ax.set_xlabel('x')\n",
" ax.set_ylabel('y')\n",
"\n",
"fig.suptitle('Same statistics, completely different data', fontsize=16, fontweight='bold', y=1.02)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "538ecb6f",
"metadata": {},
"source": [
"### 3.2 — Side-by-side with statistics overlay\n",
"\n",
"Let's add the mean and standard deviation annotations to make the point explicit."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d677b3ec",
"metadata": {},
"outputs": [],
"source": [
"fig, axes = plt.subplots(1, 3, figsize=(15, 5.5), sharey=True)\n",
"\n",
"for ax, name, color in zip(axes, focus, colors):\n",
" subset = df_focus[df_focus['dataset'] == name]\n",
" \n",
" ax.scatter(subset['x'], subset['y'], color=color, alpha=0.65, s=40,\n",
" edgecolors='white', linewidths=0.4, label='observations')\n",
" \n",
" # Mean crosshair\n",
" mx, my = subset['x'].mean(), subset['y'].mean()\n",
" ax.axvline(mx, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n",
" ax.axhline(my, color='black', linestyle='--', linewidth=1.0, alpha=0.6)\n",
" ax.scatter([mx], [my], color='black', s=80, zorder=5, label=f'mean ({mx:.1f}, {my:.1f})')\n",
" \n",
" # Stats box\n",
" stats_text = (\n",
" f\"mean x = {subset['x'].mean():.2f}\\n\"\n",
" f\"mean y = {subset['y'].mean():.2f}\\n\"\n",
" f\"sd x = {subset['x'].std():.2f}\\n\"\n",
" f\"sd y = {subset['y'].std():.2f}\\n\"\n",
" f\"corr = {subset['x'].corr(subset['y']):.2f}\"\n",
" )\n",
" ax.text(0.03, 0.97, stats_text, transform=ax.transAxes,\n",
" fontsize=8.5, verticalalignment='top', fontfamily='monospace',\n",
" bbox=dict(boxstyle='round,pad=0.4', facecolor='white', alpha=0.85, edgecolor='grey'))\n",
" \n",
" ax.set_title(name, fontsize=14, fontweight='bold')\n",
" ax.set_xlabel('x')\n",
" ax.set_ylabel('y')\n",
"\n",
"fig.suptitle('Datasaurus Dozen — statistics are identical, shapes are not',\n",
" fontsize=14, fontweight='bold', y=1.01)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e295910e",
"metadata": {},
"source": [
"> **❓ Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n",
"> What does this tell you about when and why visualisation is necessary?\n",
"\n",
"*(Double-click to write your answer here)*\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "86dea1fb",
"metadata": {},
"source": [
"## Part 4 — Small Multiples: All 13 Datasets at Once\n",
"\n",
"Seaborn's `FacetGrid` makes it easy to produce a *small multiples* plot — the same chart type repeated for each group. This is a powerful pattern for comparing distributions across many categories."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d7eb9f5a",
"metadata": {},
"outputs": [],
"source": [
"g = sns.FacetGrid(df, col='dataset', col_wrap=5, height=3, aspect=1.0,\n",
" sharex=False, sharey=False)\n",
"g.map(sns.scatterplot, 'x', 'y', alpha=0.6, s=18, color='steelblue', edgecolor='white', linewidth=0.2)\n",
"g.set_titles(col_template='{col_name}', size=10)\n",
"g.figure.suptitle('All 13 Datasaurus Dozen datasets — identical statistics',\n",
" fontsize=13, fontweight='bold', y=1.01)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "becc716d",
"metadata": {},
"source": [
"---\n",
"\n",
"## ✏️ Your Turn — Free Exploration\n",
"\n",
"The cells below are yours. Here are some things to try:\n",
"\n",
"- **Histograms:** Use `sns.histplot()` to plot the distribution of `x` or `y` for two contrasting datasets. Do the distributions look different?\n",
"- **KDE plots:** Try `sns.kdeplot(data=df_focus, x='x', hue='dataset')` to overlay density curves for the three focus groups.\n",
"- **Pair plots:** Use `sns.pairplot(df_focus, hue='dataset')` — what does it add?\n",
"- **Box plots:** Use `sns.boxplot(data=df, x='dataset', y='x')` — can boxplots reveal the structural differences?\n",
"\n",
"> **Key question to keep in mind:** For each plot type you try — does it reveal the structural difference between the datasets, or does it hide it?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "83a2bc01",
"metadata": {},
"outputs": [],
"source": [
"# Your exploration here\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7aac288",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "3cc44f9f",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "3c09cd29",
"metadata": {},
"source": [
"---\n",
"\n",
"## 🔑 Key Takeaways\n",
"\n",
"- Summary statistics (mean, SD, correlation) can be completely identical across datasets with totally different structure\n",
"- Visualisation is not a finishing step — it is a **diagnostic step** that must happen early\n",
"- Different chart types reveal different aspects: scatterplots show point-level structure, histograms show marginal distributions, box plots summarise spread but can hide shape\n",
"- The small multiples pattern (FacetGrid) is a powerful way to compare many groups at a glance\n",
"\n",
"→ In **Task 2**, you will move to a real-world dataset with real problems — and discover that the \"hard work\" you just did manually can be partially automated."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

View File

@@ -563,7 +563,17 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-02-23 18:02:42,900 - INFO - Executing shutdown due to inactivity...\n",
"2026-02-23 18:02:42,946 - INFO - Executing shutdown...\n",
"2026-02-23 18:02:42,962 - INFO - Not running with the Werkzeug Server, exiting by searching gc for BaseWSGIServer\n"
]
}
],
"source": [
"# Shut down the previous D-Tale instance and reload with the clean data\n",
"d.kill()\n",

View File

@@ -584,7 +584,17 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-02-23 18:31:02,737 - INFO - Executing shutdown due to inactivity...\n",
"2026-02-23 18:31:02,790 - INFO - Executing shutdown...\n",
"2026-02-23 18:31:02,795 - INFO - Not running with the Werkzeug Server, exiting by searching gc for BaseWSGIServer\n"
]
}
],
"source": [
"# OPTIONAL: Two-pass strategy — try a second format for the rows that failed\n",
"# If you determine the ambiguous rows use DD/MM/YYYY, try dayfirst=True on them only\n",

View File

@@ -1,12 +1,32 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "e28cb3de",
"metadata": {},
"outputs": [],
"source": [
"# 43679 -- Interactive Visualization\n",
"# 2025 - 2026\n",
"# 2nd semester\n",
"# Lab 1 - EDA (guided)\n",
"# ver 1.2\n",
"# 24022026 - Cosmetics; added rationale for task in scope of course"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 02 · Task 2 Guided EDA and Data Cleaning\n",
"# Lab 02<br>Task 2: Guided EDA and Data Cleaning\n",
"\n",
"The purpose of this task you to introduce you to the basic steps of performing data preparation for a dataset with several illustrative quality issues. In most situations you already have the basic code to be run; in others, you need to infer from existing code to complete the step. What is important here is for you to be able to identify the issues, understand the tools and approaches that may help tackling them, and acquire a systematic way of thinking about data preparation.\n",
"\n",
"**Don't just run the code. Understand why it is needed and what it is doing**\n",
"\n",
"**NOTE**: For those cells asking questions or with tables that can be filled, you can just double-click the cell and edit it with your answers and rationale\n",
"\n",
"**Estimated time:** ~50 minutes \n",
"**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
"\n",
"---\n",
@@ -23,9 +43,9 @@
"\n",
"| Tool | Role |\n",
"|---|---|\n",
"| **SweetViz** | Automated profiling generate a report, triage what needs fixing |\n",
"| **D-Tale** | Interactive navigation browse rows, inspect value counts, confirm fixes visually |\n",
"| **pandas** | All actual cleaning every transformation is explicit, reproducible code |\n",
"| **SweetViz** | Automated profiling: generate a report, triage what needs fixing |\n",
"| **D-Tale** | Interactive navigation: browse rows, inspect value counts, confirm fixes visually |\n",
"| **pandas** | All actual cleaning: every transformation is explicit, reproducible code |\n",
"\n",
"---"
]
@@ -82,7 +102,7 @@
"\n",
"---\n",
"\n",
"## Part 2 Automated Profiling with SweetViz\n",
"## Part 2: Automated Profiling with SweetViz\n",
"\n",
"SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards."
]
@@ -113,11 +133,11 @@
"| How many distinct values does `region` have? Does that seem right? | *...* |\n",
"| What is unusual about `purchase_amount`? | *...* |\n",
"\n",
"*(Double-click to fill in your answers)*\n",
"\n",
"\n",
"---\n",
"\n",
"## Part 3 Navigate and Inspect with D-Tale\n",
"## Part 3: Navigate and Inspect with D-Tale\n",
"\n",
"Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n",
"\n",
@@ -161,7 +181,7 @@
"\n",
"---\n",
"\n",
"## Part 4 Clean with Pandas\n",
"## Part 4: Clean with Pandas\n",
"\n",
"We will work through seven issue categories. Each section follows the same pattern:\n",
"1. **Inspect** — confirm the problem in code\n",
@@ -187,7 +207,7 @@
"source": [
"---\n",
"\n",
"### 4.1 Boolean columns: inconsistent encoding\n",
"### 4.1. Boolean columns: inconsistent encoding\n",
"\n",
"Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n",
"\n",
@@ -242,7 +262,7 @@
"source": [
"---\n",
"\n",
"### 4.2 Categorical columns: case and whitespace inconsistency\n",
"### 4.2. Categorical columns: case and whitespace inconsistency\n",
"\n",
"Four columns have values that are logically identical but differ in case or surrounding whitespace:\n",
"- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
@@ -319,7 +339,7 @@
"source": [
"---\n",
"\n",
"### 4.3 `purchase_amount`: comma as decimal separator\n",
"### 4.3. `purchase_amount`: comma as decimal separator\n",
"\n",
"About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n",
"\n",
@@ -364,7 +384,7 @@
"source": [
"---\n",
"\n",
"### 4.4 Missing values: decisions and strategy\n",
"### 4.4. Missing values: decisions and strategy\n",
"\n",
"Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n",
"\n",
@@ -378,7 +398,7 @@
"\n",
"<br>\n",
"\n",
"> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal but a different context might lead to different choices.\n"
"> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal, but a different context might lead to different choices.\n"
]
},
{
@@ -417,7 +437,7 @@
"source": [
"---\n",
"\n",
"### 4.5 Outliers: `avg_fps`\n",
"### 4.5. Outliers: `avg_fps`\n",
"\n",
"The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n",
"\n",
@@ -458,7 +478,7 @@
"source": [
"---\n",
"\n",
"### 4.6 Datetime columns: mixed formats\n",
"### 4.6. Datetime columns: mixed formats\n",
"\n",
"The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n",
"\n",
@@ -687,7 +707,7 @@
"\n",
"---\n",
"\n",
"## Part 5 Verify with D-Tale\n",
"## Part 5: Verify with D-Tale\n",
"\n",
"Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset."
]
@@ -718,7 +738,9 @@
"| `purchase_amount` | Describe → dtype and range | float64, no commas |\n",
"| `avg_fps` | Describe → max | Below 300 |\n",
"| `session_length_s` | Describe → min and max | No negatives, no values > 28800 |\n",
"| `start_time` | Describe → dtype | datetime64 |\n"
"| `start_time` | Describe → dtype | datetime64 |\n",
"\n",
"## Part 6: Compare initial and clean datasets with SweetViz"
]
},
{
@@ -728,7 +750,9 @@
"metadata": {},
"outputs": [],
"source": [
"# Debug\n",
"# Debug code; sometimes, sweetviz is not able to compare columns due to data type changes that are incompatible\n",
"# This code just goes around column by column to identify any column that gives an error. Otherwise, SweetViz\n",
"# just crashes without any major explanation\n",
"\n",
"# Test comparison column by column\n",
"# for col in df_clean.columns:\n",
@@ -773,7 +797,7 @@
"\n",
"---\n",
"\n",
"## Part 7 Save the Cleaned Dataset"
"## Part 7: Save the Cleaned Dataset"
]
},
{
@@ -814,10 +838,14 @@
"| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n",
"| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n",
"| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n",
"| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n",
"\n",
"→ In **Task 3**, you will apply these skills independently to a new dataset — with a checklist but without step-by-step guidance."
"| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n"
]
},
{
"cell_type": "markdown",
"id": "572f9d85",
"metadata": {},
"source": []
}
],
"metadata": {

View File

@@ -1,12 +1,28 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "92169b19",
"metadata": {},
"outputs": [],
"source": [
"# 43679 -- Interactive Visualization\n",
"# 2025 - 2026\n",
"# 2nd semester\n",
"# Lab 1 - EDA (independent)\n",
"# ver 1.1\n",
"# 24022026 - Added questions at end; cleaning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 02 · Task 3 Independent EDA and Cleaning\n",
"## Lab 01<br>Task 3: Independent EDA and Cleaning\n",
"\n",
"The purpose of this task is for you to practice EDA for a new dataset in a more independent manner. Feel free to go back to Task 2's code and reuse it, whenever it makes sense. Nevertheless, **don't limit yourself to just copy-pasting** and undersstand why you are applying each step. Understanding what are the issues and how to address them will be important for your final project.\n",
"\n",
"**Estimated time:** ~20 minutes \n",
"**Dataset:** `dataset_D_git_classroom_activity.csv`\n",
"\n",
"---\n",
@@ -34,12 +50,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1 Load and Inspect"
"## Part 1: Load and Inspect"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
@@ -70,7 +86,7 @@
"\n",
"---\n",
"\n",
"## Part 2 Automated Profiling with SweetViz\n",
"## Part 2: Automated Profiling with SweetViz\n",
"\n",
"Generate a SweetViz report on the raw dataset. Use it to fill in the triage checklist below before moving on."
]
@@ -105,7 +121,7 @@
"\n",
"---\n",
"\n",
"## Part 3 Navigate and Inspect with D-Tale\n",
"## Part 3: Navigate and Inspect with D-Tale\n",
"\n",
"Launch D-Tale and use it to confirm each issue visually. Do not clean anything here."
]
@@ -149,9 +165,9 @@
"\n",
"---\n",
"\n",
"## Part 4 Clean with Pandas\n",
"## Part 4: Clean with Pandas\n",
"\n",
"Work through each issue below. For each one: inspect fix verify. \n",
"Work through each issue below. For each one: **inspect --> fix --> verify**. \n",
"The first example in each category is more detailed; subsequent columns follow the same pattern.\n",
"\n",
"Start by creating a working copy:"
@@ -172,7 +188,7 @@
"source": [
"---\n",
"\n",
"### 4.1 Boolean columns\n",
"### 4.1. Boolean columns\n",
"\n",
"**Columns:** `is_weekend`, `label_is_high_quality`, `exam_period` \n",
"**Issue:** 8 different representations of True/False \n",
@@ -218,7 +234,7 @@
"source": [
"---\n",
"\n",
"### 4.2 `is_bot_user`: case and whitespace\n",
"### 4.2. `is_bot_user`: case and whitespace\n",
"\n",
"**Issue:** 6 variants of 2 values (`Human`, `Bot`) with mixed case and whitespace \n",
"**Approach:** `.str.strip().str.lower()` — no typos, no synonym merging needed"
@@ -260,7 +276,7 @@
"source": [
"---\n",
"\n",
"### 4.3 Categorical columns: case and whitespace\n",
"### 4.3. Categorical columns: case and whitespace\n",
"\n",
"**Columns:** `dominant_language`, `editor`, `os`, `event_type` \n",
"**Issue:** Many case/whitespace variants — strip and lowercase resolves most \n",
@@ -313,7 +329,7 @@
"source": [
"---\n",
"\n",
"### 4.4 `ci_status`: case, whitespace, and synonym merging\n",
"### 4.4. `ci_status`: case, whitespace, and synonym merging\n",
"\n",
"**Issue:** Case and whitespace variants — but also `FAILED` and `FAILURE` represent the same outcome and need to be merged into one canonical value. \n",
"**Approach:** Strip and lowercase first, then use `.replace()` to merge synonyms.\n",
@@ -338,6 +354,7 @@
"outputs": [],
"source": [
"# Fix ci_status — strip, lowercase, then merge synonyms\n",
"# You can use .replace({'current':'replaced'})\n",
"# Your code here\n"
]
},
@@ -355,13 +372,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Your decision:** Which canonical form did you choose for `failed`/`failure`, and why?\n",
"> **Your decision:** Which canonical form did you choose for `failed`/`failure`, and why? This is where you need to go for the domain context. What is the common term?\n",
"\n",
"*(Double-click to write your answer)*\n",
"\n",
"---\n",
"\n",
"### 4.5 `coverage_percent`: comma decimal separator and type conversion\n",
"### 4.5. `coverage_percent`: comma decimal separator and type conversion\n",
"\n",
"**Issue:** Loaded as `object` — some values use a comma instead of a decimal point \n",
"**Approach:** Same as `purchase_amount` in Task 2 — `.str.replace()` then `.astype(float)`"
@@ -376,7 +393,10 @@
"# Inspect — how many rows have a comma?\n",
"print(df_clean['coverage_percent'].dtype)\n",
"comma_rows = df_clean['coverage_percent'].astype(str).str.contains(',', na=False)\n",
"print(f'Rows with comma: {comma_rows.sum()}')"
"print(f'Rows with comma: {comma_rows.sum()}')\n",
"\n",
"# tip: any values outside the valid range? \n",
"# What is the valid range for this variable?"
]
},
{
@@ -396,8 +416,11 @@
"outputs": [],
"source": [
"# Verify\n",
"\n",
"print(f'dtype: {df_clean[\"coverage_percent\"].dtype}')\n",
"print(df_clean['coverage_percent'].describe().round(2))"
"print(df_clean['coverage_percent'].describe().round(2))\n",
"print(f'\\nValues < 0: {(df_clean[\"coverage_percent\"] < 0).sum()} rows')\n",
"print(f'Values > 100: {(df_clean[\"coverage_percent\"] > 100).sum()} rows')"
]
},
{
@@ -406,7 +429,7 @@
"source": [
"---\n",
"\n",
"### 4.6 Missing values: decisions and strategy\n",
"### 4.6. Missing values: decisions and strategy\n",
"\n",
"This dataset has four columns with missing values. Inspect each one and decide what to do.\n",
"\n",
@@ -460,7 +483,7 @@
"source": [
"---\n",
"\n",
"### 4.7 Outliers and impossible values\n",
"### 4.7. Outliers and impossible values\n",
"\n",
"Three issues to address:\n",
"\n",
@@ -555,7 +578,7 @@
"\n",
"---\n",
"\n",
"### 4.8 `timestamp`: mixed datetime formats *(optional)*\n",
"### 4.8. **OPTIONAL** `timestamp`: mixed datetime formats \n",
"\n",
"Like Task 2, the `timestamp` column contains mixed datetime formats. However, unlike Task 2, there is no derived column that depends on it — so the impact of unresolved timestamps is lower here.\n",
"\n",
@@ -578,7 +601,7 @@
"source": [
"---\n",
"\n",
"## Part 5 Verify with D-Tale"
"## Part 5: Verify with D-Tale"
]
},
{
@@ -610,7 +633,7 @@
"\n",
"---\n",
"\n",
"## Part 6 Before vs After with SweetViz"
"## Part 6: Before vs After with SweetViz"
]
},
{
@@ -631,7 +654,7 @@
"source": [
"---\n",
"\n",
"## Part 7 Save"
"## Part 7: Save"
]
},
{
@@ -650,7 +673,7 @@
"source": [
"---\n",
"\n",
"## Reflection\n",
"## Final Questions\n",
"\n",
"Answer the following before finishing:\n",
"\n",
@@ -658,23 +681,29 @@
"\n",
"**2.** You found rows where `tests_failed > tests_run`. What does this kind of cross-column check tell you that a single-column inspection would have missed?\n",
"\n",
"**3.** For `ci_status`, you had to decide whether `failed` and `failure` are the same thing. What kind of knowledge beyond the data itself did you need to make that decision?\n",
"**3.** For `ci_status`, you had to decide whether `failed` and `failure` are the same thing. What kind of knowledge -- beyond the data itself -- did you need to make that decision?\n",
"\n",
"**4.** Compare this dataset to the telemetry dataset from Task 2. Which issues were the same? Which were new? What does that tell you about the generality of the cleaning skills you are building?\n",
"\n",
"*(Double-click to write your answers)*"
"**4.** Compare this dataset to the telemetry dataset from Task 2. Which issues were the same? Which were new? What does that tell you about the generality of the cleaning skills you are building?\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"version": "3.10.0"
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,

View File

@@ -0,0 +1,673 @@
{
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
"language_info": {"name": "python", "version": "3.10.0"}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 02 · Task 3 — Independent EDA and Cleaning · SOLUTIONS\n",
"\n",
"**Dataset:** `dataset_D_git_classroom_activity.csv`\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1 — Load and Inspect"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sweetviz as sv\n",
"import dtale\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('dataset_D_git_classroom_activity.csv')\n",
"\n",
"print(f'Shape: {df.shape}')\n",
"print('\\nColumn types:')\n",
"print(df.dtypes)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **What to note:**\n",
"> - `coverage_percent` should be numeric but is `object` — formatting problem in raw values\n",
"> - `is_weekend`, `label_is_high_quality`, `exam_period` should be boolean but are `object`\n",
"> - `commit_message_length` is `float64` rather than `int` — a sign that missing values forced a float type\n",
"\n",
"---\n",
"\n",
"## Part 2 — Automated Profiling with SweetViz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"report = sv.analyze(df)\n",
"report.show_html('sweetviz_git_raw.html', open_browser=False)\n",
"print('Report saved. Open sweetviz_git_raw.html in your browser.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Triage checklist — answers\n",
"\n",
"| Question | Finding |\n",
"|---|---|\n",
"| Which columns have missing values? Which has the most? | `pr_merge_time_hours` (71.7%), `commit_message_length` (7%), `build_duration_s` (2.1%), `time_to_ci_minutes` (2%) |\n",
"| Which columns should be boolean? | `is_weekend`, `label_is_high_quality`, `exam_period` |\n",
"| Which columns should be numeric? | `coverage_percent` — shown as TEXT due to comma decimal separators |\n",
"| `event_type` distinct count | ~42 — should be 7; case/whitespace variants |\n",
"| What is unusual about `ci_status`? | Besides case/whitespace variants, `FAILED` and `FAILURE` are synonyms that need merging |\n",
"| Suspicious numeric ranges | `lines_added` max 5000, `time_to_ci_minutes` max 1578, `pr_merge_time_hours` has negative values |\n",
"\n",
"---\n",
"\n",
"## Part 3 — Navigate and Inspect with D-Tale"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"d = dtale.show(df, host='127.0.0.1', subprocess=False, open_browser=False)\n",
"print('D-Tale running at:', d._url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 4 — Clean with Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean = df.copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.1 — Boolean columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect\n",
"print(sorted(df_clean['is_weekend'].dropna().unique().tolist()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Boolean keys included so the mapping is safe to re-run\n",
"bool_map = {\n",
" 'True': True, 'true': True, '1': True, 'Yes': True, True: True,\n",
" 'False': False, 'false': False, '0': False, 'No': False, False: False\n",
"}\n",
"\n",
"for col in ['is_weekend', 'label_is_high_quality', 'exam_period']:\n",
" df_clean[col] = df_clean[col].map(bool_map)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify\n",
"for col in ['is_weekend', 'label_is_high_quality', 'exam_period']:\n",
" print(f\"{col}: {df_clean[col].value_counts().to_dict()} | nulls: {df_clean[col].isna().sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.2 — `is_bot_user`: case and whitespace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect\n",
"print(df_clean['is_bot_user'].value_counts().to_string())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean['is_bot_user'] = df_clean['is_bot_user'].str.strip().str.lower()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify\n",
"print(df_clean['is_bot_user'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.3 — Categorical columns: case and whitespace"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect\n",
"print(f'dominant_language unique before: {df_clean[\"dominant_language\"].nunique()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Strip and lowercase for columns with pure case/whitespace variance\n",
"for col in ['dominant_language', 'editor', 'event_type']:\n",
" df_clean[col] = df_clean[col].str.strip().str.lower()\n",
"\n",
"# os: strip and lowercase, then merge win → windows\n",
"df_clean['os'] = (\n",
" df_clean['os']\n",
" .str.strip()\n",
" .str.lower()\n",
" .replace({'win': 'windows'})\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify\n",
"for col in ['dominant_language', 'editor', 'os', 'event_type']:\n",
" print(f\"{col} ({df_clean[col].nunique()} unique): {sorted(df_clean[col].dropna().unique().tolist())}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.4 — `ci_status`: case, whitespace, and synonym merging"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect\n",
"print(df_clean['ci_status'].value_counts().to_string())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Step 1: strip and lowercase\n",
"# Step 2: merge 'failure' into 'failed'\n",
"# Rationale: both indicate the CI pipeline did not complete successfully.\n",
"# 'failed' is the more common and explicit term in CI tooling (GitHub Actions, Jenkins).\n",
"df_clean['ci_status'] = (\n",
" df_clean['ci_status']\n",
" .str.strip()\n",
" .str.lower()\n",
" .replace({'failure': 'failed'})\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify\n",
"print(df_clean['ci_status'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Decision:** `failure` → `failed`. Both mean the CI pipeline did not complete successfully. `failed` is the canonical term used by major CI tools (GitHub Actions, Jenkins, GitLab CI) and is more explicit.\n",
"\n",
"---\n",
"\n",
"### 4.5 — `coverage_percent`: comma decimal separator, type conversion, and outliers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect\n",
"print(f'dtype: {df_clean[\"coverage_percent\"].dtype}')\n",
"comma_rows = df_clean['coverage_percent'].astype(str).str.contains(',', na=False)\n",
"print(f'Rows with comma: {comma_rows.sum()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fix: replace comma, convert to float\n",
"df_clean['coverage_percent'] = (\n",
" df_clean['coverage_percent']\n",
" .astype(str)\n",
" .str.replace(',', '.', regex=False)\n",
" .replace('nan', float('nan'))\n",
" .astype(float)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify — also check for values outside the valid 0-100 range\n",
"print(f'dtype: {df_clean[\"coverage_percent\"].dtype}')\n",
"print(df_clean['coverage_percent'].describe().round(2))\n",
"print(f'\\nValues < 0: {(df_clean[\"coverage_percent\"] < 0).sum()} rows')\n",
"print(f'Values > 100: {(df_clean[\"coverage_percent\"] > 100).sum()} rows')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# coverage_percent must be in [0, 100] — values outside this range are logging errors\n",
"invalid_cov = (df_clean['coverage_percent'] < 0) | (df_clean['coverage_percent'] > 100)\n",
"df_clean.loc[invalid_cov, 'coverage_percent'] = float('nan')\n",
"print(f'Invalid coverage values set to NaN: {invalid_cov.sum()}')\n",
"print(f'Range after: {df_clean[\"coverage_percent\"].min():.1f} {df_clean[\"coverage_percent\"].max():.1f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.6 — Missing values: decisions and strategy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Inspect missing counts\n",
"missing = df_clean.isnull().sum()\n",
"pct = (missing / len(df_clean) * 100).round(1)\n",
"pd.DataFrame({'missing': missing, '%': pct})[missing > 0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Investigate pr_merge_time_hours — which event types have non-null values?\n",
"print(df_clean.loc[df_clean['pr_merge_time_hours'].notna(), 'event_type'].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Finding:** `pr_merge_time_hours` is only non-null for `pr_merged` and `pr_opened` events — exactly the rows where a merge time is meaningful. This is **structural missingness (MNAR — Missing Not At Random)**, not a data quality problem. Imputing or dropping these rows would destroy valid analytical signal. Keep as NaN.\n",
"\n",
"| Column | Decision | Rationale |\n",
"|---|---|---|\n",
"| `pr_merge_time_hours` | Keep NaN | Structural: only meaningful for PR events |\n",
"| `commit_message_length` | Keep NaN | Unclear cause — may be bot commits or merge commits without messages |\n",
"| `build_duration_s` | Keep NaN | Sporadic; likely CI jobs that did not reach the build phase |\n",
"| `time_to_ci_minutes` | Keep NaN | Sporadic; likely events that did not trigger CI |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# All four columns: leave as NaN — no action needed\n",
"# (Documented above)\n",
"print('Missing value strategy: all four columns kept as NaN.')\n",
"print('No rows dropped.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.7 — Outliers and impossible values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# A — Negative pr_merge_time_hours\n",
"neg_mask = df_clean['pr_merge_time_hours'] < 0\n",
"print(f'Negative pr_merge_time_hours: {neg_mask.sum()}')\n",
"print(df_clean.loc[neg_mask, ['event_type', 'pr_merge_time_hours']].head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fix A\n",
"df_clean.loc[neg_mask, 'pr_merge_time_hours'] = float('nan')\n",
"print(f'Negative values set to NaN: {neg_mask.sum()}')\n",
"print(f'New min: {df_clean[\"pr_merge_time_hours\"].min():.2f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# B — tests_failed > tests_run (cross-column logical check)\n",
"impossible_mask = df_clean['tests_failed'] > df_clean['tests_run']\n",
"print(f'Rows where tests_failed > tests_run: {impossible_mask.sum()}')\n",
"print(df_clean.loc[impossible_mask, ['tests_run', 'tests_failed']].describe().round(1))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Fix B — set tests_failed to NaN for impossible rows\n",
"# We do not touch tests_run — it may be correct; tests_failed is the unreliable value\n",
"df_clean.loc[impossible_mask, 'tests_failed'] = float('nan')\n",
"print(f'tests_failed set to NaN: {impossible_mask.sum()}')\n",
"# Verify: no remaining impossible rows\n",
"remaining = df_clean['tests_failed'] > df_clean['tests_run']\n",
"print(f'Remaining impossible rows: {remaining.sum()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# C — lines_added and lines_deleted outliers\n",
"print('lines_added distribution:')\n",
"print(df_clean['lines_added'].describe().round(1))\n",
"print(f'\\nRows > 1000 lines added: {(df_clean[\"lines_added\"] > 1000).sum()}')\n",
"print(df_clean.loc[df_clean['lines_added'] > 1000,\n",
" ['event_type', 'lines_added', 'lines_deleted', 'dominant_language']].head(8).to_string())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Decision: commits adding or deleting > 1000 lines are flagged as outliers.\n",
"# While large commits can be legitimate (adding a framework, vendoring dependencies),\n",
"# values of 5000 lines are extreme for a classroom context and likely logging errors.\n",
"# We set them to NaN rather than dropping — other columns in these rows remain valid.\n",
"threshold = 1000\n",
"large_add = df_clean['lines_added'] > threshold\n",
"large_del = df_clean['lines_deleted'] > threshold\n",
"\n",
"df_clean.loc[large_add, 'lines_added'] = float('nan')\n",
"df_clean.loc[large_del, 'lines_deleted'] = float('nan')\n",
"\n",
"print(f'lines_added outliers set to NaN: {large_add.sum()}')\n",
"print(f'lines_deleted outliers set to NaN: {large_del.sum()}')\n",
"print(f'\\nlines_added max after: {df_clean[\"lines_added\"].max()}')\n",
"print(f'lines_deleted max after: {df_clean[\"lines_deleted\"].max()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 4.8 — `timestamp`: mixed datetime formats *(optional)*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First pass — handles ISO 8601 formats\n",
"df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'], utc=True, errors='coerce')\n",
"print(f'timestamp dtype: {df_clean[\"timestamp\"].dtype}')\n",
"print(f'Unparsed (NaT) after first pass: {df_clean[\"timestamp\"].isna().sum()}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The remaining NaTs are DD/MM/YYYY format rows — apply a second pass\n",
"# using the systematic try_formats approach from Task 2\n",
"\n",
"def try_formats(series, formats):\n",
" result = pd.Series(pd.NaT, index=series.index, dtype='datetime64[ns, UTC]')\n",
" remaining = series.copy()\n",
" for fmt in formats:\n",
" parsed = pd.to_datetime(remaining, format=fmt, errors='coerce', utc=True)\n",
" resolved_idx = parsed.index[parsed.notna()]\n",
" result.loc[resolved_idx] = parsed.loc[resolved_idx]\n",
" remaining = remaining.drop(index=resolved_idx)\n",
" return result\n",
"\n",
"candidate_formats = [\n",
" '%d/%m/%Y %H:%M',\n",
" '%m/%d/%Y %H:%M',\n",
" '%d/%m/%Y',\n",
" '%m/%d/%Y',\n",
"]\n",
"\n",
"unparsed_idx = df_clean.index[df_clean['timestamp'].isna()]\n",
"raw_unparsed = df.loc[unparsed_idx, 'timestamp']\n",
"resolved = try_formats(raw_unparsed, candidate_formats)\n",
"df_clean.loc[unparsed_idx, 'timestamp'] = resolved\n",
"\n",
"print(f'Resolved in second pass: {resolved.notna().sum()}')\n",
"print(f'Still NaT (truly ambiguous): {df_clean[\"timestamp\"].isna().sum()}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 5 — Verify with D-Tale"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"d.kill()\n",
"d_clean = dtale.show(df_clean, host='127.0.0.1', subprocess=False, open_browser=False)\n",
"print('D-Tale (cleaned) running at:', d_clean._url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 6 — Before vs After with SweetViz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Exclude timestamp — SweetViz cannot compare string vs datetime64\n",
"exclude = ['timestamp']\n",
"compare = sv.compare(\n",
" [df.drop(columns=exclude), 'Raw'],\n",
" [df_clean.drop(columns=exclude).reset_index(drop=True), 'Cleaned']\n",
")\n",
"compare.show_html('sweetviz_git_comparison.html', open_browser=False)\n",
"print('Comparison report saved.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 7 — Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.to_csv('dataset_D_git_classroom_activity_clean.csv', index=False)\n",
"print(f'Saved: {len(df_clean)} rows, {len(df_clean.columns)} columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Reflection — Suggested Answers\n",
"\n",
"**1. `pr_merge_time_hours` missing 71.7% — is this a data quality problem?** \n",
"No. The missingness is structural: only `pr_merged` and `pr_opened` events have a merge time by definition. Every other event type (commit, push, CI run, etc.) has no merge time to record. This is MNAR — Missing Not At Random — and the pattern itself carries meaning. Imputing or dropping these rows would be wrong.\n",
"\n",
"**2. What does the cross-column check reveal that single-column inspection misses?** \n",
"Single-column inspection of `tests_failed` shows values ranging from 0 to 245 — nothing obviously wrong. Single-column inspection of `tests_run` also looks normal. Only by comparing the two together does the logical impossibility appear: you cannot fail more tests than you ran. This is a category of data quality issue that automated profiling tools like SweetViz do not detect.\n",
"\n",
"**3. What knowledge beyond the data was needed for `ci_status`?** \n",
"Domain knowledge about CI systems: that `failed` and `failure` refer to the same pipeline outcome, and that `failed` is the conventional term in tools like GitHub Actions and Jenkins. Without this knowledge, a purely statistical analysis would treat them as two separate categories and silently undercount CI failures.\n",
"\n",
"**4. What was the same as Task 2? What was new?** \n",
"Same: boolean encoding chaos, case/whitespace inconsistency, comma decimal separator, structural missingness reasoning, negative value treatment, D-Tale navigation, SweetViz before/after. \n",
"New: synonym merging (`ci_status`), cross-column logical consistency check (`tests_failed > tests_run`), out-of-range numeric check (`coverage_percent` outside 0100). \n",
"Takeaway: the core cleaning patterns transfer across domains. What changes is the domain knowledge needed to make the decisions — which canonical form to use, what physical constraints apply to each variable, what constitutes a structurally justified missing value."
]
}
]
}