Add deploy assets and update telemetry datasets

Prepare deployment package and clean telemetry/lab data: add deploy/ (README, datasaurus.csv, datasets and lab01 notebooks), add new lab02 dataset notebook variants (lab02_task1_datasets_v2/ v2b) and solutions for task3, and update multiple lab02 telemetry and git-activity notebooks. Clean and normalize claude/dataset_A_indie_game_telemetry_clean.csv (fill/standardize timestamps, session lengths and other fields) to improve consistency for downstream analysis.
2026-02-24 10:07:31 +00:00
parent fa9898b321
commit d689ada45e
17 changed files with 46042 additions and 9782 deletions
--- a/claude/lab02_task3_git_activity.ipynb
+++ b/claude/lab02_task3_git_activity.ipynb
@@ -1,12 +1,28 @@
 {
 "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "92169b19",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 43679 -- Interactive Visualization\n",
+    "# 2025 - 2026\n",
+    "# 2nd semester\n",
+    "# Lab 1 - EDA (independent)\n",
+    "# ver 1.1\n",
+    "# 24022026 - Added questions at end; cleaning"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Lab 02 · Task 3 — Independent EDA and Cleaning\n",
+    "## Lab 01<br>Task 3: Independent EDA and Cleaning\n",
+    "\n",
+    "The purpose of this task is for you to practice EDA for a new dataset in a more independent manner. Feel free to go back to Task 2's code and reuse it, whenever it makes sense. Nevertheless, **don't limit yourself to just copy-pasting** and undersstand why you are applying each step. Understanding what are the issues and how to address them will be important for your final project.\n",
    "\n",
-    "**Estimated time:** ~20 minutes  \n",
    "**Dataset:** `dataset_D_git_classroom_activity.csv`\n",
    "\n",
    "---\n",
@@ -34,12 +50,12 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Part 1 — Load and Inspect"
+    "## Part 1: Load and Inspect"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -70,7 +86,7 @@
    "\n",
    "---\n",
    "\n",
-    "## Part 2 — Automated Profiling with SweetViz\n",
+    "## Part 2: Automated Profiling with SweetViz\n",
    "\n",
    "Generate a SweetViz report on the raw dataset. Use it to fill in the triage checklist below before moving on."
   ]
@@ -105,7 +121,7 @@
    "\n",
    "---\n",
    "\n",
-    "## Part 3 — Navigate and Inspect with D-Tale\n",
+    "## Part 3: Navigate and Inspect with D-Tale\n",
    "\n",
    "Launch D-Tale and use it to confirm each issue visually. Do not clean anything here."
   ]
@@ -149,9 +165,9 @@
    "\n",
    "---\n",
    "\n",
-    "## Part 4 — Clean with Pandas\n",
+    "## Part 4: Clean with Pandas\n",
    "\n",
-    "Work through each issue below. For each one: inspect → fix → verify.  \n",
+    "Work through each issue below. For each one: **inspect --> fix --> verify**.  \n",
    "The first example in each category is more detailed; subsequent columns follow the same pattern.\n",
    "\n",
    "Start by creating a working copy:"
@@ -172,7 +188,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.1 — Boolean columns\n",
+    "### 4.1. Boolean columns\n",
    "\n",
    "**Columns:** `is_weekend`, `label_is_high_quality`, `exam_period`  \n",
    "**Issue:** 8 different representations of True/False  \n",
@@ -218,7 +234,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.2 — `is_bot_user`: case and whitespace\n",
+    "### 4.2. `is_bot_user`: case and whitespace\n",
    "\n",
    "**Issue:** 6 variants of 2 values (`Human`, `Bot`) with mixed case and whitespace  \n",
    "**Approach:** `.str.strip().str.lower()` — no typos, no synonym merging needed"
@@ -260,7 +276,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.3 — Categorical columns: case and whitespace\n",
+    "### 4.3. Categorical columns: case and whitespace\n",
    "\n",
    "**Columns:** `dominant_language`, `editor`, `os`, `event_type`  \n",
    "**Issue:** Many case/whitespace variants — strip and lowercase resolves most  \n",
@@ -313,7 +329,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.4 — `ci_status`: case, whitespace, and synonym merging\n",
+    "### 4.4. `ci_status`: case, whitespace, and synonym merging\n",
    "\n",
    "**Issue:** Case and whitespace variants — but also `FAILED` and `FAILURE` represent the same outcome and need to be merged into one canonical value.  \n",
    "**Approach:** Strip and lowercase first, then use `.replace()` to merge synonyms.\n",
@@ -338,6 +354,7 @@
   "outputs": [],
   "source": [
    "# Fix ci_status — strip, lowercase, then merge synonyms\n",
+    "# You can use .replace({'current':'replaced'})\n",
    "# Your code here\n"
   ]
  },
@@ -355,13 +372,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "> **Your decision:** Which canonical form did you choose for `failed`/`failure`, and why?\n",
+    "> **Your decision:** Which canonical form did you choose for `failed`/`failure`, and why? This is where you need to go for the domain context. What is the common term?\n",
    "\n",
    "*(Double-click to write your answer)*\n",
    "\n",
    "---\n",
    "\n",
-    "### 4.5 — `coverage_percent`: comma decimal separator and type conversion\n",
+    "### 4.5. `coverage_percent`: comma decimal separator and type conversion\n",
    "\n",
    "**Issue:** Loaded as `object` — some values use a comma instead of a decimal point  \n",
    "**Approach:** Same as `purchase_amount` in Task 2 — `.str.replace()` then `.astype(float)`"
@@ -376,7 +393,10 @@
    "# Inspect — how many rows have a comma?\n",
    "print(df_clean['coverage_percent'].dtype)\n",
    "comma_rows = df_clean['coverage_percent'].astype(str).str.contains(',', na=False)\n",
-    "print(f'Rows with comma: {comma_rows.sum()}')"
+    "print(f'Rows with comma: {comma_rows.sum()}')\n",
+    "\n",
+    "# tip: any values outside the valid range? \n",
+    "# What is the valid range for this variable?"
   ]
  },
  {
@@ -396,8 +416,11 @@
   "outputs": [],
   "source": [
    "# Verify\n",
+    "\n",
    "print(f'dtype: {df_clean[\"coverage_percent\"].dtype}')\n",
-    "print(df_clean['coverage_percent'].describe().round(2))"
+    "print(df_clean['coverage_percent'].describe().round(2))\n",
+    "print(f'\\nValues < 0:   {(df_clean[\"coverage_percent\"] < 0).sum()} rows')\n",
+    "print(f'Values > 100: {(df_clean[\"coverage_percent\"] > 100).sum()} rows')"
   ]
  },
  {
@@ -406,7 +429,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.6 — Missing values: decisions and strategy\n",
+    "### 4.6. Missing values: decisions and strategy\n",
    "\n",
    "This dataset has four columns with missing values. Inspect each one and decide what to do.\n",
    "\n",
@@ -460,7 +483,7 @@
   "source": [
    "---\n",
    "\n",
-    "### 4.7 — Outliers and impossible values\n",
+    "### 4.7. Outliers and impossible values\n",
    "\n",
    "Three issues to address:\n",
    "\n",
@@ -555,7 +578,7 @@
    "\n",
    "---\n",
    "\n",
-    "### 4.8 — `timestamp`: mixed datetime formats *(optional)*\n",
+    "### 4.8. **OPTIONAL** `timestamp`: mixed datetime formats \n",
    "\n",
    "Like Task 2, the `timestamp` column contains mixed datetime formats. However, unlike Task 2, there is no derived column that depends on it — so the impact of unresolved timestamps is lower here.\n",
    "\n",
@@ -578,7 +601,7 @@
   "source": [
    "---\n",
    "\n",
-    "## Part 5 — Verify with D-Tale"
+    "## Part 5: Verify with D-Tale"
   ]
  },
  {
@@ -610,7 +633,7 @@
    "\n",
    "---\n",
    "\n",
-    "## Part 6 — Before vs After with SweetViz"
+    "## Part 6: Before vs After with SweetViz"
   ]
  },
  {
@@ -631,7 +654,7 @@
   "source": [
    "---\n",
    "\n",
-    "## Part 7 — Save"
+    "## Part 7: Save"
   ]
  },
  {
@@ -650,7 +673,7 @@
   "source": [
    "---\n",
    "\n",
-    "## Reflection\n",
+    "## Final Questions\n",
    "\n",
    "Answer the following before finishing:\n",
    "\n",
@@ -658,23 +681,29 @@
    "\n",
    "**2.** You found rows where `tests_failed > tests_run`. What does this kind of cross-column check tell you that a single-column inspection would have missed?\n",
    "\n",
-    "**3.** For `ci_status`, you had to decide whether `failed` and `failure` are the same thing. What kind of knowledge — beyond the data itself — did you need to make that decision?\n",
+    "**3.** For `ci_status`, you had to decide whether `failed` and `failure` are the same thing. What kind of knowledge -- beyond the data itself -- did you need to make that decision?\n",
    "\n",
-    "**4.** Compare this dataset to the telemetry dataset from Task 2. Which issues were the same? Which were new? What does that tell you about the generality of the cleaning skills you are building?\n",
-    "\n",
-    "*(Double-click to write your answers)*"
+    "**4.** Compare this dataset to the telemetry dataset from Task 2. Which issues were the same? Which were new? What does that tell you about the generality of the cleaning skills you are building?\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
   "name": "python",
-   "version": "3.10.0"
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
  }
 },
 "nbformat": 4,