Add deploy assets and update telemetry datasets
Prepare deployment package and clean telemetry/lab data: add deploy/ (README, datasaurus.csv, datasets and lab01 notebooks), add new lab02 dataset notebook variants (lab02_task1_datasets_v2/ v2b) and solutions for task3, and update multiple lab02 telemetry and git-activity notebooks. Clean and normalize claude/dataset_A_indie_game_telemetry_clean.csv (fill/standardize timestamps, session lengths and other fields) to improve consistency for downstream analysis.
This commit is contained in:
@@ -1,12 +1,32 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e28cb3de",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 43679 -- Interactive Visualization\n",
|
||||
"# 2025 - 2026\n",
|
||||
"# 2nd semester\n",
|
||||
"# Lab 1 - EDA (guided)\n",
|
||||
"# ver 1.2\n",
|
||||
"# 24022026 - Cosmetics; added rationale for task in scope of course"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lab 02 · Task 2 — Guided EDA and Data Cleaning\n",
|
||||
"# Lab 02<br>Task 2: Guided EDA and Data Cleaning\n",
|
||||
"\n",
|
||||
"The purpose of this task you to introduce you to the basic steps of performing data preparation for a dataset with several illustrative quality issues. In most situations you already have the basic code to be run; in others, you need to infer from existing code to complete the step. What is important here is for you to be able to identify the issues, understand the tools and approaches that may help tackling them, and acquire a systematic way of thinking about data preparation.\n",
|
||||
"\n",
|
||||
"**Don't just run the code. Understand why it is needed and what it is doing**\n",
|
||||
"\n",
|
||||
"**NOTE**: For those cells asking questions or with tables that can be filled, you can just double-click the cell and edit it with your answers and rationale\n",
|
||||
"\n",
|
||||
"**Estimated time:** ~50 minutes \n",
|
||||
"**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
@@ -23,9 +43,9 @@
|
||||
"\n",
|
||||
"| Tool | Role |\n",
|
||||
"|---|---|\n",
|
||||
"| **SweetViz** | Automated profiling — generate a report, triage what needs fixing |\n",
|
||||
"| **D-Tale** | Interactive navigation — browse rows, inspect value counts, confirm fixes visually |\n",
|
||||
"| **pandas** | All actual cleaning — every transformation is explicit, reproducible code |\n",
|
||||
"| **SweetViz** | Automated profiling: generate a report, triage what needs fixing |\n",
|
||||
"| **D-Tale** | Interactive navigation: browse rows, inspect value counts, confirm fixes visually |\n",
|
||||
"| **pandas** | All actual cleaning: every transformation is explicit, reproducible code |\n",
|
||||
"\n",
|
||||
"---"
|
||||
]
|
||||
@@ -82,7 +102,7 @@
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 2 — Automated Profiling with SweetViz\n",
|
||||
"## Part 2: Automated Profiling with SweetViz\n",
|
||||
"\n",
|
||||
"SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards."
|
||||
]
|
||||
@@ -113,11 +133,11 @@
|
||||
"| How many distinct values does `region` have? Does that seem right? | *...* |\n",
|
||||
"| What is unusual about `purchase_amount`? | *...* |\n",
|
||||
"\n",
|
||||
"*(Double-click to fill in your answers)*\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 3 — Navigate and Inspect with D-Tale\n",
|
||||
"## Part 3: Navigate and Inspect with D-Tale\n",
|
||||
"\n",
|
||||
"Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n",
|
||||
"\n",
|
||||
@@ -161,7 +181,7 @@
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 4 — Clean with Pandas\n",
|
||||
"## Part 4: Clean with Pandas\n",
|
||||
"\n",
|
||||
"We will work through seven issue categories. Each section follows the same pattern:\n",
|
||||
"1. **Inspect** — confirm the problem in code\n",
|
||||
@@ -187,7 +207,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.1 — Boolean columns: inconsistent encoding\n",
|
||||
"### 4.1. Boolean columns: inconsistent encoding\n",
|
||||
"\n",
|
||||
"Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n",
|
||||
"\n",
|
||||
@@ -242,7 +262,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.2 — Categorical columns: case and whitespace inconsistency\n",
|
||||
"### 4.2. Categorical columns: case and whitespace inconsistency\n",
|
||||
"\n",
|
||||
"Four columns have values that are logically identical but differ in case or surrounding whitespace:\n",
|
||||
"- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
|
||||
@@ -319,7 +339,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.3 — `purchase_amount`: comma as decimal separator\n",
|
||||
"### 4.3. `purchase_amount`: comma as decimal separator\n",
|
||||
"\n",
|
||||
"About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n",
|
||||
"\n",
|
||||
@@ -364,7 +384,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.4 — Missing values: decisions and strategy\n",
|
||||
"### 4.4. Missing values: decisions and strategy\n",
|
||||
"\n",
|
||||
"Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n",
|
||||
"\n",
|
||||
@@ -378,7 +398,7 @@
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal — but a different context might lead to different choices.\n"
|
||||
"> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal, but a different context might lead to different choices.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -417,7 +437,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.5 — Outliers: `avg_fps`\n",
|
||||
"### 4.5. Outliers: `avg_fps`\n",
|
||||
"\n",
|
||||
"The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n",
|
||||
"\n",
|
||||
@@ -458,7 +478,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"### 4.6 — Datetime columns: mixed formats\n",
|
||||
"### 4.6. Datetime columns: mixed formats\n",
|
||||
"\n",
|
||||
"The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n",
|
||||
"\n",
|
||||
@@ -687,7 +707,7 @@
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 5 — Verify with D-Tale\n",
|
||||
"## Part 5: Verify with D-Tale\n",
|
||||
"\n",
|
||||
"Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset."
|
||||
]
|
||||
@@ -718,7 +738,9 @@
|
||||
"| `purchase_amount` | Describe → dtype and range | float64, no commas |\n",
|
||||
"| `avg_fps` | Describe → max | Below 300 |\n",
|
||||
"| `session_length_s` | Describe → min and max | No negatives, no values > 28800 |\n",
|
||||
"| `start_time` | Describe → dtype | datetime64 |\n"
|
||||
"| `start_time` | Describe → dtype | datetime64 |\n",
|
||||
"\n",
|
||||
"## Part 6: Compare initial and clean datasets with SweetViz"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -728,7 +750,9 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Debug\n",
|
||||
"# Debug code; sometimes, sweetviz is not able to compare columns due to data type changes that are incompatible\n",
|
||||
"# This code just goes around column by column to identify any column that gives an error. Otherwise, SweetViz\n",
|
||||
"# just crashes without any major explanation\n",
|
||||
"\n",
|
||||
"# Test comparison column by column\n",
|
||||
"# for col in df_clean.columns:\n",
|
||||
@@ -773,7 +797,7 @@
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 7 — Save the Cleaned Dataset"
|
||||
"## Part 7: Save the Cleaned Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -814,10 +838,14 @@
|
||||
"| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n",
|
||||
"| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n",
|
||||
"| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n",
|
||||
"| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n",
|
||||
"\n",
|
||||
"→ In **Task 3**, you will apply these skills independently to a new dataset — with a checklist but without step-by-step guidance."
|
||||
"| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "572f9d85",
|
||||
"metadata": {},
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
Reference in New Issue
Block a user