712 lines
25 KiB
Plaintext
712 lines
25 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Lab 02 · Task 2 — Guided EDA and Data Cleaning\n",
|
||
"\n",
|
||
"**Estimated time:** ~50 minutes \n",
|
||
"**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Objectives\n",
|
||
"\n",
|
||
"By the end of this task you will be able to:\n",
|
||
"- Use **SweetViz** to rapidly profile a dataset and identify issues\n",
|
||
"- Use **D-Tale** to navigate and inspect a dataframe interactively\n",
|
||
"- Use **pandas** to fix the most common categories of data quality problems\n",
|
||
"- Make and justify cleaning decisions rather than applying fixes mechanically\n",
|
||
"\n",
|
||
"### Tools and their roles in this task\n",
|
||
"\n",
|
||
"| Tool | Role |\n",
|
||
"|---|---|\n",
|
||
"| **SweetViz** | Automated profiling — generate a report, triage what needs fixing |\n",
|
||
"| **D-Tale** | Interactive navigation — browse rows, inspect value counts, confirm fixes visually |\n",
|
||
"| **pandas** | All actual cleaning — every transformation is explicit, reproducible code |\n",
|
||
"\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Part 1 — Setup and First Look"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import sweetviz as sv\n",
|
||
"import dtale\n",
|
||
"import warnings\n",
|
||
"warnings.filterwarnings('ignore')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Load the raw dataset — do NOT clean anything yet\n",
|
||
"df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
|
||
"\n",
|
||
"print(f'Shape: {df.shape}')\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Column names and types as pandas inferred them\n",
|
||
"print(df.dtypes)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or\n",
|
||
"> numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 2 — Automated Profiling with SweetViz\n",
|
||
"\n",
|
||
"SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Generate the profiling report (~30–60 seconds)\n",
|
||
"report = sv.analyze(df)\n",
|
||
"report.show_html('sweetviz_raw_report.html', open_browser=True)\n",
|
||
"print('Report saved. Open sweetviz_raw_report.html in your browser.')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Open the report and answer the following before moving on.\n",
|
||
"\n",
|
||
"| Question | Your finding |\n",
|
||
"|---|---|\n",
|
||
"| Which columns have missing values? Which has the most? | *...* |\n",
|
||
"| Which columns are shown as TEXT but should be boolean or numeric? | *...* |\n",
|
||
"| Are there numeric columns with suspicious ranges? | *...* |\n",
|
||
"| How many distinct values does `region` have? Does that seem right? | *...* |\n",
|
||
"| What is unusual about `purchase_amount`? | *...* |\n",
|
||
"\n",
|
||
"*(Double-click to fill in your answers)*\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 3 — Navigate and Inspect with D-Tale\n",
|
||
"\n",
|
||
"Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n",
|
||
"\n",
|
||
"**Launch D-Tale:**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"d = dtale.show(df, host='127.0.0.1', subprocess=True, open_browser=True)\n",
|
||
"print('=' * 50)\n",
|
||
"print('D-Tale is running.')\n",
|
||
"print('Open this URL in your browser:', d._url)\n",
|
||
"print('In VS Code: Ctrl+click the URL above.')\n",
|
||
"print('=' * 50)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Inspection checklist\n",
|
||
"\n",
|
||
"Use D-Tale to confirm each issue SweetViz flagged. For each column, click the column header → **Describe** to see value counts and distribution.\n",
|
||
"\n",
|
||
"| What to inspect | How to do it in D-Tale | What you should see |\n",
|
||
"|---|---|---|\n",
|
||
"| `crash_flag` unique values | Column header → Describe | 8 variants of True/False |\n",
|
||
"| `region` unique values | Column header → Describe | ~32 variants of 5 region names |\n",
|
||
"| `input_method` unique values | Column header → Describe | A typo: `controllr` |\n",
|
||
"| `purchase_amount` raw values | Sort column ascending | Some values use comma: `1,80` |\n",
|
||
"| `avg_fps` distribution | Column header → Describe | Max of 10,000 — clearly wrong |\n",
|
||
"| Missing values overview | Top menu → Describe (all columns) | `gpu_model` dominates |\n",
|
||
"\n",
|
||
"> Once you have seen the problems in the raw data, come back to the notebook for cleaning.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 4 — Clean with Pandas\n",
|
||
"\n",
|
||
"We will work through seven issue categories. Each section follows the same pattern:\n",
|
||
"1. **Inspect** — confirm the problem in code\n",
|
||
"2. **Fix** — apply the pandas transformation\n",
|
||
"3. **Verify** — check the result\n",
|
||
"\n",
|
||
"We work on a copy of the original dataframe so the raw data is always available for comparison."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Always work on a copy — keep df as the unchanged original\n",
|
||
"df_clean = df.copy()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.1 — Boolean columns: inconsistent encoding\n",
|
||
"\n",
|
||
"Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n",
|
||
"\n",
|
||
"The fix is to define an explicit mapping and apply it with `.map()`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — confirm the problem\n",
|
||
"print('crash_flag unique values:', sorted(df_clean['crash_flag'].dropna().unique()))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Define the mapping for replacements\n",
|
||
"# Why did I place True:True and False: False? Ideas?\n",
|
||
"\n",
|
||
"bool_map = {\n",
|
||
" 'True': True, 'true': True, '1': True, 'Yes': True, True: True,\n",
|
||
" 'False': False, 'false': False, '0': False, 'No': False, False: False\n",
|
||
"}\n",
|
||
"\n",
|
||
"df_clean['crash_flag'] = df_clean['crash_flag'].map(bool_map)\n",
|
||
"\n",
|
||
"print('crash_flag after mapping:')\n",
|
||
"print(df_clean['crash_flag'].value_counts())\n",
|
||
"print('Nulls:', df_clean['crash_flag'].isna().sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Apply the same mapping to the other two boolean columns\n",
|
||
"# Follow the same pattern as above for is_featured_event and is_long_session\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.2 — Categorical columns: case and whitespace inconsistency\n",
|
||
"\n",
|
||
"Four columns have values that are logically identical but differ in case or surrounding whitespace:\n",
|
||
"- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
|
||
"- `map_name` — 36 variants of 6 values\n",
|
||
"- `platform` — 32 variants of 6 values\n",
|
||
"- `input_method` — 30 variants, including a **typo**: `controllr`\n",
|
||
"\n",
|
||
"The fix uses pandas string methods: `.str.strip()` removes surrounding whitespace, `.str.lower()` normalises case. They can be chained."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — how many unique values before cleaning?\n",
|
||
"print('region unique before:', df_clean['region'].unique())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Fix region: strip whitespace and convert to lowercase\n",
|
||
"df_clean['region'] = df_clean['region'].str.strip().str.lower()\n",
|
||
"\n",
|
||
"# Verify\n",
|
||
"print('region unique after:', df_clean['region'].unique())\n",
|
||
"print(df_clean['region'].value_counts())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Apply the same strip + lower to map_name and platform\n",
|
||
"# Follow the same pattern as above\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# input_method needs an extra step: fix the typo and standardise kb/m → kbm\n",
|
||
"\n",
|
||
"# Step 0: Inspect\n",
|
||
"print('input_method unique before:', df_clean['input_method'].unique())\n",
|
||
"\n",
|
||
"# Step 1: strip and lowercase first\n",
|
||
"df_clean['input_method'] = df_clean['input_method'].str.strip().str.lower()\n",
|
||
"\n",
|
||
"# Step 2: fix the two inconsistencies with replace()\n",
|
||
"df_clean['input_method'] = df_clean['input_method'].replace({\n",
|
||
" 'controllr': 'controller', \n",
|
||
" 'kb/m': 'kbm' \n",
|
||
"})\n",
|
||
"\n",
|
||
"# Verify — should now show exactly 3 unique values\n",
|
||
"print('input_method unique after:', df_clean['input_method'].unique())\n",
|
||
"print(df_clean['input_method'].value_counts())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.3 — `purchase_amount`: comma as decimal separator\n",
|
||
"\n",
|
||
"About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n",
|
||
"\n",
|
||
"The fix: replace the comma in the string, then convert the column type."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — how many rows have a comma?\n",
|
||
"comma_rows = df_clean['purchase_amount'].astype(str).str.contains(',', na=False)\n",
|
||
"print(f'Rows with comma separator: {comma_rows.sum()}')\n",
|
||
"print('Examples:', df_clean.loc[comma_rows, 'purchase_amount'].unique()[:6])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Fix: replace comma with decimal point, then convert to float\n",
|
||
"df_clean['purchase_amount'] = (\n",
|
||
" df_clean['purchase_amount']\n",
|
||
" .astype(str) # ensure we are working with strings\n",
|
||
" .str.replace(',', '.', regex=False) # swap the separator\n",
|
||
" .replace('nan', float('nan')) # restore actual NaN rows\n",
|
||
" .astype(float) # convert to numeric\n",
|
||
")\n",
|
||
"\n",
|
||
"# Verify\n",
|
||
"print('dtype:', df_clean['purchase_amount'].dtype)\n",
|
||
"print(df_clean['purchase_amount'].describe().round(2))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.4 — Missing values: decisions and strategy\n",
|
||
"\n",
|
||
"Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n",
|
||
"\n",
|
||
"| Column | Missing | Why | Decision |\n",
|
||
"|---|---|---|---|\n",
|
||
"| `gpu_model` | 66.7% | Console/mobile players have no GPU | Keep column — missingness is meaningful |\n",
|
||
"| `build_version` | 16.5% | Not logged in older sessions | Keep as NaN — valid historical absence |\n",
|
||
"| `device_temp_c` | 4.9% | Sensor not available on some devices | Keep as NaN |\n",
|
||
"| `session_length_s` | 1.0% | Session ended abnormally | Drop these rows — see below |\n",
|
||
"| `ping_ms`, `purchase_amount`, `end_time` | < 2% | Sporadic gaps | Keep as NaN |\n",
|
||
"\n",
|
||
"<br>\n",
|
||
"\n",
|
||
"> **Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal -- but a different context (e.g. building a machine learning model) might lead to different choices."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — missing value counts across all columns\n",
|
||
"missing = df_clean.isnull().sum()\n",
|
||
"missing_pct = (missing / len(df_clean) * 100).round(1)\n",
|
||
"pd.DataFrame({'missing': missing, '%': missing_pct})[missing > 0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# session_length_s: drop rows where it is missing\n",
|
||
"# Rationale: session duration is a core metric — a session with no recorded\n",
|
||
"# duration is structurally incomplete and cannot be used for most analyses.\n",
|
||
"# These 98 rows represent <1% of the dataset, so dropping is safe.\n",
|
||
"\n",
|
||
"rows_before = len(df_clean)\n",
|
||
"df_clean = df_clean.dropna(subset=['session_length_s'])\n",
|
||
"\n",
|
||
"print(f'Rows dropped: {rows_before - len(df_clean)}')\n",
|
||
"print(f'Rows remaining: {len(df_clean)}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.5 — Outliers: `avg_fps`\n",
|
||
"\n",
|
||
"The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n",
|
||
"\n",
|
||
"**Decision:** set values above 300 fps to `NaN` rather than dropping the entire row. The rest of the data in those rows (crash flag, purchase amount, session type) is likely still valid — it would be wasteful to discard it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — how many rows are affected?\n",
|
||
"threshold = 300\n",
|
||
"outlier_mask = df_clean['avg_fps'] > threshold\n",
|
||
"print(f'Rows with avg_fps > {threshold}: {outlier_mask.sum()}')\n",
|
||
"print('\\navg_fps distribution (before fix):')\n",
|
||
"print(df_clean['avg_fps'].describe().round(1))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Fix: set outlier values to NaN using .loc with a boolean mask\n",
|
||
"df_clean.loc[outlier_mask, 'avg_fps'] = float('nan')\n",
|
||
"\n",
|
||
"# Verify — max should now be well below 300\n",
|
||
"print('avg_fps distribution (after fix):')\n",
|
||
"print(df_clean['avg_fps'].describe().round(1))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 4.6 — Datetime columns: mixed formats *(guided + optional deep dive)*\n",
|
||
"\n",
|
||
"The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n",
|
||
"\n",
|
||
"```\n",
|
||
"2025-07-18T18:32:00Z ← ISO 8601 with UTC marker\n",
|
||
"2025-07-18 20:03:21-05:00 ← ISO 8601 with UTC offset\n",
|
||
"20/10/2025 02:49 ← European DD/MM/YYYY\n",
|
||
"08/01/2025 06:35 ← Ambiguous: US MM/DD or European?\n",
|
||
"```\n",
|
||
"\n",
|
||
"Mixed datetime formats are one of the most complex cleaning problems because some ambiguities cannot be resolved automatically — `08/01/2025` could be August 1st or January 8th, and no algorithm can determine which without external context.\n",
|
||
"\n",
|
||
"The pragmatic approach is to parse what can be parsed reliably and flag what cannot."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Inspect — what does start_time actually look like?\n",
|
||
"print('Sample values from start_time:')\n",
|
||
"print(df_clean['start_time'].dropna().sample(8, random_state=42).tolist())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Fix: pd.to_datetime with utc=True normalises all timezone-aware formats to UTC.\n",
|
||
"# errors='coerce' converts anything it cannot parse to NaT (Not a Time) instead of crashing.\n",
|
||
"df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
|
||
"df_clean['end_time'] = pd.to_datetime(df_clean['end_time'], utc=True, errors='coerce')\n",
|
||
"\n",
|
||
"# Verify — check how many rows could not be parsed\n",
|
||
"print('start_time dtype:', df_clean['start_time'].dtype)\n",
|
||
"print('Unparsed start_time (NaT):', df_clean['start_time'].isna().sum())\n",
|
||
"print('Unparsed end_time (NaT): ', df_clean['end_time'].isna().sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> **Note:** The number of NaT values above reflects rows where pandas could not parse the format unambiguously. These are not errors in the code — they are genuinely ambiguous records that require a domain decision to resolve (e.g., knowing that the data source always uses DD/MM/YYYY).\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"**📌 Optional — explore the unparsed rows**\n",
|
||
"\n",
|
||
"If you want to go further, the cells below help you examine which formats failed and attempt a two-pass parsing strategy. This is optional and not required to complete the lab.\n",
|
||
"\n",
|
||
"<details>\n",
|
||
"<summary>Click to expand optional exploration</summary>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# OPTIONAL: Identify the raw values that failed to parse\n",
|
||
"# We use the index of df_clean to look up the original values in df,\n",
|
||
"# rather than a boolean mask — the two dataframes have different lengths\n",
|
||
"# after the dropna() in step 4.4, so their indices no longer align.\n",
|
||
"unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n",
|
||
"print(f'Rows with unparsed start_time: {len(unparsed_idx)}')\n",
|
||
"print('\\nRaw values that could not be parsed:')\n",
|
||
"print(df.loc[unparsed_idx, 'start_time'].dropna().unique()[:15])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# OPTIONAL: Two-pass strategy — try a second format for the rows that failed\n",
|
||
"# If you determine the ambiguous rows use DD/MM/YYYY, try dayfirst=True on them only\n",
|
||
"unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n",
|
||
"df_clean.loc[unparsed_idx, 'start_time'] = pd.to_datetime(\n",
|
||
" df.loc[unparsed_idx, 'start_time'],\n",
|
||
" dayfirst=True, utc=True, errors='coerce'\n",
|
||
")\n",
|
||
"print('NaT after second pass:', df_clean['start_time'].isna().sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"</details>\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 5 — Verify with D-Tale\n",
|
||
"\n",
|
||
"Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Shut down the previous D-Tale instance and reload with the clean data\n",
|
||
"d.kill()\n",
|
||
"d_clean = dtale.show(df_clean, host='127.0.0.1', subprocess=True, open_browser=True)\n",
|
||
"print('Open cleaned data in D-Tale:', d_clean._url)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In D-Tale, verify the following:\n",
|
||
"\n",
|
||
"| Column | What to check | Expected result |\n",
|
||
"|---|---|---|\n",
|
||
"| `crash_flag` | Describe → value counts | Only `True` and `False` |\n",
|
||
"| `region` | Describe → value counts | Exactly 5 values, all lowercase |\n",
|
||
"| `input_method` | Describe → value counts | Exactly 3 values, no `controllr` |\n",
|
||
"| `purchase_amount` | Describe → dtype and range | float64, no commas |\n",
|
||
"| `avg_fps` | Describe → max | Below 300 |\n",
|
||
"| `session_length_s` | Describe → missing count | 0 |\n",
|
||
"| `start_time` | Describe → dtype | datetime64 |\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 6 — Before vs After with SweetViz"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c8f0e03a",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Debug\n",
|
||
"\n",
|
||
"# Test comparison column by column\n",
|
||
"# for col in df_clean.columns:\n",
|
||
"# try:\n",
|
||
"# sv.compare([df[[col]], 'Raw'], [df_clean[[col]].reset_index(drop=True), 'Cleaned'])\n",
|
||
"# except Exception as e:\n",
|
||
"# print(f\"FAIL: {col} — {e}\")\n",
|
||
"# else:\n",
|
||
"# print(f\"ok: {col}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"exclude = ['start_time', 'end_time'] # needed to exclude these two because we converted them to datetime and sweetviz is not able to compare it with the original data types\n",
|
||
"\n",
|
||
"compare = sv.compare(\n",
|
||
" [df.drop(columns=exclude), 'Raw'],\n",
|
||
" [df_clean.drop(columns=exclude).reset_index(drop=True), 'Cleaned']\n",
|
||
")\n",
|
||
"compare.show_html('sweetviz_comparison_report.html', open_browser=True)\n",
|
||
"print('Comparison report saved.')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In the comparison report, check that:\n",
|
||
"- Boolean columns changed from TEXT → BOOL with only 2 distinct values\n",
|
||
"- Categorical columns show dramatically reduced DISTINCT counts\n",
|
||
"- `purchase_amount` changed from TEXT → NUMERIC\n",
|
||
"- `avg_fps` maximum is no longer 10,000\n",
|
||
"- `session_length_s` shows 0 missing\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 7 — Save the Cleaned Dataset"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
|
||
"print(f'Saved: {len(df_clean)} rows, {len(df_clean.columns)} columns')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"## Key Takeaways\n",
|
||
"\n",
|
||
"**Three tools, three roles — they complement each other:**\n",
|
||
"- **SweetViz** surfaces issues fast but cannot fix them: use it for triage and validation\n",
|
||
"- **D-Tale** lets you see the data as a human would: use it to understand problems before and after fixing them\n",
|
||
"- **pandas** is where all actual cleaning happens: explicit, reproducible, and version-controllable\n",
|
||
"\n",
|
||
"**Cleaning decisions are not mechanical:**\n",
|
||
"- Dropping `session_length_s` nulls was justified here: it would not be in every context\n",
|
||
"- Setting `avg_fps` outliers to NaN (not dropping rows) preserved valid data in other columns\n",
|
||
"- `gpu_model` missingness is structurally meaningful: imputing it would destroy information\n",
|
||
"\n",
|
||
"**Common issue categories you have now fixed with pandas:**\n",
|
||
"\n",
|
||
"| Issue | pandas approach |\n",
|
||
"|---|---|\n",
|
||
"| Boolean encoding chaos | `.map(bool_map)` |\n",
|
||
"| Case / whitespace inconsistency | `.str.strip().str.lower()` |\n",
|
||
"| Typos in categories | `.replace({'controllr': 'controller'})` |\n",
|
||
"| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n",
|
||
"| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n",
|
||
"| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n",
|
||
"| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n",
|
||
"\n",
|
||
"→ In **Task 3**, you will apply these skills independently to a new dataset — with a checklist but without step-by-step guidance."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|