VI_Lab_01_EDA/claude/lab02_task2_telemetry_v3.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 02 · Task 2 — Guided EDA and Data Cleaning\n",
    "\n",
    "**Estimated time:** ~50 minutes  \n",
    "**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
    "\n",
    "---\n",
    "\n",
    "### Objectives\n",
    "\n",
    "By the end of this task you will be able to:\n",
    "- Use **SweetViz** to rapidly profile a dataset and identify issues\n",
    "- Use **D-Tale** to navigate and inspect a dataframe interactively\n",
    "- Use **pandas** to fix the most common categories of data quality problems\n",
    "- Make and justify cleaning decisions rather than applying fixes mechanically\n",
    "\n",
    "### Tools and their roles in this task\n",
    "\n",
    "| Tool | Role |\n",
    "|---|---|\n",
    "| **SweetViz** | Automated profiling — generate a report, triage what needs fixing |\n",
    "| **D-Tale** | Interactive navigation — browse rows, inspect value counts, confirm fixes visually |\n",
    "| **pandas** | All actual cleaning — every transformation is explicit, reproducible code |\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1 — Setup and First Look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sweetviz as sv\n",
    "import dtale\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the raw dataset — do NOT clean anything yet\n",
    "df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
    "\n",
    "print(f'Shape: {df.shape}')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column names and types as pandas inferred them\n",
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or\n",
    "> numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 2 — Automated Profiling with SweetViz\n",
    "\n",
    "SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the profiling report (~30–60 seconds)\n",
    "report = sv.analyze(df)\n",
    "report.show_html('sweetviz_raw_report.html', open_browser=True)\n",
    "print('Report saved. Open sweetviz_raw_report.html in your browser.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open the report and answer the following before moving on.\n",
    "\n",
    "| Question | Your finding |\n",
    "|---|---|\n",
    "| Which columns have missing values? Which has the most? | *...* |\n",
    "| Which columns are shown as TEXT but should be boolean or numeric? | *...* |\n",
    "| Are there numeric columns with suspicious ranges? | *...* |\n",
    "| How many distinct values does `region` have? Does that seem right? | *...* |\n",
    "| What is unusual about `purchase_amount`? | *...* |\n",
    "\n",
    "*(Double-click to fill in your answers)*\n",
    "\n",
    "---\n",
    "\n",
    "## Part 3 — Navigate and Inspect with D-Tale\n",
    "\n",
    "Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n",
    "\n",
    "**Launch D-Tale:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = dtale.show(df, host='127.0.0.1', subprocess=True, open_browser=True)\n",
    "print('=' * 50)\n",
    "print('D-Tale is running.')\n",
    "print('Open this URL in your browser:', d._url)\n",
    "print('In VS Code: Ctrl+click the URL above.')\n",
    "print('=' * 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspection checklist\n",
    "\n",
    "Use D-Tale to confirm each issue SweetViz flagged. For each column, click the column header → **Describe** to see value counts and distribution.\n",
    "\n",
    "| What to inspect | How to do it in D-Tale | What you should see |\n",
    "|---|---|---|\n",
    "| `crash_flag` unique values | Column header → Describe | 8 variants of True/False |\n",
    "| `region` unique values | Column header → Describe | ~32 variants of 5 region names |\n",
    "| `input_method` unique values | Column header → Describe | A typo: `controllr` |\n",
    "| `purchase_amount` raw values | Sort column ascending | Some values use comma: `1,80` |\n",
    "| `avg_fps` distribution | Column header → Describe | Max of 10,000 — clearly wrong |\n",
    "| Missing values overview | Top menu → Describe (all columns) | `gpu_model` dominates |\n",
    "\n",
    "> Once you have seen the problems in the raw data, come back to the notebook for cleaning.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 4 — Clean with Pandas\n",
    "\n",
    "We will work through seven issue categories. Each section follows the same pattern:\n",
    "1. **Inspect** — confirm the problem in code\n",
    "2. **Fix** — apply the pandas transformation\n",
    "3. **Verify** — check the result\n",
    "\n",
    "We work on a copy of the original dataframe so the raw data is always available for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Always work on a copy — keep df as the unchanged original\n",
    "df_clean = df.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.1 — Boolean columns: inconsistent encoding\n",
    "\n",
    "Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n",
    "\n",
    "The fix is to define an explicit mapping and apply it with `.map()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — confirm the problem\n",
    "print('crash_flag unique values:', sorted(df_clean['crash_flag'].dropna().unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the mapping for replacements\n",
    "# Why did I place True:True and False: False? Ideas?\n",
    "\n",
    "bool_map = {\n",
    "    'True': True,  'true': True,  '1': True,  'Yes': True,  True: True,\n",
    "    'False': False, 'false': False, '0': False, 'No': False, False: False\n",
    "}\n",
    "\n",
    "df_clean['crash_flag'] = df_clean['crash_flag'].map(bool_map)\n",
    "\n",
    "print('crash_flag after mapping:')\n",
    "print(df_clean['crash_flag'].value_counts())\n",
    "print('Nulls:', df_clean['crash_flag'].isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the same mapping to the other two boolean columns\n",
    "# Follow the same pattern as above for is_featured_event and is_long_session\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.2 — Categorical columns: case and whitespace inconsistency\n",
    "\n",
    "Four columns have values that are logically identical but differ in case or surrounding whitespace:\n",
    "- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
    "- `map_name` — 36 variants of 6 values\n",
    "- `platform` — 32 variants of 6 values\n",
    "- `input_method` — 30 variants, including a **typo**: `controllr`\n",
    "\n",
    "The fix uses pandas string methods: `.str.strip()` removes surrounding whitespace, `.str.lower()` normalises case. They can be chained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many unique values before cleaning?\n",
    "print('region unique before:', df_clean['region'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix region: strip whitespace and convert to lowercase\n",
    "df_clean['region'] = df_clean['region'].str.strip().str.lower()\n",
    "\n",
    "# Verify\n",
    "print('region unique after:', df_clean['region'].unique())\n",
    "print(df_clean['region'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TO DO: \n",
    "# Apply the same strip + lower to map_name and platform\n",
    "# Follow the same pattern as above\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# input_method needs an extra step: fix the typo and standardise kb/m → kbm\n",
    "\n",
    "# Step 0: Inspect\n",
    "print('input_method unique before:', df_clean['input_method'].unique())\n",
    "\n",
    "# Step 1: strip and lowercase first\n",
    "df_clean['input_method'] = df_clean['input_method'].str.strip().str.lower()\n",
    "\n",
    "# Step 2: fix the two inconsistencies with replace()\n",
    "df_clean['input_method'] = df_clean['input_method'].replace({\n",
    "    'controllr': 'controller',   \n",
    "    'kb/m': 'kbm'                \n",
    "})\n",
    "\n",
    "# Verify — should now show exactly 3 unique values\n",
    "print('input_method unique after:', df_clean['input_method'].unique())\n",
    "print(df_clean['input_method'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.3 — `purchase_amount`: comma as decimal separator\n",
    "\n",
    "About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n",
    "\n",
    "The fix: replace the comma in the string, then convert the column type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many rows have a comma?\n",
    "comma_rows = df_clean['purchase_amount'].astype(str).str.contains(',', na=False)\n",
    "print(f'Rows with comma separator: {comma_rows.sum()}')\n",
    "print('Examples:', df_clean.loc[comma_rows, 'purchase_amount'].unique()[:6])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: replace comma with decimal point, then convert to float\n",
    "df_clean['purchase_amount'] = (\n",
    "    df_clean['purchase_amount']\n",
    "    .astype(str)                            # ensure we are working with strings\n",
    "    .str.replace(',', '.', regex=False)     # swap the separator\n",
    "    .replace('nan', float('nan'))           # restore actual NaN rows\n",
    "    .astype(float)                          # convert to numeric\n",
    ")\n",
    "\n",
    "# Verify\n",
    "print('dtype:', df_clean['purchase_amount'].dtype)\n",
    "print(df_clean['purchase_amount'].describe().round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.4 — Missing values: decisions and strategy\n",
    "\n",
    "Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n",
    "\n",
    "| Column | Missing | Why | Decision |\n",
    "|---|---|---|---|\n",
    "| `gpu_model` | 66.7% | Console/mobile players have no GPU | Keep column — missingness is meaningful |\n",
    "| `build_version` | 16.5% | Not logged in older sessions | Keep as NaN — valid historical absence |\n",
    "| `device_temp_c` | 4.9% | Sensor not available on some devices | Keep as NaN |\n",
    "| `session_length_s` | 1.0% | Session ended abnormally | Drop missing rows now; fix negatives/outliers after datetime correction (section 4.6) |\n",
    "| `ping_ms`, `purchase_amount`, `end_time` | < 2% | Sporadic gaps | Keep as NaN |\n",
    "\n",
    "<br>\n",
    "\n",
    "> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal — but a different context might lead to different choices.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — missing value counts across all columns\n",
    "missing = df_clean.isnull().sum()\n",
    "missing_pct = (missing / len(df_clean) * 100).round(1)\n",
    "pd.DataFrame({'missing': missing, '%': missing_pct})[missing > 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# session_length_s: drop rows where it is missing\n",
    "# Rationale: session duration is a core metric — a session with no recorded\n",
    "# duration is structurally incomplete and cannot be used for most analyses.\n",
    "# These 98 rows represent <1% of the dataset, so dropping is safe.\n",
    "\n",
    "rows_before = len(df_clean)\n",
    "df_clean = df_clean.dropna(subset=['session_length_s'])\n",
    "\n",
    "print(f'Rows dropped: {rows_before - len(df_clean)}')\n",
    "print(f'Rows remaining: {len(df_clean)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.5 — Outliers: `avg_fps`\n",
    "\n",
    "The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n",
    "\n",
    "**Decision:** set values above 300 fps to `NaN` rather than dropping the entire row. The rest of the data in those rows (crash flag, purchase amount, session type) is likely still valid — it would be wasteful to discard it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many rows are affected?\n",
    "threshold = 300\n",
    "outlier_mask = df_clean['avg_fps'] > threshold\n",
    "print(f'Rows with avg_fps > {threshold}: {outlier_mask.sum()}')\n",
    "print('\\navg_fps distribution (before fix):')\n",
    "print(df_clean['avg_fps'].describe().round(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: set outlier values to NaN using .loc with a boolean mask\n",
    "df_clean.loc[outlier_mask, 'avg_fps'] = float('nan')\n",
    "\n",
    "# Verify — max should now be well below 300\n",
    "print('avg_fps distribution (after fix):')\n",
    "print(df_clean['avg_fps'].describe().round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.6 — Datetime columns: mixed formats\n",
    "\n",
    "The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n",
    "\n",
    "```\n",
    "2025-07-18T18:32:00Z          : ISO 8601 with UTC marker\n",
    "2025-07-18 20:03:21-05:00     : ISO 8601 with UTC offset\n",
    "20/10/2025 02:49              : European DD/MM/YYYY\n",
    "08/01/2025 06:35              : Ambiguous: US MM/DD or European DD/MM?\n",
    "```\n",
    "\n",
    "Mixed datetime formats are one of the most complex cleaning problems because some ambiguities cannot be resolved automatically -- `08/01/2025` could be August 1st or January 8th, and no algorithm can determine which without external context.\n",
    "\n",
    "> **Connection to `session_length_s`:** The negative values and extreme outliers we saw earlier in `session_length_s` are not independent errors -- they are a *consequence* of this datetime problem. When `start_time` and `end_time` were recorded in different formats and misinterpreted, the pre-computed duration came out wrong. After fixing the timestamps, we will recompute `session_length_s` from scratch and validate the result.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — what does start_time actually look like?\n",
    "print('Sample values from start_time:')\n",
    "print(df_clean['start_time'].dropna().sample(8, random_state=42).tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: pd.to_datetime with utc=True normalises all timezone-aware formats to UTC.\n",
    "# errors='coerce' converts anything it cannot parse to NaT (Not a Time) instead of crashing.\n",
    "df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
    "df_clean['end_time']   = pd.to_datetime(df_clean['end_time'],   utc=True, errors='coerce')\n",
    "\n",
    "# Verify — check how many rows could not be parsed\n",
    "print('start_time dtype:', df_clean['start_time'].dtype)\n",
    "print('Unparsed start_time (NaT):', df_clean['start_time'].isna().sum())\n",
    "print('Unparsed end_time (NaT):  ', df_clean['end_time'].isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recompute session_length_s from the corrected timestamps\n",
    "# Now that start_time and end_time are both timezone-aware UTC datetimes,\n",
    "# the subtraction is unambiguous. We convert the result to seconds.\n",
    "df_clean['session_length_s'] = (\n",
    "    df_clean['end_time'] - df_clean['start_time']\n",
    ").dt.total_seconds()\n",
    "\n",
    "print('session_length_s after recomputation:')\n",
    "print(df_clean['session_length_s'].describe().round(1))\n",
    "print(f'\\nNegative values: {(df_clean[\"session_length_s\"] < 0).sum()}')\n",
    "print(f'> 8h (28800s):   {(df_clean[\"session_length_s\"] > 28800).sum()}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Any remaining negative values are rows where timestamps were genuinely\n",
    "# ambiguous and could not be resolved -- the computed duration is meaningless.\n",
    "# Set them to NaN rather than dropping the row.\n",
    "\n",
    "neg_mask = df_clean['session_length_s'] < 0\n",
    "df_clean.loc[neg_mask, 'session_length_s'] = float('nan')\n",
    "print(f'Negative durations set to NaN: {neg_mask.sum()}')\n",
    "\n",
    "# Values above 8 hours (28800s) are suspicious for a game session.\n",
    "# Inspect them before deciding.\n",
    "\n",
    "long_mask = df_clean['session_length_s'] > 28800\n",
    "print(f'\\nSessions > 8h: {long_mask.sum()}')\n",
    "print(df_clean.loc[long_mask, ['session_length_s', 'start_time', 'end_time']].head(5).to_string())\n",
    "\n",
    "# Decision: sessions > 8h are almost certainly logging errors (game left running,\n",
    "# server not recording session end). Set to NaN.\n",
    "# As always — this threshold is a judgement call that depends on the game and context.\n",
    "df_clean.loc[long_mask, 'session_length_s'] = float('nan')\n",
    "print(f'\\nSessions > 8h set to NaN: {long_mask.sum()}')\n",
    "print('\\nFinal session_length_s distribution:')\n",
    "print(df_clean['session_length_s'].describe().round(1))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note:** The number of NaT values above reflects rows where pandas could not parse the format unambiguously. These are not errors in the code — they are genuinely ambiguous records that require a domain decision to resolve (e.g., knowing that the data source always uses DD/MM/YYYY).\n",
    "\n",
    "---\n",
    "\n",
    "**📌 Optional — explore the unparsed rows**\n",
    "\n",
    "If you want to go further, the cells below help you examine which formats failed and attempt a two-pass parsing strategy. This is optional and not required to complete the lab.\n",
    "\n",
    "<details>\n",
    "<summary>Click to expand optional exploration</summary>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL: Identify the raw values that failed to parse\n",
    "# We use the index of df_clean to look up the original values in df,\n",
    "# rather than a boolean mask — the two dataframes have different lengths\n",
    "# after the dropna() in step 4.4, so their indices no longer align.\n",
    "unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n",
    "print(f'Rows with unparsed start_time: {len(unparsed_idx)}')\n",
    "print('\\nRaw values that could not be parsed:')\n",
    "print(df.loc[unparsed_idx, 'start_time'].dropna().unique()[:15])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-02-23 18:31:02,737 - INFO     - Executing shutdown due to inactivity...\n",
      "2026-02-23 18:31:02,790 - INFO     - Executing shutdown...\n",
      "2026-02-23 18:31:02,795 - INFO     - Not running with the Werkzeug Server, exiting by searching gc for BaseWSGIServer\n"
     ]
    }
   ],
   "source": [
    "# OPTIONAL: Two-pass strategy — try a second format for the rows that failed\n",
    "# If you determine the ambiguous rows use DD/MM/YYYY, try dayfirst=True on them only\n",
    "unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n",
    "df_clean.loc[unparsed_idx, 'start_time'] = pd.to_datetime(\n",
    "    df.loc[unparsed_idx, 'start_time'],\n",
    "    dayfirst=True, utc=True, errors='coerce'\n",
    ")\n",
    "print('NaT after second pass:', df_clean['start_time'].isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "</details>\n",
    "\n",
    "---\n",
    "\n",
    "## Part 5 — Verify with D-Tale\n",
    "\n",
    "Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shut down the previous D-Tale instance and reload with the clean data\n",
    "d.kill()\n",
    "d_clean = dtale.show(df_clean, host='127.0.0.1', subprocess=True, open_browser=True)\n",
    "print('Open cleaned data in D-Tale:', d_clean._url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In D-Tale, verify the following:\n",
    "\n",
    "| Column | What to check | Expected result |\n",
    "|---|---|---|\n",
    "| `crash_flag` | Describe → value counts | Only `True` and `False` |\n",
    "| `region` | Describe → value counts | Exactly 5 values, all lowercase |\n",
    "| `input_method` | Describe → value counts | Exactly 3 values, no `controllr` |\n",
    "| `purchase_amount` | Describe → dtype and range | float64, no commas |\n",
    "| `avg_fps` | Describe → max | Below 300 |\n",
    "| `session_length_s` | Describe → min and max | No negatives, no values > 28800 |\n",
    "| `start_time` | Describe → dtype | datetime64 |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8f0e03a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Debug\n",
    "\n",
    "# Test comparison column by column\n",
    "# for col in df_clean.columns:\n",
    "#     try:\n",
    "#         sv.compare([df[[col]], 'Raw'], [df_clean[[col]].reset_index(drop=True), 'Cleaned'])\n",
    "#     except Exception as e:\n",
    "#         print(f\"FAIL: {col} — {e}\")\n",
    "#     else:\n",
    "#         print(f\"ok:   {col}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exclude = ['start_time', 'end_time'] # needed to exclude these two because we converted them to datetime and sweetviz is not able to compare it with the original data types\n",
    "\n",
    "compare = sv.compare(\n",
    "    [df.drop(columns=exclude), 'Raw'],\n",
    "    [df_clean.drop(columns=exclude).reset_index(drop=True), 'Cleaned']\n",
    ")\n",
    "compare.show_html('sweetviz_comparison_report.html', open_browser=True)\n",
    "print('Comparison report saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the comparison report, check that:\n",
    "- Boolean columns changed from TEXT → BOOL with only 2 distinct values\n",
    "- Categorical columns show dramatically reduced DISTINCT counts\n",
    "- `purchase_amount` changed from TEXT → NUMERIC\n",
    "- `avg_fps` maximum is no longer 10,000\n",
    "- `session_length_s` shows 0 missing\n",
    "\n",
    "---\n",
    "\n",
    "## Part 7 — Save the Cleaned Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
    "print(f'Saved: {len(df_clean)} rows, {len(df_clean.columns)} columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Key Takeaways\n",
    "\n",
    "**Three tools, three roles — they complement each other:**\n",
    "- **SweetViz** surfaces issues fast but cannot fix them:  use it for triage and validation\n",
    "- **D-Tale** lets you see the data as a human would:  use it to understand problems before and after fixing them\n",
    "- **pandas** is where all actual cleaning happens: explicit, reproducible, and version-controllable\n",
    "\n",
    "**Cleaning decisions are not mechanical:**\n",
    "- Dropping `session_length_s` nulls was justified here: it would not be in every context\n",
    "- Setting `avg_fps` outliers to NaN (not dropping rows) preserved valid data in other columns\n",
    "- `gpu_model` missingness is structurally meaningful: imputing it would destroy information\n",
    "\n",
    "**Common issue categories you have now fixed with pandas:**\n",
    "\n",
    "| Issue | pandas approach |\n",
    "|---|---|\n",
    "| Boolean encoding chaos | `.map(bool_map)` |\n",
    "| Case / whitespace inconsistency | `.str.strip().str.lower()` |\n",
    "| Typos in categories | `.replace({'controllr': 'controller'})` |\n",
    "| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n",
    "| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n",
    "| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n",
    "| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n",
    "\n",
    "→ In **Task 3**, you will apply these skills independently to a new dataset — with a checklist but without step-by-step guidance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}