VI_Lab_01_EDA/claude/lab02_task2_telemetry_v4.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e28cb3de",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 43679 -- Interactive Visualization\n",
    "# 2025 - 2026\n",
    "# 2nd semester\n",
    "# Lab 1 - EDA (guided)\n",
    "# ver 1.2\n",
    "# 24022026 - Cosmetics; added rationale for task in scope of course"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 02<br>Task 2: Guided EDA and Data Cleaning\n",
    "\n",
    "The purpose of this task you to introduce you to the basic steps of performing data preparation for a dataset with several illustrative quality issues. In most situations you already have the basic code to be run; in others, you need to infer from existing code to complete the step. What is important here is for you to be able to identify the issues, understand the tools and approaches that may help tackling them, and acquire a systematic way of thinking about data preparation.\n",
    "\n",
    "**Don't just run the code. Understand why it is needed and what it is doing**\n",
    "\n",
    "**NOTE**: For those cells asking questions or with tables that can be filled, you can just double-click the cell and edit it with your answers and rationale\n",
    "\n",
    "**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
    "\n",
    "---\n",
    "\n",
    "### Objectives\n",
    "\n",
    "By the end of this task you will be able to:\n",
    "- Use **SweetViz** to rapidly profile a dataset and identify issues\n",
    "- Use **D-Tale** to navigate and inspect a dataframe interactively\n",
    "- Use **pandas** to fix the most common categories of data quality problems\n",
    "- Make and justify cleaning decisions rather than applying fixes mechanically\n",
    "\n",
    "### Tools and their roles in this task\n",
    "\n",
    "| Tool | Role |\n",
    "|---|---|\n",
    "| **SweetViz** | Automated profiling: generate a report, triage what needs fixing |\n",
    "| **D-Tale** | Interactive navigation: browse rows, inspect value counts, confirm fixes visually |\n",
    "| **pandas** | All actual cleaning: every transformation is explicit, reproducible code |\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1 — Setup and First Look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sweetviz as sv\n",
    "import dtale\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import pygwalker as pyg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape: (10000, 20)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>session_id</th>\n",
       "      <th>user_id</th>\n",
       "      <th>start_time</th>\n",
       "      <th>end_time</th>\n",
       "      <th>session_length_s</th>\n",
       "      <th>region</th>\n",
       "      <th>platform</th>\n",
       "      <th>gpu_model</th>\n",
       "      <th>avg_fps</th>\n",
       "      <th>ping_ms</th>\n",
       "      <th>map_name</th>\n",
       "      <th>crash_flag</th>\n",
       "      <th>purchase_amount</th>\n",
       "      <th>party_size</th>\n",
       "      <th>input_method</th>\n",
       "      <th>build_version</th>\n",
       "      <th>is_featured_event</th>\n",
       "      <th>device_temp_c</th>\n",
       "      <th>session_type</th>\n",
       "      <th>is_long_session</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sess_c2fba8e7f37a</td>\n",
       "      <td>user_488</td>\n",
       "      <td>2025-07-18T18:32:00Z</td>\n",
       "      <td>2025-07-18 20:03:21-05:00</td>\n",
       "      <td>5481.0</td>\n",
       "      <td>us-west</td>\n",
       "      <td>pc</td>\n",
       "      <td>GTX1080</td>\n",
       "      <td>83.52</td>\n",
       "      <td>431.16</td>\n",
       "      <td>ocean</td>\n",
       "      <td>Yes</td>\n",
       "      <td>0,00</td>\n",
       "      <td>2</td>\n",
       "      <td>Touch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>85.6</td>\n",
       "      <td>ranked</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sess_33d286298cf9</td>\n",
       "      <td>user_1511</td>\n",
       "      <td>2025-06-13 23:21:08+00:00</td>\n",
       "      <td>2025-06-13 23:36:30+01:00</td>\n",
       "      <td>922.0</td>\n",
       "      <td>Us-east</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>72.75</td>\n",
       "      <td>29.12</td>\n",
       "      <td>desert</td>\n",
       "      <td>No</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3</td>\n",
       "      <td>Touch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>casual</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sess_be2bb4d8986a</td>\n",
       "      <td>user_830</td>\n",
       "      <td>2025-10-20 02:42:07-05:00</td>\n",
       "      <td>20/10/2025 02:49</td>\n",
       "      <td>451.0</td>\n",
       "      <td>sa-east-1</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69.20</td>\n",
       "      <td>40.47</td>\n",
       "      <td>Forest</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5</td>\n",
       "      <td>TOUCH</td>\n",
       "      <td>1.4</td>\n",
       "      <td>False</td>\n",
       "      <td>69.0</td>\n",
       "      <td>ranked</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>sess_7f425ca9a0e2</td>\n",
       "      <td>user_1</td>\n",
       "      <td>08/01/2025 06:35</td>\n",
       "      <td>2025-08-01T08:32:45Z</td>\n",
       "      <td>7031.0</td>\n",
       "      <td>sa-east-1</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.29</td>\n",
       "      <td>92.40</td>\n",
       "      <td>Desert</td>\n",
       "      <td>No</td>\n",
       "      <td>17.55</td>\n",
       "      <td>1</td>\n",
       "      <td>Controller</td>\n",
       "      <td>1.3.2</td>\n",
       "      <td>0</td>\n",
       "      <td>48.1</td>\n",
       "      <td>casual</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sess_5657e28b22ec</td>\n",
       "      <td>user_211</td>\n",
       "      <td>2025-09-08T23:41:44Z</td>\n",
       "      <td>2025-09-09 00:32:59+01:00</td>\n",
       "      <td>3075.0</td>\n",
       "      <td>US-EAST</td>\n",
       "      <td>switch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69.96</td>\n",
       "      <td>12.63</td>\n",
       "      <td>Desert</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>controllr</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>54.7</td>\n",
       "      <td>casual</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          session_id    user_id                 start_time  \\\n",
       "0  sess_c2fba8e7f37a   user_488       2025-07-18T18:32:00Z   \n",
       "1  sess_33d286298cf9  user_1511  2025-06-13 23:21:08+00:00   \n",
       "2  sess_be2bb4d8986a   user_830  2025-10-20 02:42:07-05:00   \n",
       "3  sess_7f425ca9a0e2     user_1           08/01/2025 06:35   \n",
       "4  sess_5657e28b22ec   user_211       2025-09-08T23:41:44Z   \n",
       "\n",
       "                    end_time  session_length_s     region     platform  \\\n",
       "0  2025-07-18 20:03:21-05:00            5481.0    us-west           pc   \n",
       "1  2025-06-13 23:36:30+01:00             922.0    Us-east  PlayStation   \n",
       "2           20/10/2025 02:49             451.0  sa-east-1  PlayStation   \n",
       "3       2025-08-01T08:32:45Z            7031.0  sa-east-1  PlayStation   \n",
       "4  2025-09-09 00:32:59+01:00            3075.0    US-EAST       switch   \n",
       "\n",
       "  gpu_model  avg_fps  ping_ms map_name crash_flag purchase_amount  party_size  \\\n",
       "0   GTX1080    83.52   431.16    ocean        Yes            0,00           2   \n",
       "1       NaN    72.75    29.12   desert         No             0.0           3   \n",
       "2       NaN    69.20    40.47   Forest      False             0.0           5   \n",
       "3       NaN    33.29    92.40   Desert         No           17.55           1   \n",
       "4       NaN    69.96    12.63   Desert      False             0.0           2   \n",
       "\n",
       "  input_method build_version is_featured_event  device_temp_c session_type  \\\n",
       "0        Touch           NaN                No           85.6       ranked   \n",
       "1        Touch           NaN                 0           62.0       casual   \n",
       "2        TOUCH           1.4             False           69.0       ranked   \n",
       "3   Controller         1.3.2                 0           48.1       casual   \n",
       "4    controllr           NaN                 0           54.7      casual    \n",
       "\n",
       "  is_long_session  \n",
       "0            True  \n",
       "1               0  \n",
       "2           False  \n",
       "3            True  \n",
       "4             Yes  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the raw dataset — do NOT clean anything yet\n",
    "df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
    "\n",
    "print(f'Shape: {df.shape}')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8ca0358e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "0e7e473ff13d4dab8162abe663d1cf88",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Box(children=(HTML(value='\\n<div id=\"ifr-pyg-00064b925f58a3a86prM98tTQNaGHo5k\" style=\"height: auto\">\\n    <hea…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<script>\n",
       "    window.addEventListener(\"message\", function(event) {\n",
       "        const backgroundMap = {\n",
       "            \"dark\": \"hsl(240 10% 3.9%)\",\n",
       "            \"light\": \"hsl(0 0 100%)\",\n",
       "        };\n",
       "        const colorMap = {\n",
       "            \"dark\": \"hsl(0 0% 98%)\",\n",
       "            \"light\": \"hsl(240 10% 3.9%)\",\n",
       "        };\n",
       "        if (event.data.action === \"changeAppearance\" && event.data.gid === \"00064b925f58a3a86prM98tTQNaGHo5k\") {\n",
       "            var iframe = document.getElementById(\"gwalker-00064b925f58a3a86prM98tTQNaGHo5k\");\n",
       "            iframe.style.background  = backgroundMap[event.data.appearance];\n",
       "            iframe.style.color = colorMap[event.data.appearance];\n",
       "        }\n",
       "    });\n",
       "</script>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "<pygwalker.api.pygwalker.PygWalker at 0x29f79a1b690>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyg.walk(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column names and types as pandas inferred them\n",
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or\n",
    "> numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 2: Automated Profiling with SweetViz\n",
    "\n",
    "SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate the profiling report (~30–60 seconds)\n",
    "report = sv.analyze(df)\n",
    "report.show_html('sweetviz_raw_report.html', open_browser=True)\n",
    "print('Report saved. Open sweetviz_raw_report.html in your browser.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open the report and answer the following before moving on.\n",
    "\n",
    "| Question | Your finding |\n",
    "|---|---|\n",
    "| Which columns have missing values? Which has the most? | *...* |\n",
    "| Which columns are shown as TEXT but should be boolean or numeric? | *...* |\n",
    "| Are there numeric columns with suspicious ranges? | *...* |\n",
    "| How many distinct values does `region` have? Does that seem right? | *...* |\n",
    "| What is unusual about `purchase_amount`? | *...* |\n",
    "\n",
    "\n",
    "\n",
    "---\n",
    "\n",
    "## Part 3: Navigate and Inspect with D-Tale\n",
    "\n",
    "Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n",
    "\n",
    "**Launch D-Tale:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = dtale.show(df, host='127.0.0.1', subprocess=True, open_browser=True)\n",
    "print('=' * 50)\n",
    "print('D-Tale is running.')\n",
    "print('Open this URL in your browser:', d._url)\n",
    "print('In VS Code: Ctrl+click the URL above.')\n",
    "print('=' * 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspection checklist\n",
    "\n",
    "Use D-Tale to confirm each issue SweetViz flagged. For each column, click the column header → **Describe** to see value counts and distribution.\n",
    "\n",
    "| What to inspect | How to do it in D-Tale | What you should see |\n",
    "|---|---|---|\n",
    "| `crash_flag` unique values | Column header → Describe | 8 variants of True/False |\n",
    "| `region` unique values | Column header → Describe | ~32 variants of 5 region names |\n",
    "| `input_method` unique values | Column header → Describe | A typo: `controllr` |\n",
    "| `purchase_amount` raw values | Sort column ascending | Some values use comma: `1,80` |\n",
    "| `avg_fps` distribution | Column header → Describe | Max of 10,000 — clearly wrong |\n",
    "| Missing values overview | Top menu → Describe (all columns) | `gpu_model` dominates |\n",
    "\n",
    "<br>\n",
    "\n",
    "> Once you have seen the problems in the raw data, come back to the notebook for cleaning.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 4: Clean with Pandas\n",
    "\n",
    "We will work through seven issue categories. Each section follows the same pattern:\n",
    "1. **Inspect** — confirm the problem in code\n",
    "2. **Fix** — apply the pandas transformation\n",
    "3. **Verify** — check the result\n",
    "\n",
    "We work on a copy of the original dataframe so the raw data is always available for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Always work on a copy — keep df as the unchanged original\n",
    "df_clean = df.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.1. Boolean columns: inconsistent encoding\n",
    "\n",
    "Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n",
    "\n",
    "The fix is to define an explicit mapping and apply it with `.map()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — confirm the problem\n",
    "print('crash_flag unique values:', sorted(df_clean['crash_flag'].dropna().unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the mapping for replacements\n",
    "# Why did I place True:True and False: False? Ideas?\n",
    "\n",
    "bool_map = {\n",
    "    'True': True,  'true': True,  '1': True,  'Yes': True,  True: True,\n",
    "    'False': False, 'false': False, '0': False, 'No': False, False: False\n",
    "}\n",
    "\n",
    "df_clean['crash_flag'] = df_clean['crash_flag'].map(bool_map)\n",
    "\n",
    "print('crash_flag after mapping:')\n",
    "print(df_clean['crash_flag'].value_counts())\n",
    "print('Nulls:', df_clean['crash_flag'].isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TO DO:\n",
    "# Apply the same mapping to the other two boolean columns\n",
    "# Follow the same pattern as above for is_featured_event and is_long_session\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.2. Categorical columns: case and whitespace inconsistency\n",
    "\n",
    "Four columns have values that are logically identical but differ in case or surrounding whitespace:\n",
    "- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
    "- `map_name` — 36 variants of 6 values\n",
    "- `platform` — 32 variants of 6 values\n",
    "- `input_method` — 30 variants, including a **typo**: `controllr`\n",
    "\n",
    "The fix uses pandas string methods: `.str.strip()` removes surrounding whitespace, `.str.lower()` normalises case. They can be chained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many unique values before cleaning?\n",
    "print('region unique before:', df_clean['region'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix region: strip whitespace and convert to lowercase\n",
    "df_clean['region'] = df_clean['region'].str.strip().str.lower()\n",
    "\n",
    "# Verify\n",
    "print('region unique after:', df_clean['region'].unique())\n",
    "print(df_clean['region'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TO DO: \n",
    "# Apply the same strip + lower to map_name and platform\n",
    "# Follow the same pattern as above\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# input_method needs an extra step: fix the typo and standardise kb/m → kbm\n",
    "\n",
    "# Step 0: Inspect\n",
    "print('input_method unique before:', df_clean['input_method'].unique())\n",
    "\n",
    "# Step 1: strip and lowercase first\n",
    "df_clean['input_method'] = df_clean['input_method'].str.strip().str.lower()\n",
    "\n",
    "# Step 2: fix the two inconsistencies with replace()\n",
    "df_clean['input_method'] = df_clean['input_method'].replace({\n",
    "    'controllr': 'controller',   \n",
    "    'kb/m': 'kbm'                \n",
    "})\n",
    "\n",
    "# Verify — should now show exactly 3 unique values\n",
    "print('input_method unique after:', df_clean['input_method'].unique())\n",
    "print(df_clean['input_method'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.3. `purchase_amount`: comma as decimal separator\n",
    "\n",
    "About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n",
    "\n",
    "The fix: replace the comma in the string, then convert the column type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many rows have a comma?\n",
    "comma_rows = df_clean['purchase_amount'].astype(str).str.contains(',', na=False)\n",
    "print(f'Rows with comma separator: {comma_rows.sum()}')\n",
    "print('Examples:', df_clean.loc[comma_rows, 'purchase_amount'].unique()[:6])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: replace comma with decimal point, then convert to float\n",
    "df_clean['purchase_amount'] = (\n",
    "    df_clean['purchase_amount']\n",
    "    .astype(str)                            # ensure we are working with strings\n",
    "    .str.replace(',', '.', regex=False)     # swap the separator\n",
    "    .replace('nan', float('nan'))           # restore actual NaN rows\n",
    "    .astype(float)                          # convert to numeric\n",
    ")\n",
    "\n",
    "# Verify\n",
    "print('dtype:', df_clean['purchase_amount'].dtype)\n",
    "print(df_clean['purchase_amount'].describe().round(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.4. Missing values: decisions and strategy\n",
    "\n",
    "Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n",
    "\n",
    "| Column | Missing | Why | Decision |\n",
    "|---|---|---|---|\n",
    "| `gpu_model` | 66.7% | Console/mobile players have no GPU | Keep column — missingness is meaningful |\n",
    "| `build_version` | 16.5% | Not logged in older sessions | Keep as NaN — valid historical absence |\n",
    "| `device_temp_c` | 4.9% | Sensor not available on some devices | Keep as NaN |\n",
    "| `session_length_s` | 1.0% | Session ended abnormally | Drop missing rows now; fix negatives/outliers after datetime correction (section 4.6) |\n",
    "| `ping_ms`, `purchase_amount`, `end_time` | < 2% | Sporadic gaps | Keep as NaN |\n",
    "\n",
    "<br>\n",
    "\n",
    "> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal, but a different context might lead to different choices.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — missing value counts across all columns\n",
    "missing = df_clean.isnull().sum()\n",
    "missing_pct = (missing / len(df_clean) * 100).round(1)\n",
    "pd.DataFrame({'missing': missing, '%': missing_pct})[missing > 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# session_length_s: drop rows where it is missing\n",
    "# Rationale: session duration is a core metric — a session with no recorded\n",
    "# duration is structurally incomplete and cannot be used for most analyses.\n",
    "# These 98 rows represent <1% of the dataset, so dropping is safe.\n",
    "\n",
    "rows_before = len(df_clean)\n",
    "df_clean = df_clean.dropna(subset=['session_length_s'])\n",
    "\n",
    "print(f'Rows dropped: {rows_before - len(df_clean)}')\n",
    "print(f'Rows remaining: {len(df_clean)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.5. Outliers: `avg_fps`\n",
    "\n",
    "The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n",
    "\n",
    "**Decision:** set values above 300 fps to `NaN` rather than dropping the entire row. The rest of the data in those rows (crash flag, purchase amount, session type) is likely still valid — it would be wasteful to discard it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — how many rows are affected?\n",
    "threshold = 300\n",
    "outlier_mask = df_clean['avg_fps'] > threshold\n",
    "print(f'Rows with avg_fps > {threshold}: {outlier_mask.sum()}')\n",
    "print('\\navg_fps distribution (before fix):')\n",
    "print(df_clean['avg_fps'].describe().round(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: set outlier values to NaN using .loc with a boolean mask\n",
    "df_clean.loc[outlier_mask, 'avg_fps'] = float('nan')\n",
    "\n",
    "# Verify — max should now be well below 300\n",
    "print('avg_fps distribution (after fix):')\n",
    "print(df_clean['avg_fps'].describe().round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 4.6. Datetime columns: mixed formats\n",
    "\n",
    "The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n",
    "\n",
    "```\n",
    "2025-07-18T18:32:00Z          : ISO 8601 with UTC marker\n",
    "2025-07-18 20:03:21-05:00     : ISO 8601 with UTC offset\n",
    "20/10/2025 02:49              : European DD/MM/YYYY\n",
    "08/01/2025 06:35              : Ambiguous: US MM/DD or European DD/MM?\n",
    "```\n",
    "\n",
    "Mixed datetime formats are one of the most complex cleaning problems because some ambiguities cannot be resolved automatically -- `08/01/2025` could be August 1st or January 8th, and no algorithm can determine which without external context.\n",
    "\n",
    "> **Connection to `session_length_s`:** The negative values and extreme outliers we saw earlier in `session_length_s` are not independent errors -- they are a *consequence* of this datetime problem. When `start_time` and `end_time` were recorded in different formats and misinterpreted, the pre-computed duration came out wrong. After fixing the timestamps, we will recompute `session_length_s` from scratch and validate the result.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect — what does start_time actually look like?\n",
    "print('Sample values from start_time:')\n",
    "print(df_clean['start_time'].dropna().sample(8, random_state=42).tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fix: pd.to_datetime with utc=True normalises all timezone-aware formats to UTC.\n",
    "# errors='coerce' converts anything it cannot parse to NaT (Not a Time) instead of crashing.\n",
    "df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
    "df_clean['end_time']   = pd.to_datetime(df_clean['end_time'],   utc=True, errors='coerce')\n",
    "\n",
    "# Verify — check how many rows could not be parsed\n",
    "print('start_time dtype:', df_clean['start_time'].dtype)\n",
    "print('Unparsed start_time (NaT):', df_clean['start_time'].isna().sum())\n",
    "print('Unparsed end_time (NaT):  ', df_clean['end_time'].isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recompute session_length_s from the corrected timestamps\n",
    "# Now that start_time and end_time are both timezone-aware UTC datetimes,\n",
    "# the subtraction is unambiguous. We convert the result to seconds.\n",
    "df_clean['session_length_s'] = (\n",
    "    df_clean['end_time'] - df_clean['start_time']\n",
    ").dt.total_seconds()\n",
    "\n",
    "print('session_length_s after recomputation:')\n",
    "print(df_clean['session_length_s'].describe().round(1))\n",
    "print(f'\\nNegative values: {(df_clean[\"session_length_s\"] < 0).sum()}')\n",
    "print(f'> 8h (28800s):   {(df_clean[\"session_length_s\"] > 28800).sum()}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Any remaining negative values are rows where timestamps were genuinely\n",
    "# ambiguous and could not be resolved -- the computed duration is meaningless.\n",
    "# Set them to NaN rather than dropping the row.\n",
    "\n",
    "neg_mask = df_clean['session_length_s'] < 0\n",
    "df_clean.loc[neg_mask, 'session_length_s'] = float('nan')\n",
    "print(f'Negative durations set to NaN: {neg_mask.sum()}')\n",
    "\n",
    "# Values above 8 hours (28800s) are suspicious for a game session.\n",
    "# Inspect them before deciding.\n",
    "\n",
    "long_mask = df_clean['session_length_s'] > 28800\n",
    "print(f'\\nSessions > 8h: {long_mask.sum()}')\n",
    "print(df_clean.loc[long_mask, ['session_length_s', 'start_time', 'end_time']].head(5).to_string())\n",
    "\n",
    "# Decision: sessions > 8h are almost certainly logging errors (game left running,\n",
    "# server not recording session end). Set to NaN.\n",
    "# As always — this threshold is a judgement call that depends on the game and context.\n",
    "df_clean.loc[long_mask, 'session_length_s'] = float('nan')\n",
    "print(f'\\nSessions > 8h set to NaN: {long_mask.sum()}')\n",
    "print('\\nFinal session_length_s distribution:')\n",
    "print(df_clean['session_length_s'].describe().round(1))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note:** The number of NaT values above reflects rows where pandas could not parse the format unambiguously. These are not errors in the code — they are genuinely ambiguous records that require a domain decision to resolve (e.g., knowing that the data source always uses DD/MM/YYYY).\n",
    "\n",
    "---\n",
    "\n",
    "** **OPTIONAL** — explore the unparsed rows**\n",
    "\n",
    "If you want to go further, the cells below help you examine which formats failed and attempt a two-pass parsing strategy. This is optional and not required to complete the lab.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL — Step 1: inspect the unparsed rows\n",
    "# We use the index of df_clean (not a boolean mask) to look up raw values in df,\n",
    "# since the two dataframes have different lengths after the dropna() in step 4.4.\n",
    "unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n",
    "raw_start = df.loc[unparsed_idx, 'start_time'].dropna()\n",
    "\n",
    "print(f'Rows still unparsed: {len(unparsed_idx)}')\n",
    "print('\\nSample raw values:')\n",
    "print(raw_start.unique()[:12])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL: Step 2: define a systematic multi-format parser\n",
    "#\n",
    "# Rather than guessing with dayfirst=True, we try explicit format strings\n",
    "# in sequence and stop as soon as one succeeds for each row.\n",
    "# This is precise and transparent — no silent inference.\n",
    "\n",
    "def try_formats(series, formats):\n",
    "    \"\"\"Try explicit datetime format strings in order.\n",
    "    Returns a UTC-aware Series; rows that match no format remain NaT.\"\"\"\n",
    "    result = pd.Series(pd.NaT, index=series.index, dtype='datetime64[ns, UTC]')\n",
    "    remaining = series.copy()\n",
    "    for fmt in formats:\n",
    "        parsed = pd.to_datetime(remaining, format=fmt, errors='coerce', utc=True)\n",
    "        resolved_idx = parsed.index[parsed.notna()]   # use index labels, not boolean mask\n",
    "        result.loc[resolved_idx] = parsed.loc[resolved_idx]\n",
    "        remaining = remaining.drop(index=resolved_idx) # drop resolved rows by label\n",
    "    return result\n",
    "\n",
    "# Format strings to try, in order of specificity\n",
    "# DD/MM/YYYY is tried before MM/DD/YYYY because values where day > 12\n",
    "# can only be DD/MM — those are unambiguous and should be resolved first.\n",
    "# Values where day <= 12 will match both formats; the first one wins.\n",
    "# Those cases are genuinely ambiguous — we flag them separately below.\n",
    "candidate_formats = [\n",
    "    '%d/%m/%Y %H:%M',   # European with time: 20/10/2025 14:30\n",
    "    '%m/%d/%Y %H:%M',   # US with time:       10/20/2025 14:30\n",
    "    '%d/%m/%Y',         # European date only: 20/10/2025\n",
    "    '%m/%d/%Y',         # US date only:       10/20/2025\n",
    "]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL: Step 3: apply the systematic parser to unparsed rows\n",
    "raw_start = df.loc[unparsed_idx, 'start_time']\n",
    "raw_end   = df.loc[unparsed_idx, 'end_time']\n",
    "\n",
    "resolved_start = try_formats(raw_start, candidate_formats)\n",
    "resolved_end   = try_formats(raw_end,   candidate_formats)\n",
    "\n",
    "df_clean.loc[unparsed_idx, 'start_time'] = resolved_start\n",
    "df_clean.loc[unparsed_idx, 'end_time']   = resolved_end\n",
    "\n",
    "print(f'Resolved in second pass: {resolved_start.notna().sum()}')\n",
    "print(f'Still NaT (truly ambiguous): {df_clean[\"start_time\"].isna().sum()}')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL: Step 4: inspect truly ambiguous rows\n",
    "# These are rows where day <= 12, making both DD/MM and MM/DD valid.\n",
    "# No algorithm can resolve them without knowing the data source convention.\n",
    "# They remain NaT — do not silently guess.\n",
    "still_nat_idx = df_clean.index[df_clean['start_time'].isna()]\n",
    "if len(still_nat_idx) > 0:\n",
    "    print('Truly ambiguous timestamps (cannot resolve without domain knowledge):')\n",
    "    print(df.loc[still_nat_idx, ['start_time', 'end_time']].head(10).to_string())\n",
    "else:\n",
    "    print('All timestamps resolved.')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# OPTIONAL: Step 5: recompute session_length_s with the newly resolved timestamps\n",
    "# More rows now have valid start_time and end_time, so more durations can be recovered.\n",
    "df_clean['session_length_s'] = (\n",
    "    df_clean['end_time'] - df_clean['start_time']\n",
    ").dt.total_seconds()\n",
    "\n",
    "# Re-apply the same validation as before\n",
    "neg_mask  = df_clean['session_length_s'] < 0\n",
    "long_mask = df_clean['session_length_s'] > 28800\n",
    "df_clean.loc[neg_mask | long_mask, 'session_length_s'] = float('nan')\n",
    "\n",
    "print('session_length_s after second-pass recomputation:')\n",
    "print(df_clean['session_length_s'].describe().round(1))\n",
    "print(f'\\nNaN values: {df_clean[\"session_length_s\"].isna().sum()}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "</details>\n",
    "\n",
    "---\n",
    "\n",
    "## Part 5: Verify with D-Tale\n",
    "\n",
    "Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Shut down the previous D-Tale instance and reload with the clean data\n",
    "d.kill()\n",
    "d_clean = dtale.show(df_clean, host='127.0.0.1', subprocess=True, open_browser=True)\n",
    "print('Open cleaned data in D-Tale:', d_clean._url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In D-Tale, verify the following:\n",
    "\n",
    "| Column | What to check | Expected result |\n",
    "|---|---|---|\n",
    "| `crash_flag` | Describe → value counts | Only `True` and `False` |\n",
    "| `region` | Describe → value counts | Exactly 5 values, all lowercase |\n",
    "| `input_method` | Describe → value counts | Exactly 3 values, no `controllr` |\n",
    "| `purchase_amount` | Describe → dtype and range | float64, no commas |\n",
    "| `avg_fps` | Describe → max | Below 300 |\n",
    "| `session_length_s` | Describe → min and max | No negatives, no values > 28800 |\n",
    "| `start_time` | Describe → dtype | datetime64 |\n",
    "\n",
    "## Part 6: Compare initial and clean datasets with SweetViz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8f0e03a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Debug code; sometimes, sweetviz is not able to compare columns due to data type changes that are incompatible\n",
    "# This code just goes around column by column to identify any column that gives an error. Otherwise, SweetViz\n",
    "# just crashes without any major explanation\n",
    "\n",
    "# Test comparison column by column\n",
    "# for col in df_clean.columns:\n",
    "#     try:\n",
    "#         sv.compare([df[[col]], 'Raw'], [df_clean[[col]].reset_index(drop=True), 'Cleaned'])\n",
    "#     except Exception as e:\n",
    "#         print(f\"FAIL: {col} — {e}\")\n",
    "#     else:\n",
    "#         print(f\"ok:   {col}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare both versions of the dataset using SweetViz... \n",
    "# Not perfect, but some basic information (e.g., works bad with booleans vs categorical in crash_flag)\n",
    "# needed to exclude these two because we converted them to datetime and sweetviz is not able to compare it with the original data types\n",
    "\n",
    "exclude = ['start_time', 'end_time'] \n",
    "\n",
    "compare = sv.compare(\n",
    "    [df.drop(columns=exclude), 'Raw'],\n",
    "    [df_clean.drop(columns=exclude).reset_index(drop=True), 'Cleaned']\n",
    ")\n",
    "compare.show_html('sweetviz_comparison_report.html', open_browser=True)\n",
    "print('Comparison report saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the comparison report, check that:\n",
    "- Boolean columns changed from TEXT → BOOL with only 2 distinct values\n",
    "- Categorical columns show dramatically reduced DISTINCT counts\n",
    "- `purchase_amount` changed from TEXT → NUMERIC\n",
    "- `avg_fps` maximum is no longer 10,000\n",
    "- `session_length_s` shows 0 missing\n",
    "\n",
    "---\n",
    "\n",
    "## Part 7: Save the Cleaned Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
    "print(f'Saved: {len(df_clean)} rows, {len(df_clean.columns)} columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Key Takeaways\n",
    "\n",
    "**Three tools, three roles — they complement each other:**\n",
    "- **SweetViz** surfaces issues fast but cannot fix them:  use it for triage and validation\n",
    "- **D-Tale** lets you see the data as a human would:  use it to understand problems before and after fixing them\n",
    "- **pandas** is where all actual cleaning happens: explicit, reproducible, and version-controllable\n",
    "\n",
    "**Cleaning decisions are not mechanical:**\n",
    "- Dropping `session_length_s` nulls was justified here: it would not be in every context\n",
    "- Setting `avg_fps` outliers to NaN (not dropping rows) preserved valid data in other columns\n",
    "- `gpu_model` missingness is structurally meaningful: imputing it would destroy information\n",
    "\n",
    "**Common issue categories you have now fixed with pandas:**\n",
    "\n",
    "| Issue | pandas approach |\n",
    "|---|---|\n",
    "| Boolean encoding chaos | `.map(bool_map)` |\n",
    "| Case / whitespace inconsistency | `.str.strip().str.lower()` |\n",
    "| Typos in categories | `.replace({'controllr': 'controller'})` |\n",
    "| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n",
    "| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n",
    "| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n",
    "| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "572f9d85",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}