VI_Lab_01_EDA/claude/lab02_task2_telemetry.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 02 · Task 2 — Guided EDA and Data Cleaning with SweetViz & D-Tale\n",
    "\n",
    "**Estimated time:** ~50 minutes  \n",
    "**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
    "\n",
    "---\n",
    "\n",
    "### Objectives\n",
    "\n",
    "By the end of this task you will be able to:\n",
    "- Generate an automated EDA report with **SweetViz** to get a rapid overview of a dataset\n",
    "- Use **D-Tale** interactively to identify and fix data quality problems\n",
    "- Recognise the most common categories of data issues: inconsistent encoding, mixed types, excessive missingness, and outliers\n",
    "- Understand how interactive tools translate cleaning actions into pandas code\n",
    "\n",
    "---\n",
    "\n",
    "### Context\n",
    "\n",
    "You have been handed a telemetry dataset from a small indie game studio. It contains **10,000 session records** with information about players, platforms, performance metrics, and purchases. Before any visualisation or analysis can be built on top of this data, it must be understood and cleaned.\n",
    "\n",
    "This is real-world data quality: messy, inconsistent, and requiring decisions — not just mechanical fixes.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1 — Setup and First Load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape: (10000, 20)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>session_id</th>\n",
       "      <th>user_id</th>\n",
       "      <th>start_time</th>\n",
       "      <th>end_time</th>\n",
       "      <th>session_length_s</th>\n",
       "      <th>region</th>\n",
       "      <th>platform</th>\n",
       "      <th>gpu_model</th>\n",
       "      <th>avg_fps</th>\n",
       "      <th>ping_ms</th>\n",
       "      <th>map_name</th>\n",
       "      <th>crash_flag</th>\n",
       "      <th>purchase_amount</th>\n",
       "      <th>party_size</th>\n",
       "      <th>input_method</th>\n",
       "      <th>build_version</th>\n",
       "      <th>is_featured_event</th>\n",
       "      <th>device_temp_c</th>\n",
       "      <th>session_type</th>\n",
       "      <th>is_long_session</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>sess_c2fba8e7f37a</td>\n",
       "      <td>user_488</td>\n",
       "      <td>2025-07-18T18:32:00Z</td>\n",
       "      <td>2025-07-18 20:03:21-05:00</td>\n",
       "      <td>5481.0</td>\n",
       "      <td>us-west</td>\n",
       "      <td>pc</td>\n",
       "      <td>GTX1080</td>\n",
       "      <td>83.52</td>\n",
       "      <td>431.16</td>\n",
       "      <td>ocean</td>\n",
       "      <td>Yes</td>\n",
       "      <td>0,00</td>\n",
       "      <td>2</td>\n",
       "      <td>Touch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>No</td>\n",
       "      <td>85.6</td>\n",
       "      <td>ranked</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>sess_33d286298cf9</td>\n",
       "      <td>user_1511</td>\n",
       "      <td>2025-06-13 23:21:08+00:00</td>\n",
       "      <td>2025-06-13 23:36:30+01:00</td>\n",
       "      <td>922.0</td>\n",
       "      <td>Us-east</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>72.75</td>\n",
       "      <td>29.12</td>\n",
       "      <td>desert</td>\n",
       "      <td>No</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3</td>\n",
       "      <td>Touch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>casual</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sess_be2bb4d8986a</td>\n",
       "      <td>user_830</td>\n",
       "      <td>2025-10-20 02:42:07-05:00</td>\n",
       "      <td>20/10/2025 02:49</td>\n",
       "      <td>451.0</td>\n",
       "      <td>sa-east-1</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69.20</td>\n",
       "      <td>40.47</td>\n",
       "      <td>Forest</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>5</td>\n",
       "      <td>TOUCH</td>\n",
       "      <td>1.4</td>\n",
       "      <td>False</td>\n",
       "      <td>69.0</td>\n",
       "      <td>ranked</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>sess_7f425ca9a0e2</td>\n",
       "      <td>user_1</td>\n",
       "      <td>08/01/2025 06:35</td>\n",
       "      <td>2025-08-01T08:32:45Z</td>\n",
       "      <td>7031.0</td>\n",
       "      <td>sa-east-1</td>\n",
       "      <td>PlayStation</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33.29</td>\n",
       "      <td>92.40</td>\n",
       "      <td>Desert</td>\n",
       "      <td>No</td>\n",
       "      <td>17.55</td>\n",
       "      <td>1</td>\n",
       "      <td>Controller</td>\n",
       "      <td>1.3.2</td>\n",
       "      <td>0</td>\n",
       "      <td>48.1</td>\n",
       "      <td>casual</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>sess_5657e28b22ec</td>\n",
       "      <td>user_211</td>\n",
       "      <td>2025-09-08T23:41:44Z</td>\n",
       "      <td>2025-09-09 00:32:59+01:00</td>\n",
       "      <td>3075.0</td>\n",
       "      <td>US-EAST</td>\n",
       "      <td>switch</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69.96</td>\n",
       "      <td>12.63</td>\n",
       "      <td>Desert</td>\n",
       "      <td>False</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>controllr</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>54.7</td>\n",
       "      <td>casual</td>\n",
       "      <td>Yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          session_id    user_id                 start_time  \\\n",
       "0  sess_c2fba8e7f37a   user_488       2025-07-18T18:32:00Z   \n",
       "1  sess_33d286298cf9  user_1511  2025-06-13 23:21:08+00:00   \n",
       "2  sess_be2bb4d8986a   user_830  2025-10-20 02:42:07-05:00   \n",
       "3  sess_7f425ca9a0e2     user_1           08/01/2025 06:35   \n",
       "4  sess_5657e28b22ec   user_211       2025-09-08T23:41:44Z   \n",
       "\n",
       "                    end_time  session_length_s     region     platform  \\\n",
       "0  2025-07-18 20:03:21-05:00            5481.0    us-west           pc   \n",
       "1  2025-06-13 23:36:30+01:00             922.0    Us-east  PlayStation   \n",
       "2           20/10/2025 02:49             451.0  sa-east-1  PlayStation   \n",
       "3       2025-08-01T08:32:45Z            7031.0  sa-east-1  PlayStation   \n",
       "4  2025-09-09 00:32:59+01:00            3075.0    US-EAST       switch   \n",
       "\n",
       "  gpu_model  avg_fps  ping_ms map_name crash_flag purchase_amount  party_size  \\\n",
       "0   GTX1080    83.52   431.16    ocean        Yes            0,00           2   \n",
       "1       NaN    72.75    29.12   desert         No             0.0           3   \n",
       "2       NaN    69.20    40.47   Forest      False             0.0           5   \n",
       "3       NaN    33.29    92.40   Desert         No           17.55           1   \n",
       "4       NaN    69.96    12.63   Desert      False             0.0           2   \n",
       "\n",
       "  input_method build_version is_featured_event  device_temp_c session_type  \\\n",
       "0        Touch           NaN                No           85.6       ranked   \n",
       "1        Touch           NaN                 0           62.0       casual   \n",
       "2        TOUCH           1.4             False           69.0       ranked   \n",
       "3   Controller         1.3.2                 0           48.1       casual   \n",
       "4    controllr           NaN                 0           54.7      casual    \n",
       "\n",
       "  is_long_session  \n",
       "0            True  \n",
       "1               0  \n",
       "2           False  \n",
       "3            True  \n",
       "4             Yes  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import sweetviz as sv\n",
    "import dtale\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Load the raw dataset — do NOT clean anything yet\n",
    "df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
    "\n",
    "print(f'Shape: {df.shape}')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Column types (as loaded):\n",
      "session_id            object\n",
      "user_id               object\n",
      "start_time            object\n",
      "end_time              object\n",
      "session_length_s     float64\n",
      "region                object\n",
      "platform              object\n",
      "gpu_model             object\n",
      "avg_fps              float64\n",
      "ping_ms              float64\n",
      "map_name              object\n",
      "crash_flag            object\n",
      "purchase_amount       object\n",
      "party_size             int64\n",
      "input_method          object\n",
      "build_version         object\n",
      "is_featured_event     object\n",
      "device_temp_c        float64\n",
      "session_type          object\n",
      "is_long_session       object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "# Quick look at column types as pandas inferred them\n",
    "print('Column types (as loaded):')\n",
    "print(df.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 2 — Automated Overview with SweetViz\n",
    "\n",
    "Before diving into manual inspection, generate a SweetViz report. This gives you a visual overview of every column in one step — distributions, types, missing values, and anomalies.\n",
    "\n",
    "**Think of SweetViz as your \"triage\" tool.** It shows you *where* to look; D-Tale is where you look *closely*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bd10cd653e7a47f891552a79e946376c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "                                             |          | [  0%]   00:00 -> (? left)"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Report sweetviz_raw_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.\n",
      "Report saved as sweetviz_raw_report.html — open it in your browser.\n"
     ]
    }
   ],
   "source": [
    "# Generate the SweetViz report\n",
    "# This may take 30–60 seconds\n",
    "report = sv.analyze(df_raw)\n",
    "report.show_html('sweetviz_raw_report.html')\n",
    "\n",
    "print('Report saved as sweetviz_raw_report.html — open it in your browser.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 📋 SweetViz Checklist — What to look for\n",
    "\n",
    "Open `sweetviz_raw_report.html` and answer the following questions. Write your findings below before moving on.\n",
    "\n",
    "| Question | Your finding |\n",
    "|---|---|\n",
    "| Which columns have missing values? Which has the most? | *...* |\n",
    "| Which columns are listed as TEXT but should be numeric or boolean? | *...* |\n",
    "| Are there any numeric columns with suspicious ranges (very high max or very low min)? | *...* |\n",
    "| How many unique values does `region` have? Does that seem right? | *...* |\n",
    "| What is unusual about `purchase_amount`? | *...* |\n",
    "\n",
    "*(Double-click to fill in your answers)*\n",
    "\n",
    "---\n",
    "\n",
    "## Part 3 — Deep Inspection and Cleaning with D-Tale\n",
    "\n",
    "D-Tale opens the dataset in an interactive grid. You can sort, filter, inspect, and clean without writing a single line of pandas — but D-Tale records every action as code you can export later.\n",
    "\n",
    "**Launch D-Tale now:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-02-22 20:12:55,619 - INFO     - D-Tale started at: http://127.0.0.1:40000\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Open D-Tale at: http://127.0.0.1:40000\n"
     ]
    }
   ],
   "source": [
    "# Launch D-Tale with the raw dataset\n",
    "# A link will appear — click it to open D-Tale in a new browser ta\n",
    "d = dtale.show(df_raw, host='127.0.0.1', subprocess=False, open_browser=True)\n",
    "print(\"Open D-Tale at:\", d._url)        # lists all running instances\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c2e5293",
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "bad operand type for abs(): 'str'",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mTypeError\u001b[39m                                 Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[16]\u001b[39m\u001b[32m, line 21\u001b[39m\n\u001b[32m     18\u001b[39m \tstr_data = pd.to_numeric(s, errors=\u001b[33m'\u001b[39m\u001b[33mcoerce\u001b[39m\u001b[33m'\u001b[39m)\n\u001b[32m     19\u001b[39m pd.Series(str_data, name=\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m, index=s.index)\n\u001b[32m---> \u001b[39m\u001b[32m21\u001b[39m df[\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m] = \u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mpurchase_amount\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\generic.py:1722\u001b[39m, in \u001b[36mNDFrame.abs\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1654\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m   1655\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mabs\u001b[39m(\u001b[38;5;28mself\u001b[39m) -> Self:\n\u001b[32m   1656\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m   1657\u001b[39m \u001b[33;03m    Return a Series/DataFrame with absolute numeric value of each element.\u001b[39;00m\n\u001b[32m   1658\u001b[39m \n\u001b[32m   (...)\u001b[39m\u001b[32m   1720\u001b[39m \u001b[33;03m    3    7   40  -50\u001b[39;00m\n\u001b[32m   1721\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1722\u001b[39m     res_mgr = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_mgr\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1723\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._constructor_from_mgr(res_mgr, axes=res_mgr.axes).__finalize__(\n\u001b[32m   1724\u001b[39m         \u001b[38;5;28mself\u001b[39m, name=\u001b[33m\"\u001b[39m\u001b[33mabs\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1725\u001b[39m     )\n",
      "\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:361\u001b[39m, in \u001b[36mBaseBlockManager.apply\u001b[39m\u001b[34m(self, f, align_keys, **kwargs)\u001b[39m\n\u001b[32m    358\u001b[39m             kwargs[k] = obj[b.mgr_locs.indexer]\n\u001b[32m    360\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mcallable\u001b[39m(f):\n\u001b[32m--> \u001b[39m\u001b[32m361\u001b[39m     applied = \u001b[43mb\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    362\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m    363\u001b[39m     applied = \u001b[38;5;28mgetattr\u001b[39m(b, f)(**kwargs)\n",
      "\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\blocks.py:395\u001b[39m, in \u001b[36mBlock.apply\u001b[39m\u001b[34m(self, func, **kwargs)\u001b[39m\n\u001b[32m    389\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m    390\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mapply\u001b[39m(\u001b[38;5;28mself\u001b[39m, func, **kwargs) -> \u001b[38;5;28mlist\u001b[39m[Block]:\n\u001b[32m    391\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    392\u001b[39m \u001b[33;03m    apply the function to my values; return a block if we are not\u001b[39;00m\n\u001b[32m    393\u001b[39m \u001b[33;03m    one\u001b[39;00m\n\u001b[32m    394\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m395\u001b[39m     result = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    397\u001b[39m     result = maybe_coerce_values(result)\n\u001b[32m    398\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._split_op_result(result)\n",
      "\u001b[31mTypeError\u001b[39m: bad operand type for abs(): 'str'"
     ]
    }
   ],
   "source": [
    "# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):\n",
    "\tdf = df.to_frame(index=False)\n",
    "\n",
    "# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required\n",
    "df = df.reset_index().drop('index', axis=1, errors='ignore')\n",
    "df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers\n",
    "\n",
    "df['purchase_amount'] = df['purchase_amount'].str.replace(',', '.', case=False, regex='False')\n",
    "df['purchase_amount'] = s = df['purchase_amount'] \n",
    "\n",
    "if s.str.startswith('0x').any():\n",
    "\tstr_data = s.apply(float.fromhex)\n",
    "else:\n",
    "\tstr_data = pd.to_numeric(s, errors='coerce')\n",
    "\t\n",
    "pd.Series(str_data, name='purchase_amount', index=s.index)\n",
    "\n",
    "df['purchase_amount'] = df['purchase_amount'].abs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8180fa05",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-02-22 20:18:35,563 - INFO     - D-Tale started at: http://127.0.0.1:40000\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Open D-Tale at: http://127.0.0.1:40000\n"
     ]
    }
   ],
   "source": [
    "# Launch D-Tale with the raw dataset\n",
    "# A link will appear — click it to open D-Tale in a new browser ta\n",
    "d = dtale.show(df, host='127.0.0.1', subprocess=False, open_browser=True)\n",
    "print(\"Open D-Tale at:\", d._url)        # lists all running instances\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "745a5655",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  TCP    169.254.62.24:40000    0.0.0.0:0              LISTENING       11972\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Check if something else is already on port 40000\n",
    "import subprocess\n",
    "result = subprocess.run('netstat -ano | findstr :40000', shell=True, capture_output=True, text=True)\n",
    "print(result.stdout or \"Nothing on port 40000\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### 🔍 Issue 1 — Missing Values\n",
    "\n",
    "In D-Tale, go to **\"Describe\"** (top menu → Describe) to see the missing value counts per column.\n",
    "\n",
    "You will find:\n",
    "\n",
    "| Column | Missing | Note |\n",
    "|---|---|---|\n",
    "| `gpu_model` | ~67% | Most players are on console — GPU does not apply |\n",
    "| `build_version` | ~17% | Not recorded in older sessions |\n",
    "| `device_temp_c` | ~5% | Sensor not available on some devices |\n",
    "| `session_length_s` | ~1% | Session ended abnormally (crash?) |\n",
    "| `ping_ms`, `purchase_amount`, `end_time` | <2% | Sporadic gaps |\n",
    "\n",
    "**Cleaning decisions to make in D-Tale:**\n",
    "\n",
    "1. **`gpu_model`** — This column is missing for 67% of rows. Rather than imputing, consider: is this column useful for a console/mobile player? Go to **Column Actions → Delete Column** and remove it. Alternatively, you can keep it and decide during analysis.\n",
    "\n",
    "2. **`build_version`** — Missings are structurally valid (older sessions). Keep the column; do not impute.\n",
    "\n",
    "3. **Remaining columns** — Leave missing values in place for now. We will handle them during analysis when context is clearer.\n",
    "\n",
    "> 📝 **Record your decisions:** Which columns did you keep? Which did you drop? Why?\n",
    "\n",
    "*(Double-click to write your decisions here)*\n",
    "\n",
    "---\n",
    "\n",
    "### 🔍 Issue 2 — Boolean Columns with Mixed Encodings\n",
    "\n",
    "Three columns represent true/false flags but were stored with at least **8 different representations**:\n",
    "\n",
    "- `crash_flag` → `Yes`, `No`, `True`, `False`, `true`, `false`, `1`, `0`\n",
    "- `is_featured_event` → same 8 representations  \n",
    "- `is_long_session` → same 8 representations\n",
    "\n",
    "**In D-Tale, clean each column:**\n",
    "\n",
    "1. Click the column header → **Column Actions → Type Conversion**\n",
    "2. Select **String to Bool** (D-Tale will map Yes/True/1 → True and No/False/0 → False)\n",
    "3. Preview the result before applying\n",
    "4. Repeat for all three columns\n",
    "\n",
    "> 💡 **Alternative via Find & Replace:** If Type Conversion does not cover all variants, use **Column Actions → Replace** to manually map unusual values (e.g., `Yes` → `True`) before converting.\n",
    "\n",
    "After cleaning, verify with Describe: each column should show only `True` and `False`.\n",
    "\n",
    "---\n",
    "\n",
    "### 🔍 Issue 3 — Categorical Columns: Case and Whitespace Chaos\n",
    "\n",
    "Four categorical columns have serious inconsistency:\n",
    "\n",
    "- `region` — 32 variants of 5 values (e.g., `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
    "- `map_name` — 36 variants of 6 values\n",
    "- `platform` — 32 variants of 6 values\n",
    "- `input_method` — 30 variants, including a typo: `controllr` instead of `controller`\n",
    "\n",
    "**Clean each column in D-Tale:**\n",
    "\n",
    "1. Click column header → **Column Actions → Type Conversion → String Cleaning**\n",
    "2. Apply **Strip whitespace** and **Lowercase** (or **Uppercase** — be consistent)\n",
    "3. For `input_method`, also apply a **Replace** to fix `controllr` → `controller` and `kb/m` → `kbm` (pick one variant and standardise)\n",
    "\n",
    "After cleaning, each column should have the expected number of unique values:\n",
    "\n",
    "| Column | Before | After |\n",
    "|---|---|---|\n",
    "| `region` | 32 | 5 |\n",
    "| `map_name` | 36 | 6 |\n",
    "| `platform` | 32 | 6 |\n",
    "| `input_method` | 30 | 3 |\n",
    "\n",
    "> Use **Describe → value_counts** to verify before and after each fix.\n",
    "\n",
    "---\n",
    "\n",
    "### 🔍 Issue 4 — `purchase_amount`: Comma as Decimal Separator\n",
    "\n",
    "Some rows contain values like `\"0,00\"` and `\"1,80\"` where a comma was used instead of a decimal point. This prevents pandas from reading the column as numeric.\n",
    "\n",
    "**In D-Tale:**\n",
    "\n",
    "1. Filter the column to show only rows where the value contains a comma: **Column Actions → Filter → contains `,`**\n",
    "2. Apply a **Replace**: replace `,` with `.` in the column\n",
    "3. Then convert the column type: **Column Actions → Type Conversion → Float**\n",
    "\n",
    "> After conversion, verify the column dtype and check the range (min/max) with Describe.\n",
    "\n",
    "---\n",
    "\n",
    "### 🔍 Issue 5 — Outliers in Numeric Columns\n",
    "\n",
    "The SweetViz report and D-Tale Describe should have flagged suspicious ranges. Check these now:\n",
    "\n",
    "| Column | Suspicious value | Likely explanation |\n",
    "|---|---|---|\n",
    "| `avg_fps` | max = 10,000 | Sensor error or logging bug — physically impossible |\n",
    "| `ping_ms` | max = 627 ms | High but plausible for satellite connections |\n",
    "| `device_temp_c` | max = 100°C | Right at thermal throttling limit — possible but worth flagging |\n",
    "\n",
    "**In D-Tale, investigate `avg_fps`:**\n",
    "\n",
    "1. Use **Charts** (top menu) to plot a histogram of `avg_fps` — does it show an extreme outlier spike?\n",
    "2. Use **Filter** to see how many rows have `avg_fps > 300` (a hard upper bound for realistic gameplay)\n",
    "3. **Decide:** Should these rows be dropped, or should the value be set to `NaN` to mark it as invalid?\n",
    "4. Apply your decision via **Column Actions → Replace** or a row-level **Filter + Delete**\n",
    "\n",
    "> 📝 **Record your decision and reasoning:** What threshold did you use? How many rows were affected?\n",
    "\n",
    "*(Double-click to write your answer here)*\n",
    "\n",
    "---\n",
    "\n",
    "### 🔍 Issue 6 — Mixed Datetime Formats\n",
    "\n",
    "The `start_time` and `end_time` columns contain timestamps in multiple formats:\n",
    "\n",
    "- ISO 8601 with timezone: `2025-07-18T18:32:00Z`\n",
    "- ISO with offset: `2025-07-18 20:03:21-05:00`  \n",
    "- European: `20/10/2025 02:49`\n",
    "- US: `08/01/2025 06:35`\n",
    "\n",
    "This is one of the harder issues to fix entirely within D-Tale's UI. For now:\n",
    "\n",
    "1. In D-Tale, go to **Column Actions → Type Conversion** on `start_time` and try **String to Date** with `infer_datetime_format=True`\n",
    "2. Check how many values fail to parse (shown as NaT after conversion)\n",
    "3. Make note of any unresolved formats — these will need to be handled in pandas with `pd.to_datetime(..., errors='coerce')` and may require a more careful cleaning pass\n",
    "\n",
    "> ⚠️ **Key insight:** Not all cleaning can be done point-and-click. Some issues require programmatic resolution. This is where the code D-Tale generates becomes valuable.\n",
    "\n",
    "---\n",
    "\n",
    "## Part 4 — Export the Cleaning Code from D-Tale\n",
    "\n",
    "Every cleaning action you performed in D-Tale was recorded as pandas code. Let's export and inspect it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve the cleaned dataframe from D-Tale\n",
    "# (This reflects all changes made in the D-Tale UI)\n",
    "df_clean = d.data.copy()\n",
    "\n",
    "print(f'Cleaned shape: {df_clean.shape}')\n",
    "print('\\nColumn types after cleaning:')\n",
    "print(df_clean.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# D-Tale also lets you export the complete cleaning pipeline as Python code.\n",
    "# In the D-Tale UI: click the code icon (</>) in the top-right corner → \"Export Code\"\n",
    "# Paste the exported code below:\n",
    "\n",
    "# --- Paste D-Tale exported code here ---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 — Manual refinement in pandas\n",
    "\n",
    "D-Tale generates the skeleton; pandas lets you add precision. Here is an example of cleaning the `start_time` column more robustly — something D-Tale's UI cannot fully handle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: robust datetime parsing for mixed-format timestamps\n",
    "# pd.to_datetime with utc=True normalises all timezone representations\n",
    "df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
    "df_clean['end_time']   = pd.to_datetime(df_clean['end_time'],   utc=True, errors='coerce')\n",
    "\n",
    "# Check how many rows could not be parsed\n",
    "print('Unparsed start_time rows:', df_clean['start_time'].isna().sum())\n",
    "print('Unparsed end_time rows:  ', df_clean['end_time'].isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: cap avg_fps outliers (adjust the threshold based on your decision above)\n",
    "# Replace values > 300 with NaN to mark them as invalid rather than deleting rows\n",
    "fps_threshold = 300\n",
    "n_outliers = (df_clean['avg_fps'] > fps_threshold).sum()\n",
    "df_clean.loc[df_clean['avg_fps'] > fps_threshold, 'avg_fps'] = float('nan')\n",
    "\n",
    "print(f'Rows with avg_fps > {fps_threshold} set to NaN: {n_outliers}')\n",
    "print(f'avg_fps range after: {df_clean[\"avg_fps\"].min():.1f} – {df_clean[\"avg_fps\"].max():.1f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Part 5 — Validation: Before vs After\n",
    "\n",
    "The real test of cleaning work is a comparison report. SweetViz can compare two dataframes side by side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate a comparison report: raw vs cleaned\n",
    "# This may take 60–90 seconds\n",
    "compare_report = sv.compare([df_raw, 'Raw'], [df_clean, 'Cleaned'])\n",
    "compare_report.show_html('sweetviz_comparison_report.html', open_browser=False)\n",
    "\n",
    "print('Comparison report saved — open sweetviz_comparison_report.html in your browser.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open the comparison report and verify:\n",
    "\n",
    "- ✅ Boolean columns now show only `True` / `False`\n",
    "- ✅ Categorical columns have the expected number of unique values\n",
    "- ✅ `purchase_amount` is now numeric\n",
    "- ✅ `avg_fps` no longer has a 10,000 outlier\n",
    "- ✅ Missing value counts have changed as expected\n",
    "\n",
    "---\n",
    "\n",
    "## Part 6 — Save the Cleaned Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
    "print('Cleaned dataset saved.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 🔑 Key Takeaways\n",
    "\n",
    "- **SweetViz** gives you a rapid automated overview — use it at the start and for before/after comparison. It does not clean; it informs.\n",
    "- **D-Tale** lets you explore interactively, spot patterns, and clean through a UI. Every action is tracked as pandas code, so you are never locked into the GUI.\n",
    "- **Pandas** remains essential for edge cases: complex datetime parsing, conditional logic, and anything requiring programmatic iteration.\n",
    "- The three tools form a pipeline: **SweetViz → triage → D-Tale → interactive cleaning → pandas → refinement**.\n",
    "\n",
    "**Common issue categories you have now seen:**\n",
    "\n",
    "| Category | Example in this dataset |\n",
    "|---|---|\n",
    "| Boolean encoding inconsistency | 8 representations of True/False |\n",
    "| Categorical case/whitespace chaos | 32 variants of 5 region names |\n",
    "| Typos in categories | `controllr` vs `controller` |\n",
    "| Wrong decimal separator | `1,80` instead of `1.80` |\n",
    "| Structural missingness | `gpu_model` absent for console players |\n",
    "| Sensor/logging outliers | `avg_fps = 10,000` |\n",
    "| Mixed datetime formats | ISO 8601 mixed with European dates |\n",
    "\n",
    "→ In **Task 3**, you will apply these same skills independently to a new dataset — with less guidance."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}