Files
VI_Lab_01_EDA/claude/lab02_task2_telemetry.ipynb
2026-02-23 08:21:32 +00:00

830 lines
35 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lab 02 · Task 2 — Guided EDA and Data Cleaning with SweetViz & D-Tale\n",
"\n",
"**Estimated time:** ~50 minutes \n",
"**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
"\n",
"---\n",
"\n",
"### Objectives\n",
"\n",
"By the end of this task you will be able to:\n",
"- Generate an automated EDA report with **SweetViz** to get a rapid overview of a dataset\n",
"- Use **D-Tale** interactively to identify and fix data quality problems\n",
"- Recognise the most common categories of data issues: inconsistent encoding, mixed types, excessive missingness, and outliers\n",
"- Understand how interactive tools translate cleaning actions into pandas code\n",
"\n",
"---\n",
"\n",
"### Context\n",
"\n",
"You have been handed a telemetry dataset from a small indie game studio. It contains **10,000 session records** with information about players, platforms, performance metrics, and purchases. Before any visualisation or analysis can be built on top of this data, it must be understood and cleaned.\n",
"\n",
"This is real-world data quality: messy, inconsistent, and requiring decisions — not just mechanical fixes.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1 — Setup and First Load"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape: (10000, 20)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>session_id</th>\n",
" <th>user_id</th>\n",
" <th>start_time</th>\n",
" <th>end_time</th>\n",
" <th>session_length_s</th>\n",
" <th>region</th>\n",
" <th>platform</th>\n",
" <th>gpu_model</th>\n",
" <th>avg_fps</th>\n",
" <th>ping_ms</th>\n",
" <th>map_name</th>\n",
" <th>crash_flag</th>\n",
" <th>purchase_amount</th>\n",
" <th>party_size</th>\n",
" <th>input_method</th>\n",
" <th>build_version</th>\n",
" <th>is_featured_event</th>\n",
" <th>device_temp_c</th>\n",
" <th>session_type</th>\n",
" <th>is_long_session</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>sess_c2fba8e7f37a</td>\n",
" <td>user_488</td>\n",
" <td>2025-07-18T18:32:00Z</td>\n",
" <td>2025-07-18 20:03:21-05:00</td>\n",
" <td>5481.0</td>\n",
" <td>us-west</td>\n",
" <td>pc</td>\n",
" <td>GTX1080</td>\n",
" <td>83.52</td>\n",
" <td>431.16</td>\n",
" <td>ocean</td>\n",
" <td>Yes</td>\n",
" <td>0,00</td>\n",
" <td>2</td>\n",
" <td>Touch</td>\n",
" <td>NaN</td>\n",
" <td>No</td>\n",
" <td>85.6</td>\n",
" <td>ranked</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>sess_33d286298cf9</td>\n",
" <td>user_1511</td>\n",
" <td>2025-06-13 23:21:08+00:00</td>\n",
" <td>2025-06-13 23:36:30+01:00</td>\n",
" <td>922.0</td>\n",
" <td>Us-east</td>\n",
" <td>PlayStation</td>\n",
" <td>NaN</td>\n",
" <td>72.75</td>\n",
" <td>29.12</td>\n",
" <td>desert</td>\n",
" <td>No</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" <td>Touch</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>62.0</td>\n",
" <td>casual</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>sess_be2bb4d8986a</td>\n",
" <td>user_830</td>\n",
" <td>2025-10-20 02:42:07-05:00</td>\n",
" <td>20/10/2025 02:49</td>\n",
" <td>451.0</td>\n",
" <td>sa-east-1</td>\n",
" <td>PlayStation</td>\n",
" <td>NaN</td>\n",
" <td>69.20</td>\n",
" <td>40.47</td>\n",
" <td>Forest</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" <td>5</td>\n",
" <td>TOUCH</td>\n",
" <td>1.4</td>\n",
" <td>False</td>\n",
" <td>69.0</td>\n",
" <td>ranked</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>sess_7f425ca9a0e2</td>\n",
" <td>user_1</td>\n",
" <td>08/01/2025 06:35</td>\n",
" <td>2025-08-01T08:32:45Z</td>\n",
" <td>7031.0</td>\n",
" <td>sa-east-1</td>\n",
" <td>PlayStation</td>\n",
" <td>NaN</td>\n",
" <td>33.29</td>\n",
" <td>92.40</td>\n",
" <td>Desert</td>\n",
" <td>No</td>\n",
" <td>17.55</td>\n",
" <td>1</td>\n",
" <td>Controller</td>\n",
" <td>1.3.2</td>\n",
" <td>0</td>\n",
" <td>48.1</td>\n",
" <td>casual</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>sess_5657e28b22ec</td>\n",
" <td>user_211</td>\n",
" <td>2025-09-08T23:41:44Z</td>\n",
" <td>2025-09-09 00:32:59+01:00</td>\n",
" <td>3075.0</td>\n",
" <td>US-EAST</td>\n",
" <td>switch</td>\n",
" <td>NaN</td>\n",
" <td>69.96</td>\n",
" <td>12.63</td>\n",
" <td>Desert</td>\n",
" <td>False</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>controllr</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>54.7</td>\n",
" <td>casual</td>\n",
" <td>Yes</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" session_id user_id start_time \\\n",
"0 sess_c2fba8e7f37a user_488 2025-07-18T18:32:00Z \n",
"1 sess_33d286298cf9 user_1511 2025-06-13 23:21:08+00:00 \n",
"2 sess_be2bb4d8986a user_830 2025-10-20 02:42:07-05:00 \n",
"3 sess_7f425ca9a0e2 user_1 08/01/2025 06:35 \n",
"4 sess_5657e28b22ec user_211 2025-09-08T23:41:44Z \n",
"\n",
" end_time session_length_s region platform \\\n",
"0 2025-07-18 20:03:21-05:00 5481.0 us-west pc \n",
"1 2025-06-13 23:36:30+01:00 922.0 Us-east PlayStation \n",
"2 20/10/2025 02:49 451.0 sa-east-1 PlayStation \n",
"3 2025-08-01T08:32:45Z 7031.0 sa-east-1 PlayStation \n",
"4 2025-09-09 00:32:59+01:00 3075.0 US-EAST switch \n",
"\n",
" gpu_model avg_fps ping_ms map_name crash_flag purchase_amount party_size \\\n",
"0 GTX1080 83.52 431.16 ocean Yes 0,00 2 \n",
"1 NaN 72.75 29.12 desert No 0.0 3 \n",
"2 NaN 69.20 40.47 Forest False 0.0 5 \n",
"3 NaN 33.29 92.40 Desert No 17.55 1 \n",
"4 NaN 69.96 12.63 Desert False 0.0 2 \n",
"\n",
" input_method build_version is_featured_event device_temp_c session_type \\\n",
"0 Touch NaN No 85.6 ranked \n",
"1 Touch NaN 0 62.0 casual \n",
"2 TOUCH 1.4 False 69.0 ranked \n",
"3 Controller 1.3.2 0 48.1 casual \n",
"4 controllr NaN 0 54.7 casual \n",
"\n",
" is_long_session \n",
"0 True \n",
"1 0 \n",
"2 False \n",
"3 True \n",
"4 Yes "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import sweetviz as sv\n",
"import dtale\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"# Load the raw dataset — do NOT clean anything yet\n",
"df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
"\n",
"print(f'Shape: {df.shape}')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Column types (as loaded):\n",
"session_id object\n",
"user_id object\n",
"start_time object\n",
"end_time object\n",
"session_length_s float64\n",
"region object\n",
"platform object\n",
"gpu_model object\n",
"avg_fps float64\n",
"ping_ms float64\n",
"map_name object\n",
"crash_flag object\n",
"purchase_amount object\n",
"party_size int64\n",
"input_method object\n",
"build_version object\n",
"is_featured_event object\n",
"device_temp_c float64\n",
"session_type object\n",
"is_long_session object\n",
"dtype: object\n"
]
}
],
"source": [
"# Quick look at column types as pandas inferred them\n",
"print('Column types (as loaded):')\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
"\n",
"---\n",
"\n",
"## Part 2 — Automated Overview with SweetViz\n",
"\n",
"Before diving into manual inspection, generate a SweetViz report. This gives you a visual overview of every column in one step — distributions, types, missing values, and anomalies.\n",
"\n",
"**Think of SweetViz as your \"triage\" tool.** It shows you *where* to look; D-Tale is where you look *closely*."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "bd10cd653e7a47f891552a79e946376c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" | | [ 0%] 00:00 -> (? left)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Report sweetviz_raw_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.\n",
"Report saved as sweetviz_raw_report.html — open it in your browser.\n"
]
}
],
"source": [
"# Generate the SweetViz report\n",
"# This may take 3060 seconds\n",
"report = sv.analyze(df_raw)\n",
"report.show_html('sweetviz_raw_report.html')\n",
"\n",
"print('Report saved as sweetviz_raw_report.html — open it in your browser.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 📋 SweetViz Checklist — What to look for\n",
"\n",
"Open `sweetviz_raw_report.html` and answer the following questions. Write your findings below before moving on.\n",
"\n",
"| Question | Your finding |\n",
"|---|---|\n",
"| Which columns have missing values? Which has the most? | *...* |\n",
"| Which columns are listed as TEXT but should be numeric or boolean? | *...* |\n",
"| Are there any numeric columns with suspicious ranges (very high max or very low min)? | *...* |\n",
"| How many unique values does `region` have? Does that seem right? | *...* |\n",
"| What is unusual about `purchase_amount`? | *...* |\n",
"\n",
"*(Double-click to fill in your answers)*\n",
"\n",
"---\n",
"\n",
"## Part 3 — Deep Inspection and Cleaning with D-Tale\n",
"\n",
"D-Tale opens the dataset in an interactive grid. You can sort, filter, inspect, and clean without writing a single line of pandas — but D-Tale records every action as code you can export later.\n",
"\n",
"**Launch D-Tale now:**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-02-22 20:12:55,619 - INFO - D-Tale started at: http://127.0.0.1:40000\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Open D-Tale at: http://127.0.0.1:40000\n"
]
}
],
"source": [
"# Launch D-Tale with the raw dataset\n",
"# A link will appear — click it to open D-Tale in a new browser ta\n",
"d = dtale.show(df_raw, host='127.0.0.1', subprocess=False, open_browser=True)\n",
"print(\"Open D-Tale at:\", d._url) # lists all running instances\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c2e5293",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "bad operand type for abs(): 'str'",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mTypeError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[16]\u001b[39m\u001b[32m, line 21\u001b[39m\n\u001b[32m 18\u001b[39m \tstr_data = pd.to_numeric(s, errors=\u001b[33m'\u001b[39m\u001b[33mcoerce\u001b[39m\u001b[33m'\u001b[39m)\n\u001b[32m 19\u001b[39m pd.Series(str_data, name=\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m, index=s.index)\n\u001b[32m---> \u001b[39m\u001b[32m21\u001b[39m df[\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m] = \u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mpurchase_amount\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\generic.py:1722\u001b[39m, in \u001b[36mNDFrame.abs\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1654\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m 1655\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mabs\u001b[39m(\u001b[38;5;28mself\u001b[39m) -> Self:\n\u001b[32m 1656\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1657\u001b[39m \u001b[33;03m Return a Series/DataFrame with absolute numeric value of each element.\u001b[39;00m\n\u001b[32m 1658\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 1720\u001b[39m \u001b[33;03m 3 7 40 -50\u001b[39;00m\n\u001b[32m 1721\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1722\u001b[39m res_mgr = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_mgr\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1723\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._constructor_from_mgr(res_mgr, axes=res_mgr.axes).__finalize__(\n\u001b[32m 1724\u001b[39m \u001b[38;5;28mself\u001b[39m, name=\u001b[33m\"\u001b[39m\u001b[33mabs\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1725\u001b[39m )\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:361\u001b[39m, in \u001b[36mBaseBlockManager.apply\u001b[39m\u001b[34m(self, f, align_keys, **kwargs)\u001b[39m\n\u001b[32m 358\u001b[39m kwargs[k] = obj[b.mgr_locs.indexer]\n\u001b[32m 360\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mcallable\u001b[39m(f):\n\u001b[32m--> \u001b[39m\u001b[32m361\u001b[39m applied = \u001b[43mb\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 362\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 363\u001b[39m applied = \u001b[38;5;28mgetattr\u001b[39m(b, f)(**kwargs)\n",
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\blocks.py:395\u001b[39m, in \u001b[36mBlock.apply\u001b[39m\u001b[34m(self, func, **kwargs)\u001b[39m\n\u001b[32m 389\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m 390\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mapply\u001b[39m(\u001b[38;5;28mself\u001b[39m, func, **kwargs) -> \u001b[38;5;28mlist\u001b[39m[Block]:\n\u001b[32m 391\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 392\u001b[39m \u001b[33;03m apply the function to my values; return a block if we are not\u001b[39;00m\n\u001b[32m 393\u001b[39m \u001b[33;03m one\u001b[39;00m\n\u001b[32m 394\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m395\u001b[39m result = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 397\u001b[39m result = maybe_coerce_values(result)\n\u001b[32m 398\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._split_op_result(result)\n",
"\u001b[31mTypeError\u001b[39m: bad operand type for abs(): 'str'"
]
}
],
"source": [
"# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'\n",
"\n",
"import pandas as pd\n",
"\n",
"if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):\n",
"\tdf = df.to_frame(index=False)\n",
"\n",
"# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required\n",
"df = df.reset_index().drop('index', axis=1, errors='ignore')\n",
"df.columns = [str(c) for c in df.columns] # update columns to strings in case they are numbers\n",
"\n",
"df['purchase_amount'] = df['purchase_amount'].str.replace(',', '.', case=False, regex='False')\n",
"df['purchase_amount'] = s = df['purchase_amount'] \n",
"\n",
"if s.str.startswith('0x').any():\n",
"\tstr_data = s.apply(float.fromhex)\n",
"else:\n",
"\tstr_data = pd.to_numeric(s, errors='coerce')\n",
"\t\n",
"pd.Series(str_data, name='purchase_amount', index=s.index)\n",
"\n",
"df['purchase_amount'] = df['purchase_amount'].abs()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8180fa05",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2026-02-22 20:18:35,563 - INFO - D-Tale started at: http://127.0.0.1:40000\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Open D-Tale at: http://127.0.0.1:40000\n"
]
}
],
"source": [
"# Launch D-Tale with the raw dataset\n",
"# A link will appear — click it to open D-Tale in a new browser ta\n",
"d = dtale.show(df, host='127.0.0.1', subprocess=False, open_browser=True)\n",
"print(\"Open D-Tale at:\", d._url) # lists all running instances\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "745a5655",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" TCP 169.254.62.24:40000 0.0.0.0:0 LISTENING 11972\n",
"\n"
]
}
],
"source": [
"# Check if something else is already on port 40000\n",
"import subprocess\n",
"result = subprocess.run('netstat -ano | findstr :40000', shell=True, capture_output=True, text=True)\n",
"print(result.stdout or \"Nothing on port 40000\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"### 🔍 Issue 1 — Missing Values\n",
"\n",
"In D-Tale, go to **\"Describe\"** (top menu → Describe) to see the missing value counts per column.\n",
"\n",
"You will find:\n",
"\n",
"| Column | Missing | Note |\n",
"|---|---|---|\n",
"| `gpu_model` | ~67% | Most players are on console — GPU does not apply |\n",
"| `build_version` | ~17% | Not recorded in older sessions |\n",
"| `device_temp_c` | ~5% | Sensor not available on some devices |\n",
"| `session_length_s` | ~1% | Session ended abnormally (crash?) |\n",
"| `ping_ms`, `purchase_amount`, `end_time` | <2% | Sporadic gaps |\n",
"\n",
"**Cleaning decisions to make in D-Tale:**\n",
"\n",
"1. **`gpu_model`** — This column is missing for 67% of rows. Rather than imputing, consider: is this column useful for a console/mobile player? Go to **Column Actions → Delete Column** and remove it. Alternatively, you can keep it and decide during analysis.\n",
"\n",
"2. **`build_version`** — Missings are structurally valid (older sessions). Keep the column; do not impute.\n",
"\n",
"3. **Remaining columns** — Leave missing values in place for now. We will handle them during analysis when context is clearer.\n",
"\n",
"> 📝 **Record your decisions:** Which columns did you keep? Which did you drop? Why?\n",
"\n",
"*(Double-click to write your decisions here)*\n",
"\n",
"---\n",
"\n",
"### 🔍 Issue 2 — Boolean Columns with Mixed Encodings\n",
"\n",
"Three columns represent true/false flags but were stored with at least **8 different representations**:\n",
"\n",
"- `crash_flag` → `Yes`, `No`, `True`, `False`, `true`, `false`, `1`, `0`\n",
"- `is_featured_event` → same 8 representations \n",
"- `is_long_session` → same 8 representations\n",
"\n",
"**In D-Tale, clean each column:**\n",
"\n",
"1. Click the column header → **Column Actions → Type Conversion**\n",
"2. Select **String to Bool** (D-Tale will map Yes/True/1 → True and No/False/0 → False)\n",
"3. Preview the result before applying\n",
"4. Repeat for all three columns\n",
"\n",
"> 💡 **Alternative via Find & Replace:** If Type Conversion does not cover all variants, use **Column Actions → Replace** to manually map unusual values (e.g., `Yes` → `True`) before converting.\n",
"\n",
"After cleaning, verify with Describe: each column should show only `True` and `False`.\n",
"\n",
"---\n",
"\n",
"### 🔍 Issue 3 — Categorical Columns: Case and Whitespace Chaos\n",
"\n",
"Four categorical columns have serious inconsistency:\n",
"\n",
"- `region` — 32 variants of 5 values (e.g., `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
"- `map_name` — 36 variants of 6 values\n",
"- `platform` — 32 variants of 6 values\n",
"- `input_method` — 30 variants, including a typo: `controllr` instead of `controller`\n",
"\n",
"**Clean each column in D-Tale:**\n",
"\n",
"1. Click column header → **Column Actions → Type Conversion → String Cleaning**\n",
"2. Apply **Strip whitespace** and **Lowercase** (or **Uppercase** — be consistent)\n",
"3. For `input_method`, also apply a **Replace** to fix `controllr` → `controller` and `kb/m` → `kbm` (pick one variant and standardise)\n",
"\n",
"After cleaning, each column should have the expected number of unique values:\n",
"\n",
"| Column | Before | After |\n",
"|---|---|---|\n",
"| `region` | 32 | 5 |\n",
"| `map_name` | 36 | 6 |\n",
"| `platform` | 32 | 6 |\n",
"| `input_method` | 30 | 3 |\n",
"\n",
"> Use **Describe → value_counts** to verify before and after each fix.\n",
"\n",
"---\n",
"\n",
"### 🔍 Issue 4 — `purchase_amount`: Comma as Decimal Separator\n",
"\n",
"Some rows contain values like `\"0,00\"` and `\"1,80\"` where a comma was used instead of a decimal point. This prevents pandas from reading the column as numeric.\n",
"\n",
"**In D-Tale:**\n",
"\n",
"1. Filter the column to show only rows where the value contains a comma: **Column Actions → Filter → contains `,`**\n",
"2. Apply a **Replace**: replace `,` with `.` in the column\n",
"3. Then convert the column type: **Column Actions → Type Conversion → Float**\n",
"\n",
"> After conversion, verify the column dtype and check the range (min/max) with Describe.\n",
"\n",
"---\n",
"\n",
"### 🔍 Issue 5 — Outliers in Numeric Columns\n",
"\n",
"The SweetViz report and D-Tale Describe should have flagged suspicious ranges. Check these now:\n",
"\n",
"| Column | Suspicious value | Likely explanation |\n",
"|---|---|---|\n",
"| `avg_fps` | max = 10,000 | Sensor error or logging bug — physically impossible |\n",
"| `ping_ms` | max = 627 ms | High but plausible for satellite connections |\n",
"| `device_temp_c` | max = 100°C | Right at thermal throttling limit — possible but worth flagging |\n",
"\n",
"**In D-Tale, investigate `avg_fps`:**\n",
"\n",
"1. Use **Charts** (top menu) to plot a histogram of `avg_fps` — does it show an extreme outlier spike?\n",
"2. Use **Filter** to see how many rows have `avg_fps > 300` (a hard upper bound for realistic gameplay)\n",
"3. **Decide:** Should these rows be dropped, or should the value be set to `NaN` to mark it as invalid?\n",
"4. Apply your decision via **Column Actions → Replace** or a row-level **Filter + Delete**\n",
"\n",
"> 📝 **Record your decision and reasoning:** What threshold did you use? How many rows were affected?\n",
"\n",
"*(Double-click to write your answer here)*\n",
"\n",
"---\n",
"\n",
"### 🔍 Issue 6 — Mixed Datetime Formats\n",
"\n",
"The `start_time` and `end_time` columns contain timestamps in multiple formats:\n",
"\n",
"- ISO 8601 with timezone: `2025-07-18T18:32:00Z`\n",
"- ISO with offset: `2025-07-18 20:03:21-05:00` \n",
"- European: `20/10/2025 02:49`\n",
"- US: `08/01/2025 06:35`\n",
"\n",
"This is one of the harder issues to fix entirely within D-Tale's UI. For now:\n",
"\n",
"1. In D-Tale, go to **Column Actions → Type Conversion** on `start_time` and try **String to Date** with `infer_datetime_format=True`\n",
"2. Check how many values fail to parse (shown as NaT after conversion)\n",
"3. Make note of any unresolved formats — these will need to be handled in pandas with `pd.to_datetime(..., errors='coerce')` and may require a more careful cleaning pass\n",
"\n",
"> ⚠️ **Key insight:** Not all cleaning can be done point-and-click. Some issues require programmatic resolution. This is where the code D-Tale generates becomes valuable.\n",
"\n",
"---\n",
"\n",
"## Part 4 — Export the Cleaning Code from D-Tale\n",
"\n",
"Every cleaning action you performed in D-Tale was recorded as pandas code. Let's export and inspect it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve the cleaned dataframe from D-Tale\n",
"# (This reflects all changes made in the D-Tale UI)\n",
"df_clean = d.data.copy()\n",
"\n",
"print(f'Cleaned shape: {df_clean.shape}')\n",
"print('\\nColumn types after cleaning:')\n",
"print(df_clean.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# D-Tale also lets you export the complete cleaning pipeline as Python code.\n",
"# In the D-Tale UI: click the code icon (</>) in the top-right corner → \"Export Code\"\n",
"# Paste the exported code below:\n",
"\n",
"# --- Paste D-Tale exported code here ---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 — Manual refinement in pandas\n",
"\n",
"D-Tale generates the skeleton; pandas lets you add precision. Here is an example of cleaning the `start_time` column more robustly — something D-Tale's UI cannot fully handle."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Example: robust datetime parsing for mixed-format timestamps\n",
"# pd.to_datetime with utc=True normalises all timezone representations\n",
"df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
"df_clean['end_time'] = pd.to_datetime(df_clean['end_time'], utc=True, errors='coerce')\n",
"\n",
"# Check how many rows could not be parsed\n",
"print('Unparsed start_time rows:', df_clean['start_time'].isna().sum())\n",
"print('Unparsed end_time rows: ', df_clean['end_time'].isna().sum())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Example: cap avg_fps outliers (adjust the threshold based on your decision above)\n",
"# Replace values > 300 with NaN to mark them as invalid rather than deleting rows\n",
"fps_threshold = 300\n",
"n_outliers = (df_clean['avg_fps'] > fps_threshold).sum()\n",
"df_clean.loc[df_clean['avg_fps'] > fps_threshold, 'avg_fps'] = float('nan')\n",
"\n",
"print(f'Rows with avg_fps > {fps_threshold} set to NaN: {n_outliers}')\n",
"print(f'avg_fps range after: {df_clean[\"avg_fps\"].min():.1f} {df_clean[\"avg_fps\"].max():.1f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Part 5 — Validation: Before vs After\n",
"\n",
"The real test of cleaning work is a comparison report. SweetViz can compare two dataframes side by side."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generate a comparison report: raw vs cleaned\n",
"# This may take 6090 seconds\n",
"compare_report = sv.compare([df_raw, 'Raw'], [df_clean, 'Cleaned'])\n",
"compare_report.show_html('sweetviz_comparison_report.html', open_browser=False)\n",
"\n",
"print('Comparison report saved — open sweetviz_comparison_report.html in your browser.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the comparison report and verify:\n",
"\n",
"- ✅ Boolean columns now show only `True` / `False`\n",
"- ✅ Categorical columns have the expected number of unique values\n",
"- ✅ `purchase_amount` is now numeric\n",
"- ✅ `avg_fps` no longer has a 10,000 outlier\n",
"- ✅ Missing value counts have changed as expected\n",
"\n",
"---\n",
"\n",
"## Part 6 — Save the Cleaned Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
"print('Cleaned dataset saved.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## 🔑 Key Takeaways\n",
"\n",
"- **SweetViz** gives you a rapid automated overview — use it at the start and for before/after comparison. It does not clean; it informs.\n",
"- **D-Tale** lets you explore interactively, spot patterns, and clean through a UI. Every action is tracked as pandas code, so you are never locked into the GUI.\n",
"- **Pandas** remains essential for edge cases: complex datetime parsing, conditional logic, and anything requiring programmatic iteration.\n",
"- The three tools form a pipeline: **SweetViz → triage → D-Tale → interactive cleaning → pandas → refinement**.\n",
"\n",
"**Common issue categories you have now seen:**\n",
"\n",
"| Category | Example in this dataset |\n",
"|---|---|\n",
"| Boolean encoding inconsistency | 8 representations of True/False |\n",
"| Categorical case/whitespace chaos | 32 variants of 5 region names |\n",
"| Typos in categories | `controllr` vs `controller` |\n",
"| Wrong decimal separator | `1,80` instead of `1.80` |\n",
"| Structural missingness | `gpu_model` absent for console players |\n",
"| Sensor/logging outliers | `avg_fps = 10,000` |\n",
"| Mixed datetime formats | ISO 8601 mixed with European dates |\n",
"\n",
"→ In **Task 3**, you will apply these same skills independently to a new dataset — with less guidance."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}