830 lines
35 KiB
Plaintext
830 lines
35 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Lab 02 · Task 2 — Guided EDA and Data Cleaning with SweetViz & D-Tale\n",
|
||
"\n",
|
||
"**Estimated time:** ~50 minutes \n",
|
||
"**Dataset:** `dataset_A_indie_game_telemetry.csv`\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Objectives\n",
|
||
"\n",
|
||
"By the end of this task you will be able to:\n",
|
||
"- Generate an automated EDA report with **SweetViz** to get a rapid overview of a dataset\n",
|
||
"- Use **D-Tale** interactively to identify and fix data quality problems\n",
|
||
"- Recognise the most common categories of data issues: inconsistent encoding, mixed types, excessive missingness, and outliers\n",
|
||
"- Understand how interactive tools translate cleaning actions into pandas code\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Context\n",
|
||
"\n",
|
||
"You have been handed a telemetry dataset from a small indie game studio. It contains **10,000 session records** with information about players, platforms, performance metrics, and purchases. Before any visualisation or analysis can be built on top of this data, it must be understood and cleaned.\n",
|
||
"\n",
|
||
"This is real-world data quality: messy, inconsistent, and requiring decisions — not just mechanical fixes.\n",
|
||
"\n",
|
||
"---"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Part 1 — Setup and First Load"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Shape: (10000, 20)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>session_id</th>\n",
|
||
" <th>user_id</th>\n",
|
||
" <th>start_time</th>\n",
|
||
" <th>end_time</th>\n",
|
||
" <th>session_length_s</th>\n",
|
||
" <th>region</th>\n",
|
||
" <th>platform</th>\n",
|
||
" <th>gpu_model</th>\n",
|
||
" <th>avg_fps</th>\n",
|
||
" <th>ping_ms</th>\n",
|
||
" <th>map_name</th>\n",
|
||
" <th>crash_flag</th>\n",
|
||
" <th>purchase_amount</th>\n",
|
||
" <th>party_size</th>\n",
|
||
" <th>input_method</th>\n",
|
||
" <th>build_version</th>\n",
|
||
" <th>is_featured_event</th>\n",
|
||
" <th>device_temp_c</th>\n",
|
||
" <th>session_type</th>\n",
|
||
" <th>is_long_session</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>sess_c2fba8e7f37a</td>\n",
|
||
" <td>user_488</td>\n",
|
||
" <td>2025-07-18T18:32:00Z</td>\n",
|
||
" <td>2025-07-18 20:03:21-05:00</td>\n",
|
||
" <td>5481.0</td>\n",
|
||
" <td>us-west</td>\n",
|
||
" <td>pc</td>\n",
|
||
" <td>GTX1080</td>\n",
|
||
" <td>83.52</td>\n",
|
||
" <td>431.16</td>\n",
|
||
" <td>ocean</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" <td>0,00</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Touch</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>85.6</td>\n",
|
||
" <td>ranked</td>\n",
|
||
" <td>True</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>sess_33d286298cf9</td>\n",
|
||
" <td>user_1511</td>\n",
|
||
" <td>2025-06-13 23:21:08+00:00</td>\n",
|
||
" <td>2025-06-13 23:36:30+01:00</td>\n",
|
||
" <td>922.0</td>\n",
|
||
" <td>Us-east</td>\n",
|
||
" <td>PlayStation</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>72.75</td>\n",
|
||
" <td>29.12</td>\n",
|
||
" <td>desert</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Touch</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>casual</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>sess_be2bb4d8986a</td>\n",
|
||
" <td>user_830</td>\n",
|
||
" <td>2025-10-20 02:42:07-05:00</td>\n",
|
||
" <td>20/10/2025 02:49</td>\n",
|
||
" <td>451.0</td>\n",
|
||
" <td>sa-east-1</td>\n",
|
||
" <td>PlayStation</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>69.20</td>\n",
|
||
" <td>40.47</td>\n",
|
||
" <td>Forest</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>TOUCH</td>\n",
|
||
" <td>1.4</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>69.0</td>\n",
|
||
" <td>ranked</td>\n",
|
||
" <td>False</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>sess_7f425ca9a0e2</td>\n",
|
||
" <td>user_1</td>\n",
|
||
" <td>08/01/2025 06:35</td>\n",
|
||
" <td>2025-08-01T08:32:45Z</td>\n",
|
||
" <td>7031.0</td>\n",
|
||
" <td>sa-east-1</td>\n",
|
||
" <td>PlayStation</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>33.29</td>\n",
|
||
" <td>92.40</td>\n",
|
||
" <td>Desert</td>\n",
|
||
" <td>No</td>\n",
|
||
" <td>17.55</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Controller</td>\n",
|
||
" <td>1.3.2</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>48.1</td>\n",
|
||
" <td>casual</td>\n",
|
||
" <td>True</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>sess_5657e28b22ec</td>\n",
|
||
" <td>user_211</td>\n",
|
||
" <td>2025-09-08T23:41:44Z</td>\n",
|
||
" <td>2025-09-09 00:32:59+01:00</td>\n",
|
||
" <td>3075.0</td>\n",
|
||
" <td>US-EAST</td>\n",
|
||
" <td>switch</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>69.96</td>\n",
|
||
" <td>12.63</td>\n",
|
||
" <td>Desert</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>controllr</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>54.7</td>\n",
|
||
" <td>casual</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" session_id user_id start_time \\\n",
|
||
"0 sess_c2fba8e7f37a user_488 2025-07-18T18:32:00Z \n",
|
||
"1 sess_33d286298cf9 user_1511 2025-06-13 23:21:08+00:00 \n",
|
||
"2 sess_be2bb4d8986a user_830 2025-10-20 02:42:07-05:00 \n",
|
||
"3 sess_7f425ca9a0e2 user_1 08/01/2025 06:35 \n",
|
||
"4 sess_5657e28b22ec user_211 2025-09-08T23:41:44Z \n",
|
||
"\n",
|
||
" end_time session_length_s region platform \\\n",
|
||
"0 2025-07-18 20:03:21-05:00 5481.0 us-west pc \n",
|
||
"1 2025-06-13 23:36:30+01:00 922.0 Us-east PlayStation \n",
|
||
"2 20/10/2025 02:49 451.0 sa-east-1 PlayStation \n",
|
||
"3 2025-08-01T08:32:45Z 7031.0 sa-east-1 PlayStation \n",
|
||
"4 2025-09-09 00:32:59+01:00 3075.0 US-EAST switch \n",
|
||
"\n",
|
||
" gpu_model avg_fps ping_ms map_name crash_flag purchase_amount party_size \\\n",
|
||
"0 GTX1080 83.52 431.16 ocean Yes 0,00 2 \n",
|
||
"1 NaN 72.75 29.12 desert No 0.0 3 \n",
|
||
"2 NaN 69.20 40.47 Forest False 0.0 5 \n",
|
||
"3 NaN 33.29 92.40 Desert No 17.55 1 \n",
|
||
"4 NaN 69.96 12.63 Desert False 0.0 2 \n",
|
||
"\n",
|
||
" input_method build_version is_featured_event device_temp_c session_type \\\n",
|
||
"0 Touch NaN No 85.6 ranked \n",
|
||
"1 Touch NaN 0 62.0 casual \n",
|
||
"2 TOUCH 1.4 False 69.0 ranked \n",
|
||
"3 Controller 1.3.2 0 48.1 casual \n",
|
||
"4 controllr NaN 0 54.7 casual \n",
|
||
"\n",
|
||
" is_long_session \n",
|
||
"0 True \n",
|
||
"1 0 \n",
|
||
"2 False \n",
|
||
"3 True \n",
|
||
"4 Yes "
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"import sweetviz as sv\n",
|
||
"import dtale\n",
|
||
"import warnings\n",
|
||
"warnings.filterwarnings('ignore')\n",
|
||
"\n",
|
||
"# Load the raw dataset — do NOT clean anything yet\n",
|
||
"df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n",
|
||
"\n",
|
||
"print(f'Shape: {df.shape}')\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Column types (as loaded):\n",
|
||
"session_id object\n",
|
||
"user_id object\n",
|
||
"start_time object\n",
|
||
"end_time object\n",
|
||
"session_length_s float64\n",
|
||
"region object\n",
|
||
"platform object\n",
|
||
"gpu_model object\n",
|
||
"avg_fps float64\n",
|
||
"ping_ms float64\n",
|
||
"map_name object\n",
|
||
"crash_flag object\n",
|
||
"purchase_amount object\n",
|
||
"party_size int64\n",
|
||
"input_method object\n",
|
||
"build_version object\n",
|
||
"is_featured_event object\n",
|
||
"device_temp_c float64\n",
|
||
"session_type object\n",
|
||
"is_long_session object\n",
|
||
"dtype: object\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Quick look at column types as pandas inferred them\n",
|
||
"print('Column types (as loaded):')\n",
|
||
"print(df.dtypes)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 2 — Automated Overview with SweetViz\n",
|
||
"\n",
|
||
"Before diving into manual inspection, generate a SweetViz report. This gives you a visual overview of every column in one step — distributions, types, missing values, and anomalies.\n",
|
||
"\n",
|
||
"**Think of SweetViz as your \"triage\" tool.** It shows you *where* to look; D-Tale is where you look *closely*."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"application/vnd.jupyter.widget-view+json": {
|
||
"model_id": "bd10cd653e7a47f891552a79e946376c",
|
||
"version_major": 2,
|
||
"version_minor": 0
|
||
},
|
||
"text/plain": [
|
||
" | | [ 0%] 00:00 -> (? left)"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Report sweetviz_raw_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.\n",
|
||
"Report saved as sweetviz_raw_report.html — open it in your browser.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Generate the SweetViz report\n",
|
||
"# This may take 30–60 seconds\n",
|
||
"report = sv.analyze(df_raw)\n",
|
||
"report.show_html('sweetviz_raw_report.html')\n",
|
||
"\n",
|
||
"print('Report saved as sweetviz_raw_report.html — open it in your browser.')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 📋 SweetViz Checklist — What to look for\n",
|
||
"\n",
|
||
"Open `sweetviz_raw_report.html` and answer the following questions. Write your findings below before moving on.\n",
|
||
"\n",
|
||
"| Question | Your finding |\n",
|
||
"|---|---|\n",
|
||
"| Which columns have missing values? Which has the most? | *...* |\n",
|
||
"| Which columns are listed as TEXT but should be numeric or boolean? | *...* |\n",
|
||
"| Are there any numeric columns with suspicious ranges (very high max or very low min)? | *...* |\n",
|
||
"| How many unique values does `region` have? Does that seem right? | *...* |\n",
|
||
"| What is unusual about `purchase_amount`? | *...* |\n",
|
||
"\n",
|
||
"*(Double-click to fill in your answers)*\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 3 — Deep Inspection and Cleaning with D-Tale\n",
|
||
"\n",
|
||
"D-Tale opens the dataset in an interactive grid. You can sort, filter, inspect, and clean without writing a single line of pandas — but D-Tale records every action as code you can export later.\n",
|
||
"\n",
|
||
"**Launch D-Tale now:**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2026-02-22 20:12:55,619 - INFO - D-Tale started at: http://127.0.0.1:40000\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Open D-Tale at: http://127.0.0.1:40000\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Launch D-Tale with the raw dataset\n",
|
||
"# A link will appear — click it to open D-Tale in a new browser ta\n",
|
||
"d = dtale.show(df_raw, host='127.0.0.1', subprocess=False, open_browser=True)\n",
|
||
"print(\"Open D-Tale at:\", d._url) # lists all running instances\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4c2e5293",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "TypeError",
|
||
"evalue": "bad operand type for abs(): 'str'",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
||
"\u001b[31mTypeError\u001b[39m Traceback (most recent call last)",
|
||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[16]\u001b[39m\u001b[32m, line 21\u001b[39m\n\u001b[32m 18\u001b[39m \tstr_data = pd.to_numeric(s, errors=\u001b[33m'\u001b[39m\u001b[33mcoerce\u001b[39m\u001b[33m'\u001b[39m)\n\u001b[32m 19\u001b[39m pd.Series(str_data, name=\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m, index=s.index)\n\u001b[32m---> \u001b[39m\u001b[32m21\u001b[39m df[\u001b[33m'\u001b[39m\u001b[33mpurchase_amount\u001b[39m\u001b[33m'\u001b[39m] = \u001b[43mdf\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m'\u001b[39;49m\u001b[33;43mpurchase_amount\u001b[39;49m\u001b[33;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\generic.py:1722\u001b[39m, in \u001b[36mNDFrame.abs\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1654\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m 1655\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mabs\u001b[39m(\u001b[38;5;28mself\u001b[39m) -> Self:\n\u001b[32m 1656\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 1657\u001b[39m \u001b[33;03m Return a Series/DataFrame with absolute numeric value of each element.\u001b[39;00m\n\u001b[32m 1658\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 1720\u001b[39m \u001b[33;03m 3 7 40 -50\u001b[39;00m\n\u001b[32m 1721\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1722\u001b[39m res_mgr = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_mgr\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mabs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1723\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._constructor_from_mgr(res_mgr, axes=res_mgr.axes).__finalize__(\n\u001b[32m 1724\u001b[39m \u001b[38;5;28mself\u001b[39m, name=\u001b[33m\"\u001b[39m\u001b[33mabs\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1725\u001b[39m )\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\managers.py:361\u001b[39m, in \u001b[36mBaseBlockManager.apply\u001b[39m\u001b[34m(self, f, align_keys, **kwargs)\u001b[39m\n\u001b[32m 358\u001b[39m kwargs[k] = obj[b.mgr_locs.indexer]\n\u001b[32m 360\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mcallable\u001b[39m(f):\n\u001b[32m--> \u001b[39m\u001b[32m361\u001b[39m applied = \u001b[43mb\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 362\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 363\u001b[39m applied = \u001b[38;5;28mgetattr\u001b[39m(b, f)(**kwargs)\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32md:\\Projects\\43679_InteractiveVis\\VI_Lab_01_EDA\\.venv\\Lib\\site-packages\\pandas\\core\\internals\\blocks.py:395\u001b[39m, in \u001b[36mBlock.apply\u001b[39m\u001b[34m(self, func, **kwargs)\u001b[39m\n\u001b[32m 389\u001b[39m \u001b[38;5;129m@final\u001b[39m\n\u001b[32m 390\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mapply\u001b[39m(\u001b[38;5;28mself\u001b[39m, func, **kwargs) -> \u001b[38;5;28mlist\u001b[39m[Block]:\n\u001b[32m 391\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 392\u001b[39m \u001b[33;03m apply the function to my values; return a block if we are not\u001b[39;00m\n\u001b[32m 393\u001b[39m \u001b[33;03m one\u001b[39;00m\n\u001b[32m 394\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m395\u001b[39m result = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mvalues\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 397\u001b[39m result = maybe_coerce_values(result)\n\u001b[32m 398\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._split_op_result(result)\n",
|
||
"\u001b[31mTypeError\u001b[39m: bad operand type for abs(): 'str'"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'\n",
|
||
"\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):\n",
|
||
"\tdf = df.to_frame(index=False)\n",
|
||
"\n",
|
||
"# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required\n",
|
||
"df = df.reset_index().drop('index', axis=1, errors='ignore')\n",
|
||
"df.columns = [str(c) for c in df.columns] # update columns to strings in case they are numbers\n",
|
||
"\n",
|
||
"df['purchase_amount'] = df['purchase_amount'].str.replace(',', '.', case=False, regex='False')\n",
|
||
"df['purchase_amount'] = s = df['purchase_amount'] \n",
|
||
"\n",
|
||
"if s.str.startswith('0x').any():\n",
|
||
"\tstr_data = s.apply(float.fromhex)\n",
|
||
"else:\n",
|
||
"\tstr_data = pd.to_numeric(s, errors='coerce')\n",
|
||
"\t\n",
|
||
"pd.Series(str_data, name='purchase_amount', index=s.index)\n",
|
||
"\n",
|
||
"df['purchase_amount'] = df['purchase_amount'].abs()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "8180fa05",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2026-02-22 20:18:35,563 - INFO - D-Tale started at: http://127.0.0.1:40000\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Open D-Tale at: http://127.0.0.1:40000\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Launch D-Tale with the raw dataset\n",
|
||
"# A link will appear — click it to open D-Tale in a new browser ta\n",
|
||
"d = dtale.show(df, host='127.0.0.1', subprocess=False, open_browser=True)\n",
|
||
"print(\"Open D-Tale at:\", d._url) # lists all running instances\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "745a5655",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" TCP 169.254.62.24:40000 0.0.0.0:0 LISTENING 11972\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Check if something else is already on port 40000\n",
|
||
"import subprocess\n",
|
||
"result = subprocess.run('netstat -ano | findstr :40000', shell=True, capture_output=True, text=True)\n",
|
||
"print(result.stdout or \"Nothing on port 40000\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 1 — Missing Values\n",
|
||
"\n",
|
||
"In D-Tale, go to **\"Describe\"** (top menu → Describe) to see the missing value counts per column.\n",
|
||
"\n",
|
||
"You will find:\n",
|
||
"\n",
|
||
"| Column | Missing | Note |\n",
|
||
"|---|---|---|\n",
|
||
"| `gpu_model` | ~67% | Most players are on console — GPU does not apply |\n",
|
||
"| `build_version` | ~17% | Not recorded in older sessions |\n",
|
||
"| `device_temp_c` | ~5% | Sensor not available on some devices |\n",
|
||
"| `session_length_s` | ~1% | Session ended abnormally (crash?) |\n",
|
||
"| `ping_ms`, `purchase_amount`, `end_time` | <2% | Sporadic gaps |\n",
|
||
"\n",
|
||
"**Cleaning decisions to make in D-Tale:**\n",
|
||
"\n",
|
||
"1. **`gpu_model`** — This column is missing for 67% of rows. Rather than imputing, consider: is this column useful for a console/mobile player? Go to **Column Actions → Delete Column** and remove it. Alternatively, you can keep it and decide during analysis.\n",
|
||
"\n",
|
||
"2. **`build_version`** — Missings are structurally valid (older sessions). Keep the column; do not impute.\n",
|
||
"\n",
|
||
"3. **Remaining columns** — Leave missing values in place for now. We will handle them during analysis when context is clearer.\n",
|
||
"\n",
|
||
"> 📝 **Record your decisions:** Which columns did you keep? Which did you drop? Why?\n",
|
||
"\n",
|
||
"*(Double-click to write your decisions here)*\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 2 — Boolean Columns with Mixed Encodings\n",
|
||
"\n",
|
||
"Three columns represent true/false flags but were stored with at least **8 different representations**:\n",
|
||
"\n",
|
||
"- `crash_flag` → `Yes`, `No`, `True`, `False`, `true`, `false`, `1`, `0`\n",
|
||
"- `is_featured_event` → same 8 representations \n",
|
||
"- `is_long_session` → same 8 representations\n",
|
||
"\n",
|
||
"**In D-Tale, clean each column:**\n",
|
||
"\n",
|
||
"1. Click the column header → **Column Actions → Type Conversion**\n",
|
||
"2. Select **String to Bool** (D-Tale will map Yes/True/1 → True and No/False/0 → False)\n",
|
||
"3. Preview the result before applying\n",
|
||
"4. Repeat for all three columns\n",
|
||
"\n",
|
||
"> 💡 **Alternative via Find & Replace:** If Type Conversion does not cover all variants, use **Column Actions → Replace** to manually map unusual values (e.g., `Yes` → `True`) before converting.\n",
|
||
"\n",
|
||
"After cleaning, verify with Describe: each column should show only `True` and `False`.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 3 — Categorical Columns: Case and Whitespace Chaos\n",
|
||
"\n",
|
||
"Four categorical columns have serious inconsistency:\n",
|
||
"\n",
|
||
"- `region` — 32 variants of 5 values (e.g., `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n",
|
||
"- `map_name` — 36 variants of 6 values\n",
|
||
"- `platform` — 32 variants of 6 values\n",
|
||
"- `input_method` — 30 variants, including a typo: `controllr` instead of `controller`\n",
|
||
"\n",
|
||
"**Clean each column in D-Tale:**\n",
|
||
"\n",
|
||
"1. Click column header → **Column Actions → Type Conversion → String Cleaning**\n",
|
||
"2. Apply **Strip whitespace** and **Lowercase** (or **Uppercase** — be consistent)\n",
|
||
"3. For `input_method`, also apply a **Replace** to fix `controllr` → `controller` and `kb/m` → `kbm` (pick one variant and standardise)\n",
|
||
"\n",
|
||
"After cleaning, each column should have the expected number of unique values:\n",
|
||
"\n",
|
||
"| Column | Before | After |\n",
|
||
"|---|---|---|\n",
|
||
"| `region` | 32 | 5 |\n",
|
||
"| `map_name` | 36 | 6 |\n",
|
||
"| `platform` | 32 | 6 |\n",
|
||
"| `input_method` | 30 | 3 |\n",
|
||
"\n",
|
||
"> Use **Describe → value_counts** to verify before and after each fix.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 4 — `purchase_amount`: Comma as Decimal Separator\n",
|
||
"\n",
|
||
"Some rows contain values like `\"0,00\"` and `\"1,80\"` where a comma was used instead of a decimal point. This prevents pandas from reading the column as numeric.\n",
|
||
"\n",
|
||
"**In D-Tale:**\n",
|
||
"\n",
|
||
"1. Filter the column to show only rows where the value contains a comma: **Column Actions → Filter → contains `,`**\n",
|
||
"2. Apply a **Replace**: replace `,` with `.` in the column\n",
|
||
"3. Then convert the column type: **Column Actions → Type Conversion → Float**\n",
|
||
"\n",
|
||
"> After conversion, verify the column dtype and check the range (min/max) with Describe.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 5 — Outliers in Numeric Columns\n",
|
||
"\n",
|
||
"The SweetViz report and D-Tale Describe should have flagged suspicious ranges. Check these now:\n",
|
||
"\n",
|
||
"| Column | Suspicious value | Likely explanation |\n",
|
||
"|---|---|---|\n",
|
||
"| `avg_fps` | max = 10,000 | Sensor error or logging bug — physically impossible |\n",
|
||
"| `ping_ms` | max = 627 ms | High but plausible for satellite connections |\n",
|
||
"| `device_temp_c` | max = 100°C | Right at thermal throttling limit — possible but worth flagging |\n",
|
||
"\n",
|
||
"**In D-Tale, investigate `avg_fps`:**\n",
|
||
"\n",
|
||
"1. Use **Charts** (top menu) to plot a histogram of `avg_fps` — does it show an extreme outlier spike?\n",
|
||
"2. Use **Filter** to see how many rows have `avg_fps > 300` (a hard upper bound for realistic gameplay)\n",
|
||
"3. **Decide:** Should these rows be dropped, or should the value be set to `NaN` to mark it as invalid?\n",
|
||
"4. Apply your decision via **Column Actions → Replace** or a row-level **Filter + Delete**\n",
|
||
"\n",
|
||
"> 📝 **Record your decision and reasoning:** What threshold did you use? How many rows were affected?\n",
|
||
"\n",
|
||
"*(Double-click to write your answer here)*\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 🔍 Issue 6 — Mixed Datetime Formats\n",
|
||
"\n",
|
||
"The `start_time` and `end_time` columns contain timestamps in multiple formats:\n",
|
||
"\n",
|
||
"- ISO 8601 with timezone: `2025-07-18T18:32:00Z`\n",
|
||
"- ISO with offset: `2025-07-18 20:03:21-05:00` \n",
|
||
"- European: `20/10/2025 02:49`\n",
|
||
"- US: `08/01/2025 06:35`\n",
|
||
"\n",
|
||
"This is one of the harder issues to fix entirely within D-Tale's UI. For now:\n",
|
||
"\n",
|
||
"1. In D-Tale, go to **Column Actions → Type Conversion** on `start_time` and try **String to Date** with `infer_datetime_format=True`\n",
|
||
"2. Check how many values fail to parse (shown as NaT after conversion)\n",
|
||
"3. Make note of any unresolved formats — these will need to be handled in pandas with `pd.to_datetime(..., errors='coerce')` and may require a more careful cleaning pass\n",
|
||
"\n",
|
||
"> ⚠️ **Key insight:** Not all cleaning can be done point-and-click. Some issues require programmatic resolution. This is where the code D-Tale generates becomes valuable.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 4 — Export the Cleaning Code from D-Tale\n",
|
||
"\n",
|
||
"Every cleaning action you performed in D-Tale was recorded as pandas code. Let's export and inspect it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Retrieve the cleaned dataframe from D-Tale\n",
|
||
"# (This reflects all changes made in the D-Tale UI)\n",
|
||
"df_clean = d.data.copy()\n",
|
||
"\n",
|
||
"print(f'Cleaned shape: {df_clean.shape}')\n",
|
||
"print('\\nColumn types after cleaning:')\n",
|
||
"print(df_clean.dtypes)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# D-Tale also lets you export the complete cleaning pipeline as Python code.\n",
|
||
"# In the D-Tale UI: click the code icon (</>) in the top-right corner → \"Export Code\"\n",
|
||
"# Paste the exported code below:\n",
|
||
"\n",
|
||
"# --- Paste D-Tale exported code here ---\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 4.1 — Manual refinement in pandas\n",
|
||
"\n",
|
||
"D-Tale generates the skeleton; pandas lets you add precision. Here is an example of cleaning the `start_time` column more robustly — something D-Tale's UI cannot fully handle."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Example: robust datetime parsing for mixed-format timestamps\n",
|
||
"# pd.to_datetime with utc=True normalises all timezone representations\n",
|
||
"df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n",
|
||
"df_clean['end_time'] = pd.to_datetime(df_clean['end_time'], utc=True, errors='coerce')\n",
|
||
"\n",
|
||
"# Check how many rows could not be parsed\n",
|
||
"print('Unparsed start_time rows:', df_clean['start_time'].isna().sum())\n",
|
||
"print('Unparsed end_time rows: ', df_clean['end_time'].isna().sum())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Example: cap avg_fps outliers (adjust the threshold based on your decision above)\n",
|
||
"# Replace values > 300 with NaN to mark them as invalid rather than deleting rows\n",
|
||
"fps_threshold = 300\n",
|
||
"n_outliers = (df_clean['avg_fps'] > fps_threshold).sum()\n",
|
||
"df_clean.loc[df_clean['avg_fps'] > fps_threshold, 'avg_fps'] = float('nan')\n",
|
||
"\n",
|
||
"print(f'Rows with avg_fps > {fps_threshold} set to NaN: {n_outliers}')\n",
|
||
"print(f'avg_fps range after: {df_clean[\"avg_fps\"].min():.1f} – {df_clean[\"avg_fps\"].max():.1f}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"## Part 5 — Validation: Before vs After\n",
|
||
"\n",
|
||
"The real test of cleaning work is a comparison report. SweetViz can compare two dataframes side by side."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Generate a comparison report: raw vs cleaned\n",
|
||
"# This may take 60–90 seconds\n",
|
||
"compare_report = sv.compare([df_raw, 'Raw'], [df_clean, 'Cleaned'])\n",
|
||
"compare_report.show_html('sweetviz_comparison_report.html', open_browser=False)\n",
|
||
"\n",
|
||
"print('Comparison report saved — open sweetviz_comparison_report.html in your browser.')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Open the comparison report and verify:\n",
|
||
"\n",
|
||
"- ✅ Boolean columns now show only `True` / `False`\n",
|
||
"- ✅ Categorical columns have the expected number of unique values\n",
|
||
"- ✅ `purchase_amount` is now numeric\n",
|
||
"- ✅ `avg_fps` no longer has a 10,000 outlier\n",
|
||
"- ✅ Missing value counts have changed as expected\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## Part 6 — Save the Cleaned Dataset"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n",
|
||
"print('Cleaned dataset saved.')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"---\n",
|
||
"\n",
|
||
"## 🔑 Key Takeaways\n",
|
||
"\n",
|
||
"- **SweetViz** gives you a rapid automated overview — use it at the start and for before/after comparison. It does not clean; it informs.\n",
|
||
"- **D-Tale** lets you explore interactively, spot patterns, and clean through a UI. Every action is tracked as pandas code, so you are never locked into the GUI.\n",
|
||
"- **Pandas** remains essential for edge cases: complex datetime parsing, conditional logic, and anything requiring programmatic iteration.\n",
|
||
"- The three tools form a pipeline: **SweetViz → triage → D-Tale → interactive cleaning → pandas → refinement**.\n",
|
||
"\n",
|
||
"**Common issue categories you have now seen:**\n",
|
||
"\n",
|
||
"| Category | Example in this dataset |\n",
|
||
"|---|---|\n",
|
||
"| Boolean encoding inconsistency | 8 representations of True/False |\n",
|
||
"| Categorical case/whitespace chaos | 32 variants of 5 region names |\n",
|
||
"| Typos in categories | `controllr` vs `controller` |\n",
|
||
"| Wrong decimal separator | `1,80` instead of `1.80` |\n",
|
||
"| Structural missingness | `gpu_model` absent for console players |\n",
|
||
"| Sensor/logging outliers | `avg_fps = 10,000` |\n",
|
||
"| Mixed datetime formats | ISO 8601 mixed with European dates |\n",
|
||
"\n",
|
||
"→ In **Task 3**, you will apply these same skills independently to a new dataset — with less guidance."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|