{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 02 · Task 2 — Guided EDA and Data Cleaning with SweetViz & D-Tale\n", "\n", "**Estimated time:** ~50 minutes \n", "**Dataset:** `dataset_A_indie_game_telemetry.csv`\n", "\n", "---\n", "\n", "### Objectives\n", "\n", "By the end of this task you will be able to:\n", "- Generate an automated EDA report with **SweetViz** to get a rapid overview of a dataset\n", "- Use **D-Tale** interactively to identify and fix data quality problems\n", "- Recognise the most common categories of data issues: inconsistent encoding, mixed types, excessive missingness, and outliers\n", "- Understand how interactive tools translate cleaning actions into pandas code\n", "\n", "---\n", "\n", "### Context\n", "\n", "You have been handed a telemetry dataset from a small indie game studio. It contains **10,000 session records** with information about players, platforms, performance metrics, and purchases. Before any visualisation or analysis can be built on top of this data, it must be understood and cleaned.\n", "\n", "This is real-world data quality: messy, inconsistent, and requiring decisions — not just mechanical fixes.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1 — Setup and First Load" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape: (10000, 20)\n" ] }, { "data": { "text/html": [ "
| \n", " | session_id | \n", "user_id | \n", "start_time | \n", "end_time | \n", "session_length_s | \n", "region | \n", "platform | \n", "gpu_model | \n", "avg_fps | \n", "ping_ms | \n", "map_name | \n", "crash_flag | \n", "purchase_amount | \n", "party_size | \n", "input_method | \n", "build_version | \n", "is_featured_event | \n", "device_temp_c | \n", "session_type | \n", "is_long_session | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "sess_c2fba8e7f37a | \n", "user_488 | \n", "2025-07-18T18:32:00Z | \n", "2025-07-18 20:03:21-05:00 | \n", "5481.0 | \n", "us-west | \n", "pc | \n", "GTX1080 | \n", "83.52 | \n", "431.16 | \n", "ocean | \n", "Yes | \n", "0,00 | \n", "2 | \n", "Touch | \n", "NaN | \n", "No | \n", "85.6 | \n", "ranked | \n", "True | \n", "
| 1 | \n", "sess_33d286298cf9 | \n", "user_1511 | \n", "2025-06-13 23:21:08+00:00 | \n", "2025-06-13 23:36:30+01:00 | \n", "922.0 | \n", "Us-east | \n", "PlayStation | \n", "NaN | \n", "72.75 | \n", "29.12 | \n", "desert | \n", "No | \n", "0.0 | \n", "3 | \n", "Touch | \n", "NaN | \n", "0 | \n", "62.0 | \n", "casual | \n", "0 | \n", "
| 2 | \n", "sess_be2bb4d8986a | \n", "user_830 | \n", "2025-10-20 02:42:07-05:00 | \n", "20/10/2025 02:49 | \n", "451.0 | \n", "sa-east-1 | \n", "PlayStation | \n", "NaN | \n", "69.20 | \n", "40.47 | \n", "Forest | \n", "False | \n", "0.0 | \n", "5 | \n", "TOUCH | \n", "1.4 | \n", "False | \n", "69.0 | \n", "ranked | \n", "False | \n", "
| 3 | \n", "sess_7f425ca9a0e2 | \n", "user_1 | \n", "08/01/2025 06:35 | \n", "2025-08-01T08:32:45Z | \n", "7031.0 | \n", "sa-east-1 | \n", "PlayStation | \n", "NaN | \n", "33.29 | \n", "92.40 | \n", "Desert | \n", "No | \n", "17.55 | \n", "1 | \n", "Controller | \n", "1.3.2 | \n", "0 | \n", "48.1 | \n", "casual | \n", "True | \n", "
| 4 | \n", "sess_5657e28b22ec | \n", "user_211 | \n", "2025-09-08T23:41:44Z | \n", "2025-09-09 00:32:59+01:00 | \n", "3075.0 | \n", "US-EAST | \n", "switch | \n", "NaN | \n", "69.96 | \n", "12.63 | \n", "Desert | \n", "False | \n", "0.0 | \n", "2 | \n", "controllr | \n", "NaN | \n", "0 | \n", "54.7 | \n", "casual | \n", "Yes | \n", "