{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 02 · Task 2 — Guided EDA and Data Cleaning\n", "\n", "**Estimated time:** ~50 minutes \n", "**Dataset:** `dataset_A_indie_game_telemetry.csv`\n", "\n", "---\n", "\n", "### Objectives\n", "\n", "By the end of this task you will be able to:\n", "- Use **SweetViz** to rapidly profile a dataset and identify issues\n", "- Use **D-Tale** to navigate and inspect a dataframe interactively\n", "- Use **pandas** to fix the most common categories of data quality problems\n", "- Make and justify cleaning decisions rather than applying fixes mechanically\n", "\n", "### Tools and their roles in this task\n", "\n", "| Tool | Role |\n", "|---|---|\n", "| **SweetViz** | Automated profiling — generate a report, triage what needs fixing |\n", "| **D-Tale** | Interactive navigation — browse rows, inspect value counts, confirm fixes visually |\n", "| **pandas** | All actual cleaning — every transformation is explicit, reproducible code |\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1 — Setup and First Look" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import sweetviz as sv\n", "import dtale\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the raw dataset — do NOT clean anything yet\n", "df = pd.read_csv('dataset_A_indie_game_telemetry_v2.csv')\n", "\n", "print(f'Shape: {df.shape}')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Column names and types as pandas inferred them\n", "print(df.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **⚠️ Notice:** Several columns that should be boolean (`crash_flag`, `is_featured_event`, `is_long_session`) or\n", "> numeric (`purchase_amount`) have been inferred as `object`. This is your first signal that something is wrong.\n", "\n", "---\n", "\n", "## Part 2 — Automated Profiling with SweetViz\n", "\n", "SweetViz generates a visual report for the entire dataset in one call. Think of it as a **triage tool** — it shows you *where* to look; the actual investigation and fixing happens afterwards." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate the profiling report (~30–60 seconds)\n", "report = sv.analyze(df)\n", "report.show_html('sweetviz_raw_report.html', open_browser=True)\n", "print('Report saved. Open sweetviz_raw_report.html in your browser.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Open the report and answer the following before moving on.\n", "\n", "| Question | Your finding |\n", "|---|---|\n", "| Which columns have missing values? Which has the most? | *...* |\n", "| Which columns are shown as TEXT but should be boolean or numeric? | *...* |\n", "| Are there numeric columns with suspicious ranges? | *...* |\n", "| How many distinct values does `region` have? Does that seem right? | *...* |\n", "| What is unusual about `purchase_amount`? | *...* |\n", "\n", "*(Double-click to fill in your answers)*\n", "\n", "---\n", "\n", "## Part 3 — Navigate and Inspect with D-Tale\n", "\n", "Before writing any cleaning code, use D-Tale to browse the raw data and *see* the problems with your own eyes. You will not clean anything here — D-Tale is your inspection tool.\n", "\n", "**Launch D-Tale:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = dtale.show(df, host='127.0.0.1', subprocess=True, open_browser=True)\n", "print('=' * 50)\n", "print('D-Tale is running.')\n", "print('Open this URL in your browser:', d._url)\n", "print('In VS Code: Ctrl+click the URL above.')\n", "print('=' * 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspection checklist\n", "\n", "Use D-Tale to confirm each issue SweetViz flagged. For each column, click the column header → **Describe** to see value counts and distribution.\n", "\n", "| What to inspect | How to do it in D-Tale | What you should see |\n", "|---|---|---|\n", "| `crash_flag` unique values | Column header → Describe | 8 variants of True/False |\n", "| `region` unique values | Column header → Describe | ~32 variants of 5 region names |\n", "| `input_method` unique values | Column header → Describe | A typo: `controllr` |\n", "| `purchase_amount` raw values | Sort column ascending | Some values use comma: `1,80` |\n", "| `avg_fps` distribution | Column header → Describe | Max of 10,000 — clearly wrong |\n", "| Missing values overview | Top menu → Describe (all columns) | `gpu_model` dominates |\n", "\n", "> Once you have seen the problems in the raw data, come back to the notebook for cleaning.\n", "\n", "---\n", "\n", "## Part 4 — Clean with Pandas\n", "\n", "We will work through seven issue categories. Each section follows the same pattern:\n", "1. **Inspect** — confirm the problem in code\n", "2. **Fix** — apply the pandas transformation\n", "3. **Verify** — check the result\n", "\n", "We work on a copy of the original dataframe so the raw data is always available for comparison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Always work on a copy — keep df as the unchanged original\n", "df_clean = df.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.1 — Boolean columns: inconsistent encoding\n", "\n", "Three columns (`crash_flag`, `is_featured_event`, `is_long_session`) each have **8 different representations** of the same two values: `True`, `False`, `true`, `false`, `1`, `0`, `Yes`, `No`.\n", "\n", "The fix is to define an explicit mapping and apply it with `.map()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — confirm the problem\n", "print('crash_flag unique values:', sorted(df_clean['crash_flag'].dropna().unique()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define the mapping for replacements\n", "# Why did I place True:True and False: False? Ideas?\n", "\n", "bool_map = {\n", " 'True': True, 'true': True, '1': True, 'Yes': True, True: True,\n", " 'False': False, 'false': False, '0': False, 'No': False, False: False\n", "}\n", "\n", "df_clean['crash_flag'] = df_clean['crash_flag'].map(bool_map)\n", "\n", "print('crash_flag after mapping:')\n", "print(df_clean['crash_flag'].value_counts())\n", "print('Nulls:', df_clean['crash_flag'].isna().sum())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Apply the same mapping to the other two boolean columns\n", "# Follow the same pattern as above for is_featured_event and is_long_session\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.2 — Categorical columns: case and whitespace inconsistency\n", "\n", "Four columns have values that are logically identical but differ in case or surrounding whitespace:\n", "- `region` — 32 variants of 5 values (e.g. `us-west`, `US-WEST`, `Us-west`, `' us-west '`)\n", "- `map_name` — 36 variants of 6 values\n", "- `platform` — 32 variants of 6 values\n", "- `input_method` — 30 variants, including a **typo**: `controllr`\n", "\n", "The fix uses pandas string methods: `.str.strip()` removes surrounding whitespace, `.str.lower()` normalises case. They can be chained." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — how many unique values before cleaning?\n", "print('region unique before:', df_clean['region'].unique())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fix region: strip whitespace and convert to lowercase\n", "df_clean['region'] = df_clean['region'].str.strip().str.lower()\n", "\n", "# Verify\n", "print('region unique after:', df_clean['region'].unique())\n", "print(df_clean['region'].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TO DO: \n", "# Apply the same strip + lower to map_name and platform\n", "# Follow the same pattern as above\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# input_method needs an extra step: fix the typo and standardise kb/m → kbm\n", "\n", "# Step 0: Inspect\n", "print('input_method unique before:', df_clean['input_method'].unique())\n", "\n", "# Step 1: strip and lowercase first\n", "df_clean['input_method'] = df_clean['input_method'].str.strip().str.lower()\n", "\n", "# Step 2: fix the two inconsistencies with replace()\n", "df_clean['input_method'] = df_clean['input_method'].replace({\n", " 'controllr': 'controller', \n", " 'kb/m': 'kbm' \n", "})\n", "\n", "# Verify — should now show exactly 3 unique values\n", "print('input_method unique after:', df_clean['input_method'].unique())\n", "print(df_clean['input_method'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.3 — `purchase_amount`: comma as decimal separator\n", "\n", "About 12% of rows use a comma instead of a decimal point (`1,80` instead of `1.80`). This prevented pandas from reading the column as numeric, so it was loaded as `object`.\n", "\n", "The fix: replace the comma in the string, then convert the column type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — how many rows have a comma?\n", "comma_rows = df_clean['purchase_amount'].astype(str).str.contains(',', na=False)\n", "print(f'Rows with comma separator: {comma_rows.sum()}')\n", "print('Examples:', df_clean.loc[comma_rows, 'purchase_amount'].unique()[:6])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fix: replace comma with decimal point, then convert to float\n", "df_clean['purchase_amount'] = (\n", " df_clean['purchase_amount']\n", " .astype(str) # ensure we are working with strings\n", " .str.replace(',', '.', regex=False) # swap the separator\n", " .replace('nan', float('nan')) # restore actual NaN rows\n", " .astype(float) # convert to numeric\n", ")\n", "\n", "# Verify\n", "print('dtype:', df_clean['purchase_amount'].dtype)\n", "print(df_clean['purchase_amount'].describe().round(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.4 — Missing values: decisions and strategy\n", "\n", "Not all missing values are the same. Before deciding what to do, you need to understand *why* the value is missing — the reason determines the correct action.\n", "\n", "| Column | Missing | Why | Decision |\n", "|---|---|---|---|\n", "| `gpu_model` | 66.7% | Console/mobile players have no GPU | Keep column — missingness is meaningful |\n", "| `build_version` | 16.5% | Not logged in older sessions | Keep as NaN — valid historical absence |\n", "| `device_temp_c` | 4.9% | Sensor not available on some devices | Keep as NaN |\n", "| `session_length_s` | 1.0% | Session ended abnormally | Drop missing rows now; fix negatives/outliers after datetime correction (section 4.6) |\n", "| `ping_ms`, `purchase_amount`, `end_time` | < 2% | Sporadic gaps | Keep as NaN |\n", "\n", "
\n", "\n", "> **⚠️ Context always matters.** There is no universal rule for missing values. The decisions above are reasonable for this dataset and analytical goal — but a different context might lead to different choices.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — missing value counts across all columns\n", "missing = df_clean.isnull().sum()\n", "missing_pct = (missing / len(df_clean) * 100).round(1)\n", "pd.DataFrame({'missing': missing, '%': missing_pct})[missing > 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# session_length_s: drop rows where it is missing\n", "# Rationale: session duration is a core metric — a session with no recorded\n", "# duration is structurally incomplete and cannot be used for most analyses.\n", "# These 98 rows represent <1% of the dataset, so dropping is safe.\n", "\n", "rows_before = len(df_clean)\n", "df_clean = df_clean.dropna(subset=['session_length_s'])\n", "\n", "print(f'Rows dropped: {rows_before - len(df_clean)}')\n", "print(f'Rows remaining: {len(df_clean)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.5 — Outliers: `avg_fps`\n", "\n", "The `avg_fps` column has a maximum of 10,000 fps — physically impossible for a game running in real time. The 75th percentile is ~82 fps, confirming that 10,000 is a logging error, not an extreme but plausible value.\n", "\n", "**Decision:** set values above 300 fps to `NaN` rather than dropping the entire row. The rest of the data in those rows (crash flag, purchase amount, session type) is likely still valid — it would be wasteful to discard it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — how many rows are affected?\n", "threshold = 300\n", "outlier_mask = df_clean['avg_fps'] > threshold\n", "print(f'Rows with avg_fps > {threshold}: {outlier_mask.sum()}')\n", "print('\\navg_fps distribution (before fix):')\n", "print(df_clean['avg_fps'].describe().round(1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fix: set outlier values to NaN using .loc with a boolean mask\n", "df_clean.loc[outlier_mask, 'avg_fps'] = float('nan')\n", "\n", "# Verify — max should now be well below 300\n", "print('avg_fps distribution (after fix):')\n", "print(df_clean['avg_fps'].describe().round(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### 4.6 — Datetime columns: mixed formats\n", "\n", "The `start_time` and `end_time` columns contain timestamps in at least four different formats:\n", "\n", "```\n", "2025-07-18T18:32:00Z : ISO 8601 with UTC marker\n", "2025-07-18 20:03:21-05:00 : ISO 8601 with UTC offset\n", "20/10/2025 02:49 : European DD/MM/YYYY\n", "08/01/2025 06:35 : Ambiguous: US MM/DD or European DD/MM?\n", "```\n", "\n", "Mixed datetime formats are one of the most complex cleaning problems because some ambiguities cannot be resolved automatically -- `08/01/2025` could be August 1st or January 8th, and no algorithm can determine which without external context.\n", "\n", "> **Connection to `session_length_s`:** The negative values and extreme outliers we saw earlier in `session_length_s` are not independent errors -- they are a *consequence* of this datetime problem. When `start_time` and `end_time` were recorded in different formats and misinterpreted, the pre-computed duration came out wrong. After fixing the timestamps, we will recompute `session_length_s` from scratch and validate the result.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Inspect — what does start_time actually look like?\n", "print('Sample values from start_time:')\n", "print(df_clean['start_time'].dropna().sample(8, random_state=42).tolist())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Fix: pd.to_datetime with utc=True normalises all timezone-aware formats to UTC.\n", "# errors='coerce' converts anything it cannot parse to NaT (Not a Time) instead of crashing.\n", "df_clean['start_time'] = pd.to_datetime(df_clean['start_time'], utc=True, errors='coerce')\n", "df_clean['end_time'] = pd.to_datetime(df_clean['end_time'], utc=True, errors='coerce')\n", "\n", "# Verify — check how many rows could not be parsed\n", "print('start_time dtype:', df_clean['start_time'].dtype)\n", "print('Unparsed start_time (NaT):', df_clean['start_time'].isna().sum())\n", "print('Unparsed end_time (NaT): ', df_clean['end_time'].isna().sum())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Recompute session_length_s from the corrected timestamps\n", "# Now that start_time and end_time are both timezone-aware UTC datetimes,\n", "# the subtraction is unambiguous. We convert the result to seconds.\n", "df_clean['session_length_s'] = (\n", " df_clean['end_time'] - df_clean['start_time']\n", ").dt.total_seconds()\n", "\n", "print('session_length_s after recomputation:')\n", "print(df_clean['session_length_s'].describe().round(1))\n", "print(f'\\nNegative values: {(df_clean[\"session_length_s\"] < 0).sum()}')\n", "print(f'> 8h (28800s): {(df_clean[\"session_length_s\"] > 28800).sum()}')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Any remaining negative values are rows where timestamps were genuinely\n", "# ambiguous and could not be resolved -- the computed duration is meaningless.\n", "# Set them to NaN rather than dropping the row.\n", "\n", "neg_mask = df_clean['session_length_s'] < 0\n", "df_clean.loc[neg_mask, 'session_length_s'] = float('nan')\n", "print(f'Negative durations set to NaN: {neg_mask.sum()}')\n", "\n", "# Values above 8 hours (28800s) are suspicious for a game session.\n", "# Inspect them before deciding.\n", "\n", "long_mask = df_clean['session_length_s'] > 28800\n", "print(f'\\nSessions > 8h: {long_mask.sum()}')\n", "print(df_clean.loc[long_mask, ['session_length_s', 'start_time', 'end_time']].head(5).to_string())\n", "\n", "# Decision: sessions > 8h are almost certainly logging errors (game left running,\n", "# server not recording session end). Set to NaN.\n", "# As always — this threshold is a judgement call that depends on the game and context.\n", "df_clean.loc[long_mask, 'session_length_s'] = float('nan')\n", "print(f'\\nSessions > 8h set to NaN: {long_mask.sum()}')\n", "print('\\nFinal session_length_s distribution:')\n", "print(df_clean['session_length_s'].describe().round(1))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Note:** The number of NaT values above reflects rows where pandas could not parse the format unambiguously. These are not errors in the code — they are genuinely ambiguous records that require a domain decision to resolve (e.g., knowing that the data source always uses DD/MM/YYYY).\n", "\n", "---\n", "\n", "**📌 Optional — explore the unparsed rows**\n", "\n", "If you want to go further, the cells below help you examine which formats failed and attempt a two-pass parsing strategy. This is optional and not required to complete the lab.\n", "\n", "
\n", "Click to expand optional exploration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# OPTIONAL: Identify the raw values that failed to parse\n", "# We use the index of df_clean to look up the original values in df,\n", "# rather than a boolean mask — the two dataframes have different lengths\n", "# after the dropna() in step 4.4, so their indices no longer align.\n", "unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n", "print(f'Rows with unparsed start_time: {len(unparsed_idx)}')\n", "print('\\nRaw values that could not be parsed:')\n", "print(df.loc[unparsed_idx, 'start_time'].dropna().unique()[:15])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-02-23 18:31:02,737 - INFO - Executing shutdown due to inactivity...\n", "2026-02-23 18:31:02,790 - INFO - Executing shutdown...\n", "2026-02-23 18:31:02,795 - INFO - Not running with the Werkzeug Server, exiting by searching gc for BaseWSGIServer\n" ] } ], "source": [ "# OPTIONAL: Two-pass strategy — try a second format for the rows that failed\n", "# If you determine the ambiguous rows use DD/MM/YYYY, try dayfirst=True on them only\n", "unparsed_idx = df_clean.index[df_clean['start_time'].isna()]\n", "df_clean.loc[unparsed_idx, 'start_time'] = pd.to_datetime(\n", " df.loc[unparsed_idx, 'start_time'],\n", " dayfirst=True, utc=True, errors='coerce'\n", ")\n", "print('NaT after second pass:', df_clean['start_time'].isna().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "---\n", "\n", "## Part 5 — Verify with D-Tale\n", "\n", "Reload the cleaned dataframe into D-Tale and visually confirm the fixes. This is a quick sanity check — you are looking for anything that looks wrong before committing to the cleaned dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Shut down the previous D-Tale instance and reload with the clean data\n", "d.kill()\n", "d_clean = dtale.show(df_clean, host='127.0.0.1', subprocess=True, open_browser=True)\n", "print('Open cleaned data in D-Tale:', d_clean._url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In D-Tale, verify the following:\n", "\n", "| Column | What to check | Expected result |\n", "|---|---|---|\n", "| `crash_flag` | Describe → value counts | Only `True` and `False` |\n", "| `region` | Describe → value counts | Exactly 5 values, all lowercase |\n", "| `input_method` | Describe → value counts | Exactly 3 values, no `controllr` |\n", "| `purchase_amount` | Describe → dtype and range | float64, no commas |\n", "| `avg_fps` | Describe → max | Below 300 |\n", "| `session_length_s` | Describe → min and max | No negatives, no values > 28800 |\n", "| `start_time` | Describe → dtype | datetime64 |\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c8f0e03a", "metadata": {}, "outputs": [], "source": [ "# Debug\n", "\n", "# Test comparison column by column\n", "# for col in df_clean.columns:\n", "# try:\n", "# sv.compare([df[[col]], 'Raw'], [df_clean[[col]].reset_index(drop=True), 'Cleaned'])\n", "# except Exception as e:\n", "# print(f\"FAIL: {col} — {e}\")\n", "# else:\n", "# print(f\"ok: {col}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "exclude = ['start_time', 'end_time'] # needed to exclude these two because we converted them to datetime and sweetviz is not able to compare it with the original data types\n", "\n", "compare = sv.compare(\n", " [df.drop(columns=exclude), 'Raw'],\n", " [df_clean.drop(columns=exclude).reset_index(drop=True), 'Cleaned']\n", ")\n", "compare.show_html('sweetviz_comparison_report.html', open_browser=True)\n", "print('Comparison report saved.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the comparison report, check that:\n", "- Boolean columns changed from TEXT → BOOL with only 2 distinct values\n", "- Categorical columns show dramatically reduced DISTINCT counts\n", "- `purchase_amount` changed from TEXT → NUMERIC\n", "- `avg_fps` maximum is no longer 10,000\n", "- `session_length_s` shows 0 missing\n", "\n", "---\n", "\n", "## Part 7 — Save the Cleaned Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_clean.to_csv('dataset_A_indie_game_telemetry_clean.csv', index=False)\n", "print(f'Saved: {len(df_clean)} rows, {len(df_clean.columns)} columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Key Takeaways\n", "\n", "**Three tools, three roles — they complement each other:**\n", "- **SweetViz** surfaces issues fast but cannot fix them: use it for triage and validation\n", "- **D-Tale** lets you see the data as a human would: use it to understand problems before and after fixing them\n", "- **pandas** is where all actual cleaning happens: explicit, reproducible, and version-controllable\n", "\n", "**Cleaning decisions are not mechanical:**\n", "- Dropping `session_length_s` nulls was justified here: it would not be in every context\n", "- Setting `avg_fps` outliers to NaN (not dropping rows) preserved valid data in other columns\n", "- `gpu_model` missingness is structurally meaningful: imputing it would destroy information\n", "\n", "**Common issue categories you have now fixed with pandas:**\n", "\n", "| Issue | pandas approach |\n", "|---|---|\n", "| Boolean encoding chaos | `.map(bool_map)` |\n", "| Case / whitespace inconsistency | `.str.strip().str.lower()` |\n", "| Typos in categories | `.replace({'controllr': 'controller'})` |\n", "| Wrong decimal separator | `.str.replace(',', '.')` + `.astype(float)` |\n", "| Structural missing values | `dropna(subset=[...])` with explicit rationale |\n", "| Outliers | Boolean mask + `.loc[mask, col] = NaN` |\n", "| Mixed datetime formats | `pd.to_datetime(utc=True, errors='coerce')` |\n", "\n", "→ In **Task 3**, you will apply these skills independently to a new dataset — with a checklist but without step-by-step guidance." ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }