Add deploy assets and update telemetry datasets
Prepare deployment package and clean telemetry/lab data: add deploy/ (README, datasaurus.csv, datasets and lab01 notebooks), add new lab02 dataset notebook variants (lab02_task1_datasets_v2/ v2b) and solutions for task3, and update multiple lab02 telemetry and git-activity notebooks. Clean and normalize claude/dataset_A_indie_game_telemetry_clean.csv (fill/standardize timestamps, session lengths and other fields) to improve consistency for downstream analysis.
This commit is contained in:
@@ -4,10 +4,13 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lab 02 · Task 1 — Exploratory Data Analysis with Pandas & Seaborn\n",
|
||||
"# Lab 01<br>Task 1: Exploratory Data Analysis with Pandas & Seaborn\n",
|
||||
"\n",
|
||||
"**Estimated time:** ~30 minutes \n",
|
||||
"**Dataset:** `datasaurus_dozen.csv`\n",
|
||||
"This task serves two purposes. It introduces you to some of the basic tools to start understanding datasets and shows you why descriptive statistics may not be enough to understand the nature of a dataset.\n",
|
||||
"\n",
|
||||
"Additionally, this simple first task also serves the purpose of getting you acquainted with Jupyter notebooks.\n",
|
||||
"\n",
|
||||
"**Dataset:** `datasaurus.csv`\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
@@ -23,9 +26,9 @@
|
||||
"\n",
|
||||
"### Context\n",
|
||||
"\n",
|
||||
"The **Datasaurus Dozen** is a collection of 13 small datasets deliberately constructed to share *identical* summary statistics while looking completely different when plotted. It was created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
|
||||
"The **Datasaurus Dozen** is a collection of 13 small datasets created by Matejka & Fitzmaurice (2017) to demonstrate a modern version of Anscombe's Quartet.\n",
|
||||
"\n",
|
||||
"This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover — through visualisation — that numbers alone were hiding the story.\n",
|
||||
"This task will take you through the same journey a data analyst faces: you will start with raw numbers, run the usual summaries, and then discover, through visualisation, that numbers alone were hiding the story.\n",
|
||||
"\n",
|
||||
"---"
|
||||
]
|
||||
@@ -34,7 +37,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Part 1 — Load and Inspect the Data\n",
|
||||
"## Part 1: Load and Inspect the Data\n",
|
||||
"\n",
|
||||
"Start by importing the libraries you need and loading the dataset."
|
||||
]
|
||||
@@ -181,7 +184,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 1.1 — Structure and data types\n",
|
||||
"### 1.1. Structure and data types\n",
|
||||
"\n",
|
||||
"Before computing anything, always understand what you are working with."
|
||||
]
|
||||
@@ -255,7 +258,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 1.2 — Overall summary statistics\n",
|
||||
"### 1.2. Overall summary statistics\n",
|
||||
"\n",
|
||||
"Use `describe()` to get a global numerical summary of `x` and `y`."
|
||||
]
|
||||
@@ -363,7 +366,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"## Part 2 — Grouped Statistics: The Reveal\n",
|
||||
"## Part 2: Grouped Statistics\n",
|
||||
"\n",
|
||||
"The dataset column holds 13 different named groups. Let's compute summary statistics **per group** and see if the groups differ."
|
||||
]
|
||||
@@ -577,7 +580,7 @@
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"C:\\Users\\sss\\AppData\\Local\\Temp\\ipykernel_95640\\2163207487.py:2: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
|
||||
"C:\\Users\\sss\\AppData\\Local\\Temp\\ipykernel_64804\\2163207487.py:2: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.\n",
|
||||
" correlation = df.groupby('dataset').apply(lambda g: g['x'].corr(g['y'])).round(2)\n"
|
||||
]
|
||||
}
|
||||
@@ -593,10 +596,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> **❓ Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n",
|
||||
"> **Question:** Look at the table above. Are the 13 datasets statistically different from each other? \n",
|
||||
"> Write your answer in the cell below before moving on.\n",
|
||||
"\n",
|
||||
"*(Double-click this cell to write your answer here)*\n",
|
||||
"\n",
|
||||
"---"
|
||||
]
|
||||
@@ -605,11 +607,19 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Part 3 — Now Let's Actually Look at the Data\n",
|
||||
"<!-- ## Part 3: Now Let us Actually Look at the Data\n",
|
||||
"\n",
|
||||
"We will focus on three sub-datasets: **`dino`**, **`star`**, and **`bullseye`**. These three were chosen because they produce a dramatic visual contrast despite their identical statistics.\n",
|
||||
"\n",
|
||||
"Later, feel free to explore the remaining 10 groups."
|
||||
"Later, feel free to explore the remaining 10 groups. -->"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d6f82ff1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Part 3: Visualizing the Data"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -739,10 +749,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"> **❓ Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n",
|
||||
"> **Question:** What would a data analyst have concluded if they had only looked at the summary statistics table? \n",
|
||||
"> What does this tell you about when and why visualisation is necessary?\n",
|
||||
"\n",
|
||||
"*(Double-click to write your answer here)*\n",
|
||||
"\n",
|
||||
"---"
|
||||
]
|
||||
@@ -789,7 +798,7 @@
|
||||
"source": [
|
||||
"---\n",
|
||||
"\n",
|
||||
"## ✏️ Your Turn — Free Exploration\n",
|
||||
"## Your Turn — Free Exploration\n",
|
||||
"\n",
|
||||
"The cells below are yours. Here are some things to try:\n",
|
||||
"\n",
|
||||
@@ -801,15 +810,6 @@
|
||||
"> **Key question to keep in mind:** For each plot type you try — does it reveal the structural difference between the datasets, or does it hide it?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Your exploration here\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
||||
Reference in New Issue
Block a user