diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..87a15c2 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +/.venv diff --git a/README.md b/README.md index 0c64999..cee9503 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,161 @@ # VI_Lab_01_EDA + + + + + + + +pip install ipykernel + +*** + +Below is a clean, ready‑to‑ship **README.md** you can drop directly into your ZIP bundle. +It explains **exactly how students should prepare their environment in VS Code**, including: + +* Installing Python +* Creating a **virtual environment** +* Installing required packages +* Setting up the **Jupyter kernel** to use that venv +* Opening and running the notebooks in VS Code + +It uses **current and correct instructions** based on official VS Code documentation (Python + Jupyter extensions and venv usage) (installation workflow and environment activation practices align with Python & VS Code official practices, which are stable across versions). + +If you want, I can also generate a **requirements.txt**, **environment.yml**, or a **bootstrap script**. + +*** + +# 📘 README — Preparing Your Environment for Jupyter in VS Code + +## (Virtual Environment + Kernel Setup) + +This guide explains exactly how to prepare your system to run the EDA lab notebooks in **VS Code** using a clean Python **virtual environment**. + +The steps work on **Windows, macOS, and Linux**. + +*** + +# 1. Install the Required Tools + +### 1.1 Install Python (3.9+ recommended) + +Download from the official Python site (*python.org*) or using Microsoft Store. + +Make sure to check: + +* **Windows** → Add Python to PATH if installed from official site +* **macOS/Linux** → Python is usually included, but upgrade if needed + +### 1.2 Install VS Code + +Install from the official VS Code site. + +### 1.3 Install VS Code Extensions + +Open VS Code → **Extensions Panel** → install: + +* **Python** +* **Jupyter** + +These two extensions enable: + +* Notebook execution +* Kernel selection +* Virtual environment detection +* Interactive cells + +*** + +# 2. Create a Virtual Environment + +Choose a folder where you will store your lab materials. +Open a terminal *inside that folder*: + +### **Windows (PowerShell)** + +```powershell +python -m venv venv +.\venv\Scripts\activate +``` + +### **macOS / Linux** + +```bash +python3 -m venv venv +source venv/bin/activate +``` + +You should now see `(venv)` at the start of your terminal prompt. + +*** + +# 3. Install Required Python Packages + +Inside the active virtual environment, run: + +```bash +pip install numpy pandas matplotlib sweetviz dtale jupyter +``` + +If you are using the Task 0 datasets, also install: + +```bash +pip install seaborn +``` + +> 💡 **Tip:** +> If you have a `requirements.txt` in the bundle, run: +> +> ```bash +> pip install -r requirements.txt +> ``` + +*** + +# 4. Register the Virtual Environment as a Jupyter Kernel + +VS Code can automatically detect your venv, but we ensure explicit registration: + +```bash +python -m ipykernel install --user --name eda-env --display-name "EDA Lab Environment" +``` + +You will now see **EDA Lab Environment** as a selectable kernel inside VS Code notebooks. + +*** + +# 5. ✅ Open the Lab in VS Code + +1. Launch **VS Code** +2. Use **File → Open Folder** and choose the folder containing the lab files +3. Open any `.ipynb` file (e.g., `EDA_Lab_Starter.ipynb`) +4. At the top‑right corner of the notebook, click the **kernel selector** +5. Choose: + **EDA Lab Environment (Python venv)** + +This ensures the notebook runs using the correct interpreter. + +*** + +# 6. 🔍 (Optional) Verify Your Setup + +In a notebook cell, run: + +```python +import sys +sys.executable +``` + +It should show the Python path inside your `venv`, e.g.: + +* Windows: `…/venv/Scripts/python.exe` +* macOS/Linux: `…/venv/bin/python` + +Then check that the packages are available: + +```python +import pandas, sweetviz, dtale +print("Environment OK") +``` + diff --git a/TASK0_Datasaurus_Starter.ipynb b/TASK0_Datasaurus_Starter.ipynb new file mode 100644 index 0000000..fb495a9 --- /dev/null +++ b/TASK0_Datasaurus_Starter.ipynb @@ -0,0 +1,184 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Task 0 — Datasaurus Warm‑Up (Starter)\n", + "\n", + "**Goal:** Show why we must *always* visualize by comparing groups with nearly identical summary statistics but very different shapes when plotted.\n", + "\n", + "This starter uses `datasaurus_task0.csv` (long format) with four groups: `dino`, `star`, `circle`, `bullseye`.\n", + "\n", + "**What to do:**\n", + "1. Load the CSV (as strings first).\n", + "2. Compute basic stats per group (mean, std, correlation).\n", + "3. Generate a SweetViz report (optional but recommended).\n", + "4. Plot x vs y for each group (facet grid).\n", + "5. Write *2–4 sentences* reflecting on why summary stats were misleading.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 0) (Optional) Install packages in this environment\n", + "Uncomment if needed." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "tags": [ + "setup" + ] + }, + "outputs": [], + "source": [ + "# !pip install -q numpy pandas seaborn matplotlib sweetviz dtale\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1) Load the dataset\n", + "Load as strings first (safer), then coerce numeric columns." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": {}, + "outputs": [ + { + "ename": "", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[1;31mRunning cells with '.venv (Python 3.11.9)' requires the ipykernel package.\n", + "\u001b[1;31mInstall 'ipykernel' into the Python environment. \n", + "\u001b[1;31mCommand: 'd:/Projects/43679_InteractiveVis/VI_Lab_01_EDA/.venv/Scripts/python.exe -m pip install ipykernel -U --force-reinstall'" + ] + } + ], + "source": [ + "import pandas as pd\n", + "csv_path = 'datasaurus_task0.csv'\n", + "df_raw = pd.read_csv(csv_path, dtype=str)\n", + "df_raw.head()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Coerce numeric columns and quick sanity checks" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": {}, + "outputs": [], + "source": [ + "df = df_raw.copy()\n", + "for c in ['x','y']:\n", + " df[c] = pd.to_numeric(df[c], errors='coerce')\n", + "df.info()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2) Basic summary stats by group (fill in)\n", + "Compute mean, std for x & y by `dataset`, and the correlation within each group." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: groupby summaries\n", + "# g = df.groupby('dataset')\n", + "# means = g[['x','y']].mean()\n", + "# stds = g[['x','y']].std()\n", + "# corr = g.apply(lambda d: d[['x','y']].corr().iloc[0,1])\n", + "# means, stds, corr\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3) (Optional) SweetViz profile\n", + "Generate a quick report to observe that top-level stats look very similar." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": {}, + "outputs": [], + "source": [ + "# import sweetviz as sv\n", + "# report = sv.analyze(df)\n", + "# report.show_html('task0_sweetviz_report.html')\n", + "# print('Wrote task0_sweetviz_report.html')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4) Visualize — scatter by group (facet)\n", + "Create a facet grid with one subplot per dataset and compare shapes." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "sns.set_theme(style='white', context='notebook')\n", + "g = sns.FacetGrid(df, col='dataset', col_wrap=2, height=4, sharex=True, sharey=True)\n", + "g.map_dataframe(sns.scatterplot, x='x', y='y', s=20, edgecolor=None)\n", + "g.set_titles('{col_name}')\n", + "for ax in g.axes.flatten():\n", + " ax.set_xlabel('x'); ax.set_ylabel('y')\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5) Reflection (write here)\n", + "**Prompt:** If the per-group mean/variance/correlation were similar, why do the plots look different?\n", + "- Which shapes do you observe?\n", + "- What does this imply for relying solely on `.describe()` or correlation before plotting?\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/VisInt_Lab_01_Task_0.ipynb b/VisInt_Lab_01_Task_0.ipynb new file mode 100644 index 0000000..3584157 --- /dev/null +++ b/VisInt_Lab_01_Task_0.ipynb @@ -0,0 +1,36 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "64f0fe5d", + "metadata": {}, + "source": [ + "**43679 - Interactive Visualization**\n", + "**2025 - 2026**\n", + "*2nd semester*\n", + "\n", + "**Lab 01** - Task 0\n", + "Exploring the value of Visualization to go beyond descriptive statistics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9080704", + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}