Module 04 — Data Fundamentals

📖

Lesson Content

Read & Understand

Every AI model is only as good as the data it was trained on. Data is the curriculum. The model learns everything it knows from what it was shown during training.

Imagine teaching a student using only textbooks from one country, one decade, and one demographic. They might pass every exam in that system — and completely fail when exposed to anything outside it. That's exactly what happens to AI trained on limited or skewed data.

This is what engineers call bias. Bias doesn't just mean prejudice — it means any systematic distortion in how the training data represents the world. If your facial recognition system was trained mostly on light-skinned faces, it will perform worse on darker skin tones. Not because of malice — because of data.

The "garbage in, garbage out" principle is brutal and unforgiving. A model trained on incorrect labels or unrepresentative samples will confidently produce wrong outputs — and you might not know until real harm is done.

Key Takeaways

Data is the curriculum — AI learns only from what it's shown

Biased training data creates biased models — automatically

"Garbage in, garbage out" — bad data produces bad outputs

Bias means any systematic distortion, not just prejudice

Data diversity and quality matter as much as model design

🏗

The Data Pipeline

From raw data to model output

🌐

Raw Data Collection

Gather text, images, audio, or structured data from real-world sources

↓

🧹

Cleaning & Filtering

Remove duplicates, fix errors, handle missing values, filter noise

↓

⚠️

Labeling (Human-in-loop)

Humans annotate data — this is where human bias can be introduced

↓

✂️

Train / Validation Split

Divide data: ~80% for training, ~20% for testing performance

↓

⚙️

Model Training

The model learns from training data, evaluated on validation data

↓

📤

Deployed Output

Whatever biases entered the pipeline now appear in real-world decisions

⚠️

Types of Data Bias

How bias enters AI systems

Representation Bias

Training data underrepresents certain groups. A medical AI trained mostly on male patients may misdiagnose women.

Measurement Bias

The way data is collected differs across groups. Surveys only online miss populations without internet access.

Historical Bias

Data reflects past discrimination. A hiring AI trained on past hires may learn to prefer the historically-hired demographic.

Labeling Bias

Human annotators bring their own assumptions. If annotators label assertive women as "aggressive," the model learns that bias.

Aggregation Bias

Treating diverse groups as one. A one-size-fits-all model may work well for the majority but poorly for subgroups.

Feedback Loop Bias

Model outputs become future training data. A biased recommender amplifies its own distortions over time.

📊

Data Quality vs. Quantity

What makes better training data

Impact on model performance:

Volume

Diversity

Accuracy / Labels

Freshness

Documentation

10,000 diverse, accurate, well-labeled examples often outperform 1,000,000 noisy, skewed ones.

✅

Self-Check Quiz

Click an answer to check your understanding

Q1 of 4

What does "garbage in, garbage out" mean in AI?

AI generates random outputs if not given enough RAM

Poor quality training data leads to poor quality model outputs

AI systems produce offensive content by default

Old hardware produces worse AI models

✓ The quality of a model's outputs is directly constrained by the quality of its training data.

✗ 'Garbage in, garbage out' means bad training data = bad model outputs — quality is the binding constraint.

Q2 of 4

A hiring AI trained on 10 years of past hires learns to prefer male candidates. What type of bias is this?

Measurement bias

Feedback loop bias

Historical bias

Aggregation bias

✓ Historical bias occurs when training data reflects past discrimination, encoding those patterns into the model.

✗ This is historical bias — the model learned from data that reflected a discriminatory past and now reproduces that discrimination.

Q3 of 4

Which matters more for training data quality?

Having the most data possible, regardless of quality

Having accurate, diverse, well-labeled data even if smaller

Collecting data from the most popular sources

Using the newest data available

✓ Quality, diversity, and accuracy in training data typically matter more than raw volume.

✗ Quality beats quantity. 10,000 accurate, diverse examples routinely outperform 1,000,000 noisy, skewed ones.

Q4 of 4

At what stage of the data pipeline can bias first be introduced?

Only during model training

Only during human labeling

At any stage — from collection through deployment

Only after deployment

✓ Bias can enter at collection, cleaning, labeling, splitting, training, or through feedback loops post-deployment.

✗ Bias can enter at any stage — and often compounds as it moves through each step of the pipeline.

🧩

Exercises & Worksheets

Apply what you learned

Spot the Bias

Search for one real-world case where an AI system produced biased outputs (facial recognition errors, biased loan approvals, etc.). Identify: What type of bias? At what pipeline stage did it likely enter? What was the real-world impact?

🔍 Research

Design a Better Dataset

You're building a voice recognition AI. Your current dataset is 90% American English speakers aged 20–35. Identify 3 problems this will create. Then propose a better dataset plan: Who should be included? How would you collect it?

🧪 Design

The Labeling Test

Look at 5 images of people from a stock photo site. Write down the first adjective that comes to mind for each. Reflect: Would your labels be consistent across demographics? Could a model trained on your labels introduce bias?

🪞 Reflection

Trace the Feedback Loop

A crime prediction AI is trained on past arrest data. It predicts higher risk in certain neighborhoods. Police patrol those areas more. More arrests happen. That data retrains the model. Draw this loop and explain why it's problematic and how you might break it.

🎨 Visual + Analysis