Module 04 · The Landscape

Data Fundamentals

Understand why data is the true fuel of AI — and how bad data creates biased, broken, or dangerous systems before you ever deploy a single model.

⏱ 45 min📊 3 Diagrams🧩 4 Exercises✅ 4-Question Quiz
📖
Lesson Content
Read & Understand

Every AI model is only as good as the data it was trained on. Data is the curriculum. The model learns everything it knows from what it was shown during training.

Imagine teaching a student using only textbooks from one country, one decade, and one demographic. They might pass every exam in that system — and completely fail when exposed to anything outside it. That's exactly what happens to AI trained on limited or skewed data.

This is what engineers call bias. Bias doesn't just mean prejudice — it means any systematic distortion in how the training data represents the world. If your facial recognition system was trained mostly on light-skinned faces, it will perform worse on darker skin tones. Not because of malice — because of data.

The "garbage in, garbage out" principle is brutal and unforgiving. A model trained on incorrect labels or unrepresentative samples will confidently produce wrong outputs — and you might not know until real harm is done.

Key Takeaways

Data is the curriculum — AI learns only from what it's shown
Biased training data creates biased models — automatically
"Garbage in, garbage out" — bad data produces bad outputs
Bias means any systematic distortion, not just prejudice
Data diversity and quality matter as much as model design
🏗
The Data Pipeline
From raw data to model output
🌐

Raw Data Collection

Gather text, images, audio, or structured data from real-world sources

🧹

Cleaning & Filtering

Remove duplicates, fix errors, handle missing values, filter noise

⚠️

Labeling (Human-in-loop)

Humans annotate data — this is where human bias can be introduced

✂️

Train / Validation Split

Divide data: ~80% for training, ~20% for testing performance

⚙️

Model Training

The model learns from training data, evaluated on validation data

📤

Deployed Output

Whatever biases entered the pipeline now appear in real-world decisions

⚠️
Types of Data Bias
How bias enters AI systems
Representation Bias
Training data underrepresents certain groups. A medical AI trained mostly on male patients may misdiagnose women.
Measurement Bias
The way data is collected differs across groups. Surveys only online miss populations without internet access.
Historical Bias
Data reflects past discrimination. A hiring AI trained on past hires may learn to prefer the historically-hired demographic.
Labeling Bias
Human annotators bring their own assumptions. If annotators label assertive women as "aggressive," the model learns that bias.
Aggregation Bias
Treating diverse groups as one. A one-size-fits-all model may work well for the majority but poorly for subgroups.
Feedback Loop Bias
Model outputs become future training data. A biased recommender amplifies its own distortions over time.
📊
Data Quality vs. Quantity
What makes better training data
Impact on model performance:
Volume
Diversity
Accuracy / Labels
Freshness
Documentation
10,000 diverse, accurate, well-labeled examples often outperform 1,000,000 noisy, skewed ones.
Self-Check Quiz
Click an answer to check your understanding
Q1 of 4
What does "garbage in, garbage out" mean in AI?
A
AI generates random outputs if not given enough RAM
B
Poor quality training data leads to poor quality model outputs
C
AI systems produce offensive content by default
D
Old hardware produces worse AI models
✓ The quality of a model's outputs is directly constrained by the quality of its training data.
✗ 'Garbage in, garbage out' means bad training data = bad model outputs — quality is the binding constraint.
Q2 of 4
A hiring AI trained on 10 years of past hires learns to prefer male candidates. What type of bias is this?
A
Measurement bias
B
Feedback loop bias
C
Historical bias
D
Aggregation bias
✓ Historical bias occurs when training data reflects past discrimination, encoding those patterns into the model.
✗ This is historical bias — the model learned from data that reflected a discriminatory past and now reproduces that discrimination.
Q3 of 4
Which matters more for training data quality?
A
Having the most data possible, regardless of quality
B
Having accurate, diverse, well-labeled data even if smaller
C
Collecting data from the most popular sources
D
Using the newest data available
✓ Quality, diversity, and accuracy in training data typically matter more than raw volume.
✗ Quality beats quantity. 10,000 accurate, diverse examples routinely outperform 1,000,000 noisy, skewed ones.
Q4 of 4
At what stage of the data pipeline can bias first be introduced?
A
Only during model training
B
Only during human labeling
C
At any stage — from collection through deployment
D
Only after deployment
✓ Bias can enter at collection, cleaning, labeling, splitting, training, or through feedback loops post-deployment.
✗ Bias can enter at any stage — and often compounds as it moves through each step of the pipeline.
🧩
Exercises & Worksheets
Apply what you learned
1

Spot the Bias

Search for one real-world case where an AI system produced biased outputs (facial recognition errors, biased loan approvals, etc.). Identify: What type of bias? At what pipeline stage did it likely enter? What was the real-world impact?

🔍 Research
2

Design a Better Dataset

You're building a voice recognition AI. Your current dataset is 90% American English speakers aged 20–35. Identify 3 problems this will create. Then propose a better dataset plan: Who should be included? How would you collect it?

🧪 Design
3

The Labeling Test

Look at 5 images of people from a stock photo site. Write down the first adjective that comes to mind for each. Reflect: Would your labels be consistent across demographics? Could a model trained on your labels introduce bias?

🪞 Reflection
4

Trace the Feedback Loop

A crime prediction AI is trained on past arrest data. It predicts higher risk in certain neighborhoods. Police patrol those areas more. More arrests happen. That data retrains the model. Draw this loop and explain why it's problematic and how you might break it.

🎨 Visual + Analysis