Data Types — Know Your Ingredients - Python Machine Learning Lesson

Data Types — Know Your Ingredients - Python Machine Learning Lesson | DevLoom

Data types are like ingredients in cooking. You would not bake flour the same way you grill meat. Similarly, you cannot apply the same analysis technique to numbers, categories, and rankings.

The three types:

Type	Definition	Examples
Numerical	Values you can do math on	Age, speed, price, temperature
Categorical	Labels with no natural order	Colour (red/blue), Yes/No, country
Ordinal	Labels WITH a natural order	Grades (A > B > C), size (S < M < L)

# Numerical — you CAN calculate an average
ages = [25, 30, 35, 40]
print(sum(ages) / len(ages))
# Output: 32.5
# Categorical — an average makes NO sense
colours = ["red", "blue", "red", "green"]
# What is the "average" of red and blue? Nothing.
# Ordinal — order matters, but gaps are not equal
grades = ["A", "B", "C", "D"]
# A > B > C > D, but the "distance" between A and B
# is not necessarily the same as between C and D

💡 Key Insight: The single most important question before any analysis is: "What type is this data?" Calculating the mean of categorical data (like averaging zip codes) gives you a number that means absolutely nothing.

Think of discrete data like stairs — you step from one whole number to the next. Continuous data is like a ramp — you can stop at any point, including 3.7 or 3.14159.

Discrete data: countable, whole numbers

# Number of cars passing by — you cannot have 3.5 cars
cars_per_hour = [12, 15, 8, 22, 17, 9]
print(sum(cars_per_hour))
# Output: 83
# These are always integers

Continuous data: measurable, any value

# Temperature can be ANY value on the scale
temperatures = [36.6, 37.2, 36.8, 38.1, 36.5]
print(round(sum(temperatures) / len(temperatures), 2))
# Output: 37.04
# Price can have decimals
prices = [9.99, 24.50, 3.75, 149.99]
print(min(prices))
# Output: 3.75

Quick reference:

Property	Discrete	Continuous
Values	Whole numbers only	Any value (decimals ok)
Measured by	Counting	Measuring
Examples	Students in class, dice rolls	Height, weight, time
Can be 3.5?	❌ No	✅ Yes

💡 Key Insight: This distinction matters in ML because different algorithms handle discrete and continuous data differently. Classification algorithms predict discrete labels (spam/not spam). Regression algorithms predict continuous values (price, temperature).

Choosing the wrong technique for your data type is like using a thermometer to measure weight — the tool works fine, it is just measuring the wrong thing.

Matching data type → technique:

Data Type	Good Techniques	Bad Techniques
Numerical	Mean, std dev, regression	Mode (sometimes ok)
Categorical	Mode, frequency count, chi-square	Mean, median (meaningless)
Ordinal	Median, mode, rank correlation	Mean (gaps are not equal)

# CORRECT: Mean on numerical data
speeds = [99, 86, 87, 88, 111, 86, 103]
print(sum(speeds) / len(speeds))
# Output: 94.29
# CORRECT: Mode on categorical data
colours = ["red", "blue", "red", "green", "red"]
from collections import Counter
most_common = Counter(colours).most_common(1)
print(most_common)
# Output: [('red', 3)]
# WRONG: Mean on ordinal data
# Grades A=4, B=3, C=2: average is 3.0 = "B"
# But the gaps between grades are NOT necessarily equal
# So "B" as an average is misleading

💡 Key Insight: In real-world ML projects, you will often need to convert between types. Turning categorical data into numbers (encoding) is one of the most critical preprocessing steps. You will see this when we reach decision trees later in this course.

DevLoom

DevLoom

DevLoom

DevLoom

Data Types — Know Your Ingredients

Data Types — Know Your Ingredients

Lesson Contents

References

From Course

Share

Topics

The Three Flavours of Data

Numerical Data — Discrete vs Continuous

Choosing the Right Technique for Your Data Type