Before you analyse anything, you need to know whether you are dealing with numbers, categories, or rankings.
Save
Complete lesson & earn 250 PX
Data comes in three flavours: numerical (numbers), categorical (labels), and ordinal (ranked labels). The type determines the technique.
EXERCISE
1Every piece of data you will ever work with falls into one of three categories โ and picking the wrong analysis technique for the wrong type is a classic beginner mistake.
Save
EXERCISE
2Not all numbers are created equal โ some can only be whole numbers, and others can be infinitely precise.
Save
EXERCISE
3The data type is your compass โ it points you toward the right analysis technique and away from meaningless results.
Save
Data types are like ingredients in cooking. You would not bake flour the same way you grill meat. Similarly, you cannot apply the same analysis technique to numbers, categories, and rankings.
The three types:
| Type | Definition | Examples |
|---|---|---|
| Numerical | Values you can do math on | Age, speed, price, temperature |
| Categorical | Labels with no natural order | Colour (red/blue), Yes/No, country |
| Ordinal | Labels WITH a natural order | Grades (A > B > C), size (S < M < L) |
# Numerical โ you CAN calculate an average
ages = [25, 30, 35, 40]
print(sum(ages) / len(ages))
# Output: 32.5
# Categorical โ an average makes NO sense
colours = ["red", "blue", "red", "green"]
# What is the "average" of red and blue? Nothing.
# Ordinal โ order matters, but gaps are not equal
grades = ["A", "B", "C", "D"]
# A > B > C > D, but the "distance" between A and B
# is not necessarily the same as between C and D
๐ก Key Insight: The single most important question before any analysis is: "What type is this data?" Calculating the mean of categorical data (like averaging zip codes) gives you a number that means absolutely nothing.
Think of discrete data like stairs โ you step from one whole number to the next. Continuous data is like a ramp โ you can stop at any point, including 3.7 or 3.14159.
Discrete data: countable, whole numbers
# Number of cars passing by โ you cannot have 3.5 cars
cars_per_hour = [12, 15, 8, 22, 17, 9]
print(sum(cars_per_hour))
# Output: 83
# These are always integers
Continuous data: measurable, any value
# Temperature can be ANY value on the scale
temperatures = [36.6, 37.2, 36.8, 38.1, 36.5]
print(round(sum(temperatures) / len(temperatures), 2))
# Output: 37.04
# Price can have decimals
prices = [9.99, 24.50, 3.75, 149.99]
print(min(prices))
# Output: 3.75
Quick reference:
| Property | Discrete | Continuous |
|---|---|---|
| Values | Whole numbers only | Any value (decimals ok) |
| Measured by | Counting | Measuring |
| Examples | Students in class, dice rolls | Height, weight, time |
| Can be 3.5? | โ No | โ Yes |
๐ก Key Insight: This distinction matters in ML because different algorithms handle discrete and continuous data differently. Classification algorithms predict discrete labels (spam/not spam). Regression algorithms predict continuous values (price, temperature).
Choosing the wrong technique for your data type is like using a thermometer to measure weight โ the tool works fine, it is just measuring the wrong thing.
Matching data type โ technique:
| Data Type | Good Techniques | Bad Techniques |
|---|---|---|
| Numerical | Mean, std dev, regression | Mode (sometimes ok) |
| Categorical | Mode, frequency count, chi-square | Mean, median (meaningless) |
| Ordinal | Median, mode, rank correlation | Mean (gaps are not equal) |
# CORRECT: Mean on numerical data
speeds = [99, 86, 87, 88, 111, 86, 103]
print(sum(speeds) / len(speeds))
# Output: 94.29
# CORRECT: Mode on categorical data
colours = ["red", "blue", "red", "green", "red"]
from collections import Counter
most_common = Counter(colours).most_common(1)
print(most_common)
# Output: [('red', 3)]
# WRONG: Mean on ordinal data
# Grades A=4, B=3, C=2: average is 3.0 = "B"
# But the gaps between grades are NOT necessarily equal
# So "B" as an average is misleading
๐ก Key Insight: In real-world ML projects, you will often need to convert between types. Turning categorical data into numbers (encoding) is one of the most critical preprocessing steps. You will see this when we reach decision trees later in this course.