To a computer, a data set is just a structured collection of values — from a simple list to a full database table.
Save
Complete lesson & earn 250 PX
A data set can be an array of numbers or a table with rows and columns. Every ML model needs data to learn from.
EXERCISE
1To a computer, a data set is just a structured collection of values — it could be a simple list or an entire database.
Save
EXERCISE
2Python gives you powerful tools to load, inspect, and manipulate data sets — from plain lists to CSV files with Pandas.
Save
Think of a data set like a filing cabinet. Each drawer is a table, each folder inside is a row, and each label on the folder is a column. The filing cabinet itself is your data set.
Two forms of data sets:
1. Arrays (simple lists):
speeds = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
print(len(speeds))
# Output: 13
print(max(speeds))
# Output: 111
print(min(speeds))
# Output: 77
2. Tabular data (rows and columns):
| Car | Colour | Age | Speed | AutoPass |
|---|---|---|---|---|
| BMW | red | 5 | 99 | Y |
| Volvo | black | 7 | 86 | Y |
| VW | gray | 8 | 87 | N |
| Ford | white | 2 | 111 | Y |
💡 Key Insight: In ML, the size and quality of your data set matters more than the cleverness of your algorithm. A simple algorithm on great data will beat a fancy algorithm on bad data every time.
Think of Python as a multilingual translator for data. It can read arrays, CSVs, , — you name it. The two most common tools are lists (built-in) and Pandas DataFrames (for tabular data).
Working with lists:
speeds = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
# Slice the first 5 values
print(speeds[:5])
# Output: [99, 86, 87, 88, 111]
# Check if a value exists
print(111 in speeds)
# Output: True
Working with Pandas DataFrames:
import pandas as pd
# Create a DataFrame from a dictionary
data = {
"Car": ["BMW", "Volvo", "VW"],
"Age": [5, 7, 8],
"Speed": [99, 86, 87]
}
df = pd.DataFrame(data)
print(df)
# Output:
# Car Age Speed
# 0 BMW 5 99
# 1 Volvo 7 86
# 2 VW 8 87
💡 Key Insight: Pandas DataFrames are the lingua franca of ML in Python. Almost every ML library (scikit-learn, TensorFlow, etc.) expects your data in this format. Learning Pandas is not optional — it is step one.
EXERCISE
3The real power of data sets is not what you can see at a glance — it is the questions you can ask and the answers hiding inside.
A data set is like an unread diary — the stories are in there, but you have to ask the right questions. Can we predict a car''s AutoPass status from its age and speed? Only the data knows.
Asking questions with Python:
speeds = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
# Q: What is the average speed?
avg = sum(speeds) / len(speeds)
print(round(avg, 2))
# Output: 89.77
# Q: How many cars are faster than 90?
fast = [s for s in speeds if s > 90]
print(len(fast))
# Output: 4
# Q: What percentage are above average?
above = [s for s in speeds if s > avg]
print(f"{len(above)}/{len(speeds)}")
# Output: 5/13
With Pandas — even more powerful:
import pandas as pd
df = pd.DataFrame({"Age": [5,7,,,], : [,,,,]})
(df.describe())
Save
💡 Key Insight: Before building any ML model, always explore your data first. Use
.describe(), check for missing values with.isnull().sum(), and plot it visually. Skipping this step is how you build models that look great on paper and fail in production.