Where to Find Real Datasets
Duration: 5 min
Before you can train a model, you need data. The good news: there are millions of free, real-world datasets available online. This module covers the main sources and how to access them.
The main dataset sources
- Kaggle (kaggle.com/datasets) — the largest community of datasets. Competitions, notebooks, and discussion included.
- HuggingFace Datasets (huggingface.co/datasets) — 50,000+ datasets, many ML-ready with a Python API.
- UCI Machine Learning Repository (archive.ics.uci.edu) — classic academic datasets, great for learning.
- Google Dataset Search (datasetsearch.research.google.com) — searches across the web.
- Government open data — data.gov (US), data.gov.uk, data.gov.in — census, health, transport, weather.
Downloading from Kaggle
# Install the Kaggle CLI
pip install kaggle
# Place your kaggle.json API key in ~/.kaggle/
# Download from: kaggle.com > Account > API > Create New Token
# Download a dataset
kaggle datasets download -d camnugent/california-housing-prices
unzip california-housing-prices.zipClassic datasets worth knowing
- California Housing — predict house prices from location, rooms, income. Great for regression.
- Titanic — predict survival. The classic classification starter.
- MNIST — 70,000 handwritten digits. The 'Hello World' of image classification.
- IMDB Reviews — 50,000 movie reviews for sentiment analysis.
- Iris — 150 flower measurements. Tiny but perfect for learning clustering and classification.
💡 Tip: Start with a dataset that has a clear target variable (what you're predicting) and under 100,000 rows. You can iterate fast and see results quickly.
❓ Which platform hosts the largest community of ML datasets and competitions?