Loading the California Housing Dataset
Duration: 5 min
The California Housing dataset contains 1990 census data for California districts — median house value, median income, population, location, and more. It's a real dataset with real messiness: missing values, skewed distributions, and geographic outliers.
Loading it three ways
# Option 1: from scikit-learn (easiest)
from sklearn.datasets import fetch_california_housing
import pandas as pd
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.shape) # (20640, 9)
print(df.head())
print(df.describe())# Option 2: from a downloaded CSV
import pandas as pd
df = pd.read_csv('housing.csv')
print(df.info()) # shows column types and null counts
print(df.isnull().sum()) # check for missing values# Option 3: from HuggingFace Datasets
from datasets import load_dataset
ds = load_dataset('leostelon/california-housing')
df = ds['train'].to_pandas()
print(df.head())Understanding the columns
- MedInc — median income in the block (in tens of thousands)
- HouseAge — median age of houses in the block
- AveRooms — average number of rooms per household
- AveBedrms — average number of bedrooms per household
- Population — block population
- AveOccup — average household occupancy
- Latitude / Longitude — geographic location
- MedHouseVal — median house value (the target, in hundreds of thousands)
Quick exploration
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
df = fetch_california_housing(as_frame=True).frame
# Distribution of house values
df['MedHouseVal'].hist(bins=50)
plt.xlabel('Median House Value ($100k)')
plt.title('California Housing Price Distribution')
plt.show()
# Correlation with target
print(df.corr()['MedHouseVal'].sort_values(ascending=False))MedHouseVal 1.000000
MedInc 0.688075
AveRooms 0.151948
HouseAge 0.105623
AveOccup -0.023737
Population -0.024650
AveBedrms -0.046701
Longitude -0.045967
Latitude -0.142724💡 Tip: MedInc (median income) has the strongest correlation with house value at 0.69. This is the most important feature — a good sign before you even train a model.
❓ What does df.isnull().sum() tell you?