Module 2 of 9 · Real Datasets & Pre-trained Models · Beginner

Loading the California Housing Dataset

Duration: 5 min

The California Housing dataset contains 1990 census data for California districts — median house value, median income, population, location, and more. It's a real dataset with real messiness: missing values, skewed distributions, and geographic outliers.

Loading it three ways

# Option 1: from scikit-learn (easiest)
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.shape)        # (20640, 9)
print(df.head())
print(df.describe())

Try it in Google Colab: Open in Colab

# Option 2: from a downloaded CSV
import pandas as pd

df = pd.read_csv('housing.csv')
print(df.info())          # shows column types and null counts
print(df.isnull().sum())  # check for missing values
# Option 3: from HuggingFace Datasets
from datasets import load_dataset

ds = load_dataset('leostelon/california-housing')
df = ds['train'].to_pandas()
print(df.head())

Understanding the columns

Quick exploration

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True).frame

# Distribution of house values
df['MedHouseVal'].hist(bins=50)
plt.xlabel('Median House Value ($100k)')
plt.title('California Housing Price Distribution')
plt.show()

# Correlation with target
print(df.corr()['MedHouseVal'].sort_values(ascending=False))
MedHouseVal    1.000000
MedInc         0.688075
AveRooms       0.151948
HouseAge       0.105623
AveOccup      -0.023737
Population    -0.024650
AveBedrms     -0.046701
Longitude     -0.045967
Latitude      -0.142724

💡 Tip: MedInc (median income) has the strongest correlation with house value at 0.69. This is the most important feature — a good sign before you even train a model.

❓ What does df.isnull().sum() tell you?

← Previous Continue interactively → Next →

Related Courses