Loading the California Housing Dataset

Duration: 5 min

The California Housing dataset contains 1990 census data for California districts — median house value, median income, population, location, and more. It's a real dataset with real messiness: missing values, skewed distributions, and geographic outliers.

Loading it three ways

# Option 1: from scikit-learn (easiest)
from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.shape)        # (20640, 9)
print(df.head())
print(df.describe())

Try it in Google Colab:

# Option 2: from a downloaded CSV
import pandas as pd

df = pd.read_csv('housing.csv')
print(df.info())          # shows column types and null counts
print(df.isnull().sum())  # check for missing values

# Option 3: from HuggingFace Datasets
from datasets import load_dataset

ds = load_dataset('leostelon/california-housing')
df = ds['train'].to_pandas()
print(df.head())

Understanding the columns

MedInc — median income in the block (in tens of thousands)
HouseAge — median age of houses in the block
AveRooms — average number of rooms per household
AveBedrms — average number of bedrooms per household
Population — block population
AveOccup — average household occupancy
Latitude / Longitude — geographic location
MedHouseVal — median house value (the target, in hundreds of thousands)

Quick exploration

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

df = fetch_california_housing(as_frame=True).frame

# Distribution of house values
df['MedHouseVal'].hist(bins=50)
plt.xlabel('Median House Value ($100k)')
plt.title('California Housing Price Distribution')
plt.show()

# Correlation with target
print(df.corr()['MedHouseVal'].sort_values(ascending=False))

MedHouseVal    1.000000
MedInc         0.688075
AveRooms       0.151948
HouseAge       0.105623
AveOccup      -0.023737
Population    -0.024650
AveBedrms     -0.046701
Longitude     -0.045967
Latitude      -0.142724

💡 Tip: MedInc (median income) has the strongest correlation with house value at 0.69. This is the most important feature — a good sign before you even train a model.

❓ What does df.isnull().sum() tell you?

The total number of rows in the dataframe The number of missing values in each column The sum of all numeric values Whether the dataframe is empty

Loading the California Housing Dataset

Loading it three ways

Understanding the columns

Quick exploration

Related Courses