Module 17 of 25 · AI & Machine Learning Fundamentals · Beginner

Project: Building a Machine Learning Model

Duration: 8 min

This module guides you through the process of building a machine learning model from scratch. You will learn how to preprocess data, select features, choose appropriate algorithms, train your model, and evaluate its performance. Understanding these steps is crucial for developing effective machine learning solutions.

Data Preprocessing

Data preprocessing is a critical step in building a machine learning model. It involves cleaning the data, handling missing values, encoding categorical variables, and scaling features. Proper preprocessing ensures that your model can learn effectively from the data.

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Sample data
data = {'age': [25, 30, 35, None], 'gender': ['M', 'F', 'M', 'F'], 'income': [50000, 60000, None, 80000]}
df = pd.DataFrame(data)

# Define preprocessing for numeric and categorical features
numeric_features = ['age', 'income']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['gender']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)])

# Apply preprocessing
df_processed = preprocessor.fit_transform(df)
df_processed

Try it in Google Colab: Open in Colab

array([[ 0.       ,  1.       ,  1.       ,  0.       , 50000.      ],
       [ 1.       ,  0.       ,  0.       ,  1.       , 60000.      ],
       [ 2.       ,  1.       ,  1.       ,  0.       , 55000.      ],
       [ 3.       ,  0.       ,  0.       ,  1.       , 80000.      ]], dtype=float32)

Model Selection and Training

Choosing the right model and training it effectively is essential for achieving good performance. You need to consider the type of problem (classification, regression, etc.), the size and nature of your data, and the complexity of the model. Training involves feeding the preprocessed data into the model and adjusting its parameters to minimize error.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Sample data
X = [[0, 0], [1, 1], [2, 2], [3, 3]]
y = [0, 1, 2, 3]

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
accuracy

💡 Tip: Always validate your model using a separate test set to avoid overfitting. Use cross-validation for a more robust evaluation.

❓ What is the purpose of data preprocessing in machine learning?

❓ Which of the following is a common method for handling missing values in numeric data?

← Previous Continue interactively → Next →

Related Courses