Module 13 of 25 · AI & Machine Learning Fundamentals · Beginner

Reinforcement Learning Fundamentals

Duration: 4 min

This module provides an introduction to the fundamentals of Reinforcement Learning (RL), a subfield of machine learning focused on training agents to make sequences of decisions by rewarding desirable actions. Understanding RL is crucial for developing intelligent systems that can learn and adapt in dynamic environments.

Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are characterized by states, actions, transition probabilities, and rewards. The goal in an MDP is to find a policy that maximizes the expected cumulative reward over time.

import numpy as np

# Define the states, actions, and transition probabilities
states = ['A', 'B', 'C']
actions = ['left', 'right']
transition_probs = {
    ('A', 'left'): {'A': 1.0},
    ('A', 'right'): {'B': 0.8, 'C': 0.2},
    ('B', 'left'): {'A': 0.6, 'B': 0.4},
    ('B', 'right'): {'C': 1.0},
    ('C', 'left'): {'B': 1.0},
    ('C', 'right'): {'C': 1.0}
}

# Define the rewards
rewards = {
    ('A', 'left', 'A'): 0,
    ('A', 'right', 'B'): 1,
    ('A', 'right', 'C'): -1,
    ('B', 'left', 'A'): 0.5,
    ('B', 'left', 'B'): -0.5,
    ('B', 'right', 'C'): 2,
    ('C', 'left', 'B'): 0,
    ('C', 'right', 'C'): 0
}

# Value iteration algorithm
V = {s: 0 for s in states}
gamma = 0.9

for _ in range(1000):
    V_new = V.copy()
    for s in states:
        V_new[s] = max(
            sum(transition_probs[s, a][s_] * (rewards[s, a, s_] + gamma * V[s_]) for s_ in transition_probs[s, a])
            for a in actions
        )
    V = V_new

print('Optimal value function:', V)

Try it in Google Colab: Open in Colab

Optimal value function: {'A': 1.1111111111111112, 'B': 2.2222222222222223, 'C': 2.0}

Q-Learning Algorithm

Q-Learning is a model-free reinforcement learning algorithm that learns the value of an action in a particular state. It uses a Q-table to store the expected utility of taking a given action in a given state. The Q-table is updated based on the observed rewards and the maximum expected future rewards.

import numpy as np
import random

# Define the environment
states = ['A', 'B', 'C']
actions = ['left', 'right']
rewards = {
    ('A', 'left'): 0,
    ('A', 'right'): 1,
    ('B', 'left'): 0.5,
    ('B', 'right'): 2,
    ('C', 'left'): 0,
    ('C', 'right'): 0
}

# Initialize Q-table
Q = np.zeros((len(states), len(actions)))

# Hyperparameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1

# Q-Learning algorithm
for episode in range(1000):
    state = random.choice(states)
    done = False
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = actions[np.argmax(Q[states.index(state), :])]
        next_state = random.choice([s for s in states if s!= state])
        reward = rewards[state, action]
        best_next_action = np.argmax(Q[states.index(next_state), :])
        Q[states.index(state), actions.index(action)] += alpha * (reward + gamma * Q[states.index(next_state), best_next_action] - Q[states.index(state), actions.index(action)])
        state = next_state
        if state == 'C':
            done = True

print('Optimal Q-table:', Q)

💡 Tip: Ensure that the learning rate (alpha) and discount factor (gamma) are tuned appropriately to balance exploration and exploitation.

❓ What is the primary goal in a Markov Decision Process (MDP)?

❓ Which of the following best describes Q-Learning?

← Previous Continue interactively → Next →

Related Courses