Reinforcement Learning: Teaching AI to Make Decisions

Date: October 25, 2024 Tags: reinforcement-learning, ai, machine-learning, decision-making, autonomous-systems Abstract: Discover how reinforcement learning enables AI systems to learn through trial-and-error, from basic Q-learning to advanced approaches powering autonomous systems, game AI, and robotics.

Introduction to Reinforcement Learning

Reinforcement Learning (RL) is a paradigm where agents learn optimal behavior through interaction with their environment. Unlike supervised learning which relies on labeled examples, RL learns through trial-and-error, receiving rewards or penalties for actions taken.

The RL Framework

Key Components: - Agent: The learner or decision-maker (e.g., robot, game AI) - Environment: The world the agent interacts with - State (s): Current situation of the agent - Action (a): Choices available to the agent - Reward (r): Feedback from the environment - Policy (π): Strategy for selecting actions - Value Function: Expected future rewards

Markov Decision Processes (MDPs)

RL problems are often formulated as MDPs: - States: Complete description of environment - Actions: Available choices - Rewards: Immediate feedback - Transitions: Environment dynamics - Discount Factor (γ): Future reward weighting

# Example MDP representation
class MDP:
    def __init__(self, states, actions, transitions, rewards, gamma=0.99):
        self.states = states
        self.actions = actions
        self.transitions = transitions  # P(s'|s,a)
        self.rewards = rewards          # R(s,a,s')
        self.gamma = gamma

    def get_reward(self, state, action, next_state):
        return self.rewards.get((state, action, next_state), 0)

    def get_transition_prob(self, state, action, next_state):
        return self.transitions.get((state, action, next_state), 0)

Core RL Algorithms

1. Q-Learning

Q-Learning is a model-free RL algorithm that learns action-value function Q(s,a):

import numpy as np

class QLearningAgent:
    def __init__(self, states, actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.states = states
        self.actions = actions
        self.alpha = alpha    # Learning rate
        self.gamma = gamma    # Discount factor
        self.epsilon = epsilon # Exploration rate
        self.q_table = np.zeros((len(states), len(actions)))

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.choice(self.actions)  # Explore
        else:
            state_idx = self.states.index(state)
            return self.actions[np.argmax(self.q_table[state_idx])]  # Exploit

    def update_q_value(self, state, action, reward, next_state):
        state_idx = self.states.index(state)
        action_idx = self.actions.index(action)
        next_state_idx = self.states.index(next_state)

        # Q-learning update rule
        max_next_q = np.max(self.q_table[next_state_idx])
        current_q = self.q_table[state_idx, action_idx]

        # Q(s,a) = Q(s,a) + α[R(s,a,s') + γ•max(Q(s',a')) - Q(s,a)]
        new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
        self.q_table[state_idx, action_idx] = new_q

# Example usage
agent = QLearningAgent(states=['S1', 'S2', 'S3'], actions=['left', 'right'])

# Training loop
for episode in range(1000):
    state = 'S1'  # Reset to initial state

    while not episode_finished:
        action = agent.choose_action(state)
        # Take action, observe reward and next state
        reward, next_state, done = environment.step(action)

        agent.update_q_value(state, action, reward, next_state)
        state = next_state

        if done:
            break

2. SARSA (State-Action-Reward-State-Action)

SARSA is an on-policy RL algorithm that follows the same policy for action selection and learning:

def sarsa_update(self, state, action, reward, next_state, next_action):
    state_idx = self.states.index(state)
    action_idx = self.actions.index(action)
    next_state_idx = self.states.index(next_state)
    next_action_idx = self.actions.index(next_action)

    current_q = self.q_table[state_idx, action_idx]
    next_q = self.q_table[next_state_idx, next_action_idx]

    # SARSA update: Q(s,a) += α[r + γ•Q(s',a') - Q(s,a)]
    new_q = current_q + self.alpha * (reward + self.gamma * next_q - current_q)
    self.q_table[state_idx, action_idx] = new_q

# SARSA learning loop
state = env.reset()
action = agent.choose_action(state)

for step in range(max_steps):
    next_state, reward, done = env.step(action)
    next_action = agent.choose_action(next_state)

    agent.sarsa_update(state, action, reward, next_state, next_action)

    state, action = next_state, next_action

    if done:
        break

3. Deep Q-Learning (DQN)

Combining deep neural networks with Q-learning for high-dimensional state spaces:

```python import torch import torch.nn as nn import torch.optim as optim

class DQNAgent