Reinforcement Learning: Teaching AI to Make Decisions
Date: October 25, 2024 Tags: reinforcement-learning, ai, machine-learning, decision-making, autonomous-systems Abstract: Discover how reinforcement learning enables AI systems to learn through trial-and-error, from basic Q-learning to advanced approaches powering autonomous systems, game AI, and robotics.
Introduction to Reinforcement Learning
Reinforcement Learning (RL) is a paradigm where agents learn optimal behavior through interaction with their environment. Unlike supervised learning which relies on labeled examples, RL learns through trial-and-error, receiving rewards or penalties for actions taken.
The RL Framework
Key Components: - Agent: The learner or decision-maker (e.g., robot, game AI) - Environment: The world the agent interacts with - State (s): Current situation of the agent - Action (a): Choices available to the agent - Reward (r): Feedback from the environment - Policy (π): Strategy for selecting actions - Value Function: Expected future rewards
Markov Decision Processes (MDPs)
RL problems are often formulated as MDPs: - States: Complete description of environment - Actions: Available choices - Rewards: Immediate feedback - Transitions: Environment dynamics - Discount Factor (γ): Future reward weighting
# Example MDP representation
class MDP:
def __init__(self, states, actions, transitions, rewards, gamma=0.99):
self.states = states
self.actions = actions
self.transitions = transitions # P(s'|s,a)
self.rewards = rewards # R(s,a,s')
self.gamma = gamma
def get_reward(self, state, action, next_state):
return self.rewards.get((state, action, next_state), 0)
def get_transition_prob(self, state, action, next_state):
return self.transitions.get((state, action, next_state), 0)
Core RL Algorithms
1. Q-Learning
Q-Learning is a model-free RL algorithm that learns action-value function Q(s,a):
import numpy as np
class QLearningAgent:
def __init__(self, states, actions, alpha=0.1, gamma=0.99, epsilon=0.1):
self.states = states
self.actions = actions
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
self.q_table = np.zeros((len(states), len(actions)))
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.choice(self.actions) # Explore
else:
state_idx = self.states.index(state)
return self.actions[np.argmax(self.q_table[state_idx])] # Exploit
def update_q_value(self, state, action, reward, next_state):
state_idx = self.states.index(state)
action_idx = self.actions.index(action)
next_state_idx = self.states.index(next_state)
# Q-learning update rule
max_next_q = np.max(self.q_table[next_state_idx])
current_q = self.q_table[state_idx, action_idx]
# Q(s,a) = Q(s,a) + α[R(s,a,s') + γ•max(Q(s',a')) - Q(s,a)]
new_q = current_q + self.alpha * (reward + self.gamma * max_next_q - current_q)
self.q_table[state_idx, action_idx] = new_q
# Example usage
agent = QLearningAgent(states=['S1', 'S2', 'S3'], actions=['left', 'right'])
# Training loop
for episode in range(1000):
state = 'S1' # Reset to initial state
while not episode_finished:
action = agent.choose_action(state)
# Take action, observe reward and next state
reward, next_state, done = environment.step(action)
agent.update_q_value(state, action, reward, next_state)
state = next_state
if done:
break
2. SARSA (State-Action-Reward-State-Action)
SARSA is an on-policy RL algorithm that follows the same policy for action selection and learning:
def sarsa_update(self, state, action, reward, next_state, next_action):
state_idx = self.states.index(state)
action_idx = self.actions.index(action)
next_state_idx = self.states.index(next_state)
next_action_idx = self.actions.index(next_action)
current_q = self.q_table[state_idx, action_idx]
next_q = self.q_table[next_state_idx, next_action_idx]
# SARSA update: Q(s,a) += α[r + γ•Q(s',a') - Q(s,a)]
new_q = current_q + self.alpha * (reward + self.gamma * next_q - current_q)
self.q_table[state_idx, action_idx] = new_q
# SARSA learning loop
state = env.reset()
action = agent.choose_action(state)
for step in range(max_steps):
next_state, reward, done = env.step(action)
next_action = agent.choose_action(next_state)
agent.sarsa_update(state, action, reward, next_state, next_action)
state, action = next_state, next_action
if done:
break
3. Deep Q-Learning (DQN)
Combining deep neural networks with Q-learning for high-dimensional state spaces:
```python import torch import torch.nn as nn import torch.optim as optim
class DQNAgent