Reinforcement Learning Experiments

This repo explores different approaches for reinforcement learning, both policy evaluation and control problems.

Environments

The following environments are implemented.

2D Maze

Implemented under Maze2D, the agent navigates a maze with blocked cells (stepping up, down, left or right) starting at one corner (0, 0) and aiming to find the terminal state at (height-1, width-1) in the shortest amount of time. Reaching a non-terminal state is reward by -1 (terminal reward is 1).

Line World

Implemented under LineWorld, a linear state space with left/right steps as actions for each state. Stepping into non-terminal states are negatively rewarded so the aim is to find the terminal state at one end-point in the shortest amount of time.

Queue Access Control

Implemented under QueueAccessControl, the environment models a set of servers and a FIFO queue of customers applying for server access. At each step, we decide if the customer at the head of the queue is accepted (given an unlocked server) or rejected. Each customer has a priority and after each accepted customer, we receive a reward proportional to their priority. At each step, the in-use servers unlock at a fixed probability. The environment state holds the number of free servers and the current customer's priority. At environment reset, a fixed size queue is built by picking customers randomly according to per-priority rates.

Policies

The following policies are implemented.

ConstantPolicy: play the same action regardless of state.
RandomPolicy: select action randomly.
TabularGreedy and TabularEpsilonGreedy: based on state-action values, select the action with highest value (or explore).

Control methods

The following value learning and control algorithms are tested:

MonteCarlo: tabular, on-policy Monte Carlo learning. We run an episode and record the state-action returns i.e., the sum of rewards after the first occurrence of each state-action pair. The state-action values are the mean of the so-far observed state-action returns (updated after each episode). The simulated episode uses the an epsilon-greedy action selection based on the state-actions values calculated so far.
SARSA: tabular, on-policy SARSA learning. The state-action values are update following every step based on the rule

$$Q(s,a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]$$

where

$s$ is the current state and $a$ is the action selected by the $\varepsilon$-greedy policy,
$r$ is the reward, and
$s'$ are the resulting state and $a'$ is the next selected action.

The hyperparameters are the learning rate $\alpha$ and discount factor $\gamma$.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
examples		examples
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pyproject.toml		.pyproject.toml
EXAMPLES.md		EXAMPLES.md
README.md		README.md
examples.py		examples.py
nash_learning.ipynb		nash_learning.ipynb
playground.py		playground.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning Experiments

Environments

2D Maze

Line World

Queue Access Control

Policies

Control methods

About

Releases

Packages

Languages

danieltsoukup/reinforcement_learning

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning Experiments

Environments

2D Maze

Line World

Queue Access Control

Policies

Control methods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages