12.2. Introduction to Reinforcement Learning¶

12.2.1. Background¶

As a branch of machine learning, reinforcement learning has attracted more and more attention in recent years. DeepMind proposed deep Q-learning in 2013, enabling AI to learn how to play video games based on images. Since then, DeepMind-led scientific institutions have made remarkable achievements in reinforcement learning — a representative example is AlphaGo, which defeated the world’s top Go player Lee Sedol in 2016. Other significant achievements include AlphaStar (agent of StarCraft), OpenAI Five (agent of Dota 2), Pluribus (Texas hold’em poker, which is a multi-player zero-sum game), and robot dog motion control algorithms. These achievements have been made possible by the rapid iterations and progress of algorithms in the reinforcement learning field over the past few years. The data-hungry deep neural networks can demonstrate a good fitting effect based on the large amounts of data generated by simulators, thereby fully leveraging the capabilities of reinforcement learning algorithms and performing comparably or even better than human experts in terms of learning. Although originally utilized in the video gaming field, reinforcement learning has since been gradually applied in a wider range of realistic and meaningful fields, including robot control, dexterous manipulation, energy system scheduling, network load distribution, and automatic trading for stocks or futures. Such applications have impacted traditional control methods and heuristic decision-making theory.

12.2.2. Reinforcement Learning Components¶

The core of reinforcement learning is the process of continuously interacting with the environment to optimize the policy with the intention of improving the reward. Such a process is manifested as the selection of an action based on a specific state. The object that makes the decision is called an agent, and the impact of the decision is reflected in the environment. More specifically, the state transition and reward in the environment vary depending on the decision. State transition, which can be either deterministic or stochastic, is a function that specifies the environment’s transition from the current state to the next state. A reward, which is generally a scalar, is the feedback of the environment on the agent’s action. Figure :numref:ch011/ch11-rl shows the abstract process, which is the most common model description of reinforcement learning in the literature.

Fig. 12.2.1 Framework of reinforcementlearning¶

Take video gaming as an example. A gamer needs to gradually become familiar with the game operations in order to achieve better results. The process from getting started with the game to gradually mastering game skills is similar to the reinforcement learning process. At any given moment after the game starts, it is in a specific state. By viewing the state, the gamer can obtain an observation (e.g., images on the screen of the game console), based on which the gamer performs an action (e.g., firing bullets) that changes the game state and enables the game to enter the next state (e.g., defeating monsters). Furthermore, the gamer can know the effect of the current action (e.g., defeating a monster generates a positive score, whereas losing to a monster generates a negative score). The gamer then selects a new action based on the observation of the next state, and repeats this process until the game ends. Through these repetitive operations and observations, the gamer can gradually master the skills of the game. A reinforcement learning agent learns to play the game in a similar way.

However, there are several key issues to be noticed in this process. (1) The observation may not be equal to the state. Instead, it is generally a function of the state, and the mapping from the state to the observation may cause information loss. The environment is fully observable if the observation is equal to the state or if the state of the environment can be completely restored based on the observation; in all other cases, it is partially observable. (2) Each action performed by a gamer may not produce immediate feedback but may produce delayed effects after many steps. Reinforcement learning models allow such a delayed feedback. (3) The feedback may not be a scalar in the human learning process. To convert the feedback received by the reinforcement learning agent into a scalar, called the reward value, we can perform mathematical abstraction on it. The reward value can be a function of the state, or a function of the state and action. The existence of the reward value is a basic assumption for reinforcement learning, and is also a major difference between reinforcement learning and supervised learning.

12.2.3. Markov Decision Process¶

In reinforcement learning, the decision-making process is generally described by a Markov decision process 1, and can be represented by a tuple \((\mathcal{S}, \mathcal{A}, R, \mathcal{T}, \gamma)\). \(\mathcal{S}\) and \(\mathcal{A}\) indicate the state space and action space, respectively. \(R\) indicates the reward function. \(R(s,a)\): \(\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}\) indicates the reward value regarding the current state \(s\in\mathcal{S}\) and the current action \(a\in\mathcal{A}\). The probability of transitioning from the current state and action to the next state is defined as \(\mathcal{T}(s^\prime|s,a)\): \(\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow \mathbb{R}_+\). \(\gamma\in(0,1)\) indicates the discount factor 2 for the reward. Reinforcement learning aims to maximize the expected cumulative reward value (\(\mathbb{E}[\sum_t \gamma^t r_t]\)) received by the agent.

The Markov property in a Markov decision process is defined as follows:

(12.2.1)¶\[\begin{aligned} \mathcal{T}(s_{t+1}|s_t) = \mathcal{T}(s_{t+1}|s_0, s_1, s_2, \dots, s_t) \end{aligned}\]

That is, the transition to the current state depends on the previous state only (it does not depend on historical states). We can omit action \(a\) in the state transition function \(\mathcal{T}\) because the Markov property is part of the environment transition process and is independent of the decision process.

Based on the Markov property, we can further deduce that the optimal policy at any given moment depends only on the decision on the latest state — it does not depend on the entire decision history. This conclusion is of great significance in the design of reinforcement learning algorithms because it simplifies the process of solving the optimal policy.

1: A Markov decision process is a function in which a subsequent state depends only on the current state and action (it does not depend on historical states).
2: Each subsequent reward value can be multiplied by the discount factor so that an infinite sequence has a limited sum of reward values.