Deep Learning vs Atari: train your AI to dominate classic videogames (Part II)

Friday, June 22, 2018

Deep Learning vs Atari: train your AI to dominate classic videogames (Part II)

Written by Enrique Blanco (CDO Researcher) and Fran Ramírez (Security Researcher at Eleven Paths)

In this article, the second about our experiment using Reinforcement Learning (RL) and Deep Learning in OpenAI environments, we continue on from the previous post that you can read here if you haven't done so already. This post presents the results obtained after training our agent in the Breakout-v0 and SpaceInvaders-v0 environments. Before continuing, you may want to also catch up on our recent webinar in which we went into more detail about the results you will read about in this blog.


Reinforcement Learning (RL) is the area of Machine Learning used to train artificial intelligences to play videogames in environments developed in OpenAI Gym. It is capable of providing an agent with algorithms that allow it to examine and understand the environment that it is working in in order to achieve an objective in exchange for a set reward. These algorithms help the agent to learn, through trial and error, to maximize the reward that it can obtain based on the variables that it observes in the game, all without needing human intervention.

Below, we briefly define some of the common concepts of Reinforcement Learning (RL):

  • Environment: this describes the game in which the agent must act and learn to develop.
  • Reward: the incentive that the agent obtains after carrying out a determined action. In the case of Breakout-v0, the agent receives a positive reward when it manages to return the ball and destroy one of the bricks.
  • State: this is usually a tensor obtained from the observation space of the environment. In this case, the states consist in a collection of preprocessed images with the aim of helping to train the model.
  • Action: this is a possible move in the action space that the agent can carry out, based on the current game state or the historic states that it has studied. For example, in our case it would be to move left, right or stay still in terms of direction, and to shoot the ball
  • Control policy: this determines how the agent chooses the action that it will take. The programmer can choose the control policy at the time of carrying out the training of the neural network. Normally, you can choose a random action to start with, and once the model trains sufficiently, it will act based on the maximum value that the model has obtained up to that point.

Figure 1: Diagram showing the learning process of an agent during the training.

Beginning the Training

The algorithm used in this paper, which we will explain in the following sections, aims to maximize the reward each time. The agent recognizes images of the game environment and adds them to a neural network, which will allow it to estimate the best action to take based on the input data. We will use the TensorFlow library to build the architecture of the deep network as well as to make the relevant calculations.

The values of the actions that the model estimates from a given input are normally referred to as Q-Values. When an agent knows these values beforehand, it only has to select the action that maximizes the corresponding Q-Value for each game state that it observes. However, these Q-Values should be explored through an extensive training process, due to the large amount of possible states that can occur.

Control Policy

At the start, the values of the actions start at zero, allowing the agent to take random actions in the game. Each time that the action returns a positive reward (destroying a brick), the weights and biases of the layers of the model’s architecture update, which means that the estimation of Q-Values becomes increasingly refined.

When approximating the map of different states and actions, Reinforcement Learning techniques are often quite unstable when using a deep neural network. This is due to the nonlinearity of neural networks and the fact the small changes in Q-Values, when there is an inappropriate control policy, can drastically change the action and therefore lead to very different game states.

Due to all this, and with the aim of reducing instabilities that could arise during training, one usually runs a random sample of a large number of states, actions and rewards in order to explore the greatest number of possibilities of the current casuistry and avoid divergences and blockages in the model’s training.


The objective of the agent is to interact with the emulator with the intention of learning which action to take in a given game state – or set of game states – in order to maximize the reward of said action.

A function that returns the optimum action given a certain game state is defined as:

Q(s,a) = reward(s,a)+γ · max(Q(s’,a’))

This function is known as the Bellman Equation. It shows that the value of the Q function for a given state s and an action a equals the current reward r for that state s and the action a plus the expected reward derived from a new action a’ and a previous state s’, corrected by a discount factor γ∈[0,1].

This discount hyperparameter allows us to decide how important future rewards are in relation to the current reward. Values close to γ≃1 will be better suited to Breakout, because the rewards are not obtained immediately after the action, since various subsequent actions may take place before it becomes clear whether the initial action was successful or now. In other words, it takes various frames after bouncing the ball for a brick to break.


The Loss and Optimization Functions

Given the large number of frames per second to process, and the elevated dimensionality of the game states, it is impractical to directly map the causality between action and state. This forces us to approximate the Q function through our random sample of states, rewards and actions.

Usually, the loss function chosen aims to minimize the Root Mean-Squared Error of the Q-Values that we obtain through using our model, and the expected Q-Values. 

sqrt(loss) = Q(s’, a’) - Q(s, a) = reward(s, a) + gamma · max(Q(s’, a’) - Q(s, a))

In order to find the minimum of the previous function one can use the iterative optimization algorithm “Gradient Descent”. This algorithm calculates the gradients of the loss function for each weight and moves them in the direction that minimizes the function. However, finding the minimum of a nonlinear function can be complicated, especially due to the possibility of being stuck on a local minimum and not the global minimum what you want, or carrying out many iterations on a flat part of the curve.

Optimizing a neural network is a complicated task, which is highly dependent on the quality and quantity of the data with which the model trains. The complication of optimizing the network is also a result of its architecture, which consists in a larger number of layers and has greater dimensionality than usual, and will require a greater number of weights and biases.


Pre-Processing Input Data

One of the main deciding factors of a good training of the model, given the long computing times required, is the pre-processing of the image and the nature of the input to the neural network. This will also directly affect the routines that one needs to develop for interacting with the environment. In general, it is advisable to process the image generated by the Gym environment before it is included in the model. Generally, this aims to reduce its dimensionality, by eliminating the information that would not be useful when training the neural network. Normally, there would be an emphasis on the information relating to color that OpenAI Gym contains in its three color channels. These channels do not contain valuable information for the training of our model, and will therefore be forgotten before introducing the states to the model.

The images returned by the OpenAI Gym environment are arrays of 210x160 pixels grouped in three RGB layers. This increases the memory usage. Therefore, it is vitally important to preprocess the images in order to reduce the dimensions of the inputs, to eliminate unnecessary information and to reduce memory usage.

The tests carried out in this project are based on two approximations regarding the processing of images:

  • As a first approximation, we take images of the game environment and process them; making them greyscale, resizing them, removing any background and using a simple image filtering to detect movement. The resulting state of these steps is the latest image of the environment as well as recent traces of movement of the objects.
  • In the second alternative, we have opted for using a stack of four images as the input, with the intention of allowing the model to learn to detect movement. This is necessary since an individual state offers little information about the velocity and direction of the ball and paddle.

We are only interested in the area of the game where the ball and paddle are moving and where the bricks are. The borders of the screenshots do not offer valuable information to the model, so we eliminate these areas. Furthermore, we reduce the resolution of the image by 50% and turn it black and white (in a binary scale) since the RGB channels also offer little information of interest.

In the next post, we will offer a description of the architecture of the model with which he have trained our agents in Breakout-v0 and SpaceInvaders-v0. We will also explain in greater detail the logic of the training, explain the testing phase, and offer some conclusions about the project.

No comments:

Post a Comment