Machine learning: reinforcement learning


Reinforcement Learning is one of the hottest topics in the field of Machine Learning, and also one of the oldest. In fact, the first studies date back to the 50s of the last century. 

In 2013 a British startup, DeepMind, showed everyone how it was possible to create a system capable of learning to play any Atari game from scratch. 

Below is the link to an article on this theme dedicated 

The masterpiece of DeepMind remains, however, AlphaGo: a Reinforcement Learning system that in 2017 beat Ke Jie then world champion of Go, a complicated Chinese game with a number of positions higher than that of all the atoms present in the observable Universe: a goal achieved through the application of the power of neural networks to the field of Reinforcement Learning. 

In nature, learning is a process of exploration and environmental interaction necessary to obtain rewards. 

This simple paradigm is captured by Reinforcement Learning and coded into systems that can be executed by artificial machines. 

Reinforcement Learning is simply an area of study or better, more technically, a class of machine learning systems with typical structure and functioning. 

Let’s say, therefore, that in a Reinforcement Learning system a software agent makes observations in an environment, performing actions in it and receiving rewards in exchange. 

The agent’s goal is to maximize the reward in the long run. 

The reward can also be negative, a penalty, c.d. negative rewards. 

Some examples of Reinforcement Learning: 


  • A walking robot: the robot that walks.

In this case the agent is the software in charge of controlling the machine, which observes the real world with a multitude of sensors obtaining a reward if it reaches the goal and a penalty (i.e. negative reward) if it wastes time in useless actions (e.g. wrong direction, fall etc.) 


  • A Smart Thermostat.

An agent does not necessarily have to control the movement of a physical (or virtual) device. 

For example, the Google Nest thermostat, in the first weeks of use, adjusts a machine learning model (more correctly than reinforcement learning) adapting to the user’s needs. 

In this case the positive reward is triggered by setting a valid temperature and reducing energy consumption, the negative one in case of corrective human intervention (i.e. wrong temperature). 

In this case, the agent must therefore anticipate human needs. 


  • FinTech

The financial sector, especially the stock trading industry, welcomes these systems, because they could assist brokers and daily traders by observing the prices of shares and deciding when and how much to buy or sell. 

The rewards are triggered here according to the profit or loss margins. 


In all cases, the lack of fundamental information at the logical level will not have escaped. 

Effectively the agent performs actions in an environment and receives in return rewards that must be maximized. 


But how does the agent know what action to take at any given time, based on the observations made? Using an algorithm to determine what actions to perform, this algorithm is called Policy. 

A policy can take the form of a neural network: taking the input observations it processes the action to be taken in output.  



Q-learning is one of the most well-known reinforcement learning algorithms. 

It is part of the family of algorithms adopted in the techniques of time differences in the case of models with incomplete information. 

Q-learning is a value-based learning algorithm and focuses on optimizing the value function according to the environment or problem. 

Its goal is to allow a machine learning system to adapt to the environment around it, improving the choice of actions among those possible to perform. To achieve this, work by trying to maximize the value of the next prize earned Q. 

The model stores all the values in a table, which is Table Q, where the rows show all possible observations while the columns include all possible actions. The cells are then filled during training with values that represent the expected reward. 

The Q-learning algorithm is described by an agent, technically the AI model, interacting with the environment, a set of S-states and a set of actions for each state. 

To achieve this, the agent performs actions that alter the environment by generating a new state, obtaining a reward that can be negative or positive depending on the effect of the action, and depending on the desired result. 

By performing actions and obtaining rewards, the agent devises an optimal strategy to maximize the reward over time. This strategy is called policy, and is a mathematical function with optimized parameters. 

A very interesting example of Q-learning is the Gym developed with OpenAI Gym + Box2D ( ) 

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to games like Pong or Pinball.