Machine Learning to analyze League of Legends

Thursday, 25 May 2017

Machine Learning to analyze League of Legends

Written by David Heras and Paula Montero, LUCA Interns, and Javier Carro Data Scientist at LUCA

When League of Legends was released in 2009, few people could have predicted what was to follow. The undeniable increase of eSports has been led by the ever-popular platform produced by Riot Games.
The last benchmark that was made public by the creators indicated over 100 million monthly users. These figures have given League of Legends the top spot amongst the MOBA (Multiplayer Online Battle Arena).

We will firstly examine the market before the huge success of LoL and it's competition. For those who are unaware of how the game works, the classic version of the game consists of two teams of five competitors aiming to destroy the base of the competition whilst defending their own.

Then we will dig a little deeper. We have set out to characterize the team play and predict the results of some of the professional matches that will be played in the future. 

We have based our insights on data published by Tim Sevenhuysen using Oracle's Elixir. This data set includes information on each competition (both on an individual and group level). The data covers all facets of the game including: gold obtained, damage caused, and "farming." This data set forms the foundation of our analysis.

Data and Planning 


Our analysis uses data taken from 7 different leagues that were distributed worldwide with the results of each game. What we are trying to predict for future games is whether it will be a victory or not using a simple data classifier. 

We have used statistics from the 2017 Spring Split, which started with an extensive data set. We have access to the all the variables (player, winner, team, gold, damage etc.) from each match which is then further divided per team and finally per player from the ten from each team. 

After testing various ways to approaching the analysis of the data, this is what we have come up with:
  1. Completing a non-supervised classification of each team through grouping variables thematically: Gold, CS ("farming"), Wards, Objectives and KDA ratios.
  2. Add the trends for victory of the team for their last five matches, taking into account whether they play on the blue or red side.
We will use unsupervised learning to characterise the teams from each game and supervised learning to make the prediction for future matches.

Supervised learning becomes potent when the model is trained and knows the classifications that we previously stated. When predicting new results, the classifier knows what happened previously on similar occasions and will, therefore, predict on the basis of that information and the internal structure of the algorithm.

  • Team Data:
We had the option to work with individual information of each player. However, we decided to work with team data to avoid the problem of player substitution and changes. By focusing on the joint result of each team we know that we may be losing some information, however, we avoid the previously stated problem that could arise.

  • Data Gaps:
Occasionally there are members who have information that varies from the other team members. Data that has problems or that is left blank does not give us any information to work with. By carrying out several tests we have verified that the solution that gives us better results is to replace those cases with the average of the corresponding variable. This means we will not lose any information from the registered players.

  • Redundancy:
By carrying out a simple correlation analysis we can see that there are quite a few redundant variables, that give us repetitive information. For example with more CS, Gold won or more experience when looked at with comparison to the adversary, it is almost always translated into victory. As these three variables all give us the same outcome (positive) it is sufficient for us to use only one with our analysis, we will find out which one a little further on.

  • Temporary untagging: 
This is an important point to highlight as the duration of a game strongly influences the obtaining of gold, CS, experience, KDA and more. If we were to ignore this variable, we would be negatively affecting the data models. For example: the best Korean team could have won a match in 20 minutes having only recovered 20,000 pieces of gold, whilst the worst team in Turkey could have lost a game lasting 60 minutes while recovering 25,000 pieces of gold per player. At first glance, it would appear that the Turkish team had performed better in terms of gold, but this is not a valuable data source since the Korean team recovered 750 pieces of gold per minute compared to the 416 pieces by Turkey. This is a significantly different figure once a more in-depth analysis is performed.

Characterising the teams


After having optimized and prepared the data, we can continue to carry out the team classifications. We have used the non-supervised classification method by Kmeans which includes the Sklearn library from Python. This algorithm will analyse the teams according to the previously selected facets of the game and will then classify the teams according to their behaviour. Non-supervised classification means that there has been no previously signed classification. The outcome that we hope to achieve will see the algorithm putting the best and worst from each league together. 

The groups of variables that are mentioned above usually have more than three variables, in order to produce a scatter for the results we have reduced the size of the data set to only have two dimensions. This will make each variable part of the axis, in order to carry out this operation we apply the Principal Component Analysis from the Sklearn library of Python. 

Once we have carried out the PCA we then classified the teams using the non-supervised method. We used Spotfire to visualize the results.

The following insights are what we consider to be the most notable from the visualization:
  • The teams with the best results tend to group together in the clusters. The same pattern is noted with the teams with the worst results. This means that the algorithm effectively classified each team despite the loss of information with the reductions of the dimensions of the PCA.

data visualization
Figure1: Highlighting the teams with a lower ranking in the competitions to see how it coincides with the data clusters.

Figure2: Highlight the teams with the highest rankings to see how they coincide with the data clusters.

  • There are some rare cases like the EnVyUs (they finished last in the North American league) that when looked at through the CS or Gold recovered categories they are amongst the best of their league like the SoloMid team. However, in all of the other areas that were analysed (KDA ratios and Objectives), they maintained their position as one of the worst teams amongst the other leagues.
data visualization
Figura 3:  The special case for the EnVyUs team, as they were part of the best teams with regards to CS.


Figura 4: The special case of EnVyUs, this time amongst the worst cluster in terms of gold.

  • It should be noted that in the case of some leagues, their participants remain close. In the North American league, the teams stay close in all areas analyzed. This trend does not seem to happen in the European or Turkish leagues.

Figure 5: Measuring the proximity of teams that play in the same league.

The information that has been generated about each team through these clusters will then form part of the dataset for the training of our prediction model. In fact, according to the tests we did, the results of the predictions using simply this information were not completely wrong: we obtained a precision of accuracy between 58% and 60% with different models. We still have the capacity to enrich the information to improve this result.

Trends


After having read several articles and studies relating to predictions made in other traditional sports, it becomes apparent that it is also important to take into consideration the form of the team upon arrival at its next match. As always there are various ways to include this information, this is what we decided to do: 

We have differentiated whether the team plays on the blue or red side, this is because we know that this is a factor that greatly influences the results of the games. However, this differs depending on the current patch of results at that time.

We have also concluded that the form of the team can be a decisive factor, so we calculate the winning streak that each team has in their last five games. 

In order to calculate the trend for a team, we keep track of 5 games that are played in chronological order and the result of each one. This allows us to keep track of their form before they reach the next game.

Final Dataset


After fine-tuning the data, the model will include the following information:
  • The teams who will compete against one another
  • The clusters refer to a group of variables that each team belongs to based on their number of games played.
  • The winning trend in the last five games played by the blue team as a blue team.
  • The winning trend in the last five games played by the red team as a red team.
The following table is a small example of the final format for the data.


Figura 6: An example of the final data format.

For example, in the first game shown in the table, the team "Unicorns of Love" has a trend equal to 0.6, this leads us to conclude that at the time of facing "G2 Esports" they had won three of their last five games played on the red side. It also includes the cluster in which all teams appear with each type of variable. The "Blue Victory" column is used to train the final model and we will delete it for testing.

Results


At this point, we have the data ready for the next step, training and evaluating our prediction models.

As a previous step, we can discretize (using Sklearn´s LabelEncoder library) the names of the teams or use dummy variables.

In addition, when evaluating the efficiency we have used cross-validation to ensure that the precision of results from each model are independent of the partition taken between training and test data.

The successful results that were obtained are the following:
To make our model more of a reality, we took advantage of the fact that there was a tournament ongoing while we were carrying out this study. We wanted to do some real tests to see if the results obtained in the simulation match the real ones. (MSI)

We made predictions for the games played in the first three rounds of the group stage, and the most accurate model in the actual evaluation was the SVM which gets 68.75%. This result was higher than expected.

Miércoles 10 de mayo de 2017
vs
vs
vs
vs
Acierto
Acierto
Acierto
Fallo


Jueves 11 de mayo de 2017
vs
vs
vs
vs
vs
vs
Acierto
Fallo
Acierto
Fallo
Acierto
Acierto

Viernes 12 de mayo de 2017
vs
vs
vs
vs
vs
vs
Fallo
Acierto
Acierto
Acierto
Acierto
Fallo

Special Case


We have noted a rather curious fact, that when the prediction has failed either G2 Esports or Flash Wolves participated in a match against another team or against each other.

As for G2 Esports, anyone who follows the European scene knows that G2 Esports always goes the international championships with high expectations, and from the predictions, if SKT T1 participate they will help them reach the title. They are the clear kings of Europe and their style of play is so clean and unpolluted that it almost seems unbelievable when they leave Europe that their success becomes a lie. It has not been until this very MSI that G2 have been able to redeem themselves and against many predictions has managed to reach the final and face the Korean gods. Ocelote´s team finally showed their potential and despite reaching second place, their fans and followers in Europe had a reason to feel proud.

The Flash Wolves case follows a similar pattern. Winner of the LMS, who managed to defeat G2 in the IEM Katowice final and therefore become the champion. This meant that their performance at MSI was hotly anticipated however they were not able to overcome G2 in the group stage. Despite this loss, they were able to win at the SKT. This produced a roller coaster of feelings amongst fans, who had arrived the semifinals with some hope. It was short lived as they were defeated decisively 3-0 in a best of five series.

The prediction in the games surrounding these two teams was difficult for any fan. As we have seen all kinds of results were produced.


data visualization
Figure 7: Cluster representations for both G2 (circle and Flash Wolves (triangle)

Figure 7 highlights the two different teams. As we can see, both are neck and neck, which indicates that they have a similar performance. Taking into consideration all factors that form the analysis we could say that they are both good teams. However, if we take a closer look, we notice that in the CS graph, G2 marches ahead. Not only in comparison with Flash Wolves but also with the other teams that are featured. Flash Wolves also march ahead in the gold and objective graphs. With a deeper analysis, we can then see that the two teams are not as similar as first thought.

This is as far as we have went with our analytical experience of LoL using ML. As it is often the case when concluding an analysis, many developments continue to emerge that would surely improve the results. This is precisely because of all the knowledge of the domain that you have acquired through the process of research and all the ways in which we analysed the data. It is apparent that amongst the data science boom, "Sports Science" is a branch of activity that is also increasing with a bright future ([1] [2] [3] [4] [5]).




























No comments:

Post a Comment