league-of-legends-analysis

Project maintained by vanessly Hosted on GitHub Pages — Theme by mattgraham

Know Your Role: Predicting League of Legends Roles from Player Stats

Vinh Tran | LinkedIn | vinht@umich.edu

CC Ly | LinkedIn | vanessly@umich.edu

Introduction

Introduction and Question Identification

Dataset Overview

In this project, we’re using the Oracle’s Elixir League of Legends Match Data from the 2022 season. This dataset contains information from over 10,000 League of Legends professional matches. There are about 120,000 rows in total (each match contributes up to 12 rows: one per player plus two team‑summary rows).

What is League of Legends?

League of Legends (LoL) is a globally popular multiplayer online battle arena (MOBA) game developed by Riot Games. In each match, two teams of five players compete to destroy the opposing team’s base, called the Nexus, while defending their own. Every player controls a unique character, called a champion, and takes on a specific role on the map: Top, Mid, Jungle, Bottom (Bot), or Support (Sup).
Each role comes with distinct responsibilities:
- Top laners typically play isolated, durable champions who can hold their own.
- Mid laners often deal heavy damage and control the center of the map.
- Jungler players move between lanes and neutral zones, securing map objectives such as dragons and supporting teammates.
- Bottom players focus on dealing consistent damage from range.
- Suppor players protect and assist teammates, especially the Bot laner, and help control vision on the map.

Throughout the game, players earn gold, experience, and items by defeating enemy champions, minions, and monsters. Their performance is tracked through stats like kills, deaths, assists, damage dealt, and gold earned, many of which are used in our analysis to predict a player’s role.

Central Question

How accurately can we predict a player’s in‑game role (Top, Jungle, Mid, Bottom, or Support) using only their post‑game performance statistics?

Key Columns

Below are the columns relevant to our question:

Column	Description
`gameid`	Unique ID for each match (ties together all player and team rows)
`position`	The role a player filled in that game (Top, Jungle, Mid, Bottom, Support)
`kills`	Number of enemy champions the player eliminated
`assists`	Number of enemy champion kills the player helped secure
`deaths`	Number of times the player was eliminated by enemy champions
`dpm`	Damage per minute: average damage dealt to champions per minute
`earned gpm`	Gold per minute earned by the player throughout the match
`cspm`	Creep score per minute: average minions and monsters killed per minute
`monsterkills`	Total number of neutral monsters killed by the player
`kda`	Kills/Deaths/Assists ratio: (Kills + Assists) divided by Deaths, used to evaluate combat performance
`participation`	Also known as "kill participation". Proportion of team kills a player was involved in (kills or assists)
`xptogoldat10`	Ratio of experience points to gold earned at 10 minutes, used to estimate lane efficiency

Why It Matters

Automatically predicting a player’s position from raw match stats has practical value for coaches, pro-players, analysts, and broadcasters. Coaches can see if their players’ performances line up with expected performances of other players within the same position, and make statistically-backed decisions to optimize their team roster. Simarily, pro-players can utilize this tool to see where they are lacking in their skills, and make adjustments to improve their gameplay. Analysts and broadcasters can utilize this data as a fun and engaging statistic and classifier for audiences.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

To ensure that our analysis focused only on meaningful statistics relevant to role prediction, we applied several cleaning steps to the original dataset based on how the original data is structured and generated in the dateset.

1. Filtered only for complete player data

df = df[df['datacompleteness'] == 'complete']

The dataset includes some rows marked as incomplete, which may result from matches where data logging failed or games were not played to completion. We filtered the DataFrame to keep only rows where datacompleteness was marked as complete, ensuring all included rows contain full, reliable statistics.

df = df.groupby('gameid', group_keys=False).apply(lambda x: x.iloc[:-2])

For each gameid, the dataset contains 12 rows: 10 for individual players and 2 for team-level summary statistics. Since our prediction task focuses on individual player performance, we removed the last two rows of each match group, which correspond to team summaries. We verified that this operation worked correctly by checking that only 10 players remained in a sample game:
```
print(df.loc[df['gameid'] == 'ESPORTSTMNT01_2690210', 'playername'])
```

3. Dropped irrelevant columns

cols_to_drop = ['url', 'split', 'pick1', ..., 'firstdragon']
df.drop(columns=cols_to_drop, inplace=True)

We removed columns that are either:
- Unrelated to performance metrics (e.g. url, split)
- Draft data (e.g. pick1 to pick5)
- Team-level objective data (e.g. firstdragon, elders, heralds, etc.)

4. Dropped columns with Null values

columns_with_null = df.isnull().sum()[df.isnull().sum() > 0].index.to_list()
df.drop(columns=columns_with_null, inplace=True)

['playerid', 'teamname', 'teamid', 'ban1', 'ban2', 'ban3', 'ban4', 'ban5', 'barons', 'opp_barons', 'inhibitors', 'opp_inhibitors', 'goldat20', 'xpat20', 'csat20', 'opp_goldat20', 'opp_xpat20', 'opp_csat20', 'golddiffat20', 'xpdiffat20', 'csdiffat20', 'killsat20', 'assistsat20', 'deathsat20', 'opp_killsat20', 'opp_assistsat20', 'opp_deathsat20', 'goldat25', 'xpat25', 'csat25', 'opp_goldat25', 'opp_xpat25', 'opp_csat25', 'golddiffat25', 'xpdiffat25', 'csdiffat25', 'killsat25', 'assistsat25', 'deathsat25', 'opp_killsat25', 'opp_assistsat25', 'opp_deathsat25']

We identified and removed all columns that had missing values. Upon inspection, these columns either did not contain statistics that are relevant to our modeling goal or contained redundant information. Keeping them would have required imputation strategies that could introduce bias to our algorithm.

Final Cleaned Dataframe

	gameid	datacompleteness	league	year	date	game	patch	participantid	side	position	playername	champion	gamelength	kills	deaths	assists	teamkills	teamdeaths	firstblood	firstbloodkill	firstbloodassist	team kpm	ckpm	damagetochampions	dpm	damageshare	damagetakenperminute	damagemitigatedperminute	wardsplaced	wpm	wardskilled	wcpm	controlwardsbought	visionscore	vspm	totalgold	earnedgold	earned gpm	earnedgoldshare	goldspent	total cs	minionkills	monsterkills	cspm	goldat10	xpat10	csat10	opp_goldat10	opp_xpat10	opp_csat10	golddiffat10	xpdiffat10	csdiffat10	killsat10	assistsat10	opp_deathsat10	goldat15	xpat15	csat15	opp_goldat15	opp_xpat15	opp_csat15	golddiffat15	xpdiffat15	csdiffat15	killsat15	assistsat15	deathsat15	opp_killsat15	opp_assistsat15	opp_deathsat15
0	ESPORTSTMNT01_2690210	complete	LCKC	2022	2022-01-10 07:44:08	1	12.01	1	Blue	top	Soboro	Renekton	1713	2	3	2	9	19	0.0	0.0	0.0	0.32	0.98	15768.0	552.29	0.28	1072.40	777.79	8.0	0.28	6.0	0.21	5.0	26.0	0.91	10934	7164.0	250.93	0.25	10275.0	231.0	220.0	11.0	8.09	3228.0	4909.0	89.0	3176.0	4953.0	81.0	52.0	-44.0	8.0	0.0	0.0	0.0	5025.0	7560.0	135.0	4634.0	7215.0	121.0	391.0	345.0	14.0	0.0	1.0	0.0	0.0	1.0	0.0
1	ESPORTSTMNT01_2690210	complete	LCKC	2022	2022-01-10 07:44:08	1	12.01	2	Blue	jng	Raptor	Xin Zhao	1713	2	5	6	9	19	1.0	0.0	1.0	0.32	0.98	11765.0	412.08	0.21	944.27	650.16	6.0	0.21	18.0	0.63	6.0	48.0	1.68	9138	5368.0	188.02	0.19	8750.0	148.0	33.0	115.0	5.18	3429.0	3484.0	58.0	2944.0	3052.0	63.0	485.0	432.0	-5.0	1.0	2.0	1.0	5366.0	5320.0	89.0	4825.0	5595.0	100.0	541.0	-275.0	-11.0	2.0	3.0	2.0	0.0	5.0	1.0
2	ESPORTSTMNT01_2690210	complete	LCKC	2022	2022-01-10 07:44:08	1	12.01	3	Blue	mid	Feisty	LeBlanc	1713	2	2	3	9	19	0.0	0.0	0.0	0.32	0.98	14258.0	499.40	0.25	581.65	227.78	19.0	0.67	7.0	0.25	7.0	29.0	1.02	9715	5945.0	208.23	0.21	8725.0	193.0	177.0	16.0	6.76	3283.0	4556.0	81.0	3121.0	4485.0	81.0	162.0	71.0	0.0	0.0	1.0	1.0	5118.0	6942.0	120.0	5593.0	6789.0	119.0	-475.0	153.0	1.0	0.0	3.0	0.0	3.0	3.0	2.0
3	ESPORTSTMNT01_2690210	complete	LCKC	2022	2022-01-10 07:44:08	1	12.01	4	Blue	bot	Gamin	Samira	1713	2	4	2	9	19	1.0	0.0	1.0	0.32	0.98	11106.0	389.00	0.20	463.85	218.88	12.0	0.42	6.0	0.21	4.0	25.0	0.88	10605	6835.0	239.40	0.24	10425.0	226.0	208.0	18.0	7.92	3600.0	3103.0	78.0	3304.0	2838.0	90.0	296.0	265.0	-12.0	1.0	1.0	0.0	5461.0	4591.0	115.0	6254.0	5934.0	149.0	-793.0	-1343.0	-34.0	2.0	1.0	2.0	3.0	3.0	0.0
4	ESPORTSTMNT01_2690210	complete	LCKC	2022	2022-01-10 07:44:08	1	12.01	5	Blue	sup	Loopy	Leona	1713	1	5	6	9	19	1.0	1.0	0.0	0.32	0.98	3663.0	128.30	0.06	475.03	490.12	29.0	1.02	14.0	0.49	11.0	69.0	2.42	6678	2908.0	101.86	0.10	6395.0	42.0	42.0	0.0	1.47	2678.0	2161.0	16.0	2150.0	2748.0	15.0	528.0	-587.0	1.0	1.0	1.0	1.0	3836.0	3588.0	28.0	3393.0	4085.0	21.0	443.0	-497.0	7.0	1.0	2.0	2.0	0.0	6.0	2.0

Univariate Analysis

This histogram shows the distribution of total kills per game across the dataset, and the right-skewed shape indicates that while most games have between 20 and 40 total kills, there are occasional high-kill matches. This shows that depending on the game whether that things such as game pace, aggression of players, competetiveness of players, etc, could influence a player’s role and performance statistics, which is relevant for our model since understanding the overall distribution of kills per game helps contextualize which roles are likely to stand out based on their kill-related statistics.

Bivariate Analysis

This bar chart displays the average number of kills per game for each player position. We observe that Bottom and Mid positions have the highest kill averages, with Support having the lowest, supporting our idea that post-game statistics like kills can help differentiate between player roles, thus directly addressing our model’s goal of predicting position from performance metrics.

This bar chart compares the average number of kills and assists per game for each player position. This shows us that supp players have the highest average assists and the lowest kills, and jungle players have more assists on average than top, jungle, mid. However, this also shows that additional features may be needed to accurately distinguish between the latter positions in our role prediction model.

Interesting Aggregates

position	bot	jng	mid	sup	top
gameid
ESPORTSTMNT01_2690210	5.0	3.0	4.0	0.5	1.5
ESPORTSTMNT01_2690219	1.5	3.5	3.5	0.0	1.0
ESPORTSTMNT01_2690227	1.5	1.0	4.0	1.0	2.0
ESPORTSTMNT01_2690255	4.5	2.5	3.0	2.0	2.5
ESPORTSTMNT01_2690264	3.0	1.0	2.0	2.0	2.5
ESPORTSTMNT01_2690302	5.0	3.5	7.0	0.5	5.5
ESPORTSTMNT01_2690328	6.0	3.5	7.5	0.5	3.5
ESPORTSTMNT01_2690351	1.5	0.5	4.0	0.5	2.5
ESPORTSTMNT01_2690370	4.5	2.0	0.5	0.0	0.5
ESPORTSTMNT01_2690390	2.5	4.5	4.5	2.0	0.5

This pivot table summarizes the number of kills per game by player position. Each row corresponds to a unique gameid, and each column represents the total number of kills made by players in one of the five standard League of Legends roles: bot, jng (jungle), mid, sup (support), and top.
Doing this allows us to visualize and thus effectively compare the kill contributions of each role across each game. By analyzing these values, we can further observe trends that tell us which roles are contributing more or less kills on average.

Imputation

columns_with_null = df.isnull().sum()[df.isnull().sum() > 0].index.to_list()
df.drop(columns=columns_with_null, inplace=True)

['playerid', 'teamname', 'teamid', 'ban1', 'ban2', 'ban3', 'ban4', 'ban5', 'barons', 'opp_barons', 'inhibitors', 'opp_inhibitors', 'goldat20', 'xpat20', 'csat20', 'opp_goldat20', 'opp_xpat20', 'opp_csat20', 'golddiffat20', 'xpdiffat20', 'csdiffat20', 'killsat20', 'assistsat20', 'deathsat20', 'opp_killsat20', 'opp_assistsat20', 'opp_deathsat20', 'goldat25', 'xpat25', 'csat25', 'opp_goldat25', 'opp_xpat25', 'opp_csat25', 'golddiffat25', 'xpdiffat25', 'csdiffat25', 'killsat25', 'assistsat25', 'deathsat25', 'opp_killsat25', 'opp_assistsat25', 'opp_deathsat25']

We did not impute any missing values. Instead, as described in Step 2, we decided to just remove any columns with missing values altogether because these columns were either not relevant to our goal, or contained redundant information.
- We were able to make this deduction because of our prior knowledge of the game.

Framing a Prediction Problem

Problem Identification

Our prediction problem is: “How can we predict what role a player is playing (Top, Jungle, Mid, Bottom, or Support) based on their post-game statistics?” This is a multiclass classification problem, as we are predicting multiple possible categorical roles (one out of five roles)

Response Variable

The response variable is the position column, which identifies the role each player fulfilled during a match: top, jng, mid, bot, or sup. We chose this variable because our goal is to infer a player’s role solely from their post-game performance statistics, such as kills, assists, gold earned per minute, and damage dealt per minute, rather than using manually labeled or externally sourced data.

Evaluation Metric

We chose accuracy as our primary evaluation metric. Since the five roles are extremely balanced in the dataset and carry equal importance, accuracy is the most intuitive way to measure how often our model correctly predicts a player’s role.

Information Available at Time of Prediction

Our model is designed to use only post-game player statistics (e.g., kills, deaths, assists, gold earned, damage per minute) that are known at the time the game concludes. We excluded draft picks, team-level objectives, or opponent statistics (we did this during the data cleaning stage), as these would not be reliable or player-specific indicators for individual performance patterns.

Baseline Model

Model Description and Evaluation

Why We Chose Logistic Regression

We chose logistic regression for our baseline model because it is particularly useful when we want to:

Predict probabilities associated with class membership
Work with multiclass settings or multinomial strategies

Features Used

We included the following seven features, all of which are quantitative:

kills: Number of enemy champions eliminated by the player
assists: Number of enemy champion eliminations the player contributed to
deaths: Number of times the player was eliminated
dpm: Damage per minute dealt to enemy champions
earned gpm: Gold earned per minute during the match
cspm: Creep score per minute (number of minions/monsters killed per minute)
monsterkills: Total number of neutral monsters slain

We did not use any ordinal or nominal features in our model, so no encoding (e.g. one-hot encoding or label encoding) was necessary for the features.

The target variable (position) is nominal (categorical with no inherent order), consisting of five distinct classes: top, mid, bot, jng, and sup.

Model Performance

Test Accuracy: 0.6760885885885886

Class	Precision	Recall	F1-Score	Support
bot	0.48	0.45	0.47	5307
jng	1.00	1.00	1.00	5407
mid	0.47	0.50	0.48	5312
sup	0.96	0.97	0.97	5262
top	0.47	0.46	0.47	5352
accuracy	0.68			26640
macro avg	0.67	0.68	0.68	26640
weighted avg	0.68	0.68	0.68	26640

Confusion Matrix

These performance statistics reveal that the model is very confident and correct when predicting Jungle and Support, however, it struggles to correctly classify Bottom, Top, and Mid. This checks out intuitively, since these three roles have overlapping post-game stat profiles (e.g., similar kills, assists, and CS patterns), which makes them harder to distinguish using just basic numerical features.

Is the Model “Good”?

We believe this baseline model is a good starting point, but not fully sufficient for high-accuracy role classification. The model captures general trends such as supports having high assists and low kills but struggles to differentiate between positions like top, mid, and jungle, thus leading to a lower accuracy than wanted.

To improve on this baseline, we plan to:

Explore nonlinear models (e.g., k-nearest neighbors, random forest)
Include more features or derived features (e.g., KDA ratio, kill participation)

Nonetheless, this baseline confirms that post-game performance statistics can offer meaningful insights into role prediction.

Final Model

Feature Engineering

We created three new features based on the successes and pitfalls of our baseline model, and our prior knowledge of League of Legends and how different roles contribute to team success.

KDA (Kills/Death/Assists ratio): Reflects individual combat performance. Carry roles, such as Bottom and Mid, are expected to have relatively high KDAs due to their roles as primary damage dealers. In contrast, Support players often accumulate more assists and fewer kills, while Junglers may have moderate kills but also take more risks early-game, leading to lower KDAs on average.
- This makes KDA a useful feature for separating carry roles (Bot/Mid) from more supportive roles (Support/Jungle).
- In our baseline model, we used the individual components Kills, Deaths, and Assists as separate features. We decided to incorporate KDA instead of these individual components, since treating these components as independent may leading to redundancy and potential overfitting. We wanted to make our model as streamlined and reduce noise as much as possible.
Participation: Reflects a player’s kill participation rate, calculated as the proportion of kills and assists divided by the total number of team kills. Since Junglers, Supports, and Mid tend to roam and participate in more fights, we expect higher participation values from them. Conversely, we expect Top and Bottom to have lower participation values, since Top laners tend to be more isolated and Bottom laners tend to focus on their own lane for most of early and mid game.
- This would help us differentiate between Top and Bot from and other players.
xptogoldat10: Calculated as xpat10 / goldat10, we created this feature to measure lane efficiency by comparing experience to gold earned at the 10-minute mark. We came up with this feature because Mid laners typically earn more XP per unit of gold due to faster leveling in solo lanes.
- This would be a potentially useful feature to help distinguish Mid from Bottom, one of the largest problems we saw in our baseline model.

Model Selection and Hyperparameters

While choosing what model to utilize for our final model, we tested several methods to see what yielded the highest accuracy score.

Model: Decision Tree
Test Accuracy: 0.6995495495495495

Model: Random Forest
Test Accuracy: 0.7177552552552553

Model: Naive Bayes
Test Accuracy: 0.678978978978979

Model: Logistic Regression
Test Accuracy: 0.7015765765765766

Model: Neural Network
Test Accuracy: 0.7228228228228228

We ultimately decided upon a Random Forest Classifier for our final model because it performed slightly higher compared to the other methods, and upon research, performs strongly for classification.
We performed hyperparameter tuning using grid search cross-validation with 5 folds. The grid search explored various combinations of:
- max_depth ([10, 15, 20, None])
- n_estimators ([100, 200])
- min_samples_split ([2, 5])
- min_samples_leaf ([1, 2])
- max_features ([‘sqrt’, ‘log2’])
The best-performing model used:
- n_estimators=200
- max_depth=None
- min_samples_split=5
- min_samples_leaf=1
- max_features='sqrt'
These hyperparameters were selected based on the highest cross-validation accuracy score during tuning.

Performance Comparison

Baseline Model

Test Accuracy: 0.6760885885885886

Class	Precision	Recall	F1-Score	Support
bot	0.48	0.45	0.47	5307
jng	1.00	1.00	1.00	5407
mid	0.47	0.50	0.48	5312
sup	0.96	0.97	0.97	5262
top	0.47	0.46	0.47	5352
accuracy	0.68			26640
macro avg	0.67	0.68	0.68	26640
weighted avg	0.68	0.68	0.68	26640

Final Model

Test Accuracy: 0.9207957957957958

Class	Precision	Recall	F1-Score	Support
bot	0.96	0.97	0.97	4253
jng	1.00	1.00	1.00	4298
mid	0.83	0.82	0.82	4290
sup	0.98	0.99	0.99	4213
top	0.83	0.83	0.83	4258
accuracy	0.92			21312
macro avg	0.92	0.92	0.92	21312
weighted avg	0.92	0.92	0.92	21312

Our final model is a significant improvement from our baseline. First off, our Test Accuracy improved by 0.25, jumping from ~0.67 in our baseline model to ~0.92 in our final model. Additionally, there are significant improvements in precision, recall, and F1-score across all positions, particularly Bottom, Mid, and Top, which were previously extremely hard to differentiate.

Confusion Matrix

The confusion matrix from the Final Model demonstrates far more accurate predictions across all roles, especially in the previously confused categories of Bot, Mid, and Top. The stronger diagonal pattern indicates that misclassifications are now rare and mostly occur between conceptually similar roles, specifically Top and Mid.
Overall, by being very intentional with our features and thorough hyperparameter tuning, we improved our model’s performance substantially compared to the baseline model. The Random Forest model generalizes well and captures the nuances of player behavior across different roles, validating our original hypothesis that post-game stats can predict a player’s post-game position.

Conclusion

The game League of Legends, as complicated as it is, is designed in such a way that each role have significantly different patterns of gameplay, responsibility, and map behavior, which can be measured by statistics. By examining post-game performance statistics of players from professional League of Legends matches, we were able to successfully create, train, feature-engineer, and fine-tune a model that is able to classify players into their respective role 92% of the time.

league-of-legends-analysis

Know Your Role: Predicting League of Legends Roles from Player Stats

Vinh Tran | LinkedIn | vinht@umich.edu

CC Ly | LinkedIn | vanessly@umich.edu

Introduction

Introduction and Question Identification

Dataset Overview

What is League of Legends?

Central Question

Key Columns

Why It Matters

Data Cleaning and Exploratory Data Analysis

Data Cleaning

1. Filtered only for complete player data

2. Removed team-related summary rows

3. Dropped irrelevant columns

4. Dropped columns with Null values

Final Cleaned Dataframe

Univariate Analysis

Bivariate Analysis

Interesting Aggregates

Imputation

Framing a Prediction Problem

Problem Identification

Response Variable

Evaluation Metric

Information Available at Time of Prediction

Baseline Model

Model Description and Evaluation

Why We Chose Logistic Regression

Features Used

Model Performance

Is the Model “Good”?

Final Model

Feature Engineering

Model Selection and Hyperparameters

Performance Comparison

Baseline Model

Final Model

Conclusion