Research article - (2025)24, 764 - 778 DOI: https://doi.org/10.52082/jssm.2025.764 |
Classifying Soccer Players Based on Physical Capacities and Match-Specific Running Performance Using Machine Learning |
Michel de Haan1, Stephan van der Zwaard1,2, Jurrit Sanders3, Peter J. Beek1, Richard T. Jaspers1,![]() |
Key words: Clustering, football, sprint speed, V̇O, MAS, MSS |
Key Points |
|
|
|
Participants |
This study included 31 young male elite soccer players at a professional football club of international caliber (U18 & U21, age = 18.0 ± 0.9 years, height = 1.79 ± 0.06 m, weight = 70.5 ± 6.9 kg; mean ± standard deviation). U18 played in the highest league for their age group in the Netherlands (Eredivisie) and U21 played in the second highest professional league in the Netherlands (the so-called Keuken Kampioen Divisie or KKD). The sample included 9 forwards, 8 attacking midfielders, 5 defending midfielders, 5 backs and 4 central defenders. Playing position was determined as assigned by the coach during matches (see Appendix A for a schematic overview of the playing positions). Goalkeepers were excluded, because they have a distinctly different match-specific running performance compared to outfield players. Match-specific running data were collected over two full consecutive seasons from all 31 players, with participants undergoing exercise testing at the start of the second season. |
Ethical statement |
The study was conducted in full compliance with the Declaration of Helsinki (2013) and approved by The Scientific and Ethical Review Board (VCWE-2023-054) of the Faculty of Behavioural and Movement Sciences of the Vrije Universiteit Amsterdam. Participants provided written informed consent. Participants were instructed to avoid strenuous exercise for 30 hours leading up to the exercise testing. |
Sprint capacity |
Sprint capacity was measured using an all-out linear sprint test over 20 meters on an artificial grass surface. Before the sprint test, participants underwent a standard warm-up routine for soccer practice designed by the physical training staff of the team. This routine consisted of dynamic stretching, running exercises and footwork drills. Participants were instructed to cover these 20 meters as fast as possible from a static start. They performed this test twice, with the fastest time being used for further analysis. Positional data were obtained using LPM (Inmotio, Zeist, the Netherlands; Inmotio GPS; Insiders, Lausanne, Switzerland) and integrated over time to determine the average sprint speed over 20 meters. In addition, maximal sprint speed (MSS) over 20 meters was assessed; see Appendix B for details. Players were familiar with the testing procedure as it is part of a regular testing battery performed by the professional sports team. |
Endurance capacity |
In this study, endurance capacity was quantified using the V̇O2max obtained during a maximal incremental treadmill test (Kemi et al., |
Match-specific running performance |
Over two consecutive seasons, match-specific running performance of the U18 and U21 teams was collected using multiple positional tracking systems (Inmotio Local Position Measurement (LPM); Inmotio, Zeist, the Netherlands; Inmotio GPS; Insiders, Lausanne, Switzerland, and SciSports Optical tracking; Panoris, Brno, Czech Republic). The LPM system measures with an overall sample frequency of 1,000 Hz, divided by the number of active transponders on the field. The average measurement frequency per active transponder varied from 40 to 80 Hz over the matches. The LPM system has been demonstrated to be an accurate and valid tool for tracking player movements in football, showing a mean difference from the actual distance of maximally -1.6% (Frencken et al., |
Machine learning analysis |
Unsupervised machine learning was employed using the As input variables, we included measures for sprint and endurance capacities (average 20-meter sprint speed and V̇O2max) as well as sprint and endurance match-specific running performance (sprint distance and combined distance traveled at moderate and high intensity (MIR + HIR)), resulting in a total of four input variables. All values were normalized to Subsequently, we applied the supervised machine learning method of subgroup discovery (de Leeuw et al., |
Statistical analysis |
Data were presented as mean ( Linear regression analysis was used to identify the relationship between sprint capacity and endurance capacity. This analysis was performed for 20-meter sprint speed vs V̇O2max normalized to LBM2/3. The relationships between these physical traits were quantified in terms of explained variance ( To examine the relationship between sprint and endurance capacities with their respective match-specific running performance, the average match-specific sprint performance was plotted against the average sprint speed over 20 meters, while the distance covered at moderate and high intensity (MIR+HIR) during an average match was plotted against V̇O2max normalized to LBM2/3. Linear regression was used to examine these relationships and The |
|
|
Sprint and endurance capacities |
The 20-meter sprint test was completed by 27 players and the maximal incremental treadmill test by 28 players, with 24 players successfully completing both assessments. The average sprint speed of these 27 players on the 20-meter sprint test was 24.36 ± 0.66 km/h, with values ranging from 23.08 to 26.17 km/h. V̇O2max relative to body weight of these 28 players was 57.93 ± 3.91 mL/kg/min. When normalized to LBM2/3, V̇O2max was 257.27 ± 14.51 mL/kgLBM2/3/min, with values ranging from 232.30 to 295.90 mL/kgLBM2/3/min. For average values of MAS and MSS, see Appendix
B. For all 31 players match-specific running data were available, totaling 619 matches (mean per player: 20.0 ± 11.3); these data are shown in aggregated form together with the physical capacities in |
Clustering based on combinations of physical capacities and match-specific running performance |
Clustering analysis was performed on the 24 players who completed both the sprint and endurance tests. They all had available match data, comprising a total of 458 individual match observations collected across two full competitive seasons, with an average of approximately 19 matches per player. Using The SPR group had significantly higher sprint capacity than AVG (25.22 ± 0.66 vs 24.46 ± 0.44 km/h; mean difference = 1.07, 95% CI [0.29, 1.85], Additional cluster characteristics, including playing position and competitive level, are displayed in |
Relationship between sprint and endurance capacities |
To establish the relationship between sprint and endurance capacity, the V̇O2max normalized to LBM2/3 was plotted against the 20-meter sprint speed ( |
Relationship between physical capacity and match-specific running performance |
Physical capacities were plotted against their corresponding match-specific running performance ( A moderate but significant positive relationship was also observed between normalized V̇O2max and average match distance at moderate and high intensity ( The supervised subgroup discovery analysis showed that belonging to the SPR cluster was the most important predictor for increased sprint distances during matches ( |
|
|
The present study set out to evaluate the potential of machine learning to identify players with unique combinations of sprint and endurance capacities, and their match-specific running performance. |
The potential of unsupervised machine learning in identifying player clusters |
In this study, unsupervised machine learning, in particular One possible explanation for these results is that players with high sprint or endurance capacities are recognized by coaches and strategically utilized in matches in accordance with those strategies. For these players, focusing on their specific playing style and incorporating tailored sprint or endurance training might be of great importance. Conversely, as shown in Cluster analysis using unsupervised machine learning allows for identification of players who may benefit from alternative strategic roles during matches, are at risk of overuse, or could benefit from individualized training. This information can assist coaches in designing tailored training programs for individual athletes and optimizing overall match strategy. |
Absence of an inverse relationship between sprint and endurance capacities in young elite soccer players |
We hypothesized an inverse relationship between sprint and endurance capacity in soccer, corresponding to the distinct SPR and END groups revealed by the clustering and previous observations of such a relationship in cyclic sports (van der Zwaard et al., Age and training status are likely important factors influencing the differences in the relationship between sprint and endurance capacities observed in different sports. The previously studied cyclists (25 ± 7 years) (van der Zwaard et al., Lastly, we measured V̇O2max and sprint speed at the whole-body level. In this context, it should be noted that both physical traits depend on multiple factors, and these whole-body measurements are not direct reflections of muscle fiber characteristics. For example, the ability to transport oxygen to the muscles has a large influence on whole-body V̇O2max (Bassett and Howley, |
The relationship between physical capacities and corresponding match-specific running performance |
Only a moderate correlation between measured sprint and endurance capacities and match-specific running performance was found, with an explained variance of 17 and 15% for sprint and endurance, respectively. This suggests that factors other than physical capacities might be more crucial for determining match-specific running performance. The subgroup discovery analysis confirmed importance of the clusters for interpreting differences in match-specific running performance and showed that central defenders (who were all part of the AVG group) showed significantly reduced moderate- and high-intensity running distances during matches, demonstrating that playing position is a factor that can influence match-specific running performance. Central defenders have less opportunities for longer sprints because they operate in the pitch region which is typically densely packed with players. This spatial constraint and their primarily defense tactical application are likely reasons for their decrease in moderate- and high-intensity running (Bradley, Another factor influencing this relationship is running economy, which refers to the efficiency with which players use oxygen during submaximal running. A better running economy may enable some players to sustain higher match running outputs without necessarily having a higher V̇O2max. A measure that does incorporate this running economy is the maximal aerobic speed (MAS). In this study, we primarily focused on V̇O2max because it is more closely related to actual muscle-level aerobic capacity. However, MAS may be a practically relevant measure, as it reflects the endurance demands in a match context. The relationship between MAS and endurance match-specific running performance (Appendix B) showed that MAS indeed demonstrated a substantial relationship with moderate- and high-intensity running during matches ( The moderate to substantial relationships between physical capacity measures and corresponding match-specific running performance imply that players with lower aerobic or sprint capacities can still exhibit greater running output than peers with higher physical capacities. However, increasing match-specific running performance might not always translate to increased match performance. Literature indicates that in won matches, wide midfielders and forwards significantly increase their overall distance covered, particularly at speeds above 21 km/h, while full-backs, central defenders, and central midfielders tend to cover significantly less distance, likely within the 17-24 km/h range (Chmura et al., Overall, our results indicate that many soccer players do not specifically optimize towards either sprint or endurance performance. Instead, there may be minimal threshold levels for these capacities that players must meet to perform at an elite level within the U18 and U21 age groups. For sprint capacity, this threshold appears to be an average speed above 23 km/h over a 20-meter sprint, while for endurance capacity, it is more than 230 ml/kglbm2/3/min or 50 ml/kg/min. These values are lower than those reported for cycling and rowing, raising the question of whether these players could be further developed physiologically and how such development might influence their match performance. |
Limitations and perspectives |
A sample size of 24 young elite soccer players could be viewed as relatively small for machine learning applications, but unlike traditional statistical analyses, the statistical power in cluster analysis is largely driven by the degree of cluster separation rather than sample size (Dalmaijer et al., The results from the The inverse relationship between sprint and endurance capacity observed in cyclic sports was based on Wingate power output as a sprint measure. In this study, we did not conduct Wingate tests but opted for a 20-meter sprint test to measure sprint capacity. This test is more soccer-specific and representative of in-game sprint performance. It is however a less direct measure of leg power compared to a Wingate test. Instead, it reflects a combination of leg power and other factors, such as running technique. This difference in methodology makes it difficult to draw direct comparisons to the inverse relationship found in cyclists and rowers, especially since previous literature has shown that the explained variance between 20-meter sprint speed and Wingate power was only 19% (Nikolaidis et al., The exercise tests were performed in pre-season. This timing could present limitations, as players may not have fully reached their peak physical condition. However, we selected this point in time because it falls between the two seasons from which we collected match-specific running performance data. Additionally, conducting the tests outside of the competitive season helps to minimize variation caused by differences in in-season training loads and match-related fatigue. The question that remains is if coaches should encourage their players more to play according to their physical abilities. There are players that have certain well-developed physical capacities but are not identified within the SPR or END groups. This is because they are not utilizing these abilities in the match to the same extent as the players identified in these groups. Similarly, there are players who display high match-specific running performance despite having lower physical capacities. Future research should aim to determine if optimizing physical capacity to match running performance or employing players based on their physical capacity can lead to improved overall performance. Cluster analysis based on physical capacity and match-specific running performance could be further enriched by incorporating player’s morphological characteristics to better understand the physiological basis behind the identified clusters. Muscle properties, such as quadriceps muscle volume, fiber length, pennation angle, and physiological cross-sectional area, could shed light on why certain players excel in either sprint or endurance performance (Weide et al., Moreover, mixed sprint and endurance exercises, like repeated sprints or shuttle run tests, could be collected, and might provide extra information about the nature of the players in each cluster group. Additionally, incorporating longer sprints could give a clearer view of the maximal sprint speed of the players. Furthermore, these clusters could be used to evaluate training effects, to possibly identify groups of players who respond well to specific training impulses. In the future, cluster analysis could serve as a foundational tool for predictive modeling, with potential applications in talent identification and injury prevention. Clustering by physical capacity and match-specific running performance could help predict and better discriminate which young athletes might excel in certain roles or which players are at higher risk of injury, allowing for more tailored training and development strategies. |
Practical implications |
The training staff identified two players who, based on their playing style, would be expected to align with those in the endurance cluster. It was surprising to them, however, to discover that these players fell significantly short in terms of endurance capacity compared to the cluster. This discrepancy could be an indication that these specific players would benefit from individualized endurance training to enhance their physical capabilities and optimize their performance in line with other endurance-oriented players. Additionally, the training staff identified players in the sprint cluster as those who leveraged their speed as a primary strength during matches. In contrast, players with similar sprint speeds but lower sprint distances tended to rely more on technical skill or tactical awareness during matches. These players were often positioned more centrally, which resulted in covering less sprint distance. Conversely, players in the sprint cluster, typically playing as wingers or backs, were able to maximize their sprint distance due to the nature of their roles, which demanded more movement and subsequently higher sprint workloads. This suggests that, in some cases, players with higher technical ability can perform effectively without relying on a large volume of sprint meters, while other players with superior physical capacities may potentially compensate for lower technical skills by being positioned in roles that allow them to fully exploit their speed. This distinction highlights that physical capacity is just one element of overall performance, and that a holistic assessment that includes technical and tactical abilities is crucial. By combining these factors, coaches could establish different benchmarks for players, such as those with high technical skills or sprint-oriented players, as identified by clustering. These insights could guide training adjustments, positioning strategies, or match strategy to maximize player performance. |
|
|
Clustering revealed distinct subgroups of soccer players with unique combinations of sprint and endurance capacities, and sprint and endurance match-specific running performance. In young elite soccer players, sprint and endurance capacities showed positive, moderate, relationships with corresponding match-specific running performance but did not seem to be mutually exclusive or opposing. Derived clusters allow for identification of players who may benefit from alternative strategic roles during matches, are at risk of overuse, or could gain from individualized training. This information can assist coaches in designing tailored training programs for individual athletes and optimizing overall match strategy. |
ACKNOWLEDGEMENTS |
We would like to thank all the players who participated in this study for their efforts and dedication. Moreover, we express our gratitude to the soccer club where the present study was conducted for allowing our testing procedure to be part of the yearly medical examination and their willingness to share the match data of their players. The authors thank Bart Vromans and Nicole Koopman-Verbaarschot for their great assistance with the data collection. We are thankful to Anita Loomans and Anna TopSupport for allowing us to utilize their exercise testing room, treadmill ergometer, and heart rate monitor during the testing week. Finally, we thank all physicians that supervised these measurements.The datasets generated during and/or analyzed during the current study are not publicly available due to the privacy rules that are in place at the professional soccer club in question. However, anonymous data are available upon request from the corresponding author: Richard Jaspers, This work was supported by an NWO-ZonMW National Sportinnovator Grant (grant number: 538001045). Authors report no conflict of interest. The results of the present study do not constitute endorsement by ACSM. The results of the study are presented clearly, honestly, and without fabrication, falsification, or inappropriate data manipulation. |
AUTHOR BIOGRAPHY |
|
REFERENCES |
|