Breaking down the data science of the FAST models
I have received such an awesome response to my breakdown articles so I really wanted to go further in depth into the actual underlying data science of my methods. I hope you guys find this as interesting as I did! I am also very open to criticism from the community!
In order to better break fights down, I came up with the idea of separating the overall strategic and the tactical approaches fighters take per round. This article is going to be on the strategic aspect of my FAST models with further articles to come on the four tactical models.
So, as any data scientist will tell you, I started my project off with a question: What defines a fighter’s strategy? In my opinion, a fighter’s strategy is broadly where they CHOOSE (keyword here) to spend their time in the cage. For example, Masvidal did not choose to be in the bottom position against Usman, Usman made that choice for him. Masvidal wanted to stand and strike with him to try and knock him out. So now we intuitively have a broad measure of strategic approach: how much time does a fighter spend in a position they chose? To convert this idea into a quantitative measure, I broke it down into 3 measurable categories:
-
Standing time
-
Control time
-
Controlled time
Controlled time is equal to the opponent's control time for that round and standing time is equal to the total round time minus control time and controlled time. Thankfully, ufcstats.com has total round time and control time so I was able to create these stats with the data they provide.
So now we have our three quantitative measures but we run into the problem of differing amounts of round time. We need to be able to compare a round that ends at 2:30 with a full 5 minute round. The model I use is sensitive to the differences in raw round totals so we need to convert them to ratios. Next up, I merged back in our fourth measure: total round time. This measure is equal to the sum of the previous three, which are all in seconds.
To account for these changes and make them comparable, I divided the first three measures by the fourth one to get:
-
% of time standing
-
% of time in control
-
% of time controlled
Since we only want the time for positions that the fighter chose to be in, we will focus on the first two.
I used a K-means clustering algorithm to look for differentiation between the two measures (in machine learning we call them features which is what I will call them going forward). I don’t want to go too math heavy to explain how the algorithm works but at a high level the model looks for clusters of similar rounds and labels them as such. I then looked at the statistics for those rounds and came to a conclusion about what they are.
I used a traditional algorithm, the elbow method, to choose the number of clusters and came to the conclusion that four was the best choice. For those data science enthusiasts out there, my cosine K-means score was .827 which shows a high level of differentiation.
Now for the results. To generate the table below, I did a simple pivot table grouping using the cluster label and then averaged the two input features, i.e. % of time standing and % of time in control, to see what the average round looked like for each one.
Class | Strategic Class | % of Time In Control | % of Time Standing |
---|---|---|---|
0 | Striking Heavy | 3% | 76.4% |
1 | Balanced Grappler | 42.5% | 36.8% |
2 | Balanced Striker | 23.2% | 54.8% |
3 | Grappling Heavy | 68.3% | 16.5% |
As we can see, Class 0 spent 76.4% of their time standing and 3% in control so I labeled this class Striking Heavy. For Class 1, the values are 36.8% and 42.5% respectively so I labelled it Balanced Grappler. Class 2 gets Balanced Striker for its higher percentage in standing time of 54.8% and finally Class 3 gets Grappling Heavy for its 68.3% and 16.5% split.
As I explained to my manager on a project I did for work using K-means clustering, unsupervised clustering algorithms give a math answer that I, the data scientist, interpret into a business (or in this case MMA) answer. As such, these results are open to tons of speculation and scrutiny and I welcome any and all questions! I am very open to updating my models and data warehouse based on correct critiques and I am also open to any and all questions regarding the above methods!