The Creation of Predictive Stuff Metrics: Unveiling the pSTFERA Suite

Predictive pitching statistics are nothing new. Ever since Tom Tango and Clay Dreslough’s invention of FIP, several metrics have been conceived to gauge a pitcher's future performance more accurately. Most of these methods use regressions, deriving constants and weights of given factors of a pitcher’s season to sum in predicting next season’s ERA. This, however, has its limits - these numbers are only based on predictions of the results of the given pitcher and nothing beforehand. Given my recent research into Stuff models and my creation of Arsenal+, pitch-level evaluation has been somewhat prominent in my mind. Given the predictive power of these pitch-level models in estimating future RV/100 as well as ERA (as documented by Fangraphs here), I figured a conversion of such estimates could be helpful. But that’s not enough to be predictive.


By nature, all of the Stuff models are considered to be descriptive. While they have predictive qualities (as repeatedly demonstrated), they are meant to describe a pitcher’s stuff more objectively. Hence, a 1:1 translation of Stuff to ERA does not achieve the goal of prediction. But, with the right fine-tuning, it absolutely could be, which was done with this new suite of predictive stuff-based pitching metrics: pSERA, pLERA, pPERA, and pAERA. In its collective form, these will be referred to as pSTFERA.


What is pSTFERA?


pSTFERA is the group of predictive pitch-level metrics that integrates pitch modeling to try and more accurately predict future performance based on current performance. pSERA (Predictive Stuff ERA), pLERA (Predictive Location ERA), pPERA (Predictive Pitching ERA), and pAERA (Predictive Arsenal ERA) are all based on the primary pitch models that are available. These include Stuff+, Location+, and Pitching+, and Arsenal+. As you likely know, Stuff+ evaluates various metrics such as movement, release point, and extension to determine performance. Location+ solely uses the coordinates of a given pitch to evaluate performance. Pitching+ combines the variables of both Stuff and Location to offer a wider look into determining value. Arsenal+, the lesser-known pitch metric that I previously invented linked here, considers the current pitch’s Pitching aspects, as well as the difference between these aspects from the prior pitch.


As implied in the predictive title, all of the pSTFERA metrics are more predictive of future performance than the main models. These models were made with the actual Stuff scores still in the forefront, with only the application of additional statistical techniques and comparison to make them reflect a more predictive landscape versus a descriptive one.


Making the models required starting with a base - this was the respective metric for the 2022 and 2023 seasons (Stuff, Location, Pitching, Arsenal). Overall Stuff scores were opted for over Adjusted scores. By this, I mean that on the available apps, every score is adjusted based on the average of a given pitch, not the overall expected performance. Taking the overall version gave a picture better aligned with reality into what pitches were providing value.


A few metrics and details, such as Games, Innings Pitched, and Handedness, were taken and binned into certain categories using a K-Means Clustering Algorithm for their respective seasons. If you’re not familiar with the technique, it utilizes machine learning to assign random centroids among different categories and then recursively iterates to sort the data into a specified number of clusters based on the distance from the said centroid.


This is done because the prediction aspect of this metric is based solely on the idea that stuff metrics will regress to the mean of pitchers similar to them over time. By sorting the pitchers into these given types, it allows that average to be more specific to the given player, which should in theory prove more accurate in the long run than just assuming the overall average of pitchers. The sorting of these given metrics is the result of rigorous testing among other statistics in separating pitchers in hopes of deriving differing types to be more predictive. The aforementioned factors proved to be the most important. In enhancing the metric, a silhouette cluster analysis was conducted (this determines the distance between given clusters) to simulate the number of clusters used for the data (1 to 50 clusters were simulated) and ensure that the most optimum number of groupings was utilized.


The mean and standard deviation of each cluster’s ERA allow the metrics to be converted to an ERA scale. The mean and standard deviation are also taken for the overall stuff metrics within the cluster, which allows a cluster-based z-score to be derived for each row. The z-scores are applied to the cluster’s ERA mean and standard deviation, generating a predictive ERA for each metric.


The above reasoning and calculations lead for the pSTFERA suite to be directly read like ERA. So in theory, a 3.00 ERA would be determined to be successful in the same way that a 3.00 pSTFERA would be, with the latter of course just being more meaningful in evaluating current and predicting future performance. The usage of these predictive numbers is incredibly similar to that of any predictive metric. In development, these predictions could prompt differing decisions to be made based on the projected versus actual performance. A player with above-average pSTFERA scores and a non-matching ERA suggests that serious intervention is not necessary and that staff should not overreact to a struggling performance. However, a reaction could easily be justified if the very opposite occurs. In evaluation, it aims to expose undervalued and overvalued players. As all evaluation is mainly in an attempt to predict the future performance of a given player, these metrics allow for a better idea of what is likely going to happen in the future, making the justification for certain decisions regarding rostering and trading to be more sound. This value is of course based on the assumption the metric does add superior predictive value.


Evidence


Unlike some of the other metrics I’ve created/analyzed, the evaluation of the pSTFERA suite is a bit less straightforward. Each metric’s expected level of performance differs by a high margin. Without even looking at the data, one can surmise that pSERA is likely more predictive than pLERA. It is relatively well-established that Stuff+ is more predictive than Location+, as Stuff+ has a larger amount of important variables that contribute to the pitching location. Hence, putting them on the same level would make absolutely no sense, and potentially discount/overvalue a given metric based on its relative comparison.


To deal with this, each metric will be closely examined. Unlike in past articles, only MLB Pitchers will be tested in this section. Specifically, pitchers who played during the 2021 - 2023 MLB seasons. Through the establishment of this evidence, each metric can be properly evaluated for what they are actually worth in a sabermetric sense.


psera


In the preliminary looks and testing of the metric, Predictive Stuff ERA had the best performance among all the metrics, in almost all the samples. It was quite surprising, given pAERA’s inclusion of several other variables that detail pitching. In measuring this, I will once again be using the recurring regression charts. On the chance that the reader has not read any of my stuff-based writing before for Prospects Live, these charts utilize a recurring regression as a minimum sample size threshold changes (in the case of this study, innings pitched) to review performance as sample size changes. Put more clearly, a regression is run on this year’s pSTFERA stat versus next year’s ERA. A RMSE or R2 value is yielded, and then the threshold is reset (if it was 1 IP, it is now 2 IP) to include only the players that meet the minimum threshold. This is continually repeated. To demonstrate the sample size as the repeating regressions occur, the right Y-Axis corresponds to a bar chart behind the graph with the number of pitchers being analyzed.


In justifying the need for this measure, the first test will be the comparison of just simple ERA. Challenging the assumption that past performance equals future performance, these were the results for 2022-2023:

As is somewhat made clear, pSERA does beat ERA by a good amount. Now knowing that it is more predictive than simple past performance, the next fight goes against the other predictive metrics out there. The predictive metrics that will be mainly utilized for comparison are the ones freely available on Fangraphs.


These results are promising. While the metric does not start off being the most predictive, it proves superior by a good deal during the 60 to 120 IP minimum thresholds. It tailed off at the end, although the sample is roughly less than 50 pitchers by the time that happens. Both xFIP and SIERA stand as the biggest competitors to the metric, although the later deviation in RMSE showcases where the value of this new metric lies.

plera


To preface, the grid coordinates of a given pitch can only tell one so much. Ergo, to expect superior predictive value from a model that is only using two variables is unreasonable. Given the widespread understanding of the ERA scale, Predictive Location ERA mainly serves to provide a superior understanding of future Location value. Knowing the limits of the metric, the results should be in line with what one expects. In comparison to the past season ERA, pLERA was not able to compete very well.

The predictive version of a metric that solely evaluates pitching is less successful at predicting future performance than simple past performance. These results shouldn’t be shocking - location tells such a small story (as I’ve emphasized prior), and past ERA accounts for many factors that just knowing the given location ability of a pitcher misses on. It did compete somewhat in the early stages, but it’s not blatantly apparent that such a metric shouldn’t be used to predict overall performance. While the reader can probably already guess the answer, this is how the model performed against the other ERA estimators.

pLERA was blatantly worse than every ERA estimator out there… Given that ERA Estimators usually outperform ERA in predictive value, this could have been surmised from basic logic. But, to hammer in the point, these results make it clear. Predicting a pitcher’s future location ability can be useful in the context of attempting to separate the noise from the signal, although it’s not meant to indicate overall performance.

ppera


Predictive Pitching ERA combines both Stuff and Location, although the forthcoming results are a bit mixed. Pitching+ is generally a strong predictor of future performance, although there are a few instances where Stuff+ has proved more predictive. The increased number of variables may have something to do with this - especially with the Location+ side. As was apparent in the pLERA section, Location is generally not very predictive of performance. This may have somewhat muddied the results. With that in mind, this metric is still very strong in prediction, as seen against past ERA.

pPERA had a strong performance, having lower RMSE values than ERA for the majority of the tested samples. Given that it outperforms past ERA, it can now be put up against the other estimators.

The results are noticeably less strong than that of pSERA. It seemingly outperformed pSERA and the other Fangraphs metrics early in the minimum threshold (around 10 IP), although such a marginal difference could be owed to random variation. It outperformed the other metrics during roughly the same window as pSERA (around the 60 - 120 IP range), although the difference is noticeably less strong. As mentioned beforehand, the inferior ability of Location to predict future outcomes may have hurt the results. pPERA still serves as a superior metric to those currently out there, although the Stuff version would still be given preference.

paera


While Arsenal+ is the newest metric out of this bunch, it’s value was proven to be above the competition against future RV/100 values. That being said, RV/100 and ERA are not directly aligned, although they are highly correlated. Differences in what is classified as earned and the actual calculations in general lead to some discrepancies, although both are still powerful tools in evaluation. Like the others, it will be pitted up against past ERA.

pAERA outperformed ERA, serving as the third metric to do so out of the four total metrics examined. With that test out of the way, the only thing left is the competition.

While not the worst performing metric, pAERA admittedly struggled. The only predictive superiority apparent was during the 10-30 IP minimum thresholds, which is surprising given Arsenal+’s struggle to stabilize during testing. The predictive aspect likely allowed it to stabilize faster to some extent. Beyond those thresholds, it was in line with both xFIP and SIERA, with only marginal variance causing the flip-flop of the metrics throughout. Its hurt performance could likely be attributed to the location consideration, as mentioned in detail in the above section.


Given that Arsenal+ is the metric I personally put together, these results suggest that the metric may need to be altered or adjusted (at least in regard to location) to provide further predictive value to baseball.

COMPARING STICKINESS


In any given metric, it would be preferred that the amount of skill a given player has is translated. To ensure this, a common test that is conducted is measuring the stickiness. Stickiness is the measure of how much of a given player’s measure be explained by last year’s measure. If a metric is incredibly sticky, then one can argue a skill is apparent as performance is repeatable. If it is deemed not sticky, then the metric likely does not measure skill well as performance is generally non-repeatable. Below is the stickiness of the pSTFERA suite as well as a separate graph showing the stickiness of Fangraphs predictive metrics.

For pSTFERA, pSERA and pPERA proved to be the stickiest metrics, showcasing that they are likely more indicative of actual skill compared to the others. Looking at the Fangraph’s ERA Estimators + ERA, xFIP and SIERA were the stickiest, aligning with other baseball research in identifying the two predictive metrics most closely aligned with skill. To that point, it appears that pSERA surpasses all other metrics examined in measuring skill through stickiness.

When comparing the two graphs, it seems clear that the pSTFERA suite is much more sticky than most of Fangraph’s predictive metrics. The R-squared values generally hovered from 0 to 0.4 for Fangraphs, while the new predictive metrics varied between 0 and 0.6. The latter also stabilized much quicker, beginning to be significantly predictive of itself around the 10 IP minimum threshold. The former never met the same R2 value that was achieved in 10 Innings by pSTFERA. Hence, it is safe to conclude that based on stickiness, the pSTFERA suite is more indicative of skill than the other metrics.

Reasoning

Clustering variables

The choice for deciding which cluster variables to ultimately utilize was fairly rigorous. In choosing the right group, the K-Means model has to successfully differentiate between different types of pitchers while also not having so many variables that the differences are drowned out by noise. In my initial model, I attempted to use ten overall predictive statistics of ERA along with games and Innings Pitched - this was not very successful. The silhouette coefficient simulation suggested that the model was optimal at around three clusters with a relatively low score, meaning that it was failing to separate the groups.

The inverse was then attempted, with just games, Innings Pitched, and ERA. This had improved results and suggested a larger amount of clusters, although there was room for improvement. Ultimately, through lots of testing, adding clustering by handedness proved the most optimal by creating eight different clusters of pitchers. These subgroups allowed the predictive calculations to be more specific to the player while not going so specific that the ultimate calculations produced hard-to-believe numbers shrouded in noise.

OVERALL STUFF vs Pitch-ADjusted stuff

The decision to include Overall Stuff versus Pitch-Adjusted for the pSTFERA stats was more of an active choice to lean into sabermetric thought rather than solely focusing on mathematic proof. When running comparisons between the two metrics, both the predictive RMSE and R-squared values throughout the different minimum threshold samples were almost equivalent. One may be 0.01 to 0.02 (in terms of RMSE and R2) more predictive for a given sample and fluctuate in the opposite direction giving the other an advantage.

From a sabermetric standpoint, Overall Stuff resonates with actual production more than Pitch-Adjusted Stuff. It is well established that sliders, on average, are generally more valuable than fastballs. Yet, Pitch-Adjusted Stuff will consider the average slider and fastball as 100. Overall Stuff does not, meaning it factors in the fact that the slider is more valuable. In theory, this means it should pair with the rate the pitcher limits runs. Further testing on the sample revealed that this effect yielded non-significant marginal gains. However, such an indifference allows me the option of preference, meaning that I can personally justify this metric as more fundamentally sound based on this component.

Potential Shortfalls

A less-than-optimal BASIS

As is made obvious throughout the article, the entire metric is based on ERA. This type of metric is an ERA-estimator, meaning it attempts to predict future ERA. This is the common approach among lots of predictive statistics - lots of people put value into knowing the best estimate of this given metric. Whether that’s because Fantasy Baseball usually values Earned Runs or people are just more familiar with the stat/scale, the metric still remained flawed. As has been touted by many sabermetricians, there are a lot of variables that pitchers have control over with their ERA. Besides being more predictive, this is why FIP was invented - to give credit to pitchers based only on what we know they control (Fangraphs further explains the reasoning here). Further research has suggested that pitchers can somewhat control for barrels (as linked here), although the point remains the same - plenty of factors besides these aspects are included in ERA. The classification of an Earned Run is also extremely arbitrary, making the metric not as indicative of skill as one might hope.

For future attempts and alterations of such a metric, it may be worth attempting to not treat the metric as an ERA-estimator and opt for a FIP-estimator or even a defense and park-adjusted RS/9. This would at least provide a predictive measure on aspects of the game the pitcher is known to influence more than ERA, giving a better idea of true skill that goes beyond the stickiness shown above.

Clustering adjustments

While extensive testing was done to find the best cluster pairings for optimal performance, I did not try all the physical combinations of metrics and measures. The current clusters did not include release point data or complete Arsenal data, which may have been more effective in clustering groups and predicting future performance. Usage rates for given pitches were attempted, although this appeared to confuse the model with too many variables and yielded lackluster results.

The struggle of clustering may also have had something to do with the given samples, as only pitchers who pitched back-to-back years were considered for evaluation. No in-season evaluation occurred, which could have potentially led to a better result for more variables due to the additional sample size. Future iterations of this metric and/or differently-based versions will have further consideration of these concepts will be given in attempting to optimize for the number of sub-groups.

Conclusion

While the results were not what I expected when I took on this project, I believe they were ultimately fruitful in adding to the sabermetric space. pSERA proved the best ERA-estimator compared to any common, publicly available metric on Fangraphs. pPERA also did very well against the competition. With the level of stickiness experienced by almost all the metrics in the pSTFERA group, these metrics are likely more indicative of actual pitcher skill. This leads to the overall conclusion - some stuff-based predictive metrics likely provide superior predictive skill ability in evaluating a pitcher in comparison to the other ERA-estimators that are available on Fangraphs. pSTFERA-like statistics similar to the ones explained throughout this piece are still in their infancy in the public domain - additional time and analysis will only serve to see improvement in predictive ability.

Author’s note:

During 2020, Alex Chamberlain released an update with all of the current ERA-estimators. If one chooses to read the article, one will see that numerous metrics mentioned there are not included in this article. This is because some of these metrics are only housed on specific sites and are not easily accessible for analysis. The main additional metrics that would need comparison are FRA (Forecasted Run Average), pCRA (Predictive Classified Run Average), or pFIP (Predictive Fielder Independent Pitching, a newer metric not mentioned in the 2020 piece) as these would likely be the most competitive against these metrics.

I also did not want to attempt to recreate these metrics, as I believe that would not be fair to compare different time series and possibly different calculations. A future piece or update of the model may see acquired data for these metrics, but it’s best to take it (at least for now) for what it was measured against in this piece.