Introducing Sequencing+: A Strategy-Based Metric Quantifying Pitch Sequencing

The concept of pitch sequencing is not new. It has always been known that the order in which a given pitcher opted to throw pitches mattered. The optimal orders were up to interpretation, but if a given optimal strategy could be found, these isolated strategies could lead to better run prevention (or at least that is what’s believed). These matters were brought up in my creation of Arsenal+, a new metric considering how pitches play off one another. As Arsenal+ is inherently skill-related, and sequencing is a purely strategical matter, there was a question as to whether some of the variance in strategical success would cause the metric not to reflect true skill as desired. The metric has now been adjusted for these matters, but the issue of considering sequencing as a whole persists.


When reviewing online pieces on Sequencing, many suggested ideal strategies or the inherent benefit of including location in analyzing how pitches correctly “sequence”. But to that end, Location is a skill. There was nothing publicly available on the actual judgment of the given decision that pitchers (or possibly their respective teams) made in regard to their strategy. Given that I wanted to know, I created a metric to measure the expected success of a given pitcher based solely on their pitch choice strategy - Sequencing+



What is Sequencing+?

Sequencing+ is a strategy-based metric that measures the expected run value of a given PA based on the sequence of pitch choices. Rather than being on a pitch level like some past metrics researched, this metric is calculated using aggregated PAs. This was done on both 2023 Triple-A data and 2022-2023 MLB data, providing thorough back-testing on its uses. It is worth noting that this metric is not attempting to be more predictive of future performance compared to Stuff models, but merely serves as a gauge of how well pitchers are generally following optimal sequencing strategies. 



In determining optimal strategies, the primary independent variables of a given PA’s sequence and a few considerations and adjustments were inputted and trained against the mean run value of a given event. As done prior, these RVs were determined by taking each possible event's average delta run expectancy and assigning them to each row. The additional considerations included the number of pitches within a PA and lefty/righty matchups. These were included through the use of dummy variables, and then later adjusted based on the respective averages of each group in calculating the finalized metric. The pitch count was also included, although this was not adjusted. The reasoning behind including these factors will be supported later in the piece. 



The mean RVs were the dependent variable, and a machine-learning model was trained. From Microsoft’s FLAML library in Python, the AutoML function was used to determine the best hyperparameters and model types for the best possible model performance. The science behind the AutoML function was explained briefly in the piece regarding Arsenal+, although for a more thorough review of its function, feel free to check out this link. An LGBM model was ultimately settled upon and utilized for the expected run value predictions. In the Triple-A version, the model boasted a .1409 training RMSE over the entire sample. For the MLB version, a .4002 training RMSE over the entire 2022-2023 dataset. 



The overall Sequencing+ scores were taken by bundling the expected run values in their respective sub-categories and generating means and standard deviations to yield their z-score. The z-scores were then utilized to calculate the actual average, with a lesser z-score proving to be more impressive. The average of the metric was set so that the average sequence from a pitcher in a given situation (from the subcategories specified above) scored 100, with each point above or below 100 serving as the percent difference from the average (101 is 1% above average, 98 is 2% below average). This is similar to wRC+ and all of the Stuff metrics. 



Considering this model’s addition to the baseball space, it starts with the obvious explanation of why a pitcher’s performance may deviate from the expected values of other models. In theory, pitchers can have the same stuff and perform very differently depending on the sequence of the given pitches they use. In other words… FF -> FF -> CH does not equal CU -> CH -> FF. Being wholly based on strategy, the model and its given scores can provide optimized strategies and isolate the less-than-optimal ones. The adjustments that can be made from the pitcher’s standpoint are easily doable and serve as a base for improved performance while not having to alter a pitcher’s arsenal completely. 



Evidence


In proving the need for such a metric and its actual value, let’s start incredibly basic with a linear regression. Evaluating whether there’s a relationship between sequencing scores and corresponding overall run values, these were the results for Triple-A in 2023 with a minimum of 100 Batters Faced.

And on the MLB level for 2022 and 2023, with a minimum of a total of 100 Batters Faced:

Both results were deemed significant (P-Values <0.01), demonstrating that a relationship does indeed exist between sequencing and fewer runs, although those relationships were not very strong (as demonstrated by the low R-squared scores). However, it was never really expected that Sequencing as an individual point would provide a high level of explanation for that fact. It was approached as more of a minute-added variable, which somewhat confirms that its value is limited. 

In the past article on Arsenal+, it was proved that a pitcher’s difference in stuff and location added to the predictive value of a pitcher compared to just stuff and location individually. A minute but significant relationship between Sequencing+ and RV100 suggests that baseball dogma was correct - the order of pitches (regardless of stuff and past stuff) may have some predictive value and explain additional variance.

The same explanatory graphs as in the Arsenal+ article were used. To explain the charts, imagine a filter for the minimum amount of batters each pitcher faced being set up. A regression is taken between the batter’s faced Sequencing+ score and run value and is accordingly plotted on a graph with the minimum number of Batters Faced for a given pitcher on the x-axis and the R-squared value on the left y-axis. In the background, a bar chart reflects the number of pitchers the regression is being conducted on, with the key being shown on the right y-axis. Once the regression is charted, the filter is reset with a new minimum to one pitch higher (from 1 to 2, let’s say). Now, only pitchers that meet the new minimum (2 BF) are used for the regression, and the new score is then plotted. This will continue to repeat until a designated insignificant stop-point is appointed.

To set a baseline for how this metric performs, it was put up against basic RV100 scores, comparing first-half and second-half results given that they both met the minimum sample requirement. For the case of the AAA data, this was comparing the first and second half of the 2023 AAA season. For the MLB data, it was predicting 2023 data based on 2022 data. The Triple-A results are shown first.

As the reader can likely tell, the results flip-flopped halfway through the graph. A player’s given prior RV100 was more predictive of future performance for most of the estimate - Sequencing+ was mainly behind RV100. The end of the results are a bit fuzzier as the sample size drastically decreases, but it does maintain significance. This provides a baseline conclusion that Sequencing alone is not more important than just plain past performance in predicting future performance. To retest this, the MLB data was run. 


Similar to the AAA chart, the MLB data showed that past RV100 was more predictive of future performance than just isolating for Sequencing and the adjustments. This is in line with what someone should expect with such a metric, as past performance can be more indicative of certain skills that will likely carry on to future performance. On the other hand, the sustainability of continuing to use Sequencing effectively is up in the air. In evaluating this, the stickiness of each metric, or the ability of a given metric to predict itself and showcase baseline talent, was tested.


For AAA, Sequencing was heads and tails above every other metric in stickiness, with only a tail at the end with a much smaller sample size negating that. For MLB, sequencing is relatively more sticky with the smaller samples included. It eventually dips below Stuff+, but then increases again as the minimum number of batters faced increases above everything else. In prior articles, it was hypothesized that the variance in successful sequencing strategy limited the stickiness of a given pitcher’s ability to have pitches play off one another, as demonstrated in Arsenal+. These results suggest the contrary, revealing that sequencing as an isolated factor contributed to the stickiness of the metric. Hence, the variance in the ability to effectively play pitches off of one another is likely due to some other factor. They also suggest that certain pitchers may have a given skill for throwing optimized pitch patterns compared to others.

Sequencing+, or any other similar metrics, should continue to be regarded as a strategy-based metric due to the amount of control a given pitcher has over the outcome. While Pitcher A and Pitcher B may have different Sequencing+ numbers, Pitcher B could easily pass Pitcher A if he was more familiar with optimized strategies rather than having to physically improve himself. These factors all point to strategy, but the high stickiness does demonstrate that certain pitchers may have a skill in generally going towards sequence patterns that limit more runs. This may be due to pitchers having more of a feel towards patterns or just simple repetition that happens to be optimal. Further exploration of this needs to be conducted, although I feel an in-depth dive is best suited for a separate piece.

In switching to measuring the predictive value of these metrics, all of these prediction estimates have been mainly used with R2 values. The R-squared scores and their prediction value simply take the correlation between the two metrics, and square the matter to estimate the causation of variance due to given factor(s). Rather than trying to measure for predictive accuracy of the metric, it’s trying to articulate the accuracy of the trend of the scores versus actual later performance. A High R-squared still provides a great picture of the relative accuracy in predicting future scores, but it doesn’t estimate the actual error. 

In estimating that, the Root Mean Squared Error is taken, which measures the standard error from actual performance and predicted performance. This was utilized as the loss metric during training, with the model optimizing based on its yield of a lesser RMSE (a lower error). In establishing the value of this metric in both a descriptive and predictive fashion, the results through minimum batters faced are shown below. 

These graphs provide an especially great picture of the limitations of the metric. It is clear that the skill metrics drastically outperformed the strategy metric, which provides a basis that the actual level of talent a given pitcher has is much more important than their ability to strategize, at least regarding this narrow window. This should not be especially surprising to the reader, especially given the disparities in how MLB pitchers are compensated based on skill alone.

While it was mentioned above that there is possibly an existent skill aspect to the said strategy metric of Sequencing+, any such skill is less predictive of future performance than the other clearly classified metrics. As a whole, these examinations provide a thorough review of the limitations of this metric, although they establish a sense of value in the fact of turning a priorly subjective art of putting given pitches together into more of a science. The reasoning behind the certain choices of this metric in valuing sequencing is explained below.


Reasoning Behind Metric Adjustments

Certain aspects of this metric and its earlier mentioned calculations deserve explanation, which is the point of elaborating on that in this part.


As mentioned above, the two main adjustments were lefty/righty and the number of pitches within a given PA. For lefty/righty, it’s the same reason that stuff plays differently versus a lefty or righty. Some pitches will happen to be harder to hit against a certain-handedness pitcher, making the optimal strategy change depending on the circumstance. This was further proven in this article by Adam Daily, showing that certain platoons fare better in certain sequences. Its adjustment was included as pitchers often face an uneven amount of lefties and righties, and unearned sequencing credit was not to be awarded if they happened to fall within a certain matchup.


The need for the number of pitches with a given PA became obvious when comparing the mean run value of every different number of pitches of each PA at the AAA level. The results are shown in this tweet I published:

Only pitches 1-9 meet the minimum PA requirements to be given significance. When the amount of pitches goes up in a PA, the pitcher generally will do better. Hence, if a pitcher has higher pitch numbers per plate appearance, they may given undue sequencing credit. Each PA was adjusted for the amount of pitches that it had.

Potential Shortfalls


In considering these shortfalls, the limitation in what this metric is meant to achieve will be taken heavily into account.

Lack of individualized Sequencing consideration

In determining how the value of sequencing was predicted, the methodology that came into mind was taking all the samples and training the given sequencing patterns on what was most likely to minimize RV per BF. In doing this, it identified the most likely successful patterns and the effect of certain non-related variables on those patterns. To that end, each pitcher is being evaluated on how well his strategy aligns with the overall successful patterns, given the aforementioned adjustments. While adhering to a comprehensive strategy has apparent merits, I worry about the lack of individualized sequencing evaluation within certain matchups.

Curveballs and Sliders were preferred as first pitches in most optimal sequencing decisions as they are generally very effective. Suppose that due to this, Pitcher A opts for this almost every time. He opts for it so much that it is close to 100% First-Pitch CU/SL, which Batter A can assume when anticipating the first pitch. He may be less successful on those pitches overall, but when he knows it’s coming, he is bound to do better. In another scenario, say there is no feel for those pitches that day, and the Changeup would likely be the better choice. Pitcher A will be penalized for not following the overall best strategy, despite potentially being the optimum one.

The metric treats Sequencing as independent of Stuff, usage amounts, and matchups, when those factors likely contribute to optimized sequencing to some extent. That extent is unclear, as accounting for such measures leads to the introduction of assumed batter knowledge, which may lead to a further error by accounting for variables that may not affect the optimized sequencing strategy for all. It seems like it somewhat goes too deep into the rabbit hole, which is why the model assumed simplicity. The need for additional features considering these variables will be investigated, although I expect some challenges in preserving the integrity of a sole sequencing model and integrating the proposed variables.

determining the role of count in sequencing


In many of the pitch-level models created, count variables and adjusting for count have been integral in showcasing the true value of a given pitch. But as is already made aware, Sequencing+ is not a pitch-level metric - it’s a PA-level metric. And in evaluating the order of pitches, using count is somewhat tricky. The model currently accounts for the count but doesn’t adjust. This was done to reward pitchers for having optimized pitch choices within certain counts, although the same was not done for platoon matchups. The logic behind this was that a pitch choice being successful is more the result of the platoon versus the actual skill in picking a pitch. In comparing platoon vs non-platoon predicted optimized sequences, the recommended choices were very similar with different run values based on the matchup. For count, certain batters are expecting certain types of pitches, making an optimized decision something that should be considered and not wiped out.

Under the usual circumstances, the count is wiped out as hitters tend to do better or worse in counts, leading to a misjudgment of an actual pitcher’s performance if he often gets into certain counts. But given that is considering the whole PA, the possible inflation of runs was estimated to be minimal. But, it isn’t exactly consistent with matchup splits, meaning that this decision could be a potential shortfall in the technical soundness of Sequencing+. While v1 of Sequencing+ will award based on count choices, this may be adjusted in future versions. It will continue to be investigated.

Analysis

The intricacies and now proven validity behind the actual quantification of the value of sequencing provide an entirely new way to approach pitching strategy. While this metric primarily highlights the players that are the best at sequencing, this given model also gives insight into situations in which the most optimized pitch sequence is being used, given an average-performing hitter and similar levels of stuff for given pitches. Of course, those assumptions are not usually the exact case, but the establishment of new null assumptions that can be rejected on an individual player basis provides a building block for sequencing strategy.


In looking into these strategies, particular start sequences within given scenarios and the proper adjustments will be examined for the AAA data. Preliminary research results for the MLB side revealed similar results deemed repetitive, negating the need for the reader to spend additional time reading those results. MLB Sequence analysis may be spread through other mediums.

Examining Pitch sequences

The first pitch is everything in a PA. The choice of pitch in this situation is extremely important and sets the tone for the rest of the time against the batter. Using the Sequencing model’s expected RV, these were the average numbers from each sequence with the designated start pitch (on PAs with less than 10 pitches).

The curveball was at the top, followed by variations of curveballs, changeups, and sliders. Four-seam fastballs were ranked 8th, which tracks with what modern baseball analytics says about the preference for breaking balls and offspeed. The difference in performance was also fascinating, with a first-pitch curveball estimated to limit runs almost 5x better than a Sinker. Each result was deemed significant by the shown PA count to the side, giving validity to some of the optimal strategies that pitchers could benefit from. Of course, if a given pitcher’s curveball/slider is subpar, the actual pitch to choose may vary. This is where Stuff would come in, which could lead to an accurate decision to deviate from the normal strategy.

The first pitch still matters, but what about the next move? The second part of the sequences and their expected result were viewed.

This is not surprising - all sequences listed begin with a slider or curveball variant. Given that these were the most successful first-pitches, it should be expected that the most successful two-pitch beginning combinations also start with these types. The secondary results, on the other hand, were not expected whatsoever.

The highest-ranked two-pitch sequence was a Curveball-Slider, but the three following had non-curveball or slider-type pitches. Fastballs, when paired with Curvballs, were the most effective secondary part. This could be due to the contrast of the pitches, which was previously explored with Arsenal+. The same first-and-second-pitch combo didn’t rank until 5th with KC-KC, and the results had the third smallest sample size within the filter, subjecting it to a higher degree of variance in its true value. Sweeper-Sweeper has a much higher sample, but the results were almost 2x as bad as simply throwing CU-FF or CU-CH. Starting with a breaking pitch and opting for offspeed/speed yielded positive predicted results for most pitchers.

In exploring further with sequences, three-pitch sequences will now be looked into to ensure the most thorough evaluation of strategies in starting a PA. As may be apparent, the expected run values are going down as each sequence part is grouped. This is again due to the inherent fact that as more pitches are thrown, the pitcher does better. Given that these sequences have to have at least three pitches thrown versus the prior one and two, they contain more predicted results that are favorable for the pitcher.

These sequences (again) all begin with slider variants or Curveballs, but the second parts are even more variable than the first. The third part of the grouping was able to elevate certain results that didn’t fare as well in just the two-part start sequence approach. The other odd factor about the third group was the heavy amount of Fastballs and Sinkers. Eight of the top 10 best start three-pitch sequences had a fastball or sinker. These results also performed dramatically better than the bottom slider and curve, with the ST-ST-FF limiting runs at almost 3x the rate of CU-CU-CU/CU-FF-SL. While some of the samples are very limited, the prevalence of successful speed third pitches given the first pitch is breaking gives a pitcher something to think about when approaching a hitter.


In creating a given strategy for a pitcher, Stuff is important to remember in weighting. Knowing the batter’s ability to hit a given pitch is also very important, but that is getting into the weeds when this metric provides a simplistic baseline of sequencing performance averaged around the entire league. In focusing on optimizing an overall strategy, the best way to think of it is like a tree. From looking at only the first pitch of a sequence, what is the best option? Once that option is chosen, the results of the Two-Sequence start need to be examined, given the first pitch choice is set. From those available options, pick the best choice and continue to repeat. A visual of this thought process is included:


Assuming that at each level of Offspeed-Speed-Breaking the best choice was chosen given the prior sequence parts for the given pitcher, that is the choice that should be made.


With background in the legitimacy and optimized strategies/knowledge behind Sequencing+, the actual results of the 2023 AAA season will mean so much more. Pictured are the Top 10 Pitchers in Sequencing+, given that they saw at least 100 Batters.

On the other hand, these were the bottom scores for Triple-A.

In looking at these names, it’s also important to note that pitchers with good stuff or performance do not necessarily sequence their pitches well. This may seem obvious, but it’s worth pointing out their independence. Sequencing was proven to be less explanatory of performance than the other Stuff metrics, allowing pitchers like Zach McAllister to put up a decent 0.78 RV/100 and 3.93 ERA despite a horrendous Sequencing+ evaluation. The same type of thinking works in the inverse for great Sequencing+ scores. The metric is only attempting to isolate a tiny part of the pitching equation, so such results are to be expected.


For MLB, these were the highest scores during the 2023 season, given that they met the minimum 100 Batters Faced.

With the same requirement in place, these were the bottom scores.

Talented players were again on both charts, with Josh Hader specifically sticking out as a skilled player who does not sequence well. Looking into his data, he seemingly has a preference for starting off his plate appearances with Sinkers, which is not recommended based on the average performance of a first-pitch sinker. It also doesn’t help that one of his most common two-pitch sequence starts is SI-SI, which compounds the issue. Such a horrendous model score with such an above-average reliever like Hader makes more sense with this background. Stuff and strategy, again, don’t necessarily go hand-in-hand.



Conclusion

Pitch sequencing is a unique equation - many individualized factors go into deciding the best pitch option in a given scenario. The answer may not be clear, but the lack of an overall strategy and dependence on gut-feel strategies is nothing is unacceptable in this modern age of baseball analytics. Sequencing+ attempts to turn this art into more of a science, establishing strategies based on prior success in certain situations that generally lead to fewer runs allowed. Pitchers are rewarded based on their ability to follow the optimized sequences, earning them the Sequencing+ scores that reveal how well they do compared to the competition. And given the relationship between Sequencing and RV100, these pitchers generally fare better. It is clear that this aspect of pitching does matter, and further quantifying such aspects should only prove to explain more about why pitchers perform as they do.

The results in this piece only feature v1 of the Model - future iterations along with explanations behind said iterations will likely be published in the future, whether through Twitter or another article post. Be on the lookout for these.