Sample Size, "Clutchiness" and Probabilities - Why your eyes are probably lying to you

Recently there was a discussion in a thread about whether the concept of “clutch” was logically supportable, and in particular whether our very own Kyle Orton was “anti-clutch”, and therefore a bad QB.

Over the past year I’ve been lumped in with the “Orton Fan Club” and/or Tebow Hater group…  let me just say that the first is probably true, but the second (the TT Hater Group) is absolutely false.  I would be very happy with an outcome of trading Orton for 2nd rounder.  I’m really excited to see what TT can do, and getting good value for Orton (either on the team or through trade) has always been my overriding desire.   

As such, I’ve also probably been identified by many of you as a giant pain in the butt.  My wife would probably agree ;-)  Overall, what I am is a scientist, economist and analyst, and strongly reliant on logic and data-driven argument…  in many cases, this can run against “conventional wisdom”, which I often find to be conventional, but not particularly wise.


In order to fully explain my data driven approach to analysis, its first important to step away from the emotions of the TT/Orton debate... so while this post will look at a similar concept, I’m going to reframe the argument to a different sport, and generic players.  I think it applies to QB, but also applies to kickers, LBs, or really any professional athlete.  If you can understand me in the generic, you’ll much better understand why I come down where I do in the Orton/TT debate.     

So lets look at a generic sports player…  a pro-basketball point guard.  Please stick with me and try to think about the generic…  because my assertion overall is that we get distracted by emotional moments…  and I think I can prove it by separating out the emotion.

We will look at Point Guard named Zed…  over the following analysis, we’ll look at a bunch of data, decide if he is good or not, and in particular, look at whether we can determine if he’s “clutch” or not.

So  here we go…

First, some assumptions:

Professional athletes come into the league, and in some but not all, cases get better over the first few years of their career.   After a few years, they generally “are what they are” and reach their maximum capacity as they figure out what they are doing. 

The average rookie successfully makes a play 35% of the time, and an “average” starting veteran point guard makes a successful play about ~50% of the time.  Good players make plays ~60% of the time, and Great players make a play about ~70% of the time.

Finally, we assume “clutch” is just something a player has or does not.  They either perform well in key situations, or they choke and fail.  The “are what they are”.  A player isn’t “clutch” one year, but then a “choker” the next (or vice versa), at least once they are at the pro-level.


Back to Zed:

Part I -

Zed comes into the league…  He has some quirky mechanics which he’s working on but which make him a relatively poor prospect compared to the typical highly rated top guys, and some doubt he’ll ever be able to overcome those flaws.  As a rookie, he is given limited responsibility, but still gets 100 chances to “make a play” as a rookie.  This could be shooting for a score, block an opponent’s shot, or passing to the open guy for a easy dunk, etc.  There are multiple ways for success, but in the end when given a “chance” he either makes a play or he doesn’t.  Lets see how he does:

With his 100 chances, he “makes a play” 40 times out of the 100.  Even with those mechanical issues folks say he’s unlikely to improve, he still performs good for a rookie, though not as good as an average veteran.  Ask yourself:  Is Zed a good player?




But we also know that not all chances are created equal.  Some are those high profile moments, and are disproportionally important to the outcomes of the game.  This isn’t necessarily the last second shot,etc…  it could be the opposing team went on a run, and Zed made a great steal to calm the team down.  But basically, these are the times where the “clutch” guys come through and seem to directly result in their team winning or losing a game.  Looking over his rookie season, lets say we identify 5 of these chances in his limited rookie time, and although he only made 40 out of 100 overall, he converted on 4 of those 5 chances….  And because of that his team won games they otherwise wouldn’t have.  The fans and media like him (bad mechanics/stats and all), and say he’s a gamer and “just wins”.  Most everyone agrees he is “clutch”, and the data (80% success rate) supports that assertion.

Does this change your perception of Zed and whether he is a good player or not? 




Part 2 -

So time goes by, and Zed improves his mechanics, etc.  Now a focal part of the team (200 chances per year), in year 2 he converts 50% of his 200 chances and in his 3rd and 4th seasons he now is converting 60% of his chances overall…  good but not great.  

Overall he has converted 380 of 700 chances, from a relatively poor rookie to an full time pro performance of 54.2% overall, with the aggregate total weighed down by poor rookie/1st year percentages.  Folks now are pretty sure he “is what he is”.  What is your evaluation of him?  Is he good (60% on most recent 400 chances)?  Average (54% on 700 total chances)?   I think most would agree that he is probably a “good” player, because his early struggles were explainable by his youth and over a high number of recent attempts, he’s performed very well.

If you had to guess on how he did in the “clutch” situations, what would you predict?  Remember, as a rookie he converted 4 out of 5…. 

Lets add more hypothetical data and look at his 2nd year performance: Assume he converts 5 of 10 clutch situations…  the exact rate as his overall performance that year.  Is he still “clutch” now that he isn’t converting 80% of the time?  Note he has still converted 9 of 15 overall the past 2 years (60%), solidly in the “good” range.  I think most would agree he is still “clutch”, and likely a good player overall.




Part 3

Now we get to his most recent 3rd and 4th years clutch performance….

What if he only converts 4 out of 10 chances each of those 2 years (now less than 50% overall -  17 out of 35)?  Is he good/bad/clutch?  Were we totally wrong and he totally suck?

What if he coverts only 5 of 10 (“average” those 2 years, but 19 of 35 overall), or 6 of 10 (21 of 35 overall)?  He definitely isn’t matching his rookie performance (80%) in any of these cases.  Was everyone wrong? 

Sadly for Zed, the past 2 years were rough, and he only converted 40% of clutch situations.  The fans have totally turned on him, and some folks say he will never win anything because he’s a choker.

How does clutch performance change our perception of whether he is a good or not?   If the “clutch” performance matters, what is the threshold between “bad” to “good” if his overall performance rate is 60%?  Can a player be good,  but “anti-clutch”, or does lack of “clutch” immediately disqualify him as a “good” player?




Part 4

Now, lets look at predictability:

We know a coin has a 50% chance of being heads/tails.  Flip it 10 times.  How many of you had 3 or 4 “heads”?  If you did get that, are you now doubting that the coin is a 50/50 likelyhood?  OF COURSE NOT.  What if you got only 1 or 2?… that’s actually quite unlikely on an evenly weighted coin.  Are you starting to think the coin might be weighted?  Would you bet $20 that you get less than 5 heads on the second set of 10 attempts?  No, you probably wouldn’t because while it is not a high chance that you’ll get 9 tails to 1 heads on a 50-50 guess, it does happen.  You have other, bigger sample size data that shows the vast majority of coins are 50-50.


So lets say Zed is the coin.  We have the following data:

Overall (big sample) – 54% on 700 attempts

Recent performance (big sample) - 60% on 400 recent attempts…

Overall clutch performance - ~45% on 35 total attempts…

Recent clutch performance – only 40% on 20 recent attempts…

If his recent success rate over 400+attempts is 60% (his likely current talent level), is it really that surprising that there is 20 attempt sample that came out at 40%?  Again, no, not really…  over a 20 attempt sample, the difference between 40% and 60% is 4 flips…  its not terribly unlikely to flip a coin heads 4 times in a row, so similarly its not unlikely that a player with a 60% likelihood performs at 40% for 20 attempts. 

Therefore, if you have to predict the future and bet $20 on whether his next 10 clutch attempts are better or worse than 50%, which do you guess will happen?  Remember we already have a sample of 5 chances (rookie year) where he converted 80%...  and everyone agrees he has improved since then! 

Basically, Zed is VERY LIKELY to be better than his recent clutch performance indicates.  The nature of small sample size events just tricks us into thinking that Zed is below average, because we focus on the recent past and high profile events, rather than earlier events or the entire body of his work.

Smart money would bet on taking the “5 or over” for Zed’s likely clutch performance, even though recently he’s sucked “in the clutch”….  It could be wrong, but I’d say you’re betting against the data if you take the under.




Part 5

Finally, what if you have a new young point guard named Buddy.  He only had a limited amount of playing time behind Zed last year, but on 50 attempts, he converted 25 of them (50%).  He also had a number of clutch opportunities, and converted 3 out 4 (75%).  He has some poor mechanical issues that some say he’ll never fix (sounds like what they said about Zed as a rookie…  he’s a project), but he also has big charisma, a winning smile, some physical tools that are really special, and most folks agree that if he can overcome his mechanical issues, he could conceivably covert 70% of his chances…  he could be truly elite.

So the GM is trying to pick the best point guard for the team going forward. 

You have Zed, who performs pretty good overall, but has had a recent rough stretch in some key situations.

You also have Buddy, who looks like he’s at or above the same level as Zed was as a rookie, and has the potential to be much better going forward.

It is not surprising that a team might go with Buddy.  But that is a reflection on how great Buddy could be, not how good (or bad) Zed is.  By everything but the recent small sample size clutch performance, Zed looks like a good, not great point guard, and that was small sample size data that is likely misleading. 



As you’ve probably guessed by now, Zed is Orton and Buddy is Tebow.  But by describing the situation in the generic, hopefully I’ve shown that those folks complaining about Orton’s “anti-clutchiness” are creating an unnecessary explanation for events that instead are simply VARIABLE. 

Good QBs may perform poorly over a small sample, while bad QBs can perform good over a short-time…. A 60% probability of success doesn’t mean you get 6 successes everytime you look at 10 plays.  It averages out.  What we CAN probably say is that over the long-term/big samples, players stats will eventually reflect “what they are”. 

We are unlikely to ever have enough samples to truly understand a player’s “clutchiness”, and the ones that far exceed expectations in large samples of “clutchiness” (e.g. Elway) also typically far exceed expectations in their overall stats too.  There simply is no evidence that “clutch” is different than any other attempt at success. 

I can already anticipate folks saying, “Sure, but what about Jack or Joe or James Choker?”….  Even if folks can identify a player or 2 with good overall stats, but who far underperformed in their “clutch” situations (i.e. the so-called “chokers”),  it still seems more probable that is the “1 heads in 10 flips” case rather than requiring a whole new explanation of the phenomena like “clutchiness”. 

Folks saying Orton is a bad QB because of his recent “clutch” failures are betting against that probability.  Orton’s “clutch” failures in the recent past could easily simply be the downside 40% on the otherside of a 60% success probability.  Occams Razor tells us the simple solution, things are just variable, is more likely the correct answer, particularly since these players were all probably “clutch” in college, etc.  Major college football and NFL players have been playing in high pressure situations in front of tens of thousands of fans for years, and every single down in both college and NFL are high pressure.  It is very unlikely that a “choker” would excel in either situation if they couldn’t handle pressure, so to believe the “clutch” theory requires you assume these folks somehow snuck through their entire career without getting “discovered” to be a choker.  Even if there are small differences, the noise in the small sample is likely to be far greater than the relative change in "clutch" ability.    

Only the most extreme folks (not me) are saying Orton is a great QB… his stats don’t reflect greatness overall, so I find that assertion just as unlikely as saying he’s a bad QB.  But Orton has performed “good” over a large sample size.  Therefore probabilities tell us he most likely is good, even if in a small number of recent situations (i.e. the "clutch") he’s had some poor performance.

Tebow (Buddy) looks like he could be great, and given his performance as a rookie and other intangibles, if he develops like Zed (Orton) did we’ll be sitting pretty.  But TT being great and Orton being good are not mutually exclusive, and TT still has only a small sample size… 

We may all find we’re totally calling it wrong on TT….  He could look very poor once he has 15 games under his belt rather than 3…  but the whole small sample size thing doesn’t mean the data we have is worthless…  Given the data we have now, TT looks very good.  But we know over a bigger sample, TT’s performance will average out to close to his “true” success probability.  Just like TT’s small overall sample size success doesn’t mean he’ll be great, similarly Orton’s small sample size failures “in the clutch” don’t mean he’s bad.  There just isn’t enough data to say definitively on either assertion, and in Orton’s case, the bigger data set seems to indicate the small sample is probably misleading.



Small sample sizes and uncertainty is really tough to analyze, and football is particularly hard.  With only 16 games per season, and a limited number of plays, and lots of confounding factors like injuries, the uncertainty means the only thing we can probably be sure about in football analysis is that all of our analysis is highly uncertain.   

Remember when I was asking what threshold folks would need to change their minds on whether Zed was “good” or “bad”?  I’m guessing most folks said 40% success was bad, and 60% success was good.  But for highly variable events like a sports outcome, a 10 event sample might have error bars of +/- 40% or more at the 95% confidence level.  So if a sample comes in at 80%, that still means the “true value” of that player's talent has a 1-in-20 chance of being lower than 40%! 

1-in-20 events happen all the time….  For a team in the NFL, that means we if we wanted to rate the “clutchiness” of every player on the team, we would “miss” by over 40% on more than 2 people per team.  Think of how many more we would mis-estimate by only 20%!  With most outcomes in the NFL clustered in relatively small % differences (e.g think typical completion percentages… 45% is horrible, but 70% is HOF-caliber), how can anyone say a small sample of less than 100s of events is evidence of ANYTHING.    Add into the biases of “eye tests” (e.g. we overrate the physical over the mental, and we remember the extreme events (the interceptions, big hits, sacks, diving catches, etc.) more than the routine (1st downs, missed tackles, etc.)… 

This is why big sample size is so important… hundreds of samples can get us down to smaller uncertainties....   While the +/-5% can still have outliers in the sample data (e.g. a good performing player actually sucking), we are much less likely to mistake the gross evaluations of good vs. bad with that data resolution.

Is it any wonder that I’m very skeptical when someone says, “but remember those few times I watched on TV when Player X was good/bad/clutch/etc…”, or points to a highlight/lowlight film as the proof of why large sample play-by-play stats are wrong? 

Film observations are really important at suggesting nuance not captured by the stats (often the “why” something happened, but very rarely the “what”), but we have to have a scientific sample if we want to draw conclusions…  a few of the advanced stats websites are starting to do some scientific sampling of gamefilm (and I love that stuff!), but that is the exception rather than the rule. 

Anyway, hope this was interesting for folks, and explains where I come from in many of my posts.  I hope the lessons here are useful to you both as you try to discover the truth behind sports performance, as well as in other things in life.  A better understanding of sample size, statistics and data selection IMO would do our country a lot of good in terms of understanding when the Bozos in Washington DC are lying to us, etc.    Many know that “statistics can lie”, but very few seem to be able to realize when the stats are telling the truth. 

Go Broncos!

This is a Fan-Created Comment on The opinion here is not necessarily shared by the editorial staff of MHR

Log In Sign Up

Log In Sign Up

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior users will need to choose a permanent username, along with a new password.

Your username will be used to login to SB Nation going forward.

I already have a Vox Media account!

Verify Vox Media account

Please login to your Vox Media account. This account will be linked to your previously existing Eater account.

Please choose a new SB Nation username and password

As part of the new SB Nation launch, prior MT authors will need to choose a new username and password.

Your username will be used to login to SB Nation going forward.

Forgot password?

We'll email you a reset link.

If you signed up using a 3rd party account like Facebook or Twitter, please login with it instead.

Forgot password?

Try another email?

Almost done,

By becoming a registered user, you are also agreeing to our Terms and confirming that you have read our Privacy Policy.

Join Mile High Report

You must be a member of Mile High Report to participate.

We have our own Community Guidelines at Mile High Report. You should read them.

Join Mile High Report

You must be a member of Mile High Report to participate.

We have our own Community Guidelines at Mile High Report. You should read them.




Choose an available username to complete sign up.

In order to provide our users with a better overall experience, we ask for more information from Facebook when using it to login so that we can learn more about our audience and provide you with the best possible experience. We do not store specific user data and the sharing of it is not required to login with Facebook.