Analytics

Underlining the 'football' in 'football analytics'

 • by Mark Thompson
Share:facebooktwitteremail
news now

People often talk about the magic of the Cup. ‘Cup’ can apply to different competitions: the FA Cup, the World Cup; in other countries, they may well talk about national cup competitions in similarly hallowed terms.

No-one ever talks about ‘the magic of the league’, though. It’s for good reason: leagues aren’t supposed to be magic. They happen over such a long span of time, with an appropriate amount of matches so that the winner is decided in the most just way possible.

Large sample sizes, designed to weed out hot streaks and outliers, aren’t friendly to magic.

However, with a sport as low-scoring as football and seasons of ‘only’ 38 games, weird things can still happen. Leicester City is the one that sticks in the minds of most, but the following year’s champions, Chelsea, had their own quirks.

For one, Eden Hazard had chances worth 2.52 expected goals from wide areas of the 18-yard-box, but he scored seven from them. That’s an overperformance of nearly 200 per cent, according to Football Whispers’ numbers.

The second-biggest overperformance from those areas that season was from Willian. Four goals from 1.65 expected goals, overperformance of 142 per cent. That’s two Chelsea players with unsustainable hot streaks, and no-one has got anywhere near Hazard’s overperformance since then. Weird, huh.

Why this matters

Breaking down shots into certain areas of the pitch has two values. The first is purely narrative. Those last few paragraphs tell an interesting story. Were Chelsea’s wide forwards extraordinarily on fire, and could that have had an effect on the title race? Or maybe scoring that many goals more than expected was just a sign they were extraordinarily good. Either way, it’s a story to be told and a debate to be had with friends.

But it also has a useful analytical value. Football players don’t take many shots, so statistical analysts have usually tried to measure ‘finishing skill’ by bundling together all of a player’s attempts on goal and trying to find meaning from there.

Anyone who knows the archetypal goals of Thierry Henry (finisher suprême) and Steven Gerrard (thunderblaster) can see the footballing limitations of this approach, even if it has statistical advantages.

The primary framework should be football and, for many good people doing good work it is. Some have been moving towards a new kind of football analytics since 2016. But one could argue that this could be pushed further than it is, on average, being done (at least in public).

Won’t somebody please think of the regression models?

Just like expected goals models, which take an average of thousands upon thousands of previous similar shots, passing quality models exist. But it takes a lot of passes for them to become useful, even though a player makes far more passes than shots per season.

Why? Because football is a sport where the phases of play change so frequently. Two shots in the same part of the pitch generally have a lot in common. Two passes in the same part of the pitch can be radically different, mostly down to where the opposition are. Are the they pressing or dropping off, settled or still in transition?

By focussing on the problem of sample sizes, statistical analysts (arguably) haven’t focussed enough on football. Sure, a player might show up well in a passing model, but what types of passes are they asked to play at their current club, and what type of passes will they be asked to play at their new club. How does the player perform at those?

The principal applies to all facets of the game: shooting, dribbling, passing, tackling. Thinking about football more in terms of its fundamental building blocks rather than what can be gleaned through averages might help yield more useful results.

The big problem, and it’s not sample sizes

There’s a fairly major stumbling block, though. A lot of football data doesn’t lend itself to being applied to football.

Do you want to know what foot a player makes a pass with? With several major data companies, you can’t. Do you want to know what direction a tackler is coming at their prey from — heading away from their goal, meeting a forward (probably) head-on, or facing their own goal, making up ground, backwards pressing? You can’t. But these would be useful things to know.

Football is a very messy sport. Only a very small part of the game is made of set-plays, like cricket or baseball, and the ball is moved across the pitch with the feet, which makes control much harder. It’s almost like it was designed to make it as hard as possible to quantify.

A step back

This is all very well and good, and splitting the game up into component parts — which might even highlight transferable skills between different roles on the pitch — but there’s also a larger matter. What is football a sport of?

We know that possession isn’t the be-all and end-all — Burnley results and Sky Sports News pundits make sure that message gets drilled home well and good. Territory isn’t the be-all and end-all either. It’s a means to an end, sure, but the value of expected goals shows that it’s only a very particular area of the penalty area that’s the key.

An image from a BBC Sport explainer on expected goals, showing the average chance of scoring from areas inside the box.

Football, one could say, is a game where you’re trying to get the most control possible in that area in front of the opponents’ goal. Everything else is window dressing (or, more accurately, a mixture of philosophy and practicality in creating chances in that area).

Some of the people at the cutting edge of football analytics are moving towards ‘pitch control’ models using tracking data. Thanks to the (enormously) large trove of information available, at any one time they determine which team has ‘control’ of areas of the pitch.

Football should inform the approach to the data, but it’ll be interesting to see whether, in the future, working with this type of data could inform football. Could the models work out the best ways to create a control advantage in front of goal from any given situation?

More importantly for the vast number of us who don’t have access to either tracking data or the capabilities to work with it, should we incorporate this idea into how we approach using more traditional and widespread event data?

Nobody ever talks about the magic of the league, and nobody ever talks about the magic of data either. But they should. Because even though both work best with larger sample sizes, the possibilities for surprise and exploration is tremendous.

related
content