Forget big data -- measuring performance during this tournament is a losing battle.
Sports analytics -- or moneyball, if you prefer -- is hot right now, and nowhere more so than in football. Something of a latecomer to the gospel of data, the beautiful game is now embracing it with the zeal of the converted. With its fortuitous timing, the World Cup seems like an obvious showcase for the best that football's data experts have to offer. Except it's not.
Football data is big business. Several companies already collect data from on-field events to sell to clubs and, increasingly, fans and media. Opta, Prozone, and Infostrada have been working furiously throughout the World Cup, publishing match data in visually attractive ways to raise their profiles and attract new customers. And there's nothing wrong with that -- these companies have no choice but to take advantage of the tournament's massive global popularity to move product and get wider exposure for their work.
The problem is, as far as statistical analysis goes, the World Cup itself is probably the least friendly event on the football calendar. That's because in sports analytics, most statisticians prefer to work with a large sample of games between teams with generally stable rosters that play each other in a balanced schedule. The bigger and more consistent the sample size, the better chance variations in performance within single games will smooth itself out and allow reliable signals to emerge. That's why a data-rich domestic competition like the English Premier League tends to be the preferred object of analytical study.
By contrast, the World Cup is a month-long knockout competition. Half of the 32 teams will play a maximum three matches, and only four teams will play the maximum of seven. Their opponents depend on the initial draw and the final places in the group stage. One team could face France and Germany on its way to the final, while another might play Mexico and Greece. With such different opponents, comparing the performance of teams or players would be all but hopeless. And match data from this tournament, no matter how interesting, will almost certainly be skewed by unlikely events -- the kind that can turn a single game but might never repeat themselves in the course of a season. If none of the data is repeatable, it's hard to see how it's descriptive.
So, can analysts at least go back to look at national team performance in qualifying to predict World Cup performance? Not really. For one, national teams play very infrequently, even in a World Cup year. The United States Men's National Team (USMNT), for example, played a total of six matches this year in the lead-up to the tournament -- three fewer than Manchester City played in the month of December.
Moreover, national team rosters change a great deal between international breaks, whether due to injury or experimentation. Eddie Johnson and Landon Donovan were the goalscorers in the USMNT's 2-0 win over Mexico last September that sealed the Americans' World Cup qualification. But neither made the final 23-man roster that went to Brazil. Without the same players on the field, how can you use data from a national team game in November to make claims about how the team will play more than half a year later?
Despite these challenges, most firms have packaged and released World Cup match data, letting fans and journalists take from it whatever they wish. Prozone provides various baubles including individual possession maps; MatchStory offers a team-by-team summary with some basic goal data; Opta gives readers the same data it does for league sides, with some added nuggets here and there for each World Cup team. And Fivethirtyeight has a constantly updated and adjusted World Cup prediction model. Last time around, it managed to explain only 34 percent of the final team rankings -- and that was better than most.
On the one hand, something is better than nothing. A columnist attempting to make a point about team tactics can do no worse than to back it up with some relevant performance data. And several prediction models have beaten the bookies using the same system as chess rankings.
On the other, presenting raw numbers as part of gorgeously packaged interactive infographic, whether tackles or pass completion rates or whatever, gives readers the illusion that the data carry intrinsic meaning. "Team A tackled more than Team B, therefore Team A is better at defense." "Team X had a lower pass completion rate than Team Y, therefore Team Y is more technically gifted."
Or maybe Team A was defensively lax and therefore forced into more desperate play, and Team X was playing in a more direct but riskier way that entailed fewer successful passes.
This is the kind of equivocation skeptics use to cast doubt on data-driven approaches to sport. It also drives serious sports statisticians crazy. Raw match data is only as useful as what you do with it, whether by running regressions or developing predictive models using a large sample of games. Repeatability and reproducibility is a much harder sell than magical match numbers, but understanding these is what helps transform raw data into knowledge. The World Cup makes that kind of work difficult. Not impossible, just difficult.
That's more a problem for the data analysis firms than it is for fans. Because let's remember what's really important: The very same variation that makes a knockout tournament like the World Cup so difficult to predict is also what makes it so exciting.
Martin Rose / Getty Images Sport