Over the past couple of weeks, the 2016 State of the Stats survey gathered responses from more than 200 people involved in football analytics, either professionally or as a fan. This is the second year I’ve run the survey, and like last year, it’s about who we are as a community, our hopes and dreams, and the problems we face in our work. My hope is that it provides inspiration, tempers expectations, and exposes issues and opportunities for the coming year. At the very least, it’s got some big-ass pie charts in it, and that’s about as analytics as you can get, as I right?
Who Are We?
First up, it’s great to see that nearly half the people filling out the survey don’t necessarily identify as stats-first people:
The first major issue I’d like to address here is that tactics people need to choose an appropriate epithet to describe them, along with ‘evil number wizards’ and ‘real football men’. I propose ‘The Pep Squad’, but I’m open to ideas. Nevertheless, this year has seen an interesting combination of cross-pollination and beef across the dividing lines of stats and tactics. Real football men hate laptops and Pep equally, so in a way provide us with some common ground, and the ascendancy of RB Leipzig has been a major driver in this regard: finally tactics and analytics people are united behind the common cause of destroying footballing traditions. Of course, stats people were trying to sign Naby Keita two years before you’d even created your first WordPress draft profiling him, but I wouldn’t know anything about that.
It is still the case that most analytics people view tactics writing as riddled with confirmation bias and small sample sizes, and tactics people view analytics people as missing almost everything interesting about football (e.g. basically every aspect of positional play, which is absolutely true). I maintain that the tactics community on Twitter and in the blogosphere is the single most unexploited resource for statsfolk. You have a group of smart people with fantastic intuition, who can help guide you to footballing truths and interesting patterns, and in return you can help add rigour and context to those intuitions. As an amateur analytics writer, you’re not going to get a better proxy for a manager or coach to drive your work. I hope to see more collaborations between the communities in 2017.
Once again, I neglected add in a question about the gender makeup of the analytics community, because I’m an arsehole, especially given this was pointed out last year. We go again. What we do have again this year is the age composition both inside and outside clubs:
This is still very much skewed to the left, even more so for people at clubs or working as consultants. Without wanting to sound patronizing to my younger compatriots, a lot of this is that clubs have horrible pay structures (for all sorts of back room staff, forget about just stats). If you look at any of the analytics positions that get advertised publicly, you can sort of understand what someone with a mortgage and kids might not want to work the equivalent of two jobs for £20k, just because IT’S FOOTBALL! and IT’S A DREAM JOB! I was heartened to hear that the magnificent Christopher Long is soon to be entrusted with building the analytics team at an English club, and has stressed that he’s looking to pay market rate for technical positions. Hopefully that’s a sign of things to come.
This is how we look in terms of education, comparing maths to sports science:
Of the roughly 40 respondents that work in clubs or as professional consultants, only two got there without at least a bachelor’s degree either in a mathsy or sports science subject.
We had about 40 respondents with a formal coaching qualification, including 3 UEFA A Licencees. What made me especially happy was to see these people reporting skills with R and Python, and beyond, making it clear that there’s an increasing expectation of data skills even in the traditional roles in the sport.
What do we do?
In case you’re looking for a niche to fill, here are the stats for different types of work people were engaged in in 2016:
Once more, we find all the grubby stuff that has nothing to do with attacking at the bottom, which isn’t much of a surprise. If you’re working on attacking metrics in 2017, perhaps you might want to take a step back and see if your efforts could be better spent elsewhere. And if none of the above interest you, let’s have a trawl through the varied responses to the ‘Other’ option from this question:
- Evaluating crosses
- Evaluating counter-attacks
- Various time-in-possession measures
- Stability of possession
- Multitudes of non-shot models
- Measuring decision making
- Some version of WAR for football
- Cloning IMPECT/Packing
Here’s what people suggested you should work on in 2017:
- Simplifying and making actionable all your output
- Set piece analysis
- Treating events as sequences/networks
- Better GK metrics
- Studying transitions, identifying styles, the good, the bad
- A standardised definition of ‘possession’ for everybody to share and work from
- Measuring decision making
- Youth development
- Quantifying stats in monetary terms (e.g. for contract negotiations)
- Dribbling metrics
- Defence, goalkeepers, dear God anything but shots and xG
- Whatever increases a causal understanding of the sport
So, you can take all that to the bank.
Tools & Platforms
Here’s how we do what we do:
The geek in me wanted to get a bit more detailed this year, and find out who’s using GLMs versus random forests and stuff, but let’s be honest: a lot of analytics really is just swapping CSVs and scrolling around in Excel, which really explains why Tableau comes second here, being a slightly less painful version of that. People regularly ask me how to choose between R and Python, which annoys me because I had hoped Clojure would one day win, but its numerical computing suite (Incanter) has long since died. You should learn Python, all things being equal. R is great, and very productive in the first instance, with an enormous suite of maddeningly inconsistent libraries for almost anything you could stumble across on Wikipedia. You will have a long and happy career going from data, to slightly cleaner data, to ggplot2. But R is an atrocious general purpose language and runtime, has terrible error messages and documentation, and doesn’t have the depth of community that Python has in the wild. On top of that, distributed, GPU-based computing is going thriving more on Python than R, and as much football analytics is probably already going through the throes of well-reasoned-statistical-models versus wtf-black-box-deep-learning, the skills involved in the latter are going to prove hugely more valuable in industry over the next 10-20 years, and the bindings will appear on Python first, almost every time.
All that said, develop a brain and an eye for football first. There are people out there doing better work in Excel than you with your $20,000 of Amazon Web Services credit because they fundamentally know where to look better than you do.
In addition to these fundamental tools, almost a third of respondents used a video platform like WyScout or InStat. Not only are these essential parts of your recruitment and analysis pipelines within a club, they’re also hugely useful to confirm that things you’re seeing the data are actually real, or to find weird obscure bugs where possession chains last for ten minutes because someone had a head injury followed by a drop-ball.
What do we want?
Experience & Interests
We’re a disparate bunch, and it fills me with glee that something as simple as a bunch of spreadsheets can lead to so many different dreams and careers. Most of us are into stats because we feel it helps us understand the game better, but beyond that, we’re putting content out with the hope of one day getting paid, inside or outside the game.
I added a couple of options this time just based on the different people I’ve met and chatted to over the last year, but it turns out not many people are interested in gambling or the intermediary business. I’ll let you all chase those £20k analytics jobs and £25-an-article writing gigs for a couple of years and then you can maybe re-examine the parts of the sports industry where the real money is, and where the edges really make a difference. ¯\_(ツ)_/¯
An issue that bubbles up again and again on Twitter is how best to give feedback. It is an incredibly difficult and delicate issue, but here is how you responded:
I put this in for a variety of reasons, some of them personal. I must admit, when I first got into analytics (a little over a year ago, publicly), I produce some absolute dross. Misleading, badly thought out, dead-end crap. I got likes and retweets… precious #numbers on Twitter for those polygonal attacking charts, and numbers are the mind killer. You get addicted to the numbers. You feel validated by the numbers. Those charts were borne out of studying attacking buildup (possession chains, as the good doctor would have us call them). I truly believe the use of space and speed are absolutely central to attacking play, and they were one exploration of that, both on a football level of just seeing what different teams looked like, but on a technical level, of working out possession chains, and calculating convex hulls of possessions given their coordinates. These are incredibly valuable things for me, and I was incredibly grateful for both the kind words and the feedback I received. But let me be absolutely clear about those charts: they will never ever win anyone a football game. This dawned on me, and was also pointed out to me by people in the game, and I moved on (to other stuff that is also probably of dubious worth, but hey, it’s a process).
This is how the system should work, more or less efficiently. People need space to experiment with data, develop ideas, learn about the game, and test the waters of people’s interest inside and outside the professional game. And I’ll also point out here that stats as entertainment is also completely valid. Not everything needs to be aimed at professional clubs, or even winning statistical arguments. But I think as a community, we need to learn to draw that line extremely clearly. Because there is abysmal and colourful work out there that reveals nothing but the fact that its creator had some data and wanted to do something with it. There’s work that gets widely retweeted, that leads to more of the same, on a schedule, for every club and every game, and the numbers increase, but the ideas underlying the work stagnate. Nothing new is added, no assumptions are invalidated, nothing is tested against real games. One example this week of avoiding this cycle was Sander exploring some centrality measures from graph theory in his passing charts (already the most controversial visualisation in football analytics). I know some people haven’t worked out the mute buttons on their Twitter clients and post game viz can clog their timeline, but I’m at least impressed that he’s not just taking the retweets and sitting here. I hope we’ll all experiment with new metrics while still trying to relate them back to fundamental theories about the game of football.
At the same time, we have terrible communal memory about what’s been tried and found unhelpful before. Part of this is that we, grumpy and defensive about our work as we are, have managed to drive elder statesmen of the field like Dan Altman away (and he seems to be doing okay without Twitter). Dan’s intentions could be hard to fathom, he would judge other people’s work despite the details of his own being proprietary secrets, which often felt inequitable. But we should also be honest – all the people that responded above that they wanted to work professionally in clubs with analytics, every single one is competing with each other. For recognition, for prominence, so they can one day get one of those sweet £20k a year jobs. Despite all of that, most feedback, however vicious it might seem via the tone-deaf medium of Twitter, is coming from a place of statistical truth or practical football experience. Some of it is foul-mouthed, but I believe these people would still buy you a pint at the Opta Pro Forum.
If the above chart tells you anything, it’s that people want feedback on their work. They may not want the savage but artful trollistry of an anonymous coconut, but they want to learn, and get better. Sure, they want numbers too, and they need space to experiment, but I don’t believe were killing promising ideas in their cribs by pointing out flaws in nascent work.
So, what’s stopping us doing better work? This is what people thought:
Everybody wants data. More data, better data. I was a bit sad to see ‘lack of scientific rigour’ down there, because it just means when we get all this data we’ll make a massive mess of it, thus Dr Marek Kwiatkowski’s seminal piece. But I stand by my comments that we need to play and explore before we make much progress.
THE BIT WHERE I GET YOU SOME FREE DATA
And so I bring good tidings: the nice people at Stratagem have a standing offer to anyone with a blog that wants to write about sports data:
As a company we are simply looking to increase the awareness of our unique dataset through partnerships with prolific and respected members of the analytics community. We have a team of over 50 performance analysts who have collected data on over 10,000 matches from 22 professional football competitions to date, with specific focus put upon on chance quality. We break scoring chances down into six categories and collect granular details such as number of players between the ball and goal, defensive pressure and shot quality. All matches since June 2016 come complete with XY coordinates on key events such as goals, chances and assists.
If you’d like some data to write an article and you’re happy attributing the source, you can contact Dave Willoughby directly, or head over to the Stratagem website to find out more about what they do.
On top of that, you can also join the efforts over at WoSo Stats, a community that gathers and analyses data about women’s soccer. I suspect they are always on the lookout for anyone who can help keep their hand-coded data up-to-date.
So, that is a step in the right direction, and nobody even mentioned the dark lord’s name. Aside from data, you’ll note that the option about club wages scored highly, whereas very few people chose the two ‘aggro’ options near the bottom – indicating once more that people really are looking for healthy debate and robust feedback of their work. I put in the question about copying work because it’s something I often hear grumbles about, but generally I think it’s entirely healthy for people to clone and (preferably) elaborate on others’ work, as long as some minimum amount of credit is given. If anything, I think we should be making it vastly easier to copy and replicate our work, which can only have a positive effect on the robustness of our conclusions.
Possibly the most fundamental philosophical question facing us as football analysts today is whether or not you work in an air conditioned office. I am glad to say that the scales are tipping in favour of climate control, and I hope clubs continue to invest in this essential technology:
It’s been a funny year. With SmartOdds canning their analytics department it felt like the one organisation that really shouted about their use of stats had stepped back, damaging the field. As I hope some of these results show, the work still goes on elsewhere, just slightly more quietly.
For a while it seemed like there was a dearth of new work in the fanalytics community, with a lot of people having moved on from Twitter for professional or personal reasons. But towards the end of the year, there’s been a fantastic influx of new stuff. I don’t really want this to turn into a roundup of the best work in 2016, and I also don’t want to risk missing anybody out, but I feel going into the new year that there’s a strong cohort of active, prolific people shining a light in exactly the right areas.
Anyway, I promised myself I’d get this out in 2016, so if you’re looking for a rousing conclusion you’re out of luck. All I’ll say is that if you do even semi-good work in public, good things will happen to you, no matter how many crappy Sherwood memes you tweet to try and make yourself unemployable.
I published a survey this week, asking people about their interest in football stats and analytics, their ambitions and skills. I could and probably should have asked a lot more: it’d be cool to know where you’re all based and what teams you support, if only to confirm that statsworld is a sea of Tottenham and Arsenal fans. It would have been good to quantify just how few smart women have a voice in the football stats community.
So I dropped the ball on that, but I think we have some interesting data besides. I’ve only been writing here for a month or so, and I took the somewhat circuitous route into football stats of following Ted Knutson back when he edited a Magic: the Gathering website. Because of that, I’m intrigued as to what’s holding more people back from writing, theorising, and generally contributing to the ruckus. Let’s find out!
I got 79 responses in the couple of days the survey was up – thanks to everyone that contributed, and to those that retweeted the link! Of these responses, 13 work at clubs professionally, and we’ll look at that in more detail later.
It’s a weird feature of the statosphere that everyone seems to assume everybody else is young. Scamps like the Analytics FC mandem and student-bedroom YouTube sensation Joel Salamon distract us from some of the more venerable members of the community. What’s the truth?
This is pretty left-leaning, and more pronounced when we just focus on the analysts at clubs:
The good news is, if you’re young and interested in football stats and analytics, the only barrier between you and clubs is how good you are and how you can get noticed. It’s also possible that most 35-year-olds don’t sit around all day filling out dumb online surveys because they have tons of work to do, I’m not sure.
One of the survey’s main motivations was finding out how many people were already involved in doing stats work, how many wanted to be, and what might be holding them back. Let’s look at what our respondents are up to:
I like the blogging numbers – it’s nice to see that people are taking the advice to just get themselves out there – a good 60% of people who can see themselves blogging about stats have already taken the leap. People aren’t lying when they say that if you make good stuff, it’ll get noticed.
People’s ambitions here are pretty clear – getting into professional football clubs is most people’s dream, but one only realised for a few at this stage. More seem to want to do consultancy than take a full-time job at a club, perhaps just because the jobs are thin on the ground – I would still assume the median number of full-time stats people at Premiership clubs is zero.
There are also surprisingly few getting paid to write about stats. Outside of the echo chamber, there clearly isn’t an enormous market for stats-heavy pieces, but it’ll be interesting to see how this number changes as time progresses and the wider media incorporate more stats content.
Also worth noting the smallish numbers of people in academia. Given the dearth of paying jobs in the media, the limited number of jobs at clubs and the generally secretive nature of cutting edge work, I personally think it’d be great to see people in academia taking more of a leadership role in the stats community, but maybe my Twitter feed isn’t representative and I’m missing stuff.
Podcasting is increasingly popular, with Analytics FC hosting a series of impressive guests, and I missed off video as a medium, which is sad because in addition to Joel’s excellent videos (and their very entertaining comments sections), I think we can all agree that this is the single greatest contribution to football analytics.
Given the hopes and dreams above, what’s holding us back? The survey asked about the biggest barriers holding back the community:
The two on the left are the most common complaints I see on the Twitter statosphere. Data is the lifeblood of stats work and it’s either very expensive to acquire, or time consuming and of dubious legality. The latter point’s important: even today, WhoScored took out a gun and aimed it at their foot in response to Joel’s latest video:
The situation gets even more complicated when it comes to positioning data, the holy grail for a lot of analysts. Clubs are in an odd situation that they have to opt-in to a sharing agreement to get positioning data about other clubs, and so there’s only a small handful that have any data at all. That’s a function of paranoia and also presumably a lot of clubs not having the resources to do anything useful with the data.
About data, I will just say this: in 10 years time, you will be able to create all the data that Opta and Prozone produce using smartphone-level video and open source computer vision software on your laptop. If someone with the resources of Google wanted to, they could do this in the next couple of years, for every match in the world. I do not believe for a second that the data side of the industry is a valuable long term investment, except in cases of really privileged information like training performances or behind closed doors in academies.
Opta and Prozone will thrive on having the best researchers working for them, in tandem with the best tactical minds at clubs. WhoScored and Squawka will thrive on having the best writers working for them, making this stuff accessible and interesting.
The best way for these companies to find this talent, it appears to me, is to free the data and hire everybody you think does something interesting with it. Maybe that’s naive.
Anyway, enough of that. Elsewhere, there is a lack of stats-focused content in the media. It’s been a year of progress – you’re almost as likely to hear “expected goals” on your TV these days as you are “rainy Tuesday night in Stoke”. It’s also been a year of recurring beef, with Neil Ashton’s seminal air-conditioning piece in the Mail and the fallout from Brentford’s misadventures in the managerial market.
All you can do is keep writing, make it accessible, and hope that narratives in the stats community pan out enough that you can build trust. I certainly think it would have been great for the media to pick up on the West Ham over-performing story, it’d be money in the bank for stats people. Make content that wins people arguments in the pub, and bit by bit people will become more accustomed to thinking about stats.
If data’s the biggest barrier to entry or progress in football stats, how are people getting it today?
The most common thing to do is look at sites with accurate, timely raw numbers like WhoScored. Don’t scrape them and get in trouble, but do note that Squawka’s terms and conditions say this:
You are not permitted to use this website other than for private, noncommercial purposes. Use of any automated system or software to extract data from this website for commercial purposes (“screen scraping”) is prohibited. Squawka reserves its right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorised use of this website.
So for non-commercial purposes, maybe you’re fine. Ask your lawyer.
Kudos to the 13 people out there manually collecting stats. You can use tools like John Burn-Murdoch‘s pitch tracker to create data, and with enough time maybe you’ll have the best data in the world about set pieces or something.
In addition to these numbers, 44% of respondents to the “how do you manage football data?” say they keep a list of bookmarks to manage data. I suspect given these numbers that most people are able to judge players and teams reasonably well, looking at their shot numbers, or aggregated data like those at Objective Football. That’s a good foundation and indicates a great level of stats literacy in the community. It’s been brilliant to see the amount of stuff Paul Riley‘s been making public, as finally everyone has access to an expected goals model, raising the bar even higher.
It remains a shame that so few people have access to Opta feeds, but hopefully more and more aggregated data and tools can be made public without triggering some sort of retaliation from the owners of the data (who have paid lots of money and put lots of work into collecting it, I should make clear).
What are the secrets to doing magic with football stats? Well, no secrets, just the usual suspects:
Almost everybody lives inside a spreadsheet of some sort. Tableau is pretty standard at this point, and R is about twice as popular as Python as the language of choice for stats work. Stata gets an honourable mention as it popped up a couple of times.
The SQL number is low, but I guess that reflects the fact that most people aren’t dealing with event data in bulk, or just make do with R dataframes or something. I was the only one that ticked the GIS box, and I think you’re all mad. Being able to do geometry stuff inside SQL is huge: my shot buildup charts are basically a 5-line query that runs in less than a second. If you ask me, everybody should be looking at putting stuff into SQL Server 2016 when it’s released, you get SQL, GIS functionality and embedded R, all in one platform. Get on BizSpark, it’s all free.
The survey had a big section asking people about the sort of metrics and models they can and do produce. I think this is one of the most important questions, because it shows where we might be falling down as a community in terms of education, but it also points at the areas that are primed for new research because fewer people are working on them.
So on the left of zero you’ll see those that don’t currently know how to calculate a metric or build a useful model. On the right are those that know how, and indeed those that already have working models. Broadly speaking the techniques at the top are better known, and at the bottom are less known.
At the top is the simple stuff, calculating TSR and PDO is fairly straightforward, and it’s good to know how it’s done instead of just consuming the numbers. It also leads on to more advanced stuff, like calculating TSR/PDO but with xG numbers instead of goals and shots.
Strikers are, as ever, dead easy to model. Even just using surface stats like shots on target/90 and various conversion rates, you can get an idea of who’s good, who’s overperforming, and who is sustaining their performances between season.
At the other end of the spectrum, defender ratings obviously make an appearance – this is one of the hardest areas to judge, especially lacking positioning data that is key to so much defensive play.
Right at the bottom is predicting total corners/goals. This isn’t really that analytically useful, but for those of you that bet, these are big markets, and some of the easiest to find value in.
The appearance of goalkeeper ratings near the bottom is a surprise, if only because keepers are more or less the flipside of strikers. Tons of data available, clear metrics for what’s good and bad, even if you’re not using an xG-like model. I will take a moment to push my expected saves model and goalkeeper Christmas Shopping pieces.
A couple of people in the ‘other’ option mentioned working on youth models, or career predictions, which seems like a brilliant area to look into.
I put three questions about education into the survey, mostly because I wanted to make it clear that you can do great stats work without too much formal education, maths or otherwise.
About 40% don’t have a degree, and most that do weren’t necessarily in mathsy subjects, instead doing stats in the social sciences, or taking maths modules in the natural sciences or computer science etc. That said, only 2 of the 13 respondents currently working with professional clubs had less than a batchelor’s degree, so be aware of that.
There weren’t many Sports Science respondents at all, and I’d be interested from anyone with an opinion about whether Sports Science degrees serve you well for work in stats or analytics.
I also asked about coaching qualifications. 9 of you have the equivalent of a Level 1 Certificate in Football, 2 have Level 2, and we were graced by 2 UEFA B Licensed coaches.
The Biggest Issue Facing The Stats Community Today
The proportions remain the same inside professional clubs, and frankly I’m rethinking this whole stats career thing as a result. I’m game for unionising if you are.
You can download a slightly sanitized and anonymized version of the data here.
I don’t see a lot of statistically significant data pointing at surefire ways to get into paid work in football stats. But what I do see is tons of ways that we as a community could help, educate and collaborate with each other. I’d love to think that one day Alan Shearer will wake up every morning and check expected goals tables to see how the season’s going, but that’s a long way off, and in the meantime, it’s clear that there are loads of people that want to contribute more but can’t. I take my hat off to people like Analytics FC, whose podcast is putting important people and their work front and centre, and to Paul Riley, who as much as anyone seems to be trying to put his work (and importantly, his data) out in the open for people to build on. And most of all, huge props to StatsBomb, who I think served as the epicentre and catalyst for a lot of people to either start thinking about stats stuff, or even better to get off their arses and write about it.
So let’s all ask ourselves what we can do to help each other. I know there are tons of smart people out there that have great ideas but perhaps not the programming skill. I know there are great programmers who have no idea where to get data from. If anyone sees my stuff and wants to know how it came to be, get in touch, maybe I can give you some pointers.
In the meantime, one idea that I thought was worth doing straight away, was building a custom football stats search engine. My hope is that this will make it a little easier to find existing research to bring yourself up to speed, find new avenues of research, or at the very least, avoid wasting time redoing work that’s already been done. Annoyingly I’m on WordPress.com here so can’t embed it, but you can bung the following code on your site and get a search box for it:
Or even without script:
<form action="http://www.google.co.uk/cse" id="cse-search-box" target="_blank"> <input name="cx" type="hidden" value="018110615440115988629:xtvxg7sucik" /> <input name="ie" type="hidden" value="UTF-8" /> <input name="q" size="30" /> <input name="sa" type="submit" value="Search" /> </form>
Bookmark it, use it, tell me if there are sites missing that should be indexed. It’s not much, but it’s something I kept wishing existed, so hopefully it helps a tiny bit.
… And Relax
Thanks again to everyone that contributed to the survey, I hope the results are interesting. In six months or a year I’ll probably do this again, so I’d love some suggestions for questions for next time around.