One of the most treasured past times of online football fandom is the reveling in the post Arsenal loss histrionics of Arsenal fans on AFTV (formerly known as Arsenal Fan TV). We’ve all seen that Michael Jackson gif eating popcorn posted with the caption “Waiting for AFTV like” numerous times on Reddit and Twitter after another mauling by Bayern Munich or dropped points to a lower table side.
The accusation that AFTV feeds off of and profits from Arsenal’s failures is the channel’s most prominent criticism, one that was made by Arsenal right-back Hector Bellerin. The channel’s notoriety has led to national press coverage, debates on Sky Sports and a sitdown discussion between channel founder Robbie Lyle and along with some channel regulars with United legend and current pundit Gary Neville. Robbie does not shy away from this criticism and issued an 18-minute response to Bellerin’s remarks.
Now AFTV and its cast of characters may not be your cup of tea, but it is impossible to deny the tenacity and fortitude of their support. As far as I can tell, they show up to every Arsenal match including 3-0 League cup losses against on a weekday against a Championship side, far-flung Europa League matches that are a nightmare to coordinate travel for, and pre-season tournaments on the other side of the world. In my personal opinion, I think they’re a net positive. If you can get past the bluster, you’ll find content that engages in serious analysis from a diverse array of voices.
Despite AFTV’s rebuttals to its criticism, I don’t believe it’s unfair to say their videos, particularly the post-match fan cams, are more popular after Arsenal drops points. If you take a look at their most popular videos, you’ll see that nine of the twelve most viewed videos are post-match rants.
However, if we’re going to confirm this hypothesis then it will require a robust statistical analysis of quality data. Given my skills as a data analyst and extra free time I have volunteered abilities for this undertaking. What follows is my research into answering this question that includes my descriptions of my methodology and interpretations backed by stats and charts.
II. Data Acquisition and Construction
The first step in any data analysis project is getting the data and cleaning it. This was a project I thought about doing for a while, but never got around to completing because of the massive hurdle of getting the YouTube views data. That was until I came across the YouTube API python package, which makes it incredibly easy to pull data on the AFTV channel and its videos.
A significant initial step to check off was to find a way to only grab post-match videos (fan cams and the day after discussions). There was no way I was gonna sift the 12000k+ videos and manually pick out the ones I needed but luckily for me, AFTV organizes their videos in playlists. Playlists containing the kind of videos I was seeking typically include the score in the title i.e “Arsenal 1-0 West Ham (07/03/2020)” or they say “Arsenal vs Club X”.
Here’s a quick rundown the data preparation from acquisition to cleaning.
Download data on all 530 AFTV playlists. Use regex and a bit of manual inspection to keep the 349 playlists containing post-match videos because those playlists’s titles mention the match.
Now I have a list of the videos I want to collect data on, so I pass that into the API’s search function, which returns data such as video title, video description, views count, publish date, likes counts, dislikes counts, comment count, tags, and more.
FYI Google imposes what’s known as a rate limit which basically means there’s a cap on the amount of data I can download per day, so that meant I had to stretch this process over the course of 5 days.
Next up, tagging videos by season and match result. Using the publish date of the playlist, I could easily mark down their season. For match result, I was able to code a third of playlists with their result because the score is in their name, but for the other two-thirds I had to manually tag myself—not as bad as it sounds. And if you’re wondering, I opted against tagging the match competition of the additional burden.
The final step involved joining the playlist data with the video data so that each video could be tagged with a season and match result.
This was an arduous process (certainly harder than what I deal with at work) and I did my best to construct a decent sized and unbiased (statistically speaking) dataset. Please share your feedback on my process in the comments below.
My finalized dataset features rich clean data on 6255 videos spanning seven seasons. Now for the fun part!!
The most important tool in my arsenal (pun intended) is what’s known as a t-test, which is a common tool used in statistics to decide if one collection of numbers is greater or less than another collection of numbers—see where this is going. It’s by no means a perfect method and has its own valid criticisms but it’s not like I’m trying to prove if a cure for Covid-19 actually works.
Before we get to the main course, let me present some summary stats and exploratory data analysis that answers some of the basic and obvious questions you’d expect.
The average video gets about 76,239 hits, but that doesn’t tell the full picture because of the channel’s growth over the years so let’s visualize the average hits per season along with the number of videos per season.
As you can see AFTV has demonstrated considerable growth in its popularity since its early days—jump in hits from 15-16 to 16-17 stands out. The red chart on the right indicates I need to factor seasons into my analysis.
In the initial step towards confirm or dispelling our hypothesis let’s visualize video data by match results.
It appears we have a smoking gun(ner)!! The average number of hits from videos produced after defeats is almost twice that of draws and more than double than wins.
Let’s see if the t-tests support this assessment. In the following table, I lay out the results of my three t-tests which evaluated: win hits vs draw hits, win hits vs loss hits, and draw hits vs loss hits. If you need a quick t-test lesson, I recommend this video from Khan Academy.
|Test||Diff of Means*||t-Statistic||pvalue**|
|Win Hits vs Draw Hits||-12684.40||-3.174||0.0015|
|Win Hits vs Loss Hits||-77728.15||-12.30||2.435e-33|
|Draw Hits vs Loss Hits||-65043.75||-9.021||3.73e-19|
* Difference in the mean hits between one category of hits and determined by their order under the test column, i.e. average hits for wins is 12684.40 less than that of draws.
** All significant values italicized
Voila! I have
proven that Arsenal’s failure causes AFTV to be popular demonstrated that the differences in the three types of hits are not due to random chance, this is essentially what the pvalues are telling us. The 0.0015 number in the top right in stats terms means there is a 0.15% chance that draws accumulating more hits than wins is due to random chance and since all three pvalues are below 0.05 (a commonly used threshold to determine significance) we can say that all three relationships are significant.
But we’re not done yet! Remember what I said about seasons. Let’s go ahead and visualize average hits by season and conduct a round of t-tests for all seven seasons.
Post-loss videos being more popular than draw and win videos is a consistent trend for all seven seasons, putting us one step closer to answering this project’s thesis. Now if you’re like me, averages are not quite satisfying enough and you want to know what the distributions looks like. Thats why I’ve created a version of this visualization but with boxplots that represent a collection of hits for a season and match results. And of course, Khan Academy has got you covered on boxplots.
For the purpose of visual ease, I switched to use a log scale for the views, please remember that when digesting the results. This more holistic view of the data provides solid support for my assessment that losses are generally associated with higher view counts.
Now for the t-tests. The following table replicates the same methodology and results as the previous table but for each season. All values presented are rounded and significant pvalues italicized.
|Seasons||W vs D Mean Diff||W vs D T-Stat||W vs D Pvalue||W vs L Mean Diff||W vs L T-Stat||W vs L Pvalue||D v L Mean Diff||D vs L T-Stat||D vs L Pvalue|
I know this is a lot of numbers to throw at you but the big takeaway here is that for every season, the hits for post-loss videos are significantly greater than that of both post-win and post-draw videos.
Given this exhaustive analysis, I believe the evidence demonstrates that Arsenal Fan TV generates more hits after Arsenal drops points. The reason why this phenomenon occurs is that the rest of us non-Arsenal fans enjoy their post-loss meltdowns. The data suggest that AFTV’s success is fueled in part by Arsenal’s failure.
Thank you for indulging me, I’ve spent way more time than I needed to working on this ridiculous project.
There are plenty more ways you can analyze this data and I’d love to hear suggestions in the comments below, especially if you have ideas for a regression analysis.
And FYI I did run the same analysis for comments, likes, and dislikes and just like the results above, pretty much all those metrics are greater post-loss.
I would also appreciate any constructive criticisms or interrogations of my process as well.
I’ll be hand in the comments later if you have any questions that can be quickly and easily answered.
Cheers everyone and remember to do your part by staying home!!!
Here are the average hits by AFTV characters in case you were interested.