Putting an addiction to good use

People who know me can tell you about my addictive behaviour when it comes to watching British panel shows. I can go on and on about those. Whether it is the content, cast or production process, I can talk about this for ages. I’ve just recently discovered Shooting Stars with Bob Mortimer, Vic Reeves and ULRIKA-KA-KA-KA! Warning: Shooting Stars is a very “special” show, so prepare for some pythonesque stuff!

Let’s have a look on how I got into panel shows and why I can’t stop watching them. It all started with QI. After binge watching this show, I stumbled upon a beautiful subreddit called /r/panelshow. There, I found a lot of good recommendations, based on the people I knew from QI. I moved on to 8 Out Of 10 Cats, Never Mind The Buzzcocks and Would I Lie to You?. Which got me thinking: When I can select panel shows, based on their cast, why not build a recommendation system, which does exactly the same?

The dove from above or: The main idea

The main goal is to predict similar panel shows to a given one, based on the people appearing in them. For this we need a measurement of what is “similar” (or “dissimilar”). I used a vector space approach, where each show gets the cast as features. The features are the names of the people appearing in the show. Each show then has the number of appearances of every particular comedian/actor/pornstar in that show as a feature characteristic.

Show Jimmy Carr Stephen Fry Jeremy Clarkson
8 out of 10 cats 127 0 0
QI 5 255 5
Have I got news for you 0 2 3

If we now read each line from the table, we get a well known representation of the show as a point in a very high dimensional vector space. Imaging the shows as points in a coordinate system, only this time you don’t just have two dimensions, but 2646. And where there are points, you will be able to calculate the distances between them. Get the points with the lowest distance to your favourite show and ERANU: You’ve just received your watch list for the next week.

Of course you could measure the distance between the points/shows in the full 2646-dimensional space, but I will apply a technique called Principal Component Analysis (PCA), which is capable of transforming a high dimensional space into a 2-dimensional one. This makes it easier to access the points in a 2-dimensional coordinate system. PCA uses eigenvalues and covariances in the data to determine the positioning of the points in the low dimensional space.

Data to whom data is due.

Most good ideas fail due to missing data. But good news everyone! There is a website called British Comedy Guide, which stores cast information for many British panel shows. Now the bad news: You have to crawl HTML and parse it. With different styles of data representation for different types of panellists. And sometimes missing information. ARGH! Nevertheless BGG is consistent for all major panel shows. Let’s have a look at my crawler, shall we?

You need to set up some libraries and set a provider (here: BCG).

import requests # for downloading the HTML
from bs4 import BeautifulSoup # for parsing the HTML
import pandas as pd # for storing the data
import time # wait for it...
from collections import defaultdict

base_provider = "https://www.comedy.co.uk"

First I downloaded the complete list of panel shows and stored them with their corresponding URL for further processing.

# only TV panel shows
r = requests.get("https://www.comedy.co.uk/tv/list/panel_show/")

html_tree = BeautifulSoup(r.text, "lxml")

# read all the show titles and their urls
list_shows = [[x.text, base_provider + x["href"] + "episodes/all/"] 
                for x in html_tree.select(".m- a")]

data = pd.DataFrame(list_shows, columns=["title", "url"])

Next we need to define how one row in the DataFrame should look like and how we want to fill it.

def build_cast(idx, show):
    """Builds a DataFrame for each show (row) 
    and the number of appearences for the comedians (columns)
    """
    cast = defaultdict(int)
    
    time_start = time.time()
    
    r = requests.get(show).text
    episode_list = [base_provider + a["href"] for a in BeautifulSoup(r, "lxml")
                    .select("ol.list-unstyled > li > a[href]")]
    
    for episode in episode_list:
        r_ep = requests.get(episode).text
        ep = BeautifulSoup(r_ep, "lxml")
        cast_list = [x.text for x in
                     ep.select("h3 + table a[href^=/people]") # regulars
                     + ep.select("h3 + table + table a[href^=/people]")] # guests
        # for complete cast (incl. writers): "table a[href^=/people]"
        
        for person in cast_list:
            cast[person] += 1

        time.sleep(1) # Don't hassle the server
        
    time_end = time.time() - time_start
    print("Got", len(cast), "columns for", show, "in", int(time_end), "s")
    
    return pd.DataFrame(cast, index=[idx], columns=cast.keys())

This function returns a DataFrame with one row (equal to one show) each time it is called. We just invoke the function on each row and URL from the list of shows and let pandas build a sparse DataFrame from the results.

# build cast data
cast_data = [build_cast(idx, data["url"][idx]) for idx in data.index]

# make it one DataFrame
tmp = pd.concat(cast_data)

# join with original data, remove shows with no information and fill the sparse matrix with zeros
data = data.join(tmp)
data = data.dropna(how="all", subset=data.columns[2:]).fillna(0)

Finally, save everything.

data.to_hdf("all_shows.h5", key="data")

We asked our studio audience…

With the data at hand, we finally can look for some recommendations. We are going to use the PCA implementation from skicit-learn.

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors

First, we need to normalise the data. I have chosen to divide each row by its maximum value. Other brands normalisations are available.

# only normalise the numerical columns
data[data.columns[2:]] = data[data.columns[2:]].apply(lambda x: x/x.max(), axis=1)

# remove the old index from the collecting
data = data.reset_index(drop=True)

The data has now 69 rows and 2646 columns. Let the PCA reduce the data to two dimensions. I have used a randomized PCA. After a quick survey of other possibilities this gave the best results. Finally, build a Nearest Neighbour tree to quickly query the dataset.

y_pred = PCA(n_components=2, svd_solver="randomized", whiten=True)
         .fit_transform(data[data.columns[2:]])

NN = NearestNeighbors().fit(y_pred)

The moment of truth: Let’s take an example. I have chosen 8 Out Of 10 Cats as my starting point. In my dataset the show has index 67.

# getting the 5 closest neighbours
idx = NN.kneighbors(y_pred[67].reshape(1, -1), n_neighbors=6)

# display them (jupyter notebook syntax)
data.iloc[idx[1][0], 0:1]

The result is quite good:

title
67 8 Out Of 10 Cats
68 8 Out Of 10 Cats Does Countdown
9 The Big Fat Quiz Of The Year
64 Would I Lie To You?
11 The Bubble
58 Was It Something I Said?

Beneath the closest neighbours, you will find the spin-off 8 Out Of 10 Cats Does Countdown, which has a quite similar cast, and another show hosted by Jimmy Carr (The Big Fat Quiz Of The Year). I know, some of you guys are thinking: Isn’t it counterproductive to count the appearance of the host for every show? First of all, no, because the host is part of the panel in my model. I find it very intuitive to assume that, because if you want to watch a show, you have to more or less like the host. He will be there every time you watch the program. So it’s reasonable to give him/her a high weight. Secondly, here is another example for QI (index 44):

title
44 QI
20 Duck Quacks Don't Echo
37 The Marriage Ref
57 Wall Of Fame
43 Play To The Whistle
65 You Have Been Watching

You can see Duck Quacks Don’t Echo, which has similar concept to QI. The other panel shows might be further off, but as I said before, it is just about the cast. It’s not about the content of the shows.

One last look on the complete data. Vector Space

If you you want to try it yourself, you can find the data and the notebooks on github. Caution: I cannot guarantee for the data’s completeness or correctness, nor does it cure cancer.

And the final scores are…

You have just read about the possibility of making a recommendation system for panel shows just by looking at the cast and the cast similarities. This solution is far from perfect, but it gives a hint for further development. A lot more can be done.

  • One could complete the data set from other sources.
  • Change the weights for several panellists to represent personal favourites
  • Lower the impact of the host
  • Try other methods like clustering

The least thing I can hope for is somebody to actually pick up watching (new) panel shows. And therefore: Back to Sandi Toksvig for everything Scandinavian.

Other sources and libraries include pandas, jupyter and BeautifulSoup.