Scenario: We all understand the 80/20 rule, that says 80% of a business's revenue is generated by 20% of its customers. A casino executive might want to know how his/her property measures up. How reliant is the casino on a small number of core players?
We're going to answer this question (what % of our players are responsible for 80% of our business?) for a range of scenarios: for slots and for table games, for theo win and actual win, and we'll look back 4 years to see which way the casino is trending. And we'll also look at a 90% threshold while we're at it.
start by importing the libraries we're going to need
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib import ticker as tkr
set style for output of plots
plt.style.use(u'ggplot')
Import our data from a CSV file (during development, use a subset of the total data with the 'nrows' parameter) Note that the data is not perfectly how we'd like it -- instead of dates, we have a DATE_KEY. More often than not, cleaning and reshaping the data is a major part of the process.
# load the csv file (limit # of rows while developing), convert date format to date objects
df_all = pd.read_csv("E:\Python\CapCoop\dailyhash.csv", header=0, sep=';') #, nrows=250000)
# convert our DATE_KEYs, which are integers, to dates
df_all['gaming_date'] = pd.to_datetime(df_all.DATE_KEY, format="%Y%m%d")
df_all['year'] = df_all.gaming_date.dt.year
print(df_all.head()) # head previews the data
print(df_all.shape) # shape tells us how many rows and columns we have (over 7 million rows here)
# Also note instead of a PlayerID column we have "HashedPlayerKey."
# This allows us to uniquely identify all the records of an individual player, without revealing the actual PlayerID
# Many databases support hashing, it's a good idea when the analysis doesn't require knowing exactly who the players are
use filters to create individual dataframes for slots, tables, both and none
# create subset dataframes for slots & tables
df_slots = df_all.loc[df_all.slot_theo_win > 0]
df_tables = df_all.loc[df_all.table_theo_win > 0]
df_both = df_all.loc[(df_all.table_theo_win > 0) & (df_all.slot_theo_win > 0)]
df_none = df_all.loc[(df_all.table_theo_win <= 0) & (df_all.slot_theo_win <= 0)]
print (df_slots.shape)
print (df_tables.shape)
print (df_both.shape)
print (df_none.shape)
Source data contained a row for every player, every day. Now we want to sum total gaming win per player by year and count number of visits. If we had access to the source database, we could have used GROUP BY and aggregate functions SUM and COUNT to build these. Since we don't, we use pandas equivalent functions.
# sum/count each player's totals by year, and also rename columns to standardize
df_slot_year = df_slots.groupby(['HashedPlayerKey','year']).agg({'slot_theo_win' : 'sum',
'slot_actual_win' : 'sum',
'gaming_date' : 'count' }).rename(
index=str, columns={'slot_theo_win': 'theo_sum', 'slot_actual_win': 'actual_sum', 'gaming_date' : 'days'})
df_table_year = df_tables.groupby(['HashedPlayerKey','year']).agg({'table_theo_win' : 'sum',
'table_actual_win' : 'sum',
'gaming_date' : 'count' }).rename(
index=str, columns={'table_theo_win': 'theo_sum', 'table_actual_win': 'actual_sum', 'gaming_date' : 'days'})
SHAPE tells us the dimensions (number of rows, columns) in a dataframe
# shape and head for totals dataframes
print(df_slot_year.shape)
print(df_table_year.shape)
HEAD lets us see the first few rows - 5 if not specified
Note how the pandas GROUPBY places the grouped columns to the index, similar to an Excel PivotTable.
df_slot_year.head()
df_table_year.head(20)
# filter on theo in year 2018, and sort from largest to smallest
sorted_series = df_slot_year.xs('2018', level='year')['theo_sum'].sort_values(ascending=False).reset_index(drop=True)
# get the total, we will need this later to calculate percentages
series_total = sorted_series.sum()
# convert the single series back to a dataframe
df_sorted = sorted_series.to_frame()
# for each player, calculate the percentage of the total they represent
df_sorted['pct'] = sorted_series / series_total
# calculate a cumulative sum of percentage (remember, previously we sorted the players by total annual theo, top to bottom)
df_sorted['cumulative'] = df_sorted.pct.cumsum()
# get the index (row number) of the first (lowest row) where cumulative sum exceeds 80%
df_sorted.loc[df_sorted.cumulative > 0.8].index.min()
print initial results
# print results (we add 1 because the index is 0-based, i.e. 0, 1, 2, 3... so row #3 is the 4th row)
print(df_sorted.loc[df_sorted.cumulative > 0.8].index.min() + 1 ,
df_sorted.shape[0] , # first element of shape is the total number of rows
(df_sorted.loc[df_sorted.cumulative > 0.8].index.min() + 1) / df_sorted.shape[0]) # get the %age
# so we are printing 3 values - # of players it took to get over 80% cumulative, total # of players, and the %age
# same thing, let's insert the values into a sentence
print('the top {0:,} players out of {1:,} ({2:.1%}) were responsible for 80% of slot theo in 2018'.format(
df_sorted.loc[df_sorted.cumulative > 0.8].index.min() + 1 ,
df_sorted.shape[0] ,
(df_sorted.loc[df_sorted.cumulative > 0.8].index.min() + 1) / df_sorted.shape[0]))
So we have calculated one scenario -- 80% threshold for slot theo in 2018. But what about tables, other years, and a 90% threshold? Let's consolidate the code above into a function, then we can create a loop with all the parameters and call the function for each of them.
# create a function to calculate and print 80/20 data
def get_top_pct(year, measure, threshold, game):
df_base = df_slot_year if game == 'slot' else df_table_year
sorted_series = df_base.xs(str(year), level='year')[measure].sort_values(ascending=False).reset_index(drop=True)
series_total = sorted_series.sum()
df_sorted = sorted_series.to_frame()
df_sorted['pct'] = sorted_series / series_total
df_sorted['cumulative'] = df_sorted.pct.cumsum()
return { 'top_players' : df_sorted.loc[df_sorted.cumulative > threshold].index.min() + 1 ,
'total_players' : df_sorted.shape[0] }
we're creating an empty array, each loop we will add a row of data to it, afterwards we'll convert it to a proper dataframe (it looks like an extra step nut is actually more efficient)
rows = []
Here's our loop, we actually have 4 sets of variables to loop through
# loop through every game, measure, threshold and year and run get_top_pct function on each
for _game in ['slot', 'table']:
for _measure in ['theo_sum', 'actual_sum']:
for _threshold in [0.8, 0.9]:
for _year in [2015, 2016, 2017, 2018]:
# call our function for each loop
top_pct = get_top_pct(_year, _measure, _threshold, _game)
# print the output each time
print('the top {0:,} players out of {1:,} ({2:.1%}) were responsible for {3:.0%} of {4} {5} in {6}'.format(
top_pct['top_players'] , top_pct['total_players'] ,
top_pct['top_players'] / top_pct['total_players'] ,
_threshold , _game , _measure , _year))
# add the data to the array we just created
rows.append([_game, _measure.split('_')[0], _threshold, _year,
top_pct['top_players'] , top_pct['total_players'] ,
top_pct['top_players'] / top_pct['total_players']])
print('------')
Boom, because we created a function, we could easily perform our loop and call the function each time. So we get all 32 results in a matter of seconds.
we've completed our loops, each time we stored the result data in an array -- let's convert that to a proper dataframe
df_summary = pd.DataFrame(data=rows, columns=['game', 'measure', 'threshold', 'year', 'top', 'total', 'pct'])
df_summary
we can make a pretty interesting graph with just a bit of code -- known as lattice, or trellis, or facet charts
g = sns.factorplot(data=df_summary, x='year', y='pct', col='game', row='measure', hue='threshold', ci=None, sharey=True)
add a bit of formatting
g = sns.factorplot(data=df_summary, x='year', y='pct', col='game', row='measure', hue='threshold',
ci=None, sharey=True, palette=sns.color_palette(["#d46a6a", "#801515"]))
g.set(ylim=(0))
g.set_titles("{col_name} {row_name}")
g.set_xticklabels(size=16, family='serif')
g.set_yticklabels(size=14, family='serif')
g.axes[0][0].yaxis.set_major_formatter(tkr.FuncFormatter(lambda x, p: "{:.1%}".format(x)))
g._legend.get_texts()[0].set_text('80% of total')
g._legend.get_texts()[1].set_text('90% of total')
So let's interpret:
Going back to the original question, yes, the casino relies on a small group of VIP customers to generate a huge percentage of its business. Instead of 80/20, it's more like 80/10 all the way down to 80/4, depending on slot vs table and theo vs actual.
OK, now we know what % of players contribute 80 or 90% of the business. Let's change the perspective a bit, and see the contribution from the top 10% of players. There is a lot of overlap with the previous function, but we need to do a little extra work to cut each subsection into deciles, then create a new dataframe for the chart data.
# loop again to create plots --
# a little complicated because we need to perform the binning on each subset of data
# so we are repeating much of what the get_top_pct function does
for _measure in ['theo_sum' , 'actual_sum']:
df_chart = pd.DataFrame()
for _game in ['slot', 'table']:
for _year in [2015, 2016, 2017, 2018]:
df_base = df_slot_year if _game == 'slot' else df_table_year
sorted_series = df_base.xs(str(_year), level='year')[_measure].sort_values(ascending=False).reset_index(drop=True)
series_total = sorted_series.sum()
df_sorted = sorted_series.to_frame()
df_sorted['pct'] = sorted_series / series_total
df_sorted['cumulative'] = df_sorted.pct.cumsum()
df_sorted['decile'] = pd.cut(df_sorted.index, bins=10, labels=False)
df_sorted['year'] = _year
df_sorted['measure'] = _measure
df_sorted['game'] = _game
df_grouped = df_sorted.groupby(['year', 'measure', 'game', 'decile']).agg({'pct' : 'sum'})
df_chart = df_chart.append(df_grouped)
g2 = sns.factorplot(data=df_chart.reset_index(), x='decile', y='pct', col='year', row='game', kind='bar')
plt.subplots_adjust(top=0.9)
g2.fig.suptitle('Measure = {0}'.format(_measure.split('_')[0].upper()))
g2.set_titles("{col_name} {row_name} " + _measure.split('_')[0])
g2.set_xticklabels(size=16)
g2.set_yticklabels(size=14, family='serif')
g2.facet_axis(0,0).yaxis.set_major_formatter(tkr.FuncFormatter(lambda x, p: "{:.0%}".format(x)))
This graph is sort of the inverse of the first graph. The first graph answered the question ____% of players contribute 80/90% of theo/win. This graph says the top 10% of players contribute ____% of win/theo. Because we divided each group into deciles, the left-most bar on each plot represents the top 10% of players.
OK, one more graph -- let's just plot our top slot players each year by # of visits and theo per visit
# push the index columns back to just regular columns
dfx = df_slot_year.reset_index()
# calculate avg_theo
dfx['avg_theo'] = dfx['theo_sum'] / dfx['days']
# a little trick to get the top X percent within each year, sorted by theo_sum
dfx_top = dfx.groupby('year', group_keys=False).apply(lambda x: x.nlargest(int(len(x) * 0.01), 'theo_sum')).drop(columns=['HashedPlayerKey'])
print(dfx_top.shape)
# show the first few rows
dfx_top.head(15)
build the graph -- slightly different syntax than previous graphs
# first, we do some calculations that will be used for additional dimensions in the graph
pointsize = abs(dfx_top['actual_sum'] / 2500) # must use abs() because we can't have negative-sized points
pointcolor = np.where(dfx_top['actual_sum'] > 0, '#E24A33', '#0F8935') # we will use an alternate color for winning players (negative actual_win)
sns.set_context("talk") #style adjustment
# build the graph, the first line sets it up and second line maps various series to data elements
g3 = sns.FacetGrid(data=dfx_top, col='year', sharey=True, col_wrap=2, size=6)
g3 = g3.map(plt.scatter, 'days', 'avg_theo', s=pointsize, color=pointcolor)
# use a log scale for the y-axis
g3.axes[0].set_yscale('log')
g3.axes[0].set_yticks([100, 200, 500, 1000, 2000, 4000, 8000])
g3.axes[0].get_yaxis().set_major_formatter(tkr.ScalarFormatter())
# add a legend, in this situation we must create 'patches' to show in legend
loc_patch = mpl.patches.Patch(color='none', label='dot location: theo')
size_patch = mpl.patches.Patch(color='none', label='dot size: actual')
red_patch = mpl.patches.Patch(color='#E24A33', label='actual_win > 0')
green_patch = mpl.patches.Patch(color='#0F8935', label='actual_win < 0')
g3.axes[3].legend(loc='lower right', handles=[loc_patch, size_patch, red_patch, green_patch])
Explanation of graph above: every dot represents one player among the casino's top 1% generators of slot theo (8077 dots total over the 4 years).