Data Madness – Video Games Sales 2019

Last updated on Feb 12, 2022

Video project where we analyze a video games sales dataset, draw some conclusions based on the data and answer the following questions:

Q: What are the genres that are more likely to get a high rating?
- A: Role Playing for both user ratings and critics
Q: Does user rating reflect more on the sales or is it the critics rating
- A: Critic score has higher correlation to Global sales than user score
Q: How does the popularity of platforms change throughout the years
- A: Platforms lifespan used to be shorter and nowadays they can last up to 10 years

Github Repository

https://www.youtube.com/watch?v=NasfPaMCGNA

Jupyter notebook

The notebook file is available in the github repository, the following is an html view of the file.

IPython/Jupyter Notebook Previewer

    <meta name="viewport" content="initial-scale=1">
    <script type="text/javascript" src="./Data Madness Jupyter_files/es5-shim.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/marked.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/ansi_up.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/prism.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/katex.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/katex-auto-render.min.js.download"></script>
    <script type="text/javascript" src="./Data Madness Jupyter_files/notebook.min.js.download"></script>
    <link rel="stylesheet" href="./Data Madness Jupyter_files/katex.min.css">
    <link rel="stylesheet" href="./Data Madness Jupyter_files/prism.css">
    <link rel="stylesheet" href="./Data Madness Jupyter_files/notebook.css">
    <link rel="stylesheet" href="./Data Madness Jupyter_files/nbpreview.css">
</head>
<body>
    <div id="main">
        <div id="notebook-holder"><div class="nb-notebook"><div class="nb-worksheet"><div class="nb-cell nb-markdown-cell"><h1 id="data-madness">Data Madness</h1>

Rodrigo Alejandro Chávez Mulsa & Lillian Alice Wu

Data retrieved from: https://www.kaggle.com/mohalim/video-games-sales

Grading rubric:

Key idea: Do you have main takeaways?
Scope: Given the dataset selected, did you choose the appropriate level of complexity to explain everything?
Process: Did you follow the standard data analysis process (EDA, models, evaluation)?
Solution: Is the analysis effective and answering the intended questions?
Presentation: Is your video clear, engaging and effective? Questions: Did you answer them properly?

Use this assignment to submit a link where your video is to be found (e.g. youtube, vimeo, etc.) and submit a 1/2-1 page abstract of your approach. Mention your partner (you do not need to do double submissions).

Questions we answer:

What are the genres that are more likely to get a high rating?
- Role Playing for both user ratings and critics
Does user rating reflect more on the sales or is it the critics rating?
- Critic score has higher correlation to Global sales than user score
How does the popularity of platforms change throughout the years?
- Platforms lifespan use to be shorter and nowadays they can last up to 10 years

EDA

Setup

import IPython.core.display
import matplotlib
def apply_styles():
    matplotlib.rcParams['font.size'] = 12
    matplotlib.rcParams['figure.figsize'] = (18, 6)
    matplotlib.rcParams['lines.linewidth'] = 1
    matplotlib.rcParams['axes.grid'] = True
    matplotlib.rcParams['grid.color'] = 'k'
    matplotlib.rcParams['grid.linestyle'] = ':'
    matplotlib.rcParams['grid.linewidth'] = 0.5
apply_styles()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style=“ticks”)
import warnings
warnings.filterwarnings(‘ignore’)
plt.style.use(‘fivethirtyeight’)

df = pd.read_csv(“data/vgsales.csv”)
# df.value.astype(int, copy=False);
df.shape

(16719, 16)

df.head(5)

	Name	Platform	Year_of_Release	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Critic_Score	Critic_Count	User_Score	User_Count	Developer	Rating
0	Wii Sports	Wii	2006.0	Sports	Nintendo	41.36	28.96	3.77	NaN	82.53	76.0	51.0	8	322.0	Nintendo	E
1	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24	NaN	NaN	NaN	NaN	NaN	NaN
2	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.68	12.76	3.79	3.29	35.52	82.0	73.0	8.3	709.0	Nintendo	E
3	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.61	10.93	3.28	NaN	32.77	80.0	73.0	8	192.0	Nintendo	E
4	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	NaN	NaN	NaN	NaN	NaN	NaN

# df.Year_of_Release.value_counts()
# df.shape
# df.loc[0]

Deal with missing values and wrong types

df.Name.isnull()
df = df[df['Name'].notna()]
# df.User_Score.isnull().sum()

#non-numeric value convert to NaN
df['User_Score'] = df['User_Score'].apply(pd.to_numeric, errors='coerce')
# df.dtypes

df[['Publisher', 'Developer','Rating']]=df[['Publisher', 'Developer','Rating']].fillna(value='missing')
df[['Year_of_Release','Critic_Count', 'Critic_Score', 'User_Count']] = df[['Year_of_Release','Critic_Count', 'Critic_Score', 'User_Count']].astype('Int64')
# df.dtypes

cols_float = df.select_dtypes(include=['number']).columns
cols_float =cols_float.delete(0) #delete Release year
# df[cols_float] = df[cols_float].fillna(-1)
df.head(5)

	Name	Platform	Year_of_Release	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Critic_Score	Critic_Count	User_Score	User_Count	Developer	Rating
0	Wii Sports	Wii	2006	Sports	Nintendo	41.36	28.96	3.77	NaN	82.53	76	51	8.0	322	Nintendo	E
1	Super Mario Bros.	NES	1985	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24	NaN	NaN	NaN	NaN	missing	missing
2	Mario Kart Wii	Wii	2008	Racing	Nintendo	15.68	12.76	3.79	3.29	35.52	82	73	8.3	709	Nintendo	E
3	Wii Sports Resort	Wii	2009	Sports	Nintendo	15.61	10.93	3.28	NaN	32.77	80	73	8.0	192	Nintendo	E
4	Pokemon Red/Pokemon Blue	GB	1996	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	NaN	NaN	NaN	NaN	missing	missing

#Compute missing global sales based on sales in available regions
for i, r in df.iterrows():
    if r.Global_Sales<0:
        df.Global_Sales.at[i] = sum(r[c] if (('Sales' in c) and (r[c]>0)) else 0 for c in r.index)
df[df.Global_Sales<0]
df.head(5)

	Name	Platform	Year_of_Release	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Critic_Score	Critic_Count	User_Score	User_Count	Developer	Rating
0	Wii Sports	Wii	2006	Sports	Nintendo	41.36	28.96	3.77	NaN	82.53	76	51	8.0	322	Nintendo	E
1	Super Mario Bros.	NES	1985	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24	NaN	NaN	NaN	NaN	missing	missing
2	Mario Kart Wii	Wii	2008	Racing	Nintendo	15.68	12.76	3.79	3.29	35.52	82	73	8.3	709	Nintendo	E
3	Wii Sports Resort	Wii	2009	Sports	Nintendo	15.61	10.93	3.28	NaN	32.77	80	73	8.0	192	Nintendo	E
4	Pokemon Red/Pokemon Blue	GB	1996	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	NaN	NaN	NaN	NaN	missing	missing


# df[df.Platform == 'DS'].Year_of_Release.describe()
# df.Year_of_Release.loc[15959].replace(2004,inplace=True)
df.at[15959,'Year_of_Release']= 2004
df[df.Platform == 'DS'].Year_of_Release[df.Year_of_Release < 2006] = 2006
for i, r in df.iterrows():
    if r.Platform == 'DS' and r.Year_of_Release < 2004:
        print(df.Year_of_Release.at[i])
        print(i)
#Compute missing global sales based on sales in available regions
# for i, r in df.iterrows():
#     if r.Global_Sales<0:
        # df.Global_Sales.at[i] = sum(r[c] if (('Sales' in c) and (r[c]>0)) else 0 for c in r.index)
# df[df.Global_Sales<0]

Visualize stuff

# df.sort_values(by=['Global_Sales'],ascending=False)
# df.Global_Sales[:100].plot.bar()

df[df['Name'] == 'Monopoly'].head(3) #We see games are repeated since every platform has diff. values
df.Platform.unique()

array(['Wii', 'NES', 'GB', 'DS', 'X360', 'PS3', 'PS2', 'SNES', 'GBA',
       'PS4', '3DS', 'N64', 'PS', 'XB', 'PC', '2600', 'PSP', 'XOne',
       'WiiU', 'GC', 'GEN', 'DC', 'PSV', 'SAT', 'SCD', 'WS', 'NG', 'TG16',
       '3DO', 'GG', 'PCFX'], dtype=object)

# df_genre.first()#[df_genre.Name == 'Minecraft']
df_genre = df.groupby(['Genre','Name'])
df['Genre']= df['Genre'].astype('category')
genres = df.Genre.cat.categories #get indexes* of categories

df['Genre'].value_counts().plot(kind='bar')
plt.xlabel("Genres");
plt.ylabel('Amount of Games') #TODO: merge same name
plt.title('Total Game Count By Genres')
plt.tight_layout()
plt.savefig('figures/total_games_genre.png',dpi=400)

# fig = plt.figure(figsize=(30,30))
# # Divide the figure into a 2x1 grid, and give me the first section
# ax1 = fig.add_subplot(211)
# ax2 = fig.add_subplot(212)
# df.groupby('Genre')['Critic_Score'].mean().plot(kind='bar',ax=ax1)
a = df.groupby('Genre')['Critic_Score'].mean()
# ax1.set_xlabel("Genres")
# ax1.set_ylabel('Critic Score')
# ax1.set_title('Critic Score by Genres')
#plot by score
#in terms of rating from critic and users there is not that much of a difference
#role playing games has slightly hier ratings from both critics and users;

# df.groupby('Genre')['User_Score'].mean().plot(kind='bar',ax=ax1)
b = df.groupby('Genre')['User_Score'].mean()
b = b.multiply(10)
# ax2.set_xlabel("Genres");
# ax2.set_ylabel('Users Score')
# ax2.set_title('Users Score by Genres')

# sns.catplot(x='class', y='survival rate', hue='sex', data=df, kind='bar')

corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.tight_layout()
plt.savefig('figures/correlations.png',dpi=400)

corr.Global_Sales

Year_of_Release   -0.077001
NA_Sales           0.941999
EU_Sales           0.899154
JP_Sales           0.613277
Other_Sales        0.694608
Global_Sales       1.000000
Critic_Score       0.246017
Critic_Count       0.303113
User_Score         0.093009
User_Count         0.253403
Name: Global_Sales, dtype: float64

ab = pd.concat([a,b],axis=1).sort_values(by=['Critic_Score'],ascending=False)
ax = ab.plot(kind='bar')
ax.get_legend().set_bbox_to_anchor((0.2, -0.25))
ax.set_title('Users and Critic Scores by Genres')
plt.tight_layout()
plt.savefig('figures/users_and_critic_scores_by_genre.png',dpi=400)

# fig = plt.figure(figsize=(30,30))
# fig.suptitle('Sales by Genre', fontsize=40)
# # Divide the figure into a 2x1 grid, and give me the first section
# ax1 = fig.add_subplot(221)
# ax2 = fig.add_subplot(222)
# ax3 = fig.add_subplot(223)
# ax4 = fig.add_subplot(224)

# df.groupby('Genre')['Global_Sales'].mean().plot(kind='bar',ax=ax1)
# ax1.set_xlabel("Genres");
# ax1.set_ylabel('Global Sales')
# ax1.set_title('Global Sales by Genres')
# #plot by sale
# #platform game significantly has a higher sale

# df.groupby('Genre')['NA_Sales'].mean().plot(kind='bar',ax=ax2)
# ax2.set_xlabel("Genres");
# ax2.set_ylabel('North America Sales')
# ax2.set_title('North America Sales by Genres')

# df.groupby(‘Genre’)[‘EU_Sales’].mean().plot(kind=‘bar’,ax=ax3)
# ax3.set_xlabel(“Genres”);
# ax3.set_ylabel(‘EU Sales’)
# ax3.set_title(‘EU sales by genres’)
# #this is not the case for EU, it is shooter with highest sale and Platform games just a bit lower

# df.groupby(‘Genre’)[‘JP_Sales’].mean().plot(kind=‘bar’,ax=ax4)
# ax4.set_xlabel(“Genres”);
# ax4.set_ylabel(‘Japan Sales’)
# ax4.set_title(‘Japan Sales by Genres’)
# #in Japan Role-Playing games significantly has a higher sale

df.groupby(‘Genre’)[‘Global_Sales’,‘NA_Sales’,‘EU_Sales’,‘JP_Sales’].mean().sort_values(by=[‘Global_Sales’],ascending=False).plot(kind=‘bar’,figsize=(10,5))
plt.ylabel(‘Sales’)
plt.title(‘Sales by Genre’)
#plot by sale
#platform game significantly has a higher sale

# df.groupby(‘Genre’)[‘NA_Sales’].mean()

# df.groupby(‘Genre’)[‘EU_Sales’].mean()
# #this is not the case for EU, it is shooter with highest sale and Platform games just a bit lower

# df.groupby(‘Genre’)[‘JP_Sales’].mean()
#in Japan Role-Playing games significantly has a higher sale
plt.tight_layout()
plt.savefig(‘figures/sales_by_genre.png’,dpi=400)

df.groupby(‘Year_of_Release’)[‘Platform’].count().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2c5b3ed1940>

df_year_platf = df.groupby(‘Year_of_Release’).Platform.value_counts()
number_of_plots = len(df.Year_of_Release.unique())
colormap = plt.cm.gist_ncar # set colormap for next plot

linestyle = ["+–","^-",'-o',':s']*10# line styles, 40 plots so we repeat the list 10 times
colors = [colormap(i) for i in np.linspace(0, 1,number_of_plots)]# colors gradient of colormap
df_year_platf.unstack().plot(kind=‘line’,style=linestyle,figsize=(25,10), stacked=False,color = colors)
plt.title(‘Games released per year’)
plt.xlabel(‘Year’)
plt.ylabel(‘Amount’)
# plt.xticks(range(number_of_plots))
plt.ticks=df.Year_of_Release.unique()
plt.locator_params(nbins=20)
plt.tight_layout()
plt.savefig(‘figures/released_per_years.png’,dpi=400)
plt.show()

a = df[‘Platform’].value_counts()
a.plot(kind=‘bar’)
plt.title(‘Games per platform’)
plt.xlabel(‘Platform’)
plt.ylabel(‘Amount’)
plt.tight_layout()
plt.savefig(‘figures/games_per_platform.png’,dpi=400)

#Get first year and last year per platform
df_y_pl = df.sort_values([‘Year_of_Release’,‘Platform’],ascending=True).groupby(‘Platform’).Year_of_Release
start_pl = df_y_pl.first()
end_pl = df_y_pl.last()
years_platform = end_pl-start_pl
years_platform
years_platform.sort_values(ascending=False).plot.bar()
plt.title(‘Years amount the platform has released games’)
plt.ylabel(‘Years amount’)
plt.tight_layout()
plt.savefig(‘figures/platform_lifespan.png’,dpi=400)

# df_y_pl.first()