Data Madness – Video Games Sales 2019
Video project where we analyze a video games sales dataset, draw some conclusions based on the data and answer the following questions:
- Q: What are the genres that are more likely to get a high rating?
- A: Role Playing for both user ratings and critics
- Q: Does user rating reflect more on the sales or is it the critics rating
- A: Critic score has higher correlation to Global sales than user score
- Q: How does the popularity of platforms change throughout the years
- A: Platforms lifespan used to be shorter and nowadays they can last up to 10 years
Jupyter notebook
The notebook file is available in the github repository, the following is an html view of the file.
<meta name="viewport" content="initial-scale=1">
<script type="text/javascript" src="./Data Madness Jupyter_files/es5-shim.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/marked.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/ansi_up.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/prism.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/katex.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/katex-auto-render.min.js.download"></script>
<script type="text/javascript" src="./Data Madness Jupyter_files/notebook.min.js.download"></script>
<link rel="stylesheet" href="./Data Madness Jupyter_files/katex.min.css">
<link rel="stylesheet" href="./Data Madness Jupyter_files/prism.css">
<link rel="stylesheet" href="./Data Madness Jupyter_files/notebook.css">
<link rel="stylesheet" href="./Data Madness Jupyter_files/nbpreview.css">
</head>
<body>
<div id="main">
<div id="notebook-holder"><div class="nb-notebook"><div class="nb-worksheet"><div class="nb-cell nb-markdown-cell"><h1 id="data-madness">Data Madness</h1>
Rodrigo Alejandro Chávez Mulsa & Lillian Alice Wu
Data retrieved from: https://www.kaggle.com/mohalim/video-games-sales
Grading rubric:
- Key idea: Do you have main takeaways?
- Scope: Given the dataset selected, did you choose the appropriate level of complexity to explain everything?
- Process: Did you follow the standard data analysis process (EDA, models, evaluation)?
- Solution: Is the analysis effective and answering the intended questions?
- Presentation: Is your video clear, engaging and effective? Questions: Did you answer them properly?
Use this assignment to submit a link where your video is to be found (e.g. youtube, vimeo, etc.) and submit a 1/2-1 page abstract of your approach. Mention your partner (you do not need to do double submissions).
Questions we answer:
- What are the genres that are more likely to get a high rating?
- Role Playing for both user ratings and critics
- Does user rating reflect more on the sales or is it the critics rating?
- Critic score has higher correlation to Global sales than user score
- How does the popularity of platforms change throughout the years?
- Platforms lifespan use to be shorter and nowadays they can last up to 10 years
EDA
Setup
import IPython.core.display
import matplotlib
def apply_styles():
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (18, 6)
matplotlib.rcParams['lines.linewidth'] = 1
matplotlib.rcParams['axes.grid'] = True
matplotlib.rcParams['grid.color'] = 'k'
matplotlib.rcParams['grid.linestyle'] = ':'
matplotlib.rcParams['grid.linewidth'] = 0.5
apply_styles()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style=“ticks”)
import warnings
warnings.filterwarnings(‘ignore’)
plt.style.use(‘fivethirtyeight’)
df = pd.read_csv(“data/vgsales.csv”)
# df.value.astype(int, copy=False);
df.shape
(16719, 16)
df.head(5)
Name | Platform | Year_of_Release | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Critic_Score | Critic_Count | User_Score | User_Count | Developer | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.36 | 28.96 | 3.77 | NaN | 82.53 | 76.0 | 51.0 | 8 | 322.0 | Nintendo | E |
1 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.68 | 12.76 | 3.79 | 3.29 | 35.52 | 82.0 | 73.0 | 8.3 | 709.0 | Nintendo | E |
3 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.61 | 10.93 | 3.28 | NaN | 32.77 | 80.0 | 73.0 | 8 | 192.0 | Nintendo | E |
4 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | NaN | NaN | NaN | NaN | NaN | NaN |
# df.Year_of_Release.value_counts()
# df.shape
# df.loc[0]
Deal with missing values and wrong types
df.Name.isnull()
df = df[df['Name'].notna()]
# df.User_Score.isnull().sum()
#non-numeric value convert to NaN
df['User_Score'] = df['User_Score'].apply(pd.to_numeric, errors='coerce')
# df.dtypes
df[['Publisher', 'Developer','Rating']]=df[['Publisher', 'Developer','Rating']].fillna(value='missing')
df[['Year_of_Release','Critic_Count', 'Critic_Score', 'User_Count']] = df[['Year_of_Release','Critic_Count', 'Critic_Score', 'User_Count']].astype('Int64')
# df.dtypes
cols_float = df.select_dtypes(include=['number']).columns
cols_float =cols_float.delete(0) #delete Release year
# df[cols_float] = df[cols_float].fillna(-1)
df.head(5)
Name | Platform | Year_of_Release | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Critic_Score | Critic_Count | User_Score | User_Count | Developer | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.36 | 28.96 | 3.77 | NaN | 82.53 | 76 | 51 | 8.0 | 322 | Nintendo | E |
1 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | NaN | NaN | NaN | NaN | missing | missing |
2 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.68 | 12.76 | 3.79 | 3.29 | 35.52 | 82 | 73 | 8.3 | 709 | Nintendo | E |
3 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.61 | 10.93 | 3.28 | NaN | 32.77 | 80 | 73 | 8.0 | 192 | Nintendo | E |
4 | Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | NaN | NaN | NaN | NaN | missing | missing |
#Compute missing global sales based on sales in available regions
for i, r in df.iterrows():
if r.Global_Sales<0:
df.Global_Sales.at[i] = sum(r[c] if (('Sales' in c) and (r[c]>0)) else 0 for c in r.index)
df[df.Global_Sales<0]
df.head(5)
Name | Platform | Year_of_Release | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Critic_Score | Critic_Count | User_Score | User_Count | Developer | Rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.36 | 28.96 | 3.77 | NaN | 82.53 | 76 | 51 | 8.0 | 322 | Nintendo | E |
1 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | NaN | NaN | NaN | NaN | missing | missing |
2 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.68 | 12.76 | 3.79 | 3.29 | 35.52 | 82 | 73 | 8.3 | 709 | Nintendo | E |
3 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.61 | 10.93 | 3.28 | NaN | 32.77 | 80 | 73 | 8.0 | 192 | Nintendo | E |
4 | Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | NaN | NaN | NaN | NaN | missing | missing |
# df[df.Platform == 'DS'].Year_of_Release.describe()
# df.Year_of_Release.loc[15959].replace(2004,inplace=True)
df.at[15959,'Year_of_Release']= 2004
df[df.Platform == 'DS'].Year_of_Release[df.Year_of_Release < 2006] = 2006
for i, r in df.iterrows():
if r.Platform == 'DS' and r.Year_of_Release < 2004:
print(df.Year_of_Release.at[i])
print(i)
#Compute missing global sales based on sales in available regions
# for i, r in df.iterrows():
# if r.Global_Sales<0:
# df.Global_Sales.at[i] = sum(r[c] if (('Sales' in c) and (r[c]>0)) else 0 for c in r.index)
# df[df.Global_Sales<0]
Visualize stuff
# df.sort_values(by=['Global_Sales'],ascending=False)
# df.Global_Sales[:100].plot.bar()
df[df['Name'] == 'Monopoly'].head(3) #We see games are repeated since every platform has diff. values
df.Platform.unique()
array(['Wii', 'NES', 'GB', 'DS', 'X360', 'PS3', 'PS2', 'SNES', 'GBA', 'PS4', '3DS', 'N64', 'PS', 'XB', 'PC', '2600', 'PSP', 'XOne', 'WiiU', 'GC', 'GEN', 'DC', 'PSV', 'SAT', 'SCD', 'WS', 'NG', 'TG16', '3DO', 'GG', 'PCFX'], dtype=object)
# df_genre.first()#[df_genre.Name == 'Minecraft']
df_genre = df.groupby(['Genre','Name'])
df['Genre']= df['Genre'].astype('category')
genres = df.Genre.cat.categories #get indexes* of categories
df['Genre'].value_counts().plot(kind='bar')
plt.xlabel("Genres");
plt.ylabel('Amount of Games') #TODO: merge same name
plt.title('Total Game Count By Genres')
plt.tight_layout()
plt.savefig('figures/total_games_genre.png',dpi=400)
# fig = plt.figure(figsize=(30,30))
# # Divide the figure into a 2x1 grid, and give me the first section
# ax1 = fig.add_subplot(211)
# ax2 = fig.add_subplot(212)
# df.groupby('Genre')['Critic_Score'].mean().plot(kind='bar',ax=ax1)
a = df.groupby('Genre')['Critic_Score'].mean()
# ax1.set_xlabel("Genres")
# ax1.set_ylabel('Critic Score')
# ax1.set_title('Critic Score by Genres')
#plot by score
#in terms of rating from critic and users there is not that much of a difference
#role playing games has slightly hier ratings from both critics and users;
# df.groupby('Genre')['User_Score'].mean().plot(kind='bar',ax=ax1)
b = df.groupby('Genre')['User_Score'].mean()
b = b.multiply(10)
# ax2.set_xlabel("Genres");
# ax2.set_ylabel('Users Score')
# ax2.set_title('Users Score by Genres')
# sns.catplot(x='class', y='survival rate', hue='sex', data=df, kind='bar')
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.tight_layout()
plt.savefig('figures/correlations.png',dpi=400)
corr.Global_Sales
Year_of_Release -0.077001 NA_Sales 0.941999 EU_Sales 0.899154 JP_Sales 0.613277 Other_Sales 0.694608 Global_Sales 1.000000 Critic_Score 0.246017 Critic_Count 0.303113 User_Score 0.093009 User_Count 0.253403 Name: Global_Sales, dtype: float64
ab = pd.concat([a,b],axis=1).sort_values(by=['Critic_Score'],ascending=False)
ax = ab.plot(kind='bar')
ax.get_legend().set_bbox_to_anchor((0.2, -0.25))
ax.set_title('Users and Critic Scores by Genres')
plt.tight_layout()
plt.savefig('figures/users_and_critic_scores_by_genre.png',dpi=400)
# fig = plt.figure(figsize=(30,30))
# fig.suptitle('Sales by Genre', fontsize=40)
# # Divide the figure into a 2x1 grid, and give me the first section
# ax1 = fig.add_subplot(221)
# ax2 = fig.add_subplot(222)
# ax3 = fig.add_subplot(223)
# ax4 = fig.add_subplot(224)
# df.groupby('Genre')['Global_Sales'].mean().plot(kind='bar',ax=ax1)
# ax1.set_xlabel("Genres");
# ax1.set_ylabel('Global Sales')
# ax1.set_title('Global Sales by Genres')
# #plot by sale
# #platform game significantly has a higher sale
# df.groupby('Genre')['NA_Sales'].mean().plot(kind='bar',ax=ax2)
# ax2.set_xlabel("Genres");
# ax2.set_ylabel('North America Sales')
# ax2.set_title('North America Sales by Genres')
# df.groupby(‘Genre’)[‘EU_Sales’].mean().plot(kind=‘bar’,ax=ax3)
# ax3.set_xlabel(“Genres”);
# ax3.set_ylabel(‘EU Sales’)
# ax3.set_title(‘EU sales by genres’)
# #this is not the case for EU, it is shooter with highest sale and Platform games just a bit lower
# df.groupby(‘Genre’)[‘JP_Sales’].mean().plot(kind=‘bar’,ax=ax4)
# ax4.set_xlabel(“Genres”);
# ax4.set_ylabel(‘Japan Sales’)
# ax4.set_title(‘Japan Sales by Genres’)
# #in Japan Role-Playing games significantly has a higher sale
df.groupby(‘Genre’)[‘Global_Sales’,‘NA_Sales’,‘EU_Sales’,‘JP_Sales’].mean().sort_values(by=[‘Global_Sales’],ascending=False).plot(kind=‘bar’,figsize=(10,5))
plt.ylabel(‘Sales’)
plt.title(‘Sales by Genre’)
#plot by sale
#platform game significantly has a higher sale
# df.groupby(‘Genre’)[‘NA_Sales’].mean()
# df.groupby(‘Genre’)[‘EU_Sales’].mean()
# #this is not the case for EU, it is shooter with highest sale and Platform games just a bit lower
# df.groupby(‘Genre’)[‘JP_Sales’].mean()
#in Japan Role-Playing games significantly has a higher sale
plt.tight_layout()
plt.savefig(‘figures/sales_by_genre.png’,dpi=400)
df.groupby(‘Year_of_Release’)[‘Platform’].count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x2c5b3ed1940>
df_year_platf = df.groupby(‘Year_of_Release’).Platform.value_counts()
number_of_plots = len(df.Year_of_Release.unique())
colormap = plt.cm.gist_ncar # set colormap for next plot
linestyle = ["+–","^-",'-o',':s']*10# line styles, 40 plots so we repeat the list 10 times
colors = [colormap(i) for i in np.linspace(0, 1,number_of_plots)]# colors gradient of colormap
df_year_platf.unstack().plot(kind=‘line’,style=linestyle,figsize=(25,10), stacked=False,color = colors)
plt.title(‘Games released per year’)
plt.xlabel(‘Year’)
plt.ylabel(‘Amount’)
# plt.xticks(range(number_of_plots))
plt.ticks=df.Year_of_Release.unique()
plt.locator_params(nbins=20)
plt.tight_layout()
plt.savefig(‘figures/released_per_years.png’,dpi=400)
plt.show()
a = df[‘Platform’].value_counts()
a.plot(kind=‘bar’)
plt.title(‘Games per platform’)
plt.xlabel(‘Platform’)
plt.ylabel(‘Amount’)
plt.tight_layout()
plt.savefig(‘figures/games_per_platform.png’,dpi=400)
#Get first year and last year per platform
df_y_pl = df.sort_values([‘Year_of_Release’,‘Platform’],ascending=True).groupby(‘Platform’).Year_of_Release
start_pl = df_y_pl.first()
end_pl = df_y_pl.last()
years_platform = end_pl-start_pl
years_platform
years_platform.sort_values(ascending=False).plot.bar()
plt.title(‘Years amount the platform has released games’)
plt.ylabel(‘Years amount’)
plt.tight_layout()
plt.savefig(‘figures/platform_lifespan.png’,dpi=400)
# df_y_pl.first()