This is a quick start guide to using what the bandori-2019-stats project has to offer, adapted from the project's quick-start Jupyter notebook. As such, this guide will feature code, its output, and explanations about that output. You are free to follow along in your own environment (although, if you plan to do that, you might as well use the notebook version). Basic command line and Python knowledge (classes, libraries, etc) is assumed.
First, we'll go over the must-knows of the pandas
library, and how it's used by the project. Then, we'll go over the main classes of the project, what public methods they offer, and some examples (these examples are intended to have little overlap, cover a large set of use cases, and prevent potential confusion/error in interpreting the results of the methods). Finally, we'll briefly discuss how to get started with analyzing things outside of what the public methods offer.
Table of Contents:
Various libraries are required for the project: install them by running pip install -r requirements.txt
in the root folder of the project.
pandas
is a library that let's us manipulate and analyze data. The library provides DataFrame
s, which are structurally similar to spreadsheets or tables: they have rows and columns, and cells at the intersections of these rows and columns. A preview of a DataFrame
can be accessed by calling the DataFrame.head
method (which will be used throughout this guide to avoid printing the entirety of large DataFrame
s). You can also slice DataFrame
s: the most common operation is to get all values of a column as a Series
, like so:
import pandas as pd
# first, make a DataFrame with two rows + four columns named A,B,C,D
demo_df = pd.DataFrame([[1,2,3,4], [5,6,7,8]], columns=list('ABCD'))
print(demo_df)
demo_df["C"]
demo_df["C"]
only has values from the "C" column of demo_df
. The numbers on the left are indexes.
DataFrame
s are used to represent the survey data, found at data/responses.tsv
. Every line in this file is a person's responses, and every column (which are tab-separated) is a survey question. The intersect of column and line is a person's response to a question, which may have one answer or multiple comma-separated answers (note the distinction between response and answer: it's important). The main DataFrame
s used will have the same structure: each row is a person, and each column is a question.
All classes and most of their methods have inline documentation explaining behavior, arguments, return types, etc. It is highly recommended to check them out, especially if you plan to use some features/options that won't be discussed.
There are three main classes available for analyzing the survey data: PandasPlotter
, HeatMapPlotter
, and AssociationMiner
. Each of these create a DataFrame
for internal use on initialization, and are initialized with a path to the survey data file (i.e. data/responses.tsv
, if it hasn't been moved) in order to do this. They also have an optional initializing argument export_to_csv
, which when set to True will cause them to save whatever data they create as .csv
files in the working directory. Examples can be found in output/
.
There is also AssociationMetricPlotter
(found in plotters.py
), which can be used in conjunction with AssociationMiner
. It won't be discussed, due to its straightforwardness and niche use.
To use any of the classes, call one of its public methods. Following Python convention, public methods are any method that don't have a name starting with an underscore.
Let's import the three main classes, then we'll explore each class in detail:
from snsplotters import HeatMapPlotter
from plotters import PandasPlotter
from miner import AssociationMiner
We'll look at HeatMapPlotter
first, because it is the simplest. HeatMapPlotter
is for making heat maps of people based on how they responded to two questions. These questions must be single-answer (e.g. gender, age). If they aren't actually single-answer, they are assumed to be. There are four public methods:
HeatMapPlotter.draw_gender_vs_region
HeatMapPlotter.draw_age_vs_gender
HeatMapPlotter.draw_age_vs_region
HeatMapPlotter.draw
The first three are ready-to-use: they do exactly what their names indicate, and we can just call them. The last is a general method that we can use to plot something specific to our own liking.
To plot gender against region, do the following:
hm_plotter = HeatMapPlotter("data/responses.tsv")
hm_plotter.draw_gender_vs_region()
Two things to note: first, each cell has a percentage, and that percentage is the portion of people of a gender in each region. The horizontal lines (hopefully) help you infer this. Second, the x-axis label has a .1
appended to it, because the question "What is your gender?" is actually asked twice by the survey. pandas
appends numbers to column names if they are repeated (to keep each column name unique), and the method uses the second occurrence of the question to generate results. The second one is used because (based on the survey structure) it is the version of the question that every respondent answers.
Now that's cool and all, but we probably want to do something beside comparing gender/age/region amongst each other. To do that, there's HeatMapPlotter.draw
. We'll also need to use the project constants and DataCleaner
. The constants define some survey question strings (i.e. DataFrame
column names) for ease of use, and are ALL_CAPS following Python convention, while DataCleaner
removes DataFrame
rows with bad or irrelevant responses.
Let's import them:
from constants import *
from helpers import DataCleaner
HeatMapPlotter.draw
requires the names of two columns/questions. There are a few optional arguments as well; the most important is df
, which lets us specify a DataFrame
to use. Although the method will by default use the initial DataFrame
of the class (which can be accessed via HeatMapPlotter.df
), this original has values from the .tsv
file that we probably don't want to look at, like NaN
or Prefer not to say
.
Let's say we want to plot gender against whether the respondent plays idol games or not. First, we have to clean the original DataFrame
on the gender column, using one of DataCleaner
s method:
df = DataCleaner.filter_gender(hm_plotter.df)
This method returns a DataFrame
with rows that have NaN
or Prefer not to say
in the gender column removed. Next, we can go ahead and use this returned DataFrame
to make the heat map. The two columns to use are GENDER
and OTHER_GAMES_IDOL
, as defined in constants
:
hm_plotter.draw(GENDER, OTHER_GAMES_IDOL, df=df)
We can normalize along the x-axis too, like the first heat map:
hm_plotter.draw(GENDER, OTHER_GAMES_IDOL, df=df, normalize="x")
For full details about this option (and other options), see the inline code documentation.
PandasPlotter
is for making typical frequency graphs of people based on how they responded to two questions. These questions may be multi-answer. This class has the following public methods:
PandasPlotter.plot_music_band_by_age
PandasPlotter.plot_chara_band_by_age
PandasPlotter.plot_music_band_by_region
PandasPlotter.plot_chara_band_by_region
PandasPlotter.plot_music_band_by_gender
PandasPlotter.plot_chara_band_by_gender
PandasPlotter.plot_play_style_by_age
PandasPlotter.plot_play_style_by_region
PandasPlotter.plot_play_style_by_gender
PandasPlotter.plot_participation_by_age
PandasPlotter.plot_participation_by_region
PandasPlotter.plot_participation_by_gender
All are ready-to-use; there is no general method for this class.
Say we want to make a bar graph for favorite band music-wise against age of the respondent. Do so like this:
pd_plotter = PandasPlotter("data/responses.tsv")
pd_plotter.plot_music_band_by_age()
The numbers above each bar are the raw number of respondents that favorited each band in each corresponding age group.
For the PandasPlotter
public methods, if we want to change the look of the graph, we can specify a display
in the form of a PandasPlotDisplay
object. Let's say we want a line graph instead of the default bar graph, a different color scheme, and a new y-axis string.
PandasPlotDisplay
's constructor has four mandatory arguments, and a handful of optional ones. The mandatory ones are (in order) the type of graph, the title, the x-axis label, and the y-axis label. The only optional argument we care about for this example is colormap
, which defines the color scheme. Valid values are found here. We'll go with spring
.
Let's make the display and use it to represent the same data as before:
from plotters import PandasPlotDisplay
display_obj = PandasPlotDisplay(
"line", "Favorite Bands (Music) By Age Group", "Band", "Percentage", colormap="spring"
)
pd_plotter.plot_music_band_by_age(display=display_obj)
This isn't quite correct. We specified "band" as the x-axis (like the default), but age groups are on the x-axis ticks, while the bands are the actual lines. This is because the default display object used by PandasPlotter
's methods transposes the DataFrame
before drawing it on the graph; transposing swaps the x-axis and the hue (i.e. lines).
To make this graph valid, we can either change the x-axis name, or set transpose
to True inside display_obj
. It's more intuitive to have the bands as the actual lines (as opposed to having ages as lines), so let's do the former:
display_obj.x_label = "Age Group"
pd_plotter.plot_music_band_by_age(display=display_obj)
The raw counts that existed in the initial bar graph are unavailable in line graphs, unfortunately.
For most PandasPlotter
public methods, all we can customize is the display. Ones involved in plotting regions allow us to show all regions or only show the five most common (showing all is default).
See the default graph:
pd_plotter.plot_chara_band_by_region()
vs the minimized graph:
pd_plotter.plot_chara_band_by_region(show_all=False)
AssociationMiner
looks for associations between answers across any number of questions, and generates association rules from what it finds, which have predictive power. To read more about association rules, see Wikipedia for an overview and mlxtend's tutorial for more technical stuff.
AssociationMiner
is different from the other two classes in that it doesn't make a graph, but instead returns Rules
that represent association rules. It works with questions that are single- or multi-answer. These are the public methods:
AssociationMiner.mine_favorite_characters
AssociationMiner.mine_favorite_band_members
AssociationMiner.mine_favorite_character_reasons
(has optional arguments)AssociationMiner.mine_age_favorite_characters
AssociationMiner.mine_gender_favorite_characters
AssociationMiner.mine_region_favorite_characters
AssociationMiner.mine_age_favorite_band_chara
AssociationMiner.mine_gender_favorite_band_chara
AssociationMiner.mine_region_favorite_band_chara
AssociationMiner.mine_region_favorite_seiyuu
AssociationMiner.mine
The last method is the general one, while the rest are the ready-to-use ones.
The following finds association rules for overall favorite characters:
miner = AssociationMiner("data/responses.tsv")
rules = miner.mine_favorite_characters()
rules.table.head()
Rules.table
is a DataFrame
. The column titles are association rule jargon, so it's best to read the pages linked at the top of this section if you want to know what's going on.
If you want the crash-course version, basically: a rule consists of a predictor set and a predicted set. The predictor is made of antecedents, and the predicted is made of consequents. Support in general is the probability of occurrence (e.g. antecedent support is the probability of the antecedents occurring together), confidence is the conditional probability of the consequents occurring given the antecedents, and lift is confidence divided by the consequent support.
Here, the zeroth entry tells us that 11% of all people picked both Sayo and Lisa as favorites (support = 0.11), picking Sayo meant a 35% chance of picking Lisa as well (confidence = 0.35), and this 35% chance is 1.17 times the average chance of picking Lisa (lift = 1.17).
Rules
have two properties: table
(as seen above) and table_organized
. The former is the original DataFrame
created from the association rule mining, while the later is a filtered/sorted version.
rules.table_organized.head()
Here we can see the rules sorted by lift.
The public methods generally (when making table_organized
), remove rules with more than one antecedent and sort by lift. Furthermore, table
(and table_organized
too as a result) only have rules with support > 0.01 and confidence > 0.3 by default.
Usually, a non-organized version of the table won't be available, due to the method specifically requiring organization. AssociationMiner.mine_favorite_characters
and AssociationMiner.mine_favorite_band_members
are the only ready-to-use methods that return Rules
with both original and organized tables; others will have table_organized
set to None
.
rules1 = miner.mine_region_favorite_characters()
rules1.table.head()
rules1.table_organized is None
For AssociationMiner.mine_favorite_character_reasons
specifically, we can also specify the antecedent as "reason" or "character", since both are common and may be of interest:
rules2 = miner.mine_favorite_character_reasons(antecedent="character")
rules2.table.head()
rules3 = miner.mine_favorite_character_reasons(antecedent="reason")
rules3.table.head()
That's it for the ready-to-use methods. Now let's try using AssociationMiner.mine
, which can be quite powerful and lets us investigate associations between any survey questions.
AssociationMiner.mine
has five arguments: three are optional and change the association rule filtering behavior previously mentioned. The other two are columns
and column_values
. These are parallel lists that tell AssociationMiner
what columns to mine and what values to mine for.
Say we want to mine overall favorite characters and whether the respondent plays on the Japanese (JP) server. These two columns are already defined in constants
(which we previously imported) as CHARACTERS
and JP_SERVER
. The possible values of these columns are also already defined, as ALL_CHARACTERS
and YES_NO
, respectively.
To mine, just do this:
rules_c_jp = miner.mine(
[CHARACTERS, JP_SERVER],
[ALL_CHARACTERS, YES_NO]
)
rules_c_jp.table_organized.head()
Seems like there are only character names here, so where are the responses to the JP server question? Turns out that the association among favorite characters is stronger than between favorite characters and playing on JP, so what we're interested in doesn't show up at the top of the sorted table.
In this case, we can confirm that the miner actually mined for JP_SERVER
by checking Rules.table
:
rules_c_jp.table.head()
There are some yeses and noes there, so we definitely mined on them. To find the results we're interested in, we can use Rules.search
. This method (by default) searches Rules.table_organized
for any one of a list of strings provided as the one_of
argument inside either the antecedents or consequents, and returns the results as a DataFrame
. Let's search for a yes or no inside the consequents only:
rules_c_jp.search(
one_of=YES_NO,
location="consequents"
).head()
Now, let's try something slightly more complicated: let's mine favorite Poppin'Party character, favorite Afterglow character, and favorite Pastel*Palettes character. Column constants for these questions are defined in constants
(CHARACTER_POPIPA
, CHARACTER_AFTERGLOW
, CHARACTER_PASUPARE
), but constants for possible answers are not. To get all possible answers, we can use the helper class ResponseParser
like so:
from helpers import ResponseParser
df = miner.df
afterglow_members = ResponseParser.unique_answers(df, CHARACTER_AFTERGLOW)
popipa_members = ResponseParser.unique_answers(df, CHARACTER_POPIPA)
pasupare_members = ResponseParser.unique_answers(df, CHARACTER_PASUPARE)
rules_trio = miner.mine(
[CHARACTER_POPIPA, CHARACTER_AFTERGLOW, CHARACTER_PASUPARE],
[popipa_members, afterglow_members, pasupare_members]
)
rules_trio.table_organized.head()
ResponseParser.unique_answers
requires the DataFrame you want to look in, and the actual column name.
One last example: let's mine region and preferred play style. Almost the same as before, but one point to note: both the region and play style questions have "Other" as a valid answer. These two answers would be considered the same by the miner, which would make the results misleading, so we should remove one of them. Let's do that, and then mine:
df = miner.df
regions = ResponseParser.unique_answers(df, REGION)
play_styles = ResponseParser.unique_answers(df, PLAY_STYLE)
play_styles.remove("Other")
rules_region_style = miner.mine(
[REGION, PLAY_STYLE],
[regions, play_styles]
)
rules_region_style.table_organized.head()
The most common "extra" thing you'll probably want to do is to include more columns in the DataFrame
s. You can do so by making new column constants in constants.py
and adding them to DataCleaner.prepare_data_frame
. The constants should be set to the question string found in the original survey .tsv
file.
You might also want to create a custom plot with PandasPlotter
. The way the public methods manipulate the DataFrame
under the hood in order to plot is very systematic, and they all look very similar to each other, so you should be able to look at the code and imitate it (it mostly involves using PandasPlotter._group_counts_for_answer
and PandasPlotter._plot_group_counts_for_answer
). HeatMapPlotter
's public methods can also be copied systematically, if you want to plot something other than single-answer responses.
In conjunction with adding more columns, you might also want to add a list of all valid answers for those columns to constants.py
, particularly if answers can have commas inside of them (which prevent usage of ResponseParser.unique_answers
, as this method splits responses by comma in order to get unique answers). Adding a new constant works fine in most cases, except when an answer is a substring of another answer for the same question; this is because PandasPlotter
and AssociationMiner
determine if a response has an answer by checking for substring membership. So if you had a constant L = ["R", "R.I.O.T"]
as all valid answers, all responses the answer "R.I.O.T" would be considered responses with the answer "R" as well (since "R" is inside "R.I.O.T").
Getting around this issue is tricky: one way is to first remove commas from the answers of all responses in data/responses.tsv
, then modify or create a version of PandasPlotter._group_counts_for_answer
or AssociationMiner._reduce
(depending on what you want to do) that splits on commas instead of checking for substring membership, then use ResponseParser.unique_answers
or define a valid answer list constant.
If you're wondering why PandasPlotter
and AssociationMiner
use substring membership in the first place, it's to avoid the issue with answers having commas in them.