Overview

This is a quick start guide to using what the bandori-2019-stats project has to offer, adapted from the project's quick-start Jupyter notebook. As such, this guide will feature code, its output, and explanations about that output. You are free to follow along in your own environment (although, if you plan to do that, you might as well use the notebook version). Basic command line and Python knowledge (classes, libraries, etc) is assumed.

First, we'll go over the must-knows of the pandas library, and how it's used by the project. Then, we'll go over the main classes of the project, what public methods they offer, and some examples (these examples are intended to have little overlap, cover a large set of use cases, and prevent potential confusion/error in interpreting the results of the methods). Finally, we'll briefly discuss how to get started with analyzing things outside of what the public methods offer.

Table of Contents:

Various libraries are required for the project: install them by running pip install -r requirements.txt in the root folder of the project.

Pandas Must-Knows

pandas is a library that let's us manipulate and analyze data. The library provides DataFrames, which are structurally similar to spreadsheets or tables: they have rows and columns, and cells at the intersections of these rows and columns. A preview of a DataFrame can be accessed by calling the DataFrame.head method (which will be used throughout this guide to avoid printing the entirety of large DataFrames). You can also slice DataFrames: the most common operation is to get all values of a column as a Series, like so:

In [3]:
import pandas as pd

# first, make a DataFrame with two rows + four columns named A,B,C,D
demo_df = pd.DataFrame([[1,2,3,4], [5,6,7,8]], columns=list('ABCD'))
print(demo_df)

demo_df["C"]
   A  B  C  D
0  1  2  3  4
1  5  6  7  8
Out[3]:
0    3
1    7
Name: C, dtype: int64

demo_df["C"] only has values from the "C" column of demo_df. The numbers on the left are indexes.

DataFrames are used to represent the survey data, found at data/responses.tsv. Every line in this file is a person's responses, and every column (which are tab-separated) is a survey question. The intersect of column and line is a person's response to a question, which may have one answer or multiple comma-separated answers (note the distinction between response and answer: it's important). The main DataFrames used will have the same structure: each row is a person, and each column is a question.

Classes and Examples

All classes and most of their methods have inline documentation explaining behavior, arguments, return types, etc. It is highly recommended to check them out, especially if you plan to use some features/options that won't be discussed.

There are three main classes available for analyzing the survey data: PandasPlotter, HeatMapPlotter, and AssociationMiner. Each of these create a DataFrame for internal use on initialization, and are initialized with a path to the survey data file (i.e. data/responses.tsv, if it hasn't been moved) in order to do this. They also have an optional initializing argument export_to_csv, which when set to True will cause them to save whatever data they create as .csv files in the working directory. Examples can be found in output/.

There is also AssociationMetricPlotter (found in plotters.py), which can be used in conjunction with AssociationMiner. It won't be discussed, due to its straightforwardness and niche use.

To use any of the classes, call one of its public methods. Following Python convention, public methods are any method that don't have a name starting with an underscore.

Let's import the three main classes, then we'll explore each class in detail:

In [4]:
from snsplotters import HeatMapPlotter
from plotters import PandasPlotter
from miner import AssociationMiner

HeatMapPlotter

We'll look at HeatMapPlotter first, because it is the simplest. HeatMapPlotter is for making heat maps of people based on how they responded to two questions. These questions must be single-answer (e.g. gender, age). If they aren't actually single-answer, they are assumed to be. There are four public methods:

  • HeatMapPlotter.draw_gender_vs_region
  • HeatMapPlotter.draw_age_vs_gender
  • HeatMapPlotter.draw_age_vs_region
  • HeatMapPlotter.draw

The first three are ready-to-use: they do exactly what their names indicate, and we can just call them. The last is a general method that we can use to plot something specific to our own liking.

To plot gender against region, do the following:

In [5]:
hm_plotter = HeatMapPlotter("data/responses.tsv")
hm_plotter.draw_gender_vs_region()

Two things to note: first, each cell has a percentage, and that percentage is the portion of people of a gender in each region. The horizontal lines (hopefully) help you infer this. Second, the x-axis label has a .1 appended to it, because the question "What is your gender?" is actually asked twice by the survey. pandas appends numbers to column names if they are repeated (to keep each column name unique), and the method uses the second occurrence of the question to generate results. The second one is used because (based on the survey structure) it is the version of the question that every respondent answers.

Advanced Heat Map Plotting

Now that's cool and all, but we probably want to do something beside comparing gender/age/region amongst each other. To do that, there's HeatMapPlotter.draw. We'll also need to use the project constants and DataCleaner. The constants define some survey question strings (i.e. DataFrame column names) for ease of use, and are ALL_CAPS following Python convention, while DataCleaner removes DataFrame rows with bad or irrelevant responses.

Let's import them:

In [6]:
from constants import *
from helpers import DataCleaner

HeatMapPlotter.draw requires the names of two columns/questions. There are a few optional arguments as well; the most important is df, which lets us specify a DataFrame to use. Although the method will by default use the initial DataFrame of the class (which can be accessed via HeatMapPlotter.df), this original has values from the .tsv file that we probably don't want to look at, like NaN or Prefer not to say.

Let's say we want to plot gender against whether the respondent plays idol games or not. First, we have to clean the original DataFrame on the gender column, using one of DataCleaners method:

In [7]:
df = DataCleaner.filter_gender(hm_plotter.df)

This method returns a DataFrame with rows that have NaN or Prefer not to say in the gender column removed. Next, we can go ahead and use this returned DataFrame to make the heat map. The two columns to use are GENDER and OTHER_GAMES_IDOL, as defined in constants:

In [8]:
hm_plotter.draw(GENDER, OTHER_GAMES_IDOL, df=df)

We can normalize along the x-axis too, like the first heat map:

In [9]:
hm_plotter.draw(GENDER, OTHER_GAMES_IDOL, df=df, normalize="x")

For full details about this option (and other options), see the inline code documentation.

PandasPlotter

PandasPlotter is for making typical frequency graphs of people based on how they responded to two questions. These questions may be multi-answer. This class has the following public methods:

  • PandasPlotter.plot_music_band_by_age
  • PandasPlotter.plot_chara_band_by_age
  • PandasPlotter.plot_music_band_by_region
  • PandasPlotter.plot_chara_band_by_region
  • PandasPlotter.plot_music_band_by_gender
  • PandasPlotter.plot_chara_band_by_gender
  • PandasPlotter.plot_play_style_by_age
  • PandasPlotter.plot_play_style_by_region
  • PandasPlotter.plot_play_style_by_gender
  • PandasPlotter.plot_participation_by_age
  • PandasPlotter.plot_participation_by_region
  • PandasPlotter.plot_participation_by_gender

All are ready-to-use; there is no general method for this class.

Say we want to make a bar graph for favorite band music-wise against age of the respondent. Do so like this:

In [10]:
pd_plotter = PandasPlotter("data/responses.tsv")
pd_plotter.plot_music_band_by_age()

The numbers above each bar are the raw number of respondents that favorited each band in each corresponding age group.

Customizing PandasPlotter Display

For the PandasPlotter public methods, if we want to change the look of the graph, we can specify a display in the form of a PandasPlotDisplay object. Let's say we want a line graph instead of the default bar graph, a different color scheme, and a new y-axis string.

PandasPlotDisplay's constructor has four mandatory arguments, and a handful of optional ones. The mandatory ones are (in order) the type of graph, the title, the x-axis label, and the y-axis label. The only optional argument we care about for this example is colormap, which defines the color scheme. Valid values are found here. We'll go with spring.

Let's make the display and use it to represent the same data as before:

In [11]:
from plotters import PandasPlotDisplay

display_obj = PandasPlotDisplay(
    "line", "Favorite Bands (Music) By Age Group", "Band", "Percentage", colormap="spring"
)

pd_plotter.plot_music_band_by_age(display=display_obj)

This isn't quite correct. We specified "band" as the x-axis (like the default), but age groups are on the x-axis ticks, while the bands are the actual lines. This is because the default display object used by PandasPlotter's methods transposes the DataFrame before drawing it on the graph; transposing swaps the x-axis and the hue (i.e. lines).

To make this graph valid, we can either change the x-axis name, or set transpose to True inside display_obj. It's more intuitive to have the bands as the actual lines (as opposed to having ages as lines), so let's do the former:

In [12]:
display_obj.x_label = "Age Group"
pd_plotter.plot_music_band_by_age(display=display_obj)

The raw counts that existed in the initial bar graph are unavailable in line graphs, unfortunately.

Regions in PandasPlotter

For most PandasPlotter public methods, all we can customize is the display. Ones involved in plotting regions allow us to show all regions or only show the five most common (showing all is default).

See the default graph:

In [13]:
pd_plotter.plot_chara_band_by_region()

vs the minimized graph:

In [14]:
pd_plotter.plot_chara_band_by_region(show_all=False)

AssociationMiner

AssociationMiner looks for associations between answers across any number of questions, and generates association rules from what it finds, which have predictive power. To read more about association rules, see Wikipedia for an overview and mlxtend's tutorial for more technical stuff.

AssociationMiner is different from the other two classes in that it doesn't make a graph, but instead returns Rules that represent association rules. It works with questions that are single- or multi-answer. These are the public methods:

  • AssociationMiner.mine_favorite_characters
  • AssociationMiner.mine_favorite_band_members
  • AssociationMiner.mine_favorite_character_reasons (has optional arguments)
  • AssociationMiner.mine_age_favorite_characters
  • AssociationMiner.mine_gender_favorite_characters
  • AssociationMiner.mine_region_favorite_characters
  • AssociationMiner.mine_age_favorite_band_chara
  • AssociationMiner.mine_gender_favorite_band_chara
  • AssociationMiner.mine_region_favorite_band_chara
  • AssociationMiner.mine_region_favorite_seiyuu
  • AssociationMiner.mine

The last method is the general one, while the rest are the ready-to-use ones.

The following finds association rules for overall favorite characters:

In [15]:
miner = AssociationMiner("data/responses.tsv")
rules = miner.mine_favorite_characters()
rules.table.head()
Out[15]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (Hikawa Sayo) (Imai Lisa) 0.329692 0.295357 0.113929 0.345562 1.169981 0.016552 1.076715
1 (Imai Lisa) (Hikawa Sayo) 0.295357 0.329692 0.113929 0.385733 1.169981 0.016552 1.091233
2 (Hikawa Sayo) (Minato Yukina) 0.329692 0.243855 0.112758 0.342012 1.402522 0.032362 1.149177
3 (Minato Yukina) (Hikawa Sayo) 0.243855 0.329692 0.112758 0.462400 1.402522 0.032362 1.246853
4 (Minato Yukina) (Imai Lisa) 0.243855 0.295357 0.104565 0.428800 1.451802 0.032541 1.233619

Rules.table is a DataFrame. The column titles are association rule jargon, so it's best to read the pages linked at the top of this section if you want to know what's going on.

If you want the crash-course version, basically: a rule consists of a predictor set and a predicted set. The predictor is made of antecedents, and the predicted is made of consequents. Support in general is the probability of occurrence (e.g. antecedent support is the probability of the antecedents occurring together), confidence is the conditional probability of the consequents occurring given the antecedents, and lift is confidence divided by the consequent support.

Here, the zeroth entry tells us that 11% of all people picked both Sayo and Lisa as favorites (support = 0.11), picking Sayo meant a 35% chance of picking Lisa as well (confidence = 0.35), and this 35% chance is 1.17 times the average chance of picking Lisa (lift = 1.17).

Rules have two properties: table (as seen above) and table_organized. The former is the original DataFrame created from the association rule mining, while the later is a filtered/sorted version.

In [16]:
rules.table_organized.head()
Out[16]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
37 (Udagawa Tomoe) (Seta Kaoru) 0.090909 0.162310 0.032774 0.360515 2.221154 0.018019 1.309945 1 1 2
54 (Kitazawa Hagumi) (Seta Kaoru) 0.071401 0.162310 0.021459 0.300546 1.851684 0.009870 1.197635 1 1 2
12 (Mitake Ran) (Aoba Moca) 0.170113 0.265704 0.075693 0.444954 1.674622 0.030493 1.322946 1 1 2
24 (Toyama Kasumi) (Ichigaya Arisa) 0.143192 0.201717 0.047991 0.335150 1.661488 0.019106 1.200697 1 1 2
4 (Minato Yukina) (Imai Lisa) 0.243855 0.295357 0.104565 0.428800 1.451802 0.032541 1.233619 1 1 2

Here we can see the rules sorted by lift.

The public methods generally (when making table_organized), remove rules with more than one antecedent and sort by lift. Furthermore, table (and table_organized too as a result) only have rules with support > 0.01 and confidence > 0.3 by default.

Usually, a non-organized version of the table won't be available, due to the method specifically requiring organization. AssociationMiner.mine_favorite_characters and AssociationMiner.mine_favorite_band_members are the only ready-to-use methods that return Rules with both original and organized tables; others will have table_organized set to None.

In [17]:
rules1 = miner.mine_region_favorite_characters()
rules1.table.head()
Out[17]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
157 (Asahi LOCK Rokka) (Southeast Asia) 0.055014 0.269216 0.021849 0.397163 1.475260 0.007039 1.212242 1 1 2
532 (Tsukishima Marina) (Southeast Asia) 0.026531 0.269216 0.010535 0.397059 1.474872 0.003392 1.212032 1 1 2
47 (Uehara Himari) (North America) 0.096762 0.419430 0.055014 0.568548 1.355525 0.014429 1.345619 1 1 2
74 (Kitazawa Hagumi) (North America) 0.071401 0.419430 0.037846 0.530055 1.263749 0.007899 1.235398 1 1 2
51 (Yamabuki Saaya) (Southeast Asia) 0.156067 0.269216 0.052282 0.335000 1.244355 0.010267 1.098924 1 1 2
In [18]:
rules1.table_organized is None
Out[18]:
True

For AssociationMiner.mine_favorite_character_reasons specifically, we can also specify the antecedent as "reason" or "character", since both are common and may be of interest:

In [19]:
rules2 = miner.mine_favorite_character_reasons(antecedent="character")
rules2.table.head()
Out[19]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
5381 (Udagawa Tomoe) (Personality, Seta Kaoru) 0.090909 0.156847 0.030823 0.339056 2.161692 0.016564 1.275679 1 2 3
1867 (Mitake Ran) (Aoba Moca, Character Design, Personality) 0.170113 0.167772 0.056184 0.330275 1.968594 0.027644 1.242642 1 3 4
1600 (Mitake Ran) (Aoba Moca, Character Design) 0.170113 0.177136 0.059306 0.348624 1.968112 0.029172 1.263270 1 2 3
8821 (Kitazawa Hagumi) (Personality, Seta Kaoru) 0.071401 0.156847 0.021459 0.300546 1.916171 0.010260 1.205445 1 2 3
1040 (Mitake Ran) (Aoba Moca, Personality) 0.170113 0.249707 0.071011 0.417431 1.671681 0.028532 1.287904 1 2 3
In [20]:
rules3 = miner.mine_favorite_character_reasons(antecedent="reason")
rules3.table.head()
Out[20]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
25980 (Other) (Speaking Voice, Hikawa Sayo, Personality) 0.033164 0.142021 0.010144 0.305882 2.153782 0.005434 1.236071 1 3 4
19354 (Other) (Speaking Voice, Singing Voice, Character Desi... 0.033164 0.180258 0.012876 0.388235 2.153782 0.006897 1.339964 1 4 5
19413 (Other) (Speaking Voice, Character Design, Singing Voi... 0.033164 0.180258 0.012876 0.388235 2.153782 0.006897 1.339964 1 5 6
17577 (Other) (Speaking Voice, Singing Voice, Their Seiyuu, ... 0.033164 0.192353 0.013656 0.411765 2.140675 0.007277 1.373000 1 4 5
24770 (Other) (Speaking Voice, Hikawa Sayo) 0.033164 0.143192 0.010144 0.305882 2.136176 0.005396 1.234385 1 2 3

Advanced Association Rule Mining

That's it for the ready-to-use methods. Now let's try using AssociationMiner.mine, which can be quite powerful and lets us investigate associations between any survey questions.

AssociationMiner.mine has five arguments: three are optional and change the association rule filtering behavior previously mentioned. The other two are columns and column_values. These are parallel lists that tell AssociationMiner what columns to mine and what values to mine for.

Say we want to mine overall favorite characters and whether the respondent plays on the Japanese (JP) server. These two columns are already defined in constants (which we previously imported) as CHARACTERS and JP_SERVER. The possible values of these columns are also already defined, as ALL_CHARACTERS and YES_NO, respectively.

To mine, just do this:

In [21]:
rules_c_jp = miner.mine(
    [CHARACTERS, JP_SERVER],
    [ALL_CHARACTERS, YES_NO]
)
rules_c_jp.table_organized.head()
Out[21]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
150 (Udagawa Tomoe) (Seta Kaoru) 0.092111 0.162996 0.032839 0.356522 2.187309 0.017826 1.300750 1 1 2
269 (Kitazawa Hagumi) (Seta Kaoru) 0.073288 0.162996 0.022026 0.300546 1.843893 0.010081 1.196655 1 1 2
90 (Toyama Kasumi) (Ichigaya Arisa) 0.142571 0.199439 0.047657 0.334270 1.676047 0.019223 1.202530 1 1 2
47 (Mitake Ran) (Aoba Moca) 0.172207 0.269924 0.076492 0.444186 1.645597 0.030009 1.313526 1 1 2
22 (Imai Lisa) (Minato Yukina) 0.295555 0.243893 0.104125 0.352304 1.444502 0.032041 1.167379 1 1 2

Seems like there are only character names here, so where are the responses to the JP server question? Turns out that the association among favorite characters is stronger than between favorite characters and playing on JP, so what we're interested in doesn't show up at the top of the sorted table.

In this case, we can confirm that the miner actually mined for JP_SERVER by checking Rules.table:

In [22]:
rules_c_jp.table.head()
Out[22]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (Hikawa Sayo) (Yes) 0.331998 0.497397 0.173809 0.523522 1.052524 0.008674 1.054830
1 (Yes) (Hikawa Sayo) 0.497397 0.331998 0.173809 0.349436 1.052524 0.008674 1.026804
2 (No) (Hikawa Sayo) 0.502603 0.331998 0.158190 0.314741 0.948020 -0.008674 0.974816
3 (Hikawa Sayo) (No) 0.331998 0.502603 0.158190 0.476478 0.948020 -0.008674 0.950097
4 (No) (Okusawa Misaki) 0.502603 0.272727 0.154185 0.306773 1.124834 0.017111 1.049112

There are some yeses and noes there, so we definitely mined on them. To find the results we're interested in, we can use Rules.search. This method (by default) searches Rules.table_organized for any one of a list of strings provided as the one_of argument inside either the antecedents or consequents, and returns the results as a DataFrame. Let's search for a yes or no inside the consequents only:

In [23]:
rules_c_jp.search(
    one_of=YES_NO,
    location="consequents"
).head()
Out[23]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
372 (Tsukishima Marina) (No) 0.027233 0.502603 0.016820 0.617647 1.228896 0.003133 1.300884 1 1 2
240 (Tamade CHU2 Chiyu) (Yes) 0.041650 0.497397 0.023628 0.567308 1.140553 0.002912 1.161572 1 1 2
5 (Okusawa Misaki) (No) 0.272727 0.502603 0.154185 0.565345 1.124834 0.017111 1.144349 1 1 2
51 (Udagawa Ako) (No) 0.124149 0.502603 0.069684 0.561290 1.116766 0.007286 1.133772 1 1 2
113 (Kitazawa Hagumi) (Yes) 0.073288 0.497397 0.040048 0.546448 1.098616 0.003595 1.108149 1 1 2

Now, let's try something slightly more complicated: let's mine favorite Poppin'Party character, favorite Afterglow character, and favorite Pastel*Palettes character. Column constants for these questions are defined in constants (CHARACTER_POPIPA, CHARACTER_AFTERGLOW, CHARACTER_PASUPARE), but constants for possible answers are not. To get all possible answers, we can use the helper class ResponseParser like so:

In [24]:
from helpers import ResponseParser

df = miner.df
afterglow_members = ResponseParser.unique_answers(df, CHARACTER_AFTERGLOW)
popipa_members = ResponseParser.unique_answers(df, CHARACTER_POPIPA)
pasupare_members = ResponseParser.unique_answers(df, CHARACTER_PASUPARE)

rules_trio = miner.mine(
    [CHARACTER_POPIPA, CHARACTER_AFTERGLOW, CHARACTER_PASUPARE],
    [popipa_members, afterglow_members, pasupare_members]
)

rules_trio.table_organized.head()
Out[24]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
10 (Toyama Kasumi) (Maruyama Aya) 0.147483 0.300039 0.056964 0.386243 1.287311 0.012714 1.140454 1 1 2
14 (Uehara Himari) (Ichigaya Arisa) 0.145533 0.271557 0.050332 0.345845 1.273562 0.010811 1.113563 1 1 2
6 (Hikawa Hina) (Aoba Moca) 0.195864 0.349980 0.085057 0.434263 1.240820 0.016508 1.148978 1 1 2
12 (Uehara Himari) (Maruyama Aya) 0.145533 0.300039 0.053063 0.364611 1.215213 0.009397 1.101626 1 1 2
3 (Ichigaya Arisa) (Maruyama Aya) 0.271557 0.300039 0.093250 0.343391 1.144487 0.011772 1.066024 1 1 2

ResponseParser.unique_answers requires the DataFrame you want to look in, and the actual column name.

One last example: let's mine region and preferred play style. Almost the same as before, but one point to note: both the region and play style questions have "Other" as a valid answer. These two answers would be considered the same by the miner, which would make the results misleading, so we should remove one of them. Let's do that, and then mine:

In [25]:
df = miner.df
regions = ResponseParser.unique_answers(df, REGION)
play_styles = ResponseParser.unique_answers(df, PLAY_STYLE)
play_styles.remove("Other")

rules_region_style = miner.mine(
    [REGION, PLAY_STYLE],
    [regions, play_styles]
)

rules_region_style.table_organized.head()
Out[25]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction antecedent_len consequent_len rule_len
43 (Non-Index Fingers) (Multiple Fingers, Index Fingers) 0.025631 0.044854 0.010412 0.406250 9.057199 0.009263 1.608667 1 2 3
45 (Non-Index Fingers) (Multiple Fingers) 0.025631 0.076892 0.010412 0.406250 5.283366 0.008442 1.554708 1 1 2
31 (Non-Index Fingers) (North America, Index Fingers) 0.025631 0.153384 0.014017 0.546875 3.565397 0.010085 1.868394 1 2 3
40 (Non-Index Fingers) (Thumbs, Index Fingers) 0.025631 0.123348 0.010412 0.406250 3.293527 0.007251 1.476466 1 2 3
17 (Non-Index Fingers) (Index Fingers) 0.025631 0.368442 0.025631 1.000000 2.714130 0.016187 inf 1 1 2

Going Beyond What's Provided

The most common "extra" thing you'll probably want to do is to include more columns in the DataFrames. You can do so by making new column constants in constants.py and adding them to DataCleaner.prepare_data_frame. The constants should be set to the question string found in the original survey .tsv file.

You might also want to create a custom plot with PandasPlotter. The way the public methods manipulate the DataFrame under the hood in order to plot is very systematic, and they all look very similar to each other, so you should be able to look at the code and imitate it (it mostly involves using PandasPlotter._group_counts_for_answer and PandasPlotter._plot_group_counts_for_answer). HeatMapPlotter's public methods can also be copied systematically, if you want to plot something other than single-answer responses.

In conjunction with adding more columns, you might also want to add a list of all valid answers for those columns to constants.py, particularly if answers can have commas inside of them (which prevent usage of ResponseParser.unique_answers, as this method splits responses by comma in order to get unique answers). Adding a new constant works fine in most cases, except when an answer is a substring of another answer for the same question; this is because PandasPlotter and AssociationMiner determine if a response has an answer by checking for substring membership. So if you had a constant L = ["R", "R.I.O.T"] as all valid answers, all responses the answer "R.I.O.T" would be considered responses with the answer "R" as well (since "R" is inside "R.I.O.T").

Getting around this issue is tricky: one way is to first remove commas from the answers of all responses in data/responses.tsv, then modify or create a version of PandasPlotter._group_counts_for_answer or AssociationMiner._reduce (depending on what you want to do) that splits on commas instead of checking for substring membership, then use ResponseParser.unique_answers or define a valid answer list constant.

If you're wondering why PandasPlotter and AssociationMiner use substring membership in the first place, it's to avoid the issue with answers having commas in them.