Do Liberals and Conservatives Have Different Vocabulary Sizes?

6 min readFeb 3, 2021

Hypothesis Testing with scipy.stats

This will be item 2 in what will be a series of posts on mining information from a psql dump of openparliament.ca.

It’s easy enough and informative enough to graph data and then draw a conclusion from eyeballing your graph.

That wouldn’t pass muster in any empirical context, however, so hypothesis testing is an important skill for any kind of scientist (data or otherwise) or any individual seeking funding or publishing results in a journal.

This article contradicts the claim that in the US, Democrats tend to have a larger vocabulary than Republicans, with an analysis of words tweeted. In my earlier post, I commented on their methodological choices, and explained how in contrast, for my analysis, I use the pre-trained SpaCy POS tagger to remove named entities, stopwords, and punctuation, and then save the lemmas to a Counter object. Another benefit of this is that it avoids overcounting different forms of a word with the same root, but in contrast to stemming, lemmatization maintains the distinction between homographs — words that share spellings, but have different meanings. An example is “tear”, as in “he had a tear in his eye”, or “tear” (verb) — “he plans to tear that fabric”.

import pandas as pd
import numpy as np%config Completer.use_jedi = Falsepol_vocab = pd.read_csv(r"D:\data\openparliament\politician_vocab.csv",index_col='Unnamed: 0')##don't copy
pol_vocab.drop(['member_id', 'log_words', 'scaled'], axis=1, inplace=True)
pol_vocab.head()

The first thing to note is this isn’t a comparison of vocabulary sizes as counted by the previous methodology, because it wouldn’t be fair to compare the vocab sizes for an MP without context for how many words they’ve spoken in general.

In order to get that statistic, I loaded in the original text data, split into tokens by nltk.word_tokenizer. This does not have the stopword removal, POS-tagging, and lemmatization mentioned earlier, but as these are all of the statements made in that time frame, I’m making the assumption that they all use about the same number of stopwords, and there aren’t huge differences in naming names. I sincerely doubt there’s an MP who only goes on the floor to list street names.

df = pd.read_csv(r"D:\data\openparliament\text_en.csv",index_col='Unnamed: 0')C:\Users\alecr\Anaconda3\envs\gensim_env\lib\site-packages\IPython\core\interactiveshell.py:3072: DtypeWarning: Columns (20,23,28,29,30,31,32,33) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)from ast import literal_eval
df['tokens'] = df.tokens.apply(literal_eval)
df['token_count'] = df.tokens.apply(lambda x: len(x))words_spoken = df.pivot_table(index='politician_id', values='token_count', aggfunc=sum)
words_spoken.reset_index(inplace=True)words_spoken.token_count.describe()count    1.051000e+03
mean     1.138286e+05
std      1.755304e+05
min      4.000000e+00
25%      1.892550e+04
50%      5.104100e+04
75%      1.289090e+05
max      1.414664e+06
Name: token_count, dtype: float64

As you can see here, there are six orders of magnitude between the most and least loquacious members by number of raw tokens spoken ever (remember Python’s scientific notation). For this, I’m going to scale the sum of tokens spoken with a natural log.

words_spoken['log_words'] = words_spoken.token_count.apply(np.log)words_spoken.log_words.describe()count    1051.000000
mean       10.681356
std         1.602146
min         1.386294
25%         9.848248
50%        10.840385
75%        11.766860
max        14.162403
Name: log_words, dtype: float64pol_vocab = pol_vocab.merge(words_spoken,how='inner',left_on='politician_id',right_on='politician_id')pol_vocab.head()

At this point, this scaled vocabulary index can hardly be described as a vocabulary metric, so henceforth, I’ve called it a vocabulary index.

pol_vocab['vocab_index'] = pol_vocab['vocab_size']/pol_vocab['log_words']pol_vocab.vocab_index.describe()count    1499.000000
mean      339.769742
std       150.701900
min         0.227560
25%       229.792183
50%       337.468426
75%       441.582375
max       856.785195
Name: vocab_index, dtype: float64party_dict = {1:'con',2:'lib',4:'lib',10: 'con', 3:'quebec', 28:'con', 25:'con', 5:'ind',26:'con', 46:'quebec',9:'lib', 39:'quebec'}

Note: this bit of consolidation partially happened elsewhere, and I can imagine Quebec people, Green Party, NDP, and various flavours of piqued conservative party* voters protesting in rage that they can’t be divided so casually into liberal, conservative, and Quebec. Well, this is for the sake of a hypothesis testing demonstration, and there have been many iterations, fractures, and mergers of the Conservative party, but they consist of the same people. Here, for the sake of this analysis, the Conservative Party of Canada, Progressive Conservatives, Canadian Alliance, Reform Party of Canada, and the Canadian Reform Alliance Party (that name didn’t last long) shall be treated as just one party.

pol_vocab['party'] = pol_vocab.party_id.replace(party_dict)

The Null Hypothesis

The first thing to do with any hypothesis testing is to decide upon a null hypothesis. That’s what you would assume to be true in the simplest version of reality. In this case, I’ll assume that since the people who work in the offices of MPs and prepare their speeches are by and large educated in the same places, they more or less write the same. The alternate hypothesis then must be that they’re not the same. Remember that with hypothesis testing, the results of your testing will never prove your alternate hypothesis. You can only disprove a hypothesis, and the alternate hypothesis then becomes your next best conclusion.

H0: VOCAB^conservatives = VOCAB^liberals

H1: Vocab^conservatives != VOCAB^liberals

party_means = pol_vocab.groupby('party').mean()party_means['vocab_index']party
con       376.690728
ind       324.219212
lib       320.684699
quebec    312.969374
Name: vocab_index, dtype: float64import matplotlib.pyplot as pltplt.bar(['con', 'ind','lib','quebec'],party_means['vocab_index'],color=['blue','red','gray','black'])
plt.ylabel('normalized mean vocab size')
plt.xlabel('party group')
plt.title('Mean normalized vocabulary size by major party group')
plt.show()

party_stds = pol_vocab.groupby('party').std()party_stds['vocab_index']party
con       144.237433
ind       134.225549
lib       154.384062
quebec    137.955541
Name: vocab_index, dtype: float64

The goal of hypothesis testing is to test whether these two means come from the same distribution. Initially, I planned on going through the calculation by hand — but if you’re any kind of programmer, not reinventing the wheel should be your maxim. There are many great libraries for doing all manner of statistical tests, and unless you’re using methodology that was just published (and even so!), your first task is to look for a library that will do it for you.

from scipy.stats import ttest_ind

Since the significance value always has to be decided upon ahead of time, I’m going to choose 0.05 as the P value, as that’s the most common value used in experimentation in any field involving biological organisms or psychology.

conservative_vocab = pol_vocab[pol_vocab.party == 'con']['vocab_index']liberal_vocab = pol_vocab[pol_vocab.party == 'lib']['vocab_index']stat, p = ttest_ind(conservative_vocab,liberal_vocab)

The ttest_ind function from scipy acts on two array-like structures (arrays or pandas series objects), and gives you the t-score and the p-value. In this case, since we’ve decided that if p <= 0.05, that is sufficient evidence to reject the null hypothesis, now it’s time to see:

p7.926164444339309e-11

This is in scientific notation, so the p value is ~0.0000000000079, which is a very small number. This is sufficient to reject the null hypothesis, that conservatives and liberals have the same vocabulary usage. Now, I’ll admit disappointment at first, there’s nothing like running an experiment and missing significance, but this is a reminder to always look at the magnitude indicated by python’s scientific notation, even if you’re sure you know well to do that. That goes double when you’re testing a conclusion that you don’t expect.

An alternate hypothesis must be true. While you can’t reject the null hypothesis and then claim that whatever alternate hypothesis you have must be true, something else must true. Tied to reality as we are, going back to the mean values earlier, the mean adjusted vocabulary index for conservative speeches is 377, in contrast to 321 for the liberals.

The necessary caveat here is that the specific numbers don’t mean much here at all — so long as they were arrived at in the same way for all groups. The vocab sizes are adjusted by dividing by the log of total words spoken, and stopwords and proper names were removed from the words in the original vocab size count.

This is also a reminder to not take this as proof of the blanket statement “conservatives have larger vocabularies than liberals”. This is a very specific dataset — speeches on the floor of the house of commons, not the habits, literacy, or skills of people with those political affiliations in general. But, though a hypothesis cannot be proven, exactly, I take this as evidence in support of the somewhat counterintuitive conclusion that conservative MPs use larger vocabularies than liberals.

Do Liberals and Conservatives Have Different Vocabulary Sizes?

Hypothesis Testing with scipy.stats

The Null Hypothesis

Written by Alec Robinson