overview

Semantic change detection (Table 2)

models = load_models(['2019', '2020'], models_dir='../../models')
models
{'2019': <gensim.models.word2vec.Word2Vec>,
 '2020': <gensim.models.word2vec.Word2Vec>}

Vocabulary sizes for the two models before Procrustes alignment:

pd.DataFrame(
    columns=['Model', 'VocabSize'],
    data=[
    ['2019', f"{len(models['2019'].wv.key_to_index):,}"],
    ['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])
Model VocabSize
0 2019 252,564
1 2020 277,707
smart_procrustes_align_gensim(models['2019'], models['2020'])
190756 190756
190756 190756
<gensim.models.word2vec.Word2Vec>

Intersecting vocabulary size after alignment:

pd.DataFrame(
    columns=['Model', 'VocabSize'],
    data=[
    ['2019', f"{len(models['2019'].wv.key_to_index):,}"],
    ['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])
Model VocabSize
0 2019 190,756
1 2020 190,756

Measuring semantic distances (~ cosine distance) between the 2019 and the 2020 model for all words contained in the aligned vocabulary.

distances = measure_distances(models['2019'], models['2020'])

20 words that show the highest semantic distance between 2019 and 2020. This output is presented in Table 2 in the paper.

get_change_candidates(20, distances)
lex dist_sem
0 lockdowns 1.02
1 maskless 1.00
2 sunsetting 1.00
3 childe 0.98
4 megalodon 0.98
5 newf 0.96
6 corona 0.93
7 filtrate 0.92
8 chaz 0.90
9 klee 0.89
10 rona 0.89
11 cerb 0.87
12 rittenhouse 0.87
13 vacuo 0.86
14 moderna 0.84
15 pandemic 0.84
16 spreader 0.84
17 distancing 0.83
18 sars 0.83
19 quarantines 0.82

Extended list for the Appendix (Table 3)

get_change_candidates(50, distances, propNouns=False)
lex dist_sem
0 lockdowns 1.02
1 maskless 1.00
2 sunsetting 1.00
3 newf 0.96
4 corona 0.93
5 filtrate 0.92
6 chaz 0.90
7 rona 0.89
8 cerb 0.87
9 vacuo 0.86
10 moderna 0.84
11 pandemic 0.84
12 spreader 0.84
13 distancing 0.83
14 sars 0.83
15 quarantines 0.82
16 yada 0.82
17 recounts 0.82
18 alway 0.81
19 yadda 0.80
20 pandemics 0.80
21 pansies 0.79
22 tosser 0.79
23 bipoc 0.79
24 ventilators 0.79
25 budging 0.79
26 diys 0.78
27 thst 0.78
28 flyweight 0.77
29 yeap 0.77
30 mrna 0.77
31 tiktoks 0.77
32 buuuut 0.76
33 coomer 0.76
34 unfortunatly 0.75
35 anywho 0.75
36 quarantining 0.74
37 venti 0.74
38 webrip 0.74
39 obvi 0.74
40 fkin 0.74
41 modus 0.73
42 tink 0.73
43 duplicating 0.73
44 retinoids 0.73
45 parasol 0.72
46 copypastas 0.72
47 excercise 0.72
48 newbies 0.72
49 mers 0.72

Semantic axes (Figure 2)

models = load_models(['Coronavirus', 'conspiracy'], models_dir='../../models')
lexs = [ 'corona', 'rona', 'moderna', 'sars', 'spreader', 'maskless', 'distancing', 'quarantines', 'pandemic', 'science', 'research', 'masks', 'lockdowns', 'vaccines' ]

evaluative dimension: good vs bad

pole_words = ['good', 'bad']

proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted,  pole_words)
sem_axis_evaluative_plot

MFT-based dimension: loyalty vs betrayal

pole_words = ['loyalty', 'betrayal']

proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted,  pole_words)
sem_axis_evaluative_plot

Maps of socio-semantic variation (Figure 3)

Note that the plots in this notebook are not identical to the ones in the paper since the dimensionality reduction via t-SNE leads to differences in results between runs.

models = load_models(['Coronavirus', 'conspiracy'], models_dir='../../models')
smart_procrustes_align_gensim(models['Coronavirus'], models['conspiracy'])
67181 67181
67181 67181
<gensim.models.word2vec.Word2Vec>
nbs_vecs = pd.concat([get_nbs_vecs('vaccines', model_name, model, k=750) for model_name, model in models.items()])

common neighbours

nbs_vecs_2d = dim_red_nbs_vecs(nbs_vecs, perplexity=0.1)
nbs_sim = (nbs_vecs_2d
    .groupby('subreddit')
    .apply(lambda df: df.nlargest(10, 'sim'))
    .reset_index(drop=True)
)
/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/3553513103.py:4: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda df: df.nlargest(10, 'sim'))
map_sims_plot = (alt.Chart(nbs_sim).mark_text().encode(
        x='x_tsne:Q',
        y='y_tsne:Q',
        text='lex',
        color='subreddit:N'
    ))

map_sims_plot

differences in neighbours

nbs_vecs = dim_red_nbs_vecs(nbs_vecs, perplexity=70)
nbs_diff = nbs_vecs.drop_duplicates(subset='lex', keep=False)
nbs_diff = (nbs_diff
    .groupby('subreddit')
    .apply(lambda df: df.nlargest(20, 'sim'))
    .reset_index(drop=True)
)
/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/2989563900.py:5: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda df: df.nlargest(20, 'sim'))
map_diffs_plot = (alt.Chart(nbs_diff).mark_text().encode(
        x='x_tsne:Q',
        y='y_tsne:Q',
        text='lex:N',
        color='subreddit:N'
    ))


map_diffs_plot