overview

Semantic change detection (Table 2)

models = load_models(['2019', '2020'], models_dir='../../models')
models

{'2019': <gensim.models.word2vec.Word2Vec>,
 '2020': <gensim.models.word2vec.Word2Vec>}

Vocabulary sizes for the two models before Procrustes alignment:

pd.DataFrame(
    columns=['Model', 'VocabSize'],
    data=[
    ['2019', f"{len(models['2019'].wv.key_to_index):,}"],
    ['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])

	Model	VocabSize
0	2019	252,564
1	2020	277,707

smart_procrustes_align_gensim(models['2019'], models['2020'])

190756 190756
190756 190756

<gensim.models.word2vec.Word2Vec>

Intersecting vocabulary size after alignment:

pd.DataFrame(
    columns=['Model', 'VocabSize'],
    data=[
    ['2019', f"{len(models['2019'].wv.key_to_index):,}"],
    ['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])

	Model	VocabSize
0	2019	190,756
1	2020	190,756

Measuring semantic distances (~ cosine distance) between the 2019 and the 2020 model for all words contained in the aligned vocabulary.

distances = measure_distances(models['2019'], models['2020'])

20 words that show the highest semantic distance between 2019 and 2020. This output is presented in Table 2 in the paper.

get_change_candidates(20, distances)

	lex	dist_sem
0	lockdowns	1.02
1	maskless	1.00
2	sunsetting	1.00
3	childe	0.98
4	megalodon	0.98
5	newf	0.96
6	corona	0.93
7	filtrate	0.92
8	chaz	0.90
9	klee	0.89
10	rona	0.89
11	cerb	0.87
12	rittenhouse	0.87
13	vacuo	0.86
14	moderna	0.84
15	pandemic	0.84
16	spreader	0.84
17	distancing	0.83
18	sars	0.83
19	quarantines	0.82

Extended list for the Appendix (Table 3)

get_change_candidates(50, distances, propNouns=False)

	lex	dist_sem
0	lockdowns	1.02
1	maskless	1.00
2	sunsetting	1.00
3	newf	0.96
4	corona	0.93
5	filtrate	0.92
6	chaz	0.90
7	rona	0.89
8	cerb	0.87
9	vacuo	0.86
10	moderna	0.84
11	pandemic	0.84
12	spreader	0.84
13	distancing	0.83
14	sars	0.83
15	quarantines	0.82
16	yada	0.82
17	recounts	0.82
18	alway	0.81
19	yadda	0.80
20	pandemics	0.80
21	pansies	0.79
22	tosser	0.79
23	bipoc	0.79
24	ventilators	0.79
25	budging	0.79
26	diys	0.78
27	thst	0.78
28	flyweight	0.77
29	yeap	0.77
30	mrna	0.77
31	tiktoks	0.77
32	buuuut	0.76
33	coomer	0.76
34	unfortunatly	0.75
35	anywho	0.75
36	quarantining	0.74
37	venti	0.74
38	webrip	0.74
39	obvi	0.74
40	fkin	0.74
41	modus	0.73
42	tink	0.73
43	duplicating	0.73
44	retinoids	0.73
45	parasol	0.72
46	copypastas	0.72
47	excercise	0.72
48	newbies	0.72
49	mers	0.72

Covid-related communities (Figure 1)

In this section, we determine those communities which are most actively engaged in Covid-related discourse.

comments_dir_path = Path('../../data/covid/')
comments_paths = list(comments_dir_path.glob(f'Covid*.csv'))
comments = read_multi_comments_csvs(comments_paths)
comments

/Users/quirin/proj/socsemvar/socsemvar/socsemvar/read_data.py:25: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  comments = pd.read_csv(

	author	body	created_utc	id	subreddit
0	Gloob_Patrol	I assume you work too so he's feeling like he ...	2020-09-08 18:53:06	g4guhl5	LongDistance
1	amtrusc	Strep swab and culture negative, I’m sure? Cou...	2020-09-08 18:53:08	g4guhsm	tonsilstones
2	Ephuntz	>Good point. My apologies. It's just becomi...	2020-09-08 18:53:09	g4guhua	Winnipeg
3	cstransfer	Have you noticed an increase of people going e...	2020-09-08 18:53:09	g4guhu4	financialindependence
4	IlliniWhoDat	I haven't. I have seen it online, but haven't...	2020-09-08 18:53:13	g4gui6o	KoreanBeauty
...	...	...	...	...	...
3800760	willw	Last group pre COVID!	2020-07-01 21:59:48	fwmqfbj	jawsurgery
3800761	Daikataro	If everyone is infected with COVID, new cases ...	2020-07-01 21:59:49	fwmqff2	politics
3800762	StabYourBloodIntoMe	> If the mortality rate is actually decreas...	2020-07-01 21:59:50	fwmqfib	dataisbeautiful
3800763	Shorse_rider	I was a freelancer until covid and earned more...	2020-07-01 21:59:55	fwmqfuw	AskWomen
3800764	Gayfetus	This is actually fascinating and possibly incr...	2020-07-01 21:59:57	fwmqfz0	Coronavirus

3800765 rows × 5 columns

subreddit_counts = (comments
 .groupby('subreddit')
 .agg(comments_num = ('subreddit', 'count'))
 .sort_values('comments_num', ascending=False)
)

Plot top 15 communities that are most actively engaged in Covid-related discourse.

subreddits_chart = subreddit_counts\
    .reset_index()\
    .iloc[:15]\
    .pipe(alt.Chart)\
        .mark_bar()\
        .encode(
            x=alt.X('comments_num:Q', title='Number of Covid-related comments'),
            y=alt.Y('subreddit:N', title='Community', sort='-x')
        )

subreddits_chart

Semantic axes (Figure 2)

models = load_models(['Coronavirus', 'conspiracy'], models_dir='../../models')

lexs = [ 'corona', 'rona', 'moderna', 'sars', 'spreader', 'maskless', 'distancing', 'quarantines', 'pandemic', 'science', 'research', 'masks', 'lockdowns', 'vaccines' ]

evaluative dimension: good vs bad

pole_words = ['good', 'bad']

proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted,  pole_words)
sem_axis_evaluative_plot

MFT-based dimension: loyalty vs betrayal

pole_words = ['loyalty', 'betrayal']

proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted,  pole_words)
sem_axis_evaluative_plot

Maps of socio-semantic variation (Figure 3)

Note that the plots in this notebook are not identical to the ones in the paper since the dimensionality reduction via t-SNE leads to differences in results between runs.

models = load_models(['Coronavirus', 'conspiracy'], models_dir='../../models')

smart_procrustes_align_gensim(models['Coronavirus'], models['conspiracy'])

67181 67181
67181 67181

<gensim.models.word2vec.Word2Vec>

nbs_vecs = pd.concat([get_nbs_vecs('vaccines', model_name, model, k=750) for model_name, model in models.items()])

common neighbours

nbs_vecs_2d = dim_red_nbs_vecs(nbs_vecs, perplexity=0.1)
nbs_sim = (nbs_vecs_2d
    .groupby('subreddit')
    .apply(lambda df: df.nlargest(10, 'sim'))
    .reset_index(drop=True)
)

/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/3553513103.py:4: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda df: df.nlargest(10, 'sim'))

map_sims_plot = (alt.Chart(nbs_sim).mark_text().encode(
        x='x_tsne:Q',
        y='y_tsne:Q',
        text='lex',
        color='subreddit:N'
    ))

map_sims_plot

differences in neighbours

nbs_vecs = dim_red_nbs_vecs(nbs_vecs, perplexity=70)
nbs_diff = nbs_vecs.drop_duplicates(subset='lex', keep=False)
nbs_diff = (nbs_diff
    .groupby('subreddit')
    .apply(lambda df: df.nlargest(20, 'sim'))
    .reset_index(drop=True)
)

/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/2989563900.py:5: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda df: df.nlargest(20, 'sim'))

map_diffs_plot = (alt.Chart(nbs_diff).mark_text().encode(
        x='x_tsne:Q',
        y='y_tsne:Q',
        text='lex:N',
        color='subreddit:N'
    ))


map_diffs_plot