= load_models(['2019', '2020'], models_dir='../../models')
models models
{'2019': <gensim.models.word2vec.Word2Vec>,
'2020': <gensim.models.word2vec.Word2Vec>}
{'2019': <gensim.models.word2vec.Word2Vec>,
'2020': <gensim.models.word2vec.Word2Vec>}
Vocabulary sizes for the two models before Procrustes alignment:
pd.DataFrame(
columns=['Model', 'VocabSize'],
data=[
['2019', f"{len(models['2019'].wv.key_to_index):,}"],
['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])
Model | VocabSize | |
---|---|---|
0 | 2019 | 252,564 |
1 | 2020 | 277,707 |
190756 190756
190756 190756
<gensim.models.word2vec.Word2Vec>
Intersecting vocabulary size after alignment:
pd.DataFrame(
columns=['Model', 'VocabSize'],
data=[
['2019', f"{len(models['2019'].wv.key_to_index):,}"],
['2020', f"{len(models['2020'].wv.key_to_index):,}"],
])
Model | VocabSize | |
---|---|---|
0 | 2019 | 190,756 |
1 | 2020 | 190,756 |
Measuring semantic distances (~ cosine distance) between the 2019 and the 2020 model for all words contained in the aligned vocabulary.
20 words that show the highest semantic distance between 2019 and 2020. This output is presented in Table 2 in the paper.
lex | dist_sem | |
---|---|---|
0 | lockdowns | 1.02 |
1 | maskless | 1.00 |
2 | sunsetting | 1.00 |
3 | childe | 0.98 |
4 | megalodon | 0.98 |
5 | newf | 0.96 |
6 | corona | 0.93 |
7 | filtrate | 0.92 |
8 | chaz | 0.90 |
9 | klee | 0.89 |
10 | rona | 0.89 |
11 | cerb | 0.87 |
12 | rittenhouse | 0.87 |
13 | vacuo | 0.86 |
14 | moderna | 0.84 |
15 | pandemic | 0.84 |
16 | spreader | 0.84 |
17 | distancing | 0.83 |
18 | sars | 0.83 |
19 | quarantines | 0.82 |
Extended list for the Appendix (Table 3)
lex | dist_sem | |
---|---|---|
0 | lockdowns | 1.02 |
1 | maskless | 1.00 |
2 | sunsetting | 1.00 |
3 | newf | 0.96 |
4 | corona | 0.93 |
5 | filtrate | 0.92 |
6 | chaz | 0.90 |
7 | rona | 0.89 |
8 | cerb | 0.87 |
9 | vacuo | 0.86 |
10 | moderna | 0.84 |
11 | pandemic | 0.84 |
12 | spreader | 0.84 |
13 | distancing | 0.83 |
14 | sars | 0.83 |
15 | quarantines | 0.82 |
16 | yada | 0.82 |
17 | recounts | 0.82 |
18 | alway | 0.81 |
19 | yadda | 0.80 |
20 | pandemics | 0.80 |
21 | pansies | 0.79 |
22 | tosser | 0.79 |
23 | bipoc | 0.79 |
24 | ventilators | 0.79 |
25 | budging | 0.79 |
26 | diys | 0.78 |
27 | thst | 0.78 |
28 | flyweight | 0.77 |
29 | yeap | 0.77 |
30 | mrna | 0.77 |
31 | tiktoks | 0.77 |
32 | buuuut | 0.76 |
33 | coomer | 0.76 |
34 | unfortunatly | 0.75 |
35 | anywho | 0.75 |
36 | quarantining | 0.74 |
37 | venti | 0.74 |
38 | webrip | 0.74 |
39 | obvi | 0.74 |
40 | fkin | 0.74 |
41 | modus | 0.73 |
42 | tink | 0.73 |
43 | duplicating | 0.73 |
44 | retinoids | 0.73 |
45 | parasol | 0.72 |
46 | copypastas | 0.72 |
47 | excercise | 0.72 |
48 | newbies | 0.72 |
49 | mers | 0.72 |
In this section, we determine those communities which are most actively engaged in Covid-related discourse.
comments_dir_path = Path('../../data/covid/')
comments_paths = list(comments_dir_path.glob(f'Covid*.csv'))
comments = read_multi_comments_csvs(comments_paths)
comments
/Users/quirin/proj/socsemvar/socsemvar/socsemvar/read_data.py:25: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
comments = pd.read_csv(
author | body | created_utc | id | subreddit | |
---|---|---|---|---|---|
0 | Gloob_Patrol | I assume you work too so he's feeling like he ... | 2020-09-08 18:53:06 | g4guhl5 | LongDistance |
1 | amtrusc | Strep swab and culture negative, I’m sure? Cou... | 2020-09-08 18:53:08 | g4guhsm | tonsilstones |
2 | Ephuntz | >Good point. My apologies. It's just becomi... | 2020-09-08 18:53:09 | g4guhua | Winnipeg |
3 | cstransfer | Have you noticed an increase of people going e... | 2020-09-08 18:53:09 | g4guhu4 | financialindependence |
4 | IlliniWhoDat | I haven't. I have seen it online, but haven't... | 2020-09-08 18:53:13 | g4gui6o | KoreanBeauty |
... | ... | ... | ... | ... | ... |
3800760 | willw | Last group pre COVID! | 2020-07-01 21:59:48 | fwmqfbj | jawsurgery |
3800761 | Daikataro | If everyone is infected with COVID, new cases ... | 2020-07-01 21:59:49 | fwmqff2 | politics |
3800762 | StabYourBloodIntoMe | > If the mortality rate is actually decreas... | 2020-07-01 21:59:50 | fwmqfib | dataisbeautiful |
3800763 | Shorse_rider | I was a freelancer until covid and earned more... | 2020-07-01 21:59:55 | fwmqfuw | AskWomen |
3800764 | Gayfetus | This is actually fascinating and possibly incr... | 2020-07-01 21:59:57 | fwmqfz0 | Coronavirus |
3800765 rows × 5 columns
Plot top 15 communities that are most actively engaged in Covid-related discourse.
pole_words = ['good', 'bad']
proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted, pole_words)
sem_axis_evaluative_plot
pole_words = ['loyalty', 'betrayal']
proj_sims = get_axis_sims(lexs, models, pole_words, k=10)
proj_sims = aggregate_proj_sims(proj_sims)
proj_sims_melted = proj_sims.melt(id_vars=['lex', 'SimDiff'], var_name='model', value_name='SemSim')
sem_axis_evaluative_plot = plot_sem_axis(proj_sims_melted, pole_words)
sem_axis_evaluative_plot
Note that the plots in this notebook are not identical to the ones in the paper since the dimensionality reduction via t-SNE leads to differences in results between runs.
67181 67181
67181 67181
<gensim.models.word2vec.Word2Vec>
nbs_vecs_2d = dim_red_nbs_vecs(nbs_vecs, perplexity=0.1)
nbs_sim = (nbs_vecs_2d
.groupby('subreddit')
.apply(lambda df: df.nlargest(10, 'sim'))
.reset_index(drop=True)
)
/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/3553513103.py:4: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
.apply(lambda df: df.nlargest(10, 'sim'))
nbs_vecs = dim_red_nbs_vecs(nbs_vecs, perplexity=70)
nbs_diff = nbs_vecs.drop_duplicates(subset='lex', keep=False)
nbs_diff = (nbs_diff
.groupby('subreddit')
.apply(lambda df: df.nlargest(20, 'sim'))
.reset_index(drop=True)
)
/var/folders/gp/dw55jb3d3gl6jn22rscvxjm40000gn/T/ipykernel_2126/2989563900.py:5: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
.apply(lambda df: df.nlargest(20, 'sim'))