Introduction

> `whoami`

Quirin Würschinger
q.wuerschinger@lmu.de
Wissenschaftlicher Mitarbeiter and PostDoc in (computational) linguistics
LMU Munich

Current work

research
- lexical innovation on the web and in social networks
- variation and change in language use and social polarization in social networks
- using Large Language Models (LLMs) like ChatGPT for research in linguistics and social science.
teaching: corpus linguistics and research methodology

Promoting Open Science in (computational) linguistics at LMU

teaching and applying reproducible corpuslinguistic methods
creating and sharing corpora among researchers and students

Workshop materials

GitHub repository: https://github.com/wuqui/opensciws
slides: https://wuqui.github.io/opensciws/opensciws_slides.html
website version: https://wuqui.github.io/opensciws/opensciws_website.html

`Open` Open Science workshop

Focus on …

ask questions
discuss
apply and practice
collaborate

Time table

Topic	Start	End
intro	09:00
Open Science principles		10:30
~
version control	10:50
project structure		12:00
~
code	13:15
data and methods		14:40
~
publishing	15:00
authoring		15:25
recap, open issues, feedback	16:00	16:30

Addressing different backgrounds and goals

Backgrounds and interests

CAIS: Forschung zu Digitalisierung und Digitale Gesellschaft

research fiels

education and pedagogy
political science
sociology
communications studies
…

data and methods

qualitative interviews
text analysis
quantitative surveys
experimental designs
social media studies
…

Survey: main interests

reproducible workflows
- managing files and folders
- programming with Python and R
- plain text authoring
data and methods
- text analysis
- social media analysis
- questionnaires
publishing
- sharing data and code
- authoring papers

Who are you?

Please briefly introduce yourself …

name
place and position
your research interest in about 3 sentences for someone outside your field

Open Science principles

What is Open Science?

Why should we do Open Science?

source

What are the reasons why science can go wrong?

source

Richard McElreath: Science as Amateur Software Development

source

Principles of Open Science

Center for Open Science

Open Science lifecycle

Center for Open Science

Roles in Open Science

Funders: make open science part of the selection process, and conditions for grantees conducting research.
Publishers: make open science part of the review process, and conditions for articles published in their journals.
Institutions: make open science part of academic training, and part of the selection process for research positions and evaluation for advancement and promotion.
Societies: make open science part of their awards, events, and scholarly norms.
Researchers: enact open science in their work and advocate for broader adoption in their communities.

(Center for Open Science)

Who profits from Open Science?

source

What is Open Science to you?

What do you find interesting, important, or attractive about Open Science?

https://tinyurl.com/opnsci

Learning outcomes

::: {.content-visible when-format=“revealjs”} # Implementing an open and reproducible workflow

version control
project structure
code
data and methods
authoring
publishing

Version control

Why use version control?

source

git and GitHub/GitLab

git: software on your machine

git add src/tests.py
git commit -m 'add tests'
git push

GitHub and GitLab: services on a remote server

Collaborating using GitHub

(source)

git commands

(source)

GitHub workflow

(source)

Example

source

How to set up a GitHub repository

set up `git`

Installing git: see tutorial

Using git:

from the command line
using a standalone GUI¹ tool; e.g.:
- GitKraken
- GitHub Desktop
from within your editor/IDE²; e.g.:
- RStudio
- VSCode

set up `GitHub`

tutorial

setting up git user information (name, passwort)
setting up GitHub authentication
setting and storing authentication (‘token’)

create a repository on GitHub

(create GitHub account)
click on New (https://github.com/new)
specify repo name ¹
specify description
specify visibility: private or public
select Add a README file
specify licence ²

`clone` repositories

go to the folder where you want your project to live

git clone https://github.com/wuqui/opensciws.git

`add`ing, `commit`ing, and `push`ing changes

(source)

git add src/tests.py
git commit -m 'add tests'
git push

Project structure

Let’s not pretend we’re all geniuses …

File names

File names should be:

machine-readable
human-readable
consistent
avoid non-ASCII characters (e.g. umlauts like ü, ä)
optional: play well with default ordering (e.g. include timestamps)

File structure

.
├── analysis            <- all things data analysis
│   └── src             <- functions and other source files
├── comm
│   ├── internal-comm   <- internal communication such as meeting notes
│   └── journal-comm    <- communication with the journal, e.g. peer review
├── data
│   ├── data_clean      <- clean version of the data
│   └── data_raw        <- raw data (don't touch)
├── dissemination
│   ├── manuscripts
│   ├── posters
│   └── presentations
├── documentation       <- documentation, e.g. data management plan
└── misc                <- miscellaneous files that don't fit elsewhere

General principle: keep the ‘inputs’ of your project/code (e.g. data) separate from its outputs (e.g. plots).

Practice: project management

You have until 11:50 h to work on either …

developing a project structure for your needs from scratch
refactoring/cleaning an existing project¹

Optionally: set up version control via git/GitHub for this project.

Break

Code

Reproducibility

Reproducibility (crisis)

source

Reproducibility et al.

The quality of tools

source

Testing code

Why we should test code

source

not all our projects have that high stakes

Professional testing

source

same is true of industry

Types of tests

Testing frameworks

Python: pytest
R: testthat

Analogy

during the process of manufacturing a ballpoint pen, the cap, the body, the tail, the ink cartridge and the ballpoint are produced separately and unit tested separately.
When two or more units are ready, they are assembled and integration testing is performed, for example a test to check the cap fits on the body.
When the complete pen is integrated, system testing is performed to check it can be used to write like any pen should.
Acceptance testing could be a check to ensure the pen is the colour the customer ordered.

source

Testing example

using pytest for Python

Documenting code

Literate programming

’Literate programming is a methodology that combines a programming language with a documentation language,
thereby making programs more robust, more portable, more easily maintained,
and arguably more fun to write than programs that are written only in a high-level language.
The main idea is to treat a program as a piece of literature, addressed to human beings rather than to a computer.
The program is also viewed as a hypertext document, rather like the World Wide Web. (Indeed, I used the word WEB for this purpose long before CERN grabbed it!)’

Donald Knuth

Notebooks

Python

R

→ both work with Quarto

Example using `nbdev` for Python

Programming a deck of cards: https://github.com/fastai/nbdev_cards/

Literate testing with `nbdev`

Additional benefits of `nbdev`

simple, integrated testing
continuous integration
dependency management
publishing code for PyPI and conda
publishing documentation via Quarto

R: Quarto and RMarkdown

Licensing

Code

often appropriate: MIT license

Other materials

Data and methods

Diversity in data and methods

CAIS: Forschung zu Digitalisierung und Digitale Gesellschaft

research fiels

education and pedagogy
political science
sociology
communications studies
…

data and methods

qualitative interviews
text analysis
quantitative surveys
experimental designs
social media studies
…

source

Reasons to share your data

To allow the possibility to fully reproduce a scientific study.
To prevent duplicate efforts and speed up scientific progress. Large amounts of research funds and careers of researchers can be wasted by only sharing a small part of research in the form of publications.
To facilitate collaboration and increase the impact and quality of scientific research.
To make results of research openly available as a public good, since research is often publicly funded.

FAIR data

source

How to share your data

Turing Way tutorial

Step 1: Select what data you want to share; eg.:
- ethical concerns
- commercial concerns
Step 2: Choose a data repository or other sharing platform
- overview: re3data, NIH, FAIRsharing
- examples: Zenodo, Dryad
Step 3: Choose a licence and link to your paper and code; e.g.:
- Creative Commons
- Open Data Commons
Step 4: Upload your data and documentation
- good file organisation
- appropriate file formats (e.g. csv > xlsx)

Sharing social media data

source

Obstacles to data-sharing

Reason 1: Preparing data for sharing is resource-intensive
Reason 2: Not enough credit for data sharing
Reason 3: Lack of confidence and knowledge
Reason 4: Data protection laws
Reason 5: Platform terms of service
Reason 6: Copyright
Reason 7: Informed consent
Reason 8: Ethical challenges
Reason 9: Lack of common standards

source

The case of Twitter

stage 1: access costly & legal grey area for scraping
stage 2: Research API 🎉
stage 3: Elon Musik → X → …

Group work

get active: see https://tinyurl.com/opnsci

Break

Publishing

Open access

The Turing Way tutorial on open access

source

Routes to open access publishing

source

Preregistration

What is preregistration?

When you preregister your research, you’re simply specifying your research plan in advance of your study and submitting it to a registry.

Preregistration separates hypothesis-generating (exploratory) from hypothesis-testing (confirmatory) research.

Both are important.
But the same data cannot be used to generate and test a hypothesis, which can happen unintentionally and reduce the credibility of your results.
Addressing this problem through planning improves the quality and transparency of your research.
This helps you clearly report your study and helps others who may wish to build on it.

Open Science Center tutorial

The preregistration process

source

Avoiding pitfalls through preregistration

source

HARKing: hypothesizing after results are known

Outlets

ArXiV: preprints
Zenodo: all kinds including data, code, preprints, etc.
GitHub and GitLab: code, software
Open Science Framework: all kinds including data, code, preprints, preregistration, etc.
Software Heritage: archival of code (long-term)
Papers with Code: code and data for and with papers, mostly Machine Learning
…

Authoring

Plain text authoring

source

Markdown syntax

Quarto: Markdown basics

Benefits of Markdown

Markdown provides semantic meaning for content in a relatively simple way
You can write rich formatted content extremely quickly (compared to writing directly in HTML tags)
You can read Markdown easily in plain text before rendered by HTML
It doesn’t interrupt your workflow with the need to click buttons
It’s platform-agnostic so your content is not tied to the format of your editor

Quarto

https://quarto.org/

Multi-purpose publishing

Code

Python

R

Multi-format publishing

see the Quarto gallery

Articles

Presentations

Websites

Personal websites and blogs

Demo

Demo repository: https://github.com/wuqui/opensciwsdemo

Authoring a Quarto document

You have until 16:00 h to,

based on Quarto’s excellent getting started guide,

install Quarto
make an example text document using Quarto Markdown syntax (optional: clone my demo repo)
render the document to multiple output formats (pdf, html, docx, etc.)

further steps

publishing your document(s) as a website on GitHub pages (see tutorial)
refactoring existing document (e.g. paper, website) to Quarto Markdown

Recap, open issues, feedback

version control
project structure
code
data and methods
authoring
publishing

Open Science and Research Software Engineering

Introduction

> whoami

Workshop materials

Open Open Science workshop

Time table

Addressing different backgrounds and goals

Backgrounds and interests

Survey: main interests

Who are you?

Open Science principles

What is Open Science?

Why should we do Open Science?

Principles of Open Science

Open Science lifecycle

Roles in Open Science

Who profits from Open Science?

What is Open Science to you?

Learning outcomes

Version control

Why use version control?

git and GitHub/GitLab

Collaborating using GitHub

git commands

GitHub workflow

Example

How to set up a GitHub repository

set up git

set up GitHub

create a repository on GitHub

clone repositories

adding, commiting, and pushing changes

Project structure

File names

File structure

Practice: project management

Break

Code

Reproducibility

Reproducibility (crisis)

Reproducibility et al.

The quality of tools

Testing code

Why we should test code

Professional testing

Types of tests

Testing example

Documenting code

Literate programming

Notebooks

Example using nbdev for Python

Literate testing with nbdev

Additional benefits of nbdev

R: Quarto and RMarkdown

Licensing

Data and methods

Diversity in data and methods

Reasons to share your data

FAIR data

How to share your data

Sharing social media data

The case of Twitter

Group work

Break

Publishing

Open access

Preregistration

What is preregistration?

The preregistration process

Avoiding pitfalls through preregistration

Outlets

Authoring

Plain text authoring

Markdown syntax

Benefits of Markdown

Quarto

Multi-purpose publishing

Code

Python

R

Open Science and
Research Software Engineering

> `whoami`

`Open` Open Science workshop

set up `git`

set up `GitHub`

`clone` repositories

`add`ing, `commit`ing, and `push`ing changes

Example using `nbdev` for Python

Literate testing with `nbdev`

Additional benefits of `nbdev`