Organisation and introduction

Organisation

Course materials

https://wuqui.github.io/CorpLing251/

Registration

Open issues?

Is anyone taking this course as part of a module exam?

Requirements

Active participation
BYOD: bring your own device (ideally laptop)
Some preparatory reading/analyses

Assessment

Requirements depend on:
- Your individual study programme and module combinations
- How many credits you aim to get in this course
- Please check which option and precise requirements apply to you in the Studien-/Prüfungsordnung
Term paper types:
- Thesenpapier / thesis paper
- Hausarbeit / term paper

Overview (character counts include spaces):

Programme	Module	ECTS	PaperType	CharsMin	CharsMax
LA Gy	P 12.2	6	term paper	34,000	51,000
MA	WP 18.1	9	term paper	30,000	37,500
MA	WP 2.1	6	term paper	9,000	12,000
MA	WP 24.1	6	term paper	9,000	12,000
MA	WP 25.1	6	term paper	9,000	12,000
MA	WP 32.1	9	term paper	30,000	37,500
MA	WP 33.1	6	thesis paper	3,000	6,000
MA	WP Ang 3.1	9	term paper	30,000	37,500

Submission: via email to q.wuerschinger@lmu.de in PDF format.

Course description

This course covers theoretical and practical aspects of corpus linguistics, with an emphasis on hands-on learning. Students will examine language use in different domains, studying a range of linguistic concepts from domains such as lexis, morphology, and syntax, and investigate social variation, text type variation, and language change in these areas. The course will use data from different time periods and different genres (e.g. web corpora, academic prose, novels) to give students hands-on experience in analysing data and to cover a wide range of linguistic phenomena.

Throughout the course, students will learn how to use various corpus linguistic methods such as queries, frequency analysis, collocations, and text type analysis. By using tools like Sketch Engine and Excel, students will gain hands-on experience analysing real-world data and gain a deeper understanding of how these methods can be applied in different linguistic contexts. For example, we will analyse differences in the usage of words and constructions over time using the Gutenberg corpus, we will use Sketch Engine to analyse meaning change and variation, and Excel to create frequency tables and charts.

Literature:

Biber, Douglas, and Randi Reppen, eds. 2015. The Cambridge Handbook of English Corpus Linguistics. Cambridge Handbooks in Language and Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139764377.

McEnery, Tony, and Andrew Hardie. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511981395.

O’Keeffe, Anne, and Michael McCarthy, eds. 2022. The Routledge Handbook of Corpus Linguistics. Second edition. Routledge Handbooks in Applied Linguistics. Abingdon, Oxon ; New York, NY: Routledge.

Stefanowitsch, Anatol. 2020. Corpus Linguistics. Language Science Press. Language Science Press. https://doi.org/10.5281/zenodo.3735822.

Course schedule

Date	Topic
24.04.	Organisation and introduction
08.05.	Sketch Engine
15.05.	Lexis: innovation and diffusion
22.05.	Morphology and word-formation
05.06.	Morphology: meaning analysis, collocations, word sketches
12.06.	Creating corpora: principles, practice
26.06.	Syntax: constructions, frequency analysis, CQL
03.07.	Research projects and term papers: planning & resources
10.07.	Linguistic variation: tag questions (Tottie 2006)
17.07.	Language change: modal verbs (Hilpert 2015)
24.07.	Language change and wrap-up

Survey: experiences and expectations

Have you written a term paper yet?
- In linguistics?
  - An empirical one?
    - Using corpus data?
Have you worked with corpus data?
- Which corpora have you worked with?
- Which corpus tools have you worked with? (e.g. AntConc, Sketch Engine, WordSmith, LancsBox)
- Which corpuslinguistic methods have you used?
What are you most interested in?
- Linguistic phenomena
- Datasets
- Methods

Introduction

Introduction to corpus linguistics

What is corpus linguistics about?

Corpus linguistics is a research methodology within the field of linguistics that focuses on the systematic study of language using large and diverse collections of authentic texts, known as corpora.
These collections of language data, either written or spoken, provide a comprehensive and empirical basis for the analysis of
- language use (e.g. collocational patterns such as pretty woman)
- linguistic variation across different text types or communities (e.g. neologisms such as smash)
- language change (e.g. going to future)
The primary goal of corpus linguistics is to investigate linguistic phenomena and patterns by examining real-world language usage.
- This approach contrasts with more traditional linguistic methods that rely heavily on introspection and theoretical speculation (e.g. Chomsky).
Corpus linguistics has gained significant momentum in recent years, thanks to advances
- in data
  - e.g. social media and web corpora like Twitter and Reddit
- methods
  - e.g. social network analysis, machine learning

What is corpus linguistics good for?

Corpus linguistics is a usage-based approach to linguistic analysis.

Corpus linguistics is highly valuable from a usage-based approach to linguistics, as it emphasizes the importance of actual language use in shaping linguistic knowledge and structure.
The usage-based perspective posits that linguistic structure and knowledge emerge from the patterns and regularities that speakers encounter in their experience with language.

In this context, corpus linguistics provides a powerful toolset for investigating these patterns and regularities, offering several advantages:

Authentic language data
- Corpus linguistics relies on large and diverse collections of authentic texts, which represent real-world language use.
- This ensures that the patterns and structures uncovered through corpus analysis are grounded in genuine linguistic behavior, rather than relying on idealized or artificial examples.
Quantitative approach
- Corpus linguistics allows for the quantitative analysis of linguistic phenomena, such as frequency counts and statistical measures.
- This enables researchers to identify and describe patterns and regularities that emerge from language use, supporting the usage-based claim that linguistic structure is shaped by frequency and distributional patterns in the input.
Collocations and constructions
- The usage-based approach posits that language is composed of form-meaning pairings, known as constructions, which range from morphemes and words to idiomatic expressions and complex syntactic structures.
- Corpus linguistics offers tools for identifying and analyzing collocations and constructions in large datasets, contributing to our understanding of the relationships between form, meaning, and use.
Variability and context sensitivity
- Corpus linguistics enables the examination of language use across different contexts, genres, and registers.
- This allows researchers to investigate how linguistic features and structures vary and adapt to different situations, providing insights into the dynamic nature of language and its sensitivity to context, which is a key aspect of the usage-based approach.
Language change and development
- Corpus linguistics can be applied to diachronic and synchronic data, allowing researchers to track language change over time and compare different stages of language development.
- This helps to shed light on the emergence and evolution of linguistic structures, which is of particular interest to usage-based theorists who seek to explain language change as a result of cumulative changes in usage patterns.
Data-driven language teaching and language learning
- The usage-based approach emphasizes the importance of exposure to authentic language input in the acquisition process.
- Corpus linguistics can inform the development of language teaching materials and methods, by providing insights into the most frequent and relevant structures, vocabulary, and collocations that learners need to acquire.

Key concepts in corpus linguistics

Corpus
- A corpus is a large, structured collection of texts that serves as the basis for linguistic analysis. Corpora can be general, representing a wide variety of language use, or specialized, focusing on specific genres, registers, or domains. Monitor corpora such as NOW can be used to track language use until the present day.
Annotation
- Annotation refers to the process of adding metadata or linguistic information to a corpus, such as part-of-speech tags, syntactic structure, or semantic roles. This additional information can facilitate more in-depth and accurate analyses.
Concordance
- A concordance is a tool that allows researchers to search for specific words, phrases, or patterns in a corpus and display the results in context. This helps researchers to examine language patterns and identify trends across various texts.
Collocations
- Collocation refers to the co-occurrence of words within a specific context or proximity. Studying collocations can reveal important information about word usage, meaning, and associations.
Frequency
- Frequency analysis involves counting the occurrences of linguistic features, such as words or structures, within a corpus. This helps researchers identify patterns and trends, as well as compare language use across different corpora.
N-grams
- N-grams are sequences of n contiguous words or linguistic units within a text. They can be used to study word combinations, patterns, and structures in a corpus.
  - Example: The cat sat on the mat.
  - Example corpus: Google N-grams
Register
- Register refers to the language variety used in specific contexts or situations, characterized by particular linguistic features, such as vocabulary, grammar, and style. Examining registers can help researchers understand language variation and adaptation.
- A type of text type variation
Metadata: data about language use on several levels:
- Corpus
  - Texts: author, text type, register, topic
    - Running words (tokens): word class, lemmatization

The scope of this course

Topics:

Neologisms and lexical innovation
Productivity
Grammaticality
…

Working with corpora:

Compilation
Annotation
Selection
Corpus analysis
Data analysis (e.g. using Microsoft Excel)

Methods:

Frequency-based analysis
Collocations
Word sketches

Introduction to Sketch Engine

Website: https://www.sketchengine.eu/
Tutorial: Sketch Engine Tutorial