Sketch Engine Tutorial - Web

Author

Quirin Würschinger

Tutorial materials

You can find all materials on this GitHub repo: https://github.com/wuqui/SkEtut

I will continue to work on these materials and would appreciate questions, comments, and contributions via mail or on GitHub.

General information

What is Sketch Engine?

https://www.sketchengine.eu

Main features

corpus management

  • creating corpora from your own data
  • hosting these corpora online
  • annotating corpora
  • sharing your corpora

. . .

corpus analysis

  • access to many pre-loaded corpora
  • simple and complex queries
  • concordances
  • collocation analysis
  • text type analysis

Resources

https://www.sketchengine.eu/quick-start-guide/

https://www.sketchengine.eu/guide/

Compiling corpora

Data format

https://www.sketchengine.eu/guide/create-corpus-from-files/

texts without annotations: most common

  • structure: (ideally) use 1 document per file
  • file formats: plain text
    • .txt
    • .csv

annotated texts:

  • .xml: powerful, but more involved

Raw data

Text files

Melville: Moby Dick – available as part of Project Gutenberg


Tabular data

  1. extract the column containing the text body from your spreadsheet (e.g. in new sheet)
  2. export this column to .csv
  3. you can then import this column in SkE just like a .txt

Note, however, that (meta)data in other columns will be lost1.

Uploading files

Adding and editing metadata

Figure 1: Adding and editing metadata

Processed data

After compiling: vert (‘vertical’) format – word per line (WPL)2

Analysing data

Dashboard

Available corpora

Browse full list of (English) corpora here.

Among others, …

Subcorpora

You can create subcorpora for pre-loaded and self-compiled corpora based on

  • all available metadata categories (e.g. timestamps, topics, filenames)
  • concordance searches

Queries

You run queries from the Concordance view.

There are two options:

  • basic searches: basic
  • advanced searches: more involved and powerful (e.g. searching for constructions based on lemmatized forms or word classes)


Basic queries


Advanced (CQL) queries

Helpful: manual and CQL builder.



Extracting parts of your query matches using within:


Filtering by metadata

Options:

  • query metadata within CQL syntax (e.g. [word="bank"] within <doc topic="recreation" />)
  • perform ‘text type’3 filtering using the dropdown menus, which is also available for simple queries (see above).

Concordance view

Collocations

Additional measures (e.g. log likelihood) and other options are available in the advanced settings.

Word sketches


Word sketch difference: between two words/phrases


Word sketch difference: between two subcorpora

Visualizations

Annotating data

for metadata: see Figure 1 above

for concordance lines:

Exporting data

Almost everything can be exported:

  • your entire annotated corpora
  • results from queries/concordances
  • results from collocations
  • results from word sketches

I recommend exporting data in .xlsx format, since this seems to be best supported by SkE.4

Use cases

Compiling a corpus: dead authors’ minds

See Section Compiling corpora above.

Sharing corpora: the toy corpus of Gutenberg books that I created for this tutorial is named qw-gutenberg and it should be accessible by all LMUlers.

Studying syntactic constructions: the N BE that

Select pre-loaded corpus: Gutenberg English 2020

Query inspired by: Schmid, Hans-Jörg, and Annette Mantlik. 2015. ‘Entrenchment in Historical Corpora? Reconstructing Dead Authors’ Minds from Their Usage Profiles’. Anglia 133 (4): 583—623.


Search for target construction


Get frequency distribution of nouns in target construction:


Distribution across all authors in SkE:


Plot in exported Excel file:


Individual analysis on Samuel Pepys’ works:


Results for Samuel Pepys:

Comparing collocational profiles

corpus: enTenTen20

method: for the lemma bankn, get word sketch differences between texts with recreation and business as topics


Results:

Investigating frequency over time: the rise of whatever

corpus: English Historical Book Collection (EEBO, ECCO, Evans)

  1. Identify words that have significantly increased or decreased in frequency over time using the trends feature:


Results:


  1. Investigating the frequency increase of whatever:


Results:


Plotting the exported version in Excel:

Footnotes

  1. To preserve these data, you would have to convert your tabular data (.xlsx or .csv) into xml format before importing.↩︎

  2. More precisely: one token per line, including puctuation: e.g. it, ‘'s’, ,.↩︎

  3. ‘Text types’ in SkE are not text types in the linguistic sense, but in the technical sense: documents have different text types if they differ regarding any metadata category. For example, two ‘types’ could be texts tagged for <doc year="1900"> vs <doc year="2000">.↩︎

  4. When exporting to csv, be careful with decimal/thousands separator: when using the Text to columns option in Excel, use . as decimal and , as thousands separator (e.g. one thousand point five: 1,000.5).↩︎