Toolkit Glossary¶
- Project
Projects are like subcorpora. They are useful for dividing the larger corpus into smaller pieces by language, purpose, formatting etc.
Projects can also be a great way to separate out data that you’re currently working with (either in Toolkit, via scripts or just viewing) from data in the larger corpus.
Creating a new project allows for data to be stored in a structured manner, while indices can be shared between different projects.
- Index
An index is a collection of documents. Indices can be shared between different projects.
Documents in an index are formatted as rows in a table, where one row indicates a document and columns indicate different fields similarly to Excel files.
- Field
A field refers to a column in an index where the values are the same type for different documents.
For example, a usual full-length document in the Bwrite corpus has:
metadata fields (original file name, original path to file, file URL)
statistical fields (the amount of characters, sentences and words in the document)
text and processed text fields (text, lemmas, part-of-speech tags and word features, title page text)
the facts field
You can choose which of these fields you would like to see displayed using toggle columns.
- Facts (Metadata)
Facts are data that is added by a user or extracted from documents using different tools. Applying facts makes it easier to search for and aggregate documents.
In the Bwrite project, most of the facts were extracted from metadata in the original file path. This means you can search in the processed files indices by Genre, Publication year, University, Discipline and/or Publication.
- Search
Search allows you to look at specific documents in an index or indices. There are two types of search:
Simple search - searching for specific words or phrases over all fields.
Advanced search - a more specific and advanced form of search, for example searching for a specific metadata fact, searching for specific words or phrases in a specific field, combining searches and much more.
Read more about this in Searching.
- Query
A query is a saved search that you can use later on, either for searching or for use in different toolkit tools. Queries can be modified or deleted.
- Aggregation
Aggregation gives you more information about selected search queries.
For example, searching for all BA theses and choosing to aggregate this search using facts allows you to find the metadata attached to the theses: the number of BA theses written at specific Universities, during specific years and in specific disciplines. Aggregation can also allow you to find the most frequent/significant words.
- Task
Tasks manage the use of different tools and models. Creating a task allows the user to specify the task parameters and select different options for application. Tools and models can also be monitored, modified and applied to documents/indices via tasks.
- Elasticsearch
Elasticsearch is the search engine used in the Bwrite Toolkit.
- Embedding
Embeddings create a representation of the words in given documents as vectors. Words that are more similar have more similar values, this means we can find similar words using embeddings.
- Lexicon
Lexicons hold lists of words or phrases that you can use in queries or for finding similar words using embeddings. Read more about Lexicons and Embeddings.
- Multilingual Preprocessor (MLP)
The Multilingual Preprocessor offers analyzing options that generate more linguistic information. For example, you can generate the lemmas or part-of-speech tags for a document automatically by using MLP.
- Lemmas
Lemmas are the dictionary form of a word. For example the word “kool” (school in Estonian) can be present in a document as “koolis”, “koolist”, “kooli” or other forms, lemmatization is the process to transform it into the dictionary form.
- Tokens
Tokens are words or other structural components in a text that can be extracted with a method of tokenization.
- Word stem
Word stem refers to the stem or root of a word. Read more about word stems at https://en.wikipedia.org/wiki/Word_stem.
- Part-of-speech tags
A part-of-speech or POS tag refers to a word’s syntactic category or word class, for example the Estonian word “kool” is a NOUN.
- Word features
Word features give more linguistic information about a word. For example, the Estonian word “kool” is singular (referring to one school) and in the nominative case, with the information “Case=Nom|Number=Sing”. Word features can also give information about the mood, tense, person, verb form etc.
- F1 score
F1 in the Toolkit doesn’t refer to racing, instead it is a machine learning term to evaluate the performance of a machine learning model. Generally a higher F1 score means it is better performing (the scores range from 0.00 up to 1.00). Read more about F1 at https://en.wikipedia.org/wiki/F-score.
- Large Language Model (LLM)
LLM refers to large language models like BERT, GPT and other types of large language machine learning models.