Bwrite Corpus Overview¶

Accessing Data in Original Form¶

To access the data in original form (whole documents) with metadata, choose project 7: Final corpora (whole documents).

The data is separated by language into Estonian, Latvian and Lithuanian indices.

Corpus language	Index name	Number of documents
Estonian	year_documents_et	34 701
Latvian	year_documents_lv	3773
Lithuanian	year_documents_lt	43 336

To access these indices, see Projects and Indices.

Metadata information for original documents¶

Disciplines: Most frequent disciplines for documents, automatically parsed from faculty names/journal names etc.

A few documents have two disciplines due to having two faculties. Some documents have no discipline, due to being unable to parse the necessary data from the document contents/metadata.

Due to automatic parsing there may be inaccuracies.

Corpus language	Humanities	Medicine	Science and Technology	Social Sciences
Estonian	6751	4035	9589	12 811
Latvian	1453	432	711	818
Lithuanian	12 952	7101	6683	15 058
Sum	21 156	11 568	16 983	28 687

Overall disciplines distribution across all languages:

Genres: Most frequent genres in corpus (parsed from original metadata).

Corpus language	BA thesis	MA thesis	PhD thesis	Other student work	Journal Articles	Proceedings	Yearbooks
Estonian	16 215	9380	1170	870	6199	134	733
Latvian	153	2	1634	0	1737	247	0
Lithuanian	4229	11 331	1882	0	25 894	0	0
Sum	20 597	20 713	4686	870	33 830	381	733

Overall genres distribution across all languages:

Publication year: Most frequent publication year for student work or journal article (automatically parsed).

The publication year was determined for the years 1900-2023. Due to automatic parsing there may be inaccuracies.

All documents

Estonian documents

Latvian documents

Lithuanian documents

Universities: Most frequent universities found for student work in each language.

For Estonian and Latvian, automatically extracted from file path in NextCloud, while Lithuanian was automatically parsed from the title page of theses. Due to automatic parsing there may be inaccuracies.

Estonian universities

Latvian universities

Lithuanian universities

Publications: Most frequent publications found for journal articles in each language. Automatically extracted from file path in NextCloud.

Estonian publications

Latvian publications

Lithuanian publications

Accessing Data in Sentences¶

To access the data in sentence form (where the original documents have been split into sentences) with metadata, choose project 8: Final corpora (sentences).

The data is separated by language into Estonian, Latvian and Lithuanian indices.

Corpus language	Index name	Number of documents
Estonian	year_documents_et_split2	27 816 000
Latvian	year_documents_lv_split	4 673 416
Lithuanian	year_documents_lt_split	26 471 018

To access these indices, see Projects and Indices.

Metadata information for sentences¶

Disciplines: Most frequent disciplines for documents split into sentences, automatically parsed from faculty names/journal names etc.

A few documents have two disciplines due to having two faculties. Some documents have no discipline, due to being unable to parse the necessary data from the document contents/metadata.

Due to automatic parsing there may be inaccuracies.

Corpus language	Humanities	Medicine	Science and Technology	Social Sciences
Estonian	4 905 206	1 751 448	6 473 543	13 855 932
Latvian	1 323 767	531 757	699 703	1 910 699
Lithuanian	5 254 793	4 695 374	4 750 700	10 599 823
Sum	11 483 766	6 978 579	11 923 946	26 366 454

Overall disciplines distribution across all languages:

Genres: Most frequent genres in corpus (parsed from original metadata).

Corpus language	BA thesis	MA thesis	PhD thesis	Other student work	Journal Articles	Proceedings	Yearbooks
Estonian	12 745 043	10 250 258	2 496 556	575 682	1 581 241	15 620	224 709
Latvian	222 962	3641	3 879 790	0	529 848	37 175	0
Lithuanian	3 486 868	11 596 723	3 951 484	0	7 435 943	0	0
Sum	16 454 873	21 850 622	10 327 830	575 682	9 547 032	52 795	224 709

Overall genres distribution across all languages:

Publication year: Most frequent publication year for student work or journal article (automatically parsed).

The publication year was determined for the years 1900-2023. Due to automatic parsing there may be inaccuracies.

All documents

Estonian documents

Latvian documents

Lithuanian documents

Universities: Most frequent universities found for student work in each language.

For Estonian and Latvian, automatically extracted from file path in NextCloud, while Lithuanian was automatically parsed from the title page of theses. Due to automatic parsing there may be inaccuracies.

Estonian universities

Latvian universities

Lithuanian universities

Publications: Most frequent publications found for journal articles in each language. Automatically extracted from file path in NextCloud.

Estonian publications

Latvian publications

Lithuanian publications

How to find the same document in full or in sentences¶

Let’s say you have a document or documents you are interested in. You can find out the documents’ meta.file values and use that for searching for the same document in a different form.

For example, for the demo project, 20 random documents from all languages have been extracted for testing purposes.

But what if you need those documents in sentences?

Consider creating a query for getting only the files you are interested in. In this case, we have 20 documents, which should not be too much.

We can aggregate the query by selecting the field meta.file to get all of the file names.

Press Aggregate to see the results.

Here are the results. Pay close attention to the items per page on the right. If you should have more than 20 results, then choose a larger number to display.

We can automatically select all of the filenames to put in a lexicon by ticking the uppermost box in the picture, next to the word Key.

Then press Add to lexicon.

A menu will appear where you can either overwrite an existing lexicon or create a new one. Choose the description/name and if you want to use the lexicon for a search later, then choose the word type as Positives used.

Navigate to the sentences version of the same index. In this example, we’re searching for Lithuanian documents, so we choose year_documents_lt_split2 to search for the same documents in sentences.

You can use the lexicon for a query and get the same documents that way. If you save the query, you can also get the same documents in a separate index by using the Reindexer.

Experiments and Tutorial projects¶

Metadiscourse Study Experiment¶

The metadiscourse study corpus is a sample of academic texts split into sentences, annotated with Estonian metadiscourse markers by Bwrite. This corpus is in project 1: Metadiscourse Study. It consists of indices for Estonian (est_metadiscourse_annotations, metadiscourse_annotated, metadiscourse_et_with_facts) and Latvian (metadiscourse_annotated_latvian) annotations.

Bwrite Raw Data Corpora¶

These are projects 3, 4 and 5. We separated the raw data by language for easier processing. The raw data corpora contains the first variant of the corpora as well as some documents that were too long (probably whole journal issues). There are also indices for fixing problems that went wrong while processing the original data.

Marleen’s Master Thesis Project¶

Project 6: Marleeni MA contained indices related to Marleen’s MA, now it only contains two regex taggers she used for finding documents for her MA.

Sentiment Experiments¶

This is project 9: Sentiment study. The sentiment experiments include Estonian indices used for lexicon- or LLM-based sentiment analysis.

Index name	Description
senti_test_et_2	20 test documents split into sentences
senti_test_et2_split_pos	20 test documents split into sentences with sentiment words from lexicons as facts
sentiment_experiment_ling2	200 test documents used for the LLM-based sentiment analysis (as full documents)
ery_docs	Additional 30 documents added to the LLM experiment later on (full documents)
et_llm_sentiment_test_wl	20 test documents with wordlists and LLM sentiment (as sentences)
et_llm_sentiment_test_2_2	200 test documents with LLM and wordlist sentiment (as paragraphs)
et_llm_sentiment_test_2_ery2	Additional 30 test documents with LLM and wordlist sentiment (as paragraphs)
et_llm_sentiment_test_final	et_llm_sentiment_test_2_2 and et_llm_sentiment_test_2_ery2 together (as paragraphs)
et_llm_sentiment_test_2_2_3_23 / et_llm_sentiment_test_2_2_5_19	Annotation indices for the 200 test documents (only positive and negative)
et_llm_sentiment_test_2_ery2_3_25 / et_llm_sentiment_test_2_ery2_5_25	Annotation indices for the additional 30 documents (only positive and negative)

Demo project¶

This is project 10. A tutorial project used for building this manual and meant for practicing Texta Toolkit features.

Index name	Corpus language	Description
estonian_documents_20	Estonian	20 random Estonian documents
estonian_sentences_20	Estonian	20 random Estonian documents as sentences, has extra metadata like Discipline and sentiment words
latvian_documents_20	Latvian	20 random Latvian documents
latvian_sentences_20	Latvian	20 random Latvian documents as sentences
lithuanian_documents_20	Lithuanian	20 random Lithuanian documents
lithuanian_sentences_20	Lithuanian	20 random Lithuanian documents as sentences
estonian_20_ba_ma	Estonian	sentences from BA and MA theses (from estonian_sentences_20)
ba_ma_custom_1 / ba_ma_custom_2	Estonian	sentences from BA and MA theses (from estonian_sentences_20) split custom
ba_ma_soc_hum_1 / ba_ma_soc_hum_2	Estonian	sentences from BA and MA theses (from estonian_sentences_20) split equal
ba_ma_soc_hum_org1 / ba_ma_soc_hum_org2	Estonian	sentences from BA and MA theses (from estonian_sentences_20) split original
estonian_20_ba_ma_3_21	Estonian	sentences from BA and MA theses (annotation index)

Metadata (Discipline) project¶

This is project 12. It was used to apply discipline metadata to the final indices. It contains a few queries for checking if the Discipline metadata is added to documents.