Workflow Tutorial

This is a tutorial of what the workflow in Toolkit looks like. You are free to follow along if you wish.


New project or old?

Generally the workflow starts with creating a new project or adding to an existing project.

For testing purposes, you can make a new project or use the demo project created for this manual.

  1. Create a new project. See Creating a New Project for information on how to create a new project of your own.

If you want to use the Demo project, jump to the next step, and access 10: Demo project.

  1. To access the project, see Using a Project.

  2. Add an index into the project or access an index. You have two choices:

  1. Add a new index. See Adding Data with the Dataset Importer to add a completely new index from a jsonlines/CSV/Excel file.

  2. Use an existing index. See Add New Indices to Project and Picking an index to look at in Search view on information about how to use an existing index in a new project and select the index. Demo project has information about the indices used in the demo project.


Data validation

If you added new data, you might want to check if the data inside is correct.

Basic data validation can be done just by looking at the data and using different search methods to find data that seems interesting.

  1. Access the index you added to the project or an existing index.

  2. Manage the view with Search view options, Change documents per page, Navigating through documents, Sorting the data based on fields,

  3. Use a Simple Search to look for some keywords. You can use Searcher Options to change the display.

  4. If you are feeling confident with simple search, then use Advanced Search to find interesting data. You can also try Searching with Metadata.

  5. Save all your found documents by using Search Queries to re-access pertinent data.

  6. View if Aggregations contain anything interesting for a search query or a smaller index.


Create a new Index?

Do you want to add facts or train models based on the data you found?

Or perhaps you found some documents that are in the wrong language or are conference agendas instead of journal articles?

In these cases you should create a new index, in any other cases you can just use a query. Use the Reindexer to create a copy, subset or merged index.


Processing

You can add extra linguistic information to the documents by processing them.

Choose what type of data you’d like to add (with the processing tool in the parenthesis):

  • lemmas (MLP)

  • tokens (MLP/ES Analyzer)

  • stems (ES Analyzer)

  • detect language (MLP/Language Detector)

  • part-of-speech tags (MLP)

  • word features (MLP)

  • named entities (MLP)

Note

  1. Processing a whole index will add more fields to the data.

  2. Data (for example text field) that has already been processed with MLP or ES Analyzer (see for example text_mlp fields in Advanced Search) doesn’t need to be processed again.

Tokenization

breaks documents into tokens, separating any punctuation. See ES Analyzer

Lemmatization

transforms every word in documents into a dictionary version. See MLP

Language detection

determines the language of each selected document. See MLP or Language Detection

Stemming

finds word stems in documents. See ES Analyzer


Adding Metadata

Note

Adding metadata will add Facts to the documents.

  • Adding metadata using queries

Have you found something interesting and maybe even saved a query? Use Search Query Taggers. For more complex queries you can use the Regex Taggers

If the query seems to be missing some variants, use the Lexicons and Embeddings to find similar terms.

  • Adding metadata using clustering

If you haven’t found something interesting yet, you can try to cluster the documents using Topic Analyzer

  • Adding metadata by editing

If you spot a document where the metadata is wrong, then you are able to edit it, either individually or in bulk. To do this, see Editing and Deleting Facts (Facts Manager)



Machine learning workflow

The prerequisites for machine learning are indices that have clean and validated data. You should also have some facts to train on and the facts should be manually assigned or validated manually (at least in some form or in a small number of total facts).

Which model to choose

Decide the task you are trying to accomplish by seeing which task from the list below is closest:

  • Data that can be neatly classified into two or more categories.

For example: applied science discipline or non-applied science discipline.

Another example: sentiment labels like neutral, positive, negative. You have quite a lot of data already classified or can classify it easily.

This is binary/multilabel classification: BERT Taggers

  • Trying to find useful words or phrases from documents/data, like a certain word class.

For example: a subclass of metadiscourse particles like headers or boosters.

Another example: trying to find named entities like people’s names, organizations, places etc.

You have a small to moderate amount of data that already has these facts added or can do it very easily using boosters, lexicons and queries.

For this, CRF is the best model to use, you can even train different CRF models.

  • Trying to find pertinent keywords while discarding ones that are less frequent.

For example: some journal articles have keywords, but some don’t.

If we can train a keyword tagger based on the documents and keywords that exist, we can add keywords to other documents as well.

You have at least a medium amount of data, but more would be better.

This is keyword tagging: use Taggers, Tagger Groups or RaKUn Extractor.


How to get the training facts

If you don’t have the training facts, go back to Adding Metadata and try out different variants to get the data you want.

It can also be a good idea to add some facts to a small subset of representative documents and try to see what methods can be used.

You can also try to train a model based on a very small amount of data, but the results will not be very good (as seen in the BERT and CRF examples of the manual).


Splitting Index Into Test and Train Indices

If you want to train a model and evaluate it fairly, it is a good idea to split your index into train and test sets. The index you’re splitting should contain the fact or facts that you want to train the model to detect.

The train index is the index you’ll be training your model on and the test set used for evaluation. If you have mostly automated facts with some manually added, then it would be a good idea to keep manually added facts in the test set for a good comparison.

Test set ratio should be around 20%/15%.

If the index is representative of the overall dataset, try to split the index according to the original distribution of the fact you are training on, otherwise the results will have frequency mismatch. If the case is that the actual distribution in the larger dataset is different, then you might have to find a different fact distribution. (See Splitting an Index)


Creating and Evaluating a Model

Choose your model (see Which model to choose)

Train on the train set. Look for training info in model description (BERT, CRF, Taggers)

Apply the newly created model on the test set with a unique fact name/suffix.

See Evaluating Model Results for evaluating the models based on test set facts (higher F1 means better model).

If a created model’s F1 is over 80 (assuming the average function matches the task description) then it should be pretty good at the trained task. You can now use your model on some other data and it should get roughly 80% of the data correct.

If the created model’s F1 is under 80 you should consider looking at the confusion matrices to see if you have something wrong with the facts distribution or if you just need more data examples for certain labels. For more advanced users, you can also look at the epoch reports to recognize underfitting or overfitting, and use this information to tweak the training parameters for information.