Types of text mining and digital analysis of text

Below is a presentation of various text mining methods, such as searching via n-grams and use of collocation analysis.

Searching with n-grams

An n-gram is a sequence of one or more elements (“n number of elements”), generally words. We can gain insight into the pattern of words occurring over time.

N-grams can be textual or speech. An n-gram will generally be equivalent to a word, but it may be an alphabetical letter, a phoneme or another part of natural language.

At word level, the Norwegian word «utenrikspolitikk» (en: “foreign policy”) is a unigram, meaning that it consists of one whole word. The Norwegian phrase «EUs utenrikspolitikk» (en: “the foreign policy of the EU”) is a bigram because it includes two whole words, while “EU and NATO” is a trigram on the same basis and so forth.

N-gram search services permit swift examination of patterns of word occurrences and phrases over time. We could examine the absolute occurrence of a word in a given corpus or the relative occurrence (the word’s frequency in relation to other words in the corpus).

For instance – when did Norwegian newspapers begin writing about «bærekraftig utvikling» (en: “sustainable development”)? This query can be answered by using the Norwegian National Library’s n-gram service, NB N-gram. Google Books also has an n-gram service called Google Books Ngram Viewer for searches in their corpus.

Example of NB's n-gram-service
The Norwegian National Library’s n-gram service. Shown above are the relative frequencies of “Sigrid Undset”, “Knut Hamsun”, “Camilla Collett” and “Amalie Skram”.
Example of Google Books n-gram service
Google Books’ n-gram service. Shown above are the relative frequencies of “Albert Einstein”, “Sherlock Holmes” and “Frankenstein”.

Concordance analysis

In corpus linguistics, text mining or digital text analysis, a concordance is a generated list over every occurrence of a given word in a digital corpus with the context (a certain number of words before and after the keyword) in which the word appears for each occurrence. Concordances are also referred to as “keyword(s) in context”.

The Norwegian National Library’s DH-Lab offers a jumpstart for concordance analysis of their collections through the Python application Jupyter Notebook.

Voyant Tools is an application that allows you to upload corpus from your computer and, among other things, conduct concordance analyses.

The term concordance (lat. concordantia: agreement) refers to an alphabetical list of words in a text and where they are located. Concordances have a relatively long history – for instance, during the Middle Ages, monks would produce bible concordances. See also concordance in Merriam-Webster.

For more information, please refer to our presentation on tools and software bundles for text mining.

Collocation analysis

“Collocation” is a term used to describe words that are associated with one another, meaning that they often appear together. In corpus linguistics, text mining and digital text analysis, collocations are a statistical overview of words that have a relatively high co-occurrence with a particular keyword.

Collocation analyses are one entry to studying changes in discourse over time. We might investigate collocations using the word “democracy”. In the 1960’s and 70’s, the word “democracy” might be found paired – collocated – with the word “dictatorship”, which might come as something of a surprise. In the 2000s, though, the words “human rights” have a more significant collocation with the word “democracy”. This may imply a change in democratic discourse over time within the corpus we are analysing.

The Norwegian National Library makes their digital collection readily available for collocation analysis through their DH-Lab, which may be done through the Jupyter Notebook application. Voyant Tools allows you to upload a corpus from your personal computer and have a look at what they term “collocates”. See more under our presentation of tools and software bundles for text mining.

Topic modelling

Topic modelling, sometimes referred to as “theme modelling”, is a method that enables analysis of words’ co-occurrence patterns in texts. Statistical calculations are performed by an algorithm, the output of which allows for grouping, or clustering, of words under the concept of a topic. Despite these clusters being nothing more than words grouped by statistical analysis, a researcher may gleam interesting information about the thematic structure of texts through this method.

There are several algorithms used in topic modelling. Latent Dirichlet Allocation (LDA) and BERTopic are two examples. The outputs of these two algorithms are about equivalent, despite being different algorithms.

Voyant Tools (see information about Voyant Tools and how to get started with Voyant Tools features an implementation of LDA, which requires relatively little previous experience with topic modelling to use. Unfortunately, the output being generated may be quite difficult to understand without an intermediate understanding of topic modelling, and you are advised to read up on the subject in advance.

Topic modelling may also be done through Python and R/RStudio. Read more about Python with Jupyter Notebook and R/RStudio on our presentation of tools and software bundles for text analysis and how to get started with the tools.

Corpus comparison

Corpus comparison as a method for digital text analysis consisting of examining what words are overrepresented in a given part of the corpus as compared to a larger reference corpus. For example, one might be interested in discovering which word(s) are overrepresented in the public speeches of one politician compared to their peers of the same time period.

Corpus comparison may be performed through a tool called AntConc (see a step-by-step guide at The Programming Historian, Python and R/RStudio. You can read more about these tools under our presentation of tools and software bundles for text mining and how to get started with the tools.

Automatic name recognition

Automatic name recognition software, otherwise known as Named Entity Recognition (NER), enables identification of names of persons, products, places and such in texts. Through disambiguation techniques (if, for instance, a place name refers to several geographic locations) we are able to identify and visualise the geographical areas actually referred to in a corpus.

Python and R/RStudio are tools you can use for NER. An example of NER being used to construct name graphs is available in the Norwegian National Library’s DH-Lab. A video on analysis of locations in literature is available through the National Library’s archives in Norwegian. You can read more about these tools under our presentation of tools and software bundles for text mining and how to get started with the tools.

Sentiment analysis

Otherwise known as “opinion mining”, sentiment analysis describes automated methods to identify affective states in data sets. This is done through systematic selection of expressions of subjective opinions and emotional evaluations in the material. Sentiment analysis is popular in marketing, advertising and to examine the tone of political communication, public debate, social media as well as studies of plot and genre in literary corpora. The potential applications are limited only by one’s imagination.

Digital sentiment analysis uses word lists and data sets where words and expressions are given a score based on perceived emotional meaning in sentiment analysis of text data. The quality of the results depends on to what extent these lists and sets are trained on different types of text – among other things how the words’ context are taken into account.

Scientists at the University of Oslo have developed data sets for sentiment analysis in Norwegian. See documentation and information on sentiment analysis at GitHub.

Such analysis can also be performed in Python and R/RStudio. You can read more about these tools under our presentation of tools and software bundles for text mining and how to get started with the tools.

Published Sep. 7, 2021 10:42 AM - Last modified June 28, 2022 12:27 PM