This is our third blog in the “Text Analysis 101; A basic understanding for Business Users” series. The series is aimed at non-technical readers, who would like to get a working understanding of the concepts behind Text Analysis. We try to keep the blogs as jargon free as possible and the formulas to a minimum. This week’s blog will focus on Topic Modelling. Topic Modelling is an unsupervised Machine Learning (ML) technique. This means that it does not require a training dataset of manually tagged documents from which to learn. It is capable of working directly with the documents in question. Our first two blogs in the series focused on document classification using both supervised and unsupervised (clustering) methods.
What Topic Modelling is and why it is useful.
As the name suggests, Topic Modelling discovers the abstract topics that occur in a collection of documents. For example, assume that you work for a legal firm and have a large number of documents to consider as part of an eDiscovery process. As part of the eDiscovery process we attempt to identify certain topics that we may be interested in and discard topics we have no interest in. However, for the most part we are talking about large volumes of documents and often times we have no idea which documents are relevant or irrelevant. Topic modelling enables the discovery of high-level topics that exist in the target documents, and also the degree to which each topic is referred to in each document, i.e. the composition of topics for each document. If the documents are ordered chronologically then topic modelling can also provide insight into how the topics evolve over time.
LDA - A model for “generating” documents
Latent Dirichlet Allocation (LDA) is the name given to a model commonly used for describing the generation of documents. There are a few basic things to understand about LDA
- LDA views documents as if each document were a bag of words, imagine taking the words in a document and pouring them into a bag. All of the word order and the grammar would be lost but all of the words would still be present i.e. if there are twelve occurrences of the word “the” in the document then there will be twelve “the”s in the bag.
- LDA also views documents as if they were “generated” by a mixture of topics i.e. a document might be generated from 50% sports, 20% finance and 30% gossip.
- LDA considers that any given topic will have a high probability of generating certain words and a low probability of generating other words. For example, the “Sports” topic will have a high probability of generating words like “football”, basketball”, “baseball” and will have a low probability of producing words like “kitten”, “puppy” and “orangutan”. The presence of certain words within a document will, therefore, give an indication of the topics which make up the document.
So in summary, from the LDA view, documents are created by the following process.
- Choose the topics from which the document will be generated and the proportion of the document to come from each topic. For example, we could choose the three topics and proportions from above i.e 50% sports, 20% finance and 30% gossip.
- Generate appropriate words from the topics chosen in the proportions specified.
For example, if our document had 10 words and three topics in proportion 50% sports, 20% finance and 30% gossip, the LDA process might generate the following “bag of words” to make up the document.
baseball dollars fans playing Kardashian pays magazine chat stadium ball
The 5 red words are from the sports topic, the 2 blue words are from the finance topic and the three green words are from the gossip topic.
Collapsed Gibbs Sampling
We know that LDA assumes documents are bags of words composed in proportion from the topics that generated the words. Collapsed Gibbs Sampling tries to work backwards to figure out the words that belong to each topic and secondly, the topic proportions that make up each document. Below is an attempt to describe this method in simple terms.
- Keep a copy of each document for reference.
- Pour all of the word from each documents into a bag. The bag will then contain every word from every document, some words will appear multiple times.
- Decide the number of topics (K) that you will divide the documents into and have a bowl for each topic.
- Randomly pour the words from the bag into the topic bowls putting an equal number in each bowl. At this point, we have a first guess at the makeup of words in each topic. It is a completely random guess so is not of any practical use yet. It needs to be improved. It is also a first guess at the topic makeup of each document i.e. you can count the number of words in each document that are from each topic to figure out the proportions of topics that make up the document.
Improving on the first random guess to find the topics.
The Collapsed Gibbs Sampling algorithm can work from this first random guess and over many iterations to discover the topics. Below is a simplified description of how this achieved. For each document in the document set, go through each word one by one and do the following:
- For each of our K topics
- Find the percentage of words in the document that were generated from this topic. This will give us an indication of how import the topic (as represented by our current guess of words in the bowl) is to the document. i.e. how much of the document came from the topic
- Find the percentage of the topic that came from this word across all documents. This will give us an indication of how important the word is to the topic.
- Multiply the two percentages together, this will give an indication of how likely it is that the topic in question generated this word
- Compare the answers to the multiplication from each topic and move the word to the bowl with the highest answer.
- Keep repeating this process over and over again until the words stop moving from bowl to bowl i.e. the topics will have converged into K distinct topics.
At this point we have the words that make up each topic so we can assign a label to the topic i.e. if the topic contains the words dog, cat, tiger, buffalo we would assign the label “Animals” to the topic. Now that we have the words in each topic we can analyse each document or “bag of words” to see what proportion of each topic it was generated from. We now have the words which make up each topic, we have a label for the topic and we have the topics and proportions within each document and that’s pretty much it. There are two blogs that I used as part of our research, which you might want to take a look at. The LDA Buffet by Matthew L Jockers and An Introduction to LDA by Edwin Chen. Keep an eye out for more in our “Text Analysis 101” series.