Manually Topic Modeling Panama Papers News Coverage
Created by Brandon Locke
- Demonstrate in simple language how topic modeling works
- Gain first-hand experience in locating and recognizing topics within a text
- Know strategies for analyzing corpora and investigating research questions with Topic Modeling Tool results
“Panama Papers” News Story
The “Panama Papers” were about 12 million leaked documents that shed light upon people around the world who were hiding money in offshore entities through the Panamanian company Mossack Fonseca. The news coverage in the immediate aftermath of the leak tended to focus on a wide range of issues from the technical aspects of a hack to the implications of various world leaders and celebrities who were exposed. In the weeks following the leak, I selected 34 articles about the issue and extracted the text content into text files (see articles below.)
Manual Topic Modeling Activity
Prior to class: Read Brett, Megan R. “Topic Modeling: A Basic Introduction.” Journal of Digital Humanities.
Slides are available here
Topic Modeling is an algorithm that finds “a recurring pattern of co-occurring words” in a corpus of text. This does take overall usage of a word across the entire corpus into consideration, so words used often in many documents will not appear in every topic. So, for example, “Panama” will almost certainly be used at least once in every article — the average is about 6.3 times per article. However, “Panama” will only be included in an article’s keywords if it’s used a significant amount of times and has close usage relationships with other words.
Today you’ll skim a few articles and try your best to imitate topic modeling algorithms. Your assigned articles can be found below.
Before you start reading your articles, take a look at the total word usage for all 34 articles (stop words removed). Once you have glanced at the list, read your first article. Write (or type) out words that seem unique to the article. This means words that are used disproportionally more often in your article than in others in the corpus. Remember, you’re only looking at words used, not concepts or themes that you’re able to discern from them. You’re imitating a computer algorithm, and algorithms are not particularly smart. Once you’ve finished your first article, do your second.
Do your words seem to make sense as one or more topics, or do they seem to be random?
Comparing and Analyzing Results
I used Topic Modeling Tool (TMT), an easy-to-use tool for using MALLET for topic modeling. TMT produces a set of CSV files and a set of HTML files with your output. Take a look at these results with 20 topics. (Remember, these aren’t labeled topics, they’re clusters of words that likely represent a topic.)
Are there any identifiable “topics” here? Are there any “topics” that don’t seem to make sense?
If you click on one of the topics, you’ll see the list of documents ordered by how closely each document corresponds with the topic. The number in parentheses is the number of times words in the topic appear in the document. Now click on one of the text files. This will show you the full text of the file, and it will also show the topics that align closely with your topic.
Take a few minutes and explore these results. Click through the network of topics and documents and see if you can find any patterns. Look at the articles you read and see how closely your results matched the TMT results.
Once you’ve examined the results with 20 topics, take a look at the same articles run with 40 topics.
What differences do you see between 20 topics and 40 topics? Which set do you think are most useful to you? Why?
- Doc 01 Jake
- Doc 03 Jake
- Doc 05 Asia
- Doc 07 Asia
- Doc 10 Sarah
- Doc 11 Sarah
- Doc 12 Khalil
- Doc 21 Khalil
- Doc 22 Nicole
- Doc 28 Nicole
- Doc 29 Deanna
- Doc 32 Deanna
- Doc 34 Brandon