Manually Topic Modeling Panama Papers News Coverage

Created by Brandon Locke
Maintained by LEADR under the direction of Alice Lynn McMichael
Last Updated: 12/16/2017

Learning Goals

Demonstrate in simple language how topic modeling works

Gain first-hand experience in locating and recognizing topics within a text

Know strategies for analyzing corpora and investigating research questions with Topic Modeling Tool results

“Panama Papers” News Story

The “Panama Papers” were about 12 million leaked documents that shed light upon people around the world who were hiding money in offshore entities through the Panamanian company Mossack Fonseca. The news coverage in the immediate aftermath of the leak tended to focus on a wide range of issues from the technical aspects of a hack to the implications of various world leaders and celebrities who were exposed. In the weeks following the leak, I selected 34 articles about the issue and extracted the text content into text files (see articles below.)

Manual Topic Modeling Activity

Prior to class: Read Brett, Megan R. “Topic Modeling: A Basic Introduction.” Journal of Digital Humanities.

Slides are available here

Topic Modeling is an algorithm that finds “a recurring pattern of co-occurring words” in a corpus of text. This does take overall usage of a word across the entire corpus into consideration, so words used often in many documents will not appear in every topic. So, for example, “Panama” will almost certainly be used at least once in every article — the average is about 6.3 times per article. However, “Panama” will only be included in an article’s keywords if it’s used a significant amount of times and has close usage relationships with other words.

Today you’ll skim a few articles and try your best to imitate topic modeling algorithms. Your assigned articles can be found below.

Before you start reading your articles, take a look at the total word usage for all 34 articles (stop words removed). Once you have glanced at the list, read your first article. Write (or type) out words that seem unique to the article. This means words that are used disproportionally more often in your article than in others in the corpus. Remember, you’re only looking at words used, not concepts or themes that you’re able to discern from them. You’re imitating a computer algorithm, and algorithms are not particularly smart. Once you’ve finished your first article, do your second.

Do your words seem to make sense as one or more topics, or do they seem to be random?

Comparing and Analyzing Results

I used Topic Modeling Tool (TMT), an easy-to-use tool for using MALLET for topic modeling. TMT produces a set of CSV files and a set of HTML files with your output. Take a look at these results with 20 topics. (Remember, these aren’t labeled topics, they’re clusters of words that likely represent a topic.)

Are there any identifiable “topics” here? Are there any “topics” that don’t seem to make sense?

If you click on one of the topics, you’ll see the list of documents ordered by how closely each document corresponds with the topic. The number in parentheses is the number of times words in the topic appear in the document. Now click on one of the text files. This will show you the full text of the file, and it will also show the topics that align closely with your topic.

Take a few minutes and explore these results. Click through the network of topics and documents and see if you can find any patterns. Look at the articles you read and see how closely your results matched the TMT results.

Once you’ve examined the results with 20 topics, take a look at the same articles run with 40 topics.

What differences do you see between 20 topics and 40 topics? Which set do you think are most useful to you? Why?

Article Assignments

Doc 01 Doc 01
Doc 03 Doc 03
Doc 05 Doc 05
Doc 07 Doc 07
Doc 10 Doc 10
Doc 11 Doc 11
Doc 12 Doc 12
Doc 21 Doc 21
Doc 22 Doc 22
Doc 28 Doc 28
Doc 29 Doc 29
Doc 32 Doc 32
Doc 34 Doc 34

All Articles

https://www.wordfence.com/blog/2016/04/panama-papers-wordpress-email-connection/
http://gizmodo.com/is-this-how-a-hacker-got-the-panama-papers-1769836788
https://www.wordfence.com/blog/2016/04/mossack-fonseca-breach-vulnerable-slider-revolution/
http://wptavern.com/outdated-and-vulnerable-wordpress-and-drupal-versions-may-have-contributed-to-the-panama-papers-breach
http://www.theguardian.com/politics/2016/apr/09/david-cameron-to-launch-local-election-campaign-as-panama-papers-row-rumbles-on
http://www.independent.co.uk/news/world/politics/the-not-completely-crazy-theory-that-russia-leaked-the-panama-papers-a6977271.html
http://www.independent.co.uk/news/uk/politics/panama-papers-david-cameron-to-announce-tax-taskforce-to-investigate-revelations-as-he-seeks-to-a6976891.html
http://www.bbc.com/news/world-europe-35975893
http://time.com/4283587/these-5-facts-explain-the-massive-political-fallout-from-the-panama-papers/
http://www.nytimes.com/2016/04/06/world/europe/panama-papers-iceland.html
http://observer.com/2016/04/panama-papers-reveal-clintons-kremlin-connection/
http://www.vox.com/2016/4/5/11370646/panama-papers-iceland-gunnlaugsson-resigned
http://latino.foxnews.com/latino/news/2016/04/08/americans-lucked-out-in-panama-papers-scandal-founder-didnt-like-taking-us/
http://www.nbcnews.com/business/business-news/why-few-americans-panama-papers-lawyer-doesn-t-want-them-n553116
http://www.nytimes.com/2016/04/08/world/europe/vladimir-putin-panama-papers-american-plot.html
http://www.theguardian.com/commentisfree/2016/apr/07/panama-papers-taxes-universal-basic-income-public-services
https://www.rawstory.com/2016/04/heres-why-so-few-americans-appear-in-the-panama-papers/
http://www.usatoday.com/story/news/2016/04/06/panama-papers-americans-with-past-financial-crimes/82704788/
http://bigstory.ap.org/article/ef45a25e39224a7989894334aa44bfd4/why-few-americans-panama-papers-lawyer-doesnt-want-them
http://fusion.net/story/287671/americans-panama-papers-trove/
http://www.politico.com/agenda/story/2016/04/the-panama-papers-where-are-the-americans-000083
http://www.nbcnews.com/storyline/panama-papers/why-are-americans-not-included-panama-papers-n551081
http://fortune.com/2016/04/09/bad-security-panama-papers/
http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/#47b5058a1df5
http://www.bbc.com/news/world-latin-america-35975503
http://www.computerworld.com/article/3052218/security/the-massive-panama-papers-data-leak-explained.html
http://www.computerweekly.com/news/450280758/Panama-Papers-revealed-by-graph-database-visualisation-software
http://www.darkreading.com/vulnerabilities—threats/7-lessons-from-the-panama-papers-leak/d/d-id/1324976
http://www.wired.co.uk/news/archive/2016-04/06/panama-papers-mossack-fonseca-website-security-problems
http://www.wired.com/2016/04/security-week-panama-papers-law-firm-seriously-shoddy-security/
http://www.wired.com/2016/04/reporters-pulled-off-panama-papers-biggest-leak-whistleblower-history/
http://www.nationalreview.com/article/433696/national-security-agency-panama-papers-prove-its-worth?target=author&tid=1022644
http://thehackernews.com/2016/04/panama-paper-corruption.html
https://www.helpnetsecurity.com/2016/04/07/panama-papers-lax-security-practices/

Manually Topic Modeling Panama Papers News Coverage

This repository holds all of the online tutorials and resource guides used by the Lab for the Education and Advancement of Digital Research.

Manually Topic Modeling Panama Papers News Coverage

Learning Goals

“Panama Papers” News Story

Manual Topic Modeling Activity

Comparing and Analyzing Results

Article Assignments

All Articles

Return to LEADR’s Resources list