Assignment 3 Text Analysis of an Unknown Corpus

Communicating the outcomes of analysis of an unknown corpus of documents. Developing and comparing outcomes from ‘bag of words’, TFIDF, clustering, LSA, and LDA / topic models

profile picture of author
Author
Joshua McCarthy
Published
Tue, Jun 16 2020
Last Updated
Tue, Jun 16 2020

Context

A fun scenario for an interesting assignment utilising text analysis techniques ando a return to the continually evolving DAM report template.

Buried deep in your manager’s computer is a directory helpfully named “docs.” He has no recollection of what they’re about and is disinclined to wade through them himself. Knowing of your interest in data analysis, he sends you the folder with the challenge to “provide insights into the contents and themes of the documents in the directory” (words excerpted verbatim from his email). As it happens, last weekend you learnt a bunch of techniques that could help you do just that. So you accept the challenge, pretty confident that you can deliver. As you are about to start your analysis, you realise that there could be more at stake here than just a one-off data challenge. If you do a good job, your manager may even keep you in mind when he discusses that new data science role with the CIO. The task thus presents an opportunity to highlight not just your analytical skills but also your ability to weave a narrative around your findings, supported by appropriate visualisations.

Report

The assignment was quite challenging including some information that was quite hard to represent visually without becoming overwhelming.

Page 1, User needs and solutions text mining, topic modelling, application of outcomes

Page 1, User needs and solutions text mining, topic modelling, application of outcomes

Page 2, Wordcloud with bag of words, issues with bag of words, word relationships, doc word counts

Page 2, Wordcloud with bag of words, issues with bag of words, word relationships, doc word counts

Page 3, Wordclouds after custom stopwords and TFIDF application, highlights a strange observation that stupid and elephant have high value in tfidf

Page 3, Wordclouds after custom stopwords and TFIDF application, highlights a strange observation that stupid and elephant have high value in tfidf

Page 4, Grouping similar documents through clustering using cosine distance, with both refined and tfidf data, network graph of tfidf connections

Page 4, Grouping similar documents through clustering using cosine distance, with both refined and tfidf data, network graph of tfidf connections

Page 5, Network graph of LSA output showing 3 distinct communities connected to each other through one or two nodes, and a single unconnected node, similarity of result with clustering

Page 5, Network graph of LSA output showing 3 distinct communities connected to each other through one or two nodes, and a single unconnected node, similarity of result with clustering

Page 6, Wordclouds to investigate the LSA communities and connection points for relationships, note that the documents relate to the techniques being used

Page 6, Wordclouds to investigate the LSA communities and connection points for relationships, note that the documents relate to the techniques being used

Page 7,Topic modelling and LDA, probabilistic coherence graph to identify likely number of topics, comparison between clustering activity results

Page 7,Topic modelling and LDA, probabilistic coherence graph to identify likely number of topics, comparison between clustering activity results

Page 8, Defining overall themes, which include broad topics, assigning them to documents in a matrix

Page 8, Defining overall themes, which include broad topics, assigning them to documents in a matrix

Page 9, Looking for more granular subjects and assinging them to documents in a matrix

Page 9, Looking for more granular subjects and assinging them to documents in a matrix

Page 10, Outcomes of further investigation by searching quotes from docs in google for records, found blog included the categorised docs

Page 10, Outcomes of further investigation by searching quotes from docs in google for records, found blog included the categorised docs

Page 11, Outcomes, named and dated files, index of documents with old and new filenames, link, theme, subject and categories, value of docs, what was missed

Page 11, Outcomes, named and dated files, index of documents with old and new filenames, link, theme, subject and categories, value of docs, what was missed

Example File Name: 2018-02-05_title-title-title Original File: Doc01.txt Link: https://www.linktodoc.notareallink Theme: Organisations, Decision Making Subject: strategic problem solving, risk management Category Tags: Organizations, Paper Review, Risk analysis, sensemaking, Wicked Problems