Large-Scale Text Analysis with R

Instructors

Location

4115P Undergraduate Library

Description

Text mining, the practice of using computational and statistical analysis on large collections of digitized text, is becoming an increasingly important way of extracting meaning from writing. Whether working on survey data, medical records, political speeches or even digitized collections of historical writing, we are now able to use the power of computational algorithms to extract patterns from vast quantities of textual data. This technique gives us information we could never access by simply reading the texts. But determining which patterns have meaning and which answer key questions about our data is a difficult task, both conceptually and methodologically; particularly for those who work in the humanities who are able to benefit the most from these methods.

Large-Scale Text Analysis with R will provide an introduction to the methods of text mining using the open source software Environment “R”. In this course, we will explore the different methods through which text mining can be used to “read” text in new ways: including authorship attribution, sentiment analysis, genre studies and named entity extraction. At the same time, our focus will also be on the analysis and interpretation of our results. How do we formulate research questions and hypothesis about text that can be answered quantitatively? Which methods fit particular needs best? And how can we use the numerical output of a text analysis to explain features of the texts in ways that make sense to a wider audience?

While no programming experience is required, students should have basic computer skills and be familiar with their computer’s file system. Participants will be given a “sample corpora” to use in class exercises, but some class time will be available for independent work and participants are encouraged to bring their own text corpora and research questions so they may apply their newly learned skills to projects of their own.

From your instructor, Mark Algee-Hewitt:

First and foremost, a slightly early welcome to the Large Scale Text Analysis with R course at HILT 2016. I’m very much looking forward to getting to meeting you all next week and getting to know you and your work over the course of what will be a challenging, but hopefully rewarding, week. I know that for many of you, this material is quite new, while for others, you’ve done some of this before. Keeping this in mind, I’ve done my best to design a schedule that will begin with the basics (new to some, review to others) and then, by the end of the week, will move from the essentials of text searches and word counts to some of the more “of the minute” methodologies. So, there is no pre-requisite: I’ve tried to design everything so that no previous experience is necessary (you’ll pick up all the necessary skills as we go along) and, at the same time, so that everyone will find something over the course of the week that is both challenging and interesting.

In preparation, I’ve uploaded a zip archive of readings for the course, as well as some information sheets and a tentative syllabus for the week. You can find it at: https://www.dropbox.com/s/4yoxo8wajrvst7f/Course%20Material.zip?dl=0 The readings are primarily suggested readings (rather than required). I will be referencing some of them as we go through the material and they should serve as a good reference point for things that we cover that you are particularly interested in. The “R Commands” document, however, is rather essential document that contains all of the functions that we’ll be using together: you should keep this close for reference.

Over the course of the week, I’ll have more things for you to download (some code and a number of texts for us to play with), but these should get you started. The only two things I would ask you to do before Monday are that you download and install both the R software environment ( https://www.r-project.org/ ) and the R-Studio client ( https://www.rstudio.com/products/rstudio/download/ ). Both are free and having them up and working should make things move much more quickly the first couple of days. More information can be found in the syllabus. If you have any questions or difficulties about installing these programs, please feel free to e-mail me.

As I indicated, I’ve tried to keep the syllabus flexible with the idea that if some or many of you are interested in a particular method or technique, we can spend more time on that. Similarly, if you have any questions or concerns about the material or if there is something not listed on the syllabus that you’d really like to cover, let me know in the next few days and I’d be happy to see if we can add it in.

Also, while I will be distributing corpora to the class to use as we go through each of the methods, if you have your own corpus of texts that you are working on, you would be welcome to use it.

Many thanks again for joining me at HILT and I’m looking forward to seeing you bright and early on Monday.

All the best
Mark

Problems or Questions?

Please feel free to contact us!