Text Analytics

## Introduction

This module will provide an introduction to methods for analyzing text. Text analytics is a complicated topic, covering a wide range of problems and solutions. For example, consider the following simple questions. How should text data be structured? What types of analysis are appropriate for text data? How can these analyses be performed? How should results from text analytics be presented to a viewer?

We will focus on simple approaches to address each of these questions. In particular, we will discuss different ways to represent text, and present the use of term frequency to structure a document's text. We will then present algorithms that use this approach to measure the similarity between text, to cluster text into topics, to estimate the sentiment in text, and to present text and text analytics results to a viewer.

### Why Text Analytics?

Much of the data now being captured and stored is semi-structured or unstructured: it is not being saved in a structured format, for example, in a database or data repository with a formal framework. Text data often makes up a large part of these unstructured collections: email, web pages, news articles, books, social network postings, and so on. Text analytics and text mining are meant to be applied to this type of data to

"...find interesting regularities in large textual datasets..." (Fayad)
"...find semantic and abstract information from the surface form of textual data..." (Grobelnik & Mladenic)

where interesting means non-trivial, hidden, previously unknown, and potentially useful.

Text analytics has many potential applications. These can be discovery driven, where we try to answer specific questions through text mining, or data driven, where we take a large collection of text data and try to derive useful information by analyzing it.

There are many reasons why analyzing text is difficult.

• Abstract. The concepts contained in text are often difficult to identify and represent (e.g., sentiment or sarcasm, "I loved the way you put that. No, really, I LOVED it.")
• Relationships. Subtle, complex relationships between concepts must be extracted (e.g., negation, "I thought I'd enjoy that movie, but in the end it just didn't work out the way I expected.")
• Scale. A document can contain many thousands of words, and a document collection can contain thousands or hundreds of thousands of documents.

### Assignment

Each homework group will complete a text analytics assignment. This involves choosing a problem of interest to address, identifying and collecting appropriate text data to support the problem, then using text analysis techniques discussed during this module to analyze your data and form conclusions and results that will be presented in class.

Oct. 5, 10am
Submit a draft proposal Andrea that describes the problem you will investigate. The proposal must include the following items.
• A problem statement explaining the issue you want to study and the goals you plan to achieve.
• A description of the dataset(s) you will use, detailing either where the data is available, or how it will be collected. Provide enough specificity for us to confirm that the data will be available in a timely manner.
• A description of the types of analysis you plan to apply to your datasets to achieve the goals listed in your problem statement.
• A detailed justification of why the data you plan to collect will support the problem and analysis you propose to complete. Projects that we do not feel have a high probability of success will be returned and will need to be improved or replaced with a different proposal.
• A list of the results you will present at the end of your analysis.
The draft proposal should be approximately one page in length. We have provided an example draft proposal to give you an idea of its length, content, and format.
Oct. 10, 5pm
Submit a revised proposal through Moodle that addresses comments and/or concerns made by the instructors on your draft proposal. The revised proposal represents a template for what we expect to see during your final presentation.
Oct. 25, 8:30am–12:00pm (orange), 1:00pm–4:30pm (blue)
Present your project and its results to the class. Each group will be allotted 10 minutes: 8 minutes to present, and 2 minutes to answer one or two questions. Because of our tight schedule, each group must provide their slides to Andrea by 12pm on Oct. 24. This will allow us to load the presentations onto the classroom laptop and eliminate setup time between groups. This also means each group will be strictly limited to 10 minutes for their presentations (i.e., 4–5 slides for the average presentation). Please plan your presentations accordingly. For example, consider having only 1 or 2 groups members present your slides, then have the entire team available for questions at the end.

As an example of potential sources of text data, Dr. Xie has made available the proposed text sources from the 2011 Analytics class.

### NLTK and Gensim

Throughout this module we will provide code examples in Python using the Natural Language Toolkit (NLTK). NLTK is designed to support natural language processing and analysis of human language data. It includes the ability to perform many different language processing operations, including all of the text analytics techniques we will be discussing.

For techniques beyond the scope of NLTK, we will provide Python examples that use Gensim, a more sophisticated text analysis package that includes the text similarity algorithms we will discuss during the module.

### Tweet Capture Tool

Capturing data from a social network site like Twitter or Facebook normally requires using a programming language (e.g., Python) to talk to the application programming interface (API) the site provides to allow you to access its data.

In order to simplify this process, we are providing a Tweet Capture tool that requests tweets by keyword from Twitter's real-time tweet stream. To use the tool, you will need to register for application credentials from the Twitter Developer site. Instructions for how to request and use your keys are included in the Help section of the program.

## Text Representations

There are many different ways to represent text. We describe some of the most common approaches, discussing briefly their advantages and disadvantages.

### Character

The simplest representation views text as an ordered collection of characters. A text document is described by the frequency of sequences of characters of different lengths. For example, consider the following text data.

To be or not to be

This could be represented by a set of single character frequencies fq (1-grams), a set of two-character frequencies (2-grams), and so on.

b e n o r t
fq 2 2 1 4 1 3
be no or ot to
fq 2 1 1 1 2

Character representations have the advantage of being simple and fairly unambiguous, since they avoid many common language issues (homonyms, synonyms, etc.) They also allow us to compress large amounts of text into a relative compact representation. Unfortunately, character representations provide almost no access to the semantic properties of text data, making them a poor representation for analyzing meaning in the data.

### Word

A very common representation of text is to convert a document into individual words or terms. Our example sentence represented as words would look something like this.

be not or to
fq 2 1 1 2

Words strike a good balance between simplicity and semantics. At this level various ambiguities begin to arise, however.

• Homonym. Words with identical form but different meaning (e.g., lie: to recline versus lie: not telling the truth; or bass: a musical instrument versus bass: a fish).
• Synonym. Words with different form but the same or similar meaning (e.g., ambiguous, confusing, opaque, uncertain, unclear, vague).
• Polysemy. Words with the same form and multiple related meanings (e.g., bank: a financial institution and bank: to rely upon, as in "You can bank on me").
• Hyponym. Words that are a semantic subclass of another word, forming a type-of relationship (e.g. salmon is a hyponym of fish).

It's also the case that some words may be more useful than others, due to their commonality. Suppose we're trying to determine the similarity between the text of two documents that discuss the financial crisis. Would comparing the frequency of the word like "an" be as useful as comparing the frequency of a word like "investment" or "collapse"?

### Phrase

Combining words together forms phrases, which are often called n-grams when n words are combined. Phrases may be contiguous, "Mary switches her table lamp off" ⇒ "table lamp off", or non-contiguous, "Mary switches her table lamp off" ⇒ { lamp, off, switches }. An important advantage of phrase representations is that they often give a better meaning to the semantics of sense of the individual words in a phrase (e.g., "lie down" versus "lie shamelessly").

Google has published a number of online applications and datasets related to n-grams. One example is the Ngram Viewer which allows you to compare the occurrence of n-grams in books Google has indexed over a range of years. The Ngram Viewer allows not only explicit phrase searches, but also searches that include wildcards, inflection, or part-of-speech. They have also made their underlying n-gram datasets available to interested researchers.

### Part of Speech

Words can be further enriched by performing part-of-speech tagging. Common parts of speech in English include nouns, verbs, adjectives, adverbs, and so on. Part-of-speech tagging is often used to filter a document, allowing us to restrict analysis to "information rich" parts of the text like nouns and verbs or noun phrases. The Cognitive Computation Group at UIUC provides a comprehensive part-of-speech tagger as a web-based application.

### WordNet

WordNet is a lexical database of English words. In its simplest form, WordNet contains four databases of nouns, verbs, adjectives, and adverbs. Like a thesaurus, WordNet groups synonymous words into synsets, sets of words with similar meanings. Unlike a thesaurus, however, WordNet forms conceptual relations between synsets based on semantics. For example, for the noun database the following relations are defined.

Relation Explanation Example
hyponym From lower to higher level type-of concept, X is a hyponym of Y if X is a type of Y dalmatian is a hyponym of dog
hypernym From higher to lower level subordinate concept, X is a hypernym of Y if Y is a type of X canine is a hypernym of dog
meronym Has-member concept from group to members, X is a meronym of Y if Xs are members of Y professor is a meronym of faculty
holonym Is-member concept from members to group, X is a holonym of Y if Ys are members of X grapevine is a holonym of grape
part meronym Has-part concept from composite to part leg is a part meronym of table
part holonym Is-part concept from part to composite human is a part holonym of foot

WordNet also includes general to specific troponym relations between verbs: communicate–talk–whisper, move–jog–run, or like–love–idolize; and antonym relations between verbs and adjectives: wet–dry, young–old or like–hate. Finally, WordNet provides a brief explanation (or gloss) and example sentences for each of its synsets.

WordNet is extremely useful for performing tasks like part-of-speech tagging or word disambiguation. WordNet's databases can be searched through an online web application. The databases can also be downloaded for use by researchers and practitioners.

### Text Representation Analytics

Dr. Peter Norvig, a leading artificial intelligence researcher and Director of Research at Google, recently complied a set of statistics about character, n-gram, and word frequencies based on the Google Books archive. His results showed some interesting similarities and differences to the seminal work of Mark Mayzner, who studied the original frequency of letters in the English language in the 1960s. The video below provides an interesting overview of Dr. Norvig's findings.

## Term Vectors

As discussed above, perhaps the most common method of representing text is by individual words or terms. Syntactically, this approach converts the text in document D into a term vector Dj. Each entry in Dj corresponds to a specific term ti, and its value defines the frequency of tiDj. Other possible approaches include language modelling, which tries to predict the probabilities of specific sequences of terms, and natural language processing (NLP), which converts text into parse trees that include parts of speech and a hierarchical breakdown of phrases and sentences based on rules of grammar. Although useful for a variety of tasks (e.g., optical character recognition, spell checking, or language understanding), language modelling and NLP are normally too specific or too complex for our purposes.

As an example of term vectors, suppose we had the following four documents.

Document 1
 It is a far, far better thing I do, than I have ever done 
Document 2
 Call me Ishmael 
Document 3
 Is this a dagger I see before me? 
Document 4
 O happy dagger 

Taking the union of the documents' unique terms, the documents produce the following term vectors.

a before better call dagger do done ever far happy have i is ishmael it me o see than thing this
D1 1 0 1 0 0 1 1 1 2 0 1 2 1 0 1 0 0 0 1 1 0
D2 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
D3 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1
D4 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0

Intuitively, the overlap between term vectors might provide some clues about the similarity between documents. In our example, there is no overlap between D1 and D2, but a three-term overlap between D1 and D3, and a one-term overlap between D3 and D4.

### NLTK Term Vectors

The following Python NLTK code snippet will create the same four documents from our example as a Python list, strip punctuation characters from the documents, tokenize them into four separate token (or term) vectors, then print the term vectors.

import gensim import nltk import re import string # Create initial documents list doc = [ ] doc.append( 'It is a far, far better thing I do, than I have every done' ) doc.append( 'Call me Ishmael' ) doc.append( 'Is this a dagger I see before me?' ) doc.append( 'O happy dagger' ) # Remove punctuation, then tokenize documents punc = re.compile( '[%s]' % re.escape( string.punctuation ) ) term_vec = [ ] for d in doc: d = d.lower() d = punc.sub( '', d ) term_vec.append( nltk.word_tokenize( d ) ) # Print resulting term vectors for vec in term_vec: print vec 

Running this code in Python produces a list of term vectors identical to the table shown above.

['it', 'is', 'a', 'far', 'far', 'better', 'thing', 'i', 'do', 'than', 'i', 'have', 'ever', 'done'] ['call', 'me', 'ishmael'] ['is', 'this', 'a', 'dagger', 'i', 'see', 'before', 'me'] ['o', 'happy', 'dagger'] 

### Stop Words

A common preprocessing step during text analytics is to remove stop words, words that are common in text but that do not provide any useful context or semantics. Removing stop words is simple, since it can be performed in a single pass over the text. There is no single, definitive stop word list. Here is one fairly extensive example.

a about above after again against all am an and any are as at be because been before being below between both but by can did do does doing don down during each few for from further had has have having he her here hers herself him himself his how i if in into is it its itself just me more most my myself no nor not now of off on once only or other our ours ourselves out over own s same she should so some such t than that the their theirs them themselves then there these they this those through to too under until up very was we were what when where which while who whom why will with you your yours yourself yourselves

Applying stop word removal to our initial four document example would significantly shorten their term vectors.

better call dagger done ever far happy ishmael o see thing
D1 1 0 0 1 1 2 0 0 0 0 1
D2 0 1 0 0 0 0 0 1 0 0 0
D3 0 0 1 0 0 0 0 0 0 1 0
D4 0 0 1 0 0 0 1 0 1 0 0

Notice that the overlap between D1 and D3, which was based on stop words, has vanished. The only remaining overlap is between D3 and D4.

As with all operations, removing stop words is normally appropriate, but not always. The classic example is the sentence "To be or not to be." Removing stop words eliminates the entire sentence, which could be problematic. Consider a search engine that performs stop word removal prior to search to improve performance. Searching on the sentence "To be or not to be." using this strategy would fail.

### NLTK Stop Words

Continuing our NLTK example, the following code snippet removes stop words from the document term vectors.

# Remove stop words from term vectors stop_words = nltk.corpus.stopwords.words( 'english' ) for i in range( 0, len( term_vec ) ): term_list = [ ] for term in term_vec[ i ]: if term not in stop_words: term_list.append( term ) term_vec[ i ] = term_list # Print term vectors with stop words removed for vec in term_vec: print vec 

Running this code in Python produces a list of term vectors identical to the table shown above.

['far', 'far', 'better', 'thing', 'ever', 'done'] ['call', 'ishmael'] ['dagger', 'see'] ['o', 'happy', 'dagger'] 

### Stemming

Stemming removes suffixes from words, trimming them down to conflate them into a single, common term. For example, the terms

connect, connected, connecting, connection, connections

could be stemmed to a single term connect. There are a number of potential advantages to stemming terms in a document. The two most obvious are: (1) it reduces the total number of terms, improving efficiency, and (2) it better captures the content of a document by aggregating terms that are semantically similar.

Researchers in IR quickly realized it would be useful to develop automatic stemming algorithms. One of the first algorithms for English terms was published by Julie Beth Lovins in 1968 (Lovins, J. B. Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics 11, 1–2, 1968, 22–31.) Lovins's algorithm used 294 endings, 29 conditions, and 35 transformation rules to stem terms. Conditions and endings are paired to define when endings can be removed from terms. For example

Conditions
 A No restrictions on stem B Minimum stem length = 3 ··· BB Minimum stem length = 3 and do not remove ending after met or ryst 
Endings
 ATIONALLY B IONALLY A ··· 

Consider the term NATIONALLY. This term ends in ATIONALLY but condition B restricts its application to terms whose minimum stem length (after stemming) is 3 characters or longer, so it cannot be applied. The term also ends in IONALLY, however, and it satisfies condition A (no restriction on stem), so this ending can be removed, producing NAT.

Lovins's transformation rules handle issues like letter doubling (SITTINGSITTSIT), odd pluralization (MATRIX as MATRICES), and other irregularities (ASSUME and ASSUMPTION).

The order that rules are applied is important. In Lovins's algorithm, the longest ending that satisfies its condition is found and applied. Next, each of the 35 transformation rules are tested in turn.

Lovins's algorithm is a good example of trading space for coverage and performance. The number of endings, conditions, and rules is fairly extensive, but many special cases are handled, and the algorithm runs in just two major steps: removing a suffix, and handling language-specific transformations.

### Porter Stemming

Perhaps the most popular stemming algorithm was developed by Michael Porter in 1980 (Porter, M. F. An Algorithm for Suffix Stripping, Program 14, 3, 1980, 130–137.) Porter's algorithm attempted to improve on Lovins's in a number of ways. First, it is much simpler, containing many fewer endings and conditions. Second, unlike Lovins's approach of using stem length and the stem's ending character as a condition, Porter uses the number of consonant-vowel pairs that occur before the ending, to better represent syllables in a stem. The algorithm begins by defining consonants and vowels.

• Consonant. A letter other than A, E, I, O, U, or Y preceded by a consonant.
• Vowel. A letter than is not a consonant.

A sequence of consonants ccc... of length > 0 is denoted C. A list of vowels vvv... of length > 0 is denoted V. Therefore, any term has four forms: CVCV...C, CVCV...V, VCVC...C, or VCVC...V. Using square brackets [C] to denote arbitrary presence and parentheses (VC)m to denote m repetitions, this can be simplified to

[C] (VC)m [V]

m is the measure of the term. Here are some examples of different terms and their measures, denoted using Porter's definitions.

Measure Term Def'n
m = 0 tree ⇒ [tr] [ee] C (VC)0 V
m = 1 trouble ⇒ [tr] (ou bl) [e] C (VC)1 V
m = 1 oats ⇒ [ ] (oa ts) [ ] (VC)1
m = 2 private ⇒ [pr] (i v a t) [e] C (VC)2 V
m = 2 orrery ⇒ [ ] (o rr e r) [y] (VC)2 V

Once terms are converted into consonant–vowel descriptions, rules are defined by a conditional and a suffix transformation.

(condition) S1 → S2

The rule states that if a term ends in S1, and if the stem before S1 satisfies the condition, then S1 should be replaced by S2. The condition is often specified in terms of m.

(m > 1) EMENT →

This rule replaces a suffix EMENT with nothing if the remainder of the term has measure of 2 or greater. For example, REPLACEMENT would be stemmed to REPLAC, since REPLAC ⇒ [R] (E PL A C) [ ] with m = 2. PLACEMENT, on the other hand, would not be stemmed, since PLAC ⇒ [PL] (A C) [ ] has m = 1. Conditions can also contain more sophisticated requirements.

Condition Explanation
*S stem must end in S (any letter can be specified)
*v* stem must contain a vowel
*d stem must end in a double consonant
*o stem must end in CVC, and the second C must not be W, X, or Y

Conditions can also include boolean operators, for example, (m > 1 and (*S or *T)) for a stem with a measure of 2 or more that ends in S or T, or (*d and not (*L or *S or *Z)) for a stem that ends in a double consonant but does not end in L or S or Z.

Porter defines bundles of conditions that form eight rule sets. Each rule set is applied in order, and within a rule set the matching rule with the longest S1 is applied. The first three rules deal with plurals and past participles (a verb in the past tense, used to modify a noun or noun phrase). The next three rules reduce or strip suffixes. The final two rules clean up trailing characters on a term. Here are some examples from the first, second, fourth, and seventh rule sets.

Rule Set Rule Example
1 SSES → SS
IES → I
S →
CARESSES → CARESS
PONIES → PONI
CATS → CAT
2 (m > 0) EED → EE
(*v*) ED →
(*v*) ING →
AGREED → AGREE
PLASTERED → PLASTER
MOTORING → MOTOR, SING → SING
4 (m > 0) ATIONAL → ATE
(m > 0) ATOR → ATE
(m > 0) OUSLI → OUS
. . .
RELATIONAL → RELATE
OPERATOR → OPERATE
ANALOGOUSLI → ANALOGOUS
7 (m > 1) E → PROBATE → PROBAT, RATE → RATE

This web site provides an online demonstration of text being stemmed using Porter's algorithm. Stemming our four document example produces the following result, with happy stemming to happi.

better call dagger done ever far happi ishmael o see thing
D1 1 0 0 1 1 2 0 0 0 0 1
D2 0 1 0 0 0 0 0 1 0 0 0
D3 0 0 1 0 0 0 0 0 0 1 0
D4 0 0 1 0 0 0 1 0 1 0 0

### NLTK Porter Stemming

Completing our initial NLTK example, the following code snippet Porter stems each term in our term vectors.

# Porter stem remaining terms porter = nltk.stem.porter.PorterStemmer() for i in range( 0, len( term_vec ) ): for j in range( 0, len( term_vec[ i ] ) ): term_vec[ i ][ j ] = porter.stem( term_vec[ i ][ j ] ) # Print term vectors with stop words removed for vec in term_vec: print vec 

Running this code in Python produces a list of stemmed term vectors identical to the table shown above.

['far', 'far', 'better', 'thing', 'ever', 'done'] ['call', 'ishmael'] ['dagger', 'see'] ['o', 'happi', 'dagger'] 

## Similarity

Once documents have been converted into term vectors, vectors can be compared to estimate the similarity between pairs or sets of documents. Many algorithms weight the vectors' term frequencies to better distinguish documents from one another, then use the cosine of the angle between a pair of document vectors to compute the documents' similarity.

### Term Frequency–Inverse Document Frequency

A well known document similarity algorithm is term frequency–inverse document frequency, or TF-IDF (Salton, G. and Yang, C. S. On the Specification of Term Values in Automatic Indexing, Journal of Documentation 29, 4, 351–372, 1973). Here, individual terms in a document's term vector are weighted by their frequency in the document (the term frequency), and by their frequency over the entire document collection (the document frequency).

Consider an m×n matrix X representing m unique terms ti as rows of X and n documents Dj as columns of X. The weight X[i, j] = wi,j for tiDj is defined as wi,j = tfi,j × idfi, where tfi,j is the number of occurrences of tiDj, and idfi is the log of inverse fraction of documents ni that contain at least one occurrence of ti, idfi = ln( n / ni ).

The left matrix below shows our four document example transposed to place the m=11 terms in rows and the n=4 documents in columns. The center matrix weights each term count using TF-IDF. The right matrix normalizes each document column, to remove the influence of document length from the TF-IDF weights.

 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 X   = better 1 0 0 0 = better 1.39 0 0 0 = better 0.35 0 0 0 call 0 1 0 0 call 0 1.39 0 0 call 0 0.71 0 0 dagger 0 0 1 1 dagger 0 0 0.69 0.69 dagger 0 0 0.44 0.33 done 1 0 0 0 done 1.39 0 0 0 done 0.35 0 0 0 ever 1 0 0 0 ever 1.39 0 0 0 ever 0.35 0 0 0 far 2 0 0 0 far 2.77 0 0 0 far 0.71 0 0 0 happi 0 0 0 1 happi 0 0 0 1.39 happi 0 0 0 0.67 ishmael 0 1 0 0 ishmael 0 1.39 0 0 ishmael 0 0.71 0 0 o 0 0 0 1 o 0 0 0 1.39 o 0 0 0 0.67 see 0 0 1 0 see 0 0 1.39 0 see 0 0 0.90 0 thing 1 0 0 0 thing 1.39 0 0 0 thing 0.35 0 0 0

Most of the weights in the center matrix's columns are 1.39. These correspond to single frequency occurrences of terms (tfi,j = 1) that exist in only one document (idfi = ln(4 / 1) = 1.39). Single frequency occurrences of dagger in D3 and D4 have weights of 0.69, because idfdagger = ln(4 / 2) = 0.69. Finally, the weight for far in D1 is 2.77 because its term frequency is tffar,1 = 2.

Once documents are converted into normalized TF-IDF vectors, the similarity between two documents is the dot product of their vectors. In our example, the only documents that share a common term with a non-zero weight are D3 and D4. Their similarity is D3 · D4 = 0.44 × 0.33 = 0.15.

Mathematically, recall that cos θ = Di · Dj / |Di| |Dj|. Since the document vectors are normalized, this reduces to cos θ = Di · Dj. Dot product similarity measures the cosine of the angle between two document vectors. The more similar the direction of the vectors, the more similar the documents.

Intuitively, TF-IDF implies the following. In any document Dj, if a term ti occurs frequently, it's an important term for characterizing Dj. Moreover, if ti does not occur in many other documents, it's an important term for distinguishing Dj from other documents. This is why ti's weight in Dj increases based on term frequency and inverse document frequency. If two documents share terms with high term frequency and low document frequency, they are assumed to be similar. The dot product captures exactly this situation in its sum of the product of individual term weights.

### Gensim TF-IDF

Unfortunately, NLTK does not provide a TF-IDF implementation. To generate TF-IDF vectors and use them to calculate pairwise document similarity, we use the Gensim Python library.

# Convert term vectors into gensim dictionary dict = gensim.corpora.Dictionary( term_vec ) corp = [ ] for i in range( 0, len( term_vec ) ): corp.append( dict.doc2bow( term_vec[ i ] ) ) # Create TFIDF vectors based on term vectors bag-of-word corpora tfidf_model = gensim.models.TfidfModel( corp ) tfidf = [ ] for i in range( 0, len( corp ) ): tfidf.append( tfidf_model[ corp[ i ] ] ) # Create pairwise document similarity index n = len( dict ) index = gensim.similarities.SparseMatrixSimilarity( tfidf_model[ corp ], num_features = n ) # Print TFIDF vectors and pairwise similarity per document for i in range( 0, len( tfidf ) ): s = 'Doc ' + str( i + 1 ) + ' TFIDF:' for j in range( 0, len( tfidf[ i ] ) ): s = s + ' (' + dict.get( tfidf[ i ][ j ][ 0 ] ) + ',' s = s + ( '%.3f' % tfidf[ i ][ j ][ 1 ] ) + ')' print s for i in range( 0, len( corp ) ): print 'Doc', ( i + 1 ), 'sim: [ ', sim = index[ tfidf_model[ corp[ i ] ] ] for j in range( 0, len( sim ) ): print '%.3f ' % sim[ j ], print ']' 

Running this code produces a list of normalized TF-IDF vectors for the Porter stemmed terms in each document, and a list of pairwise similarities for each document compared to all the other documents in our four document collection.

Doc 1 TFIDF: (better,0.354) (done,0.354) (ever,0.354) (far,0.707) (thing,0.354) Doc 2 TFIDF: (call,0.707) (ishmael,0.707) Doc 3 TFIDF: (dagger,0.447) (see,0.894) Doc 4 TFIDF: (dagger,0.333) (happi,0.667) (o,0.667) Doc 1 sim: [ 1.000 0.000 0.000 0.000 ] Doc 2 sim: [ 0.000 1.000 0.000 0.000 ] Doc 3 sim: [ 0.000 0.000 1.000 0.149 ] Doc 4 sim: [ 0.000 0.000 0.149 1.000 ] 

### Latent Semantic Analysis

Latent semantic analysis (LSA) reorganizes a set of documents using a semantic space derived from implicit structure contained in the text of the documents (Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. and Harshman, R. Using Latent Semantic Analysis to Improve Access to Textual Information, Proceedings of the Conference on Human Factors in Computing Systems (CHI '88), 281–286, 1988). LSA uses the same m×n term-by-document matrix X corresponding to m unique terms across n documents. Each frequency X[i, j] for term ti in document Dj is normally weighted, for example, using TF-IDF.

 Dj ti x1,1 ⋯ x1,n =   X ⋮ ⋱ ⋮ xm,1 ⋯ xm,n

Each row in X is a term vector ti = [ xi,1xi,n ] defining the frequency of ti in each document Dj. The dot product of two term vectors tp · tqT defines a correlation between the distribution of the two terms across the documents. Similarly, the dot product DpT · Dq of two columns of X corresponding to the term frequencies for two documents defines a similarity between the documents.

Given X, we perform a singular value decomposition (SVD) to produce X = UΣVT, where U and V are orthonormal matrices (a matrix whose rows and columns are unit length, and the dot product of any pair of rows or columns is 0) and Σ is a diagonal matrix. Mathematically, U contains the eigenvectors of XXT (the tptq correlations), V contains the eigenvectors of XTX (the DpDq similarities), and ΣΣT contains the eigenvalues for U and V, which are identical. Mathematically, SVD can be seen as providing three related functions.

1. A method to transform correlated variables into uncorrelated variables that better expose relationships among the original data items.
2. A method to identify and order dimensions along which the data items exhibit the most variation.
3. A method to best approximate the data items with fewer dimensions.

To use SVD for text similarity, we first select the k largest singular values σ from Σ, together with their corresponding eigenvectors from U and V. This forms a rank-k approximation of X, Xk = UkΣkVkT. The columns ci of Uk represent concepts, linear combinations of the original terms. The columns Dj of VkT represent the documents defined based on which concepts (and how much of each concept) they contain. Consider the following example documents.

• D1. Romeo and Juliet
• D2. Juliet, O happy dagger!
• D3. Romeo died by a dagger
• D4. "Live free or die", that's the New Hampshire motto
• D5. Did you know New Hampshire is in New England

We choose a subset of the terms in these documents, then construct an initial term–document matrix X.

 D1 D2 D3 D4 D5 X   = romeo 1 0 1 0 0 juliet 1 1 0 0 0 happy 0 1 0 0 0 dagger 0 1 1 0 0 die 0 0 1 1 0 hampshire 0 0 0 1 1

Applying SVD to X produces the following decomposition.

 c1 c2 c3 c4 c5 c6 D1 D2 D3 D4 D5 U   = romeo -0.48 -0.02 -0.59 0.42 -0.31 0.38 Σ   = 2.21 0 0 0 0 VT   = -0.42 -0.57 -0.66 -0.25 -0.06 juliet -0.45 0.42 0.28 0.51 0.37 0.38 0 1.71 0 0 0 0.23 0.49 -0.28 -0.70 -0.37 happy -0.26 0.29 0.46 -0.24 0.12 0.76 0 0 1.31 0 0 -0.24 0.60 -0.53 0.32 0.44 dagger -0.55 0.13 0.05 -0.57 -0.46 -0.38 0 0 0 1.11 0 0.83 -0.27 -0.37 0.07 0.31 die -0.41 -0.57 -0.16 -0.26 0.64 0 0 0 0 0 0.48 0.12 0.05 -0.27 0.58 -0.75 hampshire -0.14 -0.63 0.58 0.34 -0.36 0 0 0 0 0 0

We choose the k = 2 largest singular values, producing the following reduced matrices.

 c1 c2 D1 D2 D3 D4 D5 U2   = romeo -0.48 -0.02 Σ2   = 2.21 0 V2T   = -0.42 -0.57 -0.66 -0.25 -0.06 juliet -0.45 0.42 0 1.71 0.23 0.49 -0.28 -0.70 -0.37 happy -0.26 0.29 dagger -0.55 0.13 die -0.41 -0.57 hampshire -0.14 -0.63

Multiplying U2 Σ2 V2 produces a new term–document matrix X2 based on the largest k=2 singular values.

 D1 D2 D3 D4 D5 X2   = romeo 0.37 0.49 0.59 0.24 0.06 juliet 0.53 0.81 0.52 -0.10 -0.12 happy 0.32 0.49 0.28 -0.09 -0.09 dagger 0.55 0.77 0.76 0.20 0.02 die 0.23 0.19 0.78 0.69 0.30 hampshire -0.04 -0.18 0.41 0.59 0.29

So what advantage does LSA provide over using a term–document matrix directly, as we do in TF-IDF? Consider the term frequencies contained in X versus X2 for D1. The original frequencies in X were 1 for romeo and juliet, and 0 for all other terms. The two largest LSA frequencies in X2 are 0.55 for dagger and 0.53 for juliet. Why is there a large positive frequency for dagger? LSA has inferred this connection based on the fact that other documents (D2 and D3) associate dagger with both romeo and juliet.

These associations affect document similarities. For example, in the original X the similarities between D1D4 and D1D5 are both 0. In the LSA matrix X2, however, the similarities are 0.25 and -0.002. The term die in D4 associates to romeo, defining a similarity between D1 and D4. No such association exists between D1 and D5. Human readers with an understanding of the context of Romeo and Juliet would likely identify the same weak presence or lack of similarity.

## Clustering

Once similarities between pairs of documents have been calculated, they can be used to cluster the documents into groups. This is often called topic clustering, since each group represents a set of documents with similar content, and by assumption that discuss a similar topic or topics.

Any similarity-based clustering algorithm can be used for topic clustering, for example, k-means, closest-neighbour agglomerative, density-based, and so on. We present a graph-based clustering algorithm that uses threshold similarities to partition the document collection into clusters. Varying the threshold similarity produces a hierarchical clustering at different levels of granularity.

### Minimum Spanning Tree Clustering

Any similarity algorithm (e.g., TF-IDF or LSA) can be used to construct a pairwise document similarity matrix Σ, where σi,j ∈ Σ defines the similarity between documents Di and Dj. Since σi,j = σj,i, only the upper or lower half of Σ is normally defined. The TF-IDF matrix X and the similarity matrix Σ for the five document example in the LSA section contains the following values.

 D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 X   = romeo 0.71 0 0.58 0 0 Σ   = D1 0.31 0.41 0 0 Δ   = D1 0.69 0.59 1 1 juliet 0.71 0.44 0 0 0 D2 0.26 0 0 D2 0.74 1 1 happy 0 0.78 0 0 0 D3 0.41 0 D3 0.59 1 dagger 0 0.44 0.58 0 0 D4 0.71 D4 0.29 die 0 0 0.58 0.71 0 D5 D5 hampshire 0 0 0 0.71 1

We want to use values in Σ to build a graph with documents as nodes and weighted edges defining the similarity between documents, with similar documents close to one another. To do this, we must weight the edges with dissimilarities Δ = 1 - Σ. Now, two documents Di and Dj with σi,j = 1 will have an edge weight of δi,j = 1 - σi,j = 0, so they will overlap. Two documents Di and Dj with σi,j = 0 will have an edge weight of δi,j = 1 - σi,j = 1, so they will be a maximum possible distance from one another.

Once Δ is built, it is used to construct a complete graph with n nodes representing the n documents, and nm edges representing the similarities between all pairs of documents. Each edge connecting Di and Dj is weighted with δi,j. Kruskal's minimum spanning tree (MST) algorithm is run to find a minimum-weight tree that includes all n documents.

 1 F ← n nodes 2 E ← nm edges 3 while E not empty && F not spanning do 4 find ei,j ∈ E with minimum wi,j 5 remove ei,j from E 6 if ei,j connects two separate trees in F then 7 add ei,j to F

This force-directed graph shows the five document example with the Euclidean distance between nodes roughly equal to the dissimilarity between the corresponding documents. MST edges are drawn in red and labelled with their δi,j.

Once the MST is constructed, topic clusters are formed by removing all edges ei,j in the MST whose δi,j ≥ τ for a threshold dissimilarity τ. This produces one or more disconnected subgraphs, each of which represents a topic cluster. The larger the value of τ, the more dissimilarity we allow between documents in a common cluster. For example, suppose we chose τ = 0.5 in the above TF-IDF dissimilarity graph. This would produce four topic clusters: C1 = { D4, D5 }, C2 = { D1 }, C3 = { D2 }, and C4 = { D3 }. Increasing τ to 0.6 produces two topic clusters: C1 = { D1, D3, D4, D5 }, and C2 = { D2 }.

Semantically, we might expect to see two clusters: C1 = { D1, D2, D3 }, and C2 = { D1, D5 }. However, because δ1,2 = δ3,4 = 0.69, it is not possible to subdivide the MST in a way that combines D1 with D3, but separates D3 from D4. This is a consequence of the fact that TF-IDF has no knowledge of context. It only has access to the terms in the documents to make its similarity estimates.

An alternative is to use LSA to derive a dissimilarity matrix that includes term correspondences in its results. The matrix Δ2 below shows dissimilarities for the five document example derived from X2, the LSA rank-2 estimate of the original term–frequency matrix X.

 D1 D2 D3 D4 D5 Δ2   = D1 0.38 0.41 0.84 0.96 D2 0.25 0.88 1 D3 0.49 0.81 D4 0.76 D5

There is still a fairly weak dissimilarity between D3 and D43,4 = 0.49, implying a fairly strong similarity of σ3,4 = 0.51), but this is larger than the dissimilarities between D1, D2, and D3. Choosing τ = 0.4 produces three clusters: C1 = { D1, D2, D3 }, C2 = { D4 }, and C3 = { D5 }. Although it's still not possible to cluster { D1, D2, D3 } and { D4, D5 }, the LSA clusters with τ = 0.4 seem semantically more appropriate than any of the TF-IDF clusters.

## Sentiment

Sentiment is defined as "an attitude, thought, or judgment prompted by feeling." An area of recent interest in text analytics is estimating the sentiment contained in a block of text. This has prompted a number of basic questions. For example, how should sentiment or emotion be characterized so we can measure it? And how can these measurements be extracted from text?

### Emotional Models

Psychologists have proposed various models to define different emotional states. In psychology mood refers to a medium or long-term affective state, where affect is described using dimensions like pleasure, arousal, and engagement.

Psychological models use emotional dimensions to position emotions on a 2D plane. The simplest models represents pleasure along a horizontal axis, with highly unpleasant on one end, highly pleasant on the other, and different levels of pleasure in between. For example, when sentiment is described as positive, negative, or neutral, it is normally assumed that "sentiment" means pleasure.

More complex models use more than a single dimension. For example, Russell proposed using valence (or pleasure) and arousal (or activation) to build an emotional circumplex of affect (Russell, J. A. A Circumplex Model of Affect. Journal of Personality and Social Psychology 39, 6, 1161–1178, 1980). Russell applied multidimensional scaling to position 28 emotional states, producing the model shown to the left with valence running along the horizontal axis and arousal along the vertical axes. The intermediate terms excited–depressed and distressed–relaxed are polar opposites formed by intermediate states of valence and arousal.

Similar models have been proposed by Watson and Tellegen (with positive and negative valence axes), Thayer (with tension and energy axes), and Larsen and Diener (with pleasure and activation axes similar to Russell's). The circumplex can be further subdivided into additional emotional regions like happy, sad, calm, and tense.

### Natural Language Processing

The area of natural language processing (NLP) in computer science is often used to analyze the structure of a text body. For example, it is useful to subjectivity classification to remove objective, fact-based text prior to estimating sentiment. Pang and Lee proposed a sentence-level subjectivity detector that computes a subjectivity weight for each sentence (Pang, B. and Lee, L. A Sentimental Education. Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL '04), 271–278, 2004). Pairs of sentences are assigned an association score based on the difference of their subjectivity scores to estimate whether they belong to a common class. A graph is constructed with sentences and the two classifications (subjective and objective) forming nodes, association weights forming edges between sentences, and subjectivity weights forming edges between each sentence and the classification nodes. A minimum graph cut is then used to split the sentences into subjective and objective classes.

Another common NLP method for sentiment analysis is to train a machine learning algorithm on a set of documents with known sentiment. Naive Bayes, maximum entropy, and support vector machine (SVM) approaches were compared for classifying movie reviews as positive or negative. The presence of absence of a term (unigrams) performed best, with accuracies ranging from 80.4% for maximum entropy to 82.9% for SVM. Interestingly, more complex inputs like bigrams, term frequencies, part of speech tagging, and document position information did not improve performance.

In a similar manner, semantic orientation has been used to rate online reviews as positive or negative (Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL '02), 417–424, 2002). The semantic orientation of phrases in a review are compared to the anchor words "excellent" and "poor" by extracting phrases containing predefined target terms, then using pointwise mutual information (PMI) to calculate the statistical dependence between each phrase and the anchor words. The difference PMI(phrase, "excellent") − PMI(phrase, "poor") estimates the direction and the strength of a phrase's semantic orientation. Results for reviews about automobiles, banks, movies, and travel destinations produced accuracies of 84%, 80%, 65.8% and 70.5%, respectively.

### Sentiment Dictionaries

An alternative to natural language approaches uses a term dictionary to estimate sentiment, often for short text snippets like online comments or social network conversations. Proponents argue that a short post does not contain enough text and grammar for an NLP algorithm to leverage, and therefore, independent word analysis may produce results comparable to NLP.

Profile of mood states (POMS) was originally designed as a psychometric tool to measure a person's mood state on six dimensions: tension–anxiety, depression–dejection, anger–hostility, fatigue–inertia, vigor–activity, and confusion–bewilderment. Subjects rate 65 adjectives on a five-point scale to produce a score in each dimension that can be compared to population norms. POMS was extended by including 793 synonyms of the original 65 adjectives to form POMS-ex, a word dictionary used to estimate sentiment.

Affective Norms for English Words (ANEW) was built to assess the emotional affect for a set of verbal terms. Three emotional dimensions were scored on a nine-point scale: valence (or pleasure), arousal (or activation), and dominance. A total of 1,033 word that were previously identified as emotion-carrying words, and that provided good coverage of all three dimensions, were rated.

Recent lexical approaches have focused specifically on short text and online social networks. SentiStrength was developed from manually scoring 2,600 MySpace comments on two five-point scales representing both the positive and the negative emotion of a comment. Analysis of the training set produced a dictionary of 298 positive terms and 465 negative terms (Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kappas, A. Sentiment Strength Detection in Short Informal Text. Journal of the American Society for Information Science and Technology 61, 12, 2544–2558, 2010). SentiStrength is augmented to recognize properties of social network text like emoticons, text abbreviations, and repeated letters and punctuation, as well simple use of booster words (somewhat, very) and negation.

WordNet is a lexical database that groups English nouns, verbs, adjectives, and adverbs into sets of cognitive synonyms---synsets---that represent distinct concepts. SentiWordNet estimates the sentiment of synsets by assigning them a positive, negative, and objective (or neutral) score on the range −1 to +1 (Baccianella, S., Esuli, A., and Sebastiani, F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC '10), 2200–2204, 2010). Individual terms can be matched to their synset, and since these are taken from WordNet, properties like word disambiguation are automatically included.

### Estimating Sentiment

As a practical example of estimating sentiment, consider analysis using ANEW's independent word dictionary. ANEW has a number of potential advantages. First, its word list is publically available. Second, it contains multiple emotional dimensions, allowing us to characterize sentiment in ways that more sophisticated than a simple positive–negative rating. We apply ANEW to process a text body as follows:

1. Parse the text body into terms, then identify the n ANEW terms { t1, … , tn } that match entries in the ANEW dictionary.
2. For each ti, extract the average and standard deviation for valence (vi,μ, vi,σ) and arousal (ai,μ, ai,σ) from the dictionary.
3. If a text body has less than n=2 ANEW terms, discard it as having insufficient measurements to estimate an overall sentiment.
4. If n ≥ 2, aggregate individual term values to estimate an overall average and standard deviation for valence (Vμ, Vσ) and arousal (Aμ, Aσ).

Calculating an overall standard deviation from a set of term average and standard deviations pairs (μi, σi) is done using a formula for averaging standard deviations. For example, for Vσ it is

$M = \frac{1}{n} \sum_{i=1}^{n} v_{i,\mu} , \, \, \, V_{\sigma}^{2} = ( \frac{1}{n} \sum_{i=1}^{n} v_{i,\sigma}^{2} + M^{2}) - M^{2}$

Calculating the overall average of individual term averages is more complicated. We could use a simple unweighted average, however, this ignores each term's standard deviation vi,σ. A large deviation vi,σ implies ambiguity in how respondents rated a term. The term could be a homonym: lie, to not tell the truth versus lie, to recline. The term could have been viewed in different contexts: a sentence where the term is used directly, "I'm happy" versus a sentence were it's negated, "I'm not happy." The term could simply be difficult to score: what valence should we attach to the ANEW term "bench?"

Intuitively, the larger vi,σ, the less weight we want to give vi,μ in the overall average, since we are less confident about where its true average lies. To do this, we calculate a term's cumulative distribution function (the normal curve) p, then use the probability at p(vi,μ) to weight vi,μ in the overall average. This gives a lower weights to terms with larger vi,σ. The height of the normal curve with μ = vi,μ and σ = vi,σ at x = vi,μ is

$p_i = \frac{1}{\sqrt{2 \Pi v_{i,\sigma}^{2}}}$

Given pi for each term ti, we normalize the pi's, then calculate a final weighted average.

$V_\mu = \sum_{i=1}^{n} \frac{p_i}{\textstyle{\sum_{i=1}^{n} p_i}} v_{i,\mu}$

Consider the tweet "Congrats to @HCP_Nevada for their health care headliner win!" with two ANEW terms "health" (vμ = 6.81, vσ = 1.88) and "win" (vμ = 8.38, vσ = 0.92). An unweighted average of the vμ's produces Vμ = 7.56, but since the standard deviation for health is higher than for win, vhealth,μ receives a weight of  0.21/0.64 = 0.33, while vwin,μ receives a weight of  0.43/0.64 = 0.67. This produces a weighted average Vμ = 7.86 that falls closer to win's valence, exactly as we want.

### Term Sentiment

To calculate valence and arousal on your own sets of terms, we have implemented an extended term dictionary in Python. To use this dictionary, download the sentiment_module.zip file, and unzip it to extract a sentiment_module folder. You then have two options for installing the module:

1. Place the sentiment_module folder in the .ipython folder located in your home directory. This should allow the IPython console to see the module.
2. Place the sentiment_module folder in the same directory where you're developing your Python program(s). This will allow you to run Python from the command line and load the term library directly.

Here's an example of using the module, assuming you've place it in your .ipython directory and that you're running Python from an IPython console.

>>> from sentiment_module import sentiment >>> term = 'happy' >>> sentiment.exist( term ) True >>> sentiment.sentiment( term ) {'arousal': 6.49, 'valence': 8.21} 

The sentiment module provides a number of functions. The two that you are most likely to use are exist and sentiment:

• exist( term ):
Returns True if term exists in the ANEW dictionary, False if it does not. term can be a string, or a list of strings.
• sentiment( term ):
Returns a dict variable with an arousal field and a valence field. If term is a string, sentiment returns the valence and arousal for the given term. If term is a list of strings, sentiment returns the average valence and arousal for all recognized terms in the list.

Remember that sentiment values lie on the range 1 (minimum) through 9 (maximum). Below are a few examples of how to use the sentiment module to compute sentiment.

>>> from sentiment_module import sentiment >>> term = 'popsicle' >>> sentiment.exist( term ) False >>> term = 'enraged' >>> sentiment.exist( term ) True >>> sentiment.sentiment( term ) {'arousal': 7.97, 'valence': 2.46} >>> >>> term_list = "it was the best of times it was the worst of times".split() >>> print term_list ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times'] >>> sentiment.exist( term_list ) [False, False, False, True, False, True, False, False, False, True, False, True] >>> sentiment.sentiment( term_list ) {'arousal': 4.939546556471719, 'valence': 5.0307617694606375} >>> >>> term_list = [ 'brocolli', 'carrot', 'pea' ] >>> sentiment.exist( term_list ) [False, False, False] >>> sentiment.sentiment( term_list ) {'arousal': 0.0, 'valence': 0.0 } 

It is also possible to add custom words to the dictionary using add_term(), as long as you can provide a reasonable valence and arousal for the word. Both the word and its stem will then be available.

• add_term( term, v, a [, replace] ):
Adds term to the sentiment dictionary, assigning it a valence of v and an arousal of a. Terms already in the dictionary will not be modified unless the optional replace argument is provided with a value of True.

Below are some examples of adding and replacing terms in the default sentiment dictionary.

>>>from sentiment_module import sentiment >>> sentiment.exist( 'stunned' ) False >>> sentiment.add_term( 'stunned', 2.0, 6.0 ) >>> sentiment.exist( 'stunned' ) True >>> sentiment.exist( 'stun' ) True >>> sentiment.sentiment( 'stunned' ) {'arousal': 2.0, 'valence': 6.0} >>> >>> sentiment.sentiment( 'happy' ) {'arousal': 6.49, 'valence': 8.21} >>> sentiment.add_term( 'happy', 6.0, 8.0 ) >>> sentiment.sentiment( 'happy' ) {'arousal': 6.49, 'valence': 8.21} >>> sentiment.add_term( 'happy', 6.0, 8.0, True ) >>> sentiment.sentiment( 'happy' ) {'arousal': 6.0, 'valence': 8.0} 

## Visualization

The simplest (and most common) way to visualize text is to display it directly to a user. This is useful, since it reveals the full detail of a document. It also has drawbacks, however. Documents take time to read. For a few documents, it might be possible to analyze them by reading them. For larger document collections, however, reading every document in its entirety is not feasible. Instead, some method is needed to present higher-level overviews of the documents and their relationships to one another, for example, summaries, topic clusters, or geolocations.

### Tweet Visualization

As a practical example of a number of visualization techniques, consider an ongoing project to analyze tweets posted on Twitter, an online social network that allows users to upload short text messages—tweets—of up to 140 characters. This restriction encourages users to construct focused, timely updates. Twitter reported in its IPO filing users were posting an average of 500 million tweets per day. Tweets are now being archived at the U.S. Library of Congress. Twitter has also shown the potential for societal impact, for example, in its use as a communication and organizing tool for activists during the 2011 "Arab Spring" protests in various Middle Eastern countries.

Collections of tweets are visualized in numerous ways: by sentiment, by topic, by frequent terms, and so on. Individual tweets are drawn as circles. Each circle's colour, brightness, size, and transparency visualize different details about the sentiment of its tweet:

• Colour. The overall valence or pleasure of the tweet: pleasant tweets are green, and unpleasant tweets are blue.
• Brightness. The overall arousal of the tweet: active tweets are brighter, and subdued tweets are darker.
• Size. One measure of how confident we are about the estimate of the tweet's sentiment: larger tweets represent more confident estimates.
• Transparency. A second measure of how confident we are about its estimate of the tweet's emotion: more opaque (i.e. less transparent) tweets represent more confident estimates.

Tweets are presented using several different visualization techniques. Each technique is designed to highlight different aspects of the tweets and their sentiment.

#### Sentiment

The estimated sentiment of each tweet defines its position in an emotional scatterplot with pleasure and arousal on its horizontal and vertical axes. The spatial distribution of the tweets summarizes their overall sentiment.

Details are presented on demand. If the user hovers the mouse cursor over a tweet, it reveals its body. Words in the sentiment dictionary are highlighted in bold italics. Clicking on a tweet generates a detail dialog with the overall pleasure and arousal for the tweet, as well as each sentiment term's mean and standard deviation of pleasure, mean and standard deviation of arousal, and frequency.

#### Topics

Text similarity and MST clustering are used to identify tweets that discuss a common topic or theme. Each topic is visualized as a rectangular group of tweets, with keywords at the top to summarize the topic, and a number at the bottom to identify the number of tweets in the cluster.

Tweets within each cluster are laid out so that the distance between them shows their text similarity: closer for stronger similarity. Topic cluster rectangles are positioned in the same way: closer for more similar topics. Tweets that are not part of any topic are visualized as singletons on the right.

As with the sentiment, details are available on demand by hovering the mouse over a tweet or clicking a tweet to reveal its content and its estimated sentiment.

#### Heatmap Tab

A heatmap visualizes the a discretized distribution of elements in a 2D plot. Here, we use a sentiment histogram to highlight "hot" red regions with many tweets, and "cold" blue regions with only a few tweets.

The emotional scatterplot is subdivided into an 8 × 8 grid of bins representing one-unit steps in pleasure and arousal. The number of tweets falling within each bin is counted and visualized using colour: red for bins with more tweets than average, and blue for bins with fewer tweets than average. White bins contain no tweets. Stronger, more saturated colours lie farther from the average.

Hovering the mouse over a heatmap bin reveals the number of tweets that lie in the bin.

#### Tag Cloud

A tag cloud visualizes a collection of text documents as a set of frequent terms, where the size of a term represents the number of times is occurs in the document set.

Tweets are visualized as four separate tag clouds in four emotional regions: upset in the upper-left, happy in the upper-right, relaxed in the lower-right, and unhappy in the lower-left. A term's size shows how often it occurs over all the tweets in the given emotional region. Larger terms occur more frequently.

#### Timeline

A timeline visualizes the number of tweets that occur over a given time window using a double-ended bar graph. Pleasant tweets are shown in green above the horizontal axis, and unpleasant tweets in blue below the axis.

The height of a bar in the graph shows the number of tweets posted over the time range covered by the bar. Bars are split into four segments representing the number of relaxed and happy tweets—in dark green and light green—and the number of unhappy and upset tweets—in dark blue and light blue.

#### Map

Maps are used to geolocate data. Tweets are visualized at the latitude and longitude where they were posted. We use the same sized, coloured circles from the sentiment and topic visualizations to show estimated sentiment and confidence in the estimate.

Twitter presents an interesting problem for geolocation. Because it implements an "opt-in" system for reporting location, users must explicitly choose to allow their location to be posted before their tweets are geotagged. Most users have not done this, so only a very few tweets contain location data. The label in the upper-right corner of the map is modified to show the total number of geotagged tweets in parentheses, to highlight this difference relative to the other visualizations.

#### Affinity

An affinity graph visualizes relationships between text elements. The basis for a relationship depends on the type of data being visualized. For tweet data, the affinity graph includes frequent tweets, people, hashtags, and URLs, together with relationships or affinities between these elements.

As before, blue and green nodes represent tweets. Orange nodes represent people, yellow nodes represent hashtags, and red nodes represent URLs. Larger nodes show more frequent elements. Links between nodes highlight relationships, for example, tweets that are similar to one another, or hashtags and people that occur in a set of tweets.

#### Tweets

Even with sophisticated visualization techniques available, it is often important to show the raw text in a document. A common analysis strategy is to use visualizations to filter a large document collection into a small subset of documents that are of particular interest to an analyst. Those documents can then be read to reveal specific details that cannot be easily captured in the visualizations.

For tweets, we show the date, author, and body of each tweet, as well as its overall pleasure v and arousal a. Sentiment terms in each tweet are highlighted in bold italics. This allows a viewer to see both the raw text of the tweet, and the estimated sentiment values we derived.