ACA Public Sentiment Project Proposal
Christopher G. Healey

Project Description

We will collect and analyze recent social network discussions about the Affordable Care Act (a.k.a. ObamaCare). Specifically, we will collect tweets from Twitter, a social network that allows users to post short text messages of up to 140 characters. We will apply topic clustering and sentiment analysis to the tweets, then interpret the results to provide a summary of the current major topics related to the ACA and their associated sentiment.

Data Source

We will use Twitter's real-time streaming API to collect tweets from Twitter that contain the keywords:

We will use the TweetCapture program provided to us to connect to Twitter's real-time stream (the firehose) and collect tweets by keyword. Based on a check of Twitter's recent tweet activity, we anticipate being able to collect approximately 24,000 tweets per day (see the Data Source Justification section below for more details), or about 150,000 tweets over a 1-week period. Again based on recent tweet activity, we can observe topics like:

Analysis

We will perform topic clustering on the tweets, to identify major topics of discussion. We will then perform sentiment estimation on each major topic, to determine a general sentiment (specifically, a positive, neutral, or negative pleasure) for the topic's tweets.

Challenges

We anticipate a number of challenges we will need to overcome as part of our project.

  1. Differentiating tweets that discuss the ACA versus tweets that match one of our keywords, but are not talking about the ACA. Fortunately, these situations should be fairly rare, since our keywords are unlikely to be used in unrelated tweets.
  2. Performing standard stop word removal, stemming, and topic clustering on short text snippets that are not grammatically correct, that do not use correct spelling, that contain numerous abbreviations, that contain shortened URLs, and so on: RT @mr_prez What r u talkin 'bout, ur ACA sounds bo-gus!  :'(   >:O  http://bit.ly/1eYmVWG.
  3. Estimating sentiment on short, possibly ungrammatical text snippets where punctuation, emoticons, and abbreviations can have a significant impact: RT @fuma ur ACA idea is teh sh*te!!!! #urbandictionary.

Data Source Justification

In spite of the fact that the ACA was passed in March 2010, public sentiment continuing to polarize around the Act and its provisions. The upcoming midterm elections in November 2014 have provided an opportunity for both supporters and opponents to re-energize arguments for and against the Act (1, 2). In addition, a number of legal challenges to the Act are working their way through the lower courts (1, 2, 3, 4), with an expectation that the conflicting decisions will be referred to the Supreme Court in the near future.

Given current interest in the ACA, and the differing opinions on the pros and cons of the Act, we believe a sufficient number of tweets, with appropriate sentiment and topic variability, will be available through Twitter. Preliminary investigation indicates an available rate of approximately 1000 tweets/hour, with a wide range of comments and opinions embedded in the tweets we previewed. Based on these findings, we feel confident we can collect the raw data needed to support our goals and analysis plan for this project.

Deliverables

We will provide the following deliverables at the end of the project.

  1. A dataset containing tweets with various ACA keywords.
  2. A set of topics and associated sentiment derived from the tweet dataset.
  3. A short in-class presentation of our findings, discussions of their meaning, and general "lessons learned" from our project.