Text Analysis of the State of the Union Addresses
Data Visualization and Design | CUNY Graduate Center | Summer 2019
Goals
- Explore ways to visualize Text
- Explore ways to visualize metadata about a text
Data
This Tableau workbook that uses this dataset
Originally from here
Premise
We have a few questions:
- Have there been any trends in the length, lexical density, or significance of State of the Union Addresses?
- Did George W. bush or Barack Obama have an overall different tone to their addresses?
Getting Started
There are really 2 datasets here: the metrics, and the first 500 words of each address. These are on 2 sheets. The metrics give an overall, distant description of the datasets. All of these metrics could reasonably be created in Excel or a text editor. More advanced analysis techniques are beyond the scope of this course, but check out my text analytics lab for more.
Merge Data
Merge fields & Pivot
- Join Metrics sheet to words sheet
- Select from w1 to w499 (shift+click)
- Select ‘Pivot’ from the drop down menu
- Create Year from the first 4 characters of Date
Trends: A Line Chart
- Drag CALC: Year to the Measures Pane and then to the Columns
- Drag Wordcount to the Rows
- From the Analytics pane, drag Trend Line into the view, and then drop it on the Polynomial model type.
- What has happened over time? Have speeches gotten longer or shorter?
- Annotate some key speeches. Right click on the data point
- Select ‘Annotate Point’ and write a short description in the box.
- You can drag and drop this box – make sure no lines cross and it doesn’t obscure any points or lines on your chart.
Stacked bar Chart of Total words and Unique Words
We want to make a bar chart that visually illustrates the difference between the total number of words and the unique number of words that a president uses. This offers a glimpse into the linguistic complexity of the speech. You might expect a shorter speech to have a greater ratio of unique words to total words if it were to have the same amount of content.
- Drag President and CALC: Year to Columns.
- Drag Measure Names to Color on the Marks card.
- On Color, right-click Measure Names, select Filter, select the check boxes for ‘Unique Words’ and ‘Word Count’, and then click OK.
- From the Measures pane, drag Measure Values to Rows.
- On the Marks card, change the mark type from Automatic to Bar. For more information, see Bar Mark.
- Clean up your axis labels, etc.
- Sort President by Median of Year
- Change the Alias of the Legend (Right Click)
Sentences
We’ve also broken our speeches down by sentence. If we take sentences as a proxy for written or formal variety versus more colloquial forms, this may give us some insight into the tone of a president’s speeches.
- Drag President and CALC: Year to the Columns
- Drag SentenceCount to the Rows – change the Aggregation to Median to give an overall picture of each president.
- You may want to create a new field that is the number of words divided by the number of sentences to give a more accurate picture of the speaking style of a given president.
Word Clouds – a first look into the topics
- Pick 2 presidents to compare – you will make different word clouds for each, but it’s too much to use all the presidents.
- Drag Single Words to Text on the Marks card.
- Drag Single Words to Size on the Marks card.
- Right-click Single Words on the Size card and select Measure > Count.
- If necessary, change the Mark type from Automatic to Text.
- Select just the top 100 by dragging ‘Single Words’ to Filter and select ‘Top’ and Choose 7. To remove Stopwords, Command+Click on all the words that seem meaningless
No Stops By Year
- Drag President & YEAR to Columns
- Change the Marks to Circle
- From the Analytics menu, select