Ever wondered who’s the most positive or negative person in your circle of friends or family? Or how these roles change over time? If you have a group chat via WhatsApp, iMessage, or Facebook Messenger then you’re sitting on a trove of data that can answer these questions! Stay tuned to find out how.
I should mention that this post is meant to be non-academic and accessible in hopes that some will recreate this for their own purposes, or provide evidence for loftier inquiries. Post in the comments below if you have questions, and the full code is available at my GitHub.
Let’s get started by emailing ourselves a text file (.txt) of our group chat of interest – in this case I’m using my family’s WhatsApp conversation. In a chat, click the three vertical dots in the top right corner, then:
I’ll be using the R programming language in this post, and we can import and clean the data using a few simple functions from the dplyr package combined with some regular expressions.
text_df <- tibble(text = readLines('YOUR_TXT_FILE')) %>% slice(4:n()) %>% mutate(time = as.Date(str_match(text, "(?:(?!,).)*"), "%m/%d/%y"), name = str_extract(as.character(str_match(text,"-.*:")), "[A-Za-z']+"), body = as.character(str_match(text, "[^:]*$"))) %>% select(-text)
This gives us 3 columns: the time a text was sent, the sender’s name, and the message itself. From here we need to tokenize the message portion (creating a column containing unique words for each person), remove unuseful “stop words” (e.g. “and”, “the”, “was”, etc.), and count by person-word combination. We can then see the top words used by each person, which often says a lot about one’s personality:
Top 5 Words per Person
What’s interesting is how quickly you begin to see yourself and your relationships on a macro scale. If you’ve met my family you’ll know we’re generally an upbeat bunch, and I would guess that we’re extra perky on WhatsApp. This being said, I was still surprised to see that my Mom had managed to use the word “love” 31 times in a few months – almost 4 times as much as the next person’s most commonly used word.
Who is the cheeriest of the bunch?
Things get really interesting when we attach sentiment scores to our words. I’ll be using the 11 point AFINN scoring system, which you can think of as the following:
- extremely positive words (“breathtaking”) score as +5
- extremely negative words (“bastard”) score as -5
- there doesn’t appear to be a 0 score for purely neutral words (“neutral”), and words lacking sentiment (“tree”) are coded as missing as well
And the cheeriness award goes to…
text_df %>% group_by(name) %>% inner_join(get_sentiments("afinn")) %>% summarise(total = mean(score)) %>% mutate(name = reorder(name, total)) %>% arrange(desc(total)) %>% ggplot(aes(name, total, fill = name)) + geom_col(show.legend = FALSE) + coord_flip() + theme(axis.text = element_text(size = 14))
Overall Sentiment Scores
(Larger Value = More Positive)
My sister-in-law Gina! She manages to edge out my Mom despite tough competition. My Dad ranks in around 4 times less positive than Gina… but is on average still positive! Good on ya Dad.
Sentiment Changes Over Time
We can learn a lot more once we consider our date variable, which lets us track changes in sentiment over time by family member. Again, just a bit of data manipulation is needed:
sent <- text_df %>% inner_join(get_sentiments("afinn")) %>% group_by(time, name) %>% mutate(sentiment = mean(score))
And we’re ready to create our first plot:
Sentiment by Person Over Time
The first thing to note is that Bob’s curve (top left) is erratic due to few of his words being “charged” with positive or negative sentiment. I’ll remove him when creating our final chart, which combines the remaining 5 curves into one graph. Sorry Bob.
Family Member Sentiment Over Time
And there you have it: my family’s sentiment quantified over time. The legend is hard to see, but my Dad is the yellowish brown, Patrick (Pook) is pink, Louis is turquoise, Mom is darker blue, and Gina is green.
Now comes the fun part: over-speculating as to the significance of the trends…
- What happened in mid-to-late April that caused Gina, Louis, and Dad to be on an upswing? Are our moods more influenced by changes in weather brought on by Spring?
- Why is Louis going on an emotional roller coaster between March and May?
- Why are Gina, Louis, and Mom increasing in positivity as of June, while Patrick and Dad are decreasing?
- Is Bob an emotionless robot!?!?
All kidding aside, the reality is that the sample size that I had access to was quite small at around 1000 texts (I lost phone last winter. Still bitter). With a larger sample and a bit of creativity, these methods could be used to answer more profound questions, such as:
- Who is the moodiest (exhibiting the greatest variance in sentiment)?
- Are we happier on weekends than on weekdays? (a simple t-test could examine this)
- Do certain family members tend to change sentiment together over time? (think correlation.)
As this was supposed to be fun, why not end with a wordcloud as a souvenir? One could even print it, frame it, and give it out as a unique gift.
text_df %>% count(word) %>% wordcloud2(shape = "circle")
Gotta love that.
That’s all for now folks.
- I owe much of this to the great David Robinson of Variance Explained (among other great things). The also great Julia Silge and him have a book from which I’ve drawn upon in large part (Text Mining with R). Check it out!
- Quite sure that grammatical and methodological errors lie within. Please let me know via comments or email.