BIG DATA Journaling

Submitted by Karthik Bala on Wed, 04/18/2018 - 12:10am

The title is clickbait sorry.

Official Semester Power Rankings

I love asking friends what their favorite semester has been and why. It reveals a lot about what they value and what influences their emotions. Most people reciprocate the question, and so I want to be prepared with the best possible answer, utilizing the the journal I've kept for the last 4ish years.

Thankfully, high school Karthik had the foresight to a collection of plain text files, one for each day, rather than an array of fancy Moleskines, so I was able to write a haphazard python script to explore all this personal data.

To answer the question, I found the sentiment of each entry, using the very simple to use TextBlob package, and aggregated them by semester.

GRAPH #1

This seemed vaguely believable at a glance, but under further inspection some semesters were definitely mis-ranked. Winter 2014, for example, was a wholly unexceptional period of my life and shouldn't have had the worst sentiment.

I sorted the entries by sentiment and did someone manual investigation. There were a lot of issues.

I had one entry that just consisted of this strange run-on: "tried to install mopidy soundcloud and then just started admiring fonts and code quality in this random library and now im trying to change my font", that had a sentiment of -0.5 (on a -1 to 1 scale), one of the lowest detected. I'm not sure if TextBlob isn't a very good library, if ranking my style of speech is an unsolved problem, or if sentiment analysis in general doesn't work well in these kinds of situations. Intro to Data Mining hadn't prepared me for this.

I also realized that 80% of my entries were either heavily sarcastic or ridden with slang, both situations that TextBlob, or most easy-to-use NLP libraries, couldn't handle (understandably). "Last night was so lit" returned a sentiment of 0, which neutralized half of my happy entries, while "this is hopeless lmao" returned a significantly positive sentiment of 0.6.

Failed by my optimism in cut and paste machine learning solutions, I regressed to entering the sentiment of each day by hand. In the future, I might ask someone smarter than me for help tuning an existing NLP model with all of this newly created training data.

While I was originally put off by the task, feeling like a Mechanical Turker, it was a lot of fun going back through my life, putting myself in old shoes, and retrospectively rating days from these past perspectives. More exciting were the results, a mostly accurate result to my original question:

GRAPH #2

I'll talk about the reasons for where each semester is where it is in my later posts.

Once I accomplished my first goal, I poked around for other correlations.

One interesting one was the length of entries vs month:

GRAPH #3

There's a significant spike in June. On further investigation, I found that these entries all tried to capture whatever new city I was in for whatever internship I was at, demanding a lot of words. It was a lot of complaining about my intern projects.

Next, I tried to find a correlation between entry length/frequency and sentiment, looking to answer the question, do I write more when I'm happy or sad? There was a weak negative correlation between entry frequency and sentiment, meaning I journal more to vent/rant than to celebrate, which might skew the scores but wouldn't change the relative rankings.

Finally, I created a wordcloud for the bottom 100 entries by sentiment, and one for the top 100 entries by sentiment, to see the common influencers of my emotions. The findings were accurate and funny, but much too explicit for this blog. I'll show you if you ask.

Karthik Bala's blog

Research Areas

People

What Starts Here

Awards

Admissions

Academics

Student Support

On-Campus Programs & Degrees

Ph.D. Program

Master's Programs

Portfolio Program in Robotics

Admissions & Incoming Students

Current Students

Online Programs & Degrees

Master's Degrees

Student Experience

Apply

FAQ

Industry

Alumni

Outreach