General Assembly — Data Science Immersive 11 · Capstone

Inside Out: Machine Learning & Lexical Rule-Based Emotion Classifier for Text

A web application that analyses textual input to classify emotions at the sentence level, using a combination of ML classification and lexical rule-based libraries.

"What is on your mind?... What are you thinking about?... A penny for your thoughts?" For most of us, experiencing, interpreting, understanding and reacting to our own emotional states and those around us are very much a part of our daily interactions at the workplace, at home with our family members and other social settings; reacting appropriately to different situations based on the emotional context is key to developing meaningful personal and business relationships.

Appreciation of emotional context is often done in-person — and for good reason — we read emotions best based on tonality, body-language, facial expressions and many other verbal and non-verbal cues perceived both consciously and sub-consciously. For this reason, emotional interpretation is largely limited to localised settings.

But what if we could glean information on emotional responses 'en-masse' using other less conventional methods? For example, by analysing the 'tonality' of a particular writing style, or 'expressions' based on punctuation or keywords from textual input? What if we could tell how a hundred, or a thousand people are feeling at a particular point in time — in real time no less? These questions motivated my research into Natural Language Processing (NLP) and the development of a tool built on both traditional ML/NLP concepts and lexical rule-based libraries.

Problem Statement

It is difficult to gather emotional responses quickly, cheaply, and efficiently at scale.

Proposed Solution

A web-application that can process textual input to provide both a broader analysis of the emotional response to specific topics (e.g. Singapore General Elections) and a more granular analysis of sentence-level emotional responses with associated keywords and phrases.

Note: the current model is only built for text input of up to a few paragraphs long.

Use Cases for "Emotional Listening"

The following are examples of potential pain points and use cases:

Pain Points

1. Costly and time-consuming to gather mass implicit feedback: Information on emotional response to an event (a press release, a public speech) could inform better decision-making, but obtaining implicit feedback in real-time at scale is often prohibitively expensive. While explicit feedback can be obtained from social media platforms (e.g. Facebook emoji reactions), this only captures a secondary reaction. It would be more powerful to get an implicit primary reaction from people's own textual input on public platforms.

2. Difficult to obtain real-time information to take preventive action: Unfortunate events like mass-killings are often preceded by telling messages indicative of troubled emotions — but are usually only discovered post-mortem.

3. Lack of nuance in basic sentiment analysis: Current sentiment analysis is usually binary (positive/negative). However, even within the category of "negative," the appropriate response varies widely: disappointment that a product feature wasn't released is very different from anger that a product is dangerous. Granular emotion classification enables more appropriate responses.

Use Cases

  1. Emotional analysis as an additional data point for digital marketing
  2. Better recommendation systems (think: Spotify playlist suggestions that match your current mood)
  3. Real-time "emotional listening" to pre-empt bad events — dynamically scraping a live source and generating real-time emotion breakdowns
  4. More granular sentiment analysis for policy makers
  5. Mental health tracker: individuals can journal and track their emotional state over time
  6. Assistive technology for disabilities: combined with speech-to-text, this could help those with impaired ability to pick up emotional nuances in daily interactions

1. Data Cleaning and Pre-Processing

1.1 Combining 3 Datasets

Three pre-tagged datasets were combined:

  1. Kaggle dataset (~415,000 tweets): tagged across 6 emotions — sadness, joy, love, anger, fear, surprise
  2. Figure Eight dataset (~40,000): 13 emotions including neutral, enthusiasm, hate, boredom, relief
  3. Figure Eight dataset (~2,500): 18 emotions including anticipation, aggression, contempt, disapproval, remorse
Bar chart showing all emotions from 3 datasets (%)
Bar chart showing all emotions from 3 datasets (%)

1.2 Problem of Imbalanced Data

There was clearly a long tail of more nuanced emotion types with far fewer observations than "plain vanilla" types like joy and anger. This imbalanced dataset would impact prediction accuracy. I chose to subset only the top 8 emotions by count to form the final training dataset.

% Breakdown of Emotion Shortlist
% Breakdown of Emotion Shortlist
Bar chart showing Emotion Shortlist
Bar chart showing Emotion Shortlist

Even after shortlisting the top 8 emotions, significant class imbalance remained. I addressed this using random upsampling within each emotion class after train-test-split, to prevent the same observations from appearing in both training and testing sets.

1.3 Text Cleaning Function

A custom cleaning function strips non-letters, converts to lowercase, removes stopwords, and lemmatises:

from bs4 import BeautifulSoup
import regex as re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def clean_text(raw_post):
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_post).get_text()

    # 2. Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)

    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()

    # 4. Convert stop words to a set (faster lookup)
    stops = set(stopwords.words('english'))

    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]

    # 6. Lemmatise
    lem_meaningful_words = [lemmatizer.lemmatize(i) for i in meaningful_words]

    # 7. Join words back into one string and return
    return(" ".join(lem_meaningful_words))

There were 5,519 values (1.22% of observations) that turned up null after cleaning — we still had 98% of our dataset to work with.

1.4 Cleaning Difficulties

2. Baseline Model

2.1 Train-Test-Split

from sklearn.model_selection import train_test_split

X = emo_short['clean_text']
y = emo_short['emotions_label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42
)

2.2 Vectorising Word Features

There are two main vectorisers: CountVectorizer (counts occurrences of each unique word) and TF-IDF Vectorizer (assigns a score based on how distinctive a word is across the corpus).

Without restricting the number of word features, the total unique vocabulary exceeded 70,000 words. With ~350,000 training rows, the full sparse matrix would contain over 23 billion cells — my kernel crashed attempting this. I limited to 2,000 features with a minimum document frequency of 10.

Sparse matrix containing unique word features
Sparse matrix containing unique word features — each row is a document, each column a word

Finding the most frequently used words per emotion:

joy_df = train_cvec_combined_df[train_cvec_combined_df['emotions_label']==2]
joy_df.drop('emotions_label', axis=1, inplace=True)
top_20_joy = joy_df.sum().sort_values(ascending=False).head(20)
top_20_joy.plot.barh(figsize=(11,6));
Top 20 word features for 'Joy' by Count
Top 20 word features for 'Joy' by Count — note how many generic words appear

A key observation: there were many overlapping high-frequency words across the 8 emotion classes (e.g. "im", "feeling", "feel", "know", "day") — not particularly informative for distinguishing emotions. Only 36.8% of top-20 words across all classes were unique.

By contrast, the words with the highest model coefficients were much more distinctive:

Top 10 word features for 'Joy' by LogReg Coefficient
Top 10 word features for 'Joy' by Logistic Regression Coefficient — far more distinctive

2.3 Baseline Model Results

Used CountVectorizer + Logistic Regression:

lr = LogisticRegression(max_iter=400000)
baseline_model = lr.fit(X_train_cvec, y_train)
MetricScore
K-fold CV (mean)0.848
Train accuracy0.865
Test accuracy0.849

The K-fold scores are consistent across 5 folds, and the train/test gap is small — the model is not overfitted. Overall, strong results.

3. Tuning and Alternative Models

TF-IDF Vectorizer

Top 20 word features for 'Joy' by TFIDF Value
Top 20 word features for 'Joy' by TF-IDF Value

Surprisingly, TF-IDF's top words were similar to CountVectorizer's, and the model performed slightly worse:

MetricScore
Train accuracy0.848
Test accuracy0.841

Naive Bayes (Grid Search)

Used a pipeline with GridSearchCV across 4 model/vectorizer combinations and 96 parameter combinations:

from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.pipeline import Pipeline

models1 = {
    'TVEC + Bernoulli': Pipeline([("vec",TfidfVectorizer()),("nb",BernoulliNB())]),
    'TVEC + Multinomial': Pipeline([("vec",TfidfVectorizer()),("mb",MultinomialNB())]),
    'CVEC + Bernoulli': Pipeline([("vec",CountVectorizer()),("nb",BernoulliNB())]),
    'CVEC + Multinomial': Pipeline([("vec",CountVectorizer()),("mb",MultinomialNB())])
}
Top 5 Gridsearch Scores
Top 5 GridSearch Scores — TVEC + Bernoulli led at 0.858

The top score goes to TVEC + Bernoulli at 0.858, but the improvement over the baseline is marginal. The optimal parameters were consistently n-gram range (1,2) and max_features = 3,000 — intuitive, as larger n-gram ranges better capture context and more features provide more signal.

Random Forest and Extra Trees

ModelTrain scoreTest score
Random Forest (10 trees)0.9470.808
Extra Trees (10 trees)0.9510.811

Both are heavily overfitted (large train/test gap) and perform worse than the baseline. With more trees they might improve — but the computational cost is prohibitive for ~500k rows.

Artificial Neural Network

from keras.models import Sequential
from keras.layers import Dense

nn_model = Sequential()
nn_model.add(Dense(256, input_dim=n_input, activation='relu'))
nn_model.add(Dense(n_output, activation='softmax'))
nn_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Two ANN configurations were tried. Both achieved ~0.85 validation accuracy — equivalent to the baseline — and the simpler model began overfitting after 1.5 epochs. No improvement over Logistic Regression.

4. Model Selection

After testing six approaches — Logistic Regression (CVEC/TVEC), Naive Bayes (Multinomial/Bernoulli), Random Forest, Extra Trees, SVM (abandoned — too slow), and ANN — the baseline CountVectorizer + Logistic Regression was selected.

Why the simplest model won: Accuracy equal to or better than most alternatives, and significantly more computationally efficient. Given that the end goal was Heroku deployment (with limited free-tier memory), computational efficiency was a hard constraint.

5. Lexical Rule-Based Libraries for Sentiment Analysis

In addition to emotion classification at the sentence level, I wanted to contextualise those emotions using lexical libraries to provide valence scores and highlight key phrases associated with the classified emotion.

VADER (Valence Aware Dictionary and sEntiment Reasoner)

VADER is "a rule-based sentiment analysis engine incorporating grammatical and syntactical rules, including empirically derived quantifications for the impact of each rule on perceived sentiment intensity." It goes beyond bag-of-words by handling emoticons, capitalisation, punctuation, and word order.

Example of how Vader scores sentiment
Example of how VADER assigns valence scores to text inputs

How VADER adds context:

  1. Valence vs. emotion: Valence measures the intrinsic positivity/negativity of a word or phrase, independent of the specific emotion. For example, "I think you are a racist person" carries a negative valence without clearly indicating the speaker's emotion.
  2. Emotional intensity: Two sentences classified as "angry" may differ significantly in intensity — the more negative valence score indicates more intense anger.
  3. Model validation: Valence scores should correlate with classified emotions (negative valence for sadness/anger/fear). Mismatches flag potential misclassifications.

TextBlob

TextBlob was used for noun-phrase extraction and part-of-speech (POS) tagging to identify adjectives and verbs associated with the classified emotion.

Textblob part-of-speech tags
TextBlob part-of-speech tags — for this project I extracted JJ (adjectives) and VB (verbs)

6. Combining Functions for Deployment

The emo_machine() function combines the full pipeline: takes in raw text, tokenises into sentences, cleans for the ML model while keeping raw text for VADER (which works better on uncleaned text), runs classification, VADER scoring, and TextBlob extraction, and returns a sorted dataframe.

def emo_machine(text_string_input):

    emo_summary = pd.DataFrame()
    vader_scores = []
    bnp = []
    b_pos_tags = []
    pos_subset = ['JJ']  # adjectives only

    # Load pickled model artifacts
    pickle_in = open("lr_baseline_model","rb")
    lr_baseline_model = pickle.load(pickle_in)
    pickle_in = open("cvec","rb")
    cvec = pickle.load(pickle_in)
    pickle_in = open("clean_text","rb")
    clean_text = pickle.load(pickle_in)

    # Tokenise input into sentences
    text_input_sent = nltk.tokenize.sent_tokenize(text_string_input)

    # Clean each sentence for ML model
    clean_text_input_sent = [clean_text(i) for i in text_input_sent]

    # Vectorise and predict emotions
    X_input_cvec = cvec.transform(clean_text_input_sent).todense()
    predictions = lr_baseline_model.predict(X_input_cvec)

    # VADER scores on RAW text (not cleaned)
    for sent in text_input_sent:
        vader_scores.append(vader.polarity_scores(sent)['compound'])

    # TextBlob noun-phrases and POS tags on cleaned text
    for sent in clean_text_input_sent:
        blob = TextBlob(sent)
        bnp.append(blob.noun_phrases)
        sent_pos_tags = [word for word, tag in blob.pos_tags if tag in pos_subset]
        b_pos_tags.append(sent_pos_tags)

    emo_summary['emotions'] = predictions
    emo_summary['sentences'] = text_input_sent
    emo_summary['vader_scores'] = vader_scores
    emo_summary['blob_phrases'] = bnp
    emo_summary['blob_tags'] = b_pos_tags

    return emo_summary.sort_values(by=['vader_scores'], ascending=False)
Final output from combined function in Jupyter Notebook
Final output from the combined emo_machine() function in Jupyter Notebook

7. Flask Deployment

The function was wrapped in a Flask application with two routes: a home page for text input and a results page that generates the emotion breakdown.

app = Flask(__name__)

@app.route('/')
def index():
    form = ReviewForm(request.form)
    return render_template('reviewform.html', form=form)

@app.route('/results', methods=['POST'])
def results():
    form = ReviewForm(request.form)
    if request.method == 'POST' and form.validate():
        review = request.form['moviereview']
        emotions = emo_machine(review)[0]
        return render_template('results.html',
                            content=review,
                            tables=[emotions.to_html()],
                            titles=['emotional breakdown'],
                            score=emo_machine(review)[1])
    return render_template('reviewform.html', form=form)

8. Heroku Deployment

Deployed application on Heroku
The deployed Inside Out web application on Heroku

Note: The Heroku free tier has since been discontinued — the live link is no longer active. The code is available on GitHub.

Post-Mortem: Improvements and Reflections

Model Improvements

UI/UX Improvements

Deployment Improvements

Other Reflections

Honest Limitations
  • Text as a single reference point for emotion loses verbal intonation, facial expressions, etc.
  • A single sentence can contain multiple emotional states — how many should be classified?
  • Sarcasm and passive-aggressiveness are extremely difficult to detect reliably.
  • Pejoratives (neutral-sounding statements that mask negativity: "his idea was interesting") are hard to catch.
  • Slang is localised and dynamic — models trained on American social media won't generalise to Singaporean colloquialisms. The lexicon needs continuous updates.
  • Lexical methods are context-specific — social media vs. formal prose would require different training data.

Resources