General Assembly — Data Science Immersive · Project 3

Web Scraping and Classifying Posts from Reddit

Applying natural language processing and classification modelling to distinguish posts from two different subreddits.

Overview

Objective: Web scrape posts from two subreddits on Reddit.com and apply natural language processing (NLP) methods and classification modelling to accurately classify posts as belonging to one subreddit or the other.

This project covered the full pipeline from data collection through deployment: using Reddit's API to scrape post titles and text, applying text cleaning and vectorisation, and training several classifiers to distinguish between the two communities.

The full code is available on GitHub.

Approach

1. Data Collection via Web Scraping

Posts were collected from two subreddits using Reddit's pushshift.io API. The scraper collected post titles and selftext, stored as a CSV for downstream processing. Key considerations:

2. Text Pre-Processing and NLP

The raw text was cleaned using a standard NLP pipeline:

Once cleaned, text was converted to numerical form using two vectorisers:

Key vectoriser parameters tuned:

3. Classification Modelling

Several classifiers were evaluated:

Models were evaluated using:

Key Takeaways

What I Learned
  • The choice of vectoriser (CountVec vs. TF-IDF) made a surprisingly small difference to classification accuracy — model architecture and hyperparameter tuning had more impact.
  • Bigrams (n-gram range (1,2)) consistently improved classification because they captured negations and common phrases that unigrams missed.
  • CountVectorizer tends to pick up high-frequency but low-information words; TF-IDF penalises these, but in practice the top features were often still similar between the two.
  • Logistic regression's coefficients are interpretable and useful for understanding which words drive the classification decision — a key advantage over black-box models.
  • Sparse matrix problems become very real at scale — limiting max_features and min_df is essential for computational feasibility.