Web Scraping and Classifying Posts from Reddit

Overview

Objective: Web scrape posts from two subreddits on Reddit.com and apply natural language processing (NLP) methods and classification modelling to accurately classify posts as belonging to one subreddit or the other.

This project covered the full pipeline from data collection through deployment: using Reddit's API to scrape post titles and text, applying text cleaning and vectorisation, and training several classifiers to distinguish between the two communities.

The full code is available on GitHub.

Approach

1. Data Collection via Web Scraping

Posts were collected from two subreddits using Reddit's pushshift.io API. The scraper collected post titles and selftext, stored as a CSV for downstream processing. Key considerations:

Rate limiting — built in delays between API calls to avoid being throttled
Handling duplicate posts — subreddits can have reposted content, so de-duplication was necessary
Handling removed/deleted posts — these show as [removed] or [deleted] and need to be dropped

2. Text Pre-Processing and NLP

The raw text was cleaned using a standard NLP pipeline:

Strip HTML tags (using BeautifulSoup)
Remove non-letter characters (using regex)
Convert to lowercase
Remove stopwords (using nltk)
Lemmatise words (using WordNetLemmatizer)

Once cleaned, text was converted to numerical form using two vectorisers:

CountVectorizer — counts occurrences of each unique word feature
TF-IDF Vectorizer — assigns scores based on how distinctive a word is within a document relative to the whole corpus

Key vectoriser parameters tuned:

max_features — limit the vocabulary size to reduce sparsity
min_df — only include words appearing in at least N documents
ngram_range — include unigrams and/or bigrams (bigrams improve context capture, e.g., "not happy" vs. "happy")

3. Classification Modelling

Several classifiers were evaluated:

Logistic Regression — strong baseline for NLP tasks, interpretable coefficients
Naive Bayes (Multinomial and Bernoulli) — fast and effective for text classification
Random Forest — captures non-linear relationships but computationally expensive at scale

Models were evaluated using:

K-fold cross-validation accuracy scores
Train/test accuracy comparison (to detect overfitting)
Confusion matrix analysis

Key Takeaways

What I Learned

The choice of vectoriser (CountVec vs. TF-IDF) made a surprisingly small difference to classification accuracy — model architecture and hyperparameter tuning had more impact.
Bigrams (n-gram range (1,2)) consistently improved classification because they captured negations and common phrases that unigrams missed.
CountVectorizer tends to pick up high-frequency but low-information words; TF-IDF penalises these, but in practice the top features were often still similar between the two.
Logistic regression's coefficients are interpretable and useful for understanding which words drive the classification decision — a key advantage over black-box models.
Sparse matrix problems become very real at scale — limiting max_features and min_df is essential for computational feasibility.