Overview
Objective: Web scrape posts from two subreddits on Reddit.com and apply natural language processing (NLP) methods and classification modelling to accurately classify posts as belonging to one subreddit or the other.
This project covered the full pipeline from data collection through deployment: using Reddit's API to scrape post titles and text, applying text cleaning and vectorisation, and training several classifiers to distinguish between the two communities.
The full code is available on GitHub.
Approach
1. Data Collection via Web Scraping
Posts were collected from two subreddits using Reddit's pushshift.io API. The scraper collected post titles and selftext, stored as a CSV for downstream processing. Key considerations:
- Rate limiting — built in delays between API calls to avoid being throttled
- Handling duplicate posts — subreddits can have reposted content, so de-duplication was necessary
- Handling removed/deleted posts — these show as
[removed]or[deleted]and need to be dropped
2. Text Pre-Processing and NLP
The raw text was cleaned using a standard NLP pipeline:
- Strip HTML tags (using
BeautifulSoup) - Remove non-letter characters (using
regex) - Convert to lowercase
- Remove stopwords (using
nltk) - Lemmatise words (using
WordNetLemmatizer)
Once cleaned, text was converted to numerical form using two vectorisers:
- CountVectorizer — counts occurrences of each unique word feature
- TF-IDF Vectorizer — assigns scores based on how distinctive a word is within a document relative to the whole corpus
Key vectoriser parameters tuned:
max_features— limit the vocabulary size to reduce sparsitymin_df— only include words appearing in at least N documentsngram_range— include unigrams and/or bigrams (bigrams improve context capture, e.g., "not happy" vs. "happy")
3. Classification Modelling
Several classifiers were evaluated:
- Logistic Regression — strong baseline for NLP tasks, interpretable coefficients
- Naive Bayes (Multinomial and Bernoulli) — fast and effective for text classification
- Random Forest — captures non-linear relationships but computationally expensive at scale
Models were evaluated using:
- K-fold cross-validation accuracy scores
- Train/test accuracy comparison (to detect overfitting)
- Confusion matrix analysis
Key Takeaways
- The choice of vectoriser (CountVec vs. TF-IDF) made a surprisingly small difference to classification accuracy — model architecture and hyperparameter tuning had more impact.
- Bigrams (n-gram range (1,2)) consistently improved classification because they captured negations and common phrases that unigrams missed.
- CountVectorizer tends to pick up high-frequency but low-information words; TF-IDF penalises these, but in practice the top features were often still similar between the two.
- Logistic regression's coefficients are interpretable and useful for understanding which words drive the classification decision — a key advantage over black-box models.
- Sparse matrix problems become very real at scale — limiting
max_featuresandmin_dfis essential for computational feasibility.