General Assembly — Data Science Immersive · Project 2

Regression and Classification with Housing Data

Using the Ames housing data to estimate sale prices and identify features that best predict abnormal sales.

Business Case Overview

You work for a real estate company interested in using data science to determine the best properties to buy and re-sell. Specifically, your company would like to identify the characteristics of residential houses that estimate the sale price and the cost-effectiveness of doing renovations.

There are three components to the project:

  1. Estimate the sale price of properties based on their "fixed" characteristics, such as neighborhood, lot size, number of stories, etc.
  2. Estimate the value of possible changes and renovations to properties from the variation in sale price not explained by the fixed characteristics. The goal is to estimate the potential return on investment (and how much you should be willing to pay contractors) when making specific improvements to properties.
  3. Determine the features in the housing data that best predict "abnormal" sales (foreclosures, etc.).

General ML Project Structure

This post documents the structured approach I developed for a machine learning project of this type. Below is the full framework I built up across this project.

1. Business Executive Summary / README

2. Data Preparation

2.1 Key libraries:

2.2 Loading the data:

data = pd.read_csv(import_path, keep_default_na=False, na_values=[''])

This drops all null values (empty cells) at the reading-in stage. Note that NA as a literal value is not considered null — the distinction matters for Ames housing data, where NA often means "no such feature" rather than a missing value.

3. Exploratory Data Analysis

3.1 Pandas functions for exploring data:

3.2 Renaming columns:

df.columns = [x.lower().replace(' ','_') for x in df.columns]

3.3 Understanding data distribution and inter-variable relationships:

3.4 Data Cleaning:

3.5 Understanding inter-variable relationships:

4. Feature Engineering / Data Transformation

4.1 Categorical data:

4.2 Polynomial transformations: Explore squaring or taking the square root of the top correlated regressors.

4.3 Other interactions:

4.4 Normalisation: Take a log transformation for skewed variables. Remember to reverse-transform coefficients when interpreting (take the exponent).

5. Feature Selection

6. Model Evaluation

6.1 Train-Test-Split: Use sklearn cross_val_score.

6.2 Scaling:

6.3 K-Fold Cross Evaluation: Report individual K-fold scores and overall mean R² score on both the unscaled (100%) and scaled (80%/training) sets.

6.4 Linear Regression (baseline): Report training and test scores; test score should approximate the K-fold mean score. Calculate RMSE.

6.5 OLS model: Report adjusted R², F-stat (<0.05 is good), and individual variable T-test p-values.

7. Hyperparameter Tuning

7.1 Lasso and Ridge Regression: Use grid search to select the optimal alpha. Check for near-zero Lasso/Ridge coefficients and consider dropping them.

7.2 VIF: Use Variance Inflation Factor to check for multicollinearity. Recursively drop variables with VIF > 5.0.

7.3 T-test: Drop all variables with high p-values.

7.4 Recursive Feature Elimination: Consider using sklearn's RFE.

8. Final Model Selection

Compare all models and select the one with the highest R² test score. In the event of a tie, always choose the simpler model. Validate linear regression assumptions — plot residuals to check for heteroskedasticity (residuals should be normally distributed with no correlation to target).

9. Generating Results and Recommendations

Key Learnings from Project 2

Key Takeaways

A collection of hard-won lessons from this project that generalise beyond housing data:

  • Remove categories with low variation (e.g. if the mode comprises >80% of the dataset) — they tend to exhibit high multicollinearity.
  • Use a log model to normalise skewed distributions. Always validate that residuals are normally distributed.
  • Aggregate similar columns (e.g. sum total square footage across basement + first floor).
  • Understand how your ordinal scaling is constructed — it must make intuitive sense.
  • Understand how your dummy variable is constructed — dummies are measured relative to a baseline, so if the baseline is the best category, all other coefficients will be negative.
  • When dropping outliers based on standard deviation, the 67/95/99 rule only applies to normally distributed variables.
  • Differentiate between null (empty — could be any value) and NA (specifically means nothing).
  • Always note what your scores are against — scaled vs. unscaled, test vs. training set.
  • Provide context for RMSE — either explain what it means, or do a relative comparison with the baseline model's RMSE.