Regression and Classification with Housing Data

Business Case Overview

You work for a real estate company interested in using data science to determine the best properties to buy and re-sell. Specifically, your company would like to identify the characteristics of residential houses that estimate the sale price and the cost-effectiveness of doing renovations.

There are three components to the project:

Estimate the sale price of properties based on their "fixed" characteristics, such as neighborhood, lot size, number of stories, etc.
Estimate the value of possible changes and renovations to properties from the variation in sale price not explained by the fixed characteristics. The goal is to estimate the potential return on investment (and how much you should be willing to pay contractors) when making specific improvements to properties.
Determine the features in the housing data that best predict "abnormal" sales (foreclosures, etc.).

General ML Project Structure

This post documents the structured approach I developed for a machine learning project of this type. Below is the full framework I built up across this project.

1. Business Executive Summary / README

Clearly establish the problem statement and project objective. Objectives should be clear and verifiable (this will help to shortlist the predictor variables).
Define the target variable. If it is a binary outcome, it is a classification problem; otherwise, regression. This determines the model to use.
Data dictionary: a list of all variables used, their data type, and what they represent.
Recommendations/Conclusions: clearly link results back to the problem statement and provide actionable recommendations. State limitations of the model and areas for further improvement.

2. Data Preparation

2.1 Key libraries:

pandas — data manipulation and basic plotting
numpy — numerical/stats calculations
seaborn and matplotlib.pyplot — customised plotting
sklearn.model_selection — train_test_split, KFold, cross_val_score
sklearn.linear_model — LinearRegression, LogisticRegression, LassoCV, RidgeCV
sklearn.preprocessing — StandardScaler
sklearn.metrics — r2_score
statsmodels.api — OLS and summary stats

2.2 Loading the data:

data = pd.read_csv(import_path, keep_default_na=False, na_values=[''])

This drops all null values (empty cells) at the reading-in stage. Note that NA as a literal value is not considered null — the distinction matters for Ames housing data, where NA often means "no such feature" rather than a missing value.

3. Exploratory Data Analysis

3.1 Pandas functions for exploring data:

df.shape — rows and columns count
df.isnull().sum() — total null values per column (divide by rows for %)
df.describe() — summary stats, useful for spotting suspicious min/max values
df.info() — variable types (cross-reference with data dictionary)
df[col].unique() — list of unique entries (useful to find errant values)

3.2 Renaming columns:

df.columns = [x.lower().replace(' ','_') for x in df.columns]

3.3 Understanding data distribution and inter-variable relationships:

Plot pairplot (target vs. predictor variables — linear or not?)
Plot histogram of all predictor variables (normality, need for transformation?)
Plot box plot of predictor variables (outliers, general distribution)
Drop outlier values before running correlations so they don't skew results

3.4 Data Cleaning:

Split the dataframe into numerical and categorical variables if necessary
Deal with null values — drop if few in number, or impute with the median
Drop entire columns if variable is non-essential or has too many null values
Deal with errant values — drop non-consistent data types, correct spelling errors
Remove outliers (manually or based on standard deviations from mean, only if normally distributed)

3.5 Understanding inter-variable relationships:

Run correlations on numerical variables and sort by strength against target
Plot heat map of correlations
Shortlist top correlated numerical variables

4. Feature Engineering / Data Transformation

4.1 Categorical data:

Ordinal: Write a data dictionary assigning values and map it. Make sure ordinal scale is encoded to be intuitive (e.g., poor → excellent maps to small → large numbers).
Nominal: Consider mean encoding; otherwise use one-hot encoding.
Binary: One-hot encode, being careful about which category is the baseline (this affects coefficient interpretation).

4.2 Polynomial transformations: Explore squaring or taking the square root of the top correlated regressors.

4.3 Other interactions:

Aggregation: sum similar columns (e.g., basement sq ft + first floor sq ft → total sq ft)
Multiplication: multiply related dummies

4.4 Normalisation: Take a log transformation for skewed variables. Remember to reverse-transform coefficients when interpreting (take the exponent).

5. Feature Selection

Top correlated numerical variables
Top correlated transformed categorical variables

6. Model Evaluation

6.1 Train-Test-Split: Use sklearn cross_val_score.

6.2 Scaling:

Scale predictor variables only (target variable does not need scaling)
Fit and scale using StandardScaler on training set; transform test set using the already-fitted scaler (do not re-fit)

6.3 K-Fold Cross Evaluation: Report individual K-fold scores and overall mean R² score on both the unscaled (100%) and scaled (80%/training) sets.

6.4 Linear Regression (baseline): Report training and test scores; test score should approximate the K-fold mean score. Calculate RMSE.

6.5 OLS model: Report adjusted R², F-stat (<0.05 is good), and individual variable T-test p-values.

7. Hyperparameter Tuning

7.1 Lasso and Ridge Regression: Use grid search to select the optimal alpha. Check for near-zero Lasso/Ridge coefficients and consider dropping them.

7.2 VIF: Use Variance Inflation Factor to check for multicollinearity. Recursively drop variables with VIF > 5.0.

7.3 T-test: Drop all variables with high p-values.

7.4 Recursive Feature Elimination: Consider using sklearn's RFE.

8. Final Model Selection

Compare all models and select the one with the highest R² test score. In the event of a tie, always choose the simpler model. Validate linear regression assumptions — plot residuals to check for heteroskedasticity (residuals should be normally distributed with no correlation to target).

9. Generating Results and Recommendations

Present the final selected features and the magnitude of their coefficients
Explain what the metrics mean (R², RMSE — provide context for whether values are "big" or "small")
Explain the intuition behind why each variable was selected
Clearly state the limitations of the model
Clearly state recommendations based on the results
Emphasize how much the model improved over time (compare to the naive mean baseline)

Key Learnings from Project 2

Key Takeaways

A collection of hard-won lessons from this project that generalise beyond housing data:

Remove categories with low variation (e.g. if the mode comprises >80% of the dataset) — they tend to exhibit high multicollinearity.
Use a log model to normalise skewed distributions. Always validate that residuals are normally distributed.
Aggregate similar columns (e.g. sum total square footage across basement + first floor).
Understand how your ordinal scaling is constructed — it must make intuitive sense.
Understand how your dummy variable is constructed — dummies are measured relative to a baseline, so if the baseline is the best category, all other coefficients will be negative.
When dropping outliers based on standard deviation, the 67/95/99 rule only applies to normally distributed variables.
Differentiate between null (empty — could be any value) and NA (specifically means nothing).
Always note what your scores are against — scaled vs. unscaled, test vs. training set.
Provide context for RMSE — either explain what it means, or do a relative comparison with the baseline model's RMSE.