Mini ML Project: Predicting Titanic Survival

Context

This is a small machine-learning project built around the classic Titanic passenger dataset. The goal was straightforward: use information about each passenger to predict whether they survived.

That makes the project a binary classification problem. The model is not trying to estimate a dollar amount or forecast a continuous number. It is choosing between two outcomes: survived or did not survive.

The Titanic dataset is useful for learning because the variables are easy to understand. Passenger class, sex, age, fare, family aboard, and port of embarkation all have plausible relationships to survival. That makes it a good place to practice the full modeling workflow without getting lost in domain jargon.

The real question was not just “can a model predict survival?” It was:

If two different model types look at the same passenger data, which one makes better predictions, and how do we know?

Modeling Setup

I compared two common supervised-learning models: logistic regression and a decision tree.

Logistic regression estimates the probability of an outcome. In this case, it turns each passenger’s features into a probability of survival. A passenger might receive a predicted survival probability of 0.82, while another might receive 0.21. To make a final yes/no prediction, the model uses a threshold, such as 0.5.

A decision tree works differently. Instead of fitting one smooth equation, it splits the data into branches. It might first separate passengers by sex, then by class, then by age or fare. The result is easier to picture as a sequence of questions, but it can also become too tailored to one training sample if the tree is not controlled carefully.

Simplified fitted decision tree showing Titanic survival rates by sex, age, and passenger class. — A compact view of the fitted decision tree. The percentages are real training-set survival rates from the R model; the full tree splits some branches further by fare and age.

That tradeoff is why this comparison is useful. Logistic regression is usually more stable and statistical. Decision trees are more visual and rule-based. The assignment was a good chance to compare those styles on the same data.

Methodology

I built the workflow in R using titanic, caret, rpart, pROC, dplyr, and ggplot2.

The pipeline followed five main steps:

Loaded the 891-passenger Titanic training dataset.
Removed columns that were not useful as predictors, including passenger name, ticket number, and cabin.
Converted categorical fields such as passenger class, sex, survival, and embarkation port into factors.
Filled missing ages with the median age and filled missing embarkation values with the most common port, S.
Split the data into an 80% training set and a 20% test set with a fixed seed for reproducibility.

Both models used the same predictors:

Passenger class
Sex
Age
Number of siblings or spouses aboard
Number of parents or children aboard
Fare
Port of embarkation

After training, I evaluated both models on the held-out test set. That matters because the test set simulates new data the model has not already seen.

How to read the ROC curve

Before comparing the two models, it helps to understand what the ROC curve is showing.

A classifier usually starts by producing probabilities. For example, the logistic regression model might say one passenger has an 82% chance of survival and another has a 21% chance. To turn those probabilities into yes/no predictions, we choose a cutoff. A common cutoff is 0.5, but it is not the only possible one.

An ROC curve asks: what happens if we move that cutoff around?

If the cutoff is strict, the model predicts survival only when it is very confident. That can reduce false positives, but it may miss real survivors. If the cutoff is loose, the model catches more actual survivors, but it may also create more false alarms. The ROC curve shows that tradeoff across many cutoffs instead of judging the model at only one threshold.

Guide diagram showing that a better ROC curve bends toward the upper-left corner while random guessing follows a diagonal line. — The closer a curve bends toward the upper-left corner, the better the model is at catching true positives while limiting false positives.

That is why AUC matters. AUC, or area under the curve, compresses the whole ROC curve into one score. A higher AUC means the model is better at ranking real survivors above non-survivors across possible thresholds.

Results

Logistic regression came out ahead, though the margin was not huge.

Metric	Logistic Regression	Decision Tree
AUC	0.841	0.816
Accuracy	79.1%	77.9%
Precision	~74.6%	~74.6%
Recall	69.0%	64.7%

The result was close, but logistic regression had the stronger AUC: 0.841 compared with 0.816 for the decision tree. Since AUC measures ranking quality across thresholds, that means logistic regression separated likely survivors from likely non-survivors slightly better overall.

ROC curve comparison showing logistic regression with AUC 0.841 and decision tree with AUC 0.816. — Logistic regression had the stronger AUC, meaning it separated survivors from non-survivors slightly better across thresholds.

Accuracy tells a similar story: logistic regression correctly classified 79.1% of the test set, compared with 77.9% for the decision tree.

Precision was basically tied, which means both models were about equally trustworthy when they predicted survival. Recall showed the clearer difference: logistic regression found 69.0% of actual survivors, while the decision tree found 64.7%. In practical terms, the tree missed more people who actually survived.

So What?

The Titanic dataset is historical, but the workflow is not. This same structure applies anywhere a team needs to predict a yes/no outcome from messy real-world data.

For example:

A subscription company might predict whether a customer will churn.
A lender might estimate whether an applicant is likely to default.
A university might identify students who may need extra support.
A marketing team might predict whether a lead will convert.
A product team might flag accounts likely to retain or drop off.

The useful part is not the Titanic prediction itself. The useful part is the modeling habit: define the outcome, clean the data, train competing models, test them fairly, and explain the tradeoff in plain language.

For this project, the recommendation was simple: use logistic regression as the stronger baseline, then improve the analysis with cross-validation, threshold tuning, and a pruned decision tree.

Download

Download the R script

A compact R workflow for comparing logistic regression and decision tree classifiers on the Titanic dataset.