(Kaggle Dataset) Titanic Survival Prediction

Wen Teng Chang
2020/6/4

Table of Contents


1. Goal and Measurement - Accuracy

The sinking of the titanic is a rare accident, as it is not a continuously repeatable event, there is no specific business goal.

Therefore, rather than performance evaluations such as “sensitivity” and “precision”, which are useful and powerful in the case of diagnosis and share price prediction, respectively; I decide to choose accuracy as my prediction evaluation since other matrices make less sense under this content.

Data Mining Goal: To classify the response variable, “Survived”, as accurate as possible.

Performance Evaluation: Accuracy


2. Exploratory Data Analysis

There are 891 samples in training set with 11 features and one response variable “Survived”. In addition, there are 418 samples in testing set with 11 features. In addition, We can see there are some missing values in training data and testing data. (We’ll treat these latter.)

Table 1: Training Data (6 rows)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 886 0 3 Rice, Mrs. William (Margaret Norton) female 39 0 5 382652 29.125 NA Q
887 887 0 2 Montvila, Rev. Juozas male 27 0 0 211536 13.000 NA S
888 888 1 1 Graham, Miss. Margaret Edith female 19 0 0 112053 30.000 B42 S
889 889 0 3 Johnston, Miss. Catherine Helen “Carrie” female NA 1 2 W./C. 6607 23.450 NA S
890 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30.000 C148 C
891 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.750 NA Q
Table 1: Testing Data (6 rows)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
413 1304 3 Henriksson, Miss. Jenny Lovisa female 28.0 0 0 347086 7.7750 NA S
414 1305 3 Spector, Mr. Woolf male NA 0 0 A.5. 3236 8.0500 NA S
415 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
416 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NA S
417 1308 3 Ware, Mr. Frederick male NA 0 0 359309 8.0500 NA S
418 1309 3 Peter, Master. Michael J male NA 1 1 2668 22.3583 NA C

Data Visualization

Sex x Survival

According to below figure, female tends to survive after Titanic sinking. This is rational due to male is apt to protect female.

Pclass x Survival

Upper class passengers tend to survive from the sinking of the Titanic. This is rational due to upper class passengers may book better cabins and have more or better emergency equipments inside.

Age x Survival

According to below histogram, we find out that child (0-12) tends to survive. This is sensible due to parents or elders usually go all out to protect their children or youngers while encountering disasters.

Fare x Survival / Fare x Pclass

Passengers who spend more have more chance to survive, the possible reason may be they live in greater cabins with more complete emergency equipments. In addition, this feature highly correlates to the feature “Pclass”, which can be proved from the right figure. The red boxes(“upper class”) dominate when fare >100 and blue box(“lower class”) dominates when fare <= 100.

SibSp / Parch / Family x Survival

Here, I plot histograms for features SibSp, Parch, and SibSp + Parch(Family) and partition them into 2 parts, respectively, based on “Survival”. What we can tell from the right two figures are that passenger with 1 Sibling or Spouse, or, 1 parent or child tend to survive. However, due to the information obtained from these figures is limited, thus, I bin them into one feature, family, and present it on the left.

The left figure(family) entails that passengers who have small family being denoted by (1,2,3) in x-axis are apt to survive. On the other hand, passengers who are single (=0) or have large family (>3) accompanying are more likely to lose their lives.

Insights

  1. Female tends to survive from the accident.
  2. Child (age<12) tends to survive from the accident.
  3. Pclass is highly correlated with Fare.
  4. Upper class passenger tends to survive.
  5. Passenger who pays more tends to survive.
  6. Passenger who travels with small family(1,2,3) tends to survive.
  7. Family(SibSp+Parch) classifies “Survived” better than its components, SibSp and Parch.

More visuaizations will be presented in next session.


3. Missing Value Treatment

How many missing values are there in training and testing data?

As we can see from below outputs, both of the training and testing data occur NAs in 3 columns.

Training Data NA Testing Data NA
Pclass 0 Pclass 0
Name 0 Name 0
Sex 0 Sex 0
Age 177 Age 86
SibSp 0 SibSp 0
Parch 0 Parch 0
Ticket 0 Ticket 0
Fare 0 Fare 1
Cabin 687 Cabin 327
Embarked 2 Embarked 0

(Training) Embarked - Method: Data visualization

Since there are just 2 missing values in the feature, Embarked, thus I’ll start from this.

Firstly, let’s observe these 2 passengers’ features such as family, Pclass, Fare and Ticket. Since they have the same ticket numbers, it is fine to infer that they board from the same place. In addition, both of their family(SibSp + Parch) = 0, Pclass = 1, and Fare = 80.

Table 2: (Training) Rows with Embarked = NA
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
62 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80 B28 NA
830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80 B28 NA

From the first plot, it occurs that the Fare = 80 is located on the median of Embarked = “C” partitioned by Pclass. However, this doesn’t absolutely prove that the missing values have Embarked = “C”, since Fare = 80 is also included in the box plot of Embarked = “S”. Thus, I’ve tried the second and third plots to find other evidence, both of them recommend “S” as the missing values because S accounts for the most proportions in either Pclass = 1 of Family = 0. After taking all into considerations, I would suggest to populate the missing values with S.

(Training and Testing) Cabin - Method: Add Cabin_Binary_NA and remove Cabin

It’s not a good idea to carry out data imputation in this case, since there are 687 missing Cabins in training set and 327 missing values in testing set. Therefore, I add one binary column which set 0 when Cabin = “NA” and set 1 when Cabin has value. As shown in below figure, this new feature seems classifying “Survived” well. There is roughly 67% to survive if Cabin_Binary_na = 1, yet just 30% if Cabin_Binary_na = 0.

(Training and Testing) Age - Method: Generate Age from the distributions

I conduct data imputation for Age based on 4 distributions which are generated from the existing values of Age being partitioned by “Title”.

To describe, I’ll divide it into 3 parts:

  1. How to generate Title from Name?
  2. How to populate Title since there are some NAs?
  3. How to populate Age from 4 distributions partitioned by Title?

1. How to generate Title from Name?

The categories of Title are determined after observing the column, Name, which are “Mr.”, “Mrs.”, “Miss.”, and “Master.” (will be encoded as 1, 2, 3, 4 at first). Thus, I extract these titles from Name and construct a new feature “Title”. However, some Name values don’t contain these titles (27 in training data and 7 in testing data), thus I have to populate those NAs accordingly, which will be described in the second step.

Training Data Value Testing Data Value
Mr. 517 Mr. 240
Mrs. 125 Mrs. 72
Miss. 182 Miss. 78
Master. 40 Master. 21
Total 864 Total 411
NA 27 NA 7

2. How to populate Title since there are some NAs?

The Title is populated sequentially based on existing Age values and Sex. Below is the profile with Title = NA. Since all the males are older than 12, thus, the Title’s NA can be assigned as “Mr.”. For row 767, although the Age is NA, we can still assign it as “Mr.” due to he is a doctor which is observed in Name.

Table 3: (Training) Rows with Title = NA
Name Age Sex
31 Uruchurtu, Don. Manuel E 40 male
150 Byles, Rev. Thomas Roussel Davids 42 male
151 Bateman, Rev. Robert James 51 male
246 Minahan, Dr. William Edward 44 male
250 Carter, Rev. Ernest Courtenay 54 male
318 Moraweck, Dr. Ernest 54 male
370 Aubart, Mme. Leontine Pauline 24 female
399 Pain, Dr. Alfred 23 male
444 Reynaldo, Ms. Encarnacion 28 female
450 Peuchen, Major. Arthur Godfrey 52 male
537 Butt, Major. Archibald Willingham 45 male
557 Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”) 48 female
600 Duff Gordon, Sir. Cosmo Edmund (“Mr Morgan”) 49 male
627 Kirkland, Rev. Charles Leonard 57 male
633 Stahelin-Maeglin, Dr. Max 32 male
642 Sagesser, Mlle. Emma 24 female
648 Simonius-Blumer, Col. Oberst Alfons 56 male
661 Frauenthal, Dr. Henry William 50 male
695 Weir, Col. John 60 male
711 Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”) 24 female
746 Crosby, Capt. Edward Gifford 70 male
760 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 33 female
767 Brewe, Dr. Arthur Jackson NA male
797 Leader, Dr. Alice (Farnham) 49 female
823 Reuchlin, Jonkheer. John George 38 male
849 Harper, Rev. John 28 male
887 Montvila, Rev. Juozas 27 male

Now, there are just females left. I’ve found that “Mrs.” tend to be older than “Miss.” and the roughly threshold is Age = 25. Thus, the new feature Title is completely populated. Note that for row 711 who is under 25 is assigned as “Mrs.” due to its Name includes “Mrs” which is not captured by criterion “Mrs.” being with a dot.

Note that Title in testing set is consructed in the same way.

Table 4: (Training) Rows with Title = NA after populating Mr. into Title
Name Age Sex
370 Aubart, Mme. Leontine Pauline 24 female
444 Reynaldo, Ms. Encarnacion 28 female
557 Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”) 48 female
642 Sagesser, Mlle. Emma 24 female
711 Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”) 24 female
760 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) 33 female
797 Leader, Dr. Alice (Farnham) 49 female

Below figure presents the relationship between new feature, Title, and Survived. We can discover that “Mr.” tends to get killed, however, “Mrs.” and “Miss.” are apt to survive.

3. How to populate Age’s NAs from 4 distributions partitioned by Title?

Note that testing set follows the same step treating missing values.

Below figures are the Age distributions from training data, the left one is plotted without missing values, and the right one is plotted based on original data. They are rather similar.

(Testing) Fare - Method: kNN Data imputation

Due to there is only one missing value of Fare in the testing set, it is rational to populate it with kNN algorithm since kNN will find several similar passengers and populate the missing Fare through majority votes.

Now, all the missing values in training and testing set are succeed cleaning. Below table shows the cleaned data after imputation.

Training Data NA Testing Data NA
Pclass 0 Pclass 0
Name 0 Name 0
Sex 0 Sex 0
Age 0 Age 0
SibSp 0 SibSp 0
Parch 0 Parch 0
Ticket 0 Ticket 0
Fare 0 Fare 0
Cabin 0 Cabin 0
Embarked 0 Embarked 0

Summary

Feature Engineering:

  1. Cabin_Binary_na: (0,1) Generated from Cabin by assigning 0 as Cabin’s NA and 1 as those with values
  2. Title: (1: “Mr.”, 2: “Mrs.”, 3: “Miss.”, 4: “Masters”) Generated from Name and populated NAs by Sex and Age

Data Cleaning and Imputation

  1. Embarked in Training set (2)
  2. Cabin in Training (687) and Testing set (327)
  3. Age in Training (177) and Testing set (86)
  4. Fare in Testing set (1)

4. Data Manipulation (Feature Engineering)

In above contents, new features Title and Cabin_Binary_na are created and involved in the training and testing datasets. Here, I am going to create some more features which may help to improve the model in next session.

Below table includes all features in the training set now. I remove PassengerId, Name, which has no information included or is extracted already.

Table 5: Training Dataset (6 rows)
Survived Pclass Sex Age SibSp Parch Ticket Fare Embarked Cabin_Binary Title
886 0 3 female 39.0 0 5 382652 29.125 Q 0 2
887 0 2 male 27.0 0 0 211536 13.000 S 0 1
888 1 1 female 19.0 0 0 112053 30.000 S 1 3
889 0 3 female 27.5 1 2 W./C. 6607 23.450 S 0 3
890 1 1 male 26.0 0 0 111369 30.000 C 1 1
891 0 3 male 32.0 0 0 370376 7.750 Q 0 1

Family

This feature is proved to be good in EDA Family x Survived, thus, I’ll create it by adding SibSp and Parch together.

Title and Pclass - One-hot Encoding

Pclass and Title are transformed to one-hot encoding(dummy variables) since there aren’t any ordinal relations in these features, that is, 4:“Master.” in Title is not better or larger than 1:“Mr.” in Title.

Fare

In this part, I bin Fare into a binary variable where Fare_Binary = “0” when Fare < 50 and “0” otherwise. As we can see from the below figures, although threshold 100 in the right figure seems classifying “Survived” better, however, the sample in that region is rather small, thus, I would prefer to choose threshold 50 instead.


5. Performance Evaluation

Logistic Regression Model

Let’s divide this session into 4 parts:

  1. Select features and train the model
  2. Tune the model via leave-one-out cross validation
  3. Select the probability threshold and conduct prediction
  4. Submit the prediction result on Kaggle

1. Select features for the model

Below features are considered for the model building.

Table 6: Final Training Dataset (6 rows)
Age Fare Cabin_Binary Family Title1 Title2 Title3 Pclass1 Pclass2
886 39.0 0 0 5 0 1 0 0 0
887 27.0 0 0 0 1 0 0 0 1
888 19.0 0 1 0 0 0 1 1 0
889 27.5 0 0 3 0 0 1 0 0
890 26.0 0 1 0 1 0 0 1 0
891 32.0 0 0 0 1 0 0 0 0
Table 6: Final Testing Dataset (6 rows)
Age Fare Cabin_Binary Family Title1 Title2 Title3 Pclass1 Pclass2
413 28.0 0 0 0 0 0 1 0 0
414 27.5 0 0 0 1 0 0 0 0
415 39.0 1 1 0 0 1 0 1 0
416 38.5 0 0 0 1 0 0 0 0
417 27.5 0 0 0 1 0 0 0 0
418 13.0 0 0 2 0 0 0 0 0

2. Leave-one-out Cross Validation

Due to the training sample size is not very large, it is better to carry out leave-one-out CV to tune the probability threshold instead of using n-fold CV. Due to we can avoid the potential risk of randomness while partitioning the data in n-fold CV.

Below figure shows the accuracies under different probability thresholds from 0.1 to 1 with step = 0.1. As we can see that when threshold = 0.5 and 0.6, the model performs very closely. In this case, I would suggest to choose 0.6 as the threshold to build the model due to most of the “Survived” = 0 in the training dataset; therefore, if threshold = 0.6, it’ll be harder for the model to classify the testing “Survived” as 1, that is, more 0 is expected to be predicted.

3. Logistic Regression Model and the Prediction

With the settings of features used mentioned in step 1 and the threshold = 0.6 being selected in step 2, below reports the details of the model. Most of the features are significant being obvious since the features are selected based on the EDA results.


Call:
glm(formula = Survived ~ ., family = binomial, data = Train_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5058  -0.5372  -0.3692   0.5305   2.8598  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.88262    0.50472   3.730 0.000191 ***
Age           -0.02937    0.00865  -3.396 0.000685 ***
Fare1          0.79524    0.33070   2.405 0.016187 *  
Cabin_Binary1  0.70768    0.36558   1.936 0.052894 .  
Family        -0.49004    0.08200  -5.976 2.28e-09 ***
Title1        -3.43268    0.53447  -6.423 1.34e-10 ***
Title2         0.40954    0.54537   0.751 0.452682    
Title3        -0.49253    0.49087  -1.003 0.315679    
Pclass1        1.50465    0.41484   3.627 0.000287 ***
Pclass2        1.00308    0.24584   4.080 4.50e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7  on 890  degrees of freedom
Residual deviance:  718.5  on 881  degrees of freedom
AIC: 738.5

Number of Fisher Scoring iterations: 5