1. Goal and Measurement - Accuracy

The sinking of the titanic is a rare accident, as it is not a continuously repeatable event, there is no specific business goal.

Therefore, rather than performance evaluations such as “sensitivity” and “precision”, which are useful and powerful in the case of diagnosis and share price prediction, respectively; I decide to choose accuracy as my prediction evaluation since other matrices make less sense under this content.

Data Mining Goal: To classify the response variable, “Survived”, as accurate as possible.

Performance Evaluation: Accuracy

2. Exploratory Data Analysis

There are 891 samples in training set with 11 features and one response variable “Survived”. In addition, there are 418 samples in testing set with 11 features. In addition, We can see there are some missing values in training data and testing data. (We’ll treat these latter.)

Table 1: Training Data (6 rows)
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39	0	5	382652	29.125	NA	Q
887	887	0	2	Montvila, Rev. Juozas	male	27	0	0	211536	13.000	NA	S
888	888	1	1	Graham, Miss. Margaret Edith	female	19	0	0	112053	30.000	B42	S
889	889	0	3	Johnston, Miss. Catherine Helen “Carrie”	female	NA	1	2	W./C. 6607	23.450	NA	S
890	890	1	1	Behr, Mr. Karl Howell	male	26	0	0	111369	30.000	C148	C
891	891	0	3	Dooley, Mr. Patrick	male	32	0	0	370376	7.750	NA	Q

Table 1: Testing Data (6 rows)
	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
413	1304	3	Henriksson, Miss. Jenny Lovisa	female	28.0	0	0	347086	7.7750	NA	S
414	1305	3	Spector, Mr. Woolf	male	NA	0	0	A.5. 3236	8.0500	NA	S
415	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
416	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NA	S
417	1308	3	Ware, Mr. Frederick	male	NA	0	0	359309	8.0500	NA	S
418	1309	3	Peter, Master. Michael J	male	NA	1	1	2668	22.3583	NA	C

Data Visualization

Sex x Survival

According to below figure, female tends to survive after Titanic sinking. This is rational due to male is apt to protect female.

Pclass x Survival

Upper class passengers tend to survive from the sinking of the Titanic. This is rational due to upper class passengers may book better cabins and have more or better emergency equipments inside.

Age x Survival

According to below histogram, we find out that child (0-12) tends to survive. This is sensible due to parents or elders usually go all out to protect their children or youngers while encountering disasters.

Fare x Survival / Fare x Pclass

Passengers who spend more have more chance to survive, the possible reason may be they live in greater cabins with more complete emergency equipments. In addition, this feature highly correlates to the feature “Pclass”, which can be proved from the right figure. The red boxes(“upper class”) dominate when fare >100 and blue box(“lower class”) dominates when fare <= 100.

SibSp / Parch / Family x Survival

Here, I plot histograms for features SibSp, Parch, and SibSp + Parch(Family) and partition them into 2 parts, respectively, based on “Survival”. What we can tell from the right two figures are that passenger with 1 Sibling or Spouse, or, 1 parent or child tend to survive. However, due to the information obtained from these figures is limited, thus, I bin them into one feature, family, and present it on the left.

The left figure(family) entails that passengers who have small family being denoted by (1,2,3) in x-axis are apt to survive. On the other hand, passengers who are single (=0) or have large family (>3) accompanying are more likely to lose their lives.

Insights

Female tends to survive from the accident.
Child (age<12) tends to survive from the accident.
Pclass is highly correlated with Fare.
Upper class passenger tends to survive.
Passenger who pays more tends to survive.
Passenger who travels with small family(1,2,3) tends to survive.
Family(SibSp+Parch) classifies “Survived” better than its components, SibSp and Parch.

More visuaizations will be presented in next session.

3. Missing Value Treatment

How many missing values are there in training and testing data?

As we can see from below outputs, both of the training and testing data occur NAs in 3 columns.

Training Data	NA	Testing Data	NA
Pclass	0	Pclass	0
Name	0	Name	0
Sex	0	Sex	0
Age	177	Age	86
SibSp	0	SibSp	0
Parch	0	Parch	0
Ticket	0	Ticket	0
Fare	0	Fare	1
Cabin	687	Cabin	327
Embarked	2	Embarked	0

(Training) Embarked - Method: Data visualization

Since there are just 2 missing values in the feature, Embarked, thus I’ll start from this.

Firstly, let’s observe these 2 passengers’ features such as family, Pclass, Fare and Ticket. Since they have the same ticket numbers, it is fine to infer that they board from the same place. In addition, both of their family(SibSp + Parch) = 0, Pclass = 1, and Fare = 80.

Table 2: (Training) Rows with Embarked = NA
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
62	62	1	1	Icard, Miss. Amelie	female	38	0	0	113572	80	B28	NA
830	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62	0	0	113572	80	B28	NA

From the first plot, it occurs that the Fare = 80 is located on the median of Embarked = “C” partitioned by Pclass. However, this doesn’t absolutely prove that the missing values have Embarked = “C”, since Fare = 80 is also included in the box plot of Embarked = “S”. Thus, I’ve tried the second and third plots to find other evidence, both of them recommend “S” as the missing values because S accounts for the most proportions in either Pclass = 1 of Family = 0. After taking all into considerations, I would suggest to populate the missing values with S.

(Training and Testing) Cabin - Method: Add Cabin_Binary_NA and remove Cabin

It’s not a good idea to carry out data imputation in this case, since there are 687 missing Cabins in training set and 327 missing values in testing set. Therefore, I add one binary column which set 0 when Cabin = “NA” and set 1 when Cabin has value. As shown in below figure, this new feature seems classifying “Survived” well. There is roughly 67% to survive if Cabin_Binary_na = 1, yet just 30% if Cabin_Binary_na = 0.

(Training and Testing) Age - Method: Generate Age from the distributions

I conduct data imputation for Age based on 4 distributions which are generated from the existing values of Age being partitioned by “Title”.

To describe, I’ll divide it into 3 parts:

How to generate Title from Name?
How to populate Title since there are some NAs?
How to populate Age from 4 distributions partitioned by Title?

1. How to generate Title from Name?

The categories of Title are determined after observing the column, Name, which are “Mr.”, “Mrs.”, “Miss.”, and “Master.” (will be encoded as 1, 2, 3, 4 at first). Thus, I extract these titles from Name and construct a new feature “Title”. However, some Name values don’t contain these titles (27 in training data and 7 in testing data), thus I have to populate those NAs accordingly, which will be described in the second step.

Training Data	Value	Testing Data	Value
Mr.	517	Mr.	240
Mrs.	125	Mrs.	72
Miss.	182	Miss.	78
Master.	40	Master.	21
Total	864	Total	411
NA	27	NA	7

2. How to populate Title since there are some NAs?

The Title is populated sequentially based on existing Age values and Sex. Below is the profile with Title = NA. Since all the males are older than 12, thus, the Title’s NA can be assigned as “Mr.”. For row 767, although the Age is NA, we can still assign it as “Mr.” due to he is a doctor which is observed in Name.

Table 3: (Training) Rows with Title = NA
	Name	Age	Sex
31	Uruchurtu, Don. Manuel E	40	male
150	Byles, Rev. Thomas Roussel Davids	42	male
151	Bateman, Rev. Robert James	51	male
246	Minahan, Dr. William Edward	44	male
250	Carter, Rev. Ernest Courtenay	54	male
318	Moraweck, Dr. Ernest	54	male
370	Aubart, Mme. Leontine Pauline	24	female
399	Pain, Dr. Alfred	23	male
444	Reynaldo, Ms. Encarnacion	28	female
450	Peuchen, Major. Arthur Godfrey	52	male
537	Butt, Major. Archibald Willingham	45	male
557	Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”)	48	female
600	Duff Gordon, Sir. Cosmo Edmund (“Mr Morgan”)	49	male
627	Kirkland, Rev. Charles Leonard	57	male
633	Stahelin-Maeglin, Dr. Max	32	male
642	Sagesser, Mlle. Emma	24	female
648	Simonius-Blumer, Col. Oberst Alfons	56	male
661	Frauenthal, Dr. Henry William	50	male
695	Weir, Col. John	60	male
711	Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”)	24	female
746	Crosby, Capt. Edward Gifford	70	male
760	Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)	33	female
767	Brewe, Dr. Arthur Jackson	NA	male
797	Leader, Dr. Alice (Farnham)	49	female
823	Reuchlin, Jonkheer. John George	38	male
849	Harper, Rev. John	28	male
887	Montvila, Rev. Juozas	27	male

Now, there are just females left. I’ve found that “Mrs.” tend to be older than “Miss.” and the roughly threshold is Age = 25. Thus, the new feature Title is completely populated. Note that for row 711 who is under 25 is assigned as “Mrs.” due to its Name includes “Mrs” which is not captured by criterion “Mrs.” being with a dot.

Note that Title in testing set is consructed in the same way.

Table 4: (Training) Rows with Title = NA after populating Mr. into Title
	Name	Age	Sex
370	Aubart, Mme. Leontine Pauline	24	female
444	Reynaldo, Ms. Encarnacion	28	female
557	Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”)	48	female
642	Sagesser, Mlle. Emma	24	female
711	Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”)	24	female
760	Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)	33	female
797	Leader, Dr. Alice (Farnham)	49	female

Below figure presents the relationship between new feature, Title, and Survived. We can discover that “Mr.” tends to get killed, however, “Mrs.” and “Miss.” are apt to survive.

3. How to populate Age’s NAs from 4 distributions partitioned by Title?

As Title is obtained in the previous step, we can construct 4 distributions of Age with NAs based on different Title being presented in the first row of below plot.
Later on, I create 4 discrete distributions which are similar to the original distributions to generate 10000 random samples each. The corresponding histograms are reported in second row. We can see that the distributions of observed data in the first row are very similar to the distributions plotted by generated samples in the second row.
Lastly, it’s time to populate the NA values in Age by the generated samples. The third row contains the histograms after data imputation being pretty similar to the distributions shown in first row. Therefore, we succeed populating all the Age values in training set.

Note that testing set follows the same step treating missing values.

Below figures are the Age distributions from training data, the left one is plotted without missing values, and the right one is plotted based on original data. They are rather similar.

(Testing) Fare - Method: kNN Data imputation

Due to there is only one missing value of Fare in the testing set, it is rational to populate it with kNN algorithm since kNN will find several similar passengers and populate the missing Fare through majority votes.

Now, all the missing values in training and testing set are succeed cleaning. Below table shows the cleaned data after imputation.

Training Data	NA	Testing Data	NA
Pclass	0	Pclass	0
Name	0	Name	0
Sex	0	Sex	0
Age	0	Age	0
SibSp	0	SibSp	0
Parch	0	Parch	0
Ticket	0	Ticket	0
Fare	0	Fare	0
Cabin	0	Cabin	0
Embarked	0	Embarked	0

Summary

Feature Engineering:

Cabin_Binary_na: (0,1) Generated from Cabin by assigning 0 as Cabin’s NA and 1 as those with values
Title: (1: “Mr.”, 2: “Mrs.”, 3: “Miss.”, 4: “Masters”) Generated from Name and populated NAs by Sex and Age

Data Cleaning and Imputation

Embarked in Training set (2)
Cabin in Training (687) and Testing set (327)
Age in Training (177) and Testing set (86)
Fare in Testing set (1)

4. Data Manipulation (Feature Engineering)

In above contents, new features Title and Cabin_Binary_na are created and involved in the training and testing datasets. Here, I am going to create some more features which may help to improve the model in next session.

Below table includes all features in the training set now. I remove PassengerId, Name, which has no information included or is extracted already.

Table 5: Training Dataset (6 rows)
	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Cabin_Binary	Title
886	0	3	female	39.0	0	5	382652	29.125	Q	0	2
887	0	2	male	27.0	0	0	211536	13.000	S	0	1
888	1	1	female	19.0	0	0	112053	30.000	S	1	3
889	0	3	female	27.5	1	2	W./C. 6607	23.450	S	0	3
890	1	1	male	26.0	0	0	111369	30.000	C	1	1
891	0	3	male	32.0	0	0	370376	7.750	Q	0	1

Family

This feature is proved to be good in EDA Family x Survived, thus, I’ll create it by adding SibSp and Parch together.

Title and Pclass - One-hot Encoding

Pclass and Title are transformed to one-hot encoding(dummy variables) since there aren’t any ordinal relations in these features, that is, 4:“Master.” in Title is not better or larger than 1:“Mr.” in Title.

Fare

In this part, I bin Fare into a binary variable where Fare_Binary = “0” when Fare < 50 and “0” otherwise. As we can see from the below figures, although threshold 100 in the right figure seems classifying “Survived” better, however, the sample in that region is rather small, thus, I would prefer to choose threshold 50 instead.

5. Performance Evaluation

Logistic Regression Model

Let’s divide this session into 4 parts:

Select features and train the model
Tune the model via leave-one-out cross validation
Select the probability threshold and conduct prediction
Submit the prediction result on Kaggle

1. Select features for the model

Below features are considered for the model building.

Table 6: Final Training Dataset (6 rows)
	Age	Cabin_Binary	Family	Title1	Title2	Title3	Pclass1	Pclass2
886	39.0	0	5	0	1	0	0	0
887	27.0	0	0	1	0	0	0	1
888	19.0	1	0	0	0	1	1	0
889	27.5	0	3	0	0	1	0	0
890	26.0	1	0	1	0	0	1	0
891	32.0	0	0	1	0	0	0	0

Table 6: Final Testing Dataset (6 rows)
	Age	Fare	Cabin_Binary	Family	Title1	Title2	Title3	Pclass1
413	28.0	0	0	0	0	0	1	0
414	27.5	0	0	0	1	0	0	0
415	39.0	1	1	0	0	1	0	1
416	38.5	0	0	0	1	0	0	0
417	27.5	0	0	0	1	0	0	0
418	13.0	0	0	2	0	0	0	0

2. Leave-one-out Cross Validation

Due to the training sample size is not very large, it is better to carry out leave-one-out CV to tune the probability threshold instead of using n-fold CV. Due to we can avoid the potential risk of randomness while partitioning the data in n-fold CV.

Below figure shows the accuracies under different probability thresholds from 0.1 to 1 with step = 0.1. As we can see that when threshold = 0.5 and 0.6, the model performs very closely. In this case, I would suggest to choose 0.6 as the threshold to build the model due to most of the “Survived” = 0 in the training dataset; therefore, if threshold = 0.6, it’ll be harder for the model to classify the testing “Survived” as 1, that is, more 0 is expected to be predicted.

3. Logistic Regression Model and the Prediction

With the settings of features used mentioned in step 1 and the threshold = 0.6 being selected in step 2, below reports the details of the model. Most of the features are significant being obvious since the features are selected based on the EDA results.


Call:
glm(formula = Survived ~ ., family = binomial, data = Train_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5058  -0.5372  -0.3692   0.5305   2.8598  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.88262    0.50472   3.730 0.000191 ***
Age           -0.02937    0.00865  -3.396 0.000685 ***
Fare1          0.79524    0.33070   2.405 0.016187 *  
Cabin_Binary1  0.70768    0.36558   1.936 0.052894 .  
Family        -0.49004    0.08200  -5.976 2.28e-09 ***
Title1        -3.43268    0.53447  -6.423 1.34e-10 ***
Title2         0.40954    0.54537   0.751 0.452682    
Title3        -0.49253    0.49087  -1.003 0.315679    
Pclass1        1.50465    0.41484   3.627 0.000287 ***
Pclass2        1.00308    0.24584   4.080 4.50e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7  on 890  degrees of freedom
Residual deviance:  718.5  on 881  degrees of freedom
AIC: 738.5

Number of Fisher Scoring iterations: 5

(Kaggle Dataset) Titanic Survival Prediction

Table of Contents