The sinking of the titanic is a rare accident, as it is not a continuously repeatable event, there is no specific business goal.
Therefore, rather than performance evaluations such as “sensitivity” and “precision”, which are useful and powerful in the case of diagnosis and share price prediction, respectively; I decide to choose accuracy as my prediction evaluation since other matrices make less sense under this content.
Data Mining Goal: To classify the response variable, “Survived”, as accurate as possible.
Performance Evaluation: Accuracy
There are 891 samples in training set with 11 features and one response variable “Survived”. In addition, there are 418 samples in testing set with 11 features. In addition, We can see there are some missing values in training data and testing data. (We’ll treat these latter.)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39 | 0 | 5 | 382652 | 29.125 | NA | Q |
| 887 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27 | 0 | 0 | 211536 | 13.000 | NA | S |
| 888 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19 | 0 | 0 | 112053 | 30.000 | B42 | S |
| 889 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen “Carrie” | female | NA | 1 | 2 | W./C. 6607 | 23.450 | NA | S |
| 890 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26 | 0 | 0 | 111369 | 30.000 | C148 | C |
| 891 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32 | 0 | 0 | 370376 | 7.750 | NA | Q |
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 413 | 1304 | 3 | Henriksson, Miss. Jenny Lovisa | female | 28.0 | 0 | 0 | 347086 | 7.7750 | NA | S |
| 414 | 1305 | 3 | Spector, Mr. Woolf | male | NA | 0 | 0 | A.5. 3236 | 8.0500 | NA | S |
| 415 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
| 416 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NA | S |
| 417 | 1308 | 3 | Ware, Mr. Frederick | male | NA | 0 | 0 | 359309 | 8.0500 | NA | S |
| 418 | 1309 | 3 | Peter, Master. Michael J | male | NA | 1 | 1 | 2668 | 22.3583 | NA | C |
According to below figure, female tends to survive after Titanic sinking. This is rational due to male is apt to protect female.
Upper class passengers tend to survive from the sinking of the Titanic. This is rational due to upper class passengers may book better cabins and have more or better emergency equipments inside.
According to below histogram, we find out that child (0-12) tends to survive. This is sensible due to parents or elders usually go all out to protect their children or youngers while encountering disasters.
Passengers who spend more have more chance to survive, the possible reason may be they live in greater cabins with more complete emergency equipments. In addition, this feature highly correlates to the feature “Pclass”, which can be proved from the right figure. The red boxes(“upper class”) dominate when fare >100 and blue box(“lower class”) dominates when fare <= 100.
Here, I plot histograms for features SibSp, Parch, and SibSp + Parch(Family) and partition them into 2 parts, respectively, based on “Survival”. What we can tell from the right two figures are that passenger with 1 Sibling or Spouse, or, 1 parent or child tend to survive. However, due to the information obtained from these figures is limited, thus, I bin them into one feature, family, and present it on the left.
The left figure(family) entails that passengers who have small family being denoted by (1,2,3) in x-axis are apt to survive. On the other hand, passengers who are single (=0) or have large family (>3) accompanying are more likely to lose their lives.
More visuaizations will be presented in next session.
As we can see from below outputs, both of the training and testing data occur NAs in 3 columns.
| Training Data | NA | Testing Data | NA |
|---|---|---|---|
| Pclass | 0 | Pclass | 0 |
| Name | 0 | Name | 0 |
| Sex | 0 | Sex | 0 |
| Age | 177 | Age | 86 |
| SibSp | 0 | SibSp | 0 |
| Parch | 0 | Parch | 0 |
| Ticket | 0 | Ticket | 0 |
| Fare | 0 | Fare | 1 |
| Cabin | 687 | Cabin | 327 |
| Embarked | 2 | Embarked | 0 |
Since there are just 2 missing values in the feature, Embarked, thus I’ll start from this.
Firstly, let’s observe these 2 passengers’ features such as family, Pclass, Fare and Ticket. Since they have the same ticket numbers, it is fine to infer that they board from the same place. In addition, both of their family(SibSp + Parch) = 0, Pclass = 1, and Fare = 80.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 62 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38 | 0 | 0 | 113572 | 80 | B28 | NA |
| 830 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62 | 0 | 0 | 113572 | 80 | B28 | NA |
From the first plot, it occurs that the Fare = 80 is located on the median of Embarked = “C” partitioned by Pclass. However, this doesn’t absolutely prove that the missing values have Embarked = “C”, since Fare = 80 is also included in the box plot of Embarked = “S”. Thus, I’ve tried the second and third plots to find other evidence, both of them recommend “S” as the missing values because S accounts for the most proportions in either Pclass = 1 of Family = 0. After taking all into considerations, I would suggest to populate the missing values with S.
It’s not a good idea to carry out data imputation in this case, since there are 687 missing Cabins in training set and 327 missing values in testing set. Therefore, I add one binary column which set 0 when Cabin = “NA” and set 1 when Cabin has value. As shown in below figure, this new feature seems classifying “Survived” well. There is roughly 67% to survive if Cabin_Binary_na = 1, yet just 30% if Cabin_Binary_na = 0.
I conduct data imputation for Age based on 4 distributions which are generated from the existing values of Age being partitioned by “Title”.
To describe, I’ll divide it into 3 parts:
The categories of Title are determined after observing the column, Name, which are “Mr.”, “Mrs.”, “Miss.”, and “Master.” (will be encoded as 1, 2, 3, 4 at first). Thus, I extract these titles from Name and construct a new feature “Title”. However, some Name values don’t contain these titles (27 in training data and 7 in testing data), thus I have to populate those NAs accordingly, which will be described in the second step.
| Training Data | Value | Testing Data | Value |
|---|---|---|---|
| Mr. | 517 | Mr. | 240 |
| Mrs. | 125 | Mrs. | 72 |
| Miss. | 182 | Miss. | 78 |
| Master. | 40 | Master. | 21 |
| Total | 864 | Total | 411 |
| NA | 27 | NA | 7 |
The Title is populated sequentially based on existing Age values and Sex. Below is the profile with Title = NA. Since all the males are older than 12, thus, the Title’s NA can be assigned as “Mr.”. For row 767, although the Age is NA, we can still assign it as “Mr.” due to he is a doctor which is observed in Name.
| Name | Age | Sex | |
|---|---|---|---|
| 31 | Uruchurtu, Don. Manuel E | 40 | male |
| 150 | Byles, Rev. Thomas Roussel Davids | 42 | male |
| 151 | Bateman, Rev. Robert James | 51 | male |
| 246 | Minahan, Dr. William Edward | 44 | male |
| 250 | Carter, Rev. Ernest Courtenay | 54 | male |
| 318 | Moraweck, Dr. Ernest | 54 | male |
| 370 | Aubart, Mme. Leontine Pauline | 24 | female |
| 399 | Pain, Dr. Alfred | 23 | male |
| 444 | Reynaldo, Ms. Encarnacion | 28 | female |
| 450 | Peuchen, Major. Arthur Godfrey | 52 | male |
| 537 | Butt, Major. Archibald Willingham | 45 | male |
| 557 | Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”) | 48 | female |
| 600 | Duff Gordon, Sir. Cosmo Edmund (“Mr Morgan”) | 49 | male |
| 627 | Kirkland, Rev. Charles Leonard | 57 | male |
| 633 | Stahelin-Maeglin, Dr. Max | 32 | male |
| 642 | Sagesser, Mlle. Emma | 24 | female |
| 648 | Simonius-Blumer, Col. Oberst Alfons | 56 | male |
| 661 | Frauenthal, Dr. Henry William | 50 | male |
| 695 | Weir, Col. John | 60 | male |
| 711 | Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”) | 24 | female |
| 746 | Crosby, Capt. Edward Gifford | 70 | male |
| 760 | Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) | 33 | female |
| 767 | Brewe, Dr. Arthur Jackson | NA | male |
| 797 | Leader, Dr. Alice (Farnham) | 49 | female |
| 823 | Reuchlin, Jonkheer. John George | 38 | male |
| 849 | Harper, Rev. John | 28 | male |
| 887 | Montvila, Rev. Juozas | 27 | male |
Now, there are just females left. I’ve found that “Mrs.” tend to be older than “Miss.” and the roughly threshold is Age = 25. Thus, the new feature Title is completely populated. Note that for row 711 who is under 25 is assigned as “Mrs.” due to its Name includes “Mrs” which is not captured by criterion “Mrs.” being with a dot.
Note that Title in testing set is consructed in the same way.
| Name | Age | Sex | |
|---|---|---|---|
| 370 | Aubart, Mme. Leontine Pauline | 24 | female |
| 444 | Reynaldo, Ms. Encarnacion | 28 | female |
| 557 | Duff Gordon, Lady. (Lucille Christiana Sutherland) (“Mrs Morgan”) | 48 | female |
| 642 | Sagesser, Mlle. Emma | 24 | female |
| 711 | Mayne, Mlle. Berthe Antonine (“Mrs de Villiers”) | 24 | female |
| 760 | Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) | 33 | female |
| 797 | Leader, Dr. Alice (Farnham) | 49 | female |
Below figure presents the relationship between new feature, Title, and Survived. We can discover that “Mr.” tends to get killed, however, “Mrs.” and “Miss.” are apt to survive.
As Title is obtained in the previous step, we can construct 4 distributions of Age with NAs based on different Title being presented in the first row of below plot.
Later on, I create 4 discrete distributions which are similar to the original distributions to generate 10000 random samples each. The corresponding histograms are reported in second row. We can see that the distributions of observed data in the first row are very similar to the distributions plotted by generated samples in the second row.
Lastly, it’s time to populate the NA values in Age by the generated samples. The third row contains the histograms after data imputation being pretty similar to the distributions shown in first row. Therefore, we succeed populating all the Age values in training set.
Note that testing set follows the same step treating missing values.
Below figures are the Age distributions from training data, the left one is plotted without missing values, and the right one is plotted based on original data. They are rather similar.
Due to there is only one missing value of Fare in the testing set, it is rational to populate it with kNN algorithm since kNN will find several similar passengers and populate the missing Fare through majority votes.
Now, all the missing values in training and testing set are succeed cleaning. Below table shows the cleaned data after imputation.
| Training Data | NA | Testing Data | NA |
|---|---|---|---|
| Pclass | 0 | Pclass | 0 |
| Name | 0 | Name | 0 |
| Sex | 0 | Sex | 0 |
| Age | 0 | Age | 0 |
| SibSp | 0 | SibSp | 0 |
| Parch | 0 | Parch | 0 |
| Ticket | 0 | Ticket | 0 |
| Fare | 0 | Fare | 0 |
| Cabin | 0 | Cabin | 0 |
| Embarked | 0 | Embarked | 0 |
In above contents, new features Title and Cabin_Binary_na are created and involved in the training and testing datasets. Here, I am going to create some more features which may help to improve the model in next session.
Below table includes all features in the training set now. I remove PassengerId, Name, which has no information included or is extracted already.
| Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Cabin_Binary | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 0 | 3 | female | 39.0 | 0 | 5 | 382652 | 29.125 | Q | 0 | 2 |
| 887 | 0 | 2 | male | 27.0 | 0 | 0 | 211536 | 13.000 | S | 0 | 1 |
| 888 | 1 | 1 | female | 19.0 | 0 | 0 | 112053 | 30.000 | S | 1 | 3 |
| 889 | 0 | 3 | female | 27.5 | 1 | 2 | W./C. 6607 | 23.450 | S | 0 | 3 |
| 890 | 1 | 1 | male | 26.0 | 0 | 0 | 111369 | 30.000 | C | 1 | 1 |
| 891 | 0 | 3 | male | 32.0 | 0 | 0 | 370376 | 7.750 | Q | 0 | 1 |
This feature is proved to be good in EDA Family x Survived, thus, I’ll create it by adding SibSp and Parch together.
Pclass and Title are transformed to one-hot encoding(dummy variables) since there aren’t any ordinal relations in these features, that is, 4:“Master.” in Title is not better or larger than 1:“Mr.” in Title.
In this part, I bin Fare into a binary variable where Fare_Binary = “0” when Fare < 50 and “0” otherwise. As we can see from the below figures, although threshold 100 in the right figure seems classifying “Survived” better, however, the sample in that region is rather small, thus, I would prefer to choose threshold 50 instead.
Let’s divide this session into 4 parts:
Below features are considered for the model building.
| Age | Fare | Cabin_Binary | Family | Title1 | Title2 | Title3 | Pclass1 | Pclass2 | |
|---|---|---|---|---|---|---|---|---|---|
| 886 | 39.0 | 0 | 0 | 5 | 0 | 1 | 0 | 0 | 0 |
| 887 | 27.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 888 | 19.0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
| 889 | 27.5 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 |
| 890 | 26.0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 891 | 32.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Age | Fare | Cabin_Binary | Family | Title1 | Title2 | Title3 | Pclass1 | Pclass2 | |
|---|---|---|---|---|---|---|---|---|---|
| 413 | 28.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 414 | 27.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 415 | 39.0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
| 416 | 38.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 417 | 27.5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 418 | 13.0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
Due to the training sample size is not very large, it is better to carry out leave-one-out CV to tune the probability threshold instead of using n-fold CV. Due to we can avoid the potential risk of randomness while partitioning the data in n-fold CV.
Below figure shows the accuracies under different probability thresholds from 0.1 to 1 with step = 0.1. As we can see that when threshold = 0.5 and 0.6, the model performs very closely. In this case, I would suggest to choose 0.6 as the threshold to build the model due to most of the “Survived” = 0 in the training dataset; therefore, if threshold = 0.6, it’ll be harder for the model to classify the testing “Survived” as 1, that is, more 0 is expected to be predicted.
With the settings of features used mentioned in step 1 and the threshold = 0.6 being selected in step 2, below reports the details of the model. Most of the features are significant being obvious since the features are selected based on the EDA results.
Call:
glm(formula = Survived ~ ., family = binomial, data = Train_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5058 -0.5372 -0.3692 0.5305 2.8598
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.88262 0.50472 3.730 0.000191 ***
Age -0.02937 0.00865 -3.396 0.000685 ***
Fare1 0.79524 0.33070 2.405 0.016187 *
Cabin_Binary1 0.70768 0.36558 1.936 0.052894 .
Family -0.49004 0.08200 -5.976 2.28e-09 ***
Title1 -3.43268 0.53447 -6.423 1.34e-10 ***
Title2 0.40954 0.54537 0.751 0.452682
Title3 -0.49253 0.49087 -1.003 0.315679
Pclass1 1.50465 0.41484 3.627 0.000287 ***
Pclass2 1.00308 0.24584 4.080 4.50e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.7 on 890 degrees of freedom
Residual deviance: 718.5 on 881 degrees of freedom
AIC: 738.5
Number of Fisher Scoring iterations: 5