Titanic catastrophe data analysis using Python

6 minute read

Data Extraction and Visualization project using Python:

In this project I will try to answer some basics questions related to the titanic tragedy using Python.

  • Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
  • What deck were the passengers on and how does that relate to their class?
  • Where did the passengers come from?
  • Who was alone and who was with family?
  • What factors helped someone survive the sinking?

First we will get our dataset from Kaggle.com

After importing Python libraries such as Pandas, Numpy and seaborn we will open the dataset in Python and set it up as a Data Frame:

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
titanic_df = pd.read_csv(r"C:\Users\Usuario\Desktop\Titanic Project\data.csv")

Let’s take a look for our data:

titanic_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

After taking a look for our data set, in a way to answer the first questions, we can notice that the column Sex is divided to two genders, Man and Women. But for better analyzing we will add another gender (Child) asuming that every person is under 16 years old is a child.

def male_female_child(passenger):
    # Take the Age and Sex
    age,sex = passenger
    # Compare the age, otherwise leave the sex
    if age < 16:
        return 'child'
    else:
        return sex
# We'll define a new column called 'person', remember to specify axis=1 for columns and not index
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child,axis=1)    

Now after we created another column called person, let’s visualize our data.

sns.catplot("person",data=titanic_df,kind="count")
titanic_df["person"].value_counts()
male      537
female    271
child      83
Name: person, dtype: int64

linearly separable data

As we can see there were on the Titanic:

537 males 271 females 83 Children

Now let’s see how they were distributed in their classes.

sns.catplot("Pclass",data=titanic_df,hue="person",kind="count")
<seaborn.axisgrid.FacetGrid at 0x2007b537948>

linearly separable data

Now let’s get more precise picture of the normal distubiotion of the passengers age on the Titanc:

titanic_df['Age'].hist(bins=70,color='indianred',alpha=0.9)

linearly separable data

Another way to visualize the data is to use FacetGrid to plot multiple kedplots on one plot

fig = sns.FacetGrid(titanic_df, hue="person",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

linearly separable data

Let’s do the same for class by changing the hue argument:

fig = sns.FacetGrid(titanic_df, hue="Pclass",aspect=4)
fig.map(sns.kdeplot,'Age',shade= True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

linearly separable data

We’ve gotten a pretty good picture of who the passengers were based on Sex, Age, and Class. So let’s move on to our 2nd question: What deck were the passengers on and how does that relate to their class?

If we look back to our dataset, specially to a Cabin column, first we have to drop all the null values and creat a new object called deck.

deck=titanic_df["Cabin"].dropna()

We only need the first letter of the deck column to classify its level, in order to do this we will create an empty list and loop it to grab the first letter.

# Set empty list
levels = []

# Loop to grab first letter
for level in deck:
    levels.append(level[0])    

# Reset DataFrame and use factor plot
cabin_df = DataFrame(levels)
cabin_df.columns = ['Cabin']
sns.catplot('Cabin',data=cabin_df,palette='winter_d',kind="count")

linearly separable data

nteresting to note we have a ‘T’ deck value there which doesn’t make sense, we can drop it out with the following code:

cabin_df = cabin_df[cabin_df.Cabin != 'T']
#Replot
sns.catplot('Cabin',data=cabin_df,palette='winter_d',kind="count")

linearly separable data

now that we’ve analyzed the distribution by decks, let’s go ahead and answer our third question:

3.) Where did the passengers come from?

Note here that the Embarked column has C,Q,and S values. Reading about the project on Kaggle you’ll note that these stand for Cherbourg, Queenstown, Southhampton.

sns.catplot('Embarked',data=titanic_df,hue='Pclass',order=['C','Q','S'],kind="count")

linearly separable data

An interesting find here is that in Queenstown, almost all the passengers that boarded there were 3rd class. It would be intersting to look at the economics of that town in that time period for further investigation.

Now let’s take a look at the 4th question:

4.) Who was alone and who was with family?

Let’s start by adding a new column to define alone

We’ll add the parent/child column with the sibsp column

titanic_df['Alone'] =  titanic_df.Parch + titanic_df.SibSp

Now we know that if the Alone column is anything but 0, then the passenger had family aboard and wasn’t alone. So let’s change the column now so that if the value is greater than 0, we know the passenger was with his/her family, otherwise they were alone.

titanic_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked person Alone
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S male With Family
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C female With Family
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S female Alone
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S female With Family
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S male Alone
def alone(passenger):
    sib,parch = passenger
    if sib ==0 and parch==0:
        return "alone"
    else:
        return "family"
titanic_df['came_with'] = titanic_df[["SibSp","Parch"]].apply(alone,axis=1)
sns.catplot('came_with',data=titanic_df,kind='count',order=(["alone","family"]))

linearly separable data

Great work! Now that we’ve throughly analyzed the data let’s go ahead and take a look at the most interesting (and open-ended) question: What factors helped someone survive the sinking?

# Let's start by creating a new column for legibility purposes through mapping (Lec 36)
titanic_df["Survivor"] = titanic_df.Survived.map({0: "no", 1: "yes"})

# Let's just get a quick overall view of survied vs died.
sns.catplot('Survivor',data=titanic_df,palette='Set1',kind="count")

linearly separable data

So quite a few more people died than those who survived. Let’s see if the class of the passengers had an effect on their survival rate, since the movie Titanic popularized the notion that the 3rd class passengers did not do as well as their 1st and 2nd class counterparts.

# Let's use a factor plot again, but now considering class
sns.catplot('Pclass','Survived',data=titanic_df,kind="point")

linearly separable data

Look like survival rates for the 3rd class are substantially lower! But maybe this effect is being caused by the large amount of men in the 3rd class in combination with the women and children first policy. Let’s use ‘hue’ to get a clearer picture on this.

# Let's use a factor plot again, but now considering class and gender
sns.catplot('Pclass','Survived',hue='person',data=titanic_df,kind="point")

linearly separable data

From this data it looks like being a male or being in 3rd class were both not favourable for survival. Even regardless of class the result of being a male in any class dramatically decreases your chances of survival.

But what about age? Did being younger or older have an effect on survival rate?

# Let's use a linear plot on age versus survival
sns.lmplot('Age','Survived',data=titanic_df)

linearly separable data

Looks like there is a general trend that the older the passenger was, the less likely they survived. Let’s go ahead and use hue to take a look at the effect of class and age.

# Let's use a linear plot on age versus survival using hue for class seperation
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter')

linearly separable data

We can also use the x_bin argument to clean up this figure and grab the data and bin it by age with a std attached!

# Let's use a linear plot on age versus survival using hue for class seperation
generations=[10,20,40,60,80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter',x_bins=generations)

linearly separable data

Interesting find on the older 1st class passengers! What about if we relate gender and age with the survival set?

sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='winter',x_bins=generations)

linearly separable data