Travel Assured

Travel Assured#

Introduction#

Travel Assured is a travel insurance company. Due to the COVID pandemic they have had to cut their marketing budget by over 50%. It is more important than ever that they advertise in the right places and to the right people.

Travel Assured has data on their current customers as well as people who got quotes but never bought insurance. They want to know if there are differences in the travel habits between customers and non-customers - they believe they are more likely to travel often (buying tickets from frequent flyer miles) and travel abroad.

The Data#

age: Numeric, the customer’s age
Employment Type: Character, the sector of employment
GraduateOrNot: Character, whether the customer is a college graduate
AnnualIncome: Numeric, the customer’s yearly income
FamilyMembers: Numeric, the number of family members living with the customer
ChronicDiseases: Numeric, whether the customer has any chronic conditions
FrequentFlyer: Character, whether a customer books frequent tickets
EverTravelledAbroad: Character, has the customer ever travelled abroad
TravelInsurance: Numeric, whether the customer bought travel insurance

Import Libraries#

Let us set up all the objects (packages, and modules) that we will need to explore the dataset.

# Import necessary modules
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

Load the Data#

# Read file
df = pd.read_csv('data/travel_insurance.csv')

# Split the data in two subgroups, to compare data with/without Travel Insurance
insured_df = df[df['TravelInsurance'] == 1]
uninsured_df = df[df['TravelInsurance'] == 0]
df.head()

	Age	Employment Type	GraduateOrNot	AnnualIncome	FamilyMembers	ChronicDiseases	FrequentFlyer	EverTravelledAbroad	TravelInsurance
0	31	Government Sector	Yes	400000	6	1	No	No	0
1	31	Private Sector/Self Employed	Yes	1250000	7	0	No	No	0
2	34	Private Sector/Self Employed	Yes	500000	4	1	No	No	1
3	28	Private Sector/Self Employed	Yes	700000	3	1	No	No	0
4	28	Private Sector/Self Employed	Yes	700000	8	1	Yes	No	0

Data Size and Structure#

Dataset comprises of 1987 observations and 9 columns.
8 independent variables: Age, Employement Type, GraduateOrNot, AnnualIncome, FamilyMembers, ChronicDiseases, FrequentFlyer, EverTravelledAbroad
- 3 numerical features, 5 categorical features
1 dependent variable: TravelInsurance
Complete dataset with no missing/null values.

# Check the size and datatypes of the DataFrame
print(f"Shape: {df.shape}")
print('\n')
print(df.info())

Shape: (1987, 9)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987 entries, 0 to 1986
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Age                  1987 non-null   int64 
 1   Employment Type      1987 non-null   object
 2   GraduateOrNot        1987 non-null   object
 3   AnnualIncome         1987 non-null   int64 
 4   FamilyMembers        1987 non-null   int64 
 5   ChronicDiseases      1987 non-null   int64 
 6   FrequentFlyer        1987 non-null   object
 7   EverTravelledAbroad  1987 non-null   object
 8   TravelInsurance      1987 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 139.8+ KB
None

# For all categorical data columns
# Print categories and the number of times they appear
# ChornicDiseases and TravelInsurance at categorical because 1 and 0 represent yes and no
CAT_COLS = ['Employment Type', 'GraduateOrNot', "ChronicDiseases", "FrequentFlyer", "EverTravelledAbroad", "TravelInsurance"]

for col in df[CAT_COLS]:
    print(df[col].value_counts())
    print('\n')

Employment Type
Private Sector/Self Employed    1417
Government Sector                570
Name: count, dtype: int64


GraduateOrNot
Yes    1692
No      295
Name: count, dtype: int64


ChronicDiseases
0    1435
1     552
Name: count, dtype: int64


FrequentFlyer
No     1570
Yes     417
Name: count, dtype: int64


EverTravelledAbroad
No     1607
Yes     380
Name: count, dtype: int64


TravelInsurance
0    1277
1     710
Name: count, dtype: int64

df.describe()

	Age	AnnualIncome	FamilyMembers	ChronicDiseases	TravelInsurance
count	1987.000000	1.987000e+03	1987.000000	1987.000000	1987.000000
mean	29.650226	9.327630e+05	4.752894	0.277806	0.357323
std	2.913308	3.768557e+05	1.609650	0.448030	0.479332
min	25.000000	3.000000e+05	2.000000	0.000000	0.000000
25%	28.000000	6.000000e+05	4.000000	0.000000	0.000000
50%	29.000000	9.000000e+05	5.000000	0.000000	0.000000
75%	32.000000	1.250000e+06	6.000000	1.000000	1.000000
max	35.000000	1.800000e+06	9.000000	1.000000	1.000000

# Percentage of dataset is null
df.isnull().sum() / len(df) * 100

Age                    0.0
Employment Type        0.0
GraduateOrNot          0.0
AnnualIncome           0.0
FamilyMembers          0.0
ChronicDiseases        0.0
FrequentFlyer          0.0
EverTravelledAbroad    0.0
TravelInsurance        0.0
dtype: float64

insured_df.value_counts('EverTravelledAbroad')

EverTravelledAbroad
No     412
Yes    298
Name: count, dtype: int64

What are the travel habits of the insured and uninsured?#

Now let’s first look at the how many people have travel insurance and their travel habits.

LABELS = ['No', 'Yes'] 
EMPLOYMENT_LABELS = ['Private Sector/Self Employed', 'Government Sector'] 

def label_fontsize(fontsize):
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(fontsize)

insured_travelled = insured_df.value_counts('EverTravelledAbroad').values
uninsured_travelled = uninsured_df.value_counts('EverTravelledAbroad').values

x = np.arange(len(LABELS))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(15, 8))
rects1 = ax.bar(x + width/2, insured_travelled, width, label='Insured', color='forestgreen')
rects2 = ax.bar(x - width/2, uninsured_travelled, width, label='Uninsured', color='firebrick')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Ever travelled abroad')
ax.set_ylabel('Count')
ax.set_title('Count of people that have travelled abroad by whether they are insured')
ax.set_xticks(x)
ax.set_xticklabels(LABELS)
ax.legend(prop={'size': 20})

ax.bar_label(rects1, padding=3, fontsize=20)
ax.bar_label(rects2, padding=3, fontsize=20)

label_fontsize(20)

fig.tight_layout()
plt.show()

../../_images/ed1204037ffa64ca542a4922240218d3ebe6b8507af86cc39a4def208557fd8b.png

insured_freqfly = insured_df.value_counts('FrequentFlyer').values
uninsured_freqfly = uninsured_df.value_counts('FrequentFlyer').values

x = np.arange(len(LABELS))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(15, 8))
rects1 = ax.bar(x + width/2, insured_freqfly, width, label='Insured', color='forestgreen')
rects2 = ax.bar(x - width/2, uninsured_freqfly, width, label='Uninsured', color='firebrick')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Frequent Flyer')
ax.set_ylabel('Count')
ax.set_title('Count of people that are frequent flyers by whether they are insured')
ax.set_xticks(x)
ax.set_xticklabels(LABELS)
ax.legend(prop={'size': 20})

ax.bar_label(rects1, padding=3, fontsize=20)
ax.bar_label(rects2, padding=3, fontsize=20)

label_fontsize(20)

fig.tight_layout()
plt.show()

../../_images/23e4f101ce2ef9f0343d69188d4ccade4440bcd9e9581ac27988192c2b51cade.png

The countplot shows that a large proportion of people that have never travelled abroad are uninsured. However for those that have, a significant proportion have gotten travel insurance. This pattern is also seen for people that aren’t/are frequent flyers. Indicating that people that have travelled abroad or are frequent flyers are more likely to get travel insurance.

Other demographic information#

Now that we have an understanding of their travel habits, let’s explore other demographic information.

insured_chronic = insured_df.value_counts('ChronicDiseases').values
uninsured_chronic = uninsured_df.value_counts('ChronicDiseases').values

x = np.arange(len(LABELS))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(15, 8))
rects1 = ax.bar(x + width/2, insured_chronic, width, label='Insured', color='forestgreen')
rects2 = ax.bar(x - width/2, uninsured_chronic, width, label='Uninsured', color='firebrick')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Has a chronic disease')
ax.set_ylabel('Count')
ax.set_title('Count of people that have a chronic disease by whether they are insured')
ax.set_xticks(x)
ax.set_xticklabels(LABELS)
ax.legend(prop={'size': 20})
ax.bar_label(rects1, padding=3, fontsize=20)
ax.bar_label(rects2, padding=3, fontsize=20)
label_fontsize(20)

fig.tight_layout()
plt.show()

../../_images/e8b88d767cc6d663fcfd59bcd569adfc038c2a79127430c5bbe9b0724f58f1ea.png

This countplot shows that between people who do and don’t have a chronic disease, most do not have travel insurance.

fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.countplot(x="FamilyMembers", data=df, hue="TravelInsurance", palette=["red", "green"])
ax.set_xlabel('Number of family members')
ax.set_title('Frequency of family size by whether they are insured')
ax.legend(title="Insured", labels=LABELS, fontsize=20, title_fontsize=20)
label_fontsize(20)
fig.tight_layout()
plt.show()

../../_images/e2409d711856f72f20cf1dd3b458e81f978ad19013394a0a36bb970ba6db91fa.png

In this countplot fort each family size there are more people uninsured than insured. However the gap between the two counts seem to decrease as the family size increases. Let’s look at this closer using proportions.

insured_fam_prop = insured_df.value_counts('FamilyMembers') / df.value_counts('FamilyMembers')

fig, ax = plt.subplots(figsize=(15, 8))
ax.bar(insured_fam_prop.index, insured_fam_prop, color='royalblue')

ax.set_xlabel('Number of family members')
ax.set_ylabel('Proportion insured')
ax.set_title('Proportion of the insured by family size')
label_fontsize(20)
fig.tight_layout()
plt.show()

../../_images/a9e8e9a1772b58781bcbd02a3336d9d59604ca5482390fc10637988d381d5efd.png

This bar chart shows the proportion of people insured for each family size. The proportion does increase as the family size increases, however the margin does not increase significantly. Further exploration may be needed to identify whether this increase is statistically significant.

fig, ax = plt.subplots(figsize=[16,7])

# plot histogram comparing the density distribution of the Annual Income in each groups
_ = plt.hist(insured_df['AnnualIncome'], 
             histtype='step', 
             hatch='/', 
             density=True, 
             color='green', 
             bins=10, 
             linestyle='--', 
             linewidth=2,
            )
_ = plt.hist(uninsured_df['AnnualIncome'],
             density=True,
             color='red', 
             bins=10,
            )

# put a title, label the axes, define the legend and display
_ = plt.suptitle("PDF of Annual Income by whether they are insured", 
                 fontsize=20, 
                 color='black',
                 x=0.52,
                 y=0.92,
                )
_ = plt.xlabel('Annual Income')
_ = plt.ylabel('Probability Density Function (PDF)')
_ = plt.legend(['Insured', 'Uninsured'])
_ = plt.ticklabel_format(axis="x", style="plain")
_ = plt.ticklabel_format(axis="y", style="plain")
plt.show()

../../_images/1af62be0145c8593cbdb1d4f6faa13769f09e43c37d866e4a901078b59048071.png

The PDF of the insured group is further to the right than the PDF of the uninsured group, meaning that the people with higher incomes are more likely to get travel insurance. This is expected because people with higher incomes are more able to afford insurance.

But histograms are affected by a binning bias (i.e. the visualization depends on the number of bins chosen). In this case, the histogram has only 10 bins to make it easier to understand.

insured_grad = insured_df.value_counts('GraduateOrNot').values
uninsured_grad = uninsured_df.value_counts('GraduateOrNot').values

x = np.arange(len(LABELS))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(15, 8))
rects1 = ax.bar(x + width/2, insured_grad, width, label='Insured', color='forestgreen')
rects2 = ax.bar(x - width/2, uninsured_grad, width, label='Uninsured', color='firebrick')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('College Graduate')
ax.set_ylabel('Count')
ax.set_title('Count of people that are college graduates by whether they are insured')
ax.set_xticks(x)
ax.set_xticklabels(LABELS)
ax.legend(prop={'size': 20})

ax.bar_label(rects1, padding=3, fontsize=20)
ax.bar_label(rects2, padding=3, fontsize=20)

label_fontsize(20)

fig.tight_layout()
plt.show()

../../_images/5bd396ec89edd622067087c6dd320b0c212e77abd160590216e7da93bca23f84.png

This countplot shows that between people who are and aren’t have a college graduate, most do not have travel insurance.

insured_employ_type = insured_df.value_counts('Employment Type').values
uninsured_employ_type = uninsured_df.value_counts('GraduateOrNot').values

x = np.arange(len(EMPLOYMENT_LABELS))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(15, 8))
rects1 = ax.bar(x + width/2, insured_employ_type, width, label='Insured', color='forestgreen')
rects2 = ax.bar(x - width/2, uninsured_employ_type, width, label='Uninsured', color='firebrick')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Employment Type')
ax.set_ylabel('Count')
ax.set_title('Count of employment type by whether they are insured')
ax.set_xticks(x)
ax.set_xticklabels(EMPLOYMENT_LABELS)
ax.legend(prop={'size': 20})

ax.bar_label(rects1, padding=3, fontsize=20)
ax.bar_label(rects2, padding=3, fontsize=20)

label_fontsize(20)

fig.tight_layout()
plt.show()

../../_images/5dc63b4d313d91a7f8e12ba26dec7a449aa4e6786d736a77ec8a2ed0e990bd90.png

fig, ax = plt.subplots(figsize=[16,7])

# plot histogram comparing the density distribution of the Annual Income in each groups
_ = plt.hist(insured_df['Age'], 
             histtype='step', 
             hatch='/', 
             density=True, 
             color='green', 
             bins=10, 
             linestyle='--', 
             linewidth=2,
            )
_ = plt.hist(uninsured_df['Age'],
             density=True,
             color='red', 
             bins=10,
            )

# put a title, label the axes, define the legend and display
_ = plt.suptitle("PDF of Age by whether they are insured", 
                 fontsize=20, 
                 color='black',
                 x=0.52,
                 y=0.92,
                )
_ = plt.xlabel('Age')
_ = plt.ylabel('Probability Density Function (PDF)')
_ = plt.legend(['Insured', 'Uninsured'])
_ = plt.ticklabel_format(axis="x", style="plain")
_ = plt.ticklabel_format(axis="y", style="plain")
plt.show()

../../_images/d700cb5d9e2f571406fc25b63d1075423b0fd203a07a2a598d32f091eb1bb35f.png

The PDF of the insured group is similar to the PDF of the uninsured group, it is unclear whether there are ages where will more likely to get travel insurance. The ages span from 25 to 35.