Data Science

martinding · ‎05-18-2023

Survival Analysis, Part 1: Introduction

Survival Analysis, Part 2: Key Models

Survival Analysis, Part 3: Using Alteryx

Survival Analysis, Part 4: Using Python in Alteryx (You are here)

If you want to conduct survival analysis using Python, you will require either a Python Integrated Development Environment (IDE) or a Jupyter Notebook environment. Fortunately, Alteryx provides the Python Tool option to code in Python. If you are not familiar with the Python Tool, you are welcome to refer to my prior blog, where I demonstrated how to perform a machine learning analysis using the Python Tool.

Key Point: You may notice that the Alteryx Survival Analysis tools are built in the environment, while this example is shown in Python. Survival Analysis methods are longstanding and time-tested – both R and Python survival analyses should provide similar results, regardless of the different environments.

Installing the Library

The Python library that we will use to perform survival analysis is called lifelines.

We can easily install this package within Alteryx, as shown below. We will also import the other libraries needed (install them first if you haven’t already). You can find its documentation here.

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
Package.installPackages(['lifelines'])

# Importing the libraries
from ayx import Alteryx
import pandas as pd
import lifelines as life
import matplotlib.pyplot as plt

Demo

Step 1: Connect to our data source

Remember, you need to first run the workflow once after connecting the Python Tool to your input data stream, this helps the tool get all the necessary metadata.

Then, you can read in the input data like this:

# Reading in the data
df = Alteryx.read("#1")
df.head()

Step 2: Kaplan-Meier Model

We need to first instantiate an instance of the KaplanMeierFitter()and then call the fit() method to calculate the survival curve values. We can then visualize the survival curve using the Matplotlib library. Refer to the Lifelines documentation for an explanation of the command arguments: https://lifelines.readthedocs.io/en/latest/fitters/univariate/KaplanMeierFitter.html#

# Building the KM Model
km = life.KaplanMeierFitter()
duration = df["Duration"].astype(float)
churn = df["RightCensored"]

km.fit(duration, event_observed=churn, label="Est. for the Average Customer") 

# Visualizing the Survival Curve
km.plot()
plt.title("KM Survival Curve")
plt.xlabel("Customer Tenure")
plt.ylabel("Survival Rate")
plt.show();

Alternatively, you could also construct a KM model for each group of customers:

# KM curve for different gender groups
ax = plt.subplot(111)

km = life.KaplanMeierFitter()

for name, grouped_df in df.groupby('Gender'):
    km.fit(grouped_df["Duration"], grouped_df["RightCensored"], label=name)
    km.plot_survival_function(ax=ax)
    
plt.title("KM Survival Curve Based on Gender")
plt.xlabel("Customer Tenure")
plt.ylabel("Survival Rate")
plt.show();

Step 3: The Cox Proportional Hazards Model

We first need to convert the type of our covariates from “String” (or Object in Python) to numerical, since statistical models in Python can’t work with string data. We can use the Pandas method get_dummies to perform one-hot-encoding. We can set the drop_first argument to true (in this case dropping Gender_Female), since this is redundant information. We can infer that our customer is a female if the value for Gender_Male and Gender_NonBinary are both 0 for that customer.

# The Cox Proportional Hazards Model

# Encoding the categorical variables
# Using one-hot ecoding
df = pd.get_dummies(df, drop_first = True)
df.head()

Then we instantiate an instance of the CoxPHFitter(). Note, the CPH model requires float-type input data, so let’s first convert our numerical columns from integer to float type.

# convert to floats
for col in df.columns:
df[col] = df[col].astype('float')

# Creating the CPH model
cph = life.CoxPHFitter() 
cph.fit(df, duration_col='Duration', event_col='RightCensored', show_progress=False) 
cph.print_summary()

Based on the survival regression output, we can see that the Python results are consistent with the Survival Analysis Tool results (which is based in R). Gender is the only statistically significant covariate and having a gender = “Male” is expected to increase the risk of churning (as we have a positive coefficient)

Let’s plot the variable coefficients! While in our case, the values of the four variables we currently have can be read from the table, as the number of variables increases, a coefficient plot will help us easily identify the bigger contributors! In this case, notice that the x-axis, which is the log(HR) represents the relative risk of churning, so values near 0 represent minimal to no risk, values below 0 represent a lower risk and values above 0 represent relatively higher risk.

# plotting coefficients 
cph.plot()
plt.title("Variale Coefficients and Confident Intervals") 
plt.show();

Now that you’ve seen examples in Alteryx and Python and understand some of the use cases, you’re ready to unlock the power of Survival Analysis in your own data!

AndreeaR · ‎09-16-2023

Hi Martin,

Great analysis! Do you know how to export all these results from the Python tool in Alteryx into an excel or pdf file? Thanks