How to predict churn in Sparkify using spark?

8 min readOct 16, 2021

Project Definition

Project Overview

Churn Rate is one of the top metrics for any primarily subscription-based business. Keeping track of your churn is one of the easiest ways to make sure your customers are happy with what they're getting and to increase your revenue.

The idea of this post is to bring an analysis and a model created from the user activity log file, which is able to predict well if a user will churn.

I used Spark, because as it is a tool that is very efficient for large volume of data and we will test some models through the Spark ML library and try to get a good result predicting if a user will churn.

Problem Statement

From a json file with the usage log of sparkify users, we need to predict which user has a high chance of churn.

My strategy for addressing this problem is to do an exploratory analysis of the data and try to identify some indicators that differs users who churn from active users like: engagement, platform usage time, number of friends between others.

Once this is done, since I can identify the users who churn through the logs, the ideia is to use all this data as features in 3 supervised learning models and try to get a good result predicting users who churn.

Metrics

I chose 3 models to test prediction: Linear Regression, Random Forest and Gradient-Boosted Trees

I used the F1 score as evaluation because the number of cases of users who churn is significantly lower than the other and it can bias the model result.

Analysis

Data Exploration and Vizualization

Loading the file

The file used is a small subset of the user activity logs on the sparkify platform and has the following structure:

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true

Cleaning the dataset

The first treatment to be done in the dataset is to remove data that may be wrong and/or that do not bring value to our analysis. So I checked the existence of userIds or sessionIds in the dataset:

There were no null records but there were several records with empty userId.

So I looked into these cases and found that they were unauthenticated users accessing pages that were not relevant to our analysis.

Exploratory Data Analysis

I started the exploratory data analysis by separating the logs into two datasets: users who churn and users who remained active to compare them. In the first comparison, we noticed that the amount of songs listened per user is much higher in users who remain active, indicating a greater engagement of them in relation to users who churn.

Here are the charts of the avg of songs listened by user per day

We can see the difference in platform engagement between the two datasets. The avg of songs by user per day for churned users is ~75 and for active users is 89.

Another relevant analysis was the amount of Thumbs Up and Thumbs Down given by users in the songs.

When we look at the amount of Thumbs Up and Down divided by the number of users, it doesn’t bring us relevant information, it just shows the difference in engagement that we’ve seen before.

But when we relativize by the amount of music listened to by users, we realize that the amount of Thumbs Down given by users who churn are higher

When we look at the level of users that churned, we see that 60% of them were ‘paid’ and of the 21 that were ‘free’, 5 had already downgraded:

Another analysis that was relevant is the time of registration. When we look at the time in days between account registration and cancellation, we see a high concentration in the first 100 days:

Finally, I compared the number of friends that churn users had against the number of friends that active users have, and we found that active users have twice as many friends:

Methodology

Data Preprocessing

Feature Engineering

The results of the above exploration were the basis for choosing the features that were used in the modeling. So I cleaned and treated the dataset to be in the following format and be used in the models:

These are the attributes:

songs: number of songs listened to by the user
avgThumbsDown: Average user thumbs down per song
level: most current user level (1=paid and 0=free)
downgrade: flag if the user has downgraded
dateDiffReg: difference in days between last user interaction and registration date
friends: number of friends on the platform
churn: class that identifies whether user churn or not

Implementation

Modeling

I chose 3 models to test: Linear Regression, Random Forest and Gradient-Boosted Trees

In the first round I used these 3 models without tuning and I decided to use the F1 score as evaluation because the number of cases of users who churn is significantly lower than the other and this can bias the model result.

The result was as follows:

The best result was from Random forest, followed by Linear Regression and finally GBT with a result well below the other two.

Refinement

Given this , I decided to use ParamGridBuilder with CrossValidator to test other parameters to tune Logistic Regression and Random Forest models to see if we can get a better result.

The parameter used in cross validation of Logistic Regression were: elasticNetParam [0.0, 0.1, 0.5, 1.0] and regParam [0.0, 0.05, 0.1]

And the parameter used in cross validation of Random Forest were: maxDepth [2, 5, 10, 25], maxBins [10, 20, 50, 100] and numTrees [5, 20, 50, 100]

Result

Model Evaluation and Validation

The result of the Linear Regression cross validation was the same as what we had already achieved with a F1 score of 0.751

The parameters in the best Linear Regression model were: elasticNetParam = 0.0 and regParam = 0.0. The other parameters are described in the image on the side.

The result of the Random Forest cross validation had a F1 score of 0.823

The parameters in the best Random Forest model were: maxDepth = 5, maxBins = 50 and numTrees = 20. The other parameters are described in the image on the side.

Justification

The Random Forest model obtained the best result in comparison with the other models with a F1 score of 0.823.

Here are some reasons why Random Forest may be better than Logistic Regression according to this post:

It emphasizes feature selection — weighs certain features as more important than others.
It does not assume that the model has a linear relationship — like regression models do.
It utilizes ensemble learning. If we were to use just 1 decision tree, we wouldn’t be using ensemble learning. A random forest takes random samples, forms many decision trees, and then averages out the leaf nodes to get a clearer model.]

Conclusion

Reflection

First, Spark is a very easy tool and I had no difficulties in the exploratory analysis, preparing the data or in the modeling. It gives the option to do the processing imperatively or declaratively using SQL, which is very easy for anyone already working with data. As I already use SQL on a daily basis, I opted for the imperative way here to learn something new.

The solution, using 6 features (songs, avgThumbsDown, level, downgrade, dateDiffReg, friends, churn), was relatively simple and we got a good final prediction result.

My biggest difficulty was the processing time to test the hyper parameters on models. The Logistic Regression was relatively fast, but the Random Forest took more than 4h processing.

Other than that, I didn’t have many difficulties.

Improvement

There is still more that can be done to ensure the model works correctly.

For example, here we use a very small dataset, we could use more data to test the model’s efficiency.

Another issue is that the number of users who churn is much smaller than the number of users who are still active, which can bias the result. As an improvement we can think about balancing the data.

Try it yourself

If you want more details you can access here the code I used in this analysis!