Building a Customer Segmentation using dbt and SQL-inf

11-01-2024| Ryan Garland|10 minutes read

This post follows on from the post Building Customer Segmentations using ML where we described in detail how to create a customer segmentation using a 6-step process. As part of that, we briefly described how it could be done in SQL-inf but didn’t go into much detail.

In this post we will expand on this, by giving a detailed example of a customer segmentation build using SQL-inf and dbt. We will use dbt to illustrate how you should materialise your customer segmentation and to compile scripts for further analysis.

However, none of the techniques or code that we will describe is specific to dbt, so could be implemented in any other platform that supports SQL-inf.

To learn more about how to get started with using Infer and dbt together, and how to set it up in 3 easy steps, have a look the Getting Started with dbt-infer guide.

Example data and source code

The data used for the example is the data from the Olist E-commerce shop, which you can find on Kaggle[1][2]. We have used it previously in a few other tutorials here and here.

The Olist data models multiple entities across the shop and their relationships. The key entities are Customers, Orders, Order Items, Products, and Sellers.

In terms of size, the Olist data sets hold the records for 94,206 customers and 97,255 orders. So there is a long tail of one-time customers and a smaller set of 2,742 repeat customers.

You can find the source code for all of the dbt models that we will use and the analysis scripts in the dbt-infer-examples repo.

Constructing the Segmentation

As mentioned, we follow the process described in Building Customer Segmentations using ML to build the segmentation. Here we will go through each step in detail.

Step 1: Defining the type of segmentation

We decide to perform a mixed demographic-behavioural segmentation on the customer profiles in the Olist data set. For this we will use what demographic and in-product behavioural data we have for each customer. The aim is to create personas based on a mix of high-level demographics (of which we have few) and purchase behaviours.

Step 2: Defining the input data

Since the data set has been anonymised the range of demographic data is limited but from the available set we will pick a few and mix with high level purchase behaviours.

The exact input columns will be:

customer_city
the high level demographics. We could also use customer_state but it would be correlated to city anyway.
no_items/no_orders
the average number of items per order.
avg_item_price
the average price of items purchased.
no_detailed_reviews/no_orders
the propensity to leave a detailed review per order.
avg_review_score
the average review score.

From this data set we can build personas around location and high-level purchasing behaviour. Note that we are not including totals, like total number of orders or total amount spend. Instead, we are looking at per-order behaviour.

Step 3: Specifying the number of segments

We don’t have a particularly strict minimal number of segments or segment size in mind here, so we will set a minimal number of costumers in each segments to be 50 - which is about 2% of the total data, meaning no segment can be smaller than that.

Step 4: Building the segmentation

To perform the segmentation we define a dbt model called customer_segmentation. The code for which you can find here in the examples repo.

We only want to run our segmentation every so often, so we set it to be disabled as default in the model config. Then we can switch it on when needed - to start with we switch it on once, run it to populate the table and then disable it.

To build the customer segmentation we decide to only consider recurrent customers - so people with more than one order. This filters out the long tail of one time customers that may simply be noise to your personas.

Taking all of our above choices into consideration the model can be defined in the following way:

 with customers_segmentation_input as (
        SELECT customer_id,
               customer_city,
               no_items/no_orders as avg_no_items,
               avg_item_price,
               no_detailed_reviews/no_orders as likehood_of_detailed_reviews,
               avg_review_score
        FROM {{ ref('customers') }} WHERE no_orders > 1
    )

    with customers_segmentation_output as (
        SELECT *
        FROM customers_segmentation_input
        CLUSTER(ignore=customer_id, min_cluster_size=50)
    )

    SELECT customer_id,
           cluster_id as customer_segment_id
    FROM customers_segmentation_output

Note that we only output the customer_id and the segment id, customer_segment_id, for each customer. Meaning that this model is purely about linking each customer to a segment in the segmentation and doesn’t replicate any other information.

The CLUSTER command outputs a number of columns, which you can read more about in the syntax reference. Of these outputs we only need the cluster_id column, which we rename to customer_segment_id. The cluster_id column represents the id of the segment that the particular customer belongs to. The numbering of segments starts at 0 and goes up with segment id -1 indicating that the customer hasn't been found to belong to any segment.

After running the segmentation model we do a bit of high level statistics on the segmentation to get a feel for it - we look at how many segments it has, how many points were not assigned to a segment and the size of each segment. You can find the script for this analysis here as well as below.

SELECT customer_segment_id,
           COUNT(customer_id) as no_customers
FROM {{ ref('customer_segmentation') }}
GROUP BY customer_segment_id

The output of the analysis looks like this:

segment id	no customers
0	839
1	351
2	309
3	277
4	267
5	167
6	114
7	80

From that we can see that we have about 12% of the customers that have not been allocated to a segment. We should not expect all customers to neatly fit into a segment. If we want all customers to fit into a segment, we can fit them using a `PREDICT`. However, that is a slightly more complicated process, so we won't cover it here.

Step 5: Describing the segments

The next step in the segmentation analysis is to analyse each segment, using the input data used to create them, to better understand what defines them and to give each a description.

To do this we group our data by the segments and calculate the over and under-indexes for each of the inputs. The script for this is a bit more complicated, so we won't repeat it here, but you can find it in the repo here.

The output of the analysis looks like this:

segment id	no items	item price	detailed reivews	review score	sao paulo	rio	belo horizonte	brasilia
0	-6.69%	2.67%	150.40%	11.46%	-73.60%	-79.21%	-54.00%	-69.70%
1	-25.86%	74.24%	-98.69%	12.10%	-61.32%	-73.40%	-49.56%	-71.20%
2	42.53%	-27.36%	-97.34%	16.58%	-42.06%	-65.62%	-52.46%	-51.14%
3	-24.94%	-41.85%	-98.34%	20.32%	-24.39%	-34.92%	-14.67%	-100.00%
4	12.85%	0.64%	174.27%	-26.70%	-66.73%	-19.40%	56.04%	224.54%
5	0.77%	5.80%	86.53%	-7.99%	192.54%	-100.00%	238.57%	-100.00%
6	0.93%	4.26%	12.48%	-21.02%	-100.00%	572.54%	-100.00%	-100.00%
7	1.41%	-3.22%	-88.45%	4.32%	228.47%	-100.00%	-100.00%	-100.00%

From this we can build a picture about each segment. Typically what we do is go through the above for each segment, writing up a small summary as we go along.

This is will give us a first picture of what the segments are driven by. We will then do the same analysis for using more data in the next step, repeat the exercise of writing up a small summary and then be in a position to better describe our segments completely.

For example for this part of the analysis our notes might look like this

Step 6: Analysing the segments

To further analyse the segments we overlay them with a few extra pieces of data from the customer profiles. The extra data that we use is

ltv
the customer ltv, i.e. total amount of spend on the platform.
avg_installments
the average number of payment installments per order.
avg_weight
the average order weight.
avg_photos
the average number of product photos. You can find the script for this analysis
here
.

segment id	ltv	avg installments	avg weight	avg photos
0	4.05%	8.22%	7.75%	2.07%
1	-0.03%	23.06%	54.92%	8.42%
2	37.43%	-0.89%	-22.05%	-5.54%
3	-56.69%	-20.16%	-40.86%	3.43%
4	27.37%	1.62%	6.91%	-3.17%
5	-2.38%	-4.13%	12.29%	-1.21%
6	-5.13%	-1.92%	15.84%	-4.73%
7	-13.79%	-3.33%	-19.42%	2.62%

We can then again analyse that, like we did with the input data, and create a table of notes for each segment.

Customer Personas

Having analysed each segment using both input and overlay data we can now build personas for each segment.

Leaves detailed reviews, lives outside the main markets but otherwise average customer.
Buys larger, heavier, more expensive items in a single purchase manner. Lives outside the main markets and tends to pay in multiple installments.
Has a high LTV through buying higher volumes of cheaper, lighter items. Tends to buy multiple items each time in larger orders. Also lives outside the 4 main markets.
Buys cheaper, lighter items in smaller orders at time. Because of this has a low LTV and tends not use installments to pay. Also lives outside the 4 main markets and leaves high review scores.
Lives in one of the major markets: either Sao Paulo or Brasilia. Buys multiple items at a time and has a high LTV. Tends to leave detailed reviews.
Lives in one of the major markets: either Sao Paulo or Belo Horizonte. Has an average LTV but buys heavier items. Tends to leave detailed reviews.
Lives in one of the major markets: Rio de Janiero. Has an average LTV but buys heavier items. Tends to leave detailed reviews but low review scores.
Lives in one of the major markets: Sao Paulo. Tends to have a low LTV and buy lighter products. Also does not leave reviews.

The next step would be to dig into the particular product categories that each persona tends to buy - as well the acquisition channels - to get a better picture of the personas and how to target them. However, like with the demographic data, we not have enough data in the publicly available Olist data set for this analysis.

Finishing Up

That’s it! We have shown how to create a customer segmentation for our Olist example data. Because of the restricted data that we have the segmentation isn’t great but nevertheless we hope that the example gives you a good idea of how you can build a better customer segmentation for your business 🤗

Here are a few links for further reading:

Post on building a Customer Segmentation

Use Case guide to Customer Segmentations

SQL-inf documentation for the CLUSTER command

SQL-inf documentation for the PREDICT command