Horses for courses

The problem of personalizing customer engagement can be broken down into three broad problems:

  1. Recording how customers engage with the brand: the most obvious aspect of this is a transaction, but non-transactional signals such as web/app behaviour, service-related communications etc. can be considered as well.
  2. Parsing customer engagement records to determine how, when and what to talk to the customer about. This is where machine learning and other aspects of intelligence come into play.
  3. Actually reaching out or reacting to the customer in a personalized manner. This is where marketing campaigns, web/app personalization etc matter.

The systems that focus on these three problems are broadly referred to as systems of record, systems of intelligence and systems of engagement respectively. Note that these need not always be separate products; rather, this is more of a conceptual division of responsibilities.

When one talks about systems of intelligence, perhaps the first problem that comes to everyone’s mind is: how can we recommend products to customers in a personalized manner? This is where recommender systems come in.

There are many algorithms that figure out how to recommend products to customers. Not all of them work spectacularly in every circumstance, so it is important to understand intuitively what their strengths are and when they work best. Horses for courses, if you will.

But why isn’t there one ring recommender to rule them all?

Well, the most obvious reason is the nature of the data itself. The fundamental, unsaid premise of personalized recommendation is that the brand knows the customer inside out, and can use this to recommend exactly the right product. When we think of this scenario, our instinct is to imagine vast reams of data about each customer. This premise is stress tested in many ways in real-life datasets.

What if the breakdown of your brand’s customer activity is as follows: 50% of your customers have visited only once, and 25% of those made that one visit over 2 years ago. 5% on the other hand, have probably visited 15 times in the past year alone. Your app penetration as a % of total sales is in single digits and climbing, but even there the same unevenness exists. 30% of your digital users have registered on your app, but haven’t bought anything, or even browsed much. But a bunch of them have browsed a lot, bought a lot etc.

The problem looks much worse when you have a lot of customers and/or products in the aggregate. Imagine a matrix (not the one where Keanu Reeves stops bullets, we mean the one with rows and columns) where the rows are customers and the columns are products. If you have a lot of customers and/or products, this matrix becomes very large.

A large matrix is problematic in two ways. First, there’s the computational cost of dealing with something of that size. Second, this matrix is likely to be very sparse, both because the matrix itself is large, but also because the customer activity is uneven, as in the above example. All of this makes the analytical problem tougher, because what any recommender is trying to do is figure out what to do with the empty cells in the matrix, i.e., which products to recommend because the customer is likely to buy them.

(Now, this is not an intractable problem. The computational angle is solved using packages that deal with large sparse matrices, and algorithms that can parallelize nicely, while the analytical angle is solved using methods that reduce the size of the matrix — product attribute-based approaches, neural embedding, matrix factorization etc).  

And then there’s the cold start problem. What if you have a new product, for which you have no transaction data? Or a new customer who has registered on your app but not browsed or bought anything?

Imagine trying to build a single algorithm that knows exactly what to do in all of these scenarios. Tough, isn’t it? That’s why it makes sense to think of different algorithms as being effective for different customer groups.

So what kinds of algorithms are there?

We’ll talk about those in this article. Not every single one of them, but we’ll cover the more popular approaches, and where they are best suited. Rather than talk about the algorithms by name, we will talk about the broad approaches, and refer to the various algorithms that fall into these categories.

People like you also bought…

This approach attempts to find similar customers to a given customer, and use their behaviour as a cue to determine what to recommend.

The most popular of these approaches is user-based collaborative filtering, which looks at user ratings for products, matches users with similar behaviour and uses this as a basis for recommendations. This approach works best when one is dealing with a lot of user behaviour, such as viewership data on a media streaming platform.

However, if one has a lot of other data characterizing a customer (e.g. in BFSI where it is more often collected as part of an application process, location intelligence in a CPG business where the customer is actually a retailer), or a lot of contextual data about the transaction that is valuable to understand behaviour (e.g. shopping time preferences or price/discount preferences in retail), a more generalized way of characterizing the customer becomes valuable. This means stepping back from collaborative filtering to a more generalized class of methods, namely lookalike models.

A variant of this approach is a purely geographical one, that might work in the CPG sector, namely: Stores near you are buying these products. This works when latitude and longitude data is available for all (or most) stores. This implicitly assumes that the location intelligence is reflected in the purchase behaviour of end-consumers in the locality, and therefore the ordering behaviour of nearby stores captures what one needs to know.

Lookalike models in general work best when you know a lot about the customer; however, they have also proven to be surprisingly effective when one is dealing with data sparsity – for instance, the best recommendation for a single transactor sometimes comes from those who have done that single transaction and maybe one more besides.

People who bought this item also bought…

This class of approaches involve mining co-purchase patterns in some form or shape. Item-based collaborative filtering, an algorithm pioneered by Amazon, is one of the best known examples, but simple co-purchase-based approaches like association rule mining are also quite effective.

The strength of this approach depends on how often people buy more than one thing. If the vast majority of customers are single transactors, there might be so little data to work with that this strategy might be a bit brittle.

The other challenge with this strategy is the sparsity of data. For instance, the number of people who bought a formal blue shirt and also bought dark grey slacks might be quite small, but the number of people who bought a formal shirt and also formal pants is likely to be much higher. This means that analyzing copurchase patterns at various levels of abstraction (i.e., figuring out patterns between groups of products rather than individual products) is critical to its effectiveness. This can either be done implicitly based on some statistical patterns that lead of product grouping, it is probably better done if a well-curated set of product attributes is present.

If you liked these, you’ll also like…

Rather than simply understand what customers bought, it is helpful to see if there are patterns across their purchases. For instance, someone might have a preference for the colour blue, whereas someone else might prefer ethnic wear. These patterns suggest that blue outfits might be a good choice to recommend to the former, while ethnic wear in the festive season might be a good time to get the latter to transact again. It is also possible to use these patterns to do the exact opposite: “we know you love our range of blues, but have you looked at what we have in checked patterns?”

All of these fall under the broad category of content-based filtering. These algorithms have proven to be extremely effective when there is a well-curated product master with many attributes.

However, our old adversary – data sparsity – is poised to strike yet again. What if the customer has purchased only one blue shirt? Do we assume that he loves blue? This is where approaches that exploit what we learn about a customer, while still leaving the door open to explore what we don’t, work best. For instance, an approach that assumes that all colours are equally preferred, and then bumps this customer’s preference for blue up by a little bit after this transaction, might work better. The more we see a customer, the more confident we are in our understanding of their preferences.

The good thing about this approach is that it solves for the cold start problem for products. A new product, by definition, is one that we have no transaction history for. This approach allows us to recommend it to customers whose preferences suggest that they might be interested in one with this product’s attributes.

Lots of customers are buying…

This is perhaps the simplest and most used recommender approach: what’s trending right now? We can slice this by location, time frame (including, for instance, what happened around the same time last year), customer attributes, product categories etc.

Trending depends solely on the aggregates, so not knowing much about some customers isn’t a deterrent. That being said, if we calculate what’s trending for narrow slices of the population (e.g. What are women in Kanpur shopping for in the Accessories section?), we might sometimes find that there isn’t a lot of sales data to back up the recommendations, and one might need to reinforce the approach with results from broader slices.

Here’s what you’re likely to buy…

What any recommender algorithm is trying to do is, in some form or shape, predict what a customer is likely to buy, and nudge them in that direction by putting those products front and center, rather than let the customer stumble upon it. So then there’s a way of modelling the problem in precisely those terms. Characterize each customer, characterize each product, characterize how customers interact with products, and use these to predict the probability that a customer will be interested in a particular product. This is now a standard two-class classification problem, and there are a million and three machine learning algorithms at your disposal to solve it.

This approach is especially useful when you’re solving a replenishment problem, i.e., what to recommend that a customer replenish, out of the items they have bought already. Use cases include CPG, grocery, food etc.

That’s a whole bunch of algorithms. How do we put these together?

As the previous descriptions might have indicated, each of the approaches we have spoken of involve a different common-sense principle, and each of them have their strengths and weaknesses depending on the situation. How, then, to decide which algorithm to use for your business? Or, if you want to get more fine-grained, how to decide which algorithm to use for which customer?

Here’s the good news: you don’t have to! There is a class of methods whose job is to combine what these algorithms produce and decide what to recommend. These are called hybrid recommenders or hybridizers. These methods depend broadly on the following factors:

  1. How strongly and how often is a product recommended by the various algorithms
  2. How much one can trust these algorithms, i.e., how well have they performed in the past
  3. What we know about the customer, that indicates how to combine these outputs

The end result is a consolidated recommendation list comprising the output of any or all of the component algorithms.

But how do we know if the algorithms work?

There are two broad ways of assessing the performance of recommenders.

  1. Active: This involves measuring the accuracy of recommendations shown to a customer, either proactively through SMS/Email/Push notifications, or reactively by way of personalizing their app/web experience when they browse. You’re measuring the accuracy of recommendations the customer sees and expresses an interest in, either through browsing or adding to their shopping cart or buying. A variant of this approach is one where the performance is compared against that of a control group who get no recommendations, or generic non-personalized recommendations.
  2. Passive: This involves measuring the accuracy of recommendations generated for a customer, by comparing them against the observed behaviour post generation. This can also be done in back-testing mode, wherein the recommendations are generated as of an earlier date based on what is known at that point, and transactions since then are used for comparison.

The difference between the two approaches is simple: An active approach cares about whether the customer has seen the recommendations, and the power of the actual nudge is considered. The flip side is, you can only measure it when you know the customer has seen the recommendations, so it’s limited by those numbers.

The passive approach, on the other hand, doesn’t care if the customer has seen the recommendations. The underlying assumption is, customers will buy what they want anyway, with or without the nudge, so it makes sense to measure performance of the algorithms as though they are simply predicting what will happen. You get to measure performance of a lot of algorithms against a lot of customers (pretty much everyone who transacted), but you don’t know if the recommendations themselves would’ve nudged the customer in a particular direction.

It is obvious that both approaches have their merits, so it makes sense to do both.

In either case, we can use a variety of metrics to track the actual level of commonality between what is recommended and what is purchased. For instance, you could track how much of what is purchased is in the recommended list (i.e., recall), or how much of what is recommended is purchased (i.e., precision), or some variant or combination of these two. Additionally, you could account for:

  • The rank of the recommendation
  • The extent of the match (maybe the customer didn’t buy the exact same product, but something in the same subcategory)
  • Whether what was recommended and/or purchased was very popular to begin with (if you recommended something rarely bought and it was mostly right, that’s probably more valuable than if you recommended something everyone buys and it turned out to be exactly right)

Okay, got it. Now how do we use these recommendations?

That, dear reader, is a story for another day. In our next article in this series, we’ll talk specifically about how recommendations can be used to personalize customer engagement.