How Genify used a Transformer-based model to build a recommender system that outperforms industry benchmarks

The rapid ascension of AI, and more recently of deep learning, comported a succession of many breakthroughs in the field of computer science. These have had a profound impact on both the academic and the business world. In particular, modern deep learning techniques applied to the pre-existing concept of recommender systems has given birth to a new, superior class of neural recommender systems, which are now revolutionizing the field of the digital business. Starting from the basics, a recommendation system, or engine, is a useful tool to predict users’ preferences based on their past experiences (content-based filtering) or based on their similarity to other users (collaborative filtering), or even a mix of both.

To cite some examples, the ubiquitous digital streaming platform, Netflix, implements a recommender system to suggest new content to its users based on what they have previously watched. Similarly, the key to the success of the popular mobile application, TikTok, is to make use of its enormous database of users and content to recommend personalized short videos to their individual likings, and hence improve their user retention rate. Equally so, the popular e-commerce platform, Amazon, uses a similar method to recommend new items to its customers so as to reduce users’ scope of research and surprise them by proposing new products that they desire based on their current and past data.

Recommendation systems to support banks

Given the experience of Genify as a growing FinTech startup, and our collaboration with several banks, our question is: how can recommender systems be useful for banks? The answer is that there is not a single aspect where a good recommender system could improve banks’ users’ experience. Firstly, good recommendations can make clients more willing to buy new banks’ financial services, thus boosting their revenue. Moreover, recommending the right products at the right time could also improve users’ engagement and increase their satisfaction, which naturally results in an increased retention rate. The recommendations are not necessarily limited to just banking products, as they can also suggest e-commerce offers or other cross-field items. Having more users means more data to analyse and train increasingly powerful recommender systems. In fact, as for all machine leaning models, it is very well known that having more training data enhances model performance.

Santander challenge

Considering the importance for banks to use recommendation engines capable of predicting the next product a user will acquire, Santander launched a challenge in 2016 with the aim of pushing machine learning scientists to develop new state-of-the-art models to handle this task. The competition is available on Kaggle and saw 1779 teams take part to win a share of the total prize of $60,000. The substantial size of this prize provides a clear view on how important this technology is.

Some state-of-the-art models

Though this challenge is now over, we at Genify decided to use this current cutting-edge technology to develop our own recommender system and tried to outperform two other high-performing models on the same dataset. The first model in question is a black-box model trained with Amazon Web Services (AWS) on AWS service Amazon Personalize (AP). AP is an auto-ML system for real-time personalized recommendations. It manages the entire ML pipeline, including processing the data, identifying features, using the best algorithms, training, optimizing, and hosting the models without requiring any ML knowledge. The second model is one trained with Xgboost, which has been thoroughly studied and adopted by many Kaggle competitors due to its simplicity and effectiveness. A review of our Xgboost model taken from Kaggle can be found in our previous blog.

In this blog, we are going to introduce our recommender system based on Transformer encoder and show that we outperform the rival models. In order to make a fair comparison, we trained the three models on the same data, which is subset of around 20,000 users and their historic data extracted from the original dataset.

Figure 1: high-level architecture of Transformer encoder layer.

Dataset overview

Before continuing, let’s briefly understand the content of the data provided by the bank. The dataset contains 1.5 years of customers’ behavioural data from Santander. The goal of the bank is to predict which new products customers will purchase. The data starts on 2015–01–28, and has monthly records of the products each customer has, such as a credit card, savings account, etc. In addition, the dataset also records user personal data such as average income, age, gender, and so on. The monthly statistics are provided until 2015–05–28. Finally, the model predicts which additional products a customer will start using from the following month, 2016–06–28. Thus, the dataset spans 17 months from 2015–01–28 to 2016–05–28, and the output set contains only the timestamp corresponding to 2016–06–28. Models are therefore trained on sequences of 16 months to predict products acquired on their respective last month. In the following tables we can see the products owned by a sample user along with the corresponding metadata.

Transformer for recommendations

Transformer, introduced in 2017, is a deep learning model, composed of an encoder and a decoder, that is primarily used to solve Natural Language Processing (NLP) tasks. Its advantage over classic models used in literature such as RNN and LSTM, is that it does not require the processing of sequential data to occur in order, thus allowing more parallelization, and therefore reducing training time. It can also be used to learn an embedding, which can be fine-tuned to other tasks, such as BERT (2018) or GPT-3 (2020). Thus, given the novelty of the model, there are very few implementations currently available and they are not easy to train. Note that although sequential data could be unordered, each element of a sequence should encode its position in the sequence so that positional information could be exploited by the Transformer. By looking at an example in the NLP field, the positional encoding of words is computed by means of a sinusoidal function based on the position of the words inside a sequence. Given that recommender systems are also trained on sequential data, we can naturally borrow the transformer from NLP and adapt it to our recommendation tasks. In fact, we can see the representation of the items owned in each month as a “word”, and the items owned in the last month as the “next word in the sentence,” which is precisely what we want to predict.

Figure 2: high-level representation of input and output of our model.

Data preprocessing and feature engineering

It is well known that data scientists spend most of their time on feature engineering and data pre-processing, rather than on designing models, and our work is no exception to this. Additional challenges are incurred due to the fact that the data also contains user personal information, which, whether frequently or not, can vary overtime. However, with different grades of importance, this information is useful to find potential similarities between users. That being said, not all features are useful to our task, with most of them not adding any significant value to our model. Moreover, fewer features results in a simpler model, which is faster to train. Features importance have been provided by our Xgboost model and are reported here:

Figure 3: Impact of a customer’s information on the product recommendations (measured as overall usefulness in prediction), ranked from most impactful to least.

Among those features, the ones we have selected are:

  • Seniority: (integer in range 0–255) how long the user is a client of the bank;
  • Age: (integer) age of the user;
  • Segmentation: (category) status of the user: “01-VIP”, “02 -Individuals”, “03 -college graduated”. In addition, to handle missing values we have created a new category, thus bringing the total to 4.
  • Gross household income: (integer) user’s yearly income based on his region of residence.

We will show later that these features, in addition to product usage history, are enough to achieve a performance superior to Amazon Personalize. Amongst the above-listed four features, “Segmentation” is the only one that can be one-hot encoded since it only includes four different values. Moreover, the other three numerical features are normalized so as to have their values falling in the same range. This step is fundamental otherwise the model would incorrectly learn that the feature “Gross household income” is more important since it assumes values that are various magnitudes higher in comparison to the other two features. Regarding numerical features, missing values have been filled in with the mean. As for the product history, it had initially been thought to alter the data so as to keep track of only newly acquired products, however, this method was been discarded since we would lose vital information about the length of the ownership, as well as the last month of ownership. Finally, the position of a word is encoded by attaching the binary representation of the year when the data had been sampled (0=2015, 1=2016) along with a 12-length one-hot vector representing its month. To make it clearer, let’s represent how a “word” of our sequence would look like after pre-processing.

Figure 4: array representation of input word.

Note that, except for “Gross household income,” “Age,” and “Seniority,” all features are binary, and their concatenation forms a 42 elements array. Finally, the input to the model is a sequence formed by 16 words grouped in batches of 32.

Model architecture

Figure 5: attention layer of our model.

Borrowing the idea of BERT, we get rid of the decoder of the transformer and only keep the encoder. In fact, our goal is to make the encoder learn an embedding representation of the user data and then let a feed-fires networks build on top of the encoder to make predictions. Hence, the model takes a sequence of 16 months of user’s history as an input and learns to predict the items owned in the very last month. The model is composed of 6 multi-head encoder attention layers with 7 attention heads.

Figure 6: multi-head attention layer of our model.

Each of those layers is followed by a 2048 size feed-forward layer. Before each attention and feed-forward layer we have a batch-normalization layer. After each layer, we have a dropout layer with a dropout chance of 0.5 to reduce overfitting. On top of the encoder we have a feed-forward layer followed by a softmax layer, which are used to predict the probability of an item owned in the last month. An item is considered as owned if its probability is > 0.5. Similarly, acquired items can be predicted by excluding the items owned in the previous month from the original predictions. Therefore, the model learns using a binary cross entropy loss function and Adam as optimizer. Regarding the learning rate, we adopted a RAdam warm-up scheduling for the learning rate, with a peak after 10 epochs and decay until the 100th epoch when the training process terminates. Having a warm-up scheduling assures to have a higher learning rate at the beginning of the training when the parameters of the model are still far from the optimum. After a certain peak it gradually decreases to provide a more stable training.

Experiments results

All the three models are trained to predict the ownership of products. Let’s recall that the difference between predicting items ownership and acquisition is that the latter doesn’t include items owned in the last month before to the month to predict. Hence, items acquired on the last date 2016–06–28 are computed excluding the items owned on 2016–05–28 from the original predictions. Genify transformer model beats Amazon Personalize and XGBoost on all the metrics we tested improving them by roughly 10% to 50%.

The metrics we decided to use to measure models’ performance are Precision@k with k=1,3,5,10,20 and, MRR and NDCG.

Note: readers not familiar with the meaning of the metrics can find a detailed introduction here in the official documentation.

Table 1: Results obtained on items acquisition.

As we can see from the table above, our model outperforms both Xgboost and Amazon Personalize on all the metrics on product acquisition. If we take a closer look at the metric Precision@1, we can observe how over 98% of the most confidential predictions of our recommendation engine are products that have effectively been acquired. On comparison, less than 70% of the top-1 predictions made by Xgboost are right and this value falls to less than 50% in the case of AP.

Table 2: Results obtained on item ownership.

As was for the case of items acquisition, our model is also the best performing for product ownership. Note that the recommender system based on Xgboost doesn’t predict items ownership by design, thus it has not been included in the evaluation. Here the metrics are much higher than they were before since predicting ownership is an overall easier task than predicting acquisition. Consequently, the idea of training models by removing owned products directly in the labels has been discarded since the sparsity of the acquisition would seriously affect the performance of the model.

Predictions example

Top 3 Recommended items (predictions):

  1. Direct Debit
  2. Long-term deposits
  3. Pensions 1

Acquired items (ground truth):

  1. Direct Debit
  2. (Empty)
  3. (Empty)

User’s products history (data):

User’s features (meta-data):

  • Seniority: 35
  • Age: 23
  • Segmentation: 03-UNIVERSITARIO
  • Gross household income: 58728.39


Transformer is a major breakthrough of deep learning and, as we have showed, it brings huge benefits not only to the field of NLP but can also be very useful to improve state-of-the-art recommender systems. The ML model implemented by Amazon Personalise, as well as its data pre-processing pipeline, is the fruit of years of experience accumulated by researchers at Amazon. However, the Transformer we developed at Genify, supported by well-engineered data pre-processing and a smart adaptation to our recommendation task, not only easily outperforms AP, but due to its huge parallelization, is also faster to train. The only drawback is that setting up a recommendation engine with such a complex model is not as immediate as learning to use an auto ML tool, but in the long term, the more accurate predictions are guaranteed to produce returns worthy of the hard work done to set it up.

As opposed to AP, which is a general ML tool, the Kaggle Xgboost model is specialized to work on the Santander dataset. It exploits the powerful Xgboost machine learning algorithm as well as a more sophisticated feature engineering process to achieve better performance than AP and occupy the higher spots of the Kaggle leader board. Still, it is not as accurate as our Transformer, thus further promoting our model as a possible leader among recommender systems. Another benefit given by the encoder of the Transformer is that it also learns feature embeddings which can be used to discover relations among different users, or to predict recommendations on a new dataset by mapping input and output to their corresponding embedding.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s