After 1.5 years in the Early Careers Academy at Auto Trader, I thought it would be great to share my experience so far as a graduate data analyst and what I’ve been up to here.
Back in 2022, I graduated from the University of Nottingham with a degree in Computer Science and that marked my first year in the UK being originally from Malaysia. You can imagine it was quite a huge change for me back then being still new to the country and moving to a new city in Manchester, stepping foot into the working world. Thankfully, the people at Auto Trader made it easy for me to settle into these new environments, constantly offering support whenever I needed it.
Getting settled in was made even easier in the first few weeks when I was placed in an induction with 13 other new graduates who had different roles across the business. I was extremely grateful to kick-start my career with them, being like-minded and wonderful people, mostly fresh graduates straight out of university. Throughout the graduate induction, I learnt a lot about Auto Trader as a whole and how we operate as a business which I found essential later in my role as it helped my understanding of why we do things. We also had a couple of team days where we would spend the entire day in our actual team, having a taste of what it’s like being in the role after the induction, which I thought was a nice transition into the role.
Moving on to my role, I was placed in the team that oversees products related to searching for vehicles on our platforms. I was introduced to all the different products we have, including the machine learning models we have powering our advertising packages. Having built a couple of ML models back in university, it was fascinating for me to see how it was built and being used in a commercial setting. In no time, I got stuck in on my first project which was doing some exploratory data analysis on the performance of our advertising packages. First projects may sound daunting for some but being familiar with some of the tools AT were already using and with my tech mentor’s (shout out to Michael) constant support, it was a comfortable first project to start with.
I was also fortunate enough to be handed the responsibility towards maintaining one of our advertising products, giving me a sense of ownership with what I was doing which I thought was great! With that, I was exposed to AT’s data architecture and had the chance to work closely with developers. It was interesting for me to witness and experience the pipeline of how a data product gets built from scratch, productionized, and monitored daily.
Throughout the 2-year Graduate Programme, there were also several projects that I had to complete alongside my day-to-day role. I had the chance to work with other people in early careers across the business which was a great way to meet new people! The Sell-A-Vehicle project was one I enjoyed a lot where we had to source, advertise, and sell a car to a consumer. It was a real eye-opener to experience the entire car buying/selling journey and understand the pain points of our customers. I also had the chance to visit a dealership and take photos of the car we bought before being advertised on AT — something I wouldn’t have been able to experience in my day-to-day role!
Plenty of support was always available, whether it was on the technical or personal side of things. Being in a wider analytics group, there are amazing and brilliant data people around, always willing to discuss and support any problems I face. I also had weekly catchups with my assigned onboarding buddy and tech mentor, so I always had someone to go to whenever I had any problems.
Overall, it’s been a fruitful 1.5 years for me at Auto Trader and I’m enjoying my time here!
I’ll finish off with a few humble tips which I thought helped me towards my application for the Graduate Data Analyst role:
• Research about the company. Tap into the culture & values and relate/align them towards your own. It also helps you to decide if the company is right for you.
• Take time to reflect on your own experiences. Look back at what you’ve done/accomplished at university or in your personal life and link the transferrable skills that you’re able to bring to the role.
• Get started with a data project. It’s always good to have something to showcase especially with tech roles – it could be something you’re passionate about!
Thanks for reading and don’t hesitate to reach out if you have any questions!
]]>Photo by Davide Ragusa on Unsplash.
In the late nineties and early noughties, every developer was adding a hit counter to their website. It essentially counted the number of page views you had, but it was the embryonic stage of web analytics so we knew no better. Refresh your home page frantically to gain kudos with your mates at work. Hit counters are long gone, and this early unit of measurement for web analytics is now simply called an event.
In 2005, Google Analytics (GA) was launched, and we started to talk to sessions. We lost that lovely large vanity metric we had with hits, which hurt our ego, but early adopters could start to see the emergence of an industry framework in reporting and analysis. The 20 hits you had previously became just one session, but we now knew the location of that user and rudimentary engagement metrics. Over the next 15 years, analysts started to evolve the thinking further to visitors, eventually maturing into cross-session behavioural analysis.
And then, more recently, GA4 was launched, a fresh framework driven by privacy concerns and enabling cross-device capabilities. It was at this juncture we adopted Snowplow as our primary web analytics solution, and many other enterprise-level analytics functions have moved to an event-based analytical platform to better serve their wider operational needs.
So, where are we going with this? Well, we can’t talk about the future without referencing the past. At Auto Trader, as we introduce more transactional product offerings, we require a better view of our consumers. We are still a marketplace connecting consumers with retailers/providers, but our role in the process is moving from simply generating leads and enquiries to owning more of the transaction, as our Digital Retailing and Leasing offerings show. We now have more ownership of the sale.
Unlike more traditional lead generation, however, these transactions now span multiple sessions, have the capability to span devices as users log in and out to complete checkout funnels, and have longer consideration periods where traditional identifiers may have expired (think Safari ITP). So thinking about consumers at a session, visitor or even signed-in grain doesn’t give us the full picture.
Here’s a crude example of the problem and explanation of what our identifiers mean:
session_id - session identifier generated by Snowplow
visitor_id - visitor identifier generated by Snowplow web tracker and SDK, distinct per platform
signed_in_id - identifier generated by logged in users and persisting across platforms
The above diagram illustrates the challenge as we create more omni-channel experiences. The consumer uses different platforms at different points in the purchase cycle, resulting in different session and visitor identifiers. That means we won’t have a full view of the consumer using a session or visitor lens as they cross devices (Events 1 and 2). Likewise, analysis focusing purely on the user’s signed-in identifier will only result in a partial picture (Events 1 and 3).
While, of course, questions within the scope of a session and, at a visitor level remain hugely useful, we need to understand their limitations and think about a more holistic view of our consumers. Analysing either on session, visitor, or signed_in_id
will only ever resolve two out of these three events.
That’s why we have created a view of the consumer—an aggregated 360 view of all identifiers across devices, essentially creating an identity graph. Whilst respecting consent, we are able to historically piece identifiers together when consumers log in, and can show cross-device behaviour. At this point, the consumer’s profile is updated and rolled up to a consumer, or unified identifier, that can be used across our platform.
We see this view of a consumer being particularly useful in the following analytical use cases:
This is complex, and we are on a journey with it. We have talked before about how we created our CDP and how it will unlock personalisation at scale at AT, but this consumer view is an analytical representation of that technology, enabling analysts to use this centralised view to build more advanced models.
]]>Photo by Pascal Meier on Unsplash.
At Auto Trader, we define landing pages as the first page a consumer sees when they enter our website. As you’d probably expect, the Auto Trader homepage is one of our most common landing pages, as that is usually where a consumer would start when beginning their vehicle-buying journey. In this blog post, I’ll be talking about how we’ve developed our understanding of landing pages through the production of our new Landing Page Performance model, and why it is the next step towards self-serve data at Auto Trader.
Landing page performance data at Auto Trader is typically split by the route a consumer takes to get to the website - often referred to as traffic source grouping or channel grouping - and can include channels such as organic search (finding us organically through a search engine) or email (by clicking email links). It is helpful to know where consumers are landing from as it informs marketing on the success of their campaigns which, in turn, influences the resources put into them.
Whilst many tables with landing page performance data were available in our BI tool Looker, they all used our page_context_name
field as the page identifier, something which usually requires technical knowledge of our Snowplow client-side tracking to extract. The page_context_name
field can also encompass multiple pages so may not always show data for a single page on our site. We wanted to create a model which bypassed these blockers and instead used page_url
, a unique identifier which anyone in the business could easily find for themselves in their browser and use to get the data for the page they’re looking at. As a consequence, it would reduce the need for analysts to extract this data on an ad-hoc basis, and instead create a self-serve model for other business areas to use. We previously had this functionality available in Google Analytics (GA) but due to Auto Trader’s migration towards using Snowplow data, the model was no longer available, so we also wanted to replace what we had lost from GA.
Our aim was to build a model that allowed the user to filter by page_url
and view key performance metrics split by traffic source grouping, as shown in the visualisation below.
To achieve this, the landing page performance model uses three key client-side tables, developed by the User Analytics team here at Auto Trader:
• consumer_views
, looking at consumer behaviour on a page view level
• consumer_sessions
, looking at consumer behaviour on a session level
• consumer_sessions_with_attribution
, an extension of consumer_sessions
with three different attribution models (at Auto Trader, we commonly use the last click non-direct attribution model)
As shown in the diagram below, we combined the fields from all three of these tables into one join: consumer_sessions_with_attribution
gave us traffic source grouping; consumer_views
added page URL; and consumer_sessions
rounded everything off with the metrics we tend to care about most in relation to a landing page, such as sessions, consumers, search views, advert views and lead interactions.
By joining all of these fields into one table, we had everything we needed to build upon our initial vision, and hence productionise in Looker for other business areas to use.
One main driver of productionising this idea in dbt was the huge query size (and cost) that came with joining three large tables in BigQuery, especially in the instances where users of this data need a wider date range. The origin of this model was a single query, with a query size of approximately 30GB for only 2 days worth of data. Contrastingly, by implementing this model in dbt, we were able to shrink the 2-day query size down to around 2.5GB, so the cost-saving benefits alone made it worthwhile to continue with this project.
The nature of dbt modelling also meant the original query could be split out into different subqueries, which personally aided in my understanding of what the model was trying to do. For example, we brought in our Input Data Ports (shown in the diagram above) as the base layer, defined the fields we wanted from each table in the staging layer and applied the majority of the logic in the Output Data Port (ODP) layer. This allowed for easier readability and meant that we could verify the numbers at each stage of the model – if something didn’t look correct, we’d know where the issue occurred. In a broader sense, by breaking down the code into smaller subqueries we ensured that the logic of the code is more accessible to any teams that will be using the ODP.
In spite of this, the complexity of the model itself meant we spent a lot of time debugging our data, as the numbers shown did not match up to what we were expecting. We used the output from consumer_sessions
as a comparison, focusing on the total sessions and consumers associated with each traffic source grouping. Realistically these numbers should’ve been very close to our model’s output, but we were seeing large discrepancies.
During the conceptualisation process, the plan was to make this model highly aggregated in dbt, meaning we would sum the counts of every metric and only provide a total value. After investigating the problem by breaking down the query, we realised the level of aggregation was skewing the results. We had aggregated too early in the query and, as such, data wasn’t being counted and instead was missed.
Our way around this was to make the model less aggregated in dbt, with each row of the output showing the activity for a single session. Since this model was to be exposed in our BI tool, Looker, it made sense to have the last step of aggregation in the tech stack. There, we could count the totals of each column and create the output we had envisioned at the start of the project.
Whilst this model took a while to develop and had its pain points, we have now successfully shifted what was previously a manual query requiring analyst input into a self-serve model that can be used by anyone in the business. The end product of a Looker dashboard with filtering to show relevant data has created the opportunity for other areas of the business, particularly those with less technical knowledge of our front-end tracking implementation, to obtain the data they need quickly and autonomously. As such, through greater visibility and knowledge of how our landing pages are performing, business functions such as marketing campaigns can be understood better and changes can be made more efficiently.
]]>Photo by Tico Mendoza
Back in April, Auto Trader gave a few of us the opportunity to attend Data Council 2023 in Austin, Texas, USA. Data Council is an independently curated conference that covers many aspects of working in a modern data-focused role, from infrastructure and data engineering to analytics tools, data science, machine learning and AI.
We were really excited about this opportunity because we all used to be part of our Data Engineering team at Auto Trader. The team has since dispersed throughout our Platform Engineering tribe, so we now have people with specialist data engineering skills embedded within data product teams. We had a great time, not just at the conference but in Austin itself. Now that the dates for Data Council 2024 have been announced, we thought we’d write a Q&A-style blog post to share our experiences.
Austin is the state capital of Texas and is located southeast of its centre, about an hour’s drive from San Antonio and three hours from Houston (which in Texan terms is pretty close!). In recent years, it has become a bit of a tech hub, providing an alternative to Silicon Valley, and is now host to a number of startups as well as established companies.
I’ve been at Auto Trader since 2017, working in various different teams, the most unusual being the plant and farm machinery section of the website. I currently build internal tooling that helps to empower our support staff.
I found Lloyd Tabb’s introduction to Malloy very interesting. Lloyd founded Looker and has been building data exploration tools throughout his career. His latest offering is Malloy, billed as an experimental language for data. Interestingly, Malloy separates the concern of how pieces of data relate to each other away from the queries and calculations you want to make. You define a source which describes the network of data and all its relationships, then write queries on top of that source. Malloy then handles all of the joins and deduplication required to answer your query. The result is simple, readable, reusable and powerful queries, uncluttered by messy joins and boilerplate code. You can watch the talk here and follow along with the interactive demo in your browser.
I spent one of my free days indulging my hobby of woodworking. Some of this was in specialist woodworking shops, torturing myself looking at the tools that aren’t available in the UK. The most inspiring place I went to was the Austin Antique Mall, a massive warehouse where loads of different vendors have stalls selling all kinds of interesting old artefacts. You can find books, maps, furniture, art and knick-knacks galore. My favourite item was a big old workbench complete with vintage wooden vice screws; it’s just a shame I couldn’t fit it in my carry-on luggage. The Antique Mall is quite a ways away from central Austin, but I never had to wait more than a few minutes for an Uber, no matter where I was in the city. If you have an afternoon to fill, it’s definitely worth checking out.
There’s a lot to like about Austin: plenty of green spaces, cool bars, and glorious weather. What I liked the most is that there is amazing food everywhere you look. Some highlights were tacos at Torchy’s (my favourite was the green chile pork) and BBQ at Terry Black’s (the sausage was incredible). The surprise hit for me was a burger from P. Terry’s, a local Austin chain. I was expecting a McDonald’s style fast and no-fuss burger, but what I got was the kind of flavourful and fulfilling burger you’d have to pay at least twice as much to find in the UK. It was the last thing I ate before leaving the city, so my visit definitely ended on a high.
I dabbled with code as a teenager and wrote some pretty awful Perl scripts during my PhD in Molecular Biology. I didn’t think I could become a software developer due to my lack of formal training. I reassessed this after a few years of teaching and managed to get a place on Auto Trader’s 2019 graduate scheme. I currently work on user tracking to enable personalisation of user journeys.
I enjoyed using Data Council as an opportunity to get up-to-date with the latest trends and broader landscape of Data Engineering. After meeting and listening to so many data practitioners, I felt inspired, energised and full of confidence that we are often at the cutting edge at Auto Trader. One of my favourite talks was the keynote by DJ Patil, former US Chief Data Scientist, titled: The things I wish I knew – What I’ve gotten right and wrong from startups to the White House, and the world ahead.
DJ talked about:
DJ really highlighted the power of data to improve the lives of everyone. I particularly liked the notion that “a technology isn’t radical nor revolutionary unless it benefits everyone”.
Exploring another city’s queer culture on my own was a new, exciting, and frankly nerve-wracking experience for me. On top of my usual social worries, the attacks on LGBT+ people and their rights across the US made me worry that I wouldn’t be safe. Austin looked after me though, and I enjoyed an evening at a local gay they bar, Cheer Up Charlies. My favourite drag performer of the night, Brigitte Bandit, is very politically active and told me about a protest against yet another anti-trans bill taking place the following day at the Texas Capitol.
I went along to the protest and experienced the most heartwarming display of solidarity I’ve ever witnessed first-hand. It’s a bittersweet memory because the bill was passed, and the onslaught against queer people continues. Overall, I found Austin as welcoming and supportive as I could have hoped, and I will fondly remember the people I met (two of them also called Austin).
I found it striking how much green space there is in Austin. It supports a thriving outdoorsy culture (and the sun helps too!). The areas around the river support lots of activities including running, cycling, hiking, swimming, kayaking and more. I had a great time doing a solo hike along Barton Creek Greenbelt, but I was quite surprised to find the creek was bone dry!
Don’t get me wrong, there are some great places in and around Manchester for all of these activities, but it all feels more spread out and harder for me to get to. Austin just feels much better built for it.
I first programmed in the late 1980s on an Acorn Electron. After a long detour via a Biological Sciences degree and a job in a Microbiology lab, I rediscovered coding and found that software development was something I could actually get paid for! I joined Auto Trader in 2014 and have spent much of the time since working on putting Machine Learning models into production. I now lead the tech team that provides our valuations and other metrics about the used vehicle marketplace.
There were many great speakers at Data Council, covering a range of interesting topics. Another great feature was that all speakers had ‘office hours’ in a separate room where you could chat with them about topics slightly too specific or in-depth to cover during questions at the end of the talk. This led me to a couple of particularly good conversations.
One of these was following a talk on testing Machine Learning, which reinforced my feeling that testing ML using traditional software engineering paradigms can be a somewhat fruitless task. It’s important to step back and re-focus on what you hope to achieve by testing. The conclusions I came away with were:
The most important thing I learned from the conference is that Auto Trader is doing some pretty advanced stuff regarding our data stack. As great as the line-up of speakers was, there are some areas where we’re pushing the envelope and have valuable experiences to share with the broader data community.
Austin is home to the largest urban bat colony in North America. An estimated 1.5 million Mexican free-tail bats make their home in the Congress Avenue bridge from roughly March to October every year. For me, the best thing we did outside the conference was to go down to the shore of Ladybird Lake as dusk approached to see the bats take flight for their nightly hunt for food. We watched as they left the bridge, first as a few scattered individuals, and then eventually as a constant stream. From a distance, they looked like smoke drifting across the sky!
I also have to give an honourable mention to the Austin Nature & Science Center at the western end of Zilker Metropolitan Park. It was free to enter and was home to several native bird species, many rescued after road or other accidents.
One thing I really liked about Austin was the culture around personal transportation. Getting around the city on foot is possible and even enjoyable. There are pavements (sidewalks), footpaths and cycle lanes. If you need to travel further there are also buses, contactless bike and e-scooter rentals.
]]>Before the endorphin high has even truly subsided, a notification pops up. “Build Failed”. Panic sets in, ego in tatters, and we discover that there was in fact another small error that needs our attention. We’ll take a look and “hey, it’s just another tiny tweak…”
20 GOTO 10
Something akin to this happened to me recently whilst pairing and ended up proving very costly in terms of how long our piece of work took. On reflection, it was such a blunder that we ended up coining the term “Faith Driven Development” to describe the phenomenon (credit to my colleagues Richard Wilmer & Hannah Brown). It’s been stuck in my mind ever since and has been a useful lens to examine how efficiently we can write and maintain software.
Many of you will be familiar with Test Driven Development (TDD) – the process of writing tests first and then creating the production code to make them pass. This is a well-established technique to help create clean code with good test coverage. You may also have heard of Behaviour Driven Development (BDD), which is a higher-level process and helps when adding large new features. There are several other _DDs out there, so what’s the harm in one more? Unlike most of the existing _DDs, however, this one is something to avoid whenever possible.
We’ve defined FDD as “A development process in which modifications to the code are tested by pushing them straight in and seeing what happens.”. We are relying on little more than a developer’s faith in their own ability.
Reading that definition, you may have recoiled in horror at the suggestion you would do such a thing. Be honest with yourself though. Look at the example at the start of this post and truly consider if you ever exhibit similar behaviours. (If you can honestly say you haven’t, please check out our careers site here.)
The situation at the start of this post isn’t the only time you can end up taking this brute-force approach to development. There’s plenty of corners this ghoul is hiding around, just waiting to ruin your day. As mentioned earlier, I started thinking about this following a story I’d been working on (a story is a constrained task or piece of work as part of an agile workflow, see more here), and I’d like to outline what happened so you can see how easy it is to fall into this anti-pattern.
We were working on a fairly cutting-edge bit of software, expanding the capabilities of the Auto Trader Data Mesh. As with many things in the data engineering realm it was pretty hard to test out locally. We needed some fairly large datasets to run the code against and so when we ran into issues, we found ourselves releasing our changes and seeing what happened in the data store. Each time we did it took around 30 minutes to see the full outcome. The first couple of times it wasn’t too bad, but as we pulled on the threads and found more and more issues the slow feedback loop was dampening our spirits and hurting our productivity.
So what is the true cost of falling into this trap? With hindsight it’s usually easy to see – time. Having to make multiple pushes to the same repository and waiting for builds in a CI pipeline is almost always slower than testing that changes actually solve the problem before releasing them. It’s also more mentally taxing; by forcing you to check back in infrequently to see how you’ve impacted the app you are prevented from concentrating on your next piece of work. Context switching ruins productivity.
The cruel and ironic part of FDD is that when it occurs you’ve usually done so because you were trying to be quicker. Trying to get that bug fix released as soon as possible led to it staying live longer. A better philosopher than me could probably get a whole book out of that.
Slow down. Breathe. Think. Write a test to capture the thing you’re trying to fix. It’s usually the little things we all know we should do all the time but are easily pushed aside when there’s a little bit of pressure.
In the more subtle examples, such as the story I’d been working on, it’s worth considering up front how you’re going to get feedback on your work. If it requires a release, try and find some way to make the loop shorter. On our data mesh story, we eventually broke the cycle by starting to work in a Databricks notebook (a tool that lets you run code in chunks at time) to confirm we were happy with our code before copying it back into our production codebase and doing a full release.
It can be useful to bring this sort of thing up in story kick-offs when discussing requirements. It’s important to not only know what the requirements are, but also how you’ll know you’ve met them. If that loop is long or expensive, try and come up with ways to avoid falling into the pit of FDD.
A little bit of Faith Driven Development is unavoidable, but it’s vital that we recognise when it’s happening and to what extent. Letting it continue unabated can make stories drag on for significantly longer that they should, crushing the morale of those working on them. Keep an eye out for a string of failing builds in quick succession, or for a story that continually “should be done today or tomorrow” over the course of a week. If you spot yourself, or a teammate, caught in the snare, then make sure to take a step back, think about what’s being done, and ask if there’s a better way to be going about it.
]]>Photo by Nik Shuliahin on Unsplash.
You’ve done the hard work in researching, developing and finally deploying your shiny new Machine Learning (ML) model, but the work is not over yet. In fact it has only just started! This is the second part of our series on our Package Uplift model, which predicts how well our customer’s stock will perform on each of our advertising packages. We will discuss how we monitor and continuously develop our machine learning models after they have been designed and deployed to reflect the current market. See part one of this series for how we created Package Uplift.
As with all of our machine learning models at Auto Trader, it is important we monitor the performance of the model to ensure that the machine learning technique and application are appropriate and accurate as our data grows and the market changes. In order to do this we use an external tool called MLflow. MLflow enables the tracking of parameters, metrics and model artefacts related to machine learning models. It allows us to easily compare different versions of a model, something that is particularly useful during the development phase in a model’s lifecycle. At Auto Trader, we are constantly evolving our model suite, and by using MLflow we have the ability to deliver iterations efficiently and accurately.
Databricks offer an MLflow tracking component, which allows us to track model metrics and parameters such as the vehicle channel and package level with the following lines of code:
mlflow.log_param('channel', 'CARS')
mlflow.log_param('package_level', '3')
The tracked metrics are then stored in Databricks within an MLFlow experiment, tracking every logged run of the model, as seen below.
For the Package Uplift model, we use Databricks to track metrics such as root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE), along with data volumes and run dates. Once we have metrics stored within MLflow we can compare runs of the experiment against one another. Databricks MLflow tracking offers a built-in plotting tool to compare runs of a single MLFlow experiment. Alternatively, the following API command will load all metadata for the specified experiment as a pandas DataFrame.
mlflow.search_runs(experiment_ids='999')
This allows us to produce custom plots and summaries of the model’s performance. For more information on using MLflow within Databricks see documentation here. In the case of Package Uplift monitoring, we can look at daily metrics and plot the trends. This allows us to easily detect if there is a deviation from the normal distribution of our metrics and investigate appropriately.
Since the Package Uplift model went live, we have been working on multiple iterations. During each testing phase of the iterations, we have utilised MLflow within Databricks to make informed decisions about whether the change is appropriate.
The Package Uplift model is written in Python using Pyspark and is published to a private Python package repository. The pipeline code can then be installed into Databricks notebooks for development using the command %pip install --index https://pypi.example.com artifact-name
. The model is designed as an ETL pipeline, meaning that data processing and model training is split into three main steps: Extract, Transform and Load. Having this structure in our models means that it is very easy to make changes to the models without having to edit lots of code. For example, if we wanted to test a change where we remove a feature from the model, we can amend the single transform function that defines the feature list and selects the columns from our source data. This iteration of the pipeline can then be executed in a notebook and tracked with MLflow.
Having both efficiently packaged model pipelines and the ability to track our model’s performance with MLflow allows us to easily test a change and validate performance compared to the current production version. This speeds up our end-to-end process of prototyping changes to pushing to production.
Currently, the Package Uplift model is trained on stock advertised on Auto Trader’s car platform. This means only our customers who are selling cars have visibility of the package simulator tool, shown in our previous blog post. We have recently been prototyping a new version of the model that will include van stock in the training data and therefore be able to produce predictions for both our car and van advertised stock. We can then provide insight to more customers advertising on our platform.
During this project, we had to think carefully about how we combine cars and vans into a single model. The nature of the van market is predominately for commercial purposes and our pricing for this vehicle type is exclusive of VAT. When modelling, we avoid introducing any bias by ensuring we are excluding VAT in our pricing feature for the purpose of training a model with both cars and vans included in the same training set.
The Package Uplift model contains features derived from other machine learning models built at Auto Trader, including our interchangeable derivatives model. This predictive set of features gives us a way of measuring how similar stock items are to one another at the derivative level, based on vehicles being viewed in the same session by our users. This model, too, was originally trained on cars only, so we extended this to cover vans to enable us to extend Package Uplift.
We initially tested building two independent models since cars and vans are different vehicle types. However, it is beneficial to have the ability to compare cars to vans, particularly when considering crossover stock. A crossover vehicle is defined as a vehicle that can be advertised on both our car and van channels, typically identified as small vans, for example the Volkswagen Caddy or a pickup truck. Given that users can and do look at both vehicle types when browsing our website, we can identify which combinations of cars and vans are most viewed together. If we were to have two independent models we would lose this ability to compare across different vehicle types. For this reason, we validated that there was a minimal loss in errors by combining car and van into a single training set vs having two independent models. We also validated that the model has the ability to produce sensible pairs of similar vehicles, with the majority of cars and vans being clearly distinct sets of objects. Some examples of similar predicted vehicles are shown below.
Photos left to right by Igor Lypnytskyi, Dardan Motors, Stephen Leonardi and Bradley Dunn, all sourced from Unsplash.
As with the interchangeable derivatives model, we also tested if the Package Uplift model would achieve higher accuracy as two separate models: for car stock and van stock. Given that there was no significant difference in our error metrics in MLflow, we remained consistent with a single Package Uplift model trained on both vehicle types.
Additionally, from a product view, Auto Trader offers one selling package per customer. If a customer has both car and van stock, they would not be able to advertise their cars on Ultra and vans on Standard. Therefore there is no need to treat vehicle types as being strictly independent.
To include vans we also had to change the features we feed into Package Uplift. Another feature of the Package Uplift model is Advert Attractiveness, which is calculated by channel. Typically, for most vehicles advertised on Auto Trader, the stock item will have a single Advert Attractiveness score for the channel it appears on. However, for crossover stock where the vehicle can appear on both the car and van channel, there are two corresponding Advert Attractiveness scores. To work with two attractiveness scores for crossover stock, we produce separate predictions for both car and van channels, which we then add together and combine into a single stock item.
At Auto Trader, our selling packages directly influence a customer’s performance of Promoted positions within our search results. The figure above shows an example of the Promoted position, identified with Ad
above the listing. The other listing position is referred to as a Natural listing position, which is a standard advertising position on our website. By upgrading to a higher package level, customers can have a performance boost through the Promoted position. The original aim of the Package Uplift model was to calculate the additional uplift gained by changing package level and therefore considered the uplift of the Promoted position. In order to calculate this uplift, the training data includes both Natural and Promoted position events.
In the case where we see some fluctuation in the model uplifts, being able to split out Natural and Promoted positions is helpful for diagnosing the cause of these fluctuations. By having a view of the natural variability in Natural listings across the package levels, we can isolate the effect of true package performance within the Promoted and Natural version of the Package Uplift model. This allows us to be more transparent in the model monitoring, so we can quickly adapt to change if required. Hence, we have developed a Natural Only Search Advantage model, as an extension to the Package Uplift model.
The modelling choice is the same as the original Package Uplift model, where we filter the training data to events that come only from the Natural search positions. Similarly to adding van stock, this change has been easy to make since the pipeline is written so that it only requires a single function to be edited.
In this post, we have covered how we track our Package Uplift model’s performance metrics using the MLflow capability within Databricks, along with improvements made since the model first went live such as including vans and a Natural only variant. The careful considerations around the Package Uplift model have led it to be the trusted and accurate model it is today, giving us valuable insight into how our package products work in search, and more importantly, providing direct insight to our customers. This transparency is something we value and strive to improve upon.
Our next step with this model is to fully embed the predictions for van stock into our customer-facing Package Simulator tool, where our customers will then be able to access their stock performance on both their car and van vehicles.
There are more improvements in the works already, such as incorporating stock items that are available for click-and-collect or home delivery instead of only being available at a physical dealership. We will also be looking at further streamlining the process of productionising changes we make to the models.
]]>Photo by Stephen Dawson on Unsplash.
Advertising packages are the core product at Auto Trader. Depending on the package tier our customers purchase, they get to appear in our promoted slots or get an advantage in our search rankings. As a business, we need to know how well our products are performing for our customers, as underperformance could lead to unhappy customers and them canceling their contracts with us. Knowing the impact of a package, not just overall, but on a per customer basis allows us to make sure our offerings are fair, and quickly react to any changes we observe. In order to isolate the effect of package level and to accurately predict a customer’s performance on each product, we had to create our most complex production model to date, Package Uplift. In this blog post, we’ll cover how Package Uplift works and how it builds on our ecosystem of Machine Learning models.
At Auto Trader, our customers can purchase one of several advertising packages. The primary difference between the different levels of advertising package is the degree of prominence given to the customer’s adverts, leading to a greater share of advert views, and faster sales. Over time our approach to delivering these additional advert views has changed, evolving from a simple tiered system to our current approach.
The above chart illustrates the two primary drivers of additional response (search appearances and advert views) with our packages. The first is Promoted Positions. The Promoted Position is towards the bottom of the search results page and customers with an eligible package get a promoted budget each day. The second mechanism is Search Advantage. Search Advantage helps to boost an advert’s ranking in search results, with Search Advantage + boosting adverts on the Ultra package even higher.
With this setup it is critical to understand the impact each package has on all the different types of stock advertised on Auto Trader so we can ensure the impact of Search Advantage is fair. It is also useful for showing customers what they would personally gain from upgrading their advertising package.
The main difficulty with understanding the impact of package level on advert performance is that a lot of other factors correlate with package level, such as the age and price of advertised vehicles and the type of retailer advertising them.
Above we have a plot showing the average year of registration (essentially the age of the vehicle). We can see that independent retailers have a very different average vehicle age to franchise retailers, and vehicles sold by independent retailers tend to be newer on the premium packages. If we did not account for this in our modelling we may confuse the impact of stock being newer for the impact of the advertising packages!
As is good practice for data science projects, we started with a simple linear model to get a baseline for comparison. In that model, we predicted the number of search appearances adverts get each day. A search appearance is when an advert is shown in a set of search results. Note that the advert must be present on the page of search results, not just eligible to appear.
From the initial results it was immediately apparent that there were problems with the model. As expected, the correlation between features in the model was causing the predicted impact of package level to be inaccurate. To put it plainly, the simple linear model was not able to represent the complex set of relationships between the input features.
While this approach did not yield a good result, it gave us a baseline and justified a more complex approach. Sometimes we find that a simple model is nearly as performant as the best we can come up with, in which case it can be worth using the simple model due to its interpretability and simplicity. Alas, that is not the case here.
Given that we needed a more advanced model for our problem, our attention turned to LightGBM. We shan’t go into the details of how LightGBM works in this post, but it is important to cover some of the key features for our problem.
First, LightGBM is a tree-based ensemble model, creating many small decision trees that focus on aspects where the model is weakest. As a simplified example, if the first decision tree created performed poorly at predicting the response of Audis, the second tree would specialise in Audis. Then if Electric vehicles had the highest prediction error the third tree would focus on those, and so on.
As a LightGBM model is non-linear it should be able to capture the complexity of our situation, where there can be complete deal-breakers for an advert that will drastically limit its performance.
The second key feature is LightGBMs implementation of monotonic constraints. Monotonic constraints allow us to enforce that a given feature increasing in value will cause the prediction to also increase (or stay the same). It can also be enforced the opposite way, with the monotonic feature increasing being guaranteed to lower the prediction. A simple example would be to constrain the price of a vehicle to never decrease with increasing mileage. This functionality within LightGBM is very useful when modelling a system with known rules. In this case, we know exactly what the package mechanisms are, and therefore know that it is virtually impossible for an increase in package level to cause a decrease in response. By enforcing monotonicity for package level we can remove the risk of the model falsely predicting that upgrading the package level will decrease response.
At this point, we must address the big risk with using monotonic constraints. What if our assumptions are wrong, and upgrading an advertising package would cause a decrease in response? Shouldn’t the model pick up the correct behaviour for packages without enforced monotonicity? Are we not just making the model predict what we want it to predict? It is certainly true that monotonicity should not be enforced without a very good reason; in our case, it is because we know how package level can impact the score by which we rank adverts. It is also important that we compare model performance with and without monotonicity. When doing this we found that only a small minority of predictions violated our assumption. Crucially, the accuracy of the predictions was nearly identical.
Having chosen LightGBM as our machine learning model, we can now discuss our input features and the format of the predictions.
Package Uplift started out as two separate models. One predicted the number of search appearances and the other full-page advert views. A full-page advert view is generated when a user clicks on one of the search results, taking them to the advert’s main page. We take the logarithm of both, as adverts can receive anywhere between a handful of events per day to many thousands and we are mainly interested in proportional accuracy (i.e. within x% of the true value), rather than absolute accuracy. If we worked in absolute terms the model would solely prioritise the few adverts with huge amounts of response, to the detriment of all else.
Both models have the same input features, only differing in the target variable (search or advert views). We represent the type of vehicle being advertised with our interchangeable stock vectors (which we have a post about here) and the overall quality of the adverts with our advert attractiveness score (see our most recent post about advert attractiveness here). Naturally, we also include features around pricing using our in-house valuations and details of the advertised location of the vehicle. Finally, we include the stock’s current package level.
Each day we train the model using fresh data and generate predictions for all eligible advertised cars on Auto Trader. For each advert we predict the number of advert views and search appearances it would obtain on each of our five packages. This matrix of predictions is what we use to monitor package performance internally.
The final step is to generate the data used in our Package Simulator tool for customers. For this, we convert the predicted response into a Performance Rating score. Performance Rating is a metric that is familiar to customers at Auto Trader, and its main purpose is to provide a single metric that puts a customer’s performance into context. For each customer, we compare the response their adverts are getting to what we would expect from other comparable adverts on the site. Performance Rating informs the customer if their stock is under- or over-performing compared to other comparable stock on Auto Trader.
Since Performance Rating is a percentile-based score, we must compute the advert’s new performance rating as if it had been on a different advertising package. Care must be taken so that the predicted performance rating for the advert’s current package is the same as the current actual performance rating. To ensure this we measure the change in performance to be relative to the current advertising package. Once we have the predicted performance ratings for each advert a customer has, we aggregate all of them together in an interactive tool, which shows a customer the predicted spread of performance ratings they would have on each package.
For Auto Trader, it is important to be transparent with our customers so they can choose the right package level for them. The performance of our advertising packages is available to our customers through Portal. This is where we show customers how their own individual performance can be expected to change if they were to upgrade or downgrade their package level. Due to the complexity of our machine learning models, we present our customers with the performance ratings described above. For interpretability we bucket the performance rating score (scaled 0-100) into four buckets, namely: Low, Below Average, Above Average and Excellent, based on the thresholds of 0-25, 25-50, 50-75 and 75-100 respectively. We then present the interactive graphs above for each package level, showing the proportion of a customer’s stock in each of the categories. As their package level increases, a customer can expect their stock to shift favouring the higher performance rating bands, and we know that a higher Performance Rating score correlates with faster sales, therefore increasing revenue for customers.
In this post, we have described how our Package Uplift model works and how we report those predictions to our customers. Package Uplift is only possible due to our other models already in production, and represents an exciting step for us where we now have a hierarchy of machine learning models that build on one another.
Great care has been taken with Package Uplift to ensure the predictions make sense with respect to our package mechanisms. After a lot of testing we have confidence in the model to use it internally to continually monitor the performance of our packages, ensuring they are performing as they should be. We also make our predictions available to customers to aid them in making decisions around their advertising package level.
If you are interested in finding out more about our Package Uplift model we are writing a second companion post to this. In it, we will cover how we have iterated on and improved this model since it has gone live, as well as how we monitor its performance, so check back soon!
]]>Mono
and Flux
are part of the Spring WebFlux library. They are Spring’s implementation of reactive streams.
If you are familiar with the standard Streams API added in Java 8 then you can think of Reactive Streams as Java Streams with
the added element of time. Normally data in a Stream is available all at once but with Reactive Streams, you are
processing the data as it is produced by a source. For example, the source could be a database query. As the records
are returned from the database, they are sent down your reactive stream.
If you’ve never heard of Streams before, you can think of them as a conveyor belt on an assembly line. You can alter items as they come down the belt. A caveat of this is that items may or may not ever be sent down the conveyor belt. An important thing to note is that code involving reactive streams is not executed unless something subscribes to that stream.
A Flux
is a Reactive Stream that can contain any number of elements of the same type. It may have 0 or infinite
values pass through it.
A Mono
is a Reactive Stream that contains either 0 or 1 value of the same type. It may have 0 or 1 value pass
through it.
Now that we are comfortable with what reactive streams are, it begs the question, why should I use them? Why did I bother to write this post?
The key reason boils down to that element of time I mentioned earlier. In a traditional model, processes are run on a thread. When that process performs a blocking operation, the thread has to wait for that to complete. This may only take a few tens of milliseconds but, if you take into consideration how fast modern computers are, this is a big missed opportunity to process data. To use a rather contrived example, the CPU from an IPhone 13 has a max clock speed of 3.23GHz. Meaning that the IPhone can make up to 3,230,000,000 decisions a second (or 3,230,000 decisions per millisecond). Instead, it’s waiting around for your operation to complete. Think of all that time wasted!
Now a computer typically solves this by spinning up a new thread and passing the next request to that new thread. This allows operations to keep being processed whilst we wait for the original thread to free up again. It does however come with a cost as creating/destroying threads is very costly in terms of time and processing power. This would cause the system to slow down as we spin up more threads to handle the incoming requests. A caveat here is that a system can only create a certain amount of threads, once this limit has been reached, it will be stuck waiting for threads to come free.
Reactive programming works differently. In this model, whenever a task requires a blocking operation to be run, an event will be fired down the stream when the task has returned a result. Meanwhile, the thread pool can continue handling events on other streams. Then by the time our blocking operation returns a result, our system has been keeping busy handling other events in other streams. By handling operations like this, a system can maintain an incredibly high throughput.
Our scenario seemed a perfect fit for use of the reactive model. We wanted to create an API gateway that would be handling roughly 300 requests per second and all of that involved sending another request downstream and waiting for a response. Making use of a non-blocking reactive architecture means that we can handle this load with a small number of threads (and in turn fewer servers).
A simple example of how we can use reactive streams in our application would be an endpoint to return some constants. Below
I have included an example controller which returns an array of constants as a Flux
. The only difference between this
endpoint and a regular Spring endpoint is that the return type is still wrapped in a Flux
. Spring will automatically
unwrap this into an array of strings.
package com.example.webfluxdemo.controller;
import org.springframework.http.MediaType;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;
import reactor.core.publisher.Flux;
@RestController
@RequestMapping("/api/constants")
public class ConstantsController {
@GetMapping(produces = MediaType.APPLICATION_JSON_VALUE)
public Flux<String> getValues() {
return Flux.just("Value-1", "Value-2", "Value-3");
}
}
One way in which you could use Mono
is to fetch a single value from a database and return it to a client. This
could be done by using reactive database drivers and returning a Mono
datatype in your controller method. To do
this in a reactive/non-blocking way, your value must be wrapped in a Mono
/Flux
through the whole stack from
your database call to your controller method.
@GetMapping(path = "user", produces = APPLICATION_JSON_VALUE)
public Mono<UserRepresentation> getUser() {
return userService.getUser()
.map(user -> new UserRepresentation(user.userId(), String.format("%s %s", user.firstName(), user.lastName())));
}
public Mono<User> getUser() {
return userRepository.findById("1").map(UserEntity::toDomain);
}
@Repository
public interface UserRepository extends ReactiveCrudRepository<UserEntity, String> {
}
A Flux
can be used in a very similar way to a Mono
however it can bring back multiple values as opposed to one.
It would also push values to the client as they arrive as opposed to waiting for all values before returning the
response.
Below I have included an example endpoint which fetches all the records from a document in a Mongo database.
Using the Spring Data library we can easily make reactive calls to the database and return a stream of results
in the form of a Flux
. These can be passed through to the controller and returned to the client.
@GetMapping(path = "users", produces = APPLICATION_JSON_VALUE)
public Flux<UserRepresentation> getUsers() {
return userService.getUsers()
.filter(user -> (user.firstName().toUpperCase().startsWith("J")))
.map(user -> new UserRepresentation(user.userId(), String.format("%s %s", user.firstName(), user.lastName())));
}
public Flux<User> getUsers() {
return userRepository.findAll().map(UserEntity::toDomain);
}
@Repository
public interface UserRepository extends ReactiveCrudRepository<UserEntity, String> {
}
Some of you may be thinking “if the database is streaming the records out as they are found by the query, why does the app
have to return them all at once to the client?”. To that question, I present you with this MediaType.APPLICATION_NDJSON_VALUE
(newline
delimited json). Notice we have been returning our results in the form of MediaType.APPLICATION_JSON_VALUE
which is just plain old
JSON
. Newline delimited json allows whole json objects to be returned separated by newlines like so:
{"id": "1", "name": "item-1"}
{"id": "2", "name": "item-2"}
{"id": "3", "name": "item-3"}
By making a simple change to our endpoint, we can return our records as they come back from the DB, as opposed to waiting for all of them before returning.
@GetMapping(path = "users", produces = APPLICATION_NDJSON_VALUE)
public Flux<UserRepresentation> getUsers() {
return userService.getUsers()
.filter(user -> (user.firstName().toUpperCase().startsWith("J")))
.map(user -> new UserRepresentation(user.userId(), String.format("%s %s", user.firstName(), user.lastName())));
}
Hitting this endpoint demonstrates the power of using reactive streams in your application, especially if your client was also using reactive streams. Instead of having to wait for all the data to be fetched and processed before you get a result, you can be working through each record as it is returned.
To illustrate this, I delayed all the elements in the Flux
slightly and sent a cURL
request to the endpoint via
terminal:
As Reactive Streams have the added element of time, they can be particularly useful in writing non-blocking code. Our use case for them is in our API gateway. Since our API gateway will need to be able to handle high levels of throughput, we need to write our code as non-blocking as possible. Using Spring Cloud Gateway, we can get an entirely reactive/non-blocking API gateway out of the box.
Our API gateway needs to do a little more than act as a reverse proxy between us and the outside world. There are scenarios in which we want to send network requests to other apps before forwarding the client request downstream. To do this in a non-blocking way, we need to take advantage of Spring’s new HTTP client WebClient. In our use case, we need to send a request to one of our microservices and so long as we get a successful HTTP status code with the response, we can allow the client to continue with their journey.
When making a network call with WebClient
, you have to think of your code as more of a specification of how you want WebClient
to send a request and what it should do with
the response. It follows that same Stream
-like API pattern.
In our repository method, we configure WebClient
to send a POST
request to a url with some headers. We then call the retrieve()
method. This then allows us to
configure how we want WebClient
to handle our response if and when we get one. bodyToMono(ResponsePayload.class)
allows us to configure how we want WebClient
to
map our response. In this case, we are mapping our response to a Mono<User>
.
@Override
public Mono<User> validate(final String userToken) {
final String url = String.format("%s/user", config.getRoot());
return webClient.post()
.uri(url)
.header(AUTHORIZATION, config.getAuth())
.header(tokenHeader, userToken)
.retrieve()
.onStatus(HttpStatus::is4xxClientError, response -> Mono.error(new BadRequestException("helpful message")))
.onStatus(HttpStatus::is5xxServerError, ClientResponse::createException)
.bodyToMono(User.class);
}
Our service class calls this repository method to fetch the information and simply returns it without performing any logic.
public Mono<User> validate(final String userToken) {
return repository.validate(userToken);
}
Since the information we have fetched is used widely across our gateway filters, we created a AuthTokenService
class to handle the storing of
this information in a request-scoped way, similar to ThreadLocal. The AuthTokenService
exists to provide the caller with
user identity information.
When dealing with an incoming request in Spring Cloud Gateway, spring provides us with a ServerWebExchange variable, aptly named exchange
in
this case. This exchange variable represents the whole interaction (or rather the exchange) between our gateway, the client, and downstream services.
It stores both the request sent by the client and the response we are returning to them. It also has an attributes
field
of type Map<String, Object>
that we can use to store request/exchange scoped data.
We pass in our exchange to the appendValidatedUserPayload()
method, and we get back a Mono<ServerWebExchange>
containing our exchange
with the requested information stored in the attributes map. This allows us to store information about the user and make it
accessible to all our filters which may be running without filters having to run in a specific order. Previous to this solution,
a filter would run first to hydrate the exchange with this information.
public Mono<ServerWebExchange> appendValidatedUserPayload(final ServerWebExchange exchange) {
return getValidatedUserPayload(exchange).map(userAuthenticationResult -> setValidatedUserPayload(exchange, userAuthenticationResult));
}
public Mono<UserAuthenticationResult> getValidatedUserPayload(final ServerWebExchange exchange) {
final ServerHttpRequest request = exchange.getRequest();
final Object preValidatedToken = exchange.getAttribute(JW_TOKEN);
if (nonNull(preValidatedToken)) {
return Mono.just(preValidatedToken)
.map(attribute -> (UserAuthenticationResult) attribute);
}
final String authToken = getBearerToken(request).orElseThrow(() ->new MissingJwTokenException("User has not provided an auth token"));
return authenticationService.validate(authToken);
}
private Optional<String> getBearerToken(final ServerHttpRequest request) {
final String authHeader = request.getHeaders().getFirst(AUTHORIZATION);
return nonNull(authHeader) ? Optional.of(authHeader.split(" ")[1]) : Optional.empty();
}
All of this eventually leads back to our filter. In this scenario, we are only interested if that fetching of information has gone smoothly, if not, we will stop the request from going any further and return an appropriate response.
@Override
public GatewayFilter apply(final Config config) {
return new OrderedGatewayFilter((exchange, chain) ->
authTokenService.appendValidatedUserPayload(exchange)
.flatMap(validatedToken -> continueRequest(exchange, chain))
.onErrorResume(JwTokenValidationException.class, error -> unauthorised(exchange))
.onErrorResume(MissingJwTokenException.class, error -> unauthorised(exchange))
.onErrorResume(WebClientException.class, error -> {
LOG.error("Failed to validate auth token!", error);
return badGateway(exchange);
})
.onErrorResume(DecodingException.class, error -> {
LOG.error("Error decoding jwt token payload.", error);
return badRequest(exchange, chain);
}),
JWT_TOKEN_VALIDATION_FILTER_ORDER);
}
public static Mono<Void> unauthorised(final ServerWebExchange exchange) {
return respondWithError(exchange, UNAUTHORIZED);
}
public static Mono<Void> badGateway(final ServerWebExchange exchange) {
return respondWithError(exchange, BAD_GATEWAY);
}
public static Mono<Void> badRequest(final ServerWebExchange exchange) {
return respondWithError(exchange, BAD_REQUEST);
}
public static Mono<Void> continueRequest(final ServerWebExchange exchange, final GatewayFilterChain chain) {
return chain.filter(exchange);
}
I set out writing this post to help spread some of the knowledge I have picked up whilst working in this paradigm. Hopefully I have been successful in doing so and you have been able to take something away from this.
Spring is widely used throughout the industry at the minute and has earned itself a good reputation (at least within Auto Trader). We are comfortable using it and hoped their implementation of a reverse proxy with Spring Cloud Gateway would be really well put together. For most intents and purposes, it is! You can easily spin up a super simple gateway in no time at all, and you can build some complicated functionality on top of it too. There are plenty of pre-made filters and predicates to allow you to easily configure routes and manipulate requests/responses. You can also trust that it will be maintained well over the coming years by VMware.
Now that the project has come to a close and I have had time to reflect on the system we built, there are many learnings which I have been able to take away from it. Working with new tech is great! We got to build a system which is integral to the business and watch it power through requests like nobody’s business. We also had the pain of having to learn how it all works and how to use the libraries whenever we wanted to do something more complex. I found that parts of the documentation were lacking and I had to read through source code and Stack Overflow threads in order to figure out how to use certain functionality. Some of the classes we use seem to still be in development and could change at any moment. Most basic implementation is easy enough to use and is documented fairly well.
I have found that whilst this feels like a step forward technologically, it doesn’t always feel like the easiest thing to work with. Granted, I am not the most knowledgeable on this topic, and I think that there is still lots for me to learn about writing code in this paradigm/style. But that lends itself to the point I’m about to make that it’s such a different way of working, lots of time needs to be allowed to learn about it. The documentation could also be better to help facilitate this (explaining my motivation for this post). Fortunately for me, I was afforded the time to learn, and I was able to enjoy doing it because of this. I imagine other developers with more pressing time constraints would become frustrated.
]]>Photo by Markus Quinten de Graaf on Unsplash
Here at Auto Trader, we aim to help customers find their perfect vehicle as quickly and easily as possible. But with over 400k vehicles advertised onsite at any one time, this can prove challenging. Whilst we provide a powerful search engine to help narrow things down, customers still need to specify a set of filters to get the most out of their search, posing the question: what if they’re not sure exactly what they want? What if they’re not sure how to find what they want using our filters? How can we help those customers?
Search filters aren’t the only way a customer can express a preference - their other activity on Auto Trader can do that as well. Each time a customer views a set of search results, they express a preference by choosing which adverts to click on. This presents us with an opportunity: we can look at this activity, model a customer’s preferences from it, and subsequently improve our search results to show more vehicles that match those preferences.
In order to not change the customer experience on Auto Trader too dramatically, we started by amending just one of our search result positions: the Featured listing position.
The Featured listing position sits at the crown of search and contains the first advert a customer sees when browsing for a vehicle, thus making it highly valuable for retailers to utilise. If a retailer would like to advertise a vehicle in this position, they must pay for a weekly Pay Per Click (PPC) Campaign. This provides a click budget to include a selection of their vehicles in the advert pool used to populate the number one position. Customers will then see this listing and may click through to the advert spending clicks from the retailer’s budget. Once the retailer’s budget reaches zero, their vehicles automatically stop being advertised in the Featured listing position until they purchase more clicks for the following week. The product encourages a higher click-through rate for the retailer’s adverts, in turn leading to more sales.
When a customer is navigating the search listing page and they apply any of the available filters, the Featured listing position will find a vehicle from the PPC pool to match the explicit preferences the customer has requested. However, if no filters are applied, a vehicle gets selected from the pool at random and presented to the customer. If we know the customer’s preferences, we can use this information to personalise the listing and increase the click-through rate for this position by showing vehicles that our customers are more likely to want. To help us infer those customer preferences, we built a Customer Data Platform (CDP).
An example Featured listing. Ideally, the vehicles here should be personalised based on the customer’s preferences. The more expensive advert shown here might not be in the price range the customer has previously shown an interest in.
The CDP at Auto Trader is a purpose-built real-time database that ingests tracking events from our behavioural data platform Snowplow. Every time a customer interacts with the Auto Trader website or our native apps, such as viewing an advert for a car, visiting the homepage, searching for vehicles and sending a request to a retailer, we can track these by firing an event to Snowplow.
Events contain identity information such as a unique user ID if the customer has consented to tracking and the Auto Trader account ID if the customer has logged in. The event also contains information regarding the action taken. For example, if you view a full-page advert on the website, the event will contain information about that advert such as the vehicle make, model, age, mileage, fuel type and more.
{
"uniqueId": "fedc3d3b-e267-4536-bfb4-a8f0d6f4df15",
"atUserId": "a875d622-7bef-405c-b360-72c744b6bf0d",
"loggedIn": true,
"url": "www.autotrader.co.uk/advert/{advertId}",
"advertDetails": {
"advertId": "{advertId}",
"make": "ford",
"model": "focus",
"ageYears": 4,
"mileage": 40000,
"fuelType": "electric"
}
}
A mockup of a Snowplow event. In reality, the event data schema is more complex than this.
These Snowplow events stream into a GCP Pub/Sub topic. From there, they are relayed into our Kafka Cluster to be consumed by the CDP. The number of events that make it to the CDP is large (~1,000 / second) and keeping up with all this data has required significant engineering that we would like to delve deeper into in another blog post. For now, the high-level architecture of the CDP can be seen in the figure below.
High-level architecture of the CDP consumer and data service. Events stream in from GCP Pub/Sub into a service that then relays the events into our Kafka Cluster. The event consumer application then uses these events to update the customer’s unified profile, which is stored in Bigtable. The data service then serves these profiles into downstream services via our gateway.
The event consumer service is a Kafka consumer written using Java (Springboot). It deserialises the event from the user behaviour topic and then enters a processing pipeline with three stages:
When a customer is in a segment, we want to use that segment to narrow down their search results to ones matching their preference. So ideally we want to define segments that align with our search filters. There’s no value in applying a filter a customer has already applied, so we also want segments that correspond to less commonly applied search filters.
Based on these criteria, we assign users into segments for fuel type preference (ELECTRIC
, PETROL
, DIESEL
etc.)
and body type preference (SUV
, HATCHBACK
, CONVERTIBLE
, etc.).
To decide when to add a customer to a segment, we look at their recent browsing activity. Taking the ELECTRIC
segment
as an example, we want customers in that segment to be more likely to view an EV (electric vehicle) than the average customer.
Therefore we assign customers to the segment based on how likely we think they are to view an EV.
We could use any aspect of a customer’s browsing history to decide whether they’re likely to view an EV, such as whether they’ve read content about EVs, or whether they’ve entered our EV giveaway competition. However, the most predictive feature is simply whether they’ve viewed EVs in the past. So we use a simple model based on the proportion of vehicles already viewed that had an electric fuel type.
Once we know how likely someone is to view an EV, we need a threshold to decide when to add them to the segment. This is a classic precision/recall trade-off. If we use a lower threshold, more customers will end up in the segment (leading to a higher recall), but the proportion of vehicles viewed that match the segment will drop (leading to a lower precision). To keep things simple when narrowing down search results, we choose a threshold of 50%, since that means a customer can be in at most one fuel type segment.
2 Petrol views, 3 Electric views -> ELECTRIC segment applied
3 Petrol views, 3 Electric views -> No segment applied
We apply the same approach to the other fuel type segments, and the body type segments as well.
Now that customer preferences regarding fuel type are being logged in the CDP, transformed into segments, and stored within a customer’s profile against their unique user ID, we need to fetch this information whilst they browse the website in order to manipulate their search experience. Before a customer lands on Auto Trader, their request passes through Consumer Gateway, a Springboot Zuul application used to manage customer traffic. Within this application, we can create a custom Zuul filter to manage the request to fetch the customer’s profile. This includes only fetching it when a customer is on certain parts of the website - more specifically, whenever they navigate to a search listings page, as this places less load on the CDP. Once the profile gets fetched from the CDP, using the unique user ID from the Snowplow cookie, the segment enums are concatenated and inserted into the request header as a string. This header is now ready to be forwarded to the family of applications that powers the Auto Trader website, Sauron.
Sauron queries a myriad of domain web services to populate what a customer sees and interacts with on the Auto Trader website, including click-and-collect locations, finance options, part exchange valuations and more. We’re particularly interested in the service that provides customer search results for vehicles, Vehicle Listing Service (VLS). VLS provides listings for Natural positions (positions that are not influenced by any Auto Trader products) and Prominence listing positions – including Featured listing. This data is fetched by querying Search One, another internal service that is responsible for providing advert information. VLS does this by building a query based on the filters used by the customer and the listing type. For instance, if it’s for the Featured listing position, the query requests adverts from the PPC pool. We can pass segments into VLS again using the request headers and, based on the segments present, VLS can add to the Search One request query. As an example, if the customer is in the SUV segment, we can manipulate the query so that it brings back an advert with an SUV body type as we can infer that the customer is interested in SUVs. The segment is only applied if the body type filter hasn’t been used (since we only want to use segments if we don’t explicitly know a customer’s preference).
Tailoring a customer’s journey on our website should be a positive experience, so to validate if customers are happy with the customisation of the Featured listing position, we can analyse their engagement through AB testing.
An AB test is a randomised experiment where we assign customers to either a test or control bucket as they enter our
website. Customers are typically shown different versions of the webpage depending on the bucket they
are assigned to. In the case of testing CDP segments, a customer in the test bucket will be shown an advert
in their search results that we believe fits their preferences based on our modelled data in the CDP. To illustrate how this works, let’s say a
customer in the test bucket is in the ELECTRIC
segment. We would show them an electric vehicle advert in the Featured listing position.
A customer in the control bucket, however, would be shown an advert in the Featured listing position with no customisation of their CDP body type or fuel type segment preference.
For a CDP segment AB Test, we have four groups of customers we can observe and compare:
This scenario differs slightly from conventional AB tests in that we must be careful of sample sizes. When we choose the percentage of traffic to run the experiment on, we must consider the size of the segments. For instance, if we create a 50% test bucket but only 20% of Auto Trader customers have a fuel type preference, we would only be testing the change on 10% of all Auto Trader customers. This means the test may take longer to run than a conventional AB test.
Prior to the AB test, we calculate power and run-time based on customers who have a preference segment only. Naturally, if a
customer is randomly assigned to the test bucket but they do not have a preference, their journey is exactly the same
as a control customer. Hence, we remove non-segment customers from our analysis because they may have different behaviour
from our in-segment customers which in turn can affect the results of the AB test. Further to that, as a rule, we would only customise
the Featured listing position if the customer did not already apply a search filter for the CDP segment.
Going back to our ELECTRIC
preference example, if the customer’s search was filtered on petrol vehicles, we would not
overwrite this filter to show electric cars as it could negatively impact their experience on site.
We have recently run AB tests using CDP segments for body type and fuel type independently using the methodology described above. In both instances, we have seen a positive uplift in the mean number of advert views per session in our test bucket. For body type, we saw an average uplift between +4.8% and +7.2%, and for fuel type an average uplift between +2.1% and +4.1%. These figures are at a 90% confidence interval, meaning if we repeated the test multiple times we would see the mean advert views per session appear between the specified interval 90% of the time.
Now that we’re using customer segments to modify the behaviour of search, if something breaks we can end up negatively impacting the customer experience, so we monitor the segment performance. We do this by looking at the vehicles viewed by customers and comparing them against the segments we had assigned to those customers. For each segment, we can then calculate the proportion of customers in the segment. Along with this, we calculate the precision (the proportion of vehicles viewed by a customer that match the segment) and the recall (the proportion of vehicles matching a segment that were viewed by a customer in that segment) of the segment.
The plot below shows this monitoring for our fuel type segments. Most customers are either in the petrol or
diesel segments, and around 1.5% of customers are in the ELECTRIC
segment. When a customer is in the ELECTRIC
segment,
roughly 70% of the vehicles they view are electric (the segment precision) and almost 50% of all electric vehicle views
come from members of the segment (the segment recall).
In this blog post, we’ve demonstrated how we can use our Customer Data Platform to improve the search experience on Auto Trader. So far, we’ve only looked at segments for body type and fuel type, and we’ve only personalised one of the search positions. There’s opportunity to go further.
Next, we’re planning to explore segments linked to other vehicle attributes. Price, mileage, and vehicle age are all promising candidates. We’ll also explore how the segments interact with each other, and how best to personalise the results when a customer is in multiple segments. Should we give equal weight to all segments, or are some more valuable than others? We’re also in the process of personalising more of our search results, not just the Featured listing, all with the aim of giving customers an even more relevant search experience.
]]>The core of our business at Auto Trader is helping people to find their next vehicle (be it car/van/truck/caravan etc.!) as easily as possible. As such, a question we need to be able to answer is “which adverts are the most attractive to our users?” Several years ago we created Advert Attractiveness specifically for that purpose (we have a blog post about it). At its heart, the Advert Attractiveness model looks at the response of users towards a given advert, while accounting for the context it has appeared in (e.g. desktop vs mobile, position of the advert on the page), and the user’s behaviour as well. The aim of this is to minimise the role that luck plays in an advert’s response in order to find the best quality adverts.
The original model only scored a subset of the advertised cars on our site. This was because it needed a reasonable number of users to see each advert (we’ll refer to these as observations) to produce a reliable score. This was done with a set of hard thresholds on the number of observations each advert had to acquire before getting a score, which meant we had limited coverage of circa 80% of car adverts scored. As we also wanted the score to be relevant to the current state of the advert we have a fixed observation window of seven days. This is because imagery, pricing etc. can all change over time, and we don’t want to saddle a newly improved advert with a low score because of its low response from weeks ago. The flip-side of this is that some adverts will never get enough response to be scored, particularly in our non-car channels.
Since its initial release we have continuously been working on the Advert Attractiveness model to address the above constraints; this has proven to be a challenging and very satisfying endeavour.
What Advert Attractiveness does is compare the observed response to what we would have expected given the set of contexts that it has appeared in. The original model simply took the raw response as the input to that comparison, leading to several problems due to the ratio being erratic early on when there is little data. This is why the original model also imposed a minimum number of observations before scoring, but ideally we’d be able to score all adverts the moment they are live on the site. We needed to find a solution to balance the want to score as soon as possible and the fact that the fewer observations we have the less sure we are about the advert’s quality.
Fortunately, as we have ~500k adverts on Auto Trader at any one time, we have a good idea about the distribution of the response levels of our adverts. For example, in any kind of metric which is based on click-through rate the distribution approximately follows a Beta distribution. If we know nothing about an advert we can still have an idea of what its likely click-through rate will be based on our prior knowledge of the distribution of all adverts on the site.
Those familiar with the Bayesian view of the world can probably see where this is going. We have a prior distribution, informed by the performance of many previous adverts, which we can update as more and more information comes in. Eventually, with enough response, our initial prior assumptions no longer practically impact our posterior1 estimate of the response. This process of updating with additional information is illustrated below.
The y-axis shows the Probability Density Function (PDF), which means that the area under each curve represents a probability and each curve has unit area (we know the result will definitely lie between 0 and 1). You can see that we start off with a broad prior (red dashed line), which narrows as we gather more information about the advert.
The power of this approach is that we no longer need any hard (and arbitrary) cut-offs for when to score an advert. Instead, we start off with a reasonable estimate, based on the general population of adverts we have on the site, which we continually refine with additional evidence (as illustrated in the plot above). This approach allows us to score practically any advert fairly, not only for car adverts, but vans, bikes, caravans, motorhomes, and our industrial equipment (known as Truck Plant Farm).
In addition, we benefit from a negative feedback loop. If an advert scores poorly, then it will be shown less in some of the positions we are interested in (such as our Promoted slot). However, this means that we have fewer observations, which tends it back towards the mean. Similar for high-scoring adverts, as we show those more we get more observations, increasing our confidence in the estimate, making it hard for an advert to maintain a high score by luck alone. This negative feedback loop is far more desirable than a positive cycle, where adverts will accelerate towards the extremes.
With our raw observed response now restrained in the case of low data, we have a much more reliable and stable model, which still allows great adverts to shine. However, this fix exposed a similar yet more subtle problem.
In the previous section, we talked about how we have a fair estimate of an advert’s response, accounting for the noise inherent in having limited observations. However, there is a second factor that we have yet to consider. When comparing the ratio of true (or our best estimate of it) and expected response we have only altered the former, not the latter. As it turns out this can still lead to extreme scores at very low numbers of observations. This is easily seen in a plot we lovingly refer to as “the fish”.
What we can see here is the Advert Attractiveness score against the number of observations we have of each advert. Note the x-axis is on a log scale. The shape of this plot arises from two main mechanisms, the first being the impact of our aforementioned prior - it is difficult for adverts to get extreme scores with low numbers of observations, and so we see the body of the plot widen as observations increase. However, we also use Advert Attractiveness to inform which adverts we show to users on our site in slots such as our ‘Promoted’ position. This means that as we become more confident in an advert’s poor score we are less likely to show it, thus making it difficult for low-scoring adverts to get a very high number of observations causing the second half of the plot to narrow, leaving just the high-scoring adverts with very high numbers of observations.
There is a third feature of the fish plot: the “tail”. This is what we are concerned about, as the scores here become erratic. The root cause of the problem is that while we have a prior on our observed response, the ratio to the expected response can still be unstable as the denominator can vary substantially. The expected response can vary significantly because of the context an advert can appear in. Imagine that an advert’s first observation is at the top of the page for a very engaged user, even in this scenario it is perfectly likely that the advert won’t be interacted with. If we compare with the case that advert had appeared right at the bottom of the page, the value of the observed response is the same (no response), yet the expected response is very different. We need to somehow capture the fact that even though the expected response maybe higher, we won’t be able to measure any difference until we have several observations.
We looked into multiple ways to apply a similar methodology to the denominator as for the numerator, however we often saw strange artefacts, such as an empty arc in the “fish”. Our solution was to apply an effective sample weight to the Advert Attractiveness itself. Fortunately, we can approximate the distribution of Advert Attractiveness scores as a Gaussian, and so we can apply an adjustment to the raw score, sr, as
\[s = \frac{1}{n_e+n}(\mu_0n_e+s_rn)\]where n is the number of observations, ne is an effective sample size (it arises as a ratio of variances, but we won’t go into detail here) and μ0 is the mean of our prior. You can sense check the above equation by looking at its behaviour when n (the number of real observations) becomes large, the score (s) tends to the raw estimate, sr. Below we have some examples of different effective sample sizes, including some extreme values that nicely illustrate the effect of this change. For our purposes, we just want a value that reduces the extreme values, somewhere between 1<ne<20.
Since its first release into the wild, we have continuously monitored Advert Attractiveness to see where we could improve it, and in this blog post we have discussed just a couple of changes we have made to it (there have been many more!). In order to allow us to score almost all adverts on site we have introduced prior weightings. This meant that even adverts with few observations could be given a score that was fair, smoothly changing with increasing data. The introduction of the prior on the observed response was a success, however, it exposed the issue with the expected response term being unstable at very low numbers of observations, requiring an additional prior.
The addition of these priors has been the key to allowing us to maximise coverage across all our different vehicle categories at Auto Trader. Consequently, we have improved not just the user experience with these more reliable scores, but also the machine learning models that use Advert Attractiveness as an input feature.
[1] For the purposes of this post you just need to know that the prior distribution is our initial assumption about the system and the posterior is the new best estimate after combining the prior with new data.↩
]]>