I’m Running Experiments. Why Hasn’t My Conversion Rate Gone Up?
Your digital platform can benefit from experimentation, but that benefit doesn’t always equate to an increasing conversion rate. Let’s explore how to manage the nuance, including special insight from experts.
The benefits of experimentation are well-researched and documented.
All else being equal, once startups begin experimentation, they see strong gains within the first few years and continued growth in years to come.
“A/B testing significantly improves startup performance and […] this performance effect compounds with time.” — Koning, Hasan, Chatterji, Experimentation and Startup Performance: Evidence from A/B testing
Given these blockbuster outcomes for experiment-led companies, it’s obvious that experimentation correlates with success. But does having an experimentation team guarantee better overall performance?
Case studies make the connection between experimentation and KPI growth seem inevitable. So, it’s logical to expect that once your experimentation engine is firing on all cylinders, you should see a noticeable impact on the data. But is it practical? If I run experiments, will my conversion rates go up?
It depends.
After running thousands of experiments, we’ve learned that what happens after a release isn’t a simple one-to-one outcome that mirrors the success of your experiments. But the story of why is not so simple.
In this article, we explore:
- Why experiment results don’t always net out one-to-one
- Why we still believe that experimentation is a powerful tool to add to your toolkit
Why don’t we always see one-to-one results from our experiments?
If you’ve ever researched an experimentation partner, you’ve probably come across one that guarantees conversion rate increases. It’s an enticing prospect that experimentation = an automatic increase in KPIs. Industry insiders like Shiva Manjunath are generally against this “snake oil” tactic, and at The Good, we simply say anyone who guarantees they can increase your conversion rate is either lucky or lying.
There are plenty of reasons experiment results don’t map one-to-one with the real-world outcomes you expect. So, let’s explore three core issues that prevent you from seeing those charts go up and to the right:
- Post-launch variables
- Experiment segmentation
- The effect of false positives
Problem 1: Post-launch variables make attribution less than clear.
Metrics are influenced by much more than a website or app experience. Direct and indirect influences impact metrics like conversion rate, revenue, and customer satisfaction.
In fact, we’ve identified over 55 variables that contribute to swings in KPIs. Factors like traffic quality, seasonality, competitor promotions, and even the economy play a huge role in whether or not website visitors will convert—now or in the future.
As we learned during the COVID-19 pandemic, even the largest experimentation wins may not eclipse outsized influences that dampen conversion rates. (True story: we once saw a single social media intern drive so much new traffic that conversion rates fell by a whole percentage point. Great for the intern’s portfolio, bad for conversion rates.)
The opposite is also true. When outside factors are working in tandem, the results can be outstanding. We worked with Alisha Runckel during her tenure at Laird Superfood, running hundreds of experiments together, contributing to their industry-high conversion rates.
But experimentation wasn’t solely responsible for her success. Alisha and her team also made good offline decisions. Their holistic approach included direct mail, bundling incentives, and even a packaging overhaul informed by sentiment testing.
“There are no silver bullets. Ecommerce success is an accumulation of good decisions made over time.” – Alisha Runckel, Laird Superfood, Humm, Hannah Andersson
If done correctly, A/B tests are a trustworthy way to estimate the effect of a treatment despite shifting external influences. But once the test period is over, you no longer have a baseline or “counterfactual” scenario to compare to.
Even if you’re capable of running precise counterfactual scenarios, there’s no easy way to understand what performance would have looked like if you hadn’t implemented the changes. As a result, once a test period is over, we’re generally ignorant of whether we’ve improved a metric that was already in decline, which makes post-launch attribution fuzzy at best.
Problem 2: Test results are not summative
5 + 5 = 10, right? Not in experimentation.
While there are many reasons that experiment results are not summative, let’s focus on one of them today—segmentation.
Many experiments only impact a segment of users, and in those cases, the observed improvement is diluted by its proportion of the larger population. Therefore, a win with a segment, or subset, of users does not predict the same gains for the whole.
Let’s run a scenario to demonstrate:
- We experiment with a punchy new landing page and checkout flow on mobile traffic only, which is 50% of our total traffic.
- The experiment shows an increase in revenue among that audience of 5%.
When we implement the changes in production, we shouldn’t expect desktop traffic to perform in an identical manner. The reasons could be many. Maybe the change isn’t as conventional for desktop devices. Or maybe the experiment hypothesis only had to do with solving for mobile users’ lack of patience and need for speed.
Whatever the reason, practitioners often run experiments that are only meant to impact a subset of their audience, and the rest of their visitors won’t experience the same benefit. That’s true for splits by device type, landing page, and similar segments, and it’s why we generally don’t extrapolate the results of a segment and apply them to the whole.
A 5% gain with one audience and a 5% gain with another does not equal a 10% lift overall. Test results are not summative.
Problem 3: False positives
The final reason (at least for today) that your overall KPIs might not match the result of your experiment is the occurrence of false positives.
False positives are what we call it when our experiment data indicates that our hypothesis is true when it actually is not.
“[False positives] appear to generate an uplift but will not actually generate any increase in revenue.” — Goodson, Most Winning A/B Tests are Illusory
False positives may sound like an atrocious error on the part of the experimenter, but they are actually par for the course. Even rigorous and experienced experimentation teams expect about a 26% false positive rate, meaning about one in four “winning” experiments is observed as the result of chance and not a true winner.
To account for false positives, some practitioners even go so far as to re-test all winning experiments that fall outside of a certain threshold, then split the difference to estimate real-world effects (more on that later).
In my experience, most practitioners aren’t so persnickety about false positives. Their approach is to accept that some “wins” are truer than others and, as a result, anticipate a slightly lower return when comparing test results to real-world outcomes. Still, those blessed with the traffic and time might consider the practice of re-running tests for improved confidence.
Does the pervasiveness of false positives mean experimentation doesn’t work? Of course not. It just means that we should approach “wins” with an informed skepticism and not expect a one-to-one relationship between the results observed during a test period and real-world performance.
Enjoying this article?
Subscribe to our newsletter, Good Question, to get insights like this sent straight to your inbox every week.
Experimentation is an incredible tool, but it’s not a silver bullet
If the caveats listed above deteriorate your confidence in experimentation, let me stop you right there.
It’s tempting to look at fuzzy post-launch attribution and false positive rates and say, “Is experimentation really worth it?” We believe it is. And people much smarter than myself swear by a “test everything” approach because there is simply no better or more rigorous way to quantify the impact of changes on your bottom line.
Problems do arise when we tout experimentation as an omnipotent growth lever—a magic bullet to success. But if disappointment is the gap between expectation and reality, harnessing the power of experimentation is simply an expectation-setting exercise.
Similar to OpenDoor’s Brian Tolkin, rather than an all-powerful growth lever, we view experimentation as a confidence-generating mechanism.
“Experimentation is all about increasing your conviction in the problem or the solution.” — Brian Tolkin, Lenny’s Podcast
Experimentation can increase your confidence in your decision-making, help you measure the discreet impact of good design, and settle internal debates about which direction to head. What experimentation won’t do is compensate for all the external forces hampering your business.
If you’re looking for confidence and precision, experimentation is an incredible tool to add to your toolkit. If you’re looking for a silver bullet, we’re still looking, too.
Assure the best possible experimentation outcomes with these tactics
It might be discouraging to hear that the hailed “silver bullet” of experimentation isn’t going to solve all of your problems. But, hopefully, you’re excited to know it is still proven to help your digital property perform at its best.
If you want to push your organization towards digital excellence and get the best outcomes from experimentation efforts, a well-run program is key. Take a measured approach to incorporating it into your growth practice and consider a few tips from veteran practitioners to make sure you are maximizing the effectiveness of experimentation.
Give your A/B tests ample time + traffic—without stopping short
It’s tempting to periodically check the progress of a test to see how things are trending, but there’s a name for this kind of behavior: peeking.
Maggie Paveza of The Good defines peeking as the act of “looking at your A/B test results with the intent to take action before the test is complete.” As Evan Miller describes in his article, How Not to Run an A/B Test, “The more you peek, the more your significance levels will be off.”
Avoid peeking by calculating test traffic requirements during the planning process and not stopping tests earlier than planned.
- Use predetermined calculations to set the acceptance criteria, test duration, and minimum traffic levels
- Analyze a test’s results only after you’ve reached predefined thresholds
- Don’t stop tests earlier than planned
Following these simple steps will increase the trustworthiness of your results and reduce your rate of false positives.
Re-run some winning tests to verify your results
If you and your team decide that precision is more important than time, you may opt to re-run winning tests that have a p value above a certain threshold, say .01-.05, to gain additional assurance that the effects measured were not the result of chance.
Experimentation veterans like Ron Kohavi recommend repeating some winning tests a second time “to check that the effect is real.”
“When you replicate, you can combine the two experiments, and get a combined p value using something called Fisher’s method or Stouffer’s method, and that gives you the joint probability. — Kohavi, The Ultimate Guide to A/B Testing, Lenny’s Podcast
By running the test a second time and analyzing results across the two experiments, the newly calculated effect likely represents the truer difference between the treatment and the control. The result is that you’re less likely to implement a “winner” that was simply observed due to chance.
Get comfortable diagnosing metric decline
While we can’t define every factor impacting conversion rates, there are some factors that are easier to spot in the data than others. Factors like seasonality, fluctuating traffic from various segments, and even traffic bots are fairly easy to track down, but you have to know how to spot them.
Luckily, Elena Verna, Head of Growth at Dropbox, created a decision tree for doing just that. Elena’s if/then flowchart helps growth specialists “quickly diagnose troubling conversion rate trends within 48 hours,” according to Reforge.
Whether you’re looking to evaluate the impact of your experiments or track down wayward KPI signals, getting comfortable with defining the source of KPI changes is a valuable skill that will help you build trust and authority within your organization and help you understand those outside factors that might dampen the impact of your experimentation program.
Trust the process and use experimentation to its full potential
One thing has been proven time and time again: experimentation done right is associated with increased overall performance across a number of factors.
While experimentation can’t combat outsized economic and environmental factors, it can be a catalyst for better decision-making and help assure your digital property is performing at its best—despite what’s going on outside.
By setting proper expectations, calculating test parameters before launching, mitigating false positives, and getting comfortable with attributing metric change to specific fluctuations, you can rest assured that your test data can be trusted and you’re performing at your best.
About the Author
Natalie Thomas
Natalie Thomas is the Director of Digital Experience & UX Strategy at The Good. She works alongside ecommerce and product marketing leaders every day to produce sustainable, long term growth strategies.