Run the Test
Now let's talk about how to actually run the test. Alright, we're going to get into the, you know, a lot of the selection, but ideally you go through that process of prioritization. Things get stored, and now you selected a test. Again, when we get into kind of meeting schedules and all that in a later segment, we're going to talk about the actual process of selection, how those meetings are run, who's involved, but for now let's assume all the ideas went in there, we generally have the theme, we know what we're going to test, and when we're going to do it, alright. So, we're going to run into the test phase, which is phase five. When it comes to running tests there are a couple things that you really need to keep in mind. So, I want to just kind of walk through the process for running, what we call high-impact tests, 'cause boy, testing split testing, all these things, it is a great way to expend an enormous amount of energy, an enormous amount of resources, and get nothing from it, r...
ight. I mean it can, but boy you can sure play work, you can play business, like oh look at all these tests, look at all the stuff that we did, like and not a single needle moved ever. So, how do we make sure that doesn't happen. Number one, every test starts with a hypothesis, this is like basic, you know, fourth grade science fair project kind of stuff, but you must start with a hypothesis. Your team is not going to want to do that, but if you recall that was on that idea sheet, right, it was on the idea sheet, make them guess, make them make an assumption. Right, we learn when we make assumptions and then test those assumptions. They're not going to want to do it because they don't want to be wrong, remind them that it's okay to be wrong as long as we're learning, right, but it doesn't count and they don't get credit for it unless they make an assumption. But I don't know what's going to happen, cool, freaking guess, throw a dart, I don't care, hypothesize something. Alright, then we build, actually build and construct the test. Depending on what came in from an ease perspective, that's going to determine how long this build process takes. Okay, that's how long the build process takes. As a general rule, when we're doing tests we will batch, we'll either do like one really, really big one, or we might say you know what we're going to do for this round let's just knock out a bunch of smaller ones. So, a bunch of ones, maybe they're lower on the impact, but really high on the ease, so we call those, there are tests that are bullets, and there are tests that are steamrollers, alright. So, think about the definition of force right, force I think is mass times acceleration, is that right. So, if you have a really, really heavy mass, it could be moving slowly, but it will still crush you, right that's a steamroller. You've all probably seen that scene from Austin Powers, right? You can have something with very low mass, but if it's moving at incredible speed it still has a tremendous amount of force, that's a bullet, right. So, in general when we're looking at build and what we're testing, we'll have, let's pick a steamroller, we don't have time, or let's do a lot of bullets, okay. One steamroller, or a lot of bullets. Hypothesize, build, then we validate. Did it work, were the assumptions that we made true, false. And then we optimize, it is rare that you will run a test and go, cool done, right. The most likely scenario is you will come up with an idea, you willy hypothesize, you will build it, you will validate it, and what you will find is we still don't know, we still aren't sure, it's only the bubble, it's close. We think if we were to do these things it could get a little bit better, or ah it didn't work, but we have a plan B, right. So you need to validate, and then scale or abandon. Do we scale the test or do we abandon it, okay do we scale it or do we abandon it. And let's kind of walk through some of these in a little bit more depth. So, hypothesizing, alright. You won't learn unless you make assumptions and then test those assumptions, alright, period and end of story. You will not learn, you will aggregate knowledge, you will aggregate additional data points, but you will not learn to see patterns. Right, you will not learn to see patterns, it's when we make assumptions and then go okay, we were right, cool, check, I'm going to look for that again. Or you know, we made an assumption, wow that didn't work out, what happened, that's when we begin, and the learning process actually begins. That's why when it comes to the idea sheet we have this hypothesis column, we will not consider an idea, and let allowed to move onto the build phase unless it has a hypothesis in there, gotta have it. Absolutely, positively, have to have it. So we hypothesize, we build, I'm not going to really spend a lot of time talking about building, 'cause that varies from test to test. Sometimes it's literally just going in and, you know, doing a very simple split test, sometimes you're building something completely new, you know having to write form, but you know again just be cognitive of the fact that when you're in this build phase don't try to do a whole lot of steamrollers at once. You're going to bring your growth team to a screeching halt. Right, and balance it, right, if you did a really, really big project, a really long build, follow that up with some quick wins, otherwise people get exhausted. Right, people get exhausted, it's like you just finished a book, want to write another book, no I don't, right, you know. We get into validation, all tests must be run to the point of statistical significance, alright. Now I want to warn you, I'm gonna get a little bit nerdy here, but hang tight with me, alright. So, statistical significance reflects your risk tolerance and confidence level. So, if you run an A B testing experiment, if you're looking at a statistical significance level of 95%, that means that to determine the winner you need to be 95% confident that the observed results are real and not an error. Just keep in mind if you're 95% confident there's also, it means there's a 5% chance that you're still wrong. But generally tests are run to a point of 95% statistical significance. If you're Uber, if you're Google, if you're Facebook, heck, if you're Hotels dot com. If you're like one of these really, really, really big companies that has an enormous amount of traffic you've got the volume, you've got the data to volume of data to allow it to run to maybe a 98% point of statistical significance. For mere mortals, like me, and most of us in the room, well you guys are actually a much bigger deal than I am. A lot of folks watching, you know, at home, then 95% is generally where we're going to land. So here's how it's done alright, you ready for this formula. So, for a given conversion rate, let's call that p, and number of trials, let's call that n, standard error is calculated as SE equals the square root of p times the factor of one, minus p, which if you recall was the given conversion rate, divided by n, which was the number, in this case of trials, but it's the number of whatever you're testing. So, to get to the 95% range, you got p, plus or minus the standard error, standard deviation one divided by 96. That, if there's no overlap, the test is either the clear winner, or the clear looser, got it. Okay, or what you can do, is just use the internal tool provided by your online testing solution. What they do, 'cause I have no idea what that previous thing was, that could have been in a totally different language for all I know, but they do stuff like, oh look there's a green trophy, it won. Ah, 99% accurate, yay, alright. Thankfully these tools were built for non-math-doing jackasses like me to figure out, okay. So, if you're even reasonably intelligent you're absolutely going to be able to figure this thing out, and if not don't freak out, alright. There are tools available that are pretty good at doing math, you know, that's kind of the nice thing. So, whatever tool that you're using, it is going to calculate statistical significance. It's going to have a way of saying we're, this is now confident 95% or better, and they all kind of generally start and build in a 95% statistical significance, right, they all have that built in. So where they're showing a green trophy, you know or generally if it's green, you're like yay. Now, where these can get a little bit wonky on you is they can declare a winner too soon. Okay, where you do have to be careful is they will, they can declare a winner too soon. So, I would say at a minimum, alright at a minimum, a test should be run for one week. What this does is it takes out any of the kind of weird time variants, right. So, if there was a holiday, or a world event, it's gonna even those out. So, if you're doing a split test and you're saying like wow this one's the clear winner, 95%, and you're 48 hours into it, I'm going to highly recommend that you let it keep going for at least seven days, and more times than not, and this will frustrate the crap out of you, but more times than not you're going to find that it's like, oh you lost your green trophy, right. They kind of came back in into some level of parody, if you give it enough time. Now, obviously there are caveats to this, if what you're testing is like a two-day promo then there's no reason to let that run for a week. Hopefully I don't have to say it, but when I was going over this with our data guy, you do make sure that there's this caveat, and like okay, binary-thinking person. But so hopefully you guys, you know got that figured out. Generally a week, right, minimum, sometimes it makes sense to let it run longer, but I'll tell you if what you're testing, if it just, if nothing's happening, end the test, nothing happened, okay, just end the test, nothing happened, and we'll talk more about that. So, we hypothesized, we built, we validated, because the green trophy popped up, now we optimize, okay. And what we're saying from an optimization perspective is basically do you have a plan B. Alright, do you have a plan B, and is it worth trying again. So maybe you tested something and you're like dang this really should work, and it didn't, it's very, very common during the brainstorming phase that you'll come up with an idea, and somebody will make a hypothesis and they'll want to try something, and somebody else will say you know, ah I like the idea, but I would do this one instead. Really simple, you say great, we're going to do theirs first, because it was their idea, and then we'll do yours if it doesn't work, that will be our plan B, does anybody else have any, I'll take a plan C, a plan D, plan E, plan F, doesn't matter, right. Let's come up with some other ideas, and generally those will also, the plan B's and things like that will wind up in the hypothesis, or the description, you can kind of figure out how you want to do that. Do you have a plan B, is it worth trying again, right. Maybe you still feel good about the concept, but the particular rollout, the particular application didn't work. Do you have additional ideas for improvement based on your analysis that could swing the result. So, maybe now that you've run it, you've got some data, you know you're saying to yourself, okay, you know what we're hearing is this, so let's keep trying, but let's do something new. Maybe when you tried it customer care was getting a lot of people saying that they were confused by a particular part of the offer, right, that'll happen when you run the new test. People say they're confused, but if you think about it they emailed into your customer success department, so you're beginning to generate some awareness, right, you're building some activity around it, they were just confused. So, maybe as a result like during the test you're going okay I think we need to tweak the verbage, I think we need to add in this little pop-up, this little clarifying paragraph, this thing, because people seemed to like it, but they're confused and they're getting stuck. Let's do it, and let's run it again. This is a biggie, were there external forces that could have skewed the data. This happens all the time, it happens all the time. Major world event, right, that can absolutely, positively, that can absolutely, positively skew the data. We've run promotions before and completely forgot about holidays, you know, we've done like promotions over like Labor Day and stuff like that, and forgot that a lot of people were off work, oh that's why that bombed in the states. If you have a large international audience it's common to forget about other international holidays, or outside of the states if you happen to be based in the states. So, just keep that in mind. We've also gone against our own big internal promotions. So, we were trying to run a test and, on a new pricing strategy at the same time that we were running a Black Friday sale, that was stupid, right. So, here we are running this Black Friday sale where everything's at a discount, and yet at the same time we're trying to do this new price point testing over here, that's just not going to work. Alright, so if it did work, and this is a biggie as far as the validation phase is concerned, does it work when tested across different offers, different channels, and different avatars, alright. So, yay, we got a result, but does it work if we, you know maybe this particular worked on this page, will it work on this other page. So, for example, maybe you do a headline test on a landing page, and it crushes, it works great for Facebook traffic, but when you look at it from a source perspective your referral traffic conversion rate is down, right, why is that. So, you gotta do some basic source tracking, right. That has to do with, you know, maybe the offer to the channel page, maybe a new pricing test worked for one product, but it didn't work for another one, right. So, you were able to increase the price on one product, you tried to increasing it on another, it didn't work, you know. If you're doing across the board, hey let's test increasing our prices 10% and see what happens. Okay, I love the idea for the test. I bet if we increase the prices 10% it's not gonna decrease conversion rates at all, right. If you go out there and you immediately increase the price 10% across every single product you have, that's bold. You know, I'd recommend that you start with one, but if you start with that one and you get a good result, so then say well I guess it worked for everything, also bold, really, really bold. And a new webinar, maybe you test the webinar to one segment of your list and it performs well, but it doesn't perform well to another, right. Just be very, very, very careful about how you do the rollout. So, we hypothesize, we built, validated, optimized, and now you have to decide do we scale it, or do we abandon it. And this is kinda what I was doing before, and if it worked then initiate a responsible rollout process, alright, a responsible rollout process to ensure that these results are holding across different campaigns, and channels, and landing pages, and all that, right. So, don't just say it won, go, new control go. We've done that so many times and we looked at it, we'd go back and look a few months later and we're like wow it's not performing as well as it did. We'd revert to the original controls, and there everything was right back where we wanted, we had a false win. So, a responsible rollout is really, really, really, really critical, keep that in mind, it's hard to do. It's hard to do, 'cause when you get a win it's very exciting okay, but be very responsible for how you roll that out. Now if it didn't work and you're out of ideas, and it's time to abandon, right. Document those results, including a why theory, alright, a why theory. Now, this is important, so we want to go in and we want to say okay why do we think it didn't work, 'cause just like we want to make assumptions on the front end, we want to make assumptions on the back. Why do we think it didn't work, but the keyword here is theory, okay, theory. Do not allow a failed test in particular, or a winning test, to turn into dogma. Dogma is expensive, dogma is incredibly expensive, and we had this happen where we've had tests that we ran and they lost and we declared, well that just doesn't work at all anymore, every again, and let's never try that 'cause it just stung too badly, right. One failed test does not a pattern make, it is a single data point, it is important to inform theory and to have a why theory, but do not allow that theory to become dogma. But where you put that in is when you go in over here into status, when you get into the in progress completed and the did it work, if the answer is no, it did not, that's where you put in your why theory. Alright, so this document that you have here it will take one document with a couple of tabs will basically take care of the complete life cycle of all your tests. Alright, from birth to scale, and death if needed, okay. Step six, report the results. And this is kinda where we were getting there before, so we focused, we analyzed, brainstorm, prioritize, test, now it is time to report.