Monday, June 4, 2012

Off the Grid and Epsilon-greedy Testing

Last week, Marek Jastrzebski1 and I joined forces on a project at Blackbaud called 'Off the Grid'. One day a year anyone in the company can pick a project that interests them, work on it for 24 hours and give a 5 minute presentation on the experience. This was a truly exhilarating. I found it to be a great opportunity to digress from daily work and try something new. I wanted to do something where we used customer data to provide some kind of feedback loop with our application. My first thought was to apply A/B testing to some of our more lucrative user-facing web forms. That approach took a sharp turn when Marek showed me a blog post he'd read the day before 2.


Epsilon-greedy and Epsilon-first Algorithms


Source: Wikipedia
Deciding to present a user one of a set of possible web forms is much like the Multi-armed Bandit3 problem. You have a number of slot machines to select from and you want to maximize your winnings. In the case of a web form, the reward is getting the user to follow the page flow to completion (which in many cases is completing a financial transaction). So given several potentially good web form variants, what's the right way to proceed? 


Conventional wisdom says pick two of the best forms and do an A/B test. With the slot machines this is like picking two machines, pulling each arm 100 times, declaring a winner and sticking with that machine. This is a little troubling to me on several fronts. First you have to pick two to start with which seems highly subjective. You have some initial assumptions about what the user might do or find useful. Presumably there's something to like about each possibility you created and no way of knowing if the others might have panned out. The other problem is that there are things that can affect the outcome totally unrelated to the forms you pick. Is it the right time of day, a significant number of tests or the right audience at that time? I think it's really hard to know the data you collected during any given window is the 'right' data. It's not that A/B tests are bad. Surely, it's better to test your ideas and back them up with data - it's just that there has to be a better way. 


Epsilon-greedy algorithms take a different approach. You never really stop exploring and you exploit the form that is giving you the best response at that time. There are many ways to tune the algorithm and a few variations on how to implement it, but the simplest case is where 90% of the time you select the choice that is giving you the biggest payout and the remaining 10% you pick a random choice. 


It's the worst feeling when the slot machine next to you starts paying off and someone who is not you is jumping up and down excitedly while you're staring at a row of mismatched fruit. So in many ways this is also an exercise in minimizing regret. By continuing to explore the other alternatives you might find that one of those variants are starting to pay off in a way you didn't expect. And if it does, you exploit it.


Venturing Off the Grid



Marek and I started by talking about ideas and throwing a lot of stuff up on the whiteboard. We only had one day to work on this, so we wanted to brainstorm just enough to get started. We thought a lot about what the value would be for the customer and sketched out how to start. Later, we'd find some way to serve web pages, hook into our TeamRaiser2 API and register users based on the form the algorithm chose.


One of the cool things about having a time constraint is that it forces you to focus and prioritize on creating the minimal level of quality required for a prototype. Several times during this exercise, we went down paths that could have been time consuming and distracting. Is it necessary for tomorrow's demo? If not we scuttled it. This happened several times on both sides and I think it really helped distill out the essential ideas.


The algorithm was written based on the high-level outline from the blog post. Basically each form records the number of tries, the number of completions and the reward (completions / tries) that each form is currently yielding. We stored this in a YAML file and would log when anything changed. This also allowed us to stop/start the server and continue from where we left off or reset everything to zero and start again.


Testing the Algorithm


Now that we had a web server serving pages using the epsilon-greedy algorithm, we needed a way to track the algorithm's progress and test its decision-making. For our purposes, we decided to use a made-up scenario where there's a new campaign to help provide for pets in a shelter. There are several pages to register through, each having a different animal prominent. We'd let the algorithm figure out which was the best and graph the reward over time.


Since the reward is completions/tries, then it basically represents the percentage of time that a user will go through the process to completion, which is what we want. In order to test that the algorithm was working, we gamed the system to provide a consistent payout for each of the forms. For example, dogs was set to 85%. 85% of the time, a user that saw the dogs page would complete the form. 15% of the time that user would drop out before the process was complete. Cats were set to 75%, Fish to 50% and the Zebra to 25%. (We'll get back to the Gazelle later.)


We ran the test and used Google Charts to graph the results. The test would record any change in reward % to a CSV for each of the forms. After a quick coffee we came back to see the first set of results.










This is interesting. The algorithm bounced around a bit but quickly noticed that dogs were paying out and the others, not so much. The other thing to note is that fish made a run for it and while it didn't get much traction, we realized it *could* have. The form could have gone viral and maybe what we thought was important at the start might have changed. Users are notably mercurial so this wouldn't be surprising. The other thing that stood out is the percentages were flattening out for dogs but still bouncing around for the other pets. This seemed like we just didn't have enough data for the others and that the trends would flatten (which they did).


Now that dogs ruled the roost, we wanted to see what happened when we introduced a new form. Enter the Gazelles. These quadrupeds are quite likable so we gamed them to peg a reward of 98% and watched to see what happened.




Sure enough the algorithm found that it was responding well and the reward continued to rise. We stopped the test to call it a night, but it certainly appears that it was only a matter of time before dogs would give in to the popularity of the gazelles.


Reviewing the Data


We were very pleased at our results and the epsilon-greedy algorithm weighted the forms very close to what we'd gamed them to respond. What was less than ideal was the speed at which the Gazelles responded. It seems like we either got really unlucky or there was a bug somewhere. When we introduced Gazelles it should have almost immediately jumped up to a reward of 100%. There should have been a miss only 2 out of 100 times so the ratio should have climbed much faster than we saw that night.


It was a bug. Turns out there was a problem in the way we were saving off the reward that didn't work consistently. After fixing it and re-running the test we got an immediate response when we introduced Gazelles into the mix.




That's more like it. We introduce a new form, the algorithm likes it, and does what it is designed to do - exploit the hell out of it.


Maximizing value for our clients

This would be a really great way to implement things. It allows our clients to develop several forms and instead of trying to pick the best one, just allow them all to be part of the epsilon-greedy selection. New forms can be dropped in at any time and if it's popular, it will succeed. No longer do you have to make a best guess or exclude other forms.


Most importantly, however, is that it allows people to innovate. You can take more risks because mistakes in this model are minimized. Sure you might occasionally serve a form that doesn't perform as well, but it's a tradeoff for detecting that next new form which might be exactly what your website needed.


See for yourself


I didn't go through all this work just to talk about it though. We have some prototype code and you can get it running quite literally, in minutes if you already have Ruby and Git installed.


https://github.com/convio/epsilon-greedy


  • git clone git://github.com/convio/epsilon-greedy.git
  • bundle install
  • cd into /server and run server.rb
  • go to the url: http://localhost:4567/test
  • in another browser tab go to http://localhost:4567/stats
That should start showing you the graphs and of course you can play around with percentages and other settings in the code itself. You might get slightly different results with each run but over time each form should start to fit the expected trend.


At some point I'd love to explore some of the other algorithms in this space. If you google around you can find references to quite a few other solutions to this problem and numerous white papers (for the mathematically inclined). 


I think over the next few years we're going to see a lot more in this space. Utilizing customer data and creating a feedback loop to improve what we present to the user is such an awesome thing to be able to do. The more we do these innovations, the more our customers win.



1. [@rubytester]
2. [20 lines of code that will beat A/B testing every time]
3. [Multi-armed Bandit]
4. [Convio Luminate TeamRaiser for Special Events]
Creative Commons License
The Science of Testing by Hugh McGowan is licensed under a Creative Commons Attribution 3.0 Unported License.