Overactive Vocabulary

When In Doubt, Ameliorate

RSS

Refactoring to Services: Payoff

One of the most exciting things I’ve done in my professional life to date is recently deploying the first big piece of the ongoing process to break Spreedly up into a series of services. We took user account management and billing and split them out into their own service, and we did it gently, with no “big rewrite” or “stop the world switchover” putting the business into jeopardy.

Why add the overhead of extra services? And yes, they do have overhead - adding more moving parts always adds complexity. I want to talk a lot more in the coming days about what we did, how we did it, and what compelled us to go down this path, but today I’m just going to tease with the bottom line payoff of getting this first piece deployed:

4,328 Deletions

I just stripped over 4,000 lines of code from this app. That code doesn’t have to be loaded, supported, or secured any longer in this context. That’s what I call a win!

Post-Deploy Smoke Tests

Ever find yourself loading your application in a browser after a deploy just to make sure you didn’t break it? Ever forget to do that and have a panicked coworker call you a few minutes later wondering if you know you took the whole site down?

Thing is, you can have the best automated test suite ever, and you can have a great set of passive monitoring tools, but nothing beats an active check of the application after a deploy. After yet another time of making an insignificant tweak that turned out to work just ever so slightly differently in production in such a way as to take one of our apps down post-deploy, I finally said, “Enough!” and made Duff code us up a solution.

That’s right folks, this one isn’t about tooting my own horn, but rather about tooting the horn of fellow Spreedly developer Duff’s since he doesn’t do it nearly enough. After we fixed my latest gaffe (which thankfully “only” affected the UI), Duff tackled the problem of adding an active sanity check to our deploy processes that would just do a basic check (or checks) and tell us as part of the deployment itself if the check failed.

I love the term smoke test for this check; originally used in plumbing, the smoke test was when they’d force smoke through newly installed pipes to see if any smoke seeped out. It was later adopted by the electronics industry as well: “You plug in a new board and turn on the power. If you see smoke coming from the board, turn off the power. You don’t have to do any more testing.” 1 The smoke test is that first test you run on a newly assembled system just to make sure it’s not immediately and obviously broken.

So Duff dug in and added a new recipe to our internal deployment gem:

smoke.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Capistrano::Configuration.instance(:must_exist).load do

  def all_smoke_tests_passed
    send_to_hipchat("All smoke tests were successful.", color: 'green')
  end

  def smoke_tests_failed(messages)
    puts "\n\n******************\n\nSMOKE TEST FAILURE!!!\n#{messages}\n\n******************\n\n"
    send_to_hipchat("http://f.cl.ly/items/2T3c2B2m3H2l3R2Y3o2z/Image%202013.02.15%201:50:04%20PM.jpeg", message_format: 'text', color: 'red')
    messages.each do |message|
      send_to_hipchat(message, color: 'red')
    end
  end

  task :smoke_test do
  end

  after "deploy", "smoke_test"

end

It’s pretty simple, it just hooks in after deploy and runs a (by default empty) task. It also provides convenience methods for taking action upon success or failure. Usage is simple:

deploy.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
require 'potluck/recipes/smoke'

task :smoke_test do
  failed_tests = []

  curl_result = `curl -s -w "%{http_code}" https://id.spreedly.com/signin -o /dev/null`
  failed_tests << "Id smoke test FAILURE." unless (curl_result == "200")

  curl_result = `curl -s -w "%{http_code}" https://spreedly.com/ -o /dev/null`
  failed_tests << "Public smoke test FAILURE." unless (curl_result == "200")

  if failed_tests.empty?
    all_smoke_tests_passed
  else
    smoke_tests_failed(failed_tests)
  end
end

All we have to do for a project is override the :smoke_test task, do whatever makes sense for that particular app, and use the convenience methods to report the result. Now if a smoke test fails, the whole team knows right away as it gets blasted into our Hipchat room:

Smoke Test Failure in Hipchat

We of course also see if the smoke test passed (though it’s much less “in your face”):

Smoke Test Success in Hipchat

It’s amazing how comforting it is to see that line show up after a deploy.

Of course, there are plenty of other ways we test our apps, both pre- and post-deploy, but the smoke test is a great addition to the overall lineup and I’m grateful to Duff for getting them going for Spreedly. And I’d encourage other developers to consider adding an active check to the deploy process to see if anything starts smoking once everything’s live.

1 Smoke Testing on Wikipedia back

The Stitching Proxy

As part of redesigning spreedly.com to highlight our focus on our gateway API product (usually known affectionately as “Core”), we wanted to move the actual serving of the marketing parts of the site to a more appropriate app - initially our account management service, and eventually a single responsibility content-centric service.

Only one problem: spreedly.com was currently handled by our Subscriptions application, and we didn’t want to change either the API urls or the hosted payment page urls for that service. And it’s not just a temporary problem, either: we’re not discontinuing the Subscriptions product, and while we may eventually move it to its own subdomain, that will be a slow and careful transition that could easily last over a year.

So we were clearly going to have to serve up two different applications off of the same domain. The plan was to continue defaulting urls to the Subscriptions app, but to hijack a specific set of pages and paths and point them at the new app. I worked up a list of paths that should point at the new path, and our handsome and talented devop John worked out how to do the mapping in our frontend load balancer. The production setup actually ended up being super simple since we already use nginx proxying to unicorn, and it was simply a matter of having a particular set of paths proxy to a different backend. And because there’s no special infractructure to support it there’s also no performance overhead to sharing the domain.

But… what about when we’re developing locally? While I’d eventually like to see us using Boxen or something like it to closely mirror the production setup in our local development environments, we’re a ways off from that yet. Today we use the awesome pow to handle our local development serving, and while it does all kinds of cool things, it won’t serve multiple apps on the same domain.

I wanted to get something going quickly, which meant I needed a simple solution that integrated with our existing setup with a minimum of fuss. Enter rack-proxy. This handy tool lets you create a piece of Rack middleware that will proxy the whole request over to another app of your choosing. Using the excellent Rails guide about Rack and a nice template in a Stack Overflow answer, I quickly had something up and running:

config/initializers/stitcher.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
if ENV["STITCH"] == "1"
  require "rack-proxy"

  class Stitcher < Rack::Proxy
    EXACT = %w(/ /terms /privacy)
    PREFIX = %w(/pricing /about /support /gateways /assets)

    def initialize(app)
      @app = app
    end

    def call(env)
      original_host = env["HTTP_HOST"]
      rewrite_env(env)
      if env["HTTP_HOST"] != original_host
        rewrite_response(perform_request(env))
      else
        @app.call(env)
      end
    end

    def rewrite_env(env)
      request = Rack::Request.new(env)

      return env unless(request.host == ENV["PUBLIC_FULL_DOMAIN"])

      unless(
          EXACT.include?(request.path) ||
          EXACT.include?(request.path + "/") ||
          PREFIX.include?(request.path) ||
          PREFIX.any?{|prefix| request.path.starts_with?(prefix + "/")})
        env["HTTP_HOST"] = ENV["SUBSCRIPTIONS_INTERNAL_DOMAIN"]
      end

      env
    end

    def extract_http_request_headers(env)
      headers = super
      headers["Host"] = ENV["PUBLIC_FULL_DOMAIN"]
      headers
    end
  end

  Rails.application.middleware.insert_before ActionDispatch::Static, Stitcher
end

Lets walk through this real quick; it’s pretty straightforward, but there are a few nuances I want to highlight. First of all, you’ll notice this only runs if a particular ENV key is set. Doing so leverages our extensive use of environment variables for app configuration, and ensures that this will only run in development and never in production. Eventually I’d like to write more about the benefits we’ve seen from switching almost exclusively to configuration via environment variables, but for now you can just note that they’re used in multiple places in this little proxy to pull the right value.

Next, note that I had to overwrite #call - this is necessary to allow passing through unproxied requests to the “hosting” application. It’d be nice if rack-proxy provided a way to do this out of the box, but it’s simple enough to implement regardless. The decision of whether or not to proxy the request is made by looking at whether the HTTP_HOST for the request has changed; if it has, then we want this request to go to the proxied app. Also note the call to #rewrite_response - this is a method provided by rack-proxy to allow adjusting the response before it is passed back to the browser. It’s not strictly necessary for me to call it here since its default behavior is to do nothing, but I used it multiple times during debugging to get a peek at what the proxy was doing, so it’s handy to keep it in the chain.

The magic happens in #rewrite_env, where I check against the set of exact and prefix paths I want rewritten and determine whether the request matches them. If it does, the HTTP_HOST is set to the proxied app’s host, and we’re good to go. The one other check I am doing here is that I’m only ever proxying when on the public/marketing domain, since this app actually serves a couple of subdomains and I only want to rewrite on one of them.

Finally, there’s a bit of trickery going on in #extract_http_request_headers to make everything work smoothly. Even though I’m rewriting the host that the request should go to, I don’t want the Host header that’s passed to the proxied app to be affected, since it should be as clueless about the fact that it’s being proxied as possible. rack-proxy uses this method to prep the headers for submission upstream, and so I just hook it and make my own last minute adjustment.

And then the middleware is inserted at the very beginning of the middleware stack, and BOOM! it all just works. Well, not quite: I spent hours debugging an issue with the headers, but other than that I really did have a solution working in a couple hours. And best of all, we now have a way in our local development environment to simulate the exact domain structure in production.

If you need to stitch another app into a Rack stack, I’d definitely recommend checking out rack-proxy.

Never Debug HTTP in the Browser

It was coming down to the wire: the new Spreedly site was launching in less than 48 hours, and I really wanted to be able to try out the way we were going to be stitching two different applications together on one domain. It was the one piece of the puzzle that we couldn’t “pre-deploy” to production, since turning it on by definition would launch the site publicly everyone. So I started hacking together a Rack-based proxy (more about that in the future) to simulate the production setup, and within an hour or so I had something that was mostly working.

The problem was the “mostly” part. The proxy was working, the right things were loading on the right paths, but I couldn’t log into the proxied application. Logging in was actually working and redirecting me into the app, but then I was immediately being redirected back to the login page. It was like I couldn’t write to the session. Maybe a cookie problem?

I do most of my browsing with Chrome, and am generally a fan of the developer tools it ships with. So I started poking, going back and forth between the server logs and the browser’s request log, trying to figure out what was going on. It didn’t take long to confirm that the server was definitely writing out to the session correctly, but that for some reason the browser wasn’t returning the right session information back to the browser. I also confirmed that the site worked fine when not proxied, and it was something proxy related that was breaking things.

And then I spent multiple hours spinning my wheels, trying to figure out what was different between the working direct requests and the broken proxied requests. I tried looking at requests in Chrome for a while, and then I switched to Firefox and tried a few different tools there as well. It was getting late, and I was burning out, and I was making no progress, and I should just quit…

But my conundrum was that while I was pretty sure this was a problem that was specific to my hacked up proxy, I wasn’t 100% sure, and I didn’t want to flip the switch on Monday in production only to find out that nobody could log in. But finally, with no real progress to show for a few hours of debugging, I gave in and gave it a rest.

And sure enough, while I didn’t figure the answer out after quitting, I did realize I’d been approaching my debugging the wrong way. Debugging is an art, and I’d been doing the equivalent of trying to play a violin with a hack saw. What I needed was a tool that would let me textually compare the headers of the working and broken requests.

Enter curl, which I should’ve pulled out right away. I knew at this point that something was up with the cookies, but I had no earthly idea what. So I dug into the curl man page and started reading through the options (there are a lot of them) looking for how I could see what was happening with the cookies. OK, actually I just ran man curl | grep -A10 cookie since I didn’t have time to read the whole tome.

Working with what I learned from the man page, I first got myself set up so I could easily run the same request against the (working) unproxied app and the (broken) proxied app. Then I used curl’s handy -c option to have it write out a cookiefile when I made a request:

$ curl \
  -i \
  -c cookies.working \
  -d "login=bob" -d "password=test" \
  http://workingapp/login

Nothing magic here; -i tells curl to print out the headers along with the response, -c stores the cookies in a curl-formatted file, and the -d switches pass in the login information (and make curl do a POST). I then repeated this against the broken version of the app, storing the cookies in a different file. By looking at the headers and responses from the two requests, I could see that both redirected me into the app, indicating that as far as the server was concerned, I had auth’d successfully.

Next step was to verify that curl would re-present the cookies correctly, and that it could reproduce the login issue. To do that I simply passed the written out cookie file to the app and noted whether I got redirected (indicating a missing session) or a 200 OK:

$ curl \
  -i \
  -b cookies.working \
  http://workingapp/innerpage

The -b switch does the magic here, reading the cookie file written out by the previous command and passing the cookies to the app just like a browser would. And sure enough, the working version gave me back a 200 and the broken version redirected me back to the login page.

Now to dig into the cookie files themselves and see what was different about them! It took a few minutes of staring and poking, but very quickly I noticed that the broken version only had one cookie in its file, while the working version had written 4+ cookies. How was that happening?

Long story short, rack-proxy uses Ruby’s Net::HTTP heavily, and Net::HTTP tries to treat headers like a hash. But headers aren’t a hash; you can pass multiple headers of the same type, and that’s exactly how Set-Cookie works. Net::HTTP papers over this by joining with a comma the values of header keys there’s more than one. And thus I was ending up with one big invalid cookie rather than the four cookies I should’ve had.

Now it was as simple as figuring out how to get Net::HTTP to give me the header values individually and submitting a pull request with a fix. I locked rack-proxy to the fixed revision in our Gemfile, and everything just worked. And while this problem wouldn’t have affected us in production (since nginx is doing all the proxying for us), I went into the launch on Monday morning much more confident since I knew this wouldn’t bite us when we flipped the switch.

Moral of the story? Browsers are the wrong tool for debugging HTTP. You need to be able to see and textually compare what’s traversing the wire, and even though the developer tools provided by modern browsers can give you a peek, they often obfuscate issues rather than making them obvious. curl and other text-based tools like it are the best way to see what’s really happening and quickly track down the problems that arise with HTTP.

Kernel#backtrace

The other day I was cranking along on part of our Spreedly codebase, furiously working to get it ready for the launch of our new marketing site and the publicity that (we hoped) was due to come with it. Now, it may not be obvious from the outside, but we re-did a lot more than just the look and feel of Spreedly, and as part of that I was working on getting billing working smoothly throughout the system.

One of the key things about how we bill is that a lot of it’s made up of very small metered charges. Because they’re so small, it’s not a huge deal if a metered fee here or there falls on the floor, and so as long as we never over charge a customer we’re more concerned with fixing issues that come up than we are with capturing every last penny. Thus there are various places in the chain of “add a fee” where things could go wrong that we just let them fail and notify ourselves rather than writing fancy recovery code.

But of course, this only works if you notify yourself in such a way that you can fix the problem, and that means a stack trace should show up in the notification. As I coded along I found myself in a situation where something could go wrong and yet it wasn’t due to an exception but rather just due to an “unexpected input” situation. That meant I had no exception to grab the stack trace off of, and for a few minutes I was stumped.

Now, I knew about Kernel#caller, but the problem with it is that it starts with (as indicated by the name) your caller’s stack, and in this case I wanted the stack to start right where I was creating the notification. Turns out there’s a simple solution:

1
2
3
4
5
module Kernel
  def backtrace
    caller
  end
end

I dropped that code in a Rails initializer and then just called backtrace when building my notification, and it works beautifully.

Now, I’m pretty cautious about going in and adding methods to Kernel, but I think this is a good example of why it’s nice that Ruby lets us do so when we want to. If you find yourself needing a backtrace without raising an exception, I hope this helps you out!