While MediaMachine is a new product, we’ve been writing http-based apis ( usually RESTish, with json ), for a looooooooooooong time and thought it would be worthwhile to talk through and explain some of our most basic production metrics. Hopefully this can help others who are new to API monitoring and want to get better visibility into what’s happening.
These are really two metrics, but they’re pretty closely related. We want to know when the service stops, and when it starts, so we add a metric for each event. This helps make deploys more visible, but if you see these firing when a deploy isn’t happening, you know you’ve got a catastrophic crash or some other unexpected behaviour. If you can figure out how to add an additional metric for unexpected stops/starts, that’s a great one to alert on.
Catastrophic crashes should ideally be caught by the service when they’re in the context of a single request and simply transformed into a 500 error response for that particular request, because you don’t want the whole server to be brought down for one bad request. We put a metric on these 500s because they’re basically “caught” errors that indicate we’ve got a problem on the server-side that needs our attention.
“Logged errors” are another kind of “caught” error, but they’re the kinds of things that might not be worth a 500. For example, if during user sign-up you want a side-effect of sending a welcome email, you might 200 for the sign-up request no matter whether or not the email was sent correctly. You might just want to log an error if it occurs in a non-critical side-effect.
HTTP is pretty great in how it distinguishes between client errors (400, 401, 403, etc) vs server errors (all the codes starting with 5, like 500s). It makes errors much easier to debug by splitting the problem-space in half, but it’s also very helpful for telling the users whether they were the ones that made the mistake or if we’ve got a faulty service.
Errors in the four-hundred block (400, 401, 403, etc), which I’ll now call 4xxs, are an indication that the problem is not on the server and that the request is not going to succeed if retried. These can be further broken down into expected and unexpected 4xxs.
Some example expected 4xxs:
- user logs in with a bad password (401)
- user signs up with an incorrect email address (400)
- user tries to delete a resource that’s already been deleted (404)
Some example unexpected 4xxs:
- User tries to access a restricted resource without logging in first (401)
- A logged in user tries to access a resource that they don’t have access to (403)
- User makes a POST with an invalid body for that resource (400)
- User requests the status of one of our video thumbnailing jobs with an invalid job id (400)
- User requests a url that isn’t routable (404)
Unexpected 4xxs are the really interesting one of the two, because we’ve got client libraries that we publish, and unexpected 4xxs would indicate a bug in those libraries. If you have mobile or web clients using your API this metric can help you find bugs in those as well.
200s/202s/204s/etc are interesting to know as a metric as well just to know that things are actually working. You may not want to log successful requests because they are so numerous and not actionable, so a metric can be a much more affordable way to keep those successes visible. Dividing this metric by the total number of requests can give you a nice success rate metric.
We like to make sure that our servers are running with some kind of separate supervisor
service somewhere else that regularly hits a status endpoint (expecting a 200 response).
Sometimes we have the supervisor service test even more than that, but the primary value is that you’re regularly hitting your API like a user would to make sure it’s still running. You want to have a metric on that status endpoint so you can alert when the frequency of hits is too low. This will tell you when a deploy went poorly, or when your API is in some kind of a zombie mode (running, but not doing anything), or if there’s a network issue where your API is unreachable. You can use 2xx metrics for the same thing, but a heartbeat metric like this doesn’t have to rely on a steady stream of real traffic.
Just tracking requests at all is pretty critical for a few reasons: (1) A lot of scaling issues are caused simply by the volume of traffic. You need to watch volume to see correlations. (2) Most of the metrics above should be viewed as a fraction of traffic. 10 errors in a hundred requests is probably a big issue. 10 errors in a million requests is probably a pretty solid service.
Response time (in milliseconds) is critical for finding slow, hung, or timed-out responses, none of which will necessarily count as an error, but all of which make for a pretty terrible user experience.
If you’re adding these metrics to a running service that hasn’t had them before you are going to find out that the service is failing in all kinds of ways you didn’t anticipate. The real work of solidifying the service can begin as soon as you can actually see what’s going wrong.