How to monitor third-party API integrations
Many enterprises and SaaS companies depend on a variety of external API integrations in order to build an awesome customer experience. Some integrations may outsource certain business functionality, such as handling payments or search, to companies like Stripe and Algolia. You may have integrated other partners and thereby expanded the functionality of your product offering. For example, if you want to add real-time alerts to an analytics tool, you might want to integrate the PagerDuty and Slack APIs into your application.
If you’re like most companies though, you’ll soon realize you’re integrating hundreds of different vendors and partners into your app. Any one of them could have performance or functional issues impacting your customer experience. Worst yet, the reliability of an integration may be less visible than your own APIs and backend. If the login functionality is broken, you’ll have many customers complaining they can’t log into your website. However, if your Slack integration is broken, only the customers who added Slack to their account will be impacted. On top of that, since the integration is asynchronous, your customers may not realize the integration is broken until after a few days when they haven’t received any alerts for some time.
How do you ensure your API integrations are reliable and high performing? After all, if you’re selling a real-time alerting feature, your alerts better be real-time and at least guaranteed delivery. Dropping alerts because your Slack or PagerDuty integration is unacceptable from a customer experience perspective.
What to monitor
Latency
Specific API integrations that have an exceedingly high latency could be a signal that your integration is about to fail. Maybe your pagination scheme is incorrect, or the vendor has not indexed your data in the best way for you to efficiently query.
Latency best practices
Average latency only tells you half the story. An API that consistently takes one second to complete is usually better than an API with high variance. For example, if an API only takes 30 milliseconds on average, but 1 out of 10 API calls take up to five seconds, then you have high variance in your customer experience. This makes it much harder to track down bugs, and harder still to handle in customer experience. That’s why the 90th and 95th percentiles are more important to focus on.
Reliability
Reliability is a key metric to monitor, especially since you’re integrating APIs that you don’t have control over. What percent of API calls are failing? In order to track reliability, you should have a rigid definition on what constitutes a failure.
Reliability best practices
While any API call that has a response status code in the 4xx or 5xx family may be considered an error, you might have specific business cases where the API appears to successfully complete and yet the API call should still be considered a failure. For example, a data API integration that consistently returns no matches or no content could be considered failing, even though the status code is always 200 OK. Another API could be returning bogus or incomplete data. Data validation is critical for measuring where the data returned is correct and up to date.
Not every API provider and integration partner follows suggested status code mapping
Availability
While reliability is specific to errors and functional correctness, availability and uptime are pure infrastructure metrics - they measure how often a service has an outage, even a temporary one. Availability is usually measured as a percentage of uptime per year or number of 9’s.
Availability % | Downtime per year | Downtime per month | Downtime per week | Downtime per day |
---|---|---|---|---|
90% (“one nine”) | 36.53 days | 73.05 hours | 16.80 hours | 2.40 hours |
99% (“two nines”) | 3.65 days | 7.31 hours | 1.68 hours | 14.40 minutes |
99.9% (“three nines”) | 8.77 hours | 43.83 minutes | 10.08 minutes | 1.44 minutes |
99.99% (“four nines”) | 52.60 minutes | 4.38 minutes | 1.01 minutes | 8.64 seconds |
99.999% (“five nines”) | 5.26 minutes | 26.30 seconds | 6.05 seconds | 864.00 milliseconds |
99.9999% (“six nines”) | 31.56 seconds | 2.63 seconds | 604.80 milliseconds | 86.40 milliseconds |
99.99999% (“seven nines”) | 3.16 seconds | 262.98 milliseconds | 60.48 milliseconds | 8.64 milliseconds |
99.999999% (“eight nines”) | 315.58 milliseconds | 26.30 milliseconds | 6.05 milliseconds | 864.00 microseconds |
99.9999999% (“nine nines”) | 31.56 milliseconds | 2.63 milliseconds | 604.80 microseconds | 86.40 microseconds |
Usage
Many API providers price on API usage. Even if the API is free, they most likely have some sort of rate limiting implemented on the API that ensure bad actors are not starving good clients. This means tracking the API usage with each of your integration partners is critical, so that you can identify when your current usage is close to plan or rate limits.
Usage best practices
It’s recommended to tie usage back to your end users even if the API integration is downstream from your customer experience. This enables measuring the direct ROI of specific integrations and identifying trends. For example, let’s say your product is a CRM, and you are paying Clearbit $199 dollars a month to enrich up to 2,500 companies. That is a direct cost that’s tied to your customers’ usage. If you have a free tier and a specific customer is using most of your Clearbit quota, then you may want to reconsider your pricing strategy. Perhaps in that case Clearbit enrichment should only be offered on paid tiers.
How to monitor API integrations
Monitoring API integrations seems like the correct remedy to stay on top of these issues. However, traditional Application Performance Monitoring (APM) tools like New Relic and AppDynamics focus more on monitoring the health of your own websites and infrastructure. This includes infrastructure metrics like memory usage and requests per minute, along with application level health, such as appdex scores and latency. Of course, if you’re consuming an API that’s running in someone else’s infrastructure, you can’t just ask your third-party provider to install an APM agent that you have access to. You’ll need a way to monitor the third-party APIs indirectly, or via some other instrumentation methodology.
The challenge of monitoring outgoing API calls
In order to monitor third party API calls, you need a mechanism to log both the outgoing API calls from your application (before they hit a third party), and the response. Many APM agents capture information at the process level, which are great for capturing infrastructure metrics like JVM threads, profiling or memory percentage, but are poor for capturing application-specific context. A second challenge is that outgoing calls in your application are usually scattered across your own code and across all the vendors’ SDKs you’re using in your application. Each third party SDK has their own callbacks and logic to handle API communication. This makes it hard to capture outgoing API calls via a cross-cutting concern or centralized proxy. So you may decide to not leverage vendor specific SDKs, or modify them in order to intercept API traffic. The latter presents a considerable amount of busy work, that very few engineering teams have time for.
Real user API monitoring is the recommended practice leveraged by modern tech companies. Real user API monitoring looks at real customer traffic rather than scheduled probes. This enables you to catch deeper integration issues, such as specific access patterns, that a health prove won’t catch.
How we did it at Moesif
At Moesif we built a SaaS solution that tracks third-party APIs. We did this by leveraging monkey patching and reflection to inject tracking codes in the core HTTP client libraries for common programming languages. Because the instrumentation is within the app itself, there is an opportunity to add business-specific context, such as the original end-user who initiated the API call, to logging specific response headers that give insights to rate limiting or other soft errors. We can do this since most languages rely on a single or small number of core HTTP libraries. For Node.js, all HTTP communication is based on the http or the https module. For Python, the majority of higher level clients will leverage the Python requests library.
An example middleware for Node.js to instrument outgoing API calls is below:
var express = require('express');
var app = express();
var moesifExpress = require('moesif-express');
// 2. Set the options, the only required field is applicationId.
var moesifMiddleware = moesifExpress({
applicationId: 'Your Moesif Application Id',
logBody: true
});
// 3. Start capturing outgoing API Calls to 3rd party services like Stripe
moesifMiddleware.startCaptureOutgoing();
the startCaptureOutgoing()
will monkey patch the Node.js http
and https
modules by overriding every method on the
http
and https
objects with our modified ones which intercept the outgoing API calls. An
example of how we perform the monkey patching for Node.js is below and the full SDK is also available on GitHub.
function _patch(recorder, logger) {
var originalGet = http.get;
var originalHttpsGet = https.get;
var originalRequest = http.request;
var originalHttpsRequest = https.request;
http.request = function(options, ...requestArgs) {
var request = originalRequest.call(http, options, ...requestArgs);
if (!request._mo_tracked) {
request._mo_tracked = true;
track(options, request, recorder, logger);
}
return request;
};
https.request = function(options, ...requestArgs) {
var request = originalHttpsRequest.call(https, options, ...requestArgs);
if (!request._mo_tracked) {
request._mo_tracked = true;
track(options, request, recorder, logger);
}
return request;
};
http.get = function(options, ...requestArgs) {
var request = http.request.call(http, options, ...requestArgs);
request.end();
return request;
};
https.get = function(options, ...requestArgs) {
var request = https.request.call(https, options, ...requestArgs);
request.end();
return request;
};
function _unpatch() {
http.request = originalRequest;
https.request = originalHttpsRequest;
http.get = originalGet;
https.get = originalHttpsGet;
}
return _unpatch;
}
How to leverage monitoring
Just setting up dashboards and reporting on your API integrations is only half the story. There are numerous ways you can leverage this data to improve your customer experience:
- Set up alerts when metrics are out of bound or have anomalous behavior. An easy way is through PagerDuty, Slack, or other channels.
- Hold partners accountable to their SLAs. When you’re consistently running into latency issues, or failures with a single vendor, let them know. If you already have a full audit log showing what happened, the partner may be able to tweak their infrastructure to better accommodate your access patterns or even refund you if they are failing contractual obligations.
- Avoid downtime and ensure your own team is able to respond to partner issues. This may include turning off a feature via feature flags, until the partner remedies the issue.