Best practices for building SDKs for APIs
In 2019, you can’t be a B2B company without having an API program. Whether you’re API is the product or APIs are leveraged to enable additional integrations and functionality for your web app.
Even though an SDK could seem simple in terms of lines of code, SDKs need to be reliable and handle scale with ease. A poorly designed SDK could cripple your customer’s infrastructure and reduce trust in your service. At Moesif, we put a lot of effort into creating SDKs that are both high performance while adding in fail safes in case bad things happen. This article walks through some of those practices. Given Moesif is an API analytics service, some of these practices are specific to high-volume data collection. However, other features are applicable regardless of your SDK purpose.
Initialize the SDK securely and asynchronously
For most SaaS solutions with an SDK, your user will need to initialize the SDK and pass in an API key. You may need to grab certain account information from your servers or grab certain device information such as OS version. This work should be performed on a background thread or leverage async calls. By leveraging background threads, the main thread can continue to respond without handing. Depending on if your SDK is for front ends or back ends, the main thread may be the main UI thread or could be the thread responding to incoming HTTP requests.
In addition to offloading initialization work, your API keys should never assume the SDK is running in a secure and trusted environment. Many SDKs are incorporated in mobile apps packages or via browser javascript, all of which can be easily disassembled to reveal the API key to anyone who can download or access the app. One way to reduce this risk is by reducing the scope of API keys to specific API resources and actions. For example, if you’re building an analytics SDK to collect and log information, the API key can be write-only and only be used to communicate with your data ingestion APIs but access is blocked on the data export APIs.
Batch heavy network traffic
For any HTTP connection created with another server, there will be some overhead before any data can be transmitted. This includes a mutual exchange of SYN and ACK packets between a client and server to create a TCP connection, a handshake to establish SSL/TLS, etc. Instead of redoing these for every HTTP request, you can reuse connections via Keep-Alive.
In addition to keep alive, you can also batch multiple events or commands into the same outbound HTTP request. Going back to our analytics SDK example, it would be inefficient to make a HTTP request for every client event that needs to be logged. TCP itself has overhead and you’re also sending many HTTP headers for each request to handle authentication, caching, etc. Batching reduces this overhead and reduce the number of system calls while keeping all the data in same memory buffer.
Batching is done by sending an array of events or commands rather than a single one per HTTP request. It usually works best when combined with local queueing.
Leverage local queuing
In order to correctly implement some of the above features like batching, you can leverage local queueing. Local queuing decouples the logic involved to capture or process data from the logic required to batch and send them to a server. Events can be stored in an in-memory memory queue or flushed to disc for durability if there is a risk of power outage. More elaborate queueing architecture can leverage distributed data stores like Redis, although this adds complexity to set up your SDK so is not recommended unless absolutely required.
Queuing also increases reliability of your SDK when your API is down. Local events can continue to be pushed into the queue while the API is down. Once up, the queue can be drained in large batches. It’s recommend to implement logic that prevents the queue from consuming too much memory or disc space if the API is down for an extended time. One way to do this is via implementing a fixed size queue which automatically drops old values upon new ones. While some events may be dropped (which may be OK if the events are only used for analytics purposes), that could be a good trade off guaranteeing your SDK won’t crash or overload your customer’s infrastructure.
Flushing the queue
Correct queueing will require certain triggers to flush the events or commands out to your server. The recommend approach is to leverage both time-based and count-based triggers. For example, you can flush events once the buffer reaches 50 events OP after 10 seconds have passed from last flush. This ensures your SDK can batch many events during peak traffic, while still keeping end to end latency low.
Without time-based flushing, a single event could sit in the queue indefinitely if no other events are pushed into the queue.
Compression
Compression is super easy to take advantage of but can easily be forgotten. Not all HTTP client libraries compress payloads by default and your backend needs to support your preferred (de)compression encoding. By compressing your payload as gzip using zlib or similar, you can reduce the size your payloads by over 10X from plain text. You can also look into newer formats like Brotli, which can further reduce this by 10 to 20% over gzip.
Set the User-Agent
You should include both the SDK name and also a version in the User-Agent
HTTP header. This allows you to
understand SDK adoption and correlate issues to specific versions. For example, we at Moesif adopted a
standard format libraryname/semvar
across our SDKs so:
const request = require('request');
const pjson = require('../package.json')
request({
headers: {
'User-Agent': 'nodejs/' + pjson.version
},
uri: 'https://api.example.com',
method: 'POST'
}, function (err, res, body) {
//it works!
});
Leverage API analytics
Once you build and publish these SDKs, it’s critical you have the right API analytics in place to measure the performance and utilization of your APIS and see what improvements you can make whether in the SDK or API side such as tweaking batch size or adding more efficient endpoints. Correctly implemented analytics can give you insights where improvements can be made in pagination or finding endpoints that are incorrectly used.
Document changes
Leverage GitHub’s release process or create a CHANGELOG.md to thoroughly document changes, even if they are minor. When a user of an SDK encounters errors or problems, the first things he or she can check is the changelog and also any tickets filed similar to the issue. Sometimes small changes can break in older or specific environments without you knowing.
Breaking changes should have even more thorough documentation describing what is breaking and how to migrate. You might have some users of your SDK trying to migrate from version 1.X.X to 3.X.X whereas others may migrate from 2.X.X to 3.X.X. Documenting what is needed to finish that migration can be very helpful.