How to Properly Leverage Elasticsearch and User Behavior Analytics for API Security
Kibana and the rest of the ELK stack (Elasticsearch, Kibana, Logstash) is great for parsing and visualizing API logs for a variety of use cases. As an open-source project, it’s free to get started (you need to still factor in any compute and storage cost which is not cheap for analytics). One use case for Kibana that’s grown recently is providing analysis and forensics for API security, a growing concern for engineering leaders and CISO’s as companies expose more and more APIs to their customers, partners, and leveraged by Single Page Apps and mobile apps. This can be done by instrumenting applications to log all API traffic to Elasticsearch. However, a naive implementation would only store raw API logs and calls, which is not sufficient for API security use cases.
Why API logging is a naive approach to API security
Raw API logs only contain the information pertaining to execute a single action. Usually the HTTP headers, IP address, request body, and other information is logged for later analysis. Monitoring can be added by purchasing a license for Elasticsearch X-Pack. The issue is that security incidents cannot always be detected by looking at API calls in isolation. Instead, hackers are able to perform elaborate behavioral flows that exercise your API in an unintended way.
Let’s take a simple pagination attack as an example. A pagination attack is when a hacker is able to paginate through a resource like /items
or /users
to scrape your data without detection. Maybe the info is already public and low risk such as items listed in an e-commerce platform.
However, the resource could also have PII or other sensitive information such as /users
, but was not correctly protected. In this case, a hacker could write a simple script to dump all the users stored in your database like so:
skip = 0
while True:
response = requests.post('https://api.acmeinc.com/users?take=10&skip=' + skip),
headers={'Authorization': 'Bearer' + ' ' + sys.argv[1]})
print("Fetched 10 users")
sleep(randint(100,1000))
skip += 10
Couple of things to note:
- The hacker is waiting a random time between each call to not run into rate limits
- Since the frontend app only fetches 10 users at a time, the hacker only fetches 10 at a time to not raise any suspicion
There is absolutely nothing in a single API call that can distinguish these bad requests vs real requests. Instead, your API security and monitoring solution needs to examine user behaviors holistically. This means examining all the API calls together made by a single user or API key which is called User Behavior Analytics or UBA.
How to implement User Behavior Analytics in Kibana and Elasticsearch
To implement User Behavior Analytics in Kibana and Elasticsearch, we need to flip our time-centric data model around to one that is user-centric Normally, API logs are stored as a time-series using the event time or request time as the date to organize data around. By doing so, older logs can easily be marked read only, moved to smaller infrastructure, or retired based on retention policies. In addition, it makes search fast when you’re only querying a limited time range.
Tagging API logs with user id
In order to convert this to a user-centric model, we need to tag each event with user identifying information such as a tenant id, a user id, or similar. Because the majority of APIs are secured by some sort of OAuth or API Key, it’s fairly easy to map the API key to a permanent identifier like user id either directly or by maintaining this mapping in a key/value store like Redis. This way your logs might look like so:
Request Time | Verb | Route | User Id |
2021-08-02T02:14:48Z | GET | /items |
1234 |
2021-08-02T02:15:49Z | GET | /items |
1234 |
2021-08-03T02:16:19Z | GET | /users |
6789 |
2021-08-03T02:24:49Z | GET | /users |
1234 |
Grouping related API logs together
Now that you have tagged all API logs with a user id, you will need to run a map reduce job to group all a user’s events together and calculate any metrics for each user. Unfortunately, log aggregation pipelines like Logstash and Fluentd can only enrich single events at a time, so you will need a custom application that can run distributed map/reduce jobs on a distributed compute framework like Spark or Hadoop.
Once you group by user id, you’ll want to store a few items in the “user profile” such as:
- Id and demographics of the user
- The raw events this user has made
- Summary metrics like number of API keys or amount of data downloaded this user has done
Storing the user profiles
Even though you are grouping by user id’s, storing all the events into a single database entity would be a no go as that removes the flexibility of time-series data stores including:
- Fat entities that contain too much data
- Cannot retire old data
- Queries become slow due to amount of data touched
To fix this, we can overlay our original time-series architecture with this user-centric approach creating a two-level data model.
User Id | Start Time | End Time | Number of Logins | Number of Users Touched | Number of API Keys | Events |
1234 | 2021-08-02T00:00:00Z | 2021-08-02T23:59:59Z | 2 | 250,223 | 1 | [] |
6789 | 2021-08-03T00:00:00Z | 2021-08-03T23:59:59Z | 13 | 232 | 12 | [] |
1234 | 2021-08-03T00:00:00Z | 2021-08-03T23:59:59Z | 0 | 323,997 | 0 | [] |
In this case, we are creating a new “user profile” every day that contains the relevant security metrics along with raw events.
Detecting API security vulnerabilities
Now that we have reorganized our API data to be user centric, it becomes far easier to identify bad actors from good users, whether through visual inspection, static alert rules, or advanced anomaly detection.
In this case, we see the typical user (6789) touched or accessed only 232 users and 12 items. Clearly this looks like standard interactive traffic. On the other hand, we have a bad actor (1234) that has touched or downloaded over 250,000 items per day over the last two days. In addition, he was accessing the API without any corresponding logins on the second day. You can now create infrastructure to detect this programmatically and alert you such as alert when any user “touched over 10,000 items in a single day.” API security and monitoring solutions like Moesif already have this functionality built in.
How long to retain API logs for API security
Unlike API logs for debugging purposes, these entities should be stored for at least a year since most breach studies demonstrate the time to detect a data breach is over 200 days. If you’re only retaining your API data for a couple of days or weeks to keep cost down, then you lose access to valuable forensics data needed for auditing and postmortem review. Treat your API data like your database backups in that you never know when you might need them and should regularly test your system to ensure the right data is captured. Most security experts recommend retaining API logs for at least a year. Naive decision making places too much emphasis on reducing storage and compute cost without considering how much risk he or she is subjecting their company to.
Storing data for a year complicates your GDPR and CCPA compliance since API logs can contain PII and personal data. Luckily, GDPR and CCPA have placed exemptions for collecting and storing logs without consent for the legitimate purpose of detecting and preventing fraud and unauthorized system access, and ensuring the security of your APIs. In addition since you already tied all API logs to individual users, handling GDPR requests such as right to erasure or right to access is a breeze.