May 21, 2024

Sensitive Data Detection using AI for API Hackers

Do you ever notice how easy it is for APIs to leak sensitive data in the weirdest places?

Data breaches are as damaging as they are commonplace. Yet many API hackers fail to detect leakage that incites such breaches when they are so focused on stepping through their methodology.

What if there was an easier way?

In this article, I will show you how to weaponize the data protection technologies Microsoft builds into some of its products to detect “sensitive data” in the API responses coming from your targets.

This technique builds upon some of the lessons from my article on 5 mistakes beginners make during app recon. “Walking the app” to map out the API endpoints and their data gives you sufficient information to run dedicated machine learning (ML) models to detect sensitive data.

Let me show you how.

But first… what represents “sensitive data”?

In the API world, “sensitive data” refers to any information that must be protected due to its confidential nature and the potential harm that could arise from its exposure. This includes a wide array of data types, such as personal identification numbers, financial records, health information, personal biometric data, and security credentials.

Sensitive data also extends to proprietary business information, such as trade secrets and internal communications, which, if disclosed, could jeopardize a company’s competitive advantage.

The classification of sensitive data is not only a technical necessity but also a legal imperative, as numerous regulations and laws—such as GDPR, HIPAA, and PIPEDA—mandate strict standards for handling and protecting this kind of information.

Understanding what constitutes sensitive data is the first step in securing APIs and ensuring they do not become conduits for unauthorized access and data breaches.

Lots of research has already gone into detecting personally identifiable information (PII). Microsoft has been doing some great work to ensure that sensitive data is properly managed and governed in its data protection and de-identification SDK, Presidio.

What is Microsoft Presidio?

Presidio is a context-aware, pluggable, and customizable PII detection and de-identification service.

It provides fast identification and anonymization modules for text and images, including credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more.

And if the predefined recognizers don’t meet your needs, you can extend it with your own custom recognizers to find the sensitive data that matters to you.

Presidio uses trained machine learning (ML) models for NER based PII identification and feature extraction for downstream rule based logic. It supports spaCy version 3+ for Named Entity Recognition, tokenization, and lemmatization.

That’s all fancy-talk to say Presidio can use AI to detect PII.

If you are an AI geek, you can even customize your own NLP models. However, I’m not… and will rely on the open-source model called “en_core_web_lg” from spaCy.

How it works

This animation from Microsoft does a pretty good job of explaining how the detection flow works in Presidio…

We don’t really care about anonymization here. We only want to detect sensitive data, so we will focus on the Presidio Analyzer.

Getting Started with Presidio Analyzer

The Presidio analyzer is a Python-based service that detects PII entities in text.

Microsoft offers Presidio as a REST service that runs in Docker and Kubernetes containers or as a Python module that you can use directly in your code.

I prefer the latter method. But it comes with a caveat.

Presidio Analyzer supports up to Python 3.11. So you will probably have to install it in a virtual environment (venv) since, at the time of this writing, most people have 3.12 installed.

Let me show you one way to set up your environment to support and use Presidio in Python.

Preparing your suspect API data

For the rest of this article, I will demonstrate using Presidio to detect sensitive data found in an HTTP archive (HAR) capture file. You should have been collecting this data as part of your initial recon process for your target. (it was lesson #1 after all)

If you don’t have a HAR file, you might be able to generate one if you have Logger++ installed in Burp Suite. To do this , go to the Logger++ tab, highlight the requests you want to export, right-click on the log pane, and select Export entries as… > Export # entries as HAR.

Otherwise, you will need to walk your app again to generate a new HAR file for this exercise.

Installation

With your data ready, let’s set up an environment to use Presidio. Knowing that Microsoft only supports up to Python 3.11, I will use that to generate a new environment called “SensitiveDataDetector”:

python3.11 -m venv SensitiveDataDetector
cd SensitiveDataDetector

With the new environment set up, we need to activate it:

source bin/activate

With the Python 3.11 environment now active, we can go about installing Microsoft’s Presidio Analyzer and download the open-source model that is provided in spaCy:

pip install presidio_analyzer
python -m spacy download en_core_web_lg

That’s it. You’re now ready to start trying out the Presidio Analyzer.

Basic Usage

So, before we go about detecting stuff in our suspect data, let’s make sure we understand how to use Presidio. All we need to do is instantiate an AnalyzerEngine object and call analyze(), passing in the block of text we want to scan. It looks something like this:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
results = analyzer.analyze(text="My phone number is 604-555-1234", language='en')
print(results)

As you can see, Presidio analyzed the text, and detected a phone number pattern from the 19th character to the 31st character in the string, with a confidence level of 0.75.

OK, let’s go have some real fun with this against our API data.

Detecting Sensitive Data in API responses with Presidio

If you did a decent job during your recon process to walk your app, you should have a HAR file with all the API requests and responses, including all the actual data. It will also include all the requests that support the web application, including the HTML, CSS, Javascript, images etc.

We will filter all that out for our purposes and only look at responses with a content type of application/json. If your target API responds with other types, like XML, you can adjust for it accordingly.

Anyway, let’s start by parsing the HAR capture file.

Building a HAR Capture Reader

Instead of reinventing the wheel, we can reuse the code that comes with mitmproxy2swagger for the HAR capture reader that I demonstrated in my article on generating rogue API docs from captured traffic.

We’ll just make a few tweaks to it to better support our needs. The modified code looks something like this:

from base64 import b64decode
import os
from typing import Iterator, Union
import json_stream

# This HAR capture reader was taken from mitmproxy2swagger and slightly modified to work for our needs.
# See https://github.com/alufers/mitmproxy2swagger/blob/master/mitmproxy2swagger/har_capture_reader.py

class HarFlowWrapper:
    def __init__(self, flow: dict):
        self.flow = flow

    def get_url(self):
        return self.flow["request"]["url"]

    def get_matching_url(self, prefix) -> Union[str, None]:
        """Get the requests URL if the prefix matches the URL, None otherwise."""
        if self.flow["request"]["url"].startswith(prefix):
            return self.flow["request"]["url"]
        return None

    def get_method(self):
        return self.flow["request"]["method"]

    def get_request_headers(self):
        headers = {}
        for kv in self.flow["request"]["headers"]:
            k = kv["name"]
            v = kv["value"]
            # create list on key if it does not exist
            headers[k] = headers.get(k, [])
            headers[k].append(v)

    def get_request_body(self):
        if (
            "request" in self.flow
            and "postData" in self.flow["request"]
            and "text" in self.flow["request"]["postData"]
        ):
            return self.flow["request"]["postData"]["text"]
        return None

    def get_response_status_code(self):
        return self.flow["response"]["status"]

    def get_response_reason(self):
        return self.flow["response"]["statusText"]
    
    def get_response_http_version(self):
        if( "response" in self.flow and "httpVersion" in self.flow["response"] ):
            return self.flow["response"]["httpVersion"]
        return None
    
    def get_response_content_type(self) -> str:
        content_type: str = "text/plain"
        
        if( 
            "response" in self.flow 
            and "headers" in self.flow["response"] 
        ):
            for kv in self.flow["response"]["headers"]:
                k = kv["name"]

                if k.lower() == "content-type":
                    content_type = kv["value"]
                    break                    

        return content_type            

    def get_response_headers(self):
        headers = {}

        if( "response" in self.flow and "headers" in self.flow["response"] ):
            for kv in self.flow["response"]["headers"]:
                k = kv["name"]
                v = kv["value"]
                # create list on key if it does not exist
                #headers[k] = headers.get(k, [])
                #headers[k].append(v)
                headers[k] = v

        return headers

    def get_response_body(self):
        if (
            "response" in self.flow
            and "content" in self.flow["response"]
            and "text" in self.flow["response"]["content"]
        ):
            try:
                if (
                    "encoding" in self.flow["response"]["content"]
                    and self.flow["response"]["content"]["encoding"] == "base64"
                ):
                    return b64decode(self.flow["response"]["content"]["text"]).decode()
            except UnicodeDecodeError:
                return None
            return self.flow["response"]["content"]["text"]
        return None

class HarCaptureReader:
    def __init__(self, file_path: str, progress_callback=None):
        self.file_path = file_path
        self.progress_callback = progress_callback

    def captured_requests(self) -> Iterator[HarFlowWrapper]:
        har_file_size = os.path.getsize(self.file_path)
        with open(self.file_path, "r", encoding="utf-8") as f:
            data = json_stream.load(f)
            for entry in data["log"]["entries"].persistent():
                if self.progress_callback:
                    self.progress_callback(f.tell() / har_file_size)
                yield HarFlowWrapper(entry)

    def name(self):
        return "har"

You can also download a complete gist for har_capture_reader.py here.

Building the Sensitive Data Detector

The code for the sensitive data detector is pretty self-explanatory.

I will call out one thing. And that is how I adjusted the call to analyze() to take in an array of “entities” I wanted to scan for:

     try:
        results = analyzer.analyze(
            text=data,
            entities=[
                "EMAIL_ADDRESS", "IBAN Generic", "IP_ADDRESS", 
                "PHONE_NUMBER", "LOCATION", "PERSON", "URL", 
                "US_BANK_NUMBER", "US_DRIVER_LICENSE", 
                "US_ITIN", "US_PASSPORT", "US_SSN" 
                ], 
            score_threshold=score_min, 
            language='en')
        
    except Exception as e:
        print( f"Exception while analyzing data with Presidio: {e}")
        return sensitive_data

You can find a list of supported predefined entity recognizers built into Presidio in the Github repo. I just looked through the code for the “supported_entity” to extract the ones I want to scan for.

YMMV though.

Here’s the full code to the detector logic:

import json
import sys
from typing import List
from dataclasses import dataclass

from presidio_analyzer import AnalyzerEngine, RecognizerResult
import argparse
from har_capture_reader import HarCaptureReader

analyzer: AnalyzerEngine = AnalyzerEngine()

# You can adjust the acceptable threshold here. Presidio using a weighting of 0 to 1.
# Typical "confidence" where the data is more likely to be sensitive is at around 0.75
# for most entities.
# See https://github.com/microsoft/presidio/tree/main/presidio-analyzer/presidio_analyzer/predefined_recognizers
SCORE_THRESHOLD: float = 0.75
    
@dataclass
class SensitiveDataResult:
    """ Class for keeping track of potentially sensitive data """
    entity_type: str
    score: float
    data: str

@dataclass
class SuspectResponse:
    """ Class for keeping track of responses that have potentially sensitive data """
    method: str
    status_code: int
    url: str
    headers: dict
    body: str
    sensitive_data: List[SensitiveDataResult]

def check_for_sensitive_data(data: str, score_min: float ) -> List[SensitiveDataResult]:
    """ Runs a response through Microsoft Presidio to see if it can detect any sensitive data """
    sensitive_data: List[SensitiveDataResult] = []
    results: List[RecognizerResult] = []

    try:
        results = analyzer.analyze(
            text=data,
            entities=[
                "EMAIL_ADDRESS", "IBAN Generic", "IP_ADDRESS", 
                "PHONE_NUMBER", "LOCATION", "PERSON", "URL", 
                "US_BANK_NUMBER", "US_DRIVER_LICENSE", 
                "US_ITIN", "US_PASSPORT", "US_SSN" 
                ], 
            score_threshold=score_min, 
            language='en')
        
    except Exception as e:
        print( f"Exception while analyzing data with Presidio: {e}")
        return sensitive_data 
    
    for r in results:
        try:
            if r.score >= SCORE_THRESHOLD:
                sensitive_data.append( SensitiveDataResult(r.entity_type, r.score, data[r.start:r.end]) )
        except Exception as e:
            print(f"{e} : {r}")

    return sensitive_data    

def pretty_print(resp: SuspectResponse, show_details: bool = False ) -> None:
    """Prints details of responses containing sensitive data"""
    print( f"\033[32m{resp.url}")
    for item in resp.sensitive_data:
        print( f"\033[0m{item.entity_type} (Score={item.score}) : \033[31m{item.data}" )
    
    if show_details:
        print( "\n\033[36m========\nRESPONSE\n========")
        print( f"Method: {resp.method}")
        print( f"Status Code: {resp.status_code}\n")
        for key,val in resp.headers.items():
            print( f"{key}: {str(val)}" )
        print( f"\n{resp.body}")
    
    print("\033[0m")


def main() -> None:
    """Main function to process HTTP archive capture files for sensitive data"""    
    parser = argparse.ArgumentParser(description="Search through HTTP archive for sensitive data")
    parser.add_argument("filename", help="The path to the HAR file to process")
    parser.add_argument('-d', '--details', action='store_true', help='Shows full detailed response')

    args = parser.parse_args()
    
    try:
        capture_reader = HarCaptureReader(args.filename)
        suspect_responses: List[SuspectResponse] = []

        for req in capture_reader.captured_requests():
            content_type = req.get_response_content_type()

            # Need to account for mixed JSON objects (ie:protobuf)
            if content_type.lower().startswith("application/json"):
                sensitive_data: List[SensitiveDataResult] = check_for_sensitive_data(req.get_response_body(), SCORE_THRESHOLD)

                if sensitive_data:
                    suspect_responses.append( 
                        SuspectResponse( 
                            req.get_method(), req.get_response_status_code(), req.get_url(),
                            req.get_response_headers(), req.get_response_body(),
                            sensitive_data)
                    )
    
        if suspect_responses:
            for resp in suspect_responses:
                pretty_print(resp, args.details)
                
    except Exception as e:
        print(f"General Exception: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

You can also download a complete gist of sensitive_data_detector.py here.

Running the Code

Making sure you are in the Python virtual environment, running the code is as simple as:

python sensitive_data_detector.py -d

If you don’t want to see the full response data, you can drop the -d flag.

The output will show you three (possibly four) things for each API response that Presidio suspects has sensitive data:

It will show you the full path to the endpoint in green
It will show you a list of entity types detected in white. It will also show the weighted confidence score, which is a float value in the range 0 to 1.
It will show you any sensitive data discovered in red.
It will optionally show you the full response (including headers) in cyan if the -d flag is used.

The scoring threshold I used when running this was set to 0.75. This filters out a lot of the false positives while giving the best results.

What next?

This sensitive data detector works well. Against my production HAR files, it only takes a few minutes to scan the entire capture file and dump out all the sensitive data. It could get noisy if you have endpoints that rely on things like email addresses. But you can tune that in the entities array.

Ultimately, it helped me find a couple of endpoints with sensitive data I didn’t expect, which was nice.

But there is a whole bunch more you could do with it.

Build your own custom recognizers

Microsoft publishes a tutorial on building your own PII recognizers for the Presidio Analyzer. If you have sensitive data proprietary to the target that you never expect to see in the API responses, you can build a simple pattern recognizer for that and inject it at runtime.

Build a Burp extension to scan for sensitive data

As Presidio Analyzer requires Python3 (and should be run in a venv), you won’t be able to write a Burp extension directly. However, nothing says you can’t use the Montoya API and write an extension in Java or Kotlin that consumes the REST API exposed through using Presidio in Docker.

Microsoft has even published a Postman collection here for you to get to know and love RESTful Presidio.

Evil Genius idea: Be like TruffleHog and scan GitHub for HAR files

Years ago, researchers used TruffleHog to scan all of GitHub looking for API secrets. Some good bug bounties were delivered through some of that work.

This year, they even monitored the public gist feed and scanned for secrets in real-time.

Gotta love the ingenuity of security researchers.

Some aspiring API hacker may be interested enough to use a similar approach. With a bit of GitHub dorking, you could probably find tons of .har files you can scan for sensitive data.

It’s just an idea. Play nice. Don’t be evil.

Conclusion

As we wrap up this exploration of leveraging AI to detect sensitive data in API responses, it’s clear that integrating technologies like Microsoft Presidio can significantly enhance our API hacking methodology.

This article not only underscores the importance of recognizing and securing sensitive data but also demonstrates practical methods to apply these concepts in real-world scenarios.

Armed with these insights and tools, you’re now better equipped to scrutinize API responses for potential vulnerabilities and ensure that sensitive data remains just that—sensitive.

Remember, the goal isn’t merely to find vulnerabilities but to preemptively secure APIs against potential data breaches. As technology evolves, so do the threats, and staying ahead requires constant learning and adaptation.

Hack hard! And remember to use your powers for good!

One last thing…

Have you joined The API Hacker Inner Circle yet? It’s my FREE weekly newsletter where I share articles like this, along with pro tips, industry insights, and community news that I don’t tend to share publicly. If you haven’t, subscribe at https://apihacker.blog.

Dana Epp

Hey, I’m Dana, aka SilverStr. I build and break software for a living, and am a Microsoft Regional Director and Developer Security MVP. I’ve spent decades as a security architect that focuses on helping secure software, data, and infrastructure on both blue and red teams. As of late, I have been focusing more on my offensive tradecraft to help developers and IT administrators see the impact of exploitation on vulnerabilities in their work. This blog is my chance to give back to the community by sharing my experiences and war wounds from the trenches.