Data Exfiltration Detection in Discord using ML

Authors

Ana Vidal
Simão Andrade

Phase 1: Problem Brainstorm

Problem

Discord is a popular communication platform that can be used to exfiltrate data
Data exfiltration can be done in many ways, such as:
- Sending messages
- Sending files
- Using voice channels

Why is it difficult to solve?

One of the biggest challenges in detecting data exfiltration via Discord is the fact that all communication is encrypted. When data is sent to a webhook, Discord uses HTTPS (HTTP over TLS), which means that the data is encrypted during transmission. This encryption makes it impossible for most network security devices, such as firewalls and Deep Packet Inspection (DPI) systems, which analyze individual packets to identify suspicious content, to directly inspect the content of the traffic. So, although it is possible to see that there is communication with Discord, it is not possible to analyze or filter the content of the data sent due to encryption, which makes this channel an ideal vehicle for exfiltrating data undetected.

Firewalls and SIEM systems based on fixed rules face a critical limitation here. Without access to the encrypted content of the packets, these systems can only monitor basic metadata, such as the destination IP address, the port, and the volume of data. However, as Discord is widely used and permitted on many corporate networks, this traffic appears legitimate and doesn't immediately raise suspicions.

Data exfiltration via Discord webhooks is a very popular technique because of the way it takes advantage of the platform's infrastructure to mask malicious activity. These features, designed to facilitate automated integrations and notifications, end up being exploited in cyber attacks, allowing sensitive information to be sent outside the network undetected.

Note

Although Discord made some updates regarding security (link), malicious users still take advantage of tools that allow development of plugins.

Real life examples

Some examples of data exfiltration using Discord:

Discovery

It has a built-in function that enables automated messages sent to a text channel in the server (Webhooks) and a API for bot creation (discord.py).
Allows the upload of a variety file types (e.g. PNG, PDF, MP4).
The maximum file upload is 10MB.

Filtering

Network Pool: 162.159.0.0/16 - Owned by Cloudflare, Inc.

Reference: NsLookup.io

Protocols Used for communication: TCP
- Destination Port(s): TCP/80 and TCP/443
- Source Port(s): UDP/50000-65535
Protocols Used for voice-communication/attachments: QUIC

Note

Additional information to look for:

Data Acquisition

Wireshark captures (.pcap): link

Aggregation

To perform the analysis, the following data will be extracted:

Group and Private Conversations – the conversation type is obtained at the packet level (uploads/downloads)
Daily and Weekly message flow with various formats of files – analyzing the timestamps of interactions (uploads/downloads)

Collection

In a testing context we are going to use:

Wireshark: For network analysis
Burp Suite: Proxy tools for traffic capturing

But in a real life context, we could use:

Syslog and Agents: To obtain data from endpoints and discord activity
Suricata or Palo Alto Networks Firewalls: To monitor ports and protocols in use, e.g. TCP and QUIC, and detect unusual or unauthorized traffic.

Sampling and Processing

We are going to focus on packet data, since there's no abundance of flows used, where it has the following fields:

IP Source
IP Destination
Packet Size (in bytes)
Packet Timestamp (in seconds)
IP Protocol Number

In order to convert our qualitative data into quantitive data, we chosen a sampling interval of 1 second.

This allows a balance between the level of detail needed to capture relevant events and the volume of data generated

This metrics obtained in the sampling interval are:

Download/Upload Size of TCP Packets (in bytes)
Download/Upload Size of UDP Packets (in bytes)
Download/Upload Number of TCP Packets
Download/Upload Number of UDP Packets

In the following order: tcp_upload_packets, tcp_upload_bytes, udp_upload_packets, udp_upload_bytes, tcp_download_packets, tcp_download_bytes, udp_download_packets, udp_download_bytes

Feature Extraction

This function extracts the following features:

Mean and Variance of silence times
Mean, Variance and 95th and 98th percentile of activity times
Mean, standard deviation, 60th and 90th percentile of upload and download bytes for TCP and UDP (separately)
Mean and standard deviation of total bytes
Mean and standard deviation of number of packets

In the following order: mean_silence_duration, variance_silence_duration, mean_activity_duration, variance_activity_duration, quartiles_activity_duration,tcp_upload_bytes_std_dev, tcp_download_bytes_std_dev, udp_upload_bytes_std_dev, udp_download_bytes_std_dev,tcp_upload_bytes_mean, tcp_download_bytes_mean, udp_upload_bytes_mean, udp_download_bytes_mean,quartiles_upload_bytes, quartiles_download_bytes,bytes_mean, bytes_std_dev,packets_mean, packets_std_dev

Important

Threshold of silence activity (number of packets) is 3.

Production

Benign Behavior: It will be done by performing normal usage of the application, made by:

Humans: sending messages and files as usual
Bots: made by plugins added to the server

Malicious Behavior: It will be done using tree types of bots:

Easy to Detect:
- Size: 10MB
- Frequency: Periodically (40s)
Intermediate to Detect:
- Size: 1-10MB
- Frequency: Same variance as a normal behavior
Hard (almost impossible) to Detect: Through embedded images, using Discord CDN

Note

Command to make random files: dd if=/dev/urandom of=file.txt bs=1M count=10 (10MB)

The files used in the exfiltration process will be located in the data folder.

Tasks to be done

Phase 2: Project Development

Project Structure

.
├── presentation/ (folder with the slides)
│
├── src/
│   ├── .env (file with the Discord Token)
│   ├── data_sampling.py (script to sample the data)
│   ├── data_processing.py (script to extract the features)
│   ├── exfiltration_bot.py
│   ├── model.ipynb (notebook with the model selection)
│   ├── data/ (folder with the data to be exfiltrated)
│   ├── captures/ (folder with the captured data)
│   ├── samples/ (folder with the sampled data)
│   ├── features/ (folder with the extracted features)
│   └── requirements.txt
│
└── README.md

Bot Creation

To create a bot in Discord, follow these steps:

Go to the Discord Developer Portal
Click on New Application
Fill in the Name and click on Create
Go to the Bot section and click on Add Bot
Click on Copy to copy the Token and paste it in the .env file
Go to the OAuth2 section and select the bot scope
Copy the URL and paste it in the browser to add the bot to a server

References:

Portal to build the bot: https://discord.com/developers/applications
Tutorial in python: https://www.youtube.com/watch?v=UYJDKSah-Ww

Setting up the environment

Add virtual environment:

python -m venv venv

Enable the virtual environment:

venv\Scripts\activate

Install the dependencies:

pip install -r requirements.txt

Add the DISCORD_TOKEN in the .env file:

echo "DISCORD_TOKEN=<your_token>" >> .env

Important

The .env file should be in the src folder

Running the bot

To run the simpler version of the bot, that uses no prior behavior to exfiltrate data, run the following command:

python simple_exfiltration_bot.py

To run the bot that uses the prior behavior to exfiltrate data, run the following command:

python complex_exfiltration_bot.py --input <input_file>

Where the <input_file> is a CSV file with the packet capture data.

Sampling the data

This is the command to sample the data (with a sampling interval of 1 second), given the discord_capture.pcap file:

python data_sampling.py --format 3 --input discord_capture.pcap --output <output_file> --delta 1 --cnet <client_network_pool> --snet 0.0.0.0/0

Extracting the features

This is the command to extract the features using the multi-slide observation window (observation window of 5 minutes width and window slide of 30 seconds), given the output_file.txt file:

python data_processing.py --input output_file.txt --method 3 --width 300 --slide 30

Note

Since the sampling interval is 1 second, the width and slide of the observation window are given in seconds

Detection using Machine Learning (Unsupervised Learning)

Models Selection

Autoencoders
- Type: Neural Network-based
- Use Case: Anomaly detection
- How it works:
  - Train an autoencoder to reconstruct "normal" network traffic patterns from packet data.
  - During inference, unusual traffic (indicative of exfiltration) will have a higher reconstruction error.
- Why suitable: Autoencoders work well with unlabeled data and are ideal for detecting anomalies like data exfiltration.
Isolation Forest
- Type: Tree-based anomaly detection
- Use Case: Identify outlier network sessions
- How it works:
  - Isolation Forest isolates data points by randomly partitioning feature space.
  - Exfiltration traffic, which is rare or abnormal, is "isolated" faster.
- Why suitable: Works efficiently with high-dimensional data like packet captures.
One-Class SVM (Support Vector Machine)
- Type: Kernel-based anomaly detection
- Use Case: Classifies normal vs. anomalous behavior
- How it works:
  - Trains on normal packet behavior to create a decision boundary.
  - Exfiltration (anomalous data) lies outside the learned boundary.
- Why suitable: Handles packet-level feature extraction well and doesn't require labels.

Data Preprocessing (TBD)

Normalization using MinMaxScaler
Train with normal behavior and test with 50/50 normal and malicious behavior

Model Evaluation (TBD)

PCA
Linear discriminant analysis
Non-negative Matrix Factorization
Generalized discriminant analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Exfiltration Detection in Discord using ML

Authors

Phase 1: Problem Brainstorm

Problem

Why is it difficult to solve?

Real life examples

Discovery

Filtering

Data Acquisition

Aggregation

Collection

Sampling and Processing

Feature Extraction

Production

Tasks to be done

Phase 2: Project Development

Project Structure

Bot Creation

Setting up the environment

Running the bot

Sampling the data

Extracting the features

Detection using Machine Learning (Unsupervised Learning)

Models Selection

Data Preprocessing (TBD)

Model Evaluation (TBD)

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
presentation		presentation
src		src
.gitignore		.gitignore
README.md		README.md

NiMouh/discord_exfiltration

Folders and files

Latest commit

History

Repository files navigation

Data Exfiltration Detection in Discord using ML

Authors

Phase 1: Problem Brainstorm

Problem

Why is it difficult to solve?

Real life examples

Discovery

Filtering

Data Acquisition

Aggregation

Collection

Sampling and Processing

Feature Extraction

Production

Tasks to be done

Phase 2: Project Development

Project Structure

Bot Creation

Setting up the environment

Running the bot

Sampling the data

Extracting the features

Detection using Machine Learning (Unsupervised Learning)

Models Selection

Data Preprocessing (TBD)

Model Evaluation (TBD)

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages