DataSense: IIoT Network + Sensor Dataset

Offline-friendly docs: open site/index.html (see below), or read /docs.

Repository Map

/dataset: dataset contents (raw, processed files)
/docs: detailed documentation (schema, guides)
testbed_inventory — architecture of the testbed and device details
attacks_inventory — executed attacks with timestamps
features_inventory — extracted features from network and sensor data
devices.csv — machine-readable list of devices
attacks.csv — machine-readable list of attack/benign sessions
/tools: scripts (checksum verification and file extraction, etc.)

Dataset Repository Layout

This section describes the structure of the dataset repository. Each directory is organized to clearly separate raw captures, processed feature files, and supporting documentation.

Directory Tree

├─ dataset/                                 → contains the dataset data
│  ├─ raw_files/                            → raw captured files (PCAP + JSON)
│  │  ├─ attack_data/                       → raw captures of all executed attacks
│  │  │  ├─ checksums/                      → integrity files
│  │  │  ├─ dos/                            → PCAP + JSON captures for all DoS attack variants
│  │  │  ├─ ddos/                           → PCAP + JSON captures for all DDoS attack variants
│  │  │  └─ …                               → other attack categories (bruteforce, mitm, etc.)
│  │  ├─ benign_data/                       → PCAP + JSON captures of benign (normal) traffic
│  │  │  ├─ checksums/                      → integrity files
│  │  │  └─ benign.tar.xz                   → PCAP + JSON captures for benign testbed execution
│  │
│  ├─ processed_files/                      → processed data and extracted features
│  │  ├─ attack_data/                       → feature data extracted from attack captures
│  │  │  ├─ checksums/                      → integrity files
│  │  │  ├─ all_attack_samples.csv.tar.xz   → features extracted for all attack samples for all time windows (1-10 seconds)
│  │  │  ├─ attack_samples_1sec.csv.tar.xz  → features extracted for attack samples for all types of attacks using 1 second time windows
│  │  │  ├─ attack_samples_2sec.csv.tar.xz  → features extracted for attack samples for all types of attacks using 2 second time windows
│  │  │  ├─ …                               → other time windows
│  │  │  └─ attack_samples_10sec.csv.tar.xz → features extracted for attack samples for all types of attacks using 10 second time windows
│  │  ├─ benign_data/                       → CSV feature files for benign captures (per device, time-windowed)
│  │  │  ├─ checksums/                      → integrity files
│  │  │  ├─ all_benign_samples.csv.tar.xz   → features extracted for all benign samples for all time windows (1-10 seconds)
│  │  │  ├─ benign_samples_1sec.csv.tar.xz  → features extracted for benign samples for all devices using 1 second time windows
│  │  │  ├─ benign_samples_2sec.csv.tar.xz  → features extracted for benign samples for all devices using 2 second time windows
│  │  │  ├─ …                               → other time windows
│  │  │  └─ benign_samples_10sec.csv.tar.xz → features extracted for benign samples for all devices using 10 second time windows
│  │  ├─ all_attack_benign_samples.tar.xz   → all benign + attack samples (for all window sizes) compressed into one file.
│
├─ docs/                                    → dataset documentation
│  ├─ index.md                              → overview / introduction
│  ├─ testbed_inventory.md                  → architecture and device inventory (MACs, IPs, roles, etc.)
│  ├─ attacks_inventory.md                  → list of all executed attacks/benign captures with timestamps
│  ├─ devices.csv                           → machine-readable list of devices
│  ├─ attacks.csv                           → machine-readable list of attack/benign sessions
│
├─ tools/                                   → helper scripts
│  ├─ unpack_dataset.py                     → unpacks zip files and checks checksum 
│
├─ examples/                                → example scripts and notebooks for using the dataset

Explanation of Directories

dataset/raw_files/
Contains the raw captured data:
Attack data (attack_data/): Each attack category (e.g., DoS, DDoS) has its own subfolder. Inside, you will find .pcap files containing network packet captures and .json files containing sensor logs collected via the MQTT broker. The data is not grouped per device; instead, full testbed network + sensor data is captured for the entire duration of each attack.
Benign data (benign_data/): Similar structure but for normal (non-attack) traffic and sensor data captures.
dataset/processed_files/
Contains processed versions of the raw captures. Network and sensor features have been extracted into .csv files, which are grouped by device and aggregated into time windows (e.g., X-second windows).
Attack data (attack_data/): Each attack type (DoS, DDoS, etc.) contains device-level feature CSVs.
Benign data (benign_data/): Device-level feature CSVs for benign data.
all_data.csv: A merged CSV combining all benign and attack features into one dataset, ready for machine learning experiments.
docs/
Documentation and metadata to help researchers understand and use the dataset.
index.md: Overview/introduction to the dataset.
testbed_inventory.md: Detailed description of the testbed layout and device information.
attacks_inventory.md: Metadata for each attack/benign capture (timestamps, categories, targets).
devices.csv: Machine-readable list of devices (MACs, IPs, roles, topics).
attacks.csv: Machine-readable list of all attacks and benign sessions.
tools/
Utility scripts for validating, processing, or analyzing dataset content. Example: timestamp alignment between PCAPs and sensor JSONs, checksum verification, feature extraction utilities.
examples/
Example usage code (e.g., Python scripts, Jupyter notebooks, Elasticsearch queries) to help users load, preprocess, and analyze the dataset efficiently.

Notes

Naming convention: Each attack or benign capture has a base filename shared by both the .pcap (network traffic) and .json (sensor data) file.
Alignment: Both PCAPs and JSON logs cover the same time window, making it possible to correlate network activity with sensor telemetry.
Processed features: CSV feature files include both network-derived features and sensor-derived features, structured by device and aggregated over consistent time windows.