Skip to content

IdahoLabUnsupported/lotl-network-anomaly-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Structured Anomalies

Networks at many sites usually have well defined structures with well known servers. For instance, DNS requests are usually handled by one or two main DNS servers with static IPs. When devices on the network communicate with one another, they should respect this structure. We define "structural anomalies" to be connections that don't coincide with the normal flow of the established network. For instance, clients sharing requesting files from an unknown file server might be deemed anomalous.

We detect these structural anomalies using a Graph Convolutional Network. Our GCN constructs a graph representation of the computer network where devices are nodes and edges are connections. Using this GCN, we aim to have a deep understanding of the structure and be able to tell when communications are out of place. The GCN creates embeddings to represent the network. We cluster these embeddings to begin building a vocabulary that will serve as the input to the final sequential model (not included in this repository). All code for the GCN can be found in models/.

Navigating This Repository

dev-vocab.py

When the only output desired is the clustered device vocabulary for each device at the connection level, the file dev-vocab.py is the only file that needs to be interfaced with. The file has two functions. The function generate-vocab-into-csv will take in the training data, (as a dictionary of dataframes where each dataframe corresponds to a log type) and file paths to store trained models. (It does return the clustered vocab, but we only really need this function to train and store the models).

Model 1: A Kmeans Model that Separates Internal Clients and Servers

Model 2 - K: GCN models for each class that we’ve seperated our data into.

Models K+1 - N: Kmeans models that leverage the silhouette index to cluster the embeddings for each class

The function generate-vocab-into-csv-inference takes in the testing data and the the path to the models and returns the clustered vocabulary.

ClientServer.py

ClientServer.py stores the ClientServer class, a class with 3 purposes. Data Preprocessing, Clustering, and Edge List Creation. The Data Preprocessing is done to collect and aggregate device specific features. There are several small feature engineering functions that create/modify features for each device and then aggregate these features for each device. Once we have the aggregated features for each device, we cluster the best features to split clients and servers.

The function interfaced with by the user used to create the edge list for a given set of nodes is create-edges-between-primary-nodes. It takes in two types of sets of nodes, primary nodes and projected nodes. We create two types of edges, edges between primary nodes that directly communicate with each other and edges between primary nodes that share a projected node they both communicate with. If you’d like to just create a graph with the entire dataset, make the primary nodes, all nodes (devices) in the dataset and make the projected nodes the empty set.

GCN.py

The next .py file is GCN.py. This file contains a class called GraphAutoencoder which is the model we’ll leverage to create our embeddings. The function create-GCN-embeddings will take in the dataframe of features for each node and the edge list as input, scale the data, and use the GraphAutoencoder model to return our embeddings.

Kmeans.py

This file only has a singular function silhouette-scores-result. It takes in the embeddings and the dataframe of IP features those embeddings correspond to and from the embeddings performs K-means from a range of 2-11 clusters to determine the optimal number of clusters to cluster the embeddings. Afterward the cluster assigment for each IP is appended to the dataframe of IP features.

About

This code detects Living-Off-the-Land attacks by analyzing Zeek logs from network traffic. After preprocessing, it uses K-Means to label devices, applies a Graph Convolutional Network to generate embeddings, and then clusters these embeddings to identify suspicious patterns.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages