Structured Anomalies

Networks at many sites usually have well defined structures with well known servers. For instance, DNS requests are usually handled by one or two main DNS servers with static IPs. When devices on the network communicate with one another, they should respect this structure. We define "structural anomalies" to be connections that don't coincide with the normal flow of the established network. For instance, clients sharing requesting files from an unknown file server might be deemed anomalous.

We detect these structural anomalies using a Graph Convolutional Network. Our GCN constructs a graph representation of the computer network where devices are nodes and edges are connections. Using this GCN, we aim to have a deep understanding of the structure and be able to tell when communications are out of place. The GCN creates embeddings to represent the network. We cluster these embeddings to begin building a vocabulary that will serve as the input to the final sequential model (not included in this repository). All code for the GCN can be found in models/.

Navigating This Repository

dev-vocab.py

When the only output desired is the clustered device vocabulary for each device at the connection level, the file dev-vocab.py is the only file that needs to be interfaced with. The file has two functions. The function generate-vocab-into-csv will take in the training data, (as a dictionary of dataframes where each dataframe corresponds to a log type) and file paths to store trained models. (It does return the clustered vocab, but we only really need this function to train and store the models).

Model 1: A Kmeans Model that Separates Internal Clients and Servers

Model 2 - K: GCN models for each class that we’ve seperated our data into.

Models K+1 - N: Kmeans models that leverage the silhouette index to cluster the embeddings for each class

The function generate-vocab-into-csv-inference takes in the testing data and the the path to the models and returns the clustered vocabulary.

ClientServer.py

ClientServer.py stores the ClientServer class, a class with 3 purposes. Data Preprocessing, Clustering, and Edge List Creation. The Data Preprocessing is done to collect and aggregate device specific features. There are several small feature engineering functions that create/modify features for each device and then aggregate these features for each device. Once we have the aggregated features for each device, we cluster the best features to split clients and servers.

The function interfaced with by the user used to create the edge list for a given set of nodes is create-edges-between-primary-nodes. It takes in two types of sets of nodes, primary nodes and projected nodes. We create two types of edges, edges between primary nodes that directly communicate with each other and edges between primary nodes that share a projected node they both communicate with. If you’d like to just create a graph with the entire dataset, make the primary nodes, all nodes (devices) in the dataset and make the projected nodes the empty set.

GCN.py

The next .py file is GCN.py. This file contains a class called GraphAutoencoder which is the model we’ll leverage to create our embeddings. The function create-GCN-embeddings will take in the dataframe of features for each node and the edge list as input, scale the data, and use the GraphAutoencoder model to return our embeddings.

Kmeans.py

This file only has a singular function silhouette-scores-result. It takes in the embeddings and the dataframe of IP features those embeddings correspond to and from the embeddings performs K-means from a range of 2-11 clusters to determine the optimal number of clusters to cluster the embeddings. Afterward the cluster assigment for each IP is appended to the dataframe of IP features.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ClientServer.py		ClientServer.py
GCN.py		GCN.py
Kmeans.py		Kmeans.py
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
dev_vocab.py		dev_vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured Anomalies

Navigating This Repository

dev-vocab.py

ClientServer.py

GCN.py

Kmeans.py

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structured Anomalies

Navigating This Repository

dev-vocab.py

ClientServer.py

GCN.py

Kmeans.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages