CodeGuardian

In this project, we developed a tool that detects machine-generated code. To achieve this, we fine-tuned the RoBerta-base model with our self-collected data and compared the results with SVM(Support Vector Machine). This repository holds content for

Data Collection
Exploratory Data Analysis
Model Training and Deployment

Demo

We developed a Flask web app to host our model. It can be accessed here: demo You can input a code snippet, and the model will classify whether it was written by a human or an AI. The lines that contributed the most to the prediction result will be highlighted in color.

Requirements

Since this repo involves training and tuning the model, please install the following packages before proceeding. Alternatively, you may choose to use tools like Google Colab.

transformers
datasets
huggingface_hub
tensorboard == 2.11
git-lfs
torchvision
tensorflow

Data collection

To collect the data for human and AI written code. For the human data, we used codeParrot/github-code dataset from Huggingface (url). To generate AI-written code and ensure that the AI code and human written code are comparable, we adopted the following pipeline:

To generate AI-written code, we adopted OpenAI's api and asychronously retrived response from ChatGPT model. The prompts are stored in llm/prompts.json file. We then used llm/data_generation.py to retrieve asych responses from the api.

Exploratory Data Analysis

To ensure balance and fairness in our training data, we exploratorily analyzed our training data using eda/EDA.ipynb. Inside, we generated word cloud images and distribution of code file lengths in our data. You can follow the procedure in the notebook to see the outputs.

Training the model

In order to train our model, following pipelines are adopted:

RoBerta-base

We trained our model using computing power from Google Colab in roberta/roberta.ipynb. You can follow the procedures inside to train the model.

SVM

The trained SVM model is included in 'SVM/SVM.ipynb.'

Pre-trained model

To use our already fine-tuned model, you can directly use it from Huggingface. To use the pretrained tokenizer and model, run the following lines:

tokenizer = AutoTokenizer.from_pretrained("lebretou/code-human-ai")
model = AutoModelForSequenceClassification.from_pretrained("lebretou/code-human-ai")

To use the same encoder that we used, please download the pickle file llm/label_encoder with the following line:

# Load label encoder
with open('label_encoder.pkl', 'rb') as file:
    label_encoder = pickle.load(file)

Evaluation Metrics

In this project, we employ several evaluation metrics to assess the performance of our deep learning models. These metrics provide insights into different aspects of model performance, particularly in terms of accuracy and the balance between precision and recall. Below is a description of each metric used:

Accuracy Description: Accuracy measures the proportion of true results (both true positives and true negatives) among the total number of cases examined. It gives us a straightforward indication of the model's overall correctness across all classes. Formula:

$$ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $$

Interpretation: A higher accuracy indicates a model that generally predicts more correct outcomes, but it may not always be reliable if the data set is unbalanced between classes.

F1 Score Description: The F1 score is the harmonic mean of precision and recall. It is a better measure than accuracy in scenarios where the class distribution is imbalanced. The F1 score can provide more insight into the balance of precision and recall in your model. Formula:

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

Interpretation: An F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is particularly useful when you need to balance precision (the accuracy of positive predictions) and recall (the ability of the model to find all the relevant cases).

AUC Score (Area Under the ROC Curve) Description: The AUC score is the area under the Receiver Operating Characteristic (ROC) curve, which is a plot of the true positive rate against the false positive rate at various threshold settings. The AUC score represents the likelihood of the model distinguishing between the positive and negative classes. Formula:

$$ \text{AUC Score} = \int_{x=0}^{1} \text{ROC curve} $$

Interpretation: An AUC score of 0.5 suggests no discriminative ability (equivalent to random guessing), while a score of 1.0 indicates perfect discrimination. A higher AUC value means a better performing model in terms of distinguishing between positive and negative classes across all thresholds.

Results

The result is shown below:

RoBERTa outperforms SVM across most of metrics and difficulty levels, indicating its ability to handle both classification accuracy and balance between precision and recall.

Easy Tasks: RoBERTa shows a significant advantage in all metrics with an accuracy of 0.9556, F1-score of 0.9565, and AUC-score of 0.9652 compared to SVM's 0.6833, 0.6698, and 0.7366 respectively. This suggests that RoBERTa is better suited for handling tasks where the complexity is lower.
Medium Tasks: The performance gap between the two models narrows slightly in medium difficulty, but RoBERTa still maintains a strong lead with an accuracy of 0.8617, F1-score of 0.8602, and AUC-score of 0.8213 compared to SVM’s 0.7147, 0.6706, and 0.7173. This indicates RoBERTa's robustness in moderately challenging scenarios.
Hard Tasks: In hard tasks, while RoBERTa’s performance declines, it still performs better than SVM with an accuracy of 0.7241, F1-score of 0.7377, and AUC-score of 0.7133 versus SVM's 0.9623, 0.4904, and 0.6612. It's noteworthy that SVM's accuracy here is high, but its F1-score is significantly lower, suggesting that while it correctly identifies many instances, it struggles with the balance of precision and recall in more complex scenarios.

Future Work

Our project has laid a solid foundation and shown promising results. However, to further enhance the capabilities and applicability of our model, we are planning the following advancements:

Adopt RoBERTa-Large Model

Objective: To leverage the increased model capacity of RoBERTa-Large for better understanding and generating context in our tasks. RoBERTa-Large, with its more extensive architecture, offers improved performance on a variety of NLP benchmarks, which could significantly boost the robustness and accuracy of our predictions. Or utilize data augmentation to artificially expand our dataset with more examples of long code entries, which will help in training our model to handle such cases effectively.

Improve Testing Accuracy on Ultra Long Code Entries Challenge: Currently, our model is relatively inaccurate when dealing with long sequences of code, which are challenging due to their complexity and the requirement for maintaining long-term dependencies.

Approach: One solution could be enhancing our model's attention mechanisms to handle long-distance dependencies in code better, possibly integrating more sophisticated variants like sparse or global attention.

Extension into More Languages

Goal: To make our tool universally applicable by supporting multiple programming languages. This extension will help cater to a broader audience, including developers working in diverse coding environments and scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
data		data
deployment		deployment
eda		eda
images		images
llm		llm
roberta		roberta
svm		svm
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeGuardian

Demo

Requirements

Data collection

Exploratory Data Analysis

Training the model

RoBerta-base

SVM

Pre-trained model

Evaluation Metrics

Results

Future Work

Reference

Thanks to Professor Ritambhara Singh and all TAs!!!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeGuardian

Demo

Requirements

Data collection

Exploratory Data Analysis

Training the model

RoBerta-base

SVM

Pre-trained model

Evaluation Metrics

Results

Future Work

Reference

Thanks to Professor Ritambhara Singh and all TAs!!!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages