LLM-Detect-AI-Generated-Text

This project aims to detect AI-generated text in essays using deep learning models. The project leverages TensorFlow, Keras, and JAX backends, and uses pre-trained models like DeBERTaV3 for text classification.

🛠 Installation

To set up the environment and install the necessary libraries, run:

!pip install -q keras_nlp==0.6.3 keras-core==0.1.7

📚 Import Libraries

The project uses the following libraries:

import os
import keras_nlp
import keras_core as keras
import keras_core.backend as K
import tensorflow as tf
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.model_selection import StratifiedKFold

⚙️ Configuration

The configuration class CFG is used to define various parameters such as:

Verbose: Verbosity level for the output Device: Specifies whether to use TPU, GPU, or CPU Epochs: Number of training epochs Batch Size: Size of each training batch Learning Rate Scheduler: cosine Class Names: Labels for the classes (e.g., "real" and "fake") Seed: Random seed for reproducibility

♻️ Reproducibility

Set the random seed for consistent results:

Copy code
keras.utils.set_random_seed(CFG.seed)

💾 Hardware Detection

Automatically detect and initialize hardware (TPU, GPU, or CPU):

def get_device():
    # Code to detect TPU, GPU, or CPU
    ...
strategy, CFG.device = get_device()

📁 Dataset

The dataset contains essays labeled as either AI-generated (fake) or human-written (real). The data is divided into train and test sets. Additionally, external datasets are used for enhancing the model's performance.

🔪 Data Split

The data is divided into stratified folds using StratifiedKFold for cross-validation.

skf = StratifiedKFold(n_splits=CFG.num_folds, shuffle=True, random_state=CFG.seed)

🍽️ Preprocessing

The preprocessing step tokenizes and converts the raw text into a format suitable for model input. It uses the DebertaV3Preprocessor from KerasNLP.

preprocessor = keras_nlp.models.DebertaV3Preprocessor.from_preset(
    preset=CFG.preset, 
    sequence_length=CFG.sequence_length,
)

🍚 DataLoader

The data is loaded and processed using TensorFlow's tf.data.Dataset API, which provides efficient data handling and pipeline construction.

def build_dataset(texts, labels=None, batch_size=32, cache=False, drop_remainder=True, repeat=False, shuffle=1024):
    AUTO = tf.data.AUTOTUNE
    # Code to build the dataset pipeline
    ...

📊 Visualization

The class distribution of the dataset is visualized using Matplotlib.

plt.figure(figsize=(8, 4))
df.name.value_counts().plot.bar(color=[cmap(0.0), cmap(0.25), cmap(0.65), cmap(0.9), cmap(1.0)])
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class distribution for Train Data")
plt.show()

📝 Training

The model is trained using the preprocessed data, and the training process is logged using Weights & Biases (WandB) for better tracking and visualization.

🤖 Inference

The trained model is used to make predictions on the test set. The results are then compared to the ground truth for evaluation.

🙌 Acknowledgements

TensorFlow and Keras for providing powerful deep learning tools Kaggle for the dataset Weights & Biases for experiment tracking

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
README.md		README.md
detect-fake-text.ipynb		detect-fake-text.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Detect-AI-Generated-Text

🛠 Installation

📚 Import Libraries

⚙️ Configuration

♻️ Reproducibility

💾 Hardware Detection

📁 Dataset

🔪 Data Split

🍽️ Preprocessing

🍚 DataLoader

📊 Visualization

📝 Training

🤖 Inference

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Detect-AI-Generated-Text

🛠 Installation

📚 Import Libraries

⚙️ Configuration

♻️ Reproducibility

💾 Hardware Detection

📁 Dataset

🔪 Data Split

🍽️ Preprocessing

🍚 DataLoader

📊 Visualization

📝 Training

🤖 Inference

🙌 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages