OriginCheck: AI vs. Human Text Identification System

OriginCheck is a research-driven project designed to distinguish between human-authored text and AI-generated content. By leveraging a hybrid neural network architecture that combines structural linguistic analysis with deep semantic processing, OriginCheck achieves high accuracy in identifying the source of textual content.

Project Overview

In the era of Large Language Models (LLMs), the boundary between human and machine-generated content is increasingly blurred. OriginCheck addresses this challenge by analyzing not just the semantic content, but also the "linguistic fingerprints"—statistical and structural patterns unique to human writing versus algorithmic generation.

Core Features

Hybrid Architecture: Combines a Convolutional Neural Network (CNN) for sequence modeling and a Multi-Layer Perceptron (MLP) for structural feature analysis.
Structural Analysis: Incorporates features such as average word length, sentence complexity, and unique word ratios.
High Precision: Trained on diverse datasets including human-written essays and AI outputs, achieving 99% accuracy on benchmark tests.
Real-time Detection: Interactive web-based interface for instant text classification.

Research Methodology

This research focuses on the integration of statistical linguistic features into a neural network framework. While deep learning models are excellent at capturing contextual relationships, human writing often exhibits subtle statistical variances in vocabulary diversity and structural consistency that machines occasionally struggle to mimic perfectly.

Data Flow

Collection: Aggregated datasets containing labeled human and AI-generated text.
Preprocessing: Tokenization, padding, and normalization of text data.
Feature Engineering: Extraction of linguistic metrics (e.g., lexical diversity, word length distributions).
Dual-Path Modeling:
- Text Path: CNN layers process tokenized sequences to understand semantic flow.
- Linguistic Path: Dense layers process structural features.
Fusion Layer: Concatenates both paths for a comprehensive final classification.

Getting Started

Prerequisites

Python 3.8+
TensorFlow 2.x
NumPy, Pandas, Scikit-learn
Streamlit (for the web app)

Installation

# Clone the repository
git clone https://github.com/your-username/OriginCheck.git
cd OriginCheck

# Install dependencies
pip install -r requirements.txt

Usage

Machine Learning Pipeline

To train or evaluate the model, you can use the provided scripts:

train.py: Train the OriginCheck hybrid model.
evaluate.py: Test the model against new datasets.

Web Interface

Launch the interactive dashboard to test individual text samples:

streamlit run app.py

Results

OriginCheck has been validated against multiple datasets, including the Reddit filtered dataset and Kaggle AI-Human text benchmarks. It demonstrates robust performance across different writing styles and topics.

Metric	Score
Accuracy	99.1%
Precision	98.9%
Recall	99.3%

Future Directions

Cross-Model Generalization: Enhancing detection for newer LLMs (e.g., GPT-5, Gemini 2.0).
Adversarial Robustness: Testing against "jailbroken" or obfuscated AI text.
Multilingual Support: Extending structural analysis to other languages.

This project was developed as part of a research initiative to enhance digital authenticity and academic integrity.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OriginCheck: AI vs. Human Text Identification System

Project Overview

Core Features

Research Methodology

Data Flow

Getting Started

Prerequisites

Installation

Usage

Machine Learning Pipeline

Web Interface

Results

Future Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OriginCheck: AI vs. Human Text Identification System

Project Overview

Core Features

Research Methodology

Data Flow

Getting Started

Prerequisites

Installation

Usage

Machine Learning Pipeline

Web Interface

Results

Future Directions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages