NLP - Natural Language Processing Repository

This repository contains the assignments completed during the Natural Language Processing (NLP) course, including individual tasks and a final project focused on a comprehensive text analysis.

Assignments: These individual tasks cover various NLP techniques, such as tokenization, lemmatization, word frequency analysis, etc. Each assignment includes:
- Source code.
- Problem description and the proposed solution.
- Conclusions and takeaways.
Final Project: A detailed analysis of a dataset comprising texts generated by both humans and AI, with the main goal of distinguishing between the two types. The final project is described below.

Final Project: AI vs Human Text Analysis

Introduction

The final project applies NLP techniques to analyze a dataset that contains essays written by humans and texts generated by AI. The primary goal is to identify the linguistic differences between human and AI-generated texts and to build a predictive model that can classify the origin of the texts.

The dataset can be accessed via this Kaggle link.

Objectives

The dataset is analyzed with a focus on the following aspects:

Lexical richness
Preferred words
Average word length
Most and least frequent words
Implementation of Logistic Regression and a Convolutional Neural Network (CNN) for text classification.

Data Pre-processing

The dataset was split into 60% for training, 20% for validation, and 20% for testing.
A random selection of 25,000 elements was used, ensuring a balanced distribution of human and AI texts.
Stop words (common, non-informative words) were removed to focus on meaningful content.

Lexical and Statistical Analysis

Lexical Richness:
- AI exhibited slightly higher lexical richness in the original texts (0.45 vs 0.43 for humans), but after removing stop words, human texts showed greater diversity (0.899 vs 0.877 for AI).
Preferred Words:
- Both humans and AI showed significant overlap in their most frequently used words, composed primarily of stop words. However, after filtering them out, distinct differences emerged between the two groups.
Average Word Length:
- AI tends to use longer words more frequently than humans, possibly due to the more technical and specialized nature of AI-generated texts.

Classification and Prediction

Two classification models were implemented:

Logistic Regression: A model was trained using features like:
- Text length.
- Stop word percentage and linking word percentage.
- Word frequency patterns.
The results included accuracy metrics such as precision, recall, and F1-score, showing a clear distinction between AI and human texts.
Convolutional Neural Network (CNN):
- A CNN model was trained using TensorFlow layers (such as Conv1D, MaxPooling, Flatten, Dense, and Dropout), improving classification accuracy.
The model was validated using a 20% validation set, achieving a good fit between predictions and actual results.

Conclusion

The final project demonstrates how NLP techniques can effectively differentiate between AI and human-generated texts by analyzing lexical richness, word usage, and other linguistic patterns. Additionally, machine learning models like logistic regression and neural networks prove to be valuable tools for classifying text origins.

Usage Instructions

Clone this repository:

git clone https://github.com/RubaGarcia/NLP.git

Navigate to the project folder:
```
cd NLP
```
Explore the assignments folder for individual NLP technique implementations.
To run the final project, follow the instructions in proyecto_final.ipynb.

Authors

Rubén García
Javier Mier

Course 2023/2024

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
assignment1		assignment1
assignment2		assignment2
assignment3		assignment3
assignment4		assignment4
assignment5		assignment5
assignment6		assignment6
proyecto-final		proyecto-final
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP - Natural Language Processing Repository

Contents

Final Project: AI vs Human Text Analysis

Introduction

Objectives

Data Pre-processing

Lexical and Statistical Analysis

Classification and Prediction

Conclusion

Usage Instructions

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP - Natural Language Processing Repository

Contents

Final Project: AI vs Human Text Analysis

Introduction

Objectives

Data Pre-processing

Lexical and Statistical Analysis

Classification and Prediction

Conclusion

Usage Instructions

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages