This repository contains the assignments completed during the Natural Language Processing (NLP) course, including individual tasks and a final project focused on a comprehensive text analysis.
-
Assignments: These individual tasks cover various NLP techniques, such as tokenization, lemmatization, word frequency analysis, etc. Each assignment includes:
- Source code.
- Problem description and the proposed solution.
- Conclusions and takeaways.
-
Final Project: A detailed analysis of a dataset comprising texts generated by both humans and AI, with the main goal of distinguishing between the two types. The final project is described below.
The final project applies NLP techniques to analyze a dataset that contains essays written by humans and texts generated by AI. The primary goal is to identify the linguistic differences between human and AI-generated texts and to build a predictive model that can classify the origin of the texts.
The dataset can be accessed via this Kaggle link.
The dataset is analyzed with a focus on the following aspects:
- Lexical richness
- Preferred words
- Average word length
- Most and least frequent words
- Implementation of Logistic Regression and a Convolutional Neural Network (CNN) for text classification.
- The dataset was split into 60% for training, 20% for validation, and 20% for testing.
- A random selection of 25,000 elements was used, ensuring a balanced distribution of human and AI texts.
- Stop words (common, non-informative words) were removed to focus on meaningful content.
-
Lexical Richness:
- AI exhibited slightly higher lexical richness in the original texts (0.45 vs 0.43 for humans), but after removing stop words, human texts showed greater diversity (0.899 vs 0.877 for AI).
-
Preferred Words:
- Both humans and AI showed significant overlap in their most frequently used words, composed primarily of stop words. However, after filtering them out, distinct differences emerged between the two groups.
-
Average Word Length:
- AI tends to use longer words more frequently than humans, possibly due to the more technical and specialized nature of AI-generated texts.
Two classification models were implemented:
-
Logistic Regression: A model was trained using features like:
- Text length.
- Stop word percentage and linking word percentage.
- Word frequency patterns.
The results included accuracy metrics such as precision, recall, and F1-score, showing a clear distinction between AI and human texts.
-
Convolutional Neural Network (CNN):
- A CNN model was trained using TensorFlow layers (such as Conv1D, MaxPooling, Flatten, Dense, and Dropout), improving classification accuracy.
The model was validated using a 20% validation set, achieving a good fit between predictions and actual results.
The final project demonstrates how NLP techniques can effectively differentiate between AI and human-generated texts by analyzing lexical richness, word usage, and other linguistic patterns. Additionally, machine learning models like logistic regression and neural networks prove to be valuable tools for classifying text origins.
- Clone this repository:
git clone https://github.com/RubaGarcia/NLP.git
- Navigate to the project folder:
cd NLP - Explore the assignments folder for individual NLP technique implementations.
- To run the final project, follow the instructions in
proyecto_final.ipynb.
- Rubén García
- Javier Mier
Course 2023/2024