Skip to content

RubaGarcia/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP - Natural Language Processing Repository

This repository contains the assignments completed during the Natural Language Processing (NLP) course, including individual tasks and a final project focused on a comprehensive text analysis.

Contents

  1. Assignments: These individual tasks cover various NLP techniques, such as tokenization, lemmatization, word frequency analysis, etc. Each assignment includes:

    • Source code.
    • Problem description and the proposed solution.
    • Conclusions and takeaways.
  2. Final Project: A detailed analysis of a dataset comprising texts generated by both humans and AI, with the main goal of distinguishing between the two types. The final project is described below.


Final Project: AI vs Human Text Analysis

Introduction

The final project applies NLP techniques to analyze a dataset that contains essays written by humans and texts generated by AI. The primary goal is to identify the linguistic differences between human and AI-generated texts and to build a predictive model that can classify the origin of the texts.

The dataset can be accessed via this Kaggle link.

Objectives

The dataset is analyzed with a focus on the following aspects:

  • Lexical richness
  • Preferred words
  • Average word length
  • Most and least frequent words
  • Implementation of Logistic Regression and a Convolutional Neural Network (CNN) for text classification.

Data Pre-processing

  • The dataset was split into 60% for training, 20% for validation, and 20% for testing.
  • A random selection of 25,000 elements was used, ensuring a balanced distribution of human and AI texts.
  • Stop words (common, non-informative words) were removed to focus on meaningful content.

Lexical and Statistical Analysis

  1. Lexical Richness:

    • AI exhibited slightly higher lexical richness in the original texts (0.45 vs 0.43 for humans), but after removing stop words, human texts showed greater diversity (0.899 vs 0.877 for AI).
  2. Preferred Words:

    • Both humans and AI showed significant overlap in their most frequently used words, composed primarily of stop words. However, after filtering them out, distinct differences emerged between the two groups.
  3. Average Word Length:

    • AI tends to use longer words more frequently than humans, possibly due to the more technical and specialized nature of AI-generated texts.

Classification and Prediction

Two classification models were implemented:

  1. Logistic Regression: A model was trained using features like:

    • Text length.
    • Stop word percentage and linking word percentage.
    • Word frequency patterns.

    The results included accuracy metrics such as precision, recall, and F1-score, showing a clear distinction between AI and human texts.

  2. Convolutional Neural Network (CNN):

    • A CNN model was trained using TensorFlow layers (such as Conv1D, MaxPooling, Flatten, Dense, and Dropout), improving classification accuracy.

    The model was validated using a 20% validation set, achieving a good fit between predictions and actual results.

Conclusion

The final project demonstrates how NLP techniques can effectively differentiate between AI and human-generated texts by analyzing lexical richness, word usage, and other linguistic patterns. Additionally, machine learning models like logistic regression and neural networks prove to be valuable tools for classifying text origins.


Usage Instructions

  1. Clone this repository:
    git clone https://github.com/RubaGarcia/NLP.git
  2. Navigate to the project folder:
    cd NLP
  3. Explore the assignments folder for individual NLP technique implementations.
  4. To run the final project, follow the instructions in proyecto_final.ipynb.

Authors

  • Rubén García
  • Javier Mier

Course 2023/2024

About

This repository contains the assignments completed during the Natural Language Processing (NLP) course, including individual tasks and a final project focused on a comprehensive text analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors