A machine learning pipeline for detecting toxic comments in text using LSTM-based neural networks. This project demonstrates text preprocessing, tokenization, and sequence modeling for text classification tasks.
- Cleaned text data using Texthero (removed digits, URLs, stop words, extra spaces).
- Applied tokenization and padding to standardize input lengths.
- Built an LSTM model with dense, batch normalization, and dropout layers to prevent overfitting.
- Supports predictions on training, and test datasets.
- Multi-label classification for detecting multiple toxicity types per comment.
- Text Cleaning Pipeline: Cleans input text using Texthero, removing digits, URLs, extra spaces, stop words, and other noise.
- Tokenization & Padding: Converts cleaned text into tokens and applies padding to ensure uniform input size for the model.
- Deep Learning Model: Builds a classification model using:
- Cell-state LSTM layers for sequence learning
- Dense layers for feature extraction
- Batch normalization to stabilize training
- Dropout layers to prevent overfitting
- Programming Language: Python
- Libraries / Frameworks: Texthero, TensorFlow / Keras, NumPy, Pandas
- Model Type: LSTM-based Neural Network for text classification
- Environment: Jupyter Notebook
git clone https://github.com/YoshaM09/Toxic-Comment-Classification.git
cd Toxic-Comment-Classificationpip install -r requirements.txtjupyter notebook notebook1_toxic_comment_classification.ipynb
jupyter notebook notebook2_toxic_comment_classification.ipynb- Run the cells sequentially to reproduce the text preprocessing, tokenization, and model training steps, and to generate predictions on both training, and test datasets.
- Input raw text comments into the notebook.
- The model outputs predictions for the following categories: toxic, severe toxic, obscene, threat, insult, identity hate.
- Each comment can belong to one or more categories, enabling multi-label classification.
- Contributions are welcome! Please submit a pull request or open an issue for suggestions.
- This project is licensed under the MIT License.