Various sources for deep learning based content moderation, sensitive content detection, scene genre classification, nudity detection, violence detection, substance detection from text, audio, video & image input modalities.
If you find this source useful, please consider citing it in your work as:
@INPROCEEDINGS{10193621,
author={Akyon, Fatih Cagatay and Temizel, Alptekin},
booktitle={2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
title={State-of-the-Art in Nudity Classification: A Comparative Analysis},
year={2023},
pages={1-5},
keywords={Analytical models;Convolution;Conferences;Transfer learning;Benchmark testing;Transformers;Safety;content moderation;nudity detection;safety;transformers},
doi={10.1109/ICASSPW59220.2023.10193621}}@article{akyon2022contentmoderation,
title={Deep Architectures for Content Moderation and Movie Content Rating},
author={Akyon, Fatih Cagatay and Temizel, Alptekin},
journal={arXiv},
doi={https://doi.org/10.48550/arXiv.2212.04533},
year={2022}
}- datasets
- techniques
- tools
| name | paper | year | url | input modality | task | labels |
|---|---|---|---|---|---|---|
| LSPD | 2022 | page | image, video | image/video classification, instance segmentation | porn, normal, sexy, hentai, drawings, female/male genital, female breast, anus | |
| MM-Trailer | 2021 | page | video | video classification | age rating | |
| Movienet | scholar | 2021 | page | image, video, text | object detection, video classification | scene level actions and places, character bboxes |
| Movie script severity dataset | 2021 | github | text | text classification | frightening, mild, moderate, severe | |
| LVU | 2021 | page | video | video classification | relationship, place, like ration, view count, genre, writer, year per movie scene | |
| Violence detection dataset | scholar | 2020 | github | video | video classification | violent, not-violent |
| Movie script dataset | 2019 | github | text | text classification | violent or not | |
| Nudenet | github | 2019 | github | image | image classification | nude or not |
| Adult content dataset | 2017 | contact | image | image classification | nude or not | |
| Substance use dataset | 2017 | first author | image | image classification | drug related or not | |
| NDPI2k dataset | 2016 | contact | video | video classification | porn or not | |
| Violent Scenes Dataset | springer | 2014 | page | video | video classification | blood, fire, gun, gore, fight |
| VSD2014 | 2014 | download | video | video classification | blood, fire, gun, gore, fight | |
| AIIA-PID4 | 2013 | - | image | image classification | bikini, porn, skin, non-skin | |
| NDPI800 dataset | scholar | 2013 | page | video | video classification | porn or not |
| HMDB-51 | scholar | 2011 | page | video | video classification | smoke, drink |
| name | paper | year | model | features | datasets | tasks | context |
|---|---|---|---|---|---|---|---|
| Movies2Scenes: Learning Scene Representations Using Movie Similarities | scholar | 2022 | ViT-like video encoder + MLP | ViT-like video encoder embedings | Private, Movienet, LVU | movie scene representation learning, video classifcation (sex, violence, drug-use) | movie scene content rating |
| Detection and Classification of Sensitive Audio-Visual Content for Automated Film Censorship and Rating | 2022 | CNN + GRU + MLP | CNN embeddings from video frames | Violence detection dataset | violent/non-violent classification from videos | movie scene content rating | |
| Automatic parental guide ratings for short movies | page | 2021 | separate model for each task: concat + LSTM, object detector, one-class CNN embeddings | video frame pixel values, image embeddings, text | Nudenet, private dataset | profanity, violence, nudity, drug classification | movie content rating |
| From None to Severe: Predicting Severity in Movie Scripts | scholar | 2021 | multi-task pairwise ranking-classification network | GloVe, Bert and TextCNN text embeddings | Movie script severity dataset | rating classifcation (frightening, mild, moderate, severe) | movie content rating |
| A Case Study of Deep Learning-Based Multi-Modal Methods for Labeling the Presence of Questionable Content in Movie Trailers | scholar | 2021 | multi-modal + multi output concat+MLP | CNN+LSTM video features, Bert and DeepMoji text embeddings, MFCC audio features | MM-Trailer | rating classifcation (red, yellow, green) | movie trailer content rating |
| Automatic Parental Guide Scene Classification Menggunakan Metode Deep Convolutional Neural Network Dan Lstm | scholar | 2020 | 3 CNN model for 3 modality, multi-label dataset | CNN video and audio embeddings, LSTM text (subitle) embeddings | private dataset | gore, nudity, drug, profanity classification from video and subtitle | movie scene content rating |
| Multimodal data fusion for sensitive scene localization | scholar | 2019 | meta-learning with Naive Bayes, SVM | MFCC and prosodic features from audio, HOG and TRoF features from images | Pornography-2k dataset, VSD2014 | violent and pornographic scene localization from video | movie scene content rating |
| A Deep Learning approach for the Motion Picture Content Rating | scholar | 2019 | MLP + rule-based decision | InceptionV3 image embeddings | Violent Scenes Dataset, private dataset | violence (shooting, blood, fire, weapon) classification from video | movie scene content rating |
| Hybrid System for MPAA Ratings of Movie Clips Using Support Vector Machine | springer | 2019 | SVM | DCT features from image | private dataset | movie content rating classification from images | movie content rating |
| Inappropriate scene detection in a video stream | page | 2017 | SVM classifier + Lenet image classifier + rules-based decision | HoG and CNN features for image | private dataset | image classification: no/mild/high violence, safe/unsafe/pornoghraphy | movie frame content rating |
| name | paper | year | model | features | datasets | tasks | context |
|---|---|---|---|---|---|---|---|
| State-of-the-Art in Nudity Classification: A Comparative Analysis | ieee | 2023 | CNN, Transformers | EfficientNet, ViT, ConvNeXT image embeddings | LSPD, Nudenet, NDPI2k | nudity classification from images | general content moderation |
| Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild | scholar | 2022 | novel threshold optimization tech. (TruSThresh) | prediction scores | UnSmile (Korean hatespeech dataset) | optimum threshold prediction | social media content moderation |
| On-Device Content Moderation | scholar | 2021 | mobilenet v3 + SSD object detector | mobilenet v3 image embeddings | private dataset | object detection + nudity classification from images | on-device content moderation |
| Gore Classification and Censoring in Images | scholar | 2021 | ensemble of CNN + MLP | mobilenet v2, densenent, vgg16 image embeddings | private dataset | gore classification from images | general content moderation |
| Automated Censoring of Cigarettes in Videos Using Deep Learning Techniques | scholar | 2020 | CNN + MLP | inception v3 image embeddings | private dataset | cigarette classification from video | general content moderation |
| A Multimodal CNN-based Tool to Censure Inappropriate Video Scenes | scholar | 2019 | CNN + SVM | InceptionV3 image embeddings, AudioVGG audio embeddings | private dataset | inappropriate (nudity+gore) classification from video | general video content moderation |
| A baseline for NSFW video detection in e-learning environments | scholar | 2019 | concat + SVM, MLP | InceptionV3 image embeddings, AudioVGG audio embeddings | YouTube8M, NDPI, Cholec80 | nudity classification from video | e-learning content moderation |
| Bringing the kid back into youtube kids: Detecting inappropriate content on video streaming platforms | scholar | 2019 | CNN + LSTM (late fusion) | CNN based encoder for image, video and audio spectrograms | private dataset | video classification: orignal, fake explicit, fake violent | social media content moderation |
Papers and tools for detecting animal cruelty content and illegal wildlife trade in images, video, and social media posts. This includes detection of wildlife trafficking imagery, identification of illegal wildlife products for sale online, and automated monitoring of animal abuse content on social media platforms.
| name | paper | year | model | features | datasets | tasks | context |
|---|---|---|---|---|---|---|---|
| A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces | acm | 2025 | LLM pseudo-labeling + specialized classifiers | LLM-generated pseudo labels, text and image features | online marketplace listings | wildlife trafficking listing classification (up to 95% F1) | online marketplace content moderation |
| Descriptive Analysis of Online Wildlife Products Using Vision Language Models | acm | 2025 | VLM + LLM pipeline | vision-language model embeddings | online wildlife product advertisements | product type, species, and IUCN/CITES status identification | online marketplace content moderation |
| Detection of trade in products derived from threatened species using machine learning and a smartphone | arxiv | 2025 | object recognition models | image features from elephant, pangolin, tiger products | wildlife product images | wildlife product detection and localization (91.3% accuracy on smartphone) | field and online detection |
| Unlocking the power of artificial intelligence for pangolin protection: Revolutionizing wildlife conservation with enhanced deep learning models | sciencedirect | 2024 | enhanced YOLOv8 with BiFPN + Triplet Attention | YOLOv8 image embeddings with attention mechanisms | pangolin image dataset | pangolin detection (87.0% mAP, 14.3 MB model) | wildlife trade monitoring |
| Detecting wildlife trafficking in images from online platforms: A test case using deep learning with pangolin images | sciencedirect | 2023 | fine-tuned CNNs | CNN image embeddings (iNaturalist, Flickr, Google images) | pangolin trade imagery | trade context classification (>90% detection rate) | online platform content moderation |
| Towards automatic detection of wildlife trade using machine vision models | sciencedirect | 2023 | DNNs (5 architectures, 3 training methods) | DNN image embeddings | exotic pet trade images (24 models trained) | exotic pet sale detection (f-score >0.95 in-distribution) | online platform content moderation |
| Detecting illegal wildlife trafficking via real time tomography 3D X-ray imaging and automated algorithms | frontiers | 2022 | 3D X-ray CT + computer algorithms | 3D X-ray tomography features | 294 scans from 13 species (lizards, birds, fish) | wildlife detection in luggage (82% detection, 1.6% false alarm) | border security screening |
| Use of Machine Learning to Detect Wildlife Product Promotion and Sales on Twitter | frontiers | 2019 | unsupervised topic model + keyword filtering | text features from tweets | 138,357 tweets | wildlife product sale detection on social media | social media content moderation |
| name | paper | year | model | features | datasets | tasks | context |
|---|---|---|---|---|---|---|---|
| Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation | arxiv | 2025 | WildFlow (flow matching + occupancy model) | latent space representations with prior knowledge injection | datasets from two national parks in Uganda | poaching prediction under observation bias and data scarcity | anti-poaching surveillance |
| Pig aggression classification using CNN, Transformers and Recurrent Networks | arxiv | 2024 | TimeSformer, ViViT, ResNet3D, CnnLstm | transformer and CNN video embeddings | pig behavior video dataset | aggressive/non-aggressive behavior classification (0.729 median accuracy) | livestock welfare monitoring |
| Perspectives in machine learning for wildlife conservation | nature | 2022 | survey of ML approaches | various (CNN, transformer, acoustic models) | various wildlife datasets | detection, tracking, counting, behavior analysis | wildlife conservation review |
| A framework for investigating illegal wildlife trade on social media with machine learning | wiley | 2019 | ML pipeline (mining, filtering, identification) | deep learning image filters, NLP text features | social media data | illegal wildlife trade investigation pipeline | social media content moderation |
| Machine learning for tracking illegal wildlife trade on social media | nature | 2018 | deep neural networks | DNN image and text embeddings | Twitter data (rhino-related posts, 19 languages) | automated wildlife trade monitoring on social media (>90% image filtering) | social media content moderation |
| name | paper | year | model | features | datasets | tasks |
|---|---|---|---|---|---|---|
| Effectively leveraging Multi-modal Features for Movie Genre Classification | scholar | 2022 | embeddings + fusion + MLP | CLIP image embeddings, PANNs audio embeddings, CLIP text embeddings | MovieNet | movie genre classification |
| OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification | scholar | 2022 | embeddings + novel transformer | ResNet-18 image embeddings, ResNet-VLAD audio embeddings | TI-News | news scene segmentation/classification (studio, outdoor, interview) |
| Detection of Animated Scenes Among Movie Trailers | scholar | 2022 | CNN + GRU | EfficientNet image embeddings | Private dataset | genre classification from movie trailer scenes |
| A multi-label movie genre classification scheme based on the movie's subtitles | springer | 2022 | KNN | text frequency vectors | Private dataset | genre classification from movie subtitle text |
| A multimodal approach for multi-label movie genre classification | scholar | 2020 | CNN + LSTM | MFCCs/SSD/LBP from audio, LBP/3DCNN from video frames, Inception-v3 from poster, TFIDF from text | Private dataset | genre classification from movie trailers |
| Genre classification of movie trailers using 3d convolutional neural networks | ieee | 2020 | 3D CNN | images | Private dataset | genre classification from movie trailer scenes |
| A unified framework of deep networks for genre classification using movie trailer | scholar | 2020 | CNN + LSTM | Inception V4 image embeddings | EmoGDB | genre classification from movie trailer scenes |
| Towards story-based classification of movie scenes | scholar | 2020 | logistic regression | manually extracted categorical features | Flintstones Scene Dataset | scene classification (Obstacle, Midpoint, Climax of Act 1) |
| name | paper | year | model | features | datasets | tasks | modalities |
|---|---|---|---|---|---|---|---|
| M&M Mix: A Multimodal Multiview Transformer Ensemble | scholar | 2022 | transformer with 2 cls heads | ViT image embeddings from audio spect., frame image, optical flow | Epic-Kitchens | video/action classification | image + audio + optical flow |
| MultiMAE: Multi-modal Multi-task Masked Autoencoders | scholar | 2022 | transformer with 3 decoder + cls heads | ViT-like image enc. patch embeddings (optional modalities) | ImageNet: Pseudo labeled multi-task training dataset (depth, segm) | image cs., semantic segm., depth est. | image + depth map |
| Data2vec: A general framework for self-supervised learning in speech, vision and language | scholar | 2022 | single encoder | transformer based audio, text, image encoder embeddings | ImageNet, Librispeech | masked pretraining | image + audio + text |
| VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text | scholar | 2022 | 1 encoder per modality | transformer based audio, text, image encoder embeddings | AudioSet, HowTo100M | pretraining + video/audio classification | image + audio + text |
| Expanding Language-Image Pretrained Models for General Video Recognition | scholar | 2022 | 1 encoder per modality | transformer based video, text encoder embeddings | HMDB-51, UCF-101 | contrastive pretraining | video + text |
| Audio-Visual Instance Discrimination with Cross-Modal Agreement | scholar | 2021 | 1 encoder per modality | CNN based audio, video encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + audio |
| Robust Audio-Visual Instance Discrimination | scholar | 2021 | 1 encoder per modality | CNN based audio, video encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + audio |
| Learning transferable visual models from natural language supervision | scholar | 2021 | 1 encoder per modality | transformer based image, text encoder embeddings | JFT-300M | contrastive pretraining | image + text |
| Self-supervised multimodal versatile networks | scholar | 2020 | multiple encoders | CNN based image/audio embeddings, word2vec text embeddings | UCF101, Kinetics, AudioSet | contrastive pretraining + classification | image + audio + text |
| Uniter: Universal image-text representation learning | scholar | 2020 | multimodal encoder | combined embeddings | COCO, Visual Genome, Conceptual Captions | qa/image-text retrieval | image + text |
| 12-in-1: Multi-task vision and language representation learning | scholar | 2020 | multimodal encoder | combined embeddings | COCO, Flickr30k | qa/image-text retrieval | image + text |
| Two-stream convolutional networks for action recognition in videos | scholar | 2014 | 1 encoder per modality | CNN based audio, text encoder embeddings | HMDB-51, UCF-101 | video/audio classification | video + optical flow |
| name | paper | year | model | features | datasets | tasks | modalities |
|---|---|---|---|---|---|---|---|
| OmniMAE: Single Model Masked Pretraining on Images and Videos | scholar | 2022 | transformer with 1 cls. head | ViT-like image/video enc. patch embeddings | ImageNet, SSv2 | video/action classification | image + video |
| OMNIVORE: A Single Model for Many Visual Modalities | scholar | 2022 | transformer with 3 cls. heads | ViT-like image/video enc. patch embeddings | ImageNet, Kinetics, SSv2, SUN RGB-D | image cls., action recog., depth est. | image + video + depth map |
| Polyvit: Co-training vision transformers on images, videos and audio | scholar | 2021 | transformer with 9 cls. heads | ViT-like image/video/audio enc. embeddings | ImageNet, CIFAR, Kinetics, Moments in Time, AudioSet, VGGSound | image cls., video cls., audio cls. | image + video + audio |
| name | paper | year | model | features | datasets | tasks |
|---|---|---|---|---|---|---|
| Frozen CLIP Models are Efficient Video Learners | scholar | 2022 | transformer with 1 cls head | CLIP image embeddings | ImageNet, Kinetics, SSv2 | action recognition |
| Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training | scholar | 2022 | transformer with 1 cls head | ViT-like video enc. patch embeddings | Kinetics, SSv2 | action recognition |
| Bevt: Bert pretraining of video transformers | scholar | 2022 | encoder-decoder transformer | VideoSwin image/video enc. embeddings | Kinetics, SSv2 | action recognition |
| Video swin transformer | scholar | 2022 | Swin trans. with cls.head | Swin video enc. embeddings | Kinetics, SSv2 | action recognition |
| Is space-time attention all you need for video understanding? | scholar | 2021 | transformer with cls. head | ViT-like video enc. patch embeddings | Kinetics, SSv2 | action recognition |
| name | paper | year | model | features | datasets | tasks |
|---|---|---|---|---|---|---|
| X3d: Expanding architectures for efficient video recognition | scholar | 2020 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, SSv2 | action recognition |
| Slowfast networks for video recognition | scholar | 2019 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, SSv2 | action recognition |
| A closer look at spatiotemporal convolutions for action recognition (R2+1D) | scholar | 2018 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, HMDB-51, UCF-101 | action recognition |
| Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D) | scholar | 2017 | CNN with cls. head | 3D CNN based video enc. embeddings | Kinetics, HMDB-51, UCF-101 | action recognition |
| name | paper | date |
|---|---|---|
| Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text | scholar | 2021 |
| Supervised contrastive learning | scholar | 2020 |
| name | paper | date |
|---|---|---|
| Machine Learning Models for Content Classification in Film Censorship and Rating | 2022 | |
| A survey of artificial intelligence strategies for automatic detection of sexually explicit videos | scholar | 2022 |
| A survey on video content rating: taxonomy, challenges and open issues | 2021 | |
| Multimodal Learning with Transformers: A Survey | scholar | 2022 |
| A Survey Paper on Movie Trailer Genre Detection | scholar | 2020 |
| name | url | description |
|---|---|---|
| safetext | github | multilingual swear word detection and filtering |
| PySceneDetect | github | Python and OpenCV-based scene cut/transition detection program & library |
| LAION safety toolkit | github | NSFW detector trained on LAION dataset |
| pysrt | github | Python parser for SubRip (srt) files |
| ffsubsync | github | Automagically synchronize subtitles with video. |
| MoviePy | github | Video editing with Python |
| WildlifeVLM | github | Vision-language model pipeline for analyzing online wildlife product advertisements |
| PyTorch-Wildlife | github | Microsoft's collaborative deep learning framework for wildlife conservation (MegaDetector, species classification) |
| TrailGuard AI | page | Intel/RESOLVE AI-powered camera trap for real-time poacher and wildlife detection with on-device deep learning |