A comprehensive data science project analyzing X (Twitter) Community Notes using advanced machine learning and data analysis techniques.
This project performs in-depth analysis of X Community Notes, a crowdsourced fact-checking system. The analysis includes:
- Topic Classification: 15-topic classification system covering conflicts, geopolitics, and general categories
- Community Detection: Network analysis to identify communities and their topic specializations
- Exploratory Data Analysis: Comprehensive statistical analysis and visualizations
- 15-Topic Classification System: Covers Ukraine Conflict, Gaza Conflict, Syria War, Iran, China-Taiwan, China Influence, Other Conflicts, Scams, Health/Medical, Climate/Environment, Politics, Technology, Economics, Entertainment, and Immigration
- TF-IDF + Logistic Regression: Efficient and accurate classification model
- Community Analysis: Louvain algorithm for community detection with topic specialization analysis
- Comprehensive Visualizations: Professional charts and dashboards for insights
- Python 3.8 or higher
- Required Python packages (see
requirements.txtor install instructions below)
- Clone or download this repository
- Install required dependencies:
pip install pandas numpy scikit-learn matplotlib seaborn networkx langdetectOptional (for better language detection):
pip install textblobPlace your Community Notes data in the following structure:
data/
notes/
notes-00000.tsv
The TSV file should contain columns including:
noteId: Unique identifier for each notesummary: Text content of the notecreatedAtMillis: Timestamp (optional, for temporal analysis)
.
βββ code/
β βββ classification/
β β βββ topic_classifier.py # Main classification module
β β βββ analyze_classification_results.py # Results analysis and visualization
β βββ clustering/
β β βββ communities_analysis.py # Community detection and analysis
β β βββ louvain_communities.ipynb # Jupyter notebook for community analysis
β βββ eda/
β β βββ eda_analysis.py # Exploratory data analysis
β βββ demo_workflow.py # Demo script showcasing workflow
β βββ generate_report.py # Report generator for presentations
βββ run_analysis.py # Main entry point script
βββ plots/
β βββ topic_classification/ # Classification visualizations
β βββ comuunity_detection/ # Community analysis visualizations
β βββ eda/ # EDA visualizations
βββ README.md # This file
The easiest way to run the analysis is using the main entry point script:
# Run complete pipeline (classification + analysis + report)
python run_analysis.py all --max-notes 10000
# Or run individual components
python run_analysis.py classify --max-notes 10000
python run_analysis.py analyze
python run_analysis.py report
# Run demo workflow
python run_analysis.py demo --quickFor a quick demonstration of the complete workflow:
# Quick demo with 10K notes
python code/demo_workflow.py --quick
# Full demo with all notes
python code/demo_workflow.py --full
# Analysis only (on existing results)
python code/demo_workflow.py --analysis-onlyfrom code.classification.topic_classifier import CustomTopicClassifier
# Initialize classifier
classifier = CustomTopicClassifier(data_path="data", output_dir="results")
# Run complete pipeline
results = classifier.run_complete_pipeline(
max_notes=10000, # Limit to 10K notes for testing
english_only=True, # Filter to English-only notes
force_refilter=False # Use cached English-filtered data if available
)from code.classification.analyze_classification_results import ClassificationAnalyzer
# Initialize analyzer
analyzer = ClassificationAnalyzer(results_dir="custom_topic_results")
# Load latest results and run complete analysis
results = analyzer.run_complete_analysis()from code.generate_report import ReportGenerator
# Generate presentation-ready report
generator = ReportGenerator(output_dir="reports")
report_file = generator.generate_report(results_dir="custom_topic_results")from code.clustering.communities_analysis import SpecializedVisualizationsDevide
# Initialize visualizer
visualizer = SpecializedVisualizationsDevide()
# Run all visualizations
results = visualizer.run_all_visualizations()classified_notes_*.csv: Full classification results with topic labels and confidence scorestrained_topic_model_*.pkl: Saved trained model for reuseseed_terms_*.json: Seed terms used for each topicclassification_summary_*.json: Complete metadata and statistics
classification_analytics/: Classification analysis charts and visualizationsreports/: Generated summary reports (text and markdown formats)plots/topic_classification/: Topic distribution, confidence analysis, temporal trendsplots/comuunity_detection/: Community structure, topic leadership matrices
The report generator creates presentation-ready summaries:
analysis_report_*.txt: Text format reportanalysis_report_*.md: Markdown format report
Reports include:
- Executive summary with key metrics
- Topic distribution analysis
- Confidence statistics
- Methodology overview
- Key insights and findings
- Seed Term Matching: Initial labels assigned based on keyword matching across 15 topic categories
- Model Training: TF-IDF vectorization (unigrams + bigrams) + Logistic Regression with class balancing
- Classification: Apply trained model to all notes with confidence scores
- Analysis: Comprehensive statistics and visualizations
- Louvain algorithm for community detection
- Topic specialization analysis per community
- Diversity and engagement metrics
- Topic leadership matrices
- Coverage Rate: 85.6% (vs ~40-60% typical for zero-shot methods)
- Model Accuracy: 81.4% (excellent for 16-class problem)
- Processing Speed: ~1000 notes/second
- Topics Covered: 15 comprehensive categories
- Vectorization: TF-IDF with 50,000 max features
- N-grams: Unigrams and bigrams
- Classifier: Logistic Regression (LBFGS solver, OVR multi-class)
- Class Balancing: Automatic class weight balancing
- Efficient memory usage (~337MB vs 5GB+ for neural models)
- Fast training and inference
- Scalable to millions of notes
- The classifier supports English-only filtering for better accuracy
- Cached English-filtered datasets are automatically saved for faster subsequent runs
- All outputs include timestamps for version tracking
This is a university data science project. For questions or improvements, please contact the project team.
This project is for academic/research purposes.
Data Science Project Team - Year 4
Last Updated: 2024