Transform sensationalized news into factual, neutral reporting through advanced AI and NLP techniques
A full-stack news processing and analysis platform that extracts facts from news articles, clusters similar content, and presents desensationalized information through a modern architecture. Tired of clickbait headlines and biased spins? InFact cuts through the noise to deliver just the factsβbecause who has time for drama in their daily news?
Built with Ballerina for robust API gateway services and Python FastAPI for advanced AI processing pipelines.
π Original InFact Implementation
InFact is your ultimate shield against sensationalized news! This platform automatically pulls articles from RSS feeds and external APIs, processes them with cutting-edge NLP to separate facts from opinions, clusters similar stories, and generates neutral summaries using AI. Built by a talented team from Sri Lanka, it's perfect for journalists, researchers, or anyone who wants unbiased information without the hype.
This implementation features a dual-service architecture:
- π Ballerina Gateway: High-performance API gateway handling news aggregation, routing, and client interactions
- π§ Python Pipeline: Advanced AI/ML processing engine for clustering, fact extraction, and content generation
π InFact Platform/
βββ π ballerina-gateway/ # Ballerina API Gateway
β βββ main.bal # Main service endpoints
β βββ modules/
β β βββ config/ # Database & API configuration
β β βββ types/ # Data models & schemas
β β βββ utils/ # Business logic utilities
β βββ Config.toml # Environment configuration
β
βββ π§ python-pipeline/ # AI Processing Engine
β βββ main.py # FastAPI application entry
β βββ core/ # Configuration & database
β βββ schemas/ # Pydantic data models
β βββ services/ # API endpoints & business logic
β βββ utils/ # NLP & AI processing tools
β
βββ π frontend/ # React Frontend (Optional)
βββ π notebook/ # Research & Development
- Smart Article Clustering - Groups related news stories using semantic similarity
- Fact vs Opinion Classification - Separates factual information from editorial content
- Neutral Article Generation - Creates unbiased summaries using Google Gemini AI
- Sentiment Analysis - Identifies and neutralizes emotional language
- Duplicate Detection - Automatic duplicate article detection and filtering
- Trending Topic Detection - Identifies emerging news patterns
- Source Bias Analysis - Tracks how different outlets cover the same story
- Real-time Statistics - Comprehensive metrics and insights
- Similarity Scoring - ML-based content similarity detection
- Weekly Digests - Automated news summaries
- Ballerina Gateway - Enterprise-grade API gateway with robust routing
- Async FastAPI Backend - High-performance Python processing with background tasks
- MongoDB Integration - Scalable document storage with intelligent clustering
- Modular Design - Clean separation of concerns with comprehensive error handling
- RSS Feed Automation - Automated news ingestion from configurable sources
- URL Tracking - Maintains links to original sources
- Image Processing - Automatic image selection for clusters
- Multi-source Aggregation - Combines articles from multiple news outlets
- Historical Analysis - Tracks news evolution over time
- Search & Filtering - Advanced query capabilities
- Ballerina 2201.8.0+ (Download here)
- Python 3.11+
- MongoDB 5.0+ (local or cloud)
- Google Gemini API Key (Get one here)
- News API Key (Get one here)
# Clone the repository
git clone <repository-url>
cd infact-ballerinacd ballerina-gateway
# Create Config.toml
cat > Config.toml << EOF
[ballerina_gateway.config]
mongoUri = "mongodb://localhost:27017"
databaseName = "newsstore"
[ballerina_gateway.utils]
newsApiKey = "your-news-api-key-here"
EOFFor detailed setup instructions, please refer to the ballerina-gateway/README.md.
For easy API testing and exploration, import the Postman collection: InFact API Collection
For detailed setup instructions, please refer to the python-pipeline/README.md.
cd python-pipeline
python main.py
# Available at: http://localhost:8091cd ballerina-gateway
bal run
# Available at: http://localhost:9090The Ballerina gateway provides enterprise-grade APIs for news management:
# Fetch articles from News API
curl -X POST "http://localhost:9090/news/fetchArticles" \
-H "Content-Type: application/json" \
-d '{"query": "technology", "pageSize": 20}'
# Get recent articles with pagination
curl "http://localhost:9090/news/articles?limit=20&skip=0"
# Extract from RSS feeds
curl -X POST "http://localhost:9090/news/rss-extract" \
-H "Content-Type: application/json" \
-d '{"from_date": "2025-08-22", "max_articles": 50}'# Process articles with AI clustering
curl -X POST "http://localhost:9090/news/process-with-storage" \
-H "Content-Type: application/json" \
-d '{"articles": [...], "n_clusters": 3}'
# Auto-processing pipeline
curl -X POST "http://localhost:9090/news/scrape-process-store?days_back=7"# Get trending topics
curl "http://localhost:9090/news/trending-topics?days_back=30"
# Search clusters
curl -X POST "http://localhost:9090/news/search" \
-H "Content-Type: application/json" \
-d '{"query": "climate change", "limit": 10}'
# Weekly digest
curl "http://localhost:9090/news/weekly-digest"Advanced AI processing capabilities:
# Direct processing with storage
curl -X POST "http://localhost:8000/api/v1/process-with-storage" \
-H "Content-Type: application/json" \
-d '{"articles": [...], "n_clusters": 3}'
# Get cluster statistics
curl "http://localhost:8000/api/v1/clusters/stats"
# Automated scraping and processing
curl -X POST "http://localhost:8000/api/v1/scrape-process-store?days_back=7"π Full API Documentation:
- Ballerina Gateway:
http://localhost:9090/news(OpenAPI spec available) - Python Pipeline:
http://localhost:8000/docs(Interactive Swagger UI)
--
- Framework: Ballerina 2201.8.0+ (Cloud-native programming language)
- Database: MongoDB with connection pooling
- External APIs: News API, RSS feeds integration
- Features: RESTful APIs, async processing, robust error handling
- Framework: FastAPI (Python 3.11+)
- NLP & ML: spaCy, sentence-transformers, scikit-learn, gensim
- AI: Google Generative AI (Gemini 2.0 Flash)
- Data Processing: NumPy, pandas, PyTorch, NLTK
- Database: MongoDB (via pymongo)
- Features: Async processing, background tasks, ML pipelines
- Framework: React 19.1.1 + Vite 7.1.2
- Styling: Tailwind CSS 4.1.12
- Features: Responsive design, real-time updates
graph TD
A[π° RSS/API Sources] --> B[π Ballerina Gateway]
B --> C[π Article Extraction]
C --> D[π§ Python Pipeline]
D --> E[π€ Text Preprocessing]
E --> F[π§ Semantic Embeddings]
F --> G[π― Clustering Algorithm]
G --> H[π Similarity Check]
H --> I{π Similar Cluster?}
I -->|Yes| J[π Merge Clusters]
I -->|No| K[β¨ Create New Cluster]
J --> L[π Fact Extraction]
K --> L
L --> M[π€ AI Generation]
M --> N[πΎ Store in MongoDB]
N --> O[π Update Analytics]
O --> P[π Return to Gateway]
- π‘ Data Ingestion - Ballerina gateway fetches from RSS feeds and News API
- π Text Preprocessing - Tokenization, lemmatization, noise removal
- π§ Embedding Generation - Semantic vectors using sentence-transformers
- π― Smart Clustering - KMeans with TF-IDF enhancement
- π Similarity Analysis - Compare with existing clusters
- π Intelligent Merging - Combine similar clusters or create new ones
- π Fact Extraction - NER + sentiment analysis for classification
- π Deduplication - Remove redundant information
- π€ AI Generation - Create neutral summaries with Gemini
- πΎ Persistent Storage - MongoDB with indexing
- πΌοΈ Media Processing - Image selection and URL tracking
cd ballerina-gateway
bal testcd python-pipeline
pytest
pytest --cov=. --cov-report=html# Test complete pipeline
curl -X POST "http://localhost:9090/news/scrape-process-store?max_articles=5"
### Performance Optimization
- **Ballerina**: Connection pooling, async processing
- **Python**: GPU acceleration, batch processing, caching
- **MongoDB**: Proper indexing, sharding for scale
# Fork and clone
git clone <your-fork-url>
cd infact-ballerina
# Setup both services
cd ballerina-gateway && bal build
cd ../python-pipeline && pip install -r requirements-dev.txt
# Run tests
cd ballerina-gateway && bal test
cd ../python-pipeline && pytest- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project was built by an awesome team from the University of Moratuwa, Sri Lanka:
- π Backend Architect: HimathX (Dhanapalage Himath Nimpura Dhanapala) β Ballerina gateway & MongoDB integration
- π¨ Frontend Wizard: codevector-2003 (Haren Daishika) β React interface & user experience
- π§ AI/ML Engineer: LazySeaHorse (Raj Pankaja) β NLP pipeline & AI processing
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β€οΈ (and a bit of caffeine) by the InFact Team. Stay factual, folks! π