- Update all documentation to use "LLM API" instead of "OpenAI GPT-4o-mini" - Rename OPENAI_API_KEY to LLM_API_KEY in configuration - Update code comments to reflect generic LLM usage - Keep OpenAI-compatible client library (supports any LLM provider) - Add LOCAL_OPERATIONS.md and accounts.txt to .gitignore
4.5 KiB
4.5 KiB
Mastodon Collector
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using LLM API, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.
Quick Start
# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt
# 2. Start everything
docker compose up -d
# 3. Open the dashboard
open http://localhost:8585
Architecture
| Service | Description | Port |
|---|---|---|
| db | PostgreSQL 16 | 5432 |
| web | Flask dashboard (Gunicorn) | 8585 |
| collector | Background service, polls every 4 hours | — |
Adding Accounts
Two methods:
- Text file — edit
accounts.txt, one handle per line (@user@instance.social). Picked up on next collection cycle. - Web UI — go to http://localhost:8585/accounts and use the form.
Configuration
Edit .env to customize:
POSTGRES_PASSWORD=collector_secret # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s)
LLM_API_KEY=sk-... # Required for toxicity analysis
Toxicity Analysis
The system includes automated toxicity detection and manual review capabilities:
Features
- Automated Classification: Uses LLM API to analyze posts across 12 toxicity dimensions:
- General toxicity, threats, hate speech
- Racism, antisemitism, islamophobia
- Sexism, homophobia, ableism
- Insults, dehumanization, extremism
- Flagging System: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
- Manual Review Interface: Web dashboard at
/analysis/flaggedfor human validation - Analysis Dashboard: Statistics, trends, and category breakdowns at
/analysis
Running Analysis
# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"
# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
Analysis Database Schema
Additional tables for toxicity analysis:
toxicity_scores— toxicity scores per status (12 categories + overall)analysis_runs— audit trail of analysis runs with costs and duration
Cost Estimation
- Batch processing: ~10 posts per API call
- Estimated cost: ~$0.12 per 1,000 posts analyzed
- Example: 16,906 posts ≈ $1.95
API Endpoints
For plugging into your analysis pipeline:
| Endpoint | Description |
|---|---|
GET /api/stats |
Overview stats (counts by type) |
GET /api/statuses |
Paginated statuses as JSON |
GET /export |
Download all statuses as CSV |
/api/statuses parameters
page— page number (default: 1)per_page— results per page (default: 100, max: 500)account_id— filter by internal account IDtype— filter by status type:post,reply,mention,reblogsince— ISO datetime, only return statuses after this time
Database Schema
Main tables:
monitored_accounts— accounts being trackedstatuses— collected posts with plain text + HTML contentmentions— who was @-mentioned in each statusmedia_attachments— images/videos attached to statusestags— hashtags usedcollection_logs— audit trail of each collection run
Each status stores raw_json with the full Mastodon API response for future analysis needs.
Moving to a Server
# Copy the project
scp -r mastodon-collector/ user@server:~/
# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d
Stopping
docker compose down # Stop services, keep data
docker compose down -v # Stop services AND delete database
Research & Reporting
See ANALYSIS_REPORT.md for a complete methodology report including:
- Data collection statistics
- Toxicity analysis methodology
- Manual review results and findings
- False positive analysis
- Limitations and considerations
License
MIT License - see LICENSE file for details.