mastodon-collector/README.md
2026-04-18 17:34:13 +00:00

4.5 KiB

Mastodon Collector

Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using LLMs, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.

Quick Start

# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt

# 2. Start everything
docker compose up -d

# 3. Open the dashboard
open http://localhost:8585

Architecture

Service Description Port
db PostgreSQL 16 5432
web Flask dashboard (Gunicorn) 8585
collector Background service, polls every 4 hours

Adding Accounts

Two methods:

  1. Text file — edit accounts.txt, one handle per line (@user@instance.social). Picked up on next collection cycle.
  2. Web UI — go to http://localhost:8585/accounts and use the form.

Configuration

Edit .env to customize:

POSTGRES_PASSWORD=collector_secret      # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400             # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-...                   # Required for toxicity analysis

Toxicity Analysis

The system includes automated toxicity detection and manual review capabilities:

Features

  • Automated Classification: Uses an LLM to analyze posts across 12 toxicity dimensions:
    • General toxicity, threats, hate speech
    • Racism, antisemitism, islamophobia
    • Sexism, homophobia, ableism
    • Insults, dehumanization, extremism
  • Flagging System: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
  • Manual Review Interface: Web dashboard at /analysis/flagged for human validation
  • Analysis Dashboard: Statistics, trends, and category breakdowns at /analysis

Running Analysis

# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"

# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"

Analysis Database Schema

Additional tables for toxicity analysis:

  • toxicity_scores — toxicity scores per status (12 categories + overall)
  • analysis_runs — audit trail of analysis runs with costs and duration

Cost Estimation

  • Batch processing: ~10 posts per API call
  • Estimated cost: ~$0.12 per 1,000 posts analyzed
  • Example: 16,906 posts ≈ $1.95

API Endpoints

For plugging into your analysis pipeline:

Endpoint Description
GET /api/stats Overview stats (counts by type)
GET /api/statuses Paginated statuses as JSON
GET /export Download all statuses as CSV

/api/statuses parameters

  • page — page number (default: 1)
  • per_page — results per page (default: 100, max: 500)
  • account_id — filter by internal account ID
  • type — filter by status type: post, reply, mention, reblog
  • since — ISO datetime, only return statuses after this time

Database Schema

Main tables:

  • monitored_accounts — accounts being tracked
  • statuses — collected posts with plain text + HTML content
  • mentions — who was @-mentioned in each status
  • media_attachments — images/videos attached to statuses
  • tags — hashtags used
  • collection_logs — audit trail of each collection run

Each status stores raw_json with the full Mastodon API response for future analysis needs.

Moving to a Server

# Copy the project
scp -r mastodon-collector/ user@server:~/

# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d

Stopping

docker compose down          # Stop services, keep data
docker compose down -v       # Stop services AND delete database

Research & Reporting

See ANALYSIS_REPORT.md for a complete methodology report including:

  • Data collection statistics
  • Toxicity analysis methodology
  • Manual review results and findings
  • False positive analysis
  • Limitations and considerations

License

MIT License - see LICENSE file for details.