2026-04-18 17:34:13 +00:00

4.5 KiB

Raw Blame History

Mastodon Collector

Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using LLMs, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.

Quick Start

# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt

# 2. Start everything
docker compose up -d

# 3. Open the dashboard
open http://localhost:8585

Architecture

Service	Description	Port
db	PostgreSQL 16	5432
web	Flask dashboard (Gunicorn)	8585
collector	Background service, polls every 4 hours	—

Adding Accounts

Two methods:

Text file — edit accounts.txt, one handle per line (@user@instance.social). Picked up on next collection cycle.
Web UI — go to http://localhost:8585/accounts and use the form.

Configuration

Edit .env to customize:

POSTGRES_PASSWORD=collector_secret      # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400             # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-...                   # Required for toxicity analysis

Toxicity Analysis

The system includes automated toxicity detection and manual review capabilities:

Features

Automated Classification: Uses an LLM to analyze posts across 12 toxicity dimensions:
- General toxicity, threats, hate speech
- Racism, antisemitism, islamophobia
- Sexism, homophobia, ableism
- Insults, dehumanization, extremism
Flagging System: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
Manual Review Interface: Web dashboard at /analysis/flagged for human validation
Analysis Dashboard: Statistics, trends, and category breakdowns at /analysis

Running Analysis

# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"

# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"

Analysis Database Schema

Additional tables for toxicity analysis:

toxicity_scores — toxicity scores per status (12 categories + overall)
analysis_runs — audit trail of analysis runs with costs and duration

Cost Estimation

Batch processing: ~10 posts per API call
Estimated cost: ~$0.12 per 1,000 posts analyzed
Example: 16,906 posts ≈ $1.95

API Endpoints

For plugging into your analysis pipeline:

Endpoint	Description
`GET /api/stats`	Overview stats (counts by type)
`GET /api/statuses`	Paginated statuses as JSON
`GET /export`	Download all statuses as CSV

`/api/statuses` parameters

page — page number (default: 1)
per_page — results per page (default: 100, max: 500)
account_id — filter by internal account ID
type — filter by status type: post, reply, mention, reblog
since — ISO datetime, only return statuses after this time

Database Schema

Main tables:

monitored_accounts — accounts being tracked
statuses — collected posts with plain text + HTML content
mentions — who was @-mentioned in each status
media_attachments — images/videos attached to statuses
tags — hashtags used
collection_logs — audit trail of each collection run

Each status stores raw_json with the full Mastodon API response for future analysis needs.

Moving to a Server

# Copy the project
scp -r mastodon-collector/ user@server:~/

# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d

Stopping

docker compose down          # Stop services, keep data
docker compose down -v       # Stop services AND delete database

Research & Reporting

See ANALYSIS_REPORT.md for a complete methodology report including:

Data collection statistics
Toxicity analysis methodology
Manual review results and findings
False positive analysis
Limitations and considerations

License

MIT License - see LICENSE file for details.

4.5 KiB Raw Blame History