No description
Find a file
Pieter 870a0710b5 Update README with toxicity analysis features and add MIT license
- Document toxicity analysis capabilities and features
- Add configuration for OPENAI_API_KEY
- Include instructions for running analysis
- Add cost estimation and database schema info
- Link to ANALYSIS_REPORT.md for research findings
- Add MIT License

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-31 17:55:03 +02:00
app Complete toxicity analysis implementation with manual review 2026-03-31 17:50:23 +02:00
scripts Add toxicity analysis system for Mastodon statuses 2026-03-30 14:43:35 +02:00
.dockerignore Initial commit: Mastodon collector application 2026-02-09 08:05:54 +01:00
.env.example Initial commit: Mastodon collector application 2026-02-09 08:05:54 +01:00
.gitignore Fix toxicity analysis web interface 2026-03-30 17:07:12 +02:00
docker-compose.yml Complete toxicity analysis system setup and testing 2026-03-30 15:39:36 +02:00
Dockerfile Initial commit: Mastodon collector application 2026-02-09 08:05:54 +01:00
LICENSE Update README with toxicity analysis features and add MIT license 2026-03-31 17:55:03 +02:00
README.md Update README with toxicity analysis features and add MIT license 2026-03-31 17:55:03 +02:00
requirements.txt Add toxicity analysis system for Mastodon statuses 2026-03-30 14:43:35 +02:00

Mastodon Collector

Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.

Quick Start

# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt

# 2. Start everything
docker compose up -d

# 3. Open the dashboard
open http://localhost:8585

Architecture

Service Description Port
db PostgreSQL 16 5432
web Flask dashboard (Gunicorn) 8585
collector Background service, polls every 4 hours

Adding Accounts

Two methods:

  1. Text file — edit accounts.txt, one handle per line (@user@instance.social). Picked up on next collection cycle.
  2. Web UI — go to http://localhost:8585/accounts and use the form.

Configuration

Edit .env to customize:

POSTGRES_PASSWORD=collector_secret      # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400             # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-...                   # Required for toxicity analysis

Toxicity Analysis

The system includes automated toxicity detection and manual review capabilities:

Features

  • Automated Classification: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions:
    • General toxicity, threats, hate speech
    • Racism, antisemitism, islamophobia
    • Sexism, homophobia, ableism
    • Insults, dehumanization, extremism
  • Flagging System: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
  • Manual Review Interface: Web dashboard at /analysis/flagged for human validation
  • Analysis Dashboard: Statistics, trends, and category breakdowns at /analysis

Running Analysis

# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"

# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"

Analysis Database Schema

Additional tables for toxicity analysis:

  • toxicity_scores — toxicity scores per status (12 categories + overall)
  • analysis_runs — audit trail of analysis runs with costs and duration

Cost Estimation

  • Batch processing: ~10 posts per API call
  • Estimated cost: ~$0.12 per 1,000 posts analyzed
  • Example: 16,906 posts ≈ $1.95

API Endpoints

For plugging into your analysis pipeline:

Endpoint Description
GET /api/stats Overview stats (counts by type)
GET /api/statuses Paginated statuses as JSON
GET /export Download all statuses as CSV

/api/statuses parameters

  • page — page number (default: 1)
  • per_page — results per page (default: 100, max: 500)
  • account_id — filter by internal account ID
  • type — filter by status type: post, reply, mention, reblog
  • since — ISO datetime, only return statuses after this time

Database Schema

Main tables:

  • monitored_accounts — accounts being tracked
  • statuses — collected posts with plain text + HTML content
  • mentions — who was @-mentioned in each status
  • media_attachments — images/videos attached to statuses
  • tags — hashtags used
  • collection_logs — audit trail of each collection run

Each status stores raw_json with the full Mastodon API response for future analysis needs.

Moving to a Server

# Copy the project
scp -r mastodon-collector/ user@server:~/

# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d

Stopping

docker compose down          # Stop services, keep data
docker compose down -v       # Stop services AND delete database

Research & Reporting

See ANALYSIS_REPORT.md for a complete methodology report including:

  • Data collection statistics
  • Toxicity analysis methodology
  • Manual review results and findings
  • False positive analysis
  • Limitations and considerations

License

MIT License - see LICENSE file for details.