mastodon-collector/README.md
2026-04-18 17:34:13 +00:00

143 lines
4.5 KiB
Markdown

# Mastodon Collector
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using LLMs, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.
## Quick Start
```bash
# 1. Add accounts to monitor
echo "@user@mastodon.social" >> accounts.txt
# 2. Start everything
docker compose up -d
# 3. Open the dashboard
open http://localhost:8585
```
## Architecture
| Service | Description | Port |
|---------------|------------------------------------------------|-------|
| **db** | PostgreSQL 16 | 5432 |
| **web** | Flask dashboard (Gunicorn) | 8585 |
| **collector** | Background service, polls every 4 hours | — |
## Adding Accounts
Two methods:
1. **Text file** — edit `accounts.txt`, one handle per line (`@user@instance.social`). Picked up on next collection cycle.
2. **Web UI** — go to http://localhost:8585/accounts and use the form.
## Configuration
Edit `.env` to customize:
```
POSTGRES_PASSWORD=collector_secret # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-... # Required for toxicity analysis
```
## Toxicity Analysis
The system includes automated toxicity detection and manual review capabilities:
### Features
- **Automated Classification**: Uses an LLM to analyze posts across 12 toxicity dimensions:
- General toxicity, threats, hate speech
- Racism, antisemitism, islamophobia
- Sexism, homophobia, ableism
- Insults, dehumanization, extremism
- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation
- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis`
### Running Analysis
```bash
# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"
# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
```
### Analysis Database Schema
Additional tables for toxicity analysis:
- `toxicity_scores` — toxicity scores per status (12 categories + overall)
- `analysis_runs` — audit trail of analysis runs with costs and duration
### Cost Estimation
- Batch processing: ~10 posts per API call
- Estimated cost: ~$0.12 per 1,000 posts analyzed
- Example: 16,906 posts ≈ $1.95
## API Endpoints
For plugging into your analysis pipeline:
| Endpoint | Description |
|-----------------------|--------------------------------------|
| `GET /api/stats` | Overview stats (counts by type) |
| `GET /api/statuses` | Paginated statuses as JSON |
| `GET /export` | Download all statuses as CSV |
### `/api/statuses` parameters
- `page` — page number (default: 1)
- `per_page` — results per page (default: 100, max: 500)
- `account_id` — filter by internal account ID
- `type` — filter by status type: `post`, `reply`, `mention`, `reblog`
- `since` — ISO datetime, only return statuses after this time
## Database Schema
Main tables:
- `monitored_accounts` — accounts being tracked
- `statuses` — collected posts with plain text + HTML content
- `mentions` — who was @-mentioned in each status
- `media_attachments` — images/videos attached to statuses
- `tags` — hashtags used
- `collection_logs` — audit trail of each collection run
Each status stores `raw_json` with the full Mastodon API response for future analysis needs.
## Moving to a Server
```bash
# Copy the project
scp -r mastodon-collector/ user@server:~/
# On the server
cd mastodon-collector
# Edit .env with production secrets
docker compose up -d
```
## Stopping
```bash
docker compose down # Stop services, keep data
docker compose down -v # Stop services AND delete database
```
## Research & Reporting
See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including:
- Data collection statistics
- Toxicity analysis methodology
- Manual review results and findings
- False positive analysis
- Limitations and considerations
## License
MIT License - see [LICENSE](LICENSE) file for details.