- Document toxicity analysis capabilities and features - Add configuration for OPENAI_API_KEY - Include instructions for running analysis - Add cost estimation and database schema info - Link to ANALYSIS_REPORT.md for research findings - Add MIT License 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
143 lines
4.5 KiB
Markdown
143 lines
4.5 KiB
Markdown
# Mastodon Collector
|
|
|
|
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# 1. Add accounts to monitor
|
|
echo "@user@mastodon.social" >> accounts.txt
|
|
|
|
# 2. Start everything
|
|
docker compose up -d
|
|
|
|
# 3. Open the dashboard
|
|
open http://localhost:8585
|
|
```
|
|
|
|
## Architecture
|
|
|
|
| Service | Description | Port |
|
|
|---------------|------------------------------------------------|-------|
|
|
| **db** | PostgreSQL 16 | 5432 |
|
|
| **web** | Flask dashboard (Gunicorn) | 8585 |
|
|
| **collector** | Background service, polls every 4 hours | — |
|
|
|
|
## Adding Accounts
|
|
|
|
Two methods:
|
|
|
|
1. **Text file** — edit `accounts.txt`, one handle per line (`@user@instance.social`). Picked up on next collection cycle.
|
|
2. **Web UI** — go to http://localhost:8585/accounts and use the form.
|
|
|
|
## Configuration
|
|
|
|
Edit `.env` to customize:
|
|
|
|
```
|
|
POSTGRES_PASSWORD=collector_secret # Change for production
|
|
FLASK_SECRET_KEY=change-me-in-production
|
|
POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s)
|
|
OPENAI_API_KEY=sk-... # Required for toxicity analysis
|
|
```
|
|
|
|
## Toxicity Analysis
|
|
|
|
The system includes automated toxicity detection and manual review capabilities:
|
|
|
|
### Features
|
|
|
|
- **Automated Classification**: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions:
|
|
- General toxicity, threats, hate speech
|
|
- Racism, antisemitism, islamophobia
|
|
- Sexism, homophobia, ableism
|
|
- Insults, dehumanization, extremism
|
|
- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
|
|
- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation
|
|
- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis`
|
|
|
|
### Running Analysis
|
|
|
|
```bash
|
|
# Analyze all unscored statuses (run inside collector container)
|
|
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"
|
|
|
|
# Limit to first 100 statuses for testing
|
|
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
|
|
```
|
|
|
|
### Analysis Database Schema
|
|
|
|
Additional tables for toxicity analysis:
|
|
|
|
- `toxicity_scores` — toxicity scores per status (12 categories + overall)
|
|
- `analysis_runs` — audit trail of analysis runs with costs and duration
|
|
|
|
### Cost Estimation
|
|
|
|
- Batch processing: ~10 posts per API call
|
|
- Estimated cost: ~$0.12 per 1,000 posts analyzed
|
|
- Example: 16,906 posts ≈ $1.95
|
|
|
|
## API Endpoints
|
|
|
|
For plugging into your analysis pipeline:
|
|
|
|
| Endpoint | Description |
|
|
|-----------------------|--------------------------------------|
|
|
| `GET /api/stats` | Overview stats (counts by type) |
|
|
| `GET /api/statuses` | Paginated statuses as JSON |
|
|
| `GET /export` | Download all statuses as CSV |
|
|
|
|
### `/api/statuses` parameters
|
|
|
|
- `page` — page number (default: 1)
|
|
- `per_page` — results per page (default: 100, max: 500)
|
|
- `account_id` — filter by internal account ID
|
|
- `type` — filter by status type: `post`, `reply`, `mention`, `reblog`
|
|
- `since` — ISO datetime, only return statuses after this time
|
|
|
|
## Database Schema
|
|
|
|
Main tables:
|
|
|
|
- `monitored_accounts` — accounts being tracked
|
|
- `statuses` — collected posts with plain text + HTML content
|
|
- `mentions` — who was @-mentioned in each status
|
|
- `media_attachments` — images/videos attached to statuses
|
|
- `tags` — hashtags used
|
|
- `collection_logs` — audit trail of each collection run
|
|
|
|
Each status stores `raw_json` with the full Mastodon API response for future analysis needs.
|
|
|
|
## Moving to a Server
|
|
|
|
```bash
|
|
# Copy the project
|
|
scp -r mastodon-collector/ user@server:~/
|
|
|
|
# On the server
|
|
cd mastodon-collector
|
|
# Edit .env with production secrets
|
|
docker compose up -d
|
|
```
|
|
|
|
## Stopping
|
|
|
|
```bash
|
|
docker compose down # Stop services, keep data
|
|
docker compose down -v # Stop services AND delete database
|
|
```
|
|
|
|
## Research & Reporting
|
|
|
|
See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including:
|
|
- Data collection statistics
|
|
- Toxicity analysis methodology
|
|
- Manual review results and findings
|
|
- False positive analysis
|
|
- Limitations and considerations
|
|
|
|
## License
|
|
|
|
MIT License - see [LICENSE](LICENSE) file for details.
|