diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..213b8a9 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 Pieter Steenman + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index 8b8a56c..a3a6b6a 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Mastodon Collector -Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes a web UI for account management and data browsing, plus JSON/CSV APIs for your analysis pipeline. +Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline. ## Quick Start @@ -38,8 +38,47 @@ Edit `.env` to customize: POSTGRES_PASSWORD=collector_secret # Change for production FLASK_SECRET_KEY=change-me-in-production POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s) +OPENAI_API_KEY=sk-... # Required for toxicity analysis ``` +## Toxicity Analysis + +The system includes automated toxicity detection and manual review capabilities: + +### Features + +- **Automated Classification**: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions: + - General toxicity, threats, hate speech + - Racism, antisemitism, islamophobia + - Sexism, homophobia, ableism + - Insults, dehumanization, extremism +- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review +- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation +- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis` + +### Running Analysis + +```bash +# Analyze all unscored statuses (run inside collector container) +docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer" + +# Limit to first 100 statuses for testing +docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer" +``` + +### Analysis Database Schema + +Additional tables for toxicity analysis: + +- `toxicity_scores` — toxicity scores per status (12 categories + overall) +- `analysis_runs` — audit trail of analysis runs with costs and duration + +### Cost Estimation + +- Batch processing: ~10 posts per API call +- Estimated cost: ~$0.12 per 1,000 posts analyzed +- Example: 16,906 posts ≈ $1.95 + ## API Endpoints For plugging into your analysis pipeline: @@ -89,3 +128,16 @@ docker compose up -d docker compose down # Stop services, keep data docker compose down -v # Stop services AND delete database ``` + +## Research & Reporting + +See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including: +- Data collection statistics +- Toxicity analysis methodology +- Manual review results and findings +- False positive analysis +- Limitations and considerations + +## License + +MIT License - see [LICENSE](LICENSE) file for details.