Update README with toxicity analysis features and add MIT license

- Document toxicity analysis capabilities and features
- Add configuration for OPENAI_API_KEY
- Include instructions for running analysis
- Add cost estimation and database schema info
- Link to ANALYSIS_REPORT.md for research findings
- Add MIT License

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Pieter 2026-03-31 17:55:03 +02:00
parent 2faf6c660b
commit 870a0710b5
2 changed files with 74 additions and 1 deletions

21
LICENSE Normal file
View file

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Pieter Steenman
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View file

@ -1,6 +1,6 @@
# Mastodon Collector
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes a web UI for account management and data browsing, plus JSON/CSV APIs for your analysis pipeline.
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.
## Quick Start
@ -38,8 +38,47 @@ Edit `.env` to customize:
POSTGRES_PASSWORD=collector_secret # Change for production
FLASK_SECRET_KEY=change-me-in-production
POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s)
OPENAI_API_KEY=sk-... # Required for toxicity analysis
```
## Toxicity Analysis
The system includes automated toxicity detection and manual review capabilities:
### Features
- **Automated Classification**: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions:
- General toxicity, threats, hate speech
- Racism, antisemitism, islamophobia
- Sexism, homophobia, ableism
- Insults, dehumanization, extremism
- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation
- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis`
### Running Analysis
```bash
# Analyze all unscored statuses (run inside collector container)
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"
# Limit to first 100 statuses for testing
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
```
### Analysis Database Schema
Additional tables for toxicity analysis:
- `toxicity_scores` — toxicity scores per status (12 categories + overall)
- `analysis_runs` — audit trail of analysis runs with costs and duration
### Cost Estimation
- Batch processing: ~10 posts per API call
- Estimated cost: ~$0.12 per 1,000 posts analyzed
- Example: 16,906 posts ≈ $1.95
## API Endpoints
For plugging into your analysis pipeline:
@ -89,3 +128,16 @@ docker compose up -d
docker compose down # Stop services, keep data
docker compose down -v # Stop services AND delete database
```
## Research & Reporting
See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including:
- Data collection statistics
- Toxicity analysis methodology
- Manual review results and findings
- False positive analysis
- Limitations and considerations
## License
MIT License - see [LICENSE](LICENSE) file for details.