Update README with toxicity analysis features and add MIT license
- Document toxicity analysis capabilities and features - Add configuration for OPENAI_API_KEY - Include instructions for running analysis - Add cost estimation and database schema info - Link to ANALYSIS_REPORT.md for research findings - Add MIT License 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
2faf6c660b
commit
870a0710b5
2 changed files with 74 additions and 1 deletions
21
LICENSE
Normal file
21
LICENSE
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) 2026 Pieter Steenman
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
54
README.md
54
README.md
|
|
@ -1,6 +1,6 @@
|
|||
# Mastodon Collector
|
||||
|
||||
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes a web UI for account management and data browsing, plus JSON/CSV APIs for your analysis pipeline.
|
||||
Collects posts, replies, and mentions from a list of Mastodon accounts and stores them in PostgreSQL. Includes automated toxicity analysis using OpenAI GPT-4o-mini, a web UI for account management, data browsing, and manual review of flagged content, plus JSON/CSV APIs for your analysis pipeline.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
|
@ -38,8 +38,47 @@ Edit `.env` to customize:
|
|||
POSTGRES_PASSWORD=collector_secret # Change for production
|
||||
FLASK_SECRET_KEY=change-me-in-production
|
||||
POLL_INTERVAL_SECONDS=14400 # Default: 4 hours (14400s)
|
||||
OPENAI_API_KEY=sk-... # Required for toxicity analysis
|
||||
```
|
||||
|
||||
## Toxicity Analysis
|
||||
|
||||
The system includes automated toxicity detection and manual review capabilities:
|
||||
|
||||
### Features
|
||||
|
||||
- **Automated Classification**: Uses OpenAI GPT-4o-mini to analyze posts across 12 toxicity dimensions:
|
||||
- General toxicity, threats, hate speech
|
||||
- Racism, antisemitism, islamophobia
|
||||
- Sexism, homophobia, ableism
|
||||
- Insults, dehumanization, extremism
|
||||
- **Flagging System**: Posts with overall toxicity ≥ 0.5 are automatically flagged for review
|
||||
- **Manual Review Interface**: Web dashboard at `/analysis/flagged` for human validation
|
||||
- **Analysis Dashboard**: Statistics, trends, and category breakdowns at `/analysis`
|
||||
|
||||
### Running Analysis
|
||||
|
||||
```bash
|
||||
# Analyze all unscored statuses (run inside collector container)
|
||||
docker exec mastodon-collector-collector-1 bash -c "python -m app.analyzer"
|
||||
|
||||
# Limit to first 100 statuses for testing
|
||||
docker exec mastodon-collector-collector-1 bash -c "ANALYZER_LIMIT=100 python -m app.analyzer"
|
||||
```
|
||||
|
||||
### Analysis Database Schema
|
||||
|
||||
Additional tables for toxicity analysis:
|
||||
|
||||
- `toxicity_scores` — toxicity scores per status (12 categories + overall)
|
||||
- `analysis_runs` — audit trail of analysis runs with costs and duration
|
||||
|
||||
### Cost Estimation
|
||||
|
||||
- Batch processing: ~10 posts per API call
|
||||
- Estimated cost: ~$0.12 per 1,000 posts analyzed
|
||||
- Example: 16,906 posts ≈ $1.95
|
||||
|
||||
## API Endpoints
|
||||
|
||||
For plugging into your analysis pipeline:
|
||||
|
|
@ -89,3 +128,16 @@ docker compose up -d
|
|||
docker compose down # Stop services, keep data
|
||||
docker compose down -v # Stop services AND delete database
|
||||
```
|
||||
|
||||
## Research & Reporting
|
||||
|
||||
See [ANALYSIS_REPORT.md](ANALYSIS_REPORT.md) for a complete methodology report including:
|
||||
- Data collection statistics
|
||||
- Toxicity analysis methodology
|
||||
- Manual review results and findings
|
||||
- False positive analysis
|
||||
- Limitations and considerations
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see [LICENSE](LICENSE) file for details.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue