Add documentation and license, remove IDE files

Added comprehensive project documentation and MIT license. Removed Claude
IDE configuration files from repository tracking.

Documentation added:
- FINDINGS.md: Complete methodology report and research findings
  - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30)
  - Human review results: 40.4% correct, 59.6% false positives
  - AI toxicity detection limitations and recommendations
- OPERATIONS.md: Complete operations and maintenance guide
  - Service start/stop procedures
  - Database operations and queries
  - Configuration options
  - Troubleshooting guide
  - Data export instructions

License:
- Added MIT License to README.md
- Copyright 2026 Post X Society
- Open source with permissive license

Repository cleanup:
- Added .claude/ to .gitignore
- Removed .claude/settings.local.json from tracking
- Prevents IDE-specific files from being committed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Pieter 2026-03-30 14:39:11 +02:00
parent 0495f47c13
commit 1c3f57d7e5
5 changed files with 771 additions and 9 deletions

View file

@ -1,9 +0,0 @@
{
"permissions": {
"allow": [
"Bash(docker compose:*)"
],
"deny": [],
"ask": []
}
}

1
.gitignore vendored
View file

@ -30,6 +30,7 @@ Thumbs.db
# IDE # IDE
.vscode/ .vscode/
.idea/ .idea/
.claude/
*.swp *.swp
*.swo *.swo
*~ *~

203
FINDINGS.md Normal file
View file

@ -0,0 +1,203 @@
# Bluesky Toxicity Analysis - Main Findings
## Study Overview
**Period:** January 1 March 30, 2026 (89 days)
**Monitored Accounts:** 159 Dutch political accounts
**Total Posts Collected:** 15,190 posts
---
## 1. Data Collection Summary
### Content Distribution
- **Primary Content (by tracked accounts):**
- Original Posts: 3,032
- Replies: 3,652
- **Total Primary:** 6,684 posts
- **Secondary Content (mentions of tracked accounts):**
- Unique Mention Posts: 8,506
- Note: Posts mentioning multiple tracked accounts counted once
### Total Dataset
- **Combined Content:** 15,190 posts
- **Collection Method:** Automated via Bluesky Public API (every 4 hours)
- **Infrastructure:** Docker containers with PostgreSQL database
---
## 2. Toxicity Detection Results
### AI Model Performance
- **Model Used:** OpenAI GPT-4.1-nano
- **Classification Categories:** 12 toxicity dimensions
- **Flagging Threshold:** Overall toxicity score ≥ 0.5 (50%)
### Flagged Content
- **Primary Content (Posts/Replies):** 97 posts flagged
- **Secondary Content (Mentions):** 413 unique posts flagged
- **Total Flagged:** 510 unique posts
### Distribution Insight
- 81% of flagged content came from mentions (external users → politicians)
- 19% of flagged content came from politicians themselves
- External users directed significantly more toxic content toward politicians than politicians produced
---
## 3. Human Review Results
### Review Completion
- **Total Items Reviewed:** 510 posts (100% of flagged content)
- **Review Period:** January 1 March 30, 2026
- **Review Interface:** Custom web application with ✓/✗/? buttons
### Validation Results
#### Primary Content (Posts/Replies by Politicians)
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 32 | 33.0% |
| ✗ Incorrectly Flagged | 65 | 67.0% |
| ? Unsure | 0 | 0.0% |
| **Total** | **97** | **100%** |
#### Secondary Content (Mentions of Politicians)
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 174 | 42.1% |
| ✗ Incorrectly Flagged | 239 | 57.9% |
| ? Unsure | 0 | 0.0% |
| **Total** | **413** | **100%** |
#### Combined Results
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 206 | 40.4% |
| ✗ Incorrectly Flagged | 304 | 59.6% |
| ? Unsure | 0 | 0.0% |
| **Total** | **510** | **100%** |
---
## 4. Key Findings
### 4.1 High False Positive Rate
- **Overall False Positive Rate: 59.6%**
- The AI model over-flagged content, with nearly 6 out of 10 flagged items being false positives
- Primary content had worse performance (67.0% false positives) than mentions (57.9%)
### 4.2 Model Limitations Identified
1. **Threshold Sensitivity:** The 0.5 threshold appears too low for Dutch political discourse
2. **Context Misinterpretation:** Strong policy language, political criticism, and satire frequently misclassified as toxic
3. **Cultural/Linguistic Gaps:** Dutch political communication patterns may not align with model training data
4. **Nuance Detection:** Difficulty distinguishing between heated but legitimate debate and actual toxicity
### 4.3 Directional Toxicity Pattern
- External mentions (8,506 posts) generated **413 flagged items** (4.9% flagging rate)
- Primary content (6,684 posts) generated **97 flagged items** (1.5% flagging rate)
- Politicians receive approximately **3× more toxic content** than they produce (by flagging rate)
- However, after human review, both sources showed high false positive rates
### 4.4 Accuracy Comparison
- **Mentions accuracy:** 42.1% (slightly better)
- **Primary content accuracy:** 33.0% (worse)
- Neither content type achieved acceptable accuracy for automated moderation
- Possible explanation: Politicians' language more frequently uses strong policy terms that trigger false positives
---
## 5. Implications for Automated Moderation
### What This Study Reveals
1. **AI Cannot Replace Human Judgment:** 59.6% false positive rate makes unsupervised automation dangerous
2. **Threshold Optimization Needed:** Current 0.5 threshold too aggressive; may need 0.7+ for political content
3. **Domain-Specific Training Required:** Political discourse needs specialized models or fine-tuning
4. **Human-in-the-Loop Essential:** Automated flagging useful for triage, but human review mandatory
### Recommended Approach
- Use AI toxicity detection as **first-pass screening only**
- Require human review for all flagged content before action
- Consider higher thresholds (0.70.8) for political accounts
- Train domain-specific models on Dutch political discourse
- Implement appeals process for false positives
---
## 6. Technical Implementation Success
### What Worked Well
1. **Automated Collection:** 4-hour collection cycles captured comprehensive dataset
2. **Human Review Interface:** Web UI with ✓/✗/? buttons efficient for manual validation
3. **Date Filtering:** Allowed focused analysis of specific time periods
4. **Engagement Metrics:** Successfully captured likes, replies, reposts, quotes for mentions
5. **Deduplication Logic:** Properly handled posts mentioning multiple tracked accounts
### Infrastructure Performance
- **Uptime:** 99%+ (only brief scheduler issue Feb 23-24)
- **Data Integrity:** PostgreSQL database handled 15K+ posts without issues
- **Analysis Throughput:** GPT-4.1-nano processed all content efficiently
- **Web Interface:** Responsive UI for 500+ manual reviews
---
## 7. Study Limitations
1. **Single Model Used:** Only tested GPT-4.1-nano; ensemble approaches not evaluated
2. **No Inter-Rater Reliability:** Single human reviewer; no validation of review consistency
3. **Limited Context:** Dutch political context; findings may not generalize to other domains
4. **Arbitrary Threshold:** 0.5 threshold not scientifically optimized
5. **Limited Time Period:** 3-month window may not capture seasonal variations in discourse
6. **No Appeal Process:** No mechanism for accounts to contest flagging decisions
---
## 8. Recommendations for Future Work
### Short-Term Improvements
1. **Threshold Optimization:** Test 0.6, 0.7, 0.8 thresholds and measure precision/recall
2. **Category-Specific Tuning:** Different thresholds for different toxicity categories
3. **Context Windows:** Analyze conversation threads, not isolated posts
4. **Multi-Model Validation:** Test other models (Perspective API, custom fine-tuned models)
### Long-Term Research
1. **Dutch Political Corpus:** Create labeled training dataset for Dutch political discourse
2. **Fine-Tune Models:** Train specialized classifiers on validated Dutch political content
3. **Longitudinal Study:** Track patterns over election cycles and major events
4. **Cross-Platform Analysis:** Compare Bluesky toxicity patterns with Twitter/X, Mastodon
5. **Inter-Rater Reliability Study:** Multiple reviewers to validate human judgment consistency
---
## 9. Data Access
### Database Content (as of March 30, 2026)
- **Accounts Table:** 159 tracked political accounts
- **Posts Table:** 6,684 posts and replies
- **Mentions Table:** 8,506 unique mention posts
- **Toxicity Scores:** 6,684 scored primary posts
- **Mention Toxicity Scores:** 8,506 scored mentions
- **Human Reviews:** 510 manual validations
### Exported Datasets Available
- Full post content with toxicity scores
- Human review decisions with timestamps
- Engagement metrics (likes, replies, reposts, quotes)
- Time-series data for trend analysis
---
## 10. Conclusion
This study demonstrates that while AI-powered toxicity detection can **identify potential concerns** in large-scale social media content, it **cannot reliably moderate** without substantial human oversight. The 59.6% false positive rate indicates current models are not suitable for automated enforcement in political discourse contexts.
**Key Takeaway:** AI toxicity detection is a useful **triage tool** for human moderators, not a replacement for human judgment. Political discourse requires nuanced understanding of context, satire, and legitimate critique that current AI models cannot consistently provide.
**Project Status:** Data collection complete. Web interface remains available for analysis and reporting. Database preserved for future research.
---
**Generated:** March 30, 2026
**Study Period:** January 1 March 30, 2026
**Monitored Platform:** Bluesky Social Network
**Geographic Focus:** Dutch Political Discourse

543
OPERATIONS.md Normal file
View file

@ -0,0 +1,543 @@
# Bluesky Collector - Operations Guide
## Quick Reference
### Current Status (March 30, 2026)
- **Collector:** ❌ STOPPED (data collection complete)
- **Scheduler:** ❌ STOPPED (no further automated runs)
- **Web Interface:** ✅ RUNNING (http://localhost:5001)
- **Database:** ✅ RUNNING (PostgreSQL on port 5433)
---
## Starting and Stopping Services
### View Current Service Status
```bash
cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
docker compose ps
```
### Start All Services
```bash
docker compose up -d
```
This starts:
- `db` - PostgreSQL database (port 5433)
- `web` - Web interface (port 5001)
- `collector` - Data collection service
- `scheduler` - Automated collection scheduler (runs every 4 hours)
### Stop Collection Only (Keep Web Interface)
```bash
docker compose stop scheduler collector
```
This configuration allows browsing collected data without gathering new content.
### Start Collection Services
```bash
docker compose start scheduler collector
```
### Stop All Services
```bash
docker compose down
```
**Warning:** This will stop the web interface and database. Data is preserved in Docker volumes.
### Stop and Remove Everything (Including Data)
```bash
docker compose down -v
```
**⚠️ DANGER:** This deletes all collected data permanently!
---
## Service Details
### Database (PostgreSQL)
- **Image:** `postgres:16-alpine`
- **Port:** 5433 (external) → 5432 (internal)
- **Data Volume:** `pgdata`
- **Access:**
```bash
docker compose exec db psql -U bluesky -d bluesky
```
### Web Interface
- **URL:** http://localhost:5001
- **Port:** 5001
- **Stack:** Flask + Gunicorn
- **Pages:**
- `/` - Dashboard with collection stats
- `/accounts` - Account toxicity summary
- `/statuses` - Posts and replies browser
- `/mentions` - Mentions browser
- `/analysis` - Toxicity analysis overview
- `/analysis/flagged` - Flagged content with human review
- `/export` - Data export options
### Collector Service
- **Schedule:** Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
- **Function:** Collects new posts and mentions from Bluesky API
- **Logs:**
```bash
docker compose logs -f collector
```
### Scheduler Service
- **Image:** `mcuadros/ofelia`
- **Function:** Triggers collector and analyzer jobs on schedule
- **Jobs:**
- `collect` - Runs at 0 minutes past every 4th hour
- `analyze` - Runs at 30 minutes past every 4th hour
- **Logs:**
```bash
docker compose logs -f scheduler
```
---
## Manual Operations
### Run Manual Collection
```bash
docker compose exec collector python -m src
```
Collects posts and mentions immediately (outside of schedule).
### Run Manual Analysis
```bash
docker compose exec collector python -m src.analyzer
```
Analyzes all unscored posts/mentions using OpenAI API.
**Cost Warning:** Analysis incurs OpenAI API costs. Check batch size settings.
### Analyze Specific Batch Size
```bash
docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100
```
Options:
- `--batch-size N` - Number of posts per API call (default: 10)
- `--limit N` - Maximum posts to analyze (default: 0 = unlimited)
- `--concurrency N` - Parallel API requests (default: 3)
### View Recent Logs
```bash
# All services
docker compose logs --tail 100
# Specific service
docker compose logs --tail 50 collector
docker compose logs --tail 50 web
# Follow logs in real-time
docker compose logs -f collector
```
---
## Database Operations
### Access Database Shell
```bash
docker compose exec db psql -U bluesky -d bluesky
```
### Common Queries
#### Check Collection Status
```sql
SELECT
started_at::date as date,
COUNT(*) as runs,
SUM(posts_collected) as total_posts,
SUM(mentions_collected) as total_mentions,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
FROM collection_runs
WHERE started_at >= '2026-01-01'
GROUP BY started_at::date
ORDER BY date DESC;
```
#### Count Flagged Content
```sql
-- Posts/Replies
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;
-- Mentions (unique posts)
SELECT COUNT(DISTINCT m.post_uri)
FROM mention_toxicity_scores mts
JOIN mentions m ON m.id = mts.mention_id
WHERE mts.overall >= 0.5;
```
#### Human Review Progress
```sql
SELECT
CASE
WHEN review_status IS NULL THEN 'Unreviewed'
ELSE review_status
END as status,
COUNT(*) as count
FROM toxicity_scores
WHERE overall >= 0.5
GROUP BY review_status;
```
### Backup Database
```bash
docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql
```
### Restore Database
```bash
cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky
```
---
## Rebuilding Services
### Rebuild After Code Changes
```bash
# Rebuild specific service
docker compose build web
docker compose build collector
# Rebuild and restart
docker compose up -d --build web
# Rebuild everything
docker compose build
docker compose up -d
```
### Apply Database Migrations
```bash
# View available migrations
ls scripts/*.sql
# Apply specific migration
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql
```
---
## Configuration
### Environment Variables (.env file)
```bash
# Database
POSTGRES_USER=bluesky
POSTGRES_PASSWORD=changeme
POSTGRES_PORT=5433
# Web Interface
WEB_PORT=5001
# Bluesky API (for authenticated search)
BSKY_HANDLE=your-handle.bsky.social
BSKY_APP_PASSWORD=your-app-password
# OpenAI API (for toxicity analysis)
OPENAI_API_KEY=sk-...
# Analysis Settings
ANALYZER_MODEL=gpt-4.1-nano
ANALYZER_CONCURRENCY=3
ANALYZER_BATCH_SIZE=10
ANALYZER_LIMIT=0
# Collection Settings
MAX_PAGES_PER_ACCOUNT=50
MENTION_LOOKBACK_HOURS=12
LOG_LEVEL=INFO
```
### Tracked Accounts (config/accounts.yml)
```yaml
accounts:
- handle: example.bsky.social # Account to monitor
- handle: another.bsky.social
```
Add or remove accounts, then restart collector:
```bash
docker compose restart collector
```
---
## Troubleshooting
### Web Interface Not Loading
```bash
# Check if web service is running
docker compose ps web
# Check web logs for errors
docker compose logs --tail 50 web
# Restart web service
docker compose restart web
```
### Collector Not Running
```bash
# Check scheduler is running
docker compose ps scheduler
# Check collector status
docker compose ps collector
# Start scheduler if stopped
docker compose start scheduler
# Check scheduler logs
docker compose logs scheduler
```
### Database Connection Issues
```bash
# Check database health
docker compose ps db
# Restart database
docker compose restart db
# Check database logs
docker compose logs db
```
### Out of Disk Space
```bash
# Check Docker disk usage
docker system df
# Remove unused images/containers
docker system prune
# Check database size
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"
```
### Analysis Failing (OpenAI API)
```bash
# Check API key is set
docker compose exec collector printenv | grep OPENAI_API_KEY
# Test API connectivity
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"
# Check rate limits in logs
docker compose logs collector | grep -i "rate limit"
```
---
## Performance Tuning
### Increase Collection Speed
Edit `docker-compose.yml`:
```yaml
environment:
MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50
MENTION_LOOKBACK_HOURS: 24 # Increase lookback
```
### Increase Analysis Speed
```yaml
environment:
ANALYZER_CONCURRENCY: 5 # More parallel requests
ANALYZER_BATCH_SIZE: 20 # Bigger batches
```
**Cost Warning:** Higher concurrency and batch size = higher OpenAI API costs.
### Change Collection Schedule
Edit `docker-compose.yml` under collector labels:
```yaml
labels:
ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours
ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection
```
Restart scheduler after changes:
```bash
docker compose restart scheduler
```
---
## Data Export
### Export to CSV via Web Interface
1. Navigate to http://localhost:5001/export
2. Select date range and filters
3. Click "Export to CSV"
### Export via Command Line
#### All Posts
```bash
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
ts.overall, ts.toxic, ts.hate_speech, ts.threat
FROM posts p
LEFT JOIN accounts a ON a.did = p.author_did
LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
WHERE p.created_at >= '2026-01-01'
) TO STDOUT CSV HEADER" > posts_export.csv
```
#### Flagged Content with Reviews
```bash
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
SELECT p.uri, a.handle, p.text, p.created_at,
ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
FROM toxicity_scores ts
JOIN posts p ON p.uri = ts.uri
LEFT JOIN accounts a ON a.did = p.author_did
WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
ORDER BY ts.overall DESC
) TO STDOUT CSV HEADER" > flagged_export.csv
```
---
## Restarting Data Collection (If Needed)
### Resume Collection After Pause
1. Start services:
```bash
docker compose start scheduler collector
```
2. Verify collection runs:
```bash
docker compose logs -f collector
```
3. Check database for new entries:
```bash
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT MAX(created_at) FROM posts;
SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day';
"
```
### Start Fresh Collection (Keep Database)
1. Stop services:
```bash
docker compose down
```
2. Start only database and web:
```bash
docker compose up -d db web
```
3. Truncate collection tracking (optional):
```bash
docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;"
```
4. Start collector:
```bash
docker compose up -d scheduler collector
```
### Complete Reset (Delete All Data)
```bash
# Stop everything
docker compose down
# Remove data volume
docker volume rm bluesky-collector_pgdata
# Restart from scratch
docker compose up -d
```
**⚠️ WARNING:** This deletes all collected posts, mentions, and analysis results permanently!
---
## Monitoring
### Collection Health Check
```bash
# Last 5 collection runs
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
FROM collection_runs
ORDER BY started_at DESC
LIMIT 5;
"
```
### Analysis Progress
```bash
# Count scored vs unscored
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
(SELECT COUNT(*) FROM posts) as total_posts,
(SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
(SELECT COUNT(*) FROM mentions) as total_mentions,
(SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
"
```
### Disk Usage
```bash
# Database size
docker compose exec db psql -U bluesky -d bluesky -c "
SELECT
pg_size_pretty(pg_database_size('bluesky')) as db_size,
pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
"
```
---
## Security Notes
1. **Never commit .env file** - Contains API keys and passwords
2. **Change default passwords** - PostgreSQL default password is `changeme`
3. **Firewall rules** - Ports 5001 (web) and 5433 (database) exposed to localhost only
4. **API keys** - Bluesky and OpenAI credentials stored in environment variables
5. **Data retention** - Contains personal data (Bluesky posts); handle per GDPR requirements
---
## Support
### Documentation
- Main findings: `FINDINGS.md`
- This operations guide: `OPERATIONS.md`
- Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector
### Logs Location
- Docker logs: `docker compose logs [service]`
- Application logs: `./logs/` directory (if volume mounted)
### Common Issues
1. **Port conflicts:** Change `WEB_PORT` or `POSTGRES_PORT` in .env
2. **Out of memory:** Reduce `ANALYZER_CONCURRENCY` or `ANALYZER_BATCH_SIZE`
3. **API rate limits:** Reduce collection frequency or batch size
4. **Disk full:** Run `docker system prune` and consider data export/cleanup
---
**Last Updated:** March 30, 2026
**Project Status:** Data collection complete, web interface available for analysis

View file

@ -279,3 +279,27 @@ The `pgdata` volume persists across container restarts. Back it up with standard
```bash ```bash
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
``` ```
## License
MIT License
Copyright (c) 2026 Post X Society
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.