bluesky-collector/FINDINGS.md

# Bluesky Toxicity Analysis - Main Findings

## Study Overview
**Period:** January 1 – March 30, 2026 (89 days)
**Monitored Accounts:** 159 Dutch political accounts
**Total Posts Collected:** 15,190 posts

---

## 1. Data Collection Summary

### Content Distribution
- **Primary Content (by tracked accounts):**
  - Original Posts: 3,032
  - Replies: 3,652
  - **Total Primary:** 6,684 posts

- **Secondary Content (mentions of tracked accounts):**
  - Unique Mention Posts: 8,506
  - Note: Posts mentioning multiple tracked accounts counted once

### Total Dataset
- **Combined Content:** 15,190 posts
- **Collection Method:** Automated via Bluesky Public API (every 4 hours)
- **Infrastructure:** Docker containers with PostgreSQL database

---

## 2. Toxicity Detection Results

### AI Model Performance
- **Model Used:** OpenAI GPT-4.1-nano
- **Classification Categories:** 12 toxicity dimensions
- **Flagging Threshold:** Overall toxicity score ≥ 0.5 (50%)

### Flagged Content
- **Primary Content (Posts/Replies):** 97 posts flagged
- **Secondary Content (Mentions):** 413 unique posts flagged
- **Total Flagged:** 510 unique posts

### Distribution Insight
- 81% of flagged content came from mentions (external users → politicians)
- 19% of flagged content came from politicians themselves
- External users directed significantly more toxic content toward politicians than politicians produced

---

## 3. Human Review Results

### Review Completion
- **Total Items Reviewed:** 510 posts (100% of flagged content)
- **Review Period:** January 1 – March 30, 2026
- **Review Interface:** Custom web application with ✓/✗/? buttons

### Validation Results

#### Primary Content (Posts/Replies by Politicians)
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 32 | 33.0% |
| ✗ Incorrectly Flagged | 65 | 67.0% |
| ? Unsure | 0 | 0.0% |
| **Total** | **97** | **100%** |

#### Secondary Content (Mentions of Politicians)
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 174 | 42.1% |
| ✗ Incorrectly Flagged | 239 | 57.9% |
| ? Unsure | 0 | 0.0% |
| **Total** | **413** | **100%** |

#### Combined Results
| Status | Count | Percentage |
|--------|-------|------------|
| ✓ Correctly Flagged | 206 | 40.4% |
| ✗ Incorrectly Flagged | 304 | 59.6% |
| ? Unsure | 0 | 0.0% |
| **Total** | **510** | **100%** |

---

## 4. Key Findings

### 4.1 High False Positive Rate
- **Overall False Positive Rate: 59.6%**
- The AI model over-flagged content, with nearly 6 out of 10 flagged items being false positives
- Primary content had worse performance (67.0% false positives) than mentions (57.9%)

### 4.2 Model Limitations Identified
1. **Threshold Sensitivity:** The 0.5 threshold appears too low for Dutch political discourse
2. **Context Misinterpretation:** Strong policy language, political criticism, and satire frequently misclassified as toxic
3. **Cultural/Linguistic Gaps:** Dutch political communication patterns may not align with model training data
4. **Nuance Detection:** Difficulty distinguishing between heated but legitimate debate and actual toxicity

### 4.3 Directional Toxicity Pattern
- External mentions (8,506 posts) generated **413 flagged items** (4.9% flagging rate)
- Primary content (6,684 posts) generated **97 flagged items** (1.5% flagging rate)
- Politicians receive approximately **3× more toxic content** than they produce (by flagging rate)
- However, after human review, both sources showed high false positive rates

### 4.4 Accuracy Comparison
- **Mentions accuracy:** 42.1% (slightly better)
- **Primary content accuracy:** 33.0% (worse)
- Neither content type achieved acceptable accuracy for automated moderation
- Possible explanation: Politicians' language more frequently uses strong policy terms that trigger false positives

---

## 5. Implications for Automated Moderation

### What This Study Reveals
1. **AI Cannot Replace Human Judgment:** 59.6% false positive rate makes unsupervised automation dangerous
2. **Threshold Optimization Needed:** Current 0.5 threshold too aggressive; may need 0.7+ for political content
3. **Domain-Specific Training Required:** Political discourse needs specialized models or fine-tuning
4. **Human-in-the-Loop Essential:** Automated flagging useful for triage, but human review mandatory

### Recommended Approach
- Use AI toxicity detection as **first-pass screening only**
- Require human review for all flagged content before action
- Consider higher thresholds (0.7–0.8) for political accounts
- Train domain-specific models on Dutch political discourse
- Implement appeals process for false positives

---

## 6. Technical Implementation Success

### What Worked Well
1. **Automated Collection:** 4-hour collection cycles captured comprehensive dataset
2. **Human Review Interface:** Web UI with ✓/✗/? buttons efficient for manual validation
3. **Date Filtering:** Allowed focused analysis of specific time periods
4. **Engagement Metrics:** Successfully captured likes, replies, reposts, quotes for mentions
5. **Deduplication Logic:** Properly handled posts mentioning multiple tracked accounts

### Infrastructure Performance
- **Uptime:** 99%+ (only brief scheduler issue Feb 23-24)
- **Data Integrity:** PostgreSQL database handled 15K+ posts without issues
- **Analysis Throughput:** GPT-4.1-nano processed all content efficiently
- **Web Interface:** Responsive UI for 500+ manual reviews

---

## 7. Study Limitations

1. **Single Model Used:** Only tested GPT-4.1-nano; ensemble approaches not evaluated
2. **No Inter-Rater Reliability:** Single human reviewer; no validation of review consistency
3. **Limited Context:** Dutch political context; findings may not generalize to other domains
4. **Arbitrary Threshold:** 0.5 threshold not scientifically optimized
5. **Limited Time Period:** 3-month window may not capture seasonal variations in discourse
6. **No Appeal Process:** No mechanism for accounts to contest flagging decisions

---

## 8. Recommendations for Future Work

### Short-Term Improvements
1. **Threshold Optimization:** Test 0.6, 0.7, 0.8 thresholds and measure precision/recall
2. **Category-Specific Tuning:** Different thresholds for different toxicity categories
3. **Context Windows:** Analyze conversation threads, not isolated posts
4. **Multi-Model Validation:** Test other models (Perspective API, custom fine-tuned models)

### Long-Term Research
1. **Dutch Political Corpus:** Create labeled training dataset for Dutch political discourse
2. **Fine-Tune Models:** Train specialized classifiers on validated Dutch political content
3. **Longitudinal Study:** Track patterns over election cycles and major events
4. **Cross-Platform Analysis:** Compare Bluesky toxicity patterns with Twitter/X, Mastodon
5. **Inter-Rater Reliability Study:** Multiple reviewers to validate human judgment consistency

---

## 9. Data Access

### Database Content (as of March 30, 2026)
- **Accounts Table:** 159 tracked political accounts
- **Posts Table:** 6,684 posts and replies
- **Mentions Table:** 8,506 unique mention posts
- **Toxicity Scores:** 6,684 scored primary posts
- **Mention Toxicity Scores:** 8,506 scored mentions
- **Human Reviews:** 510 manual validations

### Exported Datasets Available
- Full post content with toxicity scores
- Human review decisions with timestamps
- Engagement metrics (likes, replies, reposts, quotes)
- Time-series data for trend analysis

---

## 10. Conclusion

This study demonstrates that while AI-powered toxicity detection can **identify potential concerns** in large-scale social media content, it **cannot reliably moderate** without substantial human oversight. The 59.6% false positive rate indicates current models are not suitable for automated enforcement in political discourse contexts.

**Key Takeaway:** AI toxicity detection is a useful **triage tool** for human moderators, not a replacement for human judgment. Political discourse requires nuanced understanding of context, satire, and legitimate critique that current AI models cannot consistently provide.

**Project Status:** Data collection complete. Web interface remains available for analysis and reporting. Database preserved for future research.

---

**Generated:** March 30, 2026
**Study Period:** January 1 – March 30, 2026
**Monitored Platform:** Bluesky Social Network
**Geographic Focus:** Dutch Political Discourse
-												Add documentation and license, remove IDE files

Added comprehensive project documentation and MIT license. Removed Claude
IDE configuration files from repository tracking.

Documentation added:
- FINDINGS.md: Complete methodology report and research findings
  - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30)
  - Human review results: 40.4% correct, 59.6% false positives
  - AI toxicity detection limitations and recommendations
- OPERATIONS.md: Complete operations and maintenance guide
  - Service start/stop procedures
  - Database operations and queries
  - Configuration options
  - Troubleshooting guide
  - Data export instructions

License:
- Added MIT License to README.md
- Copyright 2026 Post X Society
- Open source with permissive license

Repository cleanup:
- Added .claude/ to .gitignore
- Removed .claude/settings.local.json from tracking
- Prevents IDE-specific files from being committed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2026-03-30 14:39:11 +02:00
+								# Bluesky Toxicity Analysis - Main Findings
 								## Study Overview
 								**Period:** January 1 – March 30, 2026 (89 days)
 								**Monitored Accounts:** 159 Dutch political accounts
 								**Total Posts Collected:** 15,190 posts
 								---
 								## 1. Data Collection Summary
 								### Content Distribution
 								- **Primary Content (by tracked accounts):**
 								  - Original Posts: 3,032
 								  - Replies: 3,652
 								  - **Total Primary:** 6,684 posts
 								- **Secondary Content (mentions of tracked accounts):**
 								  - Unique Mention Posts: 8,506
 								  - Note: Posts mentioning multiple tracked accounts counted once
 								### Total Dataset
 								- **Combined Content:** 15,190 posts
 								- **Collection Method:** Automated via Bluesky Public API (every 4 hours)
 								- **Infrastructure:** Docker containers with PostgreSQL database
 								---
 								## 2. Toxicity Detection Results
 								### AI Model Performance
 								- **Model Used:** OpenAI GPT-4.1-nano
 								- **Classification Categories:** 12 toxicity dimensions
 								- **Flagging Threshold:** Overall toxicity score ≥ 0.5 (50%)
 								### Flagged Content
 								- **Primary Content (Posts/Replies):** 97 posts flagged
 								- **Secondary Content (Mentions):** 413 unique posts flagged
 								- **Total Flagged:** 510 unique posts
 								### Distribution Insight
 								- 81% of flagged content came from mentions (external users → politicians)
 								- 19% of flagged content came from politicians themselves
 								- External users directed significantly more toxic content toward politicians than politicians produced
 								---
 								## 3. Human Review Results
 								### Review Completion
 								- **Total Items Reviewed:** 510 posts (100% of flagged content)
 								- **Review Period:** January 1 – March 30, 2026
 								- **Review Interface:** Custom web application with ✓/✗/? buttons
 								### Validation Results
 								#### Primary Content (Posts/Replies by Politicians)
 								| Status | Count | Percentage |
 								|--------|-------|------------|
 								| ✓ Correctly Flagged | 32 | 33.0% |
 								| ✗ Incorrectly Flagged | 65 | 67.0% |
 								| ? Unsure | 0 | 0.0% |
 								| **Total** | **97** | **100%** |
 								#### Secondary Content (Mentions of Politicians)
 								| Status | Count | Percentage |
 								|--------|-------|------------|
 								| ✓ Correctly Flagged | 174 | 42.1% |
 								| ✗ Incorrectly Flagged | 239 | 57.9% |
 								| ? Unsure | 0 | 0.0% |
 								| **Total** | **413** | **100%** |
 								#### Combined Results
 								| Status | Count | Percentage |
 								|--------|-------|------------|
 								| ✓ Correctly Flagged | 206 | 40.4% |
 								| ✗ Incorrectly Flagged | 304 | 59.6% |
 								| ? Unsure | 0 | 0.0% |
 								| **Total** | **510** | **100%** |
 								---
 								## 4. Key Findings
 								### 4.1 High False Positive Rate
 								- **Overall False Positive Rate: 59.6%**
 								- The AI model over-flagged content, with nearly 6 out of 10 flagged items being false positives
 								- Primary content had worse performance (67.0% false positives) than mentions (57.9%)
 								### 4.2 Model Limitations Identified
 . **Threshold Sensitivity:** The 0.5 threshold appears too low for Dutch political discourse
 . **Context Misinterpretation:** Strong policy language, political criticism, and satire frequently misclassified as toxic
 . **Cultural/Linguistic Gaps:** Dutch political communication patterns may not align with model training data
 . **Nuance Detection:** Difficulty distinguishing between heated but legitimate debate and actual toxicity
 								### 4.3 Directional Toxicity Pattern
 								- External mentions (8,506 posts) generated **413 flagged items** (4.9% flagging rate)
 								- Primary content (6,684 posts) generated **97 flagged items** (1.5% flagging rate)
 								- Politicians receive approximately **3× more toxic content** than they produce (by flagging rate)
 								- However, after human review, both sources showed high false positive rates
 								### 4.4 Accuracy Comparison
 								- **Mentions accuracy:** 42.1% (slightly better)
 								- **Primary content accuracy:** 33.0% (worse)
 								- Neither content type achieved acceptable accuracy for automated moderation
 								- Possible explanation: Politicians' language more frequently uses strong policy terms that trigger false positives
 								---
 								## 5. Implications for Automated Moderation
 								### What This Study Reveals
 . **AI Cannot Replace Human Judgment:** 59.6% false positive rate makes unsupervised automation dangerous
 . **Threshold Optimization Needed:** Current 0.5 threshold too aggressive; may need 0.7+ for political content
 . **Domain-Specific Training Required:** Political discourse needs specialized models or fine-tuning
 . **Human-in-the-Loop Essential:** Automated flagging useful for triage, but human review mandatory
 								### Recommended Approach
 								- Use AI toxicity detection as **first-pass screening only**
 								- Require human review for all flagged content before action
 								- Consider higher thresholds (0.7–0.8) for political accounts
 								- Train domain-specific models on Dutch political discourse
 								- Implement appeals process for false positives
 								---
 								## 6. Technical Implementation Success
 								### What Worked Well
 . **Automated Collection:** 4-hour collection cycles captured comprehensive dataset
 . **Human Review Interface:** Web UI with ✓/✗/? buttons efficient for manual validation
 . **Date Filtering:** Allowed focused analysis of specific time periods
 . **Engagement Metrics:** Successfully captured likes, replies, reposts, quotes for mentions
 . **Deduplication Logic:** Properly handled posts mentioning multiple tracked accounts
 								### Infrastructure Performance
 								- **Uptime:** 99%+ (only brief scheduler issue Feb 23-24)
 								- **Data Integrity:** PostgreSQL database handled 15K+ posts without issues
 								- **Analysis Throughput:** GPT-4.1-nano processed all content efficiently
 								- **Web Interface:** Responsive UI for 500+ manual reviews
 								---
 								## 7. Study Limitations
 . **Single Model Used:** Only tested GPT-4.1-nano; ensemble approaches not evaluated
 . **No Inter-Rater Reliability:** Single human reviewer; no validation of review consistency
 . **Limited Context:** Dutch political context; findings may not generalize to other domains
 . **Arbitrary Threshold:** 0.5 threshold not scientifically optimized
 . **Limited Time Period:** 3-month window may not capture seasonal variations in discourse
 . **No Appeal Process:** No mechanism for accounts to contest flagging decisions
 								---
 								## 8. Recommendations for Future Work
 								### Short-Term Improvements
 . **Threshold Optimization:** Test 0.6, 0.7, 0.8 thresholds and measure precision/recall
 . **Category-Specific Tuning:** Different thresholds for different toxicity categories
 . **Context Windows:** Analyze conversation threads, not isolated posts
 . **Multi-Model Validation:** Test other models (Perspective API, custom fine-tuned models)
 								### Long-Term Research
 . **Dutch Political Corpus:** Create labeled training dataset for Dutch political discourse
 . **Fine-Tune Models:** Train specialized classifiers on validated Dutch political content
 . **Longitudinal Study:** Track patterns over election cycles and major events
 . **Cross-Platform Analysis:** Compare Bluesky toxicity patterns with Twitter/X, Mastodon
 . **Inter-Rater Reliability Study:** Multiple reviewers to validate human judgment consistency
 								---
 								## 9. Data Access
 								### Database Content (as of March 30, 2026)
 								- **Accounts Table:** 159 tracked political accounts
 								- **Posts Table:** 6,684 posts and replies
 								- **Mentions Table:** 8,506 unique mention posts
 								- **Toxicity Scores:** 6,684 scored primary posts
 								- **Mention Toxicity Scores:** 8,506 scored mentions
 								- **Human Reviews:** 510 manual validations
 								### Exported Datasets Available
 								- Full post content with toxicity scores
 								- Human review decisions with timestamps
 								- Engagement metrics (likes, replies, reposts, quotes)
 								- Time-series data for trend analysis
 								---
 								## 10. Conclusion
 								This study demonstrates that while AI-powered toxicity detection can **identify potential concerns** in large-scale social media content, it **cannot reliably moderate** without substantial human oversight. The 59.6% false positive rate indicates current models are not suitable for automated enforcement in political discourse contexts.
 								**Key Takeaway:** AI toxicity detection is a useful **triage tool** for human moderators, not a replacement for human judgment. Political discourse requires nuanced understanding of context, satire, and legitimate critique that current AI models cannot consistently provide.
 								**Project Status:** Data collection complete. Web interface remains available for analysis and reporting. Database preserved for future research.
 								---
 								**Generated:** March 30, 2026
 								**Study Period:** January 1 – March 30, 2026
 								**Monitored Platform:** Bluesky Social Network
 								**Geographic Focus:** Dutch Political Discourse