Add documentation and license, remove IDE files
Added comprehensive project documentation and MIT license. Removed Claude IDE configuration files from repository tracking. Documentation added: - FINDINGS.md: Complete methodology report and research findings - 159 accounts tracked, 15,190 posts collected (Jan 1 - Mar 30) - Human review results: 40.4% correct, 59.6% false positives - AI toxicity detection limitations and recommendations - OPERATIONS.md: Complete operations and maintenance guide - Service start/stop procedures - Database operations and queries - Configuration options - Troubleshooting guide - Data export instructions License: - Added MIT License to README.md - Copyright 2026 Post X Society - Open source with permissive license Repository cleanup: - Added .claude/ to .gitignore - Removed .claude/settings.local.json from tracking - Prevents IDE-specific files from being committed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
0495f47c13
commit
1c3f57d7e5
5 changed files with 771 additions and 9 deletions
|
|
@ -1,9 +0,0 @@
|
|||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(docker compose:*)"
|
||||
],
|
||||
"deny": [],
|
||||
"ask": []
|
||||
}
|
||||
}
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -30,6 +30,7 @@ Thumbs.db
|
|||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
.claude/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
|
|
|||
203
FINDINGS.md
Normal file
203
FINDINGS.md
Normal file
|
|
@ -0,0 +1,203 @@
|
|||
# Bluesky Toxicity Analysis - Main Findings
|
||||
|
||||
## Study Overview
|
||||
**Period:** January 1 – March 30, 2026 (89 days)
|
||||
**Monitored Accounts:** 159 Dutch political accounts
|
||||
**Total Posts Collected:** 15,190 posts
|
||||
|
||||
---
|
||||
|
||||
## 1. Data Collection Summary
|
||||
|
||||
### Content Distribution
|
||||
- **Primary Content (by tracked accounts):**
|
||||
- Original Posts: 3,032
|
||||
- Replies: 3,652
|
||||
- **Total Primary:** 6,684 posts
|
||||
|
||||
- **Secondary Content (mentions of tracked accounts):**
|
||||
- Unique Mention Posts: 8,506
|
||||
- Note: Posts mentioning multiple tracked accounts counted once
|
||||
|
||||
### Total Dataset
|
||||
- **Combined Content:** 15,190 posts
|
||||
- **Collection Method:** Automated via Bluesky Public API (every 4 hours)
|
||||
- **Infrastructure:** Docker containers with PostgreSQL database
|
||||
|
||||
---
|
||||
|
||||
## 2. Toxicity Detection Results
|
||||
|
||||
### AI Model Performance
|
||||
- **Model Used:** OpenAI GPT-4.1-nano
|
||||
- **Classification Categories:** 12 toxicity dimensions
|
||||
- **Flagging Threshold:** Overall toxicity score ≥ 0.5 (50%)
|
||||
|
||||
### Flagged Content
|
||||
- **Primary Content (Posts/Replies):** 97 posts flagged
|
||||
- **Secondary Content (Mentions):** 413 unique posts flagged
|
||||
- **Total Flagged:** 510 unique posts
|
||||
|
||||
### Distribution Insight
|
||||
- 81% of flagged content came from mentions (external users → politicians)
|
||||
- 19% of flagged content came from politicians themselves
|
||||
- External users directed significantly more toxic content toward politicians than politicians produced
|
||||
|
||||
---
|
||||
|
||||
## 3. Human Review Results
|
||||
|
||||
### Review Completion
|
||||
- **Total Items Reviewed:** 510 posts (100% of flagged content)
|
||||
- **Review Period:** January 1 – March 30, 2026
|
||||
- **Review Interface:** Custom web application with ✓/✗/? buttons
|
||||
|
||||
### Validation Results
|
||||
|
||||
#### Primary Content (Posts/Replies by Politicians)
|
||||
| Status | Count | Percentage |
|
||||
|--------|-------|------------|
|
||||
| ✓ Correctly Flagged | 32 | 33.0% |
|
||||
| ✗ Incorrectly Flagged | 65 | 67.0% |
|
||||
| ? Unsure | 0 | 0.0% |
|
||||
| **Total** | **97** | **100%** |
|
||||
|
||||
#### Secondary Content (Mentions of Politicians)
|
||||
| Status | Count | Percentage |
|
||||
|--------|-------|------------|
|
||||
| ✓ Correctly Flagged | 174 | 42.1% |
|
||||
| ✗ Incorrectly Flagged | 239 | 57.9% |
|
||||
| ? Unsure | 0 | 0.0% |
|
||||
| **Total** | **413** | **100%** |
|
||||
|
||||
#### Combined Results
|
||||
| Status | Count | Percentage |
|
||||
|--------|-------|------------|
|
||||
| ✓ Correctly Flagged | 206 | 40.4% |
|
||||
| ✗ Incorrectly Flagged | 304 | 59.6% |
|
||||
| ? Unsure | 0 | 0.0% |
|
||||
| **Total** | **510** | **100%** |
|
||||
|
||||
---
|
||||
|
||||
## 4. Key Findings
|
||||
|
||||
### 4.1 High False Positive Rate
|
||||
- **Overall False Positive Rate: 59.6%**
|
||||
- The AI model over-flagged content, with nearly 6 out of 10 flagged items being false positives
|
||||
- Primary content had worse performance (67.0% false positives) than mentions (57.9%)
|
||||
|
||||
### 4.2 Model Limitations Identified
|
||||
1. **Threshold Sensitivity:** The 0.5 threshold appears too low for Dutch political discourse
|
||||
2. **Context Misinterpretation:** Strong policy language, political criticism, and satire frequently misclassified as toxic
|
||||
3. **Cultural/Linguistic Gaps:** Dutch political communication patterns may not align with model training data
|
||||
4. **Nuance Detection:** Difficulty distinguishing between heated but legitimate debate and actual toxicity
|
||||
|
||||
### 4.3 Directional Toxicity Pattern
|
||||
- External mentions (8,506 posts) generated **413 flagged items** (4.9% flagging rate)
|
||||
- Primary content (6,684 posts) generated **97 flagged items** (1.5% flagging rate)
|
||||
- Politicians receive approximately **3× more toxic content** than they produce (by flagging rate)
|
||||
- However, after human review, both sources showed high false positive rates
|
||||
|
||||
### 4.4 Accuracy Comparison
|
||||
- **Mentions accuracy:** 42.1% (slightly better)
|
||||
- **Primary content accuracy:** 33.0% (worse)
|
||||
- Neither content type achieved acceptable accuracy for automated moderation
|
||||
- Possible explanation: Politicians' language more frequently uses strong policy terms that trigger false positives
|
||||
|
||||
---
|
||||
|
||||
## 5. Implications for Automated Moderation
|
||||
|
||||
### What This Study Reveals
|
||||
1. **AI Cannot Replace Human Judgment:** 59.6% false positive rate makes unsupervised automation dangerous
|
||||
2. **Threshold Optimization Needed:** Current 0.5 threshold too aggressive; may need 0.7+ for political content
|
||||
3. **Domain-Specific Training Required:** Political discourse needs specialized models or fine-tuning
|
||||
4. **Human-in-the-Loop Essential:** Automated flagging useful for triage, but human review mandatory
|
||||
|
||||
### Recommended Approach
|
||||
- Use AI toxicity detection as **first-pass screening only**
|
||||
- Require human review for all flagged content before action
|
||||
- Consider higher thresholds (0.7–0.8) for political accounts
|
||||
- Train domain-specific models on Dutch political discourse
|
||||
- Implement appeals process for false positives
|
||||
|
||||
---
|
||||
|
||||
## 6. Technical Implementation Success
|
||||
|
||||
### What Worked Well
|
||||
1. **Automated Collection:** 4-hour collection cycles captured comprehensive dataset
|
||||
2. **Human Review Interface:** Web UI with ✓/✗/? buttons efficient for manual validation
|
||||
3. **Date Filtering:** Allowed focused analysis of specific time periods
|
||||
4. **Engagement Metrics:** Successfully captured likes, replies, reposts, quotes for mentions
|
||||
5. **Deduplication Logic:** Properly handled posts mentioning multiple tracked accounts
|
||||
|
||||
### Infrastructure Performance
|
||||
- **Uptime:** 99%+ (only brief scheduler issue Feb 23-24)
|
||||
- **Data Integrity:** PostgreSQL database handled 15K+ posts without issues
|
||||
- **Analysis Throughput:** GPT-4.1-nano processed all content efficiently
|
||||
- **Web Interface:** Responsive UI for 500+ manual reviews
|
||||
|
||||
---
|
||||
|
||||
## 7. Study Limitations
|
||||
|
||||
1. **Single Model Used:** Only tested GPT-4.1-nano; ensemble approaches not evaluated
|
||||
2. **No Inter-Rater Reliability:** Single human reviewer; no validation of review consistency
|
||||
3. **Limited Context:** Dutch political context; findings may not generalize to other domains
|
||||
4. **Arbitrary Threshold:** 0.5 threshold not scientifically optimized
|
||||
5. **Limited Time Period:** 3-month window may not capture seasonal variations in discourse
|
||||
6. **No Appeal Process:** No mechanism for accounts to contest flagging decisions
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommendations for Future Work
|
||||
|
||||
### Short-Term Improvements
|
||||
1. **Threshold Optimization:** Test 0.6, 0.7, 0.8 thresholds and measure precision/recall
|
||||
2. **Category-Specific Tuning:** Different thresholds for different toxicity categories
|
||||
3. **Context Windows:** Analyze conversation threads, not isolated posts
|
||||
4. **Multi-Model Validation:** Test other models (Perspective API, custom fine-tuned models)
|
||||
|
||||
### Long-Term Research
|
||||
1. **Dutch Political Corpus:** Create labeled training dataset for Dutch political discourse
|
||||
2. **Fine-Tune Models:** Train specialized classifiers on validated Dutch political content
|
||||
3. **Longitudinal Study:** Track patterns over election cycles and major events
|
||||
4. **Cross-Platform Analysis:** Compare Bluesky toxicity patterns with Twitter/X, Mastodon
|
||||
5. **Inter-Rater Reliability Study:** Multiple reviewers to validate human judgment consistency
|
||||
|
||||
---
|
||||
|
||||
## 9. Data Access
|
||||
|
||||
### Database Content (as of March 30, 2026)
|
||||
- **Accounts Table:** 159 tracked political accounts
|
||||
- **Posts Table:** 6,684 posts and replies
|
||||
- **Mentions Table:** 8,506 unique mention posts
|
||||
- **Toxicity Scores:** 6,684 scored primary posts
|
||||
- **Mention Toxicity Scores:** 8,506 scored mentions
|
||||
- **Human Reviews:** 510 manual validations
|
||||
|
||||
### Exported Datasets Available
|
||||
- Full post content with toxicity scores
|
||||
- Human review decisions with timestamps
|
||||
- Engagement metrics (likes, replies, reposts, quotes)
|
||||
- Time-series data for trend analysis
|
||||
|
||||
---
|
||||
|
||||
## 10. Conclusion
|
||||
|
||||
This study demonstrates that while AI-powered toxicity detection can **identify potential concerns** in large-scale social media content, it **cannot reliably moderate** without substantial human oversight. The 59.6% false positive rate indicates current models are not suitable for automated enforcement in political discourse contexts.
|
||||
|
||||
**Key Takeaway:** AI toxicity detection is a useful **triage tool** for human moderators, not a replacement for human judgment. Political discourse requires nuanced understanding of context, satire, and legitimate critique that current AI models cannot consistently provide.
|
||||
|
||||
**Project Status:** Data collection complete. Web interface remains available for analysis and reporting. Database preserved for future research.
|
||||
|
||||
---
|
||||
|
||||
**Generated:** March 30, 2026
|
||||
**Study Period:** January 1 – March 30, 2026
|
||||
**Monitored Platform:** Bluesky Social Network
|
||||
**Geographic Focus:** Dutch Political Discourse
|
||||
543
OPERATIONS.md
Normal file
543
OPERATIONS.md
Normal file
|
|
@ -0,0 +1,543 @@
|
|||
# Bluesky Collector - Operations Guide
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Current Status (March 30, 2026)
|
||||
- **Collector:** ❌ STOPPED (data collection complete)
|
||||
- **Scheduler:** ❌ STOPPED (no further automated runs)
|
||||
- **Web Interface:** ✅ RUNNING (http://localhost:5001)
|
||||
- **Database:** ✅ RUNNING (PostgreSQL on port 5433)
|
||||
|
||||
---
|
||||
|
||||
## Starting and Stopping Services
|
||||
|
||||
### View Current Service Status
|
||||
```bash
|
||||
cd /Users/pieter/Nextcloud-Hetzner/PXS\ Cloud/Projects/26004\ HEIO\ 2/04\ Applications/bluesky-collector
|
||||
docker compose ps
|
||||
```
|
||||
|
||||
### Start All Services
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
This starts:
|
||||
- `db` - PostgreSQL database (port 5433)
|
||||
- `web` - Web interface (port 5001)
|
||||
- `collector` - Data collection service
|
||||
- `scheduler` - Automated collection scheduler (runs every 4 hours)
|
||||
|
||||
### Stop Collection Only (Keep Web Interface)
|
||||
```bash
|
||||
docker compose stop scheduler collector
|
||||
```
|
||||
|
||||
This configuration allows browsing collected data without gathering new content.
|
||||
|
||||
### Start Collection Services
|
||||
```bash
|
||||
docker compose start scheduler collector
|
||||
```
|
||||
|
||||
### Stop All Services
|
||||
```bash
|
||||
docker compose down
|
||||
```
|
||||
|
||||
**Warning:** This will stop the web interface and database. Data is preserved in Docker volumes.
|
||||
|
||||
### Stop and Remove Everything (Including Data)
|
||||
```bash
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
**⚠️ DANGER:** This deletes all collected data permanently!
|
||||
|
||||
---
|
||||
|
||||
## Service Details
|
||||
|
||||
### Database (PostgreSQL)
|
||||
- **Image:** `postgres:16-alpine`
|
||||
- **Port:** 5433 (external) → 5432 (internal)
|
||||
- **Data Volume:** `pgdata`
|
||||
- **Access:**
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky
|
||||
```
|
||||
|
||||
### Web Interface
|
||||
- **URL:** http://localhost:5001
|
||||
- **Port:** 5001
|
||||
- **Stack:** Flask + Gunicorn
|
||||
- **Pages:**
|
||||
- `/` - Dashboard with collection stats
|
||||
- `/accounts` - Account toxicity summary
|
||||
- `/statuses` - Posts and replies browser
|
||||
- `/mentions` - Mentions browser
|
||||
- `/analysis` - Toxicity analysis overview
|
||||
- `/analysis/flagged` - Flagged content with human review
|
||||
- `/export` - Data export options
|
||||
|
||||
### Collector Service
|
||||
- **Schedule:** Every 4 hours (00:00, 04:00, 08:00, 12:00, 16:00, 20:00)
|
||||
- **Function:** Collects new posts and mentions from Bluesky API
|
||||
- **Logs:**
|
||||
```bash
|
||||
docker compose logs -f collector
|
||||
```
|
||||
|
||||
### Scheduler Service
|
||||
- **Image:** `mcuadros/ofelia`
|
||||
- **Function:** Triggers collector and analyzer jobs on schedule
|
||||
- **Jobs:**
|
||||
- `collect` - Runs at 0 minutes past every 4th hour
|
||||
- `analyze` - Runs at 30 minutes past every 4th hour
|
||||
- **Logs:**
|
||||
```bash
|
||||
docker compose logs -f scheduler
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Manual Operations
|
||||
|
||||
### Run Manual Collection
|
||||
```bash
|
||||
docker compose exec collector python -m src
|
||||
```
|
||||
|
||||
Collects posts and mentions immediately (outside of schedule).
|
||||
|
||||
### Run Manual Analysis
|
||||
```bash
|
||||
docker compose exec collector python -m src.analyzer
|
||||
```
|
||||
|
||||
Analyzes all unscored posts/mentions using OpenAI API.
|
||||
|
||||
**Cost Warning:** Analysis incurs OpenAI API costs. Check batch size settings.
|
||||
|
||||
### Analyze Specific Batch Size
|
||||
```bash
|
||||
docker compose exec collector python -m src.analyzer --batch-size 50 --limit 100
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--batch-size N` - Number of posts per API call (default: 10)
|
||||
- `--limit N` - Maximum posts to analyze (default: 0 = unlimited)
|
||||
- `--concurrency N` - Parallel API requests (default: 3)
|
||||
|
||||
### View Recent Logs
|
||||
```bash
|
||||
# All services
|
||||
docker compose logs --tail 100
|
||||
|
||||
# Specific service
|
||||
docker compose logs --tail 50 collector
|
||||
docker compose logs --tail 50 web
|
||||
|
||||
# Follow logs in real-time
|
||||
docker compose logs -f collector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Operations
|
||||
|
||||
### Access Database Shell
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky
|
||||
```
|
||||
|
||||
### Common Queries
|
||||
|
||||
#### Check Collection Status
|
||||
```sql
|
||||
SELECT
|
||||
started_at::date as date,
|
||||
COUNT(*) as runs,
|
||||
SUM(posts_collected) as total_posts,
|
||||
SUM(mentions_collected) as total_mentions,
|
||||
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as successful
|
||||
FROM collection_runs
|
||||
WHERE started_at >= '2026-01-01'
|
||||
GROUP BY started_at::date
|
||||
ORDER BY date DESC;
|
||||
```
|
||||
|
||||
#### Count Flagged Content
|
||||
```sql
|
||||
-- Posts/Replies
|
||||
SELECT COUNT(*) FROM toxicity_scores WHERE overall >= 0.5;
|
||||
|
||||
-- Mentions (unique posts)
|
||||
SELECT COUNT(DISTINCT m.post_uri)
|
||||
FROM mention_toxicity_scores mts
|
||||
JOIN mentions m ON m.id = mts.mention_id
|
||||
WHERE mts.overall >= 0.5;
|
||||
```
|
||||
|
||||
#### Human Review Progress
|
||||
```sql
|
||||
SELECT
|
||||
CASE
|
||||
WHEN review_status IS NULL THEN 'Unreviewed'
|
||||
ELSE review_status
|
||||
END as status,
|
||||
COUNT(*) as count
|
||||
FROM toxicity_scores
|
||||
WHERE overall >= 0.5
|
||||
GROUP BY review_status;
|
||||
```
|
||||
|
||||
### Backup Database
|
||||
```bash
|
||||
docker compose exec db pg_dump -U bluesky bluesky > backup_$(date +%Y%m%d).sql
|
||||
```
|
||||
|
||||
### Restore Database
|
||||
```bash
|
||||
cat backup_20260330.sql | docker compose exec -T db psql -U bluesky -d bluesky
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rebuilding Services
|
||||
|
||||
### Rebuild After Code Changes
|
||||
```bash
|
||||
# Rebuild specific service
|
||||
docker compose build web
|
||||
docker compose build collector
|
||||
|
||||
# Rebuild and restart
|
||||
docker compose up -d --build web
|
||||
|
||||
# Rebuild everything
|
||||
docker compose build
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Apply Database Migrations
|
||||
```bash
|
||||
# View available migrations
|
||||
ls scripts/*.sql
|
||||
|
||||
# Apply specific migration
|
||||
docker compose exec -T db psql -U bluesky -d bluesky < scripts/04-human-review.sql
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables (.env file)
|
||||
```bash
|
||||
# Database
|
||||
POSTGRES_USER=bluesky
|
||||
POSTGRES_PASSWORD=changeme
|
||||
POSTGRES_PORT=5433
|
||||
|
||||
# Web Interface
|
||||
WEB_PORT=5001
|
||||
|
||||
# Bluesky API (for authenticated search)
|
||||
BSKY_HANDLE=your-handle.bsky.social
|
||||
BSKY_APP_PASSWORD=your-app-password
|
||||
|
||||
# OpenAI API (for toxicity analysis)
|
||||
OPENAI_API_KEY=sk-...
|
||||
|
||||
# Analysis Settings
|
||||
ANALYZER_MODEL=gpt-4.1-nano
|
||||
ANALYZER_CONCURRENCY=3
|
||||
ANALYZER_BATCH_SIZE=10
|
||||
ANALYZER_LIMIT=0
|
||||
|
||||
# Collection Settings
|
||||
MAX_PAGES_PER_ACCOUNT=50
|
||||
MENTION_LOOKBACK_HOURS=12
|
||||
LOG_LEVEL=INFO
|
||||
```
|
||||
|
||||
### Tracked Accounts (config/accounts.yml)
|
||||
```yaml
|
||||
accounts:
|
||||
- handle: example.bsky.social # Account to monitor
|
||||
- handle: another.bsky.social
|
||||
```
|
||||
|
||||
Add or remove accounts, then restart collector:
|
||||
```bash
|
||||
docker compose restart collector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Web Interface Not Loading
|
||||
```bash
|
||||
# Check if web service is running
|
||||
docker compose ps web
|
||||
|
||||
# Check web logs for errors
|
||||
docker compose logs --tail 50 web
|
||||
|
||||
# Restart web service
|
||||
docker compose restart web
|
||||
```
|
||||
|
||||
### Collector Not Running
|
||||
```bash
|
||||
# Check scheduler is running
|
||||
docker compose ps scheduler
|
||||
|
||||
# Check collector status
|
||||
docker compose ps collector
|
||||
|
||||
# Start scheduler if stopped
|
||||
docker compose start scheduler
|
||||
|
||||
# Check scheduler logs
|
||||
docker compose logs scheduler
|
||||
```
|
||||
|
||||
### Database Connection Issues
|
||||
```bash
|
||||
# Check database health
|
||||
docker compose ps db
|
||||
|
||||
# Restart database
|
||||
docker compose restart db
|
||||
|
||||
# Check database logs
|
||||
docker compose logs db
|
||||
```
|
||||
|
||||
### Out of Disk Space
|
||||
```bash
|
||||
# Check Docker disk usage
|
||||
docker system df
|
||||
|
||||
# Remove unused images/containers
|
||||
docker system prune
|
||||
|
||||
# Check database size
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "SELECT pg_size_pretty(pg_database_size('bluesky'));"
|
||||
```
|
||||
|
||||
### Analysis Failing (OpenAI API)
|
||||
```bash
|
||||
# Check API key is set
|
||||
docker compose exec collector printenv | grep OPENAI_API_KEY
|
||||
|
||||
# Test API connectivity
|
||||
docker compose exec collector python -c "from openai import OpenAI; OpenAI(api_key='$OPENAI_API_KEY').models.list()"
|
||||
|
||||
# Check rate limits in logs
|
||||
docker compose logs collector | grep -i "rate limit"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Increase Collection Speed
|
||||
Edit `docker-compose.yml`:
|
||||
```yaml
|
||||
environment:
|
||||
MAX_PAGES_PER_ACCOUNT: 100 # Increase from 50
|
||||
MENTION_LOOKBACK_HOURS: 24 # Increase lookback
|
||||
```
|
||||
|
||||
### Increase Analysis Speed
|
||||
```yaml
|
||||
environment:
|
||||
ANALYZER_CONCURRENCY: 5 # More parallel requests
|
||||
ANALYZER_BATCH_SIZE: 20 # Bigger batches
|
||||
```
|
||||
|
||||
**Cost Warning:** Higher concurrency and batch size = higher OpenAI API costs.
|
||||
|
||||
### Change Collection Schedule
|
||||
Edit `docker-compose.yml` under collector labels:
|
||||
```yaml
|
||||
labels:
|
||||
ofelia.job-exec.collect.schedule: "0 0 */2 * * *" # Every 2 hours
|
||||
ofelia.job-exec.analyze.schedule: "0 30 */2 * * *" # 30 min after collection
|
||||
```
|
||||
|
||||
Restart scheduler after changes:
|
||||
```bash
|
||||
docker compose restart scheduler
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Export
|
||||
|
||||
### Export to CSV via Web Interface
|
||||
1. Navigate to http://localhost:5001/export
|
||||
2. Select date range and filters
|
||||
3. Click "Export to CSV"
|
||||
|
||||
### Export via Command Line
|
||||
|
||||
#### All Posts
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
|
||||
SELECT p.uri, p.author_did, a.handle, p.text, p.created_at, p.post_type,
|
||||
ts.overall, ts.toxic, ts.hate_speech, ts.threat
|
||||
FROM posts p
|
||||
LEFT JOIN accounts a ON a.did = p.author_did
|
||||
LEFT JOIN toxicity_scores ts ON ts.uri = p.uri
|
||||
WHERE p.created_at >= '2026-01-01'
|
||||
) TO STDOUT CSV HEADER" > posts_export.csv
|
||||
```
|
||||
|
||||
#### Flagged Content with Reviews
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "COPY (
|
||||
SELECT p.uri, a.handle, p.text, p.created_at,
|
||||
ts.overall, ts.human_reviewed, ts.review_status, ts.reviewed_at
|
||||
FROM toxicity_scores ts
|
||||
JOIN posts p ON p.uri = ts.uri
|
||||
LEFT JOIN accounts a ON a.did = p.author_did
|
||||
WHERE ts.overall >= 0.5 AND p.created_at >= '2026-01-01'
|
||||
ORDER BY ts.overall DESC
|
||||
) TO STDOUT CSV HEADER" > flagged_export.csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Restarting Data Collection (If Needed)
|
||||
|
||||
### Resume Collection After Pause
|
||||
1. Start services:
|
||||
```bash
|
||||
docker compose start scheduler collector
|
||||
```
|
||||
|
||||
2. Verify collection runs:
|
||||
```bash
|
||||
docker compose logs -f collector
|
||||
```
|
||||
|
||||
3. Check database for new entries:
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "
|
||||
SELECT MAX(created_at) FROM posts;
|
||||
SELECT COUNT(*) FROM collection_runs WHERE started_at > NOW() - INTERVAL '1 day';
|
||||
"
|
||||
```
|
||||
|
||||
### Start Fresh Collection (Keep Database)
|
||||
1. Stop services:
|
||||
```bash
|
||||
docker compose down
|
||||
```
|
||||
|
||||
2. Start only database and web:
|
||||
```bash
|
||||
docker compose up -d db web
|
||||
```
|
||||
|
||||
3. Truncate collection tracking (optional):
|
||||
```bash
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "TRUNCATE collection_runs;"
|
||||
```
|
||||
|
||||
4. Start collector:
|
||||
```bash
|
||||
docker compose up -d scheduler collector
|
||||
```
|
||||
|
||||
### Complete Reset (Delete All Data)
|
||||
```bash
|
||||
# Stop everything
|
||||
docker compose down
|
||||
|
||||
# Remove data volume
|
||||
docker volume rm bluesky-collector_pgdata
|
||||
|
||||
# Restart from scratch
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
**⚠️ WARNING:** This deletes all collected posts, mentions, and analysis results permanently!
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Collection Health Check
|
||||
```bash
|
||||
# Last 5 collection runs
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "
|
||||
SELECT started_at, finished_at, status, posts_collected, mentions_collected, errors
|
||||
FROM collection_runs
|
||||
ORDER BY started_at DESC
|
||||
LIMIT 5;
|
||||
"
|
||||
```
|
||||
|
||||
### Analysis Progress
|
||||
```bash
|
||||
# Count scored vs unscored
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM posts) as total_posts,
|
||||
(SELECT COUNT(*) FROM toxicity_scores) as scored_posts,
|
||||
(SELECT COUNT(*) FROM mentions) as total_mentions,
|
||||
(SELECT COUNT(*) FROM mention_toxicity_scores) as scored_mentions;
|
||||
"
|
||||
```
|
||||
|
||||
### Disk Usage
|
||||
```bash
|
||||
# Database size
|
||||
docker compose exec db psql -U bluesky -d bluesky -c "
|
||||
SELECT
|
||||
pg_size_pretty(pg_database_size('bluesky')) as db_size,
|
||||
pg_size_pretty(pg_total_relation_size('posts')) as posts_table,
|
||||
pg_size_pretty(pg_total_relation_size('mentions')) as mentions_table;
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Notes
|
||||
|
||||
1. **Never commit .env file** - Contains API keys and passwords
|
||||
2. **Change default passwords** - PostgreSQL default password is `changeme`
|
||||
3. **Firewall rules** - Ports 5001 (web) and 5433 (database) exposed to localhost only
|
||||
4. **API keys** - Bluesky and OpenAI credentials stored in environment variables
|
||||
5. **Data retention** - Contains personal data (Bluesky posts); handle per GDPR requirements
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
### Documentation
|
||||
- Main findings: `FINDINGS.md`
|
||||
- This operations guide: `OPERATIONS.md`
|
||||
- Git repository: https://forgejo.postxsociety.cloud/pieter/bluesky-collector
|
||||
|
||||
### Logs Location
|
||||
- Docker logs: `docker compose logs [service]`
|
||||
- Application logs: `./logs/` directory (if volume mounted)
|
||||
|
||||
### Common Issues
|
||||
1. **Port conflicts:** Change `WEB_PORT` or `POSTGRES_PORT` in .env
|
||||
2. **Out of memory:** Reduce `ANALYZER_CONCURRENCY` or `ANALYZER_BATCH_SIZE`
|
||||
3. **API rate limits:** Reduce collection frequency or batch size
|
||||
4. **Disk full:** Run `docker system prune` and consider data export/cleanup
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** March 30, 2026
|
||||
**Project Status:** Data collection complete, web interface available for analysis
|
||||
24
README.md
24
README.md
|
|
@ -279,3 +279,27 @@ The `pgdata` volume persists across container restarts. Back it up with standard
|
|||
```bash
|
||||
docker compose exec -T db pg_dump -U bluesky bluesky > backup.sql
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 Post X Society
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue