Spaces:

SWE-Arena
/

SWE-Issue

Running

App Files Files Community

zhimin-z commited on 20 days ago

Commit

4b78e58

1 Parent(s): 64746d3

merge wanted

Browse files

Files changed (3) hide show

README.md +47 -14
app.py +169 -34
msr.py +358 -19

README.md CHANGED Viewed

@@ -31,34 +31,51 @@ Key metrics from the last 180 days:
 - **Total Issues**: Issues the assistant has been involved with (authored, assigned, or commented on)
 - **Closed Issues**: Issues that were closed
 - **Resolved Issues**: Closed issues marked as completed
-- **Resolution Rate**: Percentage of closed issues successfully resolved
 **Monthly Trends**
-- Resolution rate trends (line plots)
 - Issue volume over time (bar charts)
 We focus on 180 days to highlight current capabilities and active assistants.
 ## How It Works
 **Data Collection**
-We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
-- Issues opened or assigned to the assistant (`IssuesEvent`)
-- Issue comments by the assistant (`IssueCommentEvent`)
 **Regular Updates**
 Leaderboard refreshes weekly (Friday at 00:00 UTC).
 **Community Submissions**
-Anyone can submit an assistant. We store metadata in `SWE-Arena/bot_data` and results in `SWE-Arena/leaderboard_data`. All submissions are validated via GitHub API.
 ## Using the Leaderboard
 ### Browsing
-Leaderboard tab features:
 - Searchable table (by assistant name or website)
-- Filterable columns (by resolution rate)
 - Monthly charts (resolution trends and activity)
 ### Adding Your Assistant
 Submit Assistant tab requires:
@@ -71,33 +88,49 @@ Submissions are validated and data loads within seconds.
 ## Understanding the Metrics
-**Resolution Rate**
 Percentage of closed issues successfully completed:
 ```
-Resolution Rate = resolved issues ÷ closed issues × 100
 ```
 An issue is "resolved" when `state_reason` is `completed` on GitHub. This means the problem was solved, not just closed without resolution.
 Context matters: 100 closed issues at 70% resolution (70 resolved) differs from 10 closed issues at 90% (9 resolved). Consider both rate and volume.
 **Monthly Trends**
-- **Line plots**: Resolution rate changes over time
 - **Bar charts**: Issue volume per month
 Patterns to watch:
 - Consistent high rates = effective problem-solving
 - Increasing trends = improving assistants
 - High volume + good rates = productivity + effectiveness
 ## What's Next
 Planned improvements:
 - Repository-based analysis
-- Extended metrics (comment activity, response time, complexity)
-- Resolution time tracking
-- Issue type patterns (bugs, features, docs)
 ## Questions or Issues?

 - **Total Issues**: Issues the assistant has been involved with (authored, assigned, or commented on)
 - **Closed Issues**: Issues that were closed
 - **Resolved Issues**: Closed issues marked as completed
+- **Resolved Rate**: Percentage of closed issues successfully resolved
+- **Resolved Wanted Issues**: Long-standing issues (30+ days old) from major open-source projects that the assistant resolved via merged pull requests
 **Monthly Trends**
+- Resolved rate trends (line plots)
 - Issue volume over time (bar charts)
+**Issues Wanted**
+- Long-standing open issues (30+ days) with fix-needed labels (e.g. `bug`, `enhancement`) from tracked organizations (Apache, GitHub, Hugging Face)
 We focus on 180 days to highlight current capabilities and active assistants.
 ## How It Works
 **Data Collection**
+We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking two types of issues:
+1. **Agent-Assigned Issues**:
+   - Issues opened or assigned to the assistant (`IssuesEvent`)
+   - Issue comments by the assistant (`IssueCommentEvent`)
+2. **Wanted Issues** (from tracked organizations: Apache, GitHub, Hugging Face):
+   - Long-standing open issues (30+ days) with fix-needed labels (`bug`, `enhancement`)
+   - Pull requests created by assistants that reference these issues
+   - Only counts as resolved when the assistant's PR is merged and the issue is subsequently closed
 **Regular Updates**
 Leaderboard refreshes weekly (Friday at 00:00 UTC).
 **Community Submissions**
+Anyone can submit an assistant. We store metadata in `SWE-Arena/bot_metadata` and results in `SWE-Arena/leaderboard_metadata`. All submissions are validated via GitHub API.
 ## Using the Leaderboard
 ### Browsing
+**Leaderboard Tab**:
 - Searchable table (by assistant name or website)
+- Filterable columns (by resolved rate)
 - Monthly charts (resolution trends and activity)
+- View both agent-assigned metrics and wanted issue resolutions
+**Issues Wanted Tab**:
+- Browse long-standing open issues (30+ days) from major open-source projects
+- Filter by tracked organizations (Apache, GitHub, Hugging Face)
+- See which issues need attention from the community
 ### Adding Your Assistant
 Submit Assistant tab requires:
 ## Understanding the Metrics
+**Resolved Rate**
 Percentage of closed issues successfully completed:
 ```
+Resolved Rate = resolved issues ÷ closed issues × 100
 ```
 An issue is "resolved" when `state_reason` is `completed` on GitHub. This means the problem was solved, not just closed without resolution.
 Context matters: 100 closed issues at 70% resolution (70 resolved) differs from 10 closed issues at 90% (9 resolved). Consider both rate and volume.
+**Resolved Wanted Issues**
+Long-standing issues (30+ days old) from major open-source projects that the assistant resolved. An issue qualifies when:
+1. It's from a tracked organization (Apache, GitHub, Hugging Face)
+2. It has a fix-needed label (`bug`, `enhancement`)
+3. The assistant created a pull request referencing the issue
+4. The pull request was merged
+5. The issue was subsequently closed
+This metric highlights assistants' ability to tackle challenging, community-identified problems in high-impact projects.
+**Long-Standing Issues**
+Issues that have been open for 30+ days represent real challenges the community has struggled to address. These are harder than typical issues and demonstrate an assistant's problem-solving capabilities.
 **Monthly Trends**
+- **Line plots**: Resolved rate changes over time
 - **Bar charts**: Issue volume per month
 Patterns to watch:
 - Consistent high rates = effective problem-solving
 - Increasing trends = improving assistants
 - High volume + good rates = productivity + effectiveness
+- High wanted issue resolution = ability to tackle challenging community problems
 ## What's Next
 Planned improvements:
 - Repository-based analysis
+- Extended metrics (comment activity, response time, code complexity)
+- Resolution time tracking from issue creation to PR merge
+- Issue category patterns and difficulty assessment
+- Expanded organization and label tracking for wanted issues
+- Integration with additional high-impact open-source organizations
 ## Questions or Issues?

app.py CHANGED Viewed

@@ -3,6 +3,7 @@ from gradio_leaderboard import Leaderboard, ColumnFilter
 import json
 import os
 import time
 import requests
 from huggingface_hub import HfApi, hf_hub_download
 from huggingface_hub.errors import HfHubHTTPError
@@ -14,6 +15,7 @@ import plotly.graph_objects as go
 from plotly.subplots import make_subplots
 from apscheduler.schedulers.background import BackgroundScheduler
 from apscheduler.triggers.cron import CronTrigger
 # Load environment variables
 load_dotenv()
@@ -23,8 +25,11 @@ load_dotenv()
 # =============================================================================
 AGENTS_REPO = "SWE-Arena/bot_metadata"  # HuggingFace dataset for agent metadata
 LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
 LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata"  # HuggingFace dataset for leaderboard data
 MAX_RETRIES = 5
 LEADERBOARD_COLUMNS = [
@@ -33,6 +38,7 @@ LEADERBOARD_COLUMNS = [
     ("Total Issues", "number"),
     ("Resolved Issues", "number"),
     ("Resolved Rate (%)", "number"),
 ]
 # =============================================================================
@@ -95,52 +101,113 @@ def validate_github_username(identifier):
 # HUGGINGFACE DATASET OPERATIONS
 # =============================================================================
-def load_agents_from_hf():
-    """Load all agent metadata JSON files from HuggingFace dataset."""
     try:
-        api = HfApi()
-        agents = []
-        # List all files in the repository
-        files = list_repo_files_with_backoff(api=api, repo_id=AGENTS_REPO, repo_type="dataset")
-        # Filter for JSON files only
-        json_files = [f for f in files if f.endswith('.json')]
-        # Download and parse each JSON file
-        for json_file in json_files:
-            try:
-                file_path = hf_hub_download_with_backoff(
-                    repo_id=AGENTS_REPO,
-                    filename=json_file,
-                    repo_type="dataset"
-                )
-                with open(file_path, 'r') as f:
-                    agent_data = json.load(f)
-                    # Only process agents with status == "active"
-                    if agent_data.get('status') != 'active':
-                        continue
-                    # Extract github_identifier from filename (e.g., "agent[bot].json" -> "agent[bot]")
-                    filename_identifier = json_file.replace('.json', '')
-                    # Add or override github_identifier to match filename
-                    agent_data['github_identifier'] = filename_identifier
-                    agents.append(agent_data)
             except Exception as e:
-                print(f"Warning: Could not load {json_file}: {str(e)}")
                 continue
-        print(f"Loaded {len(agents)} agents from HuggingFace")
-        return agents
-    except Exception as e:
-        print(f"Could not load agents from HuggingFace: {str(e)}")
-        return None
 def get_hf_token():
@@ -483,6 +550,7 @@ def get_leaderboard_dataframe():
             total_issues,
             data.get('resolved_issues', 0),
             data.get('resolved_rate', 0.0),
         ])
     print(f"Filtered out {filtered_count} agents with 0 issues")
@@ -493,7 +561,7 @@ def get_leaderboard_dataframe():
     df = pd.DataFrame(rows, columns=column_names)
     # Ensure numeric types
-    numeric_cols = ["Total Issues", "Resolved Issues", "Resolved Rate (%)"]
     for col in numeric_cols:
         if col in df.columns:
             df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
@@ -508,6 +576,54 @@ def get_leaderboard_dataframe():
     return df
 def submit_agent(identifier, agent_name, organization, website):
     """
     Submit a new agent to the leaderboard.
@@ -657,6 +773,25 @@ with gr.Blocks(title="SWE Agent Issue Leaderboard", theme=gr.themes.Soft()) as a
             )
         # Submit Agent Tab
         with gr.Tab("Submit Agent"):

 import json
 import os
 import time
+import subprocess
 import requests
 from huggingface_hub import HfApi, hf_hub_download
 from huggingface_hub.errors import HfHubHTTPError
 from plotly.subplots import make_subplots
 from apscheduler.schedulers.background import BackgroundScheduler
 from apscheduler.triggers.cron import CronTrigger
+from datetime import datetime, timezone
 # Load environment variables
 load_dotenv()
 # =============================================================================
 AGENTS_REPO = "SWE-Arena/bot_metadata"  # HuggingFace dataset for agent metadata
+AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_metadata")  # Local git clone path
 LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
 LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata"  # HuggingFace dataset for leaderboard data
+LONGSTANDING_GAP_DAYS = 30  # Minimum days for an issue to be considered long-standing
+GIT_SYNC_TIMEOUT = 300  # 5 minutes timeout for git pull
 MAX_RETRIES = 5
 LEADERBOARD_COLUMNS = [
     ("Total Issues", "number"),
     ("Resolved Issues", "number"),
     ("Resolved Rate (%)", "number"),
+    ("Resolved Wanted Issues", "number"),
 ]
 # =============================================================================
 # HUGGINGFACE DATASET OPERATIONS
 # =============================================================================
+def sync_agents_repo():
+    """
+    Sync local bot_metadata repository with remote using git pull.
+    This is MANDATORY to ensure we have the latest bot data.
+    Raises exception if sync fails.
+    """
+    if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
+        error_msg = f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}"
+        print(f"   Error {error_msg}")
+        print(f"   Please clone it first: git clone https://huggingface.co/datasets/{AGENTS_REPO}")
+        raise FileNotFoundError(error_msg)
+    if not os.path.exists(os.path.join(AGENTS_REPO_LOCAL_PATH, '.git')):
+        error_msg = f"{AGENTS_REPO_LOCAL_PATH} exists but is not a git repository"
+        print(f"   Error {error_msg}")
+        raise ValueError(error_msg)
     try:
+        # Run git pull with extended timeout due to large repository
+        result = subprocess.run(
+            ['git', 'pull'],
+            cwd=AGENTS_REPO_LOCAL_PATH,
+            capture_output=True,
+            text=True,
+            timeout=GIT_SYNC_TIMEOUT
+        )
+        if result.returncode == 0:
+            output = result.stdout.strip()
+            if "Already up to date" in output or "Already up-to-date" in output:
+                print(f"   Success Repository is up to date")
+            else:
+                print(f"   Success Repository synced successfully")
+                if output:
+                    # Print first few lines of output
+                    lines = output.split('\n')[:5]
+                    for line in lines:
+                        print(f"     {line}")
+            return True
+        else:
+            error_msg = f"Git pull failed: {result.stderr.strip()}"
+            print(f"   Error {error_msg}")
+            raise RuntimeError(error_msg)
+    except subprocess.TimeoutExpired:
+        error_msg = f"Git pull timed out after {GIT_SYNC_TIMEOUT} seconds"
+        print(f"   Error {error_msg}")
+        raise TimeoutError(error_msg)
+    except (FileNotFoundError, ValueError, RuntimeError, TimeoutError):
+        raise  # Re-raise expected exceptions
+    except Exception as e:
+        error_msg = f"Error syncing repository: {str(e)}"
+        print(f"   Error {error_msg}")
+        raise RuntimeError(error_msg) from e
+def load_agents_from_hf():
+    """
+    Load all agent metadata JSON files from local git repository.
+    ALWAYS syncs with remote first to ensure we have the latest bot data.
+    """
+    # MANDATORY: Sync with remote first to get latest bot data
+    print(f"   Syncing bot_metadata repository to get latest agents...")
+    sync_agents_repo()  # Will raise exception if sync fails
+    agents = []
+    # Scan local directory for JSON files
+    if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
+        raise FileNotFoundError(f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}")
+    # Walk through the directory to find all JSON files
+    files_processed = 0
+    print(f"   Loading agent metadata from {AGENTS_REPO_LOCAL_PATH}...")
+    for root, dirs, files in os.walk(AGENTS_REPO_LOCAL_PATH):
+        # Skip .git directory
+        if '.git' in root:
+            continue
+        for filename in files:
+            if not filename.endswith('.json'):
+                continue
+            files_processed += 1
+            file_path = os.path.join(root, filename)
+            try:
+                with open(file_path, 'r', encoding='utf-8') as f:
+                    agent_data = json.load(f)
+                # Only include active agents
+                if agent_data.get('status') != 'active':
+                    continue
+                # Extract github_identifier from filename
+                github_identifier = filename.replace('.json', '')
+                agent_data['github_identifier'] = github_identifier
+                agents.append(agent_data)
             except Exception as e:
+                print(f"   Warning Error loading {filename}: {str(e)}")
                 continue
+    print(f"   Success Loaded {len(agents)} active agents (from {files_processed} total files)")
+    return agents
 def get_hf_token():
             total_issues,
             data.get('resolved_issues', 0),
             data.get('resolved_rate', 0.0),
+            data.get('resolved_wanted_issues', 0),
         ])
     print(f"Filtered out {filtered_count} agents with 0 issues")
     df = pd.DataFrame(rows, columns=column_names)
     # Ensure numeric types
+    numeric_cols = ["Total Issues", "Resolved Issues", "Resolved Rate (%)", "Resolved Wanted Issues"]
     for col in numeric_cols:
         if col in df.columns:
             df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
     return df
+def get_wanted_issues_dataframe():
+    """Load wanted issues and convert to pandas DataFrame."""
+    saved_data = load_leaderboard_data_from_hf()
+    if not saved_data or 'wanted_issues' not in saved_data:
+        print(f"No wanted issues data available")
+        return pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"])
+    wanted_issues = saved_data['wanted_issues']
+    print(f"Loaded {len(wanted_issues)} wanted issues")
+    if not wanted_issues:
+        return pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"])
+    rows = []
+    for issue in wanted_issues:
+        # Calculate age
+        created_at = issue.get('created_at')
+        age_days = 0
+        if created_at and created_at != 'N/A':
+            try:
+                created = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
+                age_days = (datetime.now(timezone.utc) - created).days
+            except:
+                pass
+        # Create clickable link
+        url = issue.get('url', '')
+        repo = issue.get('repo', '')
+        issue_number = issue.get('number', '')
+        url_link = f'<a href="{url}" target="_blank">{repo}#{issue_number}</a>'
+        rows.append([
+            issue.get('title', ''),
+            url_link,
+            age_days,
+            ', '.join(issue.get('labels', []))
+        ])
+    df = pd.DataFrame(rows, columns=["Title", "URL", "Age (days)", "Labels"])
+    # Sort by age descending
+    if "Age (days)" in df.columns and not df.empty:
+        df = df.sort_values(by="Age (days)", ascending=False).reset_index(drop=True)
+    return df
 def submit_agent(identifier, agent_name, organization, website):
     """
     Submit a new agent to the leaderboard.
             )
+        # Issues Wanted Tab
+        with gr.Tab("Issues Wanted"):
+            gr.Markdown("### Long-Standing Patch-Wanted Issues")
+            gr.Markdown(f"*Issues open for {LONGSTANDING_GAP_DAYS}+ days with patch-wanted labels from tracked organizations*")
+            wanted_table = gr.Dataframe(
+                value=pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"]),
+                datatype=["str", "html", "number", "str"],
+                interactive=False,
+                wrap=True
+            )
+            app.load(
+                fn=get_wanted_issues_dataframe,
+                inputs=[],
+                outputs=[wanted_table]
+            )
         # Submit Agent Tab
         with gr.Tab("Submit Agent"):

msr.py CHANGED Viewed

@@ -25,13 +25,27 @@ load_dotenv()
 # CONFIGURATION
 # =============================================================================
-AGENTS_REPO = "SWE-Arena/bot_data"
-AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_data")  # Local git clone path
 DUCKDB_CACHE_FILE = "cache.duckdb"
 GHARCHIVE_DATA_LOCAL_PATH = os.path.expanduser("~/gharchive/data")
 LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
-LEADERBOARD_REPO = "SWE-Arena/leaderboard_data"
 LEADERBOARD_TIME_FRAME_DAYS = 180
 # Git sync configuration (mandatory to get latest bot data)
 GIT_SYNC_TIMEOUT = 300  # 5 minutes timeout for git pull
@@ -509,9 +523,310 @@ def fetch_all_issue_metadata_streaming(conn, identifiers, start_date, end_date):
     return dict(metadata_by_agent)
 def sync_agents_repo():
     """
-    Sync local bot_data repository with remote using git pull.
     This is MANDATORY to ensure we have the latest bot data.
     Raises exception if sync fails.
     """
@@ -571,7 +886,7 @@ def load_agents_from_hf():
     ALWAYS syncs with remote first to ensure we have the latest bot data.
     """
     # MANDATORY: Sync with remote first to get latest bot data
-    print(f"   Syncing bot_data repository to get latest agents...")
     sync_agents_repo()  # Will raise exception if sync fails
     agents = []
@@ -705,12 +1020,21 @@ def calculate_monthly_metrics_by_agent(all_metadata_dict, agents):
     }
-def construct_leaderboard_from_metadata(all_metadata_dict, agents):
-    """Construct leaderboard from in-memory issue metadata."""
     if not agents:
         print("Error: No agents found")
         return {}
     cache_dict = {}
     for agent in agents:
@@ -720,18 +1044,22 @@ def construct_leaderboard_from_metadata(all_metadata_dict, agents):
         bot_metadata = all_metadata_dict.get(identifier, [])
         stats = calculate_issue_stats_from_metadata(bot_metadata)
         cache_dict[identifier] = {
             'name': agent_name,
             'website': agent.get('website', 'N/A'),
             'github_identifier': identifier,
-            **stats
         }
     return cache_dict
-def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics):
-    """Save leaderboard data and monthly metrics to HuggingFace dataset."""
     try:
         token = get_hf_token()
         if not token:
@@ -739,13 +1067,20 @@ def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics):
         api = HfApi(token=token)
         combined_data = {
-            'last_updated': datetime.now(timezone.utc).isoformat(),
             'leaderboard': leaderboard_dict,
             'monthly_metrics': monthly_metrics,
-            'metadata': {
-                'leaderboard_time_frame_days': LEADERBOARD_TIME_FRAME_DAYS
-            }
         }
         with open(LEADERBOARD_FILENAME, 'w') as f:
@@ -809,11 +1144,15 @@ def mine_all_agents():
     start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
     try:
-        # USE STREAMING FUNCTION FOR ISSUES
-        all_metadata = fetch_all_issue_metadata_streaming(
             conn, identifiers, start_date, end_date
         )
     except Exception as e:
         print(f"Error during DuckDB fetch: {str(e)}")
         traceback.print_exc()
@@ -824,9 +1163,9 @@ def mine_all_agents():
     print(f"\n[4/4] Saving leaderboard...")
     try:
-        leaderboard_dict = construct_leaderboard_from_metadata(all_metadata, agents)
-        monthly_metrics = calculate_monthly_metrics_by_agent(all_metadata, agents)
-        save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics)
     except Exception as e:
         print(f"Error saving leaderboard: {str(e)}")

 # CONFIGURATION
 # =============================================================================
+AGENTS_REPO = "SWE-Arena/bot_metadata"
+AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_metadata")  # Local git clone path
 DUCKDB_CACHE_FILE = "cache.duckdb"
 GHARCHIVE_DATA_LOCAL_PATH = os.path.expanduser("~/gharchive/data")
 LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
+LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata"
 LEADERBOARD_TIME_FRAME_DAYS = 180
+LONGSTANDING_GAP_DAYS = 30  # Minimum days for an issue to be considered long-standing
+# GitHub organizations and repositories to track for wanted issues
+TRACKED_ORGS = [
+    "apache",
+    "github",
+    "huggingface",
+]
+# Labels that indicate "patch wanted" status
+PATCH_WANTED_LABELS = [
+    "bug",
+    "enhancement",
+]
 # Git sync configuration (mandatory to get latest bot data)
 GIT_SYNC_TIMEOUT = 300  # 5 minutes timeout for git pull
     return dict(metadata_by_agent)
+def fetch_unified_issue_metadata_streaming(conn, identifiers, start_date, end_date):
+    """
+    UNIFIED: Fetch both agent-assigned issues AND wanted issues using streaming batch processing.
+    Tracks TWO types of issues:
+    1. Agent-assigned issues: Issues where agents are assigned to or commented on
+    2. Wanted issues: Long-standing issues from tracked orgs linked to merged PRs by agents
+    Args:
+        conn: DuckDB connection instance
+        identifiers: List of GitHub usernames/bot identifiers
+        start_date: Start datetime (timezone-aware)
+        end_date: End datetime (timezone-aware)
+    Returns:
+        Dictionary with three keys:
+        - 'agent_issues': {agent_id: [issue_metadata]} for agent-assigned issues
+        - 'wanted_open': [open_wanted_issues] for long-standing open issues
+        - 'wanted_resolved': {agent_id: [resolved_wanted]} for resolved wanted issues
+    """
+    # First, get agent-assigned issues using existing function
+    print(f"   [1/2] Fetching agent-assigned/commented issues...")
+    agent_issues = fetch_all_issue_metadata_streaming(conn, identifiers, start_date, end_date)
+    # Now fetch wanted issues
+    print(f"\n   [2/2] Fetching wanted issues from tracked orgs...")
+    identifier_set = set(identifiers)
+    # Storage for wanted issues
+    all_issues = {}  # issue_url -> issue_metadata
+    issue_to_prs = defaultdict(set)  # issue_url -> set of PR URLs
+    pr_creators = {}  # pr_url -> creator login
+    pr_merged_at = {}  # pr_url -> merged_at timestamp
+    # Calculate total batches
+    total_days = (end_date - start_date).days
+    total_batches = (total_days // BATCH_SIZE_DAYS) + 1
+    # Process in batches
+    current_date = start_date
+    batch_num = 0
+    print(f"   Streaming {total_batches} batches for wanted issues...")
+    while current_date <= end_date:
+        batch_num += 1
+        batch_end = min(current_date + timedelta(days=BATCH_SIZE_DAYS - 1), end_date)
+        # Get file patterns for THIS BATCH ONLY
+        file_patterns = generate_file_path_patterns(current_date, batch_end)
+        if not file_patterns:
+            print(f"   Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} - NO DATA")
+            current_date = batch_end + timedelta(days=1)
+            continue
+        # Progress indicator
+        print(f"   Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} ({len(file_patterns)} files)... ", end="", flush=True)
+        # Build file patterns SQL for THIS BATCH
+        file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
+        try:
+            # Create temp view from file read (done ONCE per batch)
+            conn.execute(f"""
+                CREATE OR REPLACE TEMP VIEW batch_data AS
+                SELECT *
+                FROM read_json({file_patterns_sql}, union_by_name=true, filename=true, compression='gzip', format='newline_delimited', ignore_errors=true, maximum_object_size=2147483648)
+            """)
+            # Query 1: Fetch all issues (NOT PRs) from tracked orgs
+            issue_query = """
+            SELECT
+                json_extract_string(payload, '$.issue.html_url') as issue_url,
+                json_extract_string(repo, '$.name') as repo_name,
+                json_extract_string(payload, '$.issue.title') as title,
+                json_extract_string(payload, '$.issue.number') as issue_number,
+                MIN(json_extract_string(payload, '$.issue.created_at')) as created_at,
+                MAX(json_extract_string(payload, '$.issue.closed_at')) as closed_at,
+                json_extract(payload, '$.issue.labels') as labels
+            FROM batch_data
+            WHERE
+                type IN ('IssuesEvent', 'IssueCommentEvent')
+                AND json_extract_string(payload, '$.issue.pull_request') IS NULL
+                AND json_extract_string(payload, '$.issue.html_url') IS NOT NULL
+            GROUP BY issue_url, repo_name, title, issue_number, labels
+            """
+            issue_results = conn.execute(issue_query).fetchall()
+            # Filter issues by tracked orgs and collect them
+            for row in issue_results:
+                issue_url = row[0]
+                repo_name = row[1]
+                title = row[2]
+                issue_number = row[3]
+                created_at = row[4]
+                closed_at = row[5]
+                labels_json = row[6]
+                if not issue_url or not repo_name:
+                    continue
+                # Extract org from repo_name
+                parts = repo_name.split('/')
+                if len(parts) != 2:
+                    continue
+                org = parts[0]
+                # Filter by tracked orgs
+                if org not in TRACKED_ORGS:
+                    continue
+                # Parse labels
+                try:
+                    if isinstance(labels_json, str):
+                        labels_data = json.loads(labels_json)
+                    else:
+                        labels_data = labels_json
+                    if not isinstance(labels_data, list):
+                        label_names = []
+                    else:
+                        label_names = [label.get('name', '').lower() for label in labels_data if isinstance(label, dict)]
+                except (json.JSONDecodeError, TypeError):
+                    label_names = []
+                # Determine state
+                normalized_closed_at = normalize_date_format(closed_at) if closed_at else None
+                state = 'closed' if (normalized_closed_at and normalized_closed_at != 'N/A') else 'open'
+                # Store issue metadata
+                all_issues[issue_url] = {
+                    'url': issue_url,
+                    'repo': repo_name,
+                    'title': title,
+                    'number': issue_number,
+                    'state': state,
+                    'created_at': normalize_date_format(created_at),
+                    'closed_at': normalized_closed_at,
+                    'labels': label_names
+                }
+            # Query 2: Find PRs from both IssueCommentEvent and PullRequestEvent
+            pr_query = """
+            SELECT DISTINCT
+                COALESCE(
+                    json_extract_string(payload, '$.issue.html_url'),
+                    json_extract_string(payload, '$.pull_request.html_url')
+                ) as pr_url,
+                COALESCE(
+                    json_extract_string(payload, '$.issue.user.login'),
+                    json_extract_string(payload, '$.pull_request.user.login')
+                ) as pr_creator,
+                COALESCE(
+                    json_extract_string(payload, '$.issue.pull_request.merged_at'),
+                    json_extract_string(payload, '$.pull_request.merged_at')
+                ) as merged_at,
+                COALESCE(
+                    json_extract_string(payload, '$.issue.body'),
+                    json_extract_string(payload, '$.pull_request.body')
+                ) as pr_body
+            FROM batch_data
+            WHERE
+                (type = 'IssueCommentEvent' AND json_extract_string(payload, '$.issue.pull_request') IS NOT NULL)
+                OR type = 'PullRequestEvent'
+            """
+            pr_results = conn.execute(pr_query).fetchall()
+            for row in pr_results:
+                pr_url = row[0]
+                pr_creator = row[1]
+                merged_at = row[2]
+                pr_body = row[3]
+                if not pr_url or not pr_creator:
+                    continue
+                pr_creators[pr_url] = pr_creator
+                pr_merged_at[pr_url] = merged_at
+                # Extract linked issues from PR body
+                if pr_body:
+                    # Match issue URLs or #number references
+                    issue_refs = re.findall(r'(?:https?://github\.com/[\w-]+/[\w-]+/issues/\d+)|(?:#\d+)', pr_body, re.IGNORECASE)
+                    for ref in issue_refs:
+                        # Convert #number to full URL if needed
+                        if ref.startswith('#'):
+                            # Extract org/repo from PR URL
+                            pr_parts = pr_url.split('/')
+                            if len(pr_parts) >= 5:
+                                org = pr_parts[-4]
+                                repo = pr_parts[-3]
+                                issue_num = ref[1:]
+                                issue_url = f"https://github.com/{org}/{repo}/issues/{issue_num}"
+                                issue_to_prs[issue_url].add(pr_url)
+                        else:
+                            issue_to_prs[ref].add(pr_url)
+            print(f"✓ {len(issue_results)} issues, {len(pr_results)} PRs")
+            # Clean up temp view after batch processing
+            conn.execute("DROP VIEW IF EXISTS batch_data")
+        except Exception as e:
+            print(f"\n   ✗ Batch {batch_num} error: {str(e)}")
+            traceback.print_exc()
+            # Clean up temp view even on error
+            try:
+                conn.execute("DROP VIEW IF EXISTS batch_data")
+            except:
+                pass
+        # Move to next batch
+        current_date = batch_end + timedelta(days=1)
+    # Post-processing: Filter issues and assign to agents
+    print(f"\n   Post-processing {len(all_issues)} wanted issues...")
+    wanted_open = []
+    wanted_resolved = defaultdict(list)
+    current_time = datetime.now(timezone.utc)
+    for issue_url, issue_meta in all_issues.items():
+        # Check if issue has linked PRs
+        linked_prs = issue_to_prs.get(issue_url, set())
+        if not linked_prs:
+            continue
+        # Check if any linked PR was merged AND created by an agent
+        resolved_by = None
+        for pr_url in linked_prs:
+            merged_at = pr_merged_at.get(pr_url)
+            if merged_at:  # PR was merged
+                pr_creator = pr_creators.get(pr_url)
+                if pr_creator in identifier_set:
+                    resolved_by = pr_creator
+                    break
+        if not resolved_by:
+            continue
+        # Process based on issue state
+        if issue_meta['state'] == 'open':
+            # For open issues: check if labels match PATCH_WANTED_LABELS
+            issue_labels = issue_meta.get('labels', [])
+            has_patch_label = False
+            for issue_label in issue_labels:
+                for wanted_label in PATCH_WANTED_LABELS:
+                    if wanted_label.lower() in issue_label:
+                        has_patch_label = True
+                        break
+                if has_patch_label:
+                    break
+            if not has_patch_label:
+                continue
+            # Check if long-standing
+            created_at_str = issue_meta.get('created_at')
+            if created_at_str and created_at_str != 'N/A':
+                try:
+                    created_dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
+                    days_open = (current_time - created_dt).days
+                    if days_open >= LONGSTANDING_GAP_DAYS:
+                        wanted_open.append(issue_meta)
+                except:
+                    pass
+        elif issue_meta['state'] == 'closed':
+            # For closed issues: must be closed within time frame AND open 30+ days
+            closed_at_str = issue_meta.get('closed_at')
+            created_at_str = issue_meta.get('created_at')
+            if closed_at_str and closed_at_str != 'N/A' and created_at_str and created_at_str != 'N/A':
+                try:
+                    closed_dt = datetime.fromisoformat(closed_at_str.replace('Z', '+00:00'))
+                    created_dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
+                    # Calculate how long the issue was open
+                    days_open = (closed_dt - created_dt).days
+                    # Only include if closed within timeframe AND was open 30+ days
+                    if start_date <= closed_dt <= end_date and days_open >= LONGSTANDING_GAP_DAYS:
+                        wanted_resolved[resolved_by].append(issue_meta)
+                except:
+                    pass
+    print(f"   ✓ Found {len(wanted_open)} long-standing open wanted issues")
+    print(f"   ✓ Found {sum(len(issues) for issues in wanted_resolved.values())} resolved wanted issues across {len(wanted_resolved)} agents")
+    return {
+        'agent_issues': agent_issues,
+        'wanted_open': wanted_open,
+        'wanted_resolved': dict(wanted_resolved)
+    }
 def sync_agents_repo():
     """
+    Sync local bot_metadata repository with remote using git pull.
     This is MANDATORY to ensure we have the latest bot data.
     Raises exception if sync fails.
     """
     ALWAYS syncs with remote first to ensure we have the latest bot data.
     """
     # MANDATORY: Sync with remote first to get latest bot data
+    print(f"   Syncing bot_metadata repository to get latest agents...")
     sync_agents_repo()  # Will raise exception if sync fails
     agents = []
     }
+def construct_leaderboard_from_metadata(all_metadata_dict, agents, wanted_resolved_dict=None):
+    """Construct leaderboard from in-memory issue metadata.
+    Args:
+        all_metadata_dict: Dictionary mapping agent ID to list of issue metadata (agent-assigned issues)
+        agents: List of agent metadata
+        wanted_resolved_dict: Optional dictionary mapping agent ID to list of resolved wanted issues
+    """
     if not agents:
         print("Error: No agents found")
         return {}
+    if wanted_resolved_dict is None:
+        wanted_resolved_dict = {}
     cache_dict = {}
     for agent in agents:
         bot_metadata = all_metadata_dict.get(identifier, [])
         stats = calculate_issue_stats_from_metadata(bot_metadata)
+        # Add wanted issues count
+        resolved_wanted = len(wanted_resolved_dict.get(identifier, []))
         cache_dict[identifier] = {
             'name': agent_name,
             'website': agent.get('website', 'N/A'),
             'github_identifier': identifier,
+            **stats,
+            'resolved_wanted_issues': resolved_wanted
         }
     return cache_dict
+def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics, wanted_issues=None):
+    """Save leaderboard data, monthly metrics, and wanted issues to HuggingFace dataset."""
     try:
         token = get_hf_token()
         if not token:
         api = HfApi(token=token)
+        if wanted_issues is None:
+            wanted_issues = []
         combined_data = {
+            'metadata': {
+                'last_updated': datetime.now(timezone.utc).isoformat(),
+                'leaderboard_time_frame_days': LEADERBOARD_TIME_FRAME_DAYS,
+                'longstanding_gap_days': LONGSTANDING_GAP_DAYS,
+                'tracked_orgs': TRACKED_ORGS,
+                'patch_wanted_labels': PATCH_WANTED_LABELS
+            },
             'leaderboard': leaderboard_dict,
             'monthly_metrics': monthly_metrics,
+            'wanted_issues': wanted_issues
         }
         with open(LEADERBOARD_FILENAME, 'w') as f:
     start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
     try:
+        # USE UNIFIED STREAMING FUNCTION FOR BOTH ISSUE TYPES
+        results = fetch_unified_issue_metadata_streaming(
             conn, identifiers, start_date, end_date
         )
+        agent_issues = results['agent_issues']
+        wanted_open = results['wanted_open']
+        wanted_resolved = results['wanted_resolved']
     except Exception as e:
         print(f"Error during DuckDB fetch: {str(e)}")
         traceback.print_exc()
     print(f"\n[4/4] Saving leaderboard...")
     try:
+        leaderboard_dict = construct_leaderboard_from_metadata(agent_issues, agents, wanted_resolved)
+        monthly_metrics = calculate_monthly_metrics_by_agent(agent_issues, agents)
+        save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics, wanted_open)
     except Exception as e:
         print(f"Error saving leaderboard: {str(e)}")