zhimin-z commited on
Commit
4b78e58
·
1 Parent(s): 64746d3

merge wanted

Browse files
Files changed (3) hide show
  1. README.md +47 -14
  2. app.py +169 -34
  3. msr.py +358 -19
README.md CHANGED
@@ -31,34 +31,51 @@ Key metrics from the last 180 days:
31
  - **Total Issues**: Issues the assistant has been involved with (authored, assigned, or commented on)
32
  - **Closed Issues**: Issues that were closed
33
  - **Resolved Issues**: Closed issues marked as completed
34
- - **Resolution Rate**: Percentage of closed issues successfully resolved
 
35
 
36
  **Monthly Trends**
37
- - Resolution rate trends (line plots)
38
  - Issue volume over time (bar charts)
39
 
 
 
 
40
  We focus on 180 days to highlight current capabilities and active assistants.
41
 
42
  ## How It Works
43
 
44
  **Data Collection**
45
- We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
46
- - Issues opened or assigned to the assistant (`IssuesEvent`)
47
- - Issue comments by the assistant (`IssueCommentEvent`)
 
 
 
 
 
 
 
48
 
49
  **Regular Updates**
50
  Leaderboard refreshes weekly (Friday at 00:00 UTC).
51
 
52
  **Community Submissions**
53
- Anyone can submit an assistant. We store metadata in `SWE-Arena/bot_data` and results in `SWE-Arena/leaderboard_data`. All submissions are validated via GitHub API.
54
 
55
  ## Using the Leaderboard
56
 
57
  ### Browsing
58
- Leaderboard tab features:
59
  - Searchable table (by assistant name or website)
60
- - Filterable columns (by resolution rate)
61
  - Monthly charts (resolution trends and activity)
 
 
 
 
 
 
62
 
63
  ### Adding Your Assistant
64
  Submit Assistant tab requires:
@@ -71,33 +88,49 @@ Submissions are validated and data loads within seconds.
71
 
72
  ## Understanding the Metrics
73
 
74
- **Resolution Rate**
75
  Percentage of closed issues successfully completed:
76
 
77
  ```
78
- Resolution Rate = resolved issues ÷ closed issues × 100
79
  ```
80
 
81
  An issue is "resolved" when `state_reason` is `completed` on GitHub. This means the problem was solved, not just closed without resolution.
82
 
83
  Context matters: 100 closed issues at 70% resolution (70 resolved) differs from 10 closed issues at 90% (9 resolved). Consider both rate and volume.
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  **Monthly Trends**
86
- - **Line plots**: Resolution rate changes over time
87
  - **Bar charts**: Issue volume per month
88
 
89
  Patterns to watch:
90
  - Consistent high rates = effective problem-solving
91
  - Increasing trends = improving assistants
92
  - High volume + good rates = productivity + effectiveness
 
93
 
94
  ## What's Next
95
 
96
  Planned improvements:
97
  - Repository-based analysis
98
- - Extended metrics (comment activity, response time, complexity)
99
- - Resolution time tracking
100
- - Issue type patterns (bugs, features, docs)
 
 
101
 
102
  ## Questions or Issues?
103
 
 
31
  - **Total Issues**: Issues the assistant has been involved with (authored, assigned, or commented on)
32
  - **Closed Issues**: Issues that were closed
33
  - **Resolved Issues**: Closed issues marked as completed
34
+ - **Resolved Rate**: Percentage of closed issues successfully resolved
35
+ - **Resolved Wanted Issues**: Long-standing issues (30+ days old) from major open-source projects that the assistant resolved via merged pull requests
36
 
37
  **Monthly Trends**
38
+ - Resolved rate trends (line plots)
39
  - Issue volume over time (bar charts)
40
 
41
+ **Issues Wanted**
42
+ - Long-standing open issues (30+ days) with fix-needed labels (e.g. `bug`, `enhancement`) from tracked organizations (Apache, GitHub, Hugging Face)
43
+
44
  We focus on 180 days to highlight current capabilities and active assistants.
45
 
46
  ## How It Works
47
 
48
  **Data Collection**
49
+ We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking two types of issues:
50
+
51
+ 1. **Agent-Assigned Issues**:
52
+ - Issues opened or assigned to the assistant (`IssuesEvent`)
53
+ - Issue comments by the assistant (`IssueCommentEvent`)
54
+
55
+ 2. **Wanted Issues** (from tracked organizations: Apache, GitHub, Hugging Face):
56
+ - Long-standing open issues (30+ days) with fix-needed labels (`bug`, `enhancement`)
57
+ - Pull requests created by assistants that reference these issues
58
+ - Only counts as resolved when the assistant's PR is merged and the issue is subsequently closed
59
 
60
  **Regular Updates**
61
  Leaderboard refreshes weekly (Friday at 00:00 UTC).
62
 
63
  **Community Submissions**
64
+ Anyone can submit an assistant. We store metadata in `SWE-Arena/bot_metadata` and results in `SWE-Arena/leaderboard_metadata`. All submissions are validated via GitHub API.
65
 
66
  ## Using the Leaderboard
67
 
68
  ### Browsing
69
+ **Leaderboard Tab**:
70
  - Searchable table (by assistant name or website)
71
+ - Filterable columns (by resolved rate)
72
  - Monthly charts (resolution trends and activity)
73
+ - View both agent-assigned metrics and wanted issue resolutions
74
+
75
+ **Issues Wanted Tab**:
76
+ - Browse long-standing open issues (30+ days) from major open-source projects
77
+ - Filter by tracked organizations (Apache, GitHub, Hugging Face)
78
+ - See which issues need attention from the community
79
 
80
  ### Adding Your Assistant
81
  Submit Assistant tab requires:
 
88
 
89
  ## Understanding the Metrics
90
 
91
+ **Resolved Rate**
92
  Percentage of closed issues successfully completed:
93
 
94
  ```
95
+ Resolved Rate = resolved issues ÷ closed issues × 100
96
  ```
97
 
98
  An issue is "resolved" when `state_reason` is `completed` on GitHub. This means the problem was solved, not just closed without resolution.
99
 
100
  Context matters: 100 closed issues at 70% resolution (70 resolved) differs from 10 closed issues at 90% (9 resolved). Consider both rate and volume.
101
 
102
+ **Resolved Wanted Issues**
103
+ Long-standing issues (30+ days old) from major open-source projects that the assistant resolved. An issue qualifies when:
104
+ 1. It's from a tracked organization (Apache, GitHub, Hugging Face)
105
+ 2. It has a fix-needed label (`bug`, `enhancement`)
106
+ 3. The assistant created a pull request referencing the issue
107
+ 4. The pull request was merged
108
+ 5. The issue was subsequently closed
109
+
110
+ This metric highlights assistants' ability to tackle challenging, community-identified problems in high-impact projects.
111
+
112
+ **Long-Standing Issues**
113
+ Issues that have been open for 30+ days represent real challenges the community has struggled to address. These are harder than typical issues and demonstrate an assistant's problem-solving capabilities.
114
+
115
  **Monthly Trends**
116
+ - **Line plots**: Resolved rate changes over time
117
  - **Bar charts**: Issue volume per month
118
 
119
  Patterns to watch:
120
  - Consistent high rates = effective problem-solving
121
  - Increasing trends = improving assistants
122
  - High volume + good rates = productivity + effectiveness
123
+ - High wanted issue resolution = ability to tackle challenging community problems
124
 
125
  ## What's Next
126
 
127
  Planned improvements:
128
  - Repository-based analysis
129
+ - Extended metrics (comment activity, response time, code complexity)
130
+ - Resolution time tracking from issue creation to PR merge
131
+ - Issue category patterns and difficulty assessment
132
+ - Expanded organization and label tracking for wanted issues
133
+ - Integration with additional high-impact open-source organizations
134
 
135
  ## Questions or Issues?
136
 
app.py CHANGED
@@ -3,6 +3,7 @@ from gradio_leaderboard import Leaderboard, ColumnFilter
3
  import json
4
  import os
5
  import time
 
6
  import requests
7
  from huggingface_hub import HfApi, hf_hub_download
8
  from huggingface_hub.errors import HfHubHTTPError
@@ -14,6 +15,7 @@ import plotly.graph_objects as go
14
  from plotly.subplots import make_subplots
15
  from apscheduler.schedulers.background import BackgroundScheduler
16
  from apscheduler.triggers.cron import CronTrigger
 
17
 
18
  # Load environment variables
19
  load_dotenv()
@@ -23,8 +25,11 @@ load_dotenv()
23
  # =============================================================================
24
 
25
  AGENTS_REPO = "SWE-Arena/bot_metadata" # HuggingFace dataset for agent metadata
 
26
  LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
27
  LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata" # HuggingFace dataset for leaderboard data
 
 
28
  MAX_RETRIES = 5
29
 
30
  LEADERBOARD_COLUMNS = [
@@ -33,6 +38,7 @@ LEADERBOARD_COLUMNS = [
33
  ("Total Issues", "number"),
34
  ("Resolved Issues", "number"),
35
  ("Resolved Rate (%)", "number"),
 
36
  ]
37
 
38
  # =============================================================================
@@ -95,52 +101,113 @@ def validate_github_username(identifier):
95
  # HUGGINGFACE DATASET OPERATIONS
96
  # =============================================================================
97
 
98
- def load_agents_from_hf():
99
- """Load all agent metadata JSON files from HuggingFace dataset."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  try:
101
- api = HfApi()
102
- agents = []
 
 
 
 
 
 
103
 
104
- # List all files in the repository
105
- files = list_repo_files_with_backoff(api=api, repo_id=AGENTS_REPO, repo_type="dataset")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- # Filter for JSON files only
108
- json_files = [f for f in files if f.endswith('.json')]
109
 
110
- # Download and parse each JSON file
111
- for json_file in json_files:
112
- try:
113
- file_path = hf_hub_download_with_backoff(
114
- repo_id=AGENTS_REPO,
115
- filename=json_file,
116
- repo_type="dataset"
117
- )
118
 
119
- with open(file_path, 'r') as f:
120
- agent_data = json.load(f)
121
 
122
- # Only process agents with status == "active"
123
- if agent_data.get('status') != 'active':
124
- continue
125
 
126
- # Extract github_identifier from filename (e.g., "agent[bot].json" -> "agent[bot]")
127
- filename_identifier = json_file.replace('.json', '')
 
128
 
129
- # Add or override github_identifier to match filename
130
- agent_data['github_identifier'] = filename_identifier
 
 
131
 
132
- agents.append(agent_data)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
  except Exception as e:
135
- print(f"Warning: Could not load {json_file}: {str(e)}")
136
  continue
137
 
138
- print(f"Loaded {len(agents)} agents from HuggingFace")
139
- return agents
140
-
141
- except Exception as e:
142
- print(f"Could not load agents from HuggingFace: {str(e)}")
143
- return None
144
 
145
 
146
  def get_hf_token():
@@ -483,6 +550,7 @@ def get_leaderboard_dataframe():
483
  total_issues,
484
  data.get('resolved_issues', 0),
485
  data.get('resolved_rate', 0.0),
 
486
  ])
487
 
488
  print(f"Filtered out {filtered_count} agents with 0 issues")
@@ -493,7 +561,7 @@ def get_leaderboard_dataframe():
493
  df = pd.DataFrame(rows, columns=column_names)
494
 
495
  # Ensure numeric types
496
- numeric_cols = ["Total Issues", "Resolved Issues", "Resolved Rate (%)"]
497
  for col in numeric_cols:
498
  if col in df.columns:
499
  df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
@@ -508,6 +576,54 @@ def get_leaderboard_dataframe():
508
  return df
509
 
510
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
511
  def submit_agent(identifier, agent_name, organization, website):
512
  """
513
  Submit a new agent to the leaderboard.
@@ -657,6 +773,25 @@ with gr.Blocks(title="SWE Agent Issue Leaderboard", theme=gr.themes.Soft()) as a
657
  )
658
 
659
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
660
  # Submit Agent Tab
661
  with gr.Tab("Submit Agent"):
662
 
 
3
  import json
4
  import os
5
  import time
6
+ import subprocess
7
  import requests
8
  from huggingface_hub import HfApi, hf_hub_download
9
  from huggingface_hub.errors import HfHubHTTPError
 
15
  from plotly.subplots import make_subplots
16
  from apscheduler.schedulers.background import BackgroundScheduler
17
  from apscheduler.triggers.cron import CronTrigger
18
+ from datetime import datetime, timezone
19
 
20
  # Load environment variables
21
  load_dotenv()
 
25
  # =============================================================================
26
 
27
  AGENTS_REPO = "SWE-Arena/bot_metadata" # HuggingFace dataset for agent metadata
28
+ AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_metadata") # Local git clone path
29
  LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
30
  LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata" # HuggingFace dataset for leaderboard data
31
+ LONGSTANDING_GAP_DAYS = 30 # Minimum days for an issue to be considered long-standing
32
+ GIT_SYNC_TIMEOUT = 300 # 5 minutes timeout for git pull
33
  MAX_RETRIES = 5
34
 
35
  LEADERBOARD_COLUMNS = [
 
38
  ("Total Issues", "number"),
39
  ("Resolved Issues", "number"),
40
  ("Resolved Rate (%)", "number"),
41
+ ("Resolved Wanted Issues", "number"),
42
  ]
43
 
44
  # =============================================================================
 
101
  # HUGGINGFACE DATASET OPERATIONS
102
  # =============================================================================
103
 
104
+ def sync_agents_repo():
105
+ """
106
+ Sync local bot_metadata repository with remote using git pull.
107
+ This is MANDATORY to ensure we have the latest bot data.
108
+ Raises exception if sync fails.
109
+ """
110
+ if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
111
+ error_msg = f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}"
112
+ print(f" Error {error_msg}")
113
+ print(f" Please clone it first: git clone https://huggingface.co/datasets/{AGENTS_REPO}")
114
+ raise FileNotFoundError(error_msg)
115
+
116
+ if not os.path.exists(os.path.join(AGENTS_REPO_LOCAL_PATH, '.git')):
117
+ error_msg = f"{AGENTS_REPO_LOCAL_PATH} exists but is not a git repository"
118
+ print(f" Error {error_msg}")
119
+ raise ValueError(error_msg)
120
+
121
  try:
122
+ # Run git pull with extended timeout due to large repository
123
+ result = subprocess.run(
124
+ ['git', 'pull'],
125
+ cwd=AGENTS_REPO_LOCAL_PATH,
126
+ capture_output=True,
127
+ text=True,
128
+ timeout=GIT_SYNC_TIMEOUT
129
+ )
130
 
131
+ if result.returncode == 0:
132
+ output = result.stdout.strip()
133
+ if "Already up to date" in output or "Already up-to-date" in output:
134
+ print(f" Success Repository is up to date")
135
+ else:
136
+ print(f" Success Repository synced successfully")
137
+ if output:
138
+ # Print first few lines of output
139
+ lines = output.split('\n')[:5]
140
+ for line in lines:
141
+ print(f" {line}")
142
+ return True
143
+ else:
144
+ error_msg = f"Git pull failed: {result.stderr.strip()}"
145
+ print(f" Error {error_msg}")
146
+ raise RuntimeError(error_msg)
147
+
148
+ except subprocess.TimeoutExpired:
149
+ error_msg = f"Git pull timed out after {GIT_SYNC_TIMEOUT} seconds"
150
+ print(f" Error {error_msg}")
151
+ raise TimeoutError(error_msg)
152
+ except (FileNotFoundError, ValueError, RuntimeError, TimeoutError):
153
+ raise # Re-raise expected exceptions
154
+ except Exception as e:
155
+ error_msg = f"Error syncing repository: {str(e)}"
156
+ print(f" Error {error_msg}")
157
+ raise RuntimeError(error_msg) from e
158
 
 
 
159
 
160
+ def load_agents_from_hf():
161
+ """
162
+ Load all agent metadata JSON files from local git repository.
163
+ ALWAYS syncs with remote first to ensure we have the latest bot data.
164
+ """
165
+ # MANDATORY: Sync with remote first to get latest bot data
166
+ print(f" Syncing bot_metadata repository to get latest agents...")
167
+ sync_agents_repo() # Will raise exception if sync fails
168
 
169
+ agents = []
 
170
 
171
+ # Scan local directory for JSON files
172
+ if not os.path.exists(AGENTS_REPO_LOCAL_PATH):
173
+ raise FileNotFoundError(f"Local repository not found at {AGENTS_REPO_LOCAL_PATH}")
174
 
175
+ # Walk through the directory to find all JSON files
176
+ files_processed = 0
177
+ print(f" Loading agent metadata from {AGENTS_REPO_LOCAL_PATH}...")
178
 
179
+ for root, dirs, files in os.walk(AGENTS_REPO_LOCAL_PATH):
180
+ # Skip .git directory
181
+ if '.git' in root:
182
+ continue
183
 
184
+ for filename in files:
185
+ if not filename.endswith('.json'):
186
+ continue
187
+
188
+ files_processed += 1
189
+ file_path = os.path.join(root, filename)
190
+
191
+ try:
192
+ with open(file_path, 'r', encoding='utf-8') as f:
193
+ agent_data = json.load(f)
194
+
195
+ # Only include active agents
196
+ if agent_data.get('status') != 'active':
197
+ continue
198
+
199
+ # Extract github_identifier from filename
200
+ github_identifier = filename.replace('.json', '')
201
+ agent_data['github_identifier'] = github_identifier
202
+
203
+ agents.append(agent_data)
204
 
205
  except Exception as e:
206
+ print(f" Warning Error loading {filename}: {str(e)}")
207
  continue
208
 
209
+ print(f" Success Loaded {len(agents)} active agents (from {files_processed} total files)")
210
+ return agents
 
 
 
 
211
 
212
 
213
  def get_hf_token():
 
550
  total_issues,
551
  data.get('resolved_issues', 0),
552
  data.get('resolved_rate', 0.0),
553
+ data.get('resolved_wanted_issues', 0),
554
  ])
555
 
556
  print(f"Filtered out {filtered_count} agents with 0 issues")
 
561
  df = pd.DataFrame(rows, columns=column_names)
562
 
563
  # Ensure numeric types
564
+ numeric_cols = ["Total Issues", "Resolved Issues", "Resolved Rate (%)", "Resolved Wanted Issues"]
565
  for col in numeric_cols:
566
  if col in df.columns:
567
  df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
 
576
  return df
577
 
578
 
579
+ def get_wanted_issues_dataframe():
580
+ """Load wanted issues and convert to pandas DataFrame."""
581
+ saved_data = load_leaderboard_data_from_hf()
582
+
583
+ if not saved_data or 'wanted_issues' not in saved_data:
584
+ print(f"No wanted issues data available")
585
+ return pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"])
586
+
587
+ wanted_issues = saved_data['wanted_issues']
588
+ print(f"Loaded {len(wanted_issues)} wanted issues")
589
+
590
+ if not wanted_issues:
591
+ return pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"])
592
+
593
+ rows = []
594
+ for issue in wanted_issues:
595
+ # Calculate age
596
+ created_at = issue.get('created_at')
597
+ age_days = 0
598
+ if created_at and created_at != 'N/A':
599
+ try:
600
+ created = datetime.fromisoformat(created_at.replace('Z', '+00:00'))
601
+ age_days = (datetime.now(timezone.utc) - created).days
602
+ except:
603
+ pass
604
+
605
+ # Create clickable link
606
+ url = issue.get('url', '')
607
+ repo = issue.get('repo', '')
608
+ issue_number = issue.get('number', '')
609
+ url_link = f'<a href="{url}" target="_blank">{repo}#{issue_number}</a>'
610
+
611
+ rows.append([
612
+ issue.get('title', ''),
613
+ url_link,
614
+ age_days,
615
+ ', '.join(issue.get('labels', []))
616
+ ])
617
+
618
+ df = pd.DataFrame(rows, columns=["Title", "URL", "Age (days)", "Labels"])
619
+
620
+ # Sort by age descending
621
+ if "Age (days)" in df.columns and not df.empty:
622
+ df = df.sort_values(by="Age (days)", ascending=False).reset_index(drop=True)
623
+
624
+ return df
625
+
626
+
627
  def submit_agent(identifier, agent_name, organization, website):
628
  """
629
  Submit a new agent to the leaderboard.
 
773
  )
774
 
775
 
776
+ # Issues Wanted Tab
777
+ with gr.Tab("Issues Wanted"):
778
+ gr.Markdown("### Long-Standing Patch-Wanted Issues")
779
+ gr.Markdown(f"*Issues open for {LONGSTANDING_GAP_DAYS}+ days with patch-wanted labels from tracked organizations*")
780
+
781
+ wanted_table = gr.Dataframe(
782
+ value=pd.DataFrame(columns=["Title", "URL", "Age (days)", "Labels"]),
783
+ datatype=["str", "html", "number", "str"],
784
+ interactive=False,
785
+ wrap=True
786
+ )
787
+
788
+ app.load(
789
+ fn=get_wanted_issues_dataframe,
790
+ inputs=[],
791
+ outputs=[wanted_table]
792
+ )
793
+
794
+
795
  # Submit Agent Tab
796
  with gr.Tab("Submit Agent"):
797
 
msr.py CHANGED
@@ -25,13 +25,27 @@ load_dotenv()
25
  # CONFIGURATION
26
  # =============================================================================
27
 
28
- AGENTS_REPO = "SWE-Arena/bot_data"
29
- AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_data") # Local git clone path
30
  DUCKDB_CACHE_FILE = "cache.duckdb"
31
  GHARCHIVE_DATA_LOCAL_PATH = os.path.expanduser("~/gharchive/data")
32
  LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
33
- LEADERBOARD_REPO = "SWE-Arena/leaderboard_data"
34
  LEADERBOARD_TIME_FRAME_DAYS = 180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  # Git sync configuration (mandatory to get latest bot data)
37
  GIT_SYNC_TIMEOUT = 300 # 5 minutes timeout for git pull
@@ -509,9 +523,310 @@ def fetch_all_issue_metadata_streaming(conn, identifiers, start_date, end_date):
509
  return dict(metadata_by_agent)
510
 
511
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
512
  def sync_agents_repo():
513
  """
514
- Sync local bot_data repository with remote using git pull.
515
  This is MANDATORY to ensure we have the latest bot data.
516
  Raises exception if sync fails.
517
  """
@@ -571,7 +886,7 @@ def load_agents_from_hf():
571
  ALWAYS syncs with remote first to ensure we have the latest bot data.
572
  """
573
  # MANDATORY: Sync with remote first to get latest bot data
574
- print(f" Syncing bot_data repository to get latest agents...")
575
  sync_agents_repo() # Will raise exception if sync fails
576
 
577
  agents = []
@@ -705,12 +1020,21 @@ def calculate_monthly_metrics_by_agent(all_metadata_dict, agents):
705
  }
706
 
707
 
708
- def construct_leaderboard_from_metadata(all_metadata_dict, agents):
709
- """Construct leaderboard from in-memory issue metadata."""
 
 
 
 
 
 
710
  if not agents:
711
  print("Error: No agents found")
712
  return {}
713
 
 
 
 
714
  cache_dict = {}
715
 
716
  for agent in agents:
@@ -720,18 +1044,22 @@ def construct_leaderboard_from_metadata(all_metadata_dict, agents):
720
  bot_metadata = all_metadata_dict.get(identifier, [])
721
  stats = calculate_issue_stats_from_metadata(bot_metadata)
722
 
 
 
 
723
  cache_dict[identifier] = {
724
  'name': agent_name,
725
  'website': agent.get('website', 'N/A'),
726
  'github_identifier': identifier,
727
- **stats
 
728
  }
729
 
730
  return cache_dict
731
 
732
 
733
- def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics):
734
- """Save leaderboard data and monthly metrics to HuggingFace dataset."""
735
  try:
736
  token = get_hf_token()
737
  if not token:
@@ -739,13 +1067,20 @@ def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics):
739
 
740
  api = HfApi(token=token)
741
 
 
 
 
742
  combined_data = {
743
- 'last_updated': datetime.now(timezone.utc).isoformat(),
 
 
 
 
 
 
744
  'leaderboard': leaderboard_dict,
745
  'monthly_metrics': monthly_metrics,
746
- 'metadata': {
747
- 'leaderboard_time_frame_days': LEADERBOARD_TIME_FRAME_DAYS
748
- }
749
  }
750
 
751
  with open(LEADERBOARD_FILENAME, 'w') as f:
@@ -809,11 +1144,15 @@ def mine_all_agents():
809
  start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
810
 
811
  try:
812
- # USE STREAMING FUNCTION FOR ISSUES
813
- all_metadata = fetch_all_issue_metadata_streaming(
814
  conn, identifiers, start_date, end_date
815
  )
816
 
 
 
 
 
817
  except Exception as e:
818
  print(f"Error during DuckDB fetch: {str(e)}")
819
  traceback.print_exc()
@@ -824,9 +1163,9 @@ def mine_all_agents():
824
  print(f"\n[4/4] Saving leaderboard...")
825
 
826
  try:
827
- leaderboard_dict = construct_leaderboard_from_metadata(all_metadata, agents)
828
- monthly_metrics = calculate_monthly_metrics_by_agent(all_metadata, agents)
829
- save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics)
830
 
831
  except Exception as e:
832
  print(f"Error saving leaderboard: {str(e)}")
 
25
  # CONFIGURATION
26
  # =============================================================================
27
 
28
+ AGENTS_REPO = "SWE-Arena/bot_metadata"
29
+ AGENTS_REPO_LOCAL_PATH = os.path.expanduser("~/bot_metadata") # Local git clone path
30
  DUCKDB_CACHE_FILE = "cache.duckdb"
31
  GHARCHIVE_DATA_LOCAL_PATH = os.path.expanduser("~/gharchive/data")
32
  LEADERBOARD_FILENAME = f"{os.getenv('COMPOSE_PROJECT_NAME')}.json"
33
+ LEADERBOARD_REPO = "SWE-Arena/leaderboard_metadata"
34
  LEADERBOARD_TIME_FRAME_DAYS = 180
35
+ LONGSTANDING_GAP_DAYS = 30 # Minimum days for an issue to be considered long-standing
36
+
37
+ # GitHub organizations and repositories to track for wanted issues
38
+ TRACKED_ORGS = [
39
+ "apache",
40
+ "github",
41
+ "huggingface",
42
+ ]
43
+
44
+ # Labels that indicate "patch wanted" status
45
+ PATCH_WANTED_LABELS = [
46
+ "bug",
47
+ "enhancement",
48
+ ]
49
 
50
  # Git sync configuration (mandatory to get latest bot data)
51
  GIT_SYNC_TIMEOUT = 300 # 5 minutes timeout for git pull
 
523
  return dict(metadata_by_agent)
524
 
525
 
526
+ def fetch_unified_issue_metadata_streaming(conn, identifiers, start_date, end_date):
527
+ """
528
+ UNIFIED: Fetch both agent-assigned issues AND wanted issues using streaming batch processing.
529
+
530
+ Tracks TWO types of issues:
531
+ 1. Agent-assigned issues: Issues where agents are assigned to or commented on
532
+ 2. Wanted issues: Long-standing issues from tracked orgs linked to merged PRs by agents
533
+
534
+ Args:
535
+ conn: DuckDB connection instance
536
+ identifiers: List of GitHub usernames/bot identifiers
537
+ start_date: Start datetime (timezone-aware)
538
+ end_date: End datetime (timezone-aware)
539
+
540
+ Returns:
541
+ Dictionary with three keys:
542
+ - 'agent_issues': {agent_id: [issue_metadata]} for agent-assigned issues
543
+ - 'wanted_open': [open_wanted_issues] for long-standing open issues
544
+ - 'wanted_resolved': {agent_id: [resolved_wanted]} for resolved wanted issues
545
+ """
546
+ # First, get agent-assigned issues using existing function
547
+ print(f" [1/2] Fetching agent-assigned/commented issues...")
548
+ agent_issues = fetch_all_issue_metadata_streaming(conn, identifiers, start_date, end_date)
549
+
550
+ # Now fetch wanted issues
551
+ print(f"\n [2/2] Fetching wanted issues from tracked orgs...")
552
+ identifier_set = set(identifiers)
553
+
554
+ # Storage for wanted issues
555
+ all_issues = {} # issue_url -> issue_metadata
556
+ issue_to_prs = defaultdict(set) # issue_url -> set of PR URLs
557
+ pr_creators = {} # pr_url -> creator login
558
+ pr_merged_at = {} # pr_url -> merged_at timestamp
559
+
560
+ # Calculate total batches
561
+ total_days = (end_date - start_date).days
562
+ total_batches = (total_days // BATCH_SIZE_DAYS) + 1
563
+
564
+ # Process in batches
565
+ current_date = start_date
566
+ batch_num = 0
567
+
568
+ print(f" Streaming {total_batches} batches for wanted issues...")
569
+
570
+ while current_date <= end_date:
571
+ batch_num += 1
572
+ batch_end = min(current_date + timedelta(days=BATCH_SIZE_DAYS - 1), end_date)
573
+
574
+ # Get file patterns for THIS BATCH ONLY
575
+ file_patterns = generate_file_path_patterns(current_date, batch_end)
576
+
577
+ if not file_patterns:
578
+ print(f" Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} - NO DATA")
579
+ current_date = batch_end + timedelta(days=1)
580
+ continue
581
+
582
+ # Progress indicator
583
+ print(f" Batch {batch_num}/{total_batches}: {current_date.date()} to {batch_end.date()} ({len(file_patterns)} files)... ", end="", flush=True)
584
+
585
+ # Build file patterns SQL for THIS BATCH
586
+ file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
587
+
588
+ try:
589
+ # Create temp view from file read (done ONCE per batch)
590
+ conn.execute(f"""
591
+ CREATE OR REPLACE TEMP VIEW batch_data AS
592
+ SELECT *
593
+ FROM read_json({file_patterns_sql}, union_by_name=true, filename=true, compression='gzip', format='newline_delimited', ignore_errors=true, maximum_object_size=2147483648)
594
+ """)
595
+
596
+ # Query 1: Fetch all issues (NOT PRs) from tracked orgs
597
+ issue_query = """
598
+ SELECT
599
+ json_extract_string(payload, '$.issue.html_url') as issue_url,
600
+ json_extract_string(repo, '$.name') as repo_name,
601
+ json_extract_string(payload, '$.issue.title') as title,
602
+ json_extract_string(payload, '$.issue.number') as issue_number,
603
+ MIN(json_extract_string(payload, '$.issue.created_at')) as created_at,
604
+ MAX(json_extract_string(payload, '$.issue.closed_at')) as closed_at,
605
+ json_extract(payload, '$.issue.labels') as labels
606
+ FROM batch_data
607
+ WHERE
608
+ type IN ('IssuesEvent', 'IssueCommentEvent')
609
+ AND json_extract_string(payload, '$.issue.pull_request') IS NULL
610
+ AND json_extract_string(payload, '$.issue.html_url') IS NOT NULL
611
+ GROUP BY issue_url, repo_name, title, issue_number, labels
612
+ """
613
+
614
+ issue_results = conn.execute(issue_query).fetchall()
615
+
616
+ # Filter issues by tracked orgs and collect them
617
+ for row in issue_results:
618
+ issue_url = row[0]
619
+ repo_name = row[1]
620
+ title = row[2]
621
+ issue_number = row[3]
622
+ created_at = row[4]
623
+ closed_at = row[5]
624
+ labels_json = row[6]
625
+
626
+ if not issue_url or not repo_name:
627
+ continue
628
+
629
+ # Extract org from repo_name
630
+ parts = repo_name.split('/')
631
+ if len(parts) != 2:
632
+ continue
633
+ org = parts[0]
634
+
635
+ # Filter by tracked orgs
636
+ if org not in TRACKED_ORGS:
637
+ continue
638
+
639
+ # Parse labels
640
+ try:
641
+ if isinstance(labels_json, str):
642
+ labels_data = json.loads(labels_json)
643
+ else:
644
+ labels_data = labels_json
645
+
646
+ if not isinstance(labels_data, list):
647
+ label_names = []
648
+ else:
649
+ label_names = [label.get('name', '').lower() for label in labels_data if isinstance(label, dict)]
650
+
651
+ except (json.JSONDecodeError, TypeError):
652
+ label_names = []
653
+
654
+ # Determine state
655
+ normalized_closed_at = normalize_date_format(closed_at) if closed_at else None
656
+ state = 'closed' if (normalized_closed_at and normalized_closed_at != 'N/A') else 'open'
657
+
658
+ # Store issue metadata
659
+ all_issues[issue_url] = {
660
+ 'url': issue_url,
661
+ 'repo': repo_name,
662
+ 'title': title,
663
+ 'number': issue_number,
664
+ 'state': state,
665
+ 'created_at': normalize_date_format(created_at),
666
+ 'closed_at': normalized_closed_at,
667
+ 'labels': label_names
668
+ }
669
+
670
+ # Query 2: Find PRs from both IssueCommentEvent and PullRequestEvent
671
+ pr_query = """
672
+ SELECT DISTINCT
673
+ COALESCE(
674
+ json_extract_string(payload, '$.issue.html_url'),
675
+ json_extract_string(payload, '$.pull_request.html_url')
676
+ ) as pr_url,
677
+ COALESCE(
678
+ json_extract_string(payload, '$.issue.user.login'),
679
+ json_extract_string(payload, '$.pull_request.user.login')
680
+ ) as pr_creator,
681
+ COALESCE(
682
+ json_extract_string(payload, '$.issue.pull_request.merged_at'),
683
+ json_extract_string(payload, '$.pull_request.merged_at')
684
+ ) as merged_at,
685
+ COALESCE(
686
+ json_extract_string(payload, '$.issue.body'),
687
+ json_extract_string(payload, '$.pull_request.body')
688
+ ) as pr_body
689
+ FROM batch_data
690
+ WHERE
691
+ (type = 'IssueCommentEvent' AND json_extract_string(payload, '$.issue.pull_request') IS NOT NULL)
692
+ OR type = 'PullRequestEvent'
693
+ """
694
+
695
+ pr_results = conn.execute(pr_query).fetchall()
696
+
697
+ for row in pr_results:
698
+ pr_url = row[0]
699
+ pr_creator = row[1]
700
+ merged_at = row[2]
701
+ pr_body = row[3]
702
+
703
+ if not pr_url or not pr_creator:
704
+ continue
705
+
706
+ pr_creators[pr_url] = pr_creator
707
+ pr_merged_at[pr_url] = merged_at
708
+
709
+ # Extract linked issues from PR body
710
+ if pr_body:
711
+ # Match issue URLs or #number references
712
+ issue_refs = re.findall(r'(?:https?://github\.com/[\w-]+/[\w-]+/issues/\d+)|(?:#\d+)', pr_body, re.IGNORECASE)
713
+
714
+ for ref in issue_refs:
715
+ # Convert #number to full URL if needed
716
+ if ref.startswith('#'):
717
+ # Extract org/repo from PR URL
718
+ pr_parts = pr_url.split('/')
719
+ if len(pr_parts) >= 5:
720
+ org = pr_parts[-4]
721
+ repo = pr_parts[-3]
722
+ issue_num = ref[1:]
723
+ issue_url = f"https://github.com/{org}/{repo}/issues/{issue_num}"
724
+ issue_to_prs[issue_url].add(pr_url)
725
+ else:
726
+ issue_to_prs[ref].add(pr_url)
727
+
728
+ print(f"✓ {len(issue_results)} issues, {len(pr_results)} PRs")
729
+
730
+ # Clean up temp view after batch processing
731
+ conn.execute("DROP VIEW IF EXISTS batch_data")
732
+
733
+ except Exception as e:
734
+ print(f"\n ✗ Batch {batch_num} error: {str(e)}")
735
+ traceback.print_exc()
736
+ # Clean up temp view even on error
737
+ try:
738
+ conn.execute("DROP VIEW IF EXISTS batch_data")
739
+ except:
740
+ pass
741
+
742
+ # Move to next batch
743
+ current_date = batch_end + timedelta(days=1)
744
+
745
+ # Post-processing: Filter issues and assign to agents
746
+ print(f"\n Post-processing {len(all_issues)} wanted issues...")
747
+
748
+ wanted_open = []
749
+ wanted_resolved = defaultdict(list)
750
+ current_time = datetime.now(timezone.utc)
751
+
752
+ for issue_url, issue_meta in all_issues.items():
753
+ # Check if issue has linked PRs
754
+ linked_prs = issue_to_prs.get(issue_url, set())
755
+ if not linked_prs:
756
+ continue
757
+
758
+ # Check if any linked PR was merged AND created by an agent
759
+ resolved_by = None
760
+ for pr_url in linked_prs:
761
+ merged_at = pr_merged_at.get(pr_url)
762
+ if merged_at: # PR was merged
763
+ pr_creator = pr_creators.get(pr_url)
764
+ if pr_creator in identifier_set:
765
+ resolved_by = pr_creator
766
+ break
767
+
768
+ if not resolved_by:
769
+ continue
770
+
771
+ # Process based on issue state
772
+ if issue_meta['state'] == 'open':
773
+ # For open issues: check if labels match PATCH_WANTED_LABELS
774
+ issue_labels = issue_meta.get('labels', [])
775
+ has_patch_label = False
776
+ for issue_label in issue_labels:
777
+ for wanted_label in PATCH_WANTED_LABELS:
778
+ if wanted_label.lower() in issue_label:
779
+ has_patch_label = True
780
+ break
781
+ if has_patch_label:
782
+ break
783
+
784
+ if not has_patch_label:
785
+ continue
786
+
787
+ # Check if long-standing
788
+ created_at_str = issue_meta.get('created_at')
789
+ if created_at_str and created_at_str != 'N/A':
790
+ try:
791
+ created_dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
792
+ days_open = (current_time - created_dt).days
793
+ if days_open >= LONGSTANDING_GAP_DAYS:
794
+ wanted_open.append(issue_meta)
795
+ except:
796
+ pass
797
+
798
+ elif issue_meta['state'] == 'closed':
799
+ # For closed issues: must be closed within time frame AND open 30+ days
800
+ closed_at_str = issue_meta.get('closed_at')
801
+ created_at_str = issue_meta.get('created_at')
802
+
803
+ if closed_at_str and closed_at_str != 'N/A' and created_at_str and created_at_str != 'N/A':
804
+ try:
805
+ closed_dt = datetime.fromisoformat(closed_at_str.replace('Z', '+00:00'))
806
+ created_dt = datetime.fromisoformat(created_at_str.replace('Z', '+00:00'))
807
+
808
+ # Calculate how long the issue was open
809
+ days_open = (closed_dt - created_dt).days
810
+
811
+ # Only include if closed within timeframe AND was open 30+ days
812
+ if start_date <= closed_dt <= end_date and days_open >= LONGSTANDING_GAP_DAYS:
813
+ wanted_resolved[resolved_by].append(issue_meta)
814
+ except:
815
+ pass
816
+
817
+ print(f" ✓ Found {len(wanted_open)} long-standing open wanted issues")
818
+ print(f" ✓ Found {sum(len(issues) for issues in wanted_resolved.values())} resolved wanted issues across {len(wanted_resolved)} agents")
819
+
820
+ return {
821
+ 'agent_issues': agent_issues,
822
+ 'wanted_open': wanted_open,
823
+ 'wanted_resolved': dict(wanted_resolved)
824
+ }
825
+
826
+
827
  def sync_agents_repo():
828
  """
829
+ Sync local bot_metadata repository with remote using git pull.
830
  This is MANDATORY to ensure we have the latest bot data.
831
  Raises exception if sync fails.
832
  """
 
886
  ALWAYS syncs with remote first to ensure we have the latest bot data.
887
  """
888
  # MANDATORY: Sync with remote first to get latest bot data
889
+ print(f" Syncing bot_metadata repository to get latest agents...")
890
  sync_agents_repo() # Will raise exception if sync fails
891
 
892
  agents = []
 
1020
  }
1021
 
1022
 
1023
+ def construct_leaderboard_from_metadata(all_metadata_dict, agents, wanted_resolved_dict=None):
1024
+ """Construct leaderboard from in-memory issue metadata.
1025
+
1026
+ Args:
1027
+ all_metadata_dict: Dictionary mapping agent ID to list of issue metadata (agent-assigned issues)
1028
+ agents: List of agent metadata
1029
+ wanted_resolved_dict: Optional dictionary mapping agent ID to list of resolved wanted issues
1030
+ """
1031
  if not agents:
1032
  print("Error: No agents found")
1033
  return {}
1034
 
1035
+ if wanted_resolved_dict is None:
1036
+ wanted_resolved_dict = {}
1037
+
1038
  cache_dict = {}
1039
 
1040
  for agent in agents:
 
1044
  bot_metadata = all_metadata_dict.get(identifier, [])
1045
  stats = calculate_issue_stats_from_metadata(bot_metadata)
1046
 
1047
+ # Add wanted issues count
1048
+ resolved_wanted = len(wanted_resolved_dict.get(identifier, []))
1049
+
1050
  cache_dict[identifier] = {
1051
  'name': agent_name,
1052
  'website': agent.get('website', 'N/A'),
1053
  'github_identifier': identifier,
1054
+ **stats,
1055
+ 'resolved_wanted_issues': resolved_wanted
1056
  }
1057
 
1058
  return cache_dict
1059
 
1060
 
1061
+ def save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics, wanted_issues=None):
1062
+ """Save leaderboard data, monthly metrics, and wanted issues to HuggingFace dataset."""
1063
  try:
1064
  token = get_hf_token()
1065
  if not token:
 
1067
 
1068
  api = HfApi(token=token)
1069
 
1070
+ if wanted_issues is None:
1071
+ wanted_issues = []
1072
+
1073
  combined_data = {
1074
+ 'metadata': {
1075
+ 'last_updated': datetime.now(timezone.utc).isoformat(),
1076
+ 'leaderboard_time_frame_days': LEADERBOARD_TIME_FRAME_DAYS,
1077
+ 'longstanding_gap_days': LONGSTANDING_GAP_DAYS,
1078
+ 'tracked_orgs': TRACKED_ORGS,
1079
+ 'patch_wanted_labels': PATCH_WANTED_LABELS
1080
+ },
1081
  'leaderboard': leaderboard_dict,
1082
  'monthly_metrics': monthly_metrics,
1083
+ 'wanted_issues': wanted_issues
 
 
1084
  }
1085
 
1086
  with open(LEADERBOARD_FILENAME, 'w') as f:
 
1144
  start_date = end_date - timedelta(days=LEADERBOARD_TIME_FRAME_DAYS)
1145
 
1146
  try:
1147
+ # USE UNIFIED STREAMING FUNCTION FOR BOTH ISSUE TYPES
1148
+ results = fetch_unified_issue_metadata_streaming(
1149
  conn, identifiers, start_date, end_date
1150
  )
1151
 
1152
+ agent_issues = results['agent_issues']
1153
+ wanted_open = results['wanted_open']
1154
+ wanted_resolved = results['wanted_resolved']
1155
+
1156
  except Exception as e:
1157
  print(f"Error during DuckDB fetch: {str(e)}")
1158
  traceback.print_exc()
 
1163
  print(f"\n[4/4] Saving leaderboard...")
1164
 
1165
  try:
1166
+ leaderboard_dict = construct_leaderboard_from_metadata(agent_issues, agents, wanted_resolved)
1167
+ monthly_metrics = calculate_monthly_metrics_by_agent(agent_issues, agents)
1168
+ save_leaderboard_data_to_hf(leaderboard_dict, monthly_metrics, wanted_open)
1169
 
1170
  except Exception as e:
1171
  print(f"Error saving leaderboard: {str(e)}")