Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Snowflake/snowflake-arctic-embed-m-long
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
# CodeRankEmbed
|
| 8 |
+
|
| 9 |
+
`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks.
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# Performance Benchmarks
|
| 13 |
+
|
| 14 |
+
| Name | Parameters | CSN | CoIR |
|
| 15 |
+
| :-------------------------------:| :----- | :-------- | :------: |
|
| 16 |
+
| **CodeRankEmbed** | 137M | **77.9** |**60.1** |
|
| 17 |
+
| CodeSage-Large | 1.3B | 71.2 | 59.4 |
|
| 18 |
+
| Jina-Code-v2 | 161M | 67.2 | 58.4 |
|
| 19 |
+
| CodeT5+ | 110M | 74.2 | 45.9 |
|
| 20 |
+
| Voyage-Code-002 | Unknown | 68.5 | 56.3 |
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
# Usage
|
| 24 |
+
|
| 25 |
+
**Important**: the query prompt *must* include the following *task instruction prefix*: "Represent this query for searching relevant code"
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
from sentence_transformers import SentenceTransformer
|
| 29 |
+
|
| 30 |
+
model = SentenceTransformer("cornstack/CodeRankEmbed", trust_remote_code=True)
|
| 31 |
+
queries = ['Represent this query for searching relevant code: Calculate the n-th Fibonacci number']
|
| 32 |
+
codes = ["""def func(n):
|
| 33 |
+
if n <= 0:
|
| 34 |
+
return "Input should be a positive integer"
|
| 35 |
+
elif n == 1:
|
| 36 |
+
return 0
|
| 37 |
+
elif n == 2:
|
| 38 |
+
return 1
|
| 39 |
+
else:
|
| 40 |
+
a, b = 0, 1
|
| 41 |
+
for _ in range(2, n):
|
| 42 |
+
a, b = b, a + b
|
| 43 |
+
return b
|
| 44 |
+
"""]
|
| 45 |
+
query_embeddings = model.encode(queries)
|
| 46 |
+
print(query_embeddings)
|
| 47 |
+
code_embeddings = model.encode(codes)
|
| 48 |
+
print(code_embeddings)
|
| 49 |
+
```
|