Update README.md
Browse files
README.md
CHANGED
|
@@ -2622,31 +2622,35 @@ model-index:
|
|
| 2622 |
|
| 2623 |
## Intended Usage & Model Info
|
| 2624 |
|
| 2625 |
-
`jina-embedding-b-en-v2` is an English, monolingual embedding model supporting
|
| 2626 |
-
It is based on a Bert architecture that supports the symmetric bidirectional variant of ALiBi to support longer sequence length.
|
| 2627 |
-
The backbone
|
| 2628 |
-
The model is further trained on Jina AI's collection of more than
|
| 2629 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
| 2630 |
|
| 2631 |
-
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length thanks to ALiBi.
|
| 2632 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
| 2633 |
|
| 2634 |
-
|
| 2635 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
| 2636 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2637 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters.
|
| 2638 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters **(you are here)**.
|
| 2639 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
| 2640 |
|
| 2641 |
## Data & Parameters
|
| 2642 |
-
<!-- TODO: update the paper ID once it is published on arxiv -->
|
| 2643 |
-
Please checkout our [technical blog](https://arxiv.org/abs/2307.11224).
|
| 2644 |
|
| 2645 |
-
|
| 2646 |
|
| 2647 |
-
|
| 2648 |
-
|
| 2649 |
-
<!-- TODO: add evaluation table here -->
|
| 2650 |
|
| 2651 |
## Usage
|
| 2652 |
|
|
@@ -2662,20 +2666,13 @@ embeddings = model.encode(['How is the weather today?', 'What is the current wea
|
|
| 2662 |
print(cos_sim(embeddings[0], embeddings[1]))
|
| 2663 |
```
|
| 2664 |
|
| 2665 |
-
|
| 2666 |
-
We include an experimental implementation for Flash Attention, shipped with the model.
|
| 2667 |
-
Install the following triton version:
|
| 2668 |
-
`pip install triton==2.0.0.dev20221202`.
|
| 2669 |
-
Now run the same code above, but make sure to set the parameter `with_flash` to `True` when you load the model. You also have to use either `fp16` or `bf16`:
|
| 2670 |
-
```python
|
| 2671 |
-
from transformers import AutoModel
|
| 2672 |
-
from numpy.linalg import norm
|
| 2673 |
-
import torch
|
| 2674 |
|
| 2675 |
-
|
| 2676 |
-
|
| 2677 |
-
|
| 2678 |
-
|
|
|
|
| 2679 |
```
|
| 2680 |
|
| 2681 |
## Fine-tuning
|
|
@@ -2683,7 +2680,8 @@ print(cos_sim(embeddings[0], embeddings[1]))
|
|
| 2683 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
| 2684 |
|
| 2685 |
## Plans
|
| 2686 |
-
|
|
|
|
| 2687 |
|
| 2688 |
## Contact
|
| 2689 |
|
|
|
|
| 2622 |
|
| 2623 |
## Intended Usage & Model Info
|
| 2624 |
|
| 2625 |
+
`jina-embedding-b-en-v2` is an English, monolingual **embedding model** supporting **8192 sequence length**.
|
| 2626 |
+
It is based on a Bert architecture (Jina Bert) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to support longer sequence length.
|
| 2627 |
+
The backbone `jina-bert-b-en-v2` is pretrained on the C4 dataset.
|
| 2628 |
+
The model is further trained on Jina AI's collection of more than 400 millions of sentence pairs and hard negatives.
|
| 2629 |
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
|
| 2630 |
|
| 2631 |
+
The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
|
| 2632 |
This makes our model useful for a range of use cases, especially when processing long documents is needed, including long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search,...
|
| 2633 |
|
| 2634 |
+
With a standard size of 137 million parameters, the model enables fast inference while delivering better performance than our small model. It is recommended to use a single GPU for inference.
|
| 2635 |
Additionally, we provide the following embedding models, supporting 8k sequence length as well:
|
| 2636 |
|
| 2637 |
+
### V1 (Based on T5)
|
| 2638 |
+
|
| 2639 |
+
- [`jina-embedding-s-en-v1`](https://huggingface.co/jinaai/jina-embedding-s-en-v1): 35 million parameters.
|
| 2640 |
+
- [`jina-embedding-b-en-v1`](https://huggingface.co/jinaai/jina-embedding-b-en-v1): 110 million parameters.
|
| 2641 |
+
- [`jina-embedding-l-en-v1`](https://huggingface.co/jinaai/jina-embedding-l-en-v1): 330 million parameters.
|
| 2642 |
+
|
| 2643 |
+
### V2 (Based on JinaBert)
|
| 2644 |
+
|
| 2645 |
- [`jina-embedding-s-en-v2`](https://huggingface.co/jinaai/jina-embedding-s-en-v2): 33 million parameters.
|
| 2646 |
- [`jina-embedding-b-en-v2`](https://huggingface.co/jinaai/jina-embedding-b-en-v2): 137 million parameters **(you are here)**.
|
| 2647 |
- [`jina-embedding-l-en-v2`](https://huggingface.co/jinaai/jina-embedding-l-en-v2): 435 million parameters.
|
| 2648 |
|
| 2649 |
## Data & Parameters
|
|
|
|
|
|
|
| 2650 |
|
| 2651 |
+
Jina Embedding V2 technical report coming soon.
|
| 2652 |
|
| 2653 |
+
Jina Embedding V1 [technical report](https://arxiv.org/abs/2307.11224).
|
|
|
|
|
|
|
| 2654 |
|
| 2655 |
## Usage
|
| 2656 |
|
|
|
|
| 2666 |
print(cos_sim(embeddings[0], embeddings[1]))
|
| 2667 |
```
|
| 2668 |
|
| 2669 |
+
If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2670 |
|
| 2671 |
+
```python
|
| 2672 |
+
embeddings = model.encode(
|
| 2673 |
+
['Very long ... document'],
|
| 2674 |
+
max_length=2048
|
| 2675 |
+
)
|
| 2676 |
```
|
| 2677 |
|
| 2678 |
## Fine-tuning
|
|
|
|
| 2680 |
Please consider [Finetuner](https://github.com/jina-ai/finetuner).
|
| 2681 |
|
| 2682 |
## Plans
|
| 2683 |
+
|
| 2684 |
+
The development of new bilingual models is currently underway. We will be targeting mainly the German and Spanish languages. The upcoming models will be called `jina-embedding-b-de/es-v2`.
|
| 2685 |
|
| 2686 |
## Contact
|
| 2687 |
|