Spaces:
Runtime error
Runtime error
Commit
·
3f0ff07
1
Parent(s):
eb023e7
Update Space (evaluate main: af3c3056)
Browse files- README.md +38 -3
- comet.py +36 -10
- requirements.txt +1 -1
README.md
CHANGED
|
@@ -36,7 +36,11 @@ reference = ["They were able to control the fire.", "Schools and kindergartens o
|
|
| 36 |
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
| 37 |
```
|
| 38 |
|
| 39 |
-
It has several configurations, named after the COMET model to be used.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
It also has several optional arguments:
|
| 42 |
|
|
@@ -44,7 +48,7 @@ It also has several optional arguments:
|
|
| 44 |
|
| 45 |
`progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
|
| 46 |
|
| 47 |
-
More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/
|
| 48 |
|
| 49 |
## Output values
|
| 50 |
|
|
@@ -107,9 +111,40 @@ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, B
|
|
| 107 |
|
| 108 |
Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
|
| 109 |
|
| 110 |
-
Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
```bibtex
|
| 115 |
@inproceedings{rei-EtAl:2020:WMT,
|
|
|
|
| 36 |
comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
| 37 |
```
|
| 38 |
|
| 39 |
+
It has several configurations, named after the COMET model to be used. For versions below 2.0 it will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`) and for the latest versions (>= 2.0) it will default to `Unbabel/wmt22-comet-da`.
|
| 40 |
+
|
| 41 |
+
Alternative models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on Machine Translation.
|
| 42 |
+
|
| 43 |
+
> NOTE: In `unbabel-comet>=2.0` all models were moved to Hugging Face Hub and you need to add the suffix `Unbabel/` to be able to download and use them. For example for the distilled version replace `eamt22-cometinho-da` with `Unbabel/eamt22-cometinho-da`.
|
| 44 |
|
| 45 |
It also has several optional arguments:
|
| 46 |
|
|
|
|
| 48 |
|
| 49 |
`progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
|
| 50 |
|
| 51 |
+
More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/index.html).
|
| 52 |
|
| 53 |
## Output values
|
| 54 |
|
|
|
|
| 111 |
|
| 112 |
Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
|
| 113 |
|
| 114 |
+
Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt22-comet-da`, takes over 2.32GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `eamt22-cometinho-da` is 344MB.
|
| 115 |
+
|
| 116 |
+
### Interpreting Scores:
|
| 117 |
+
|
| 118 |
+
When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
|
| 119 |
+
|
| 120 |
+
In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
|
| 121 |
+
|
| 122 |
+
However, for the latest COMET models like `Unbabel/wmt22-comet-da`, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
|
| 123 |
+
|
| 124 |
+
It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run statistical significance measures to reliably compare scores between systems.
|
| 125 |
|
| 126 |
## Citation
|
| 127 |
+
```bibtex
|
| 128 |
+
@inproceedings{rei-etal-2022-comet,
|
| 129 |
+
title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
|
| 130 |
+
author = "Rei, Ricardo and
|
| 131 |
+
C. de Souza, Jos{\'e} G. and
|
| 132 |
+
Alves, Duarte and
|
| 133 |
+
Zerva, Chrysoula and
|
| 134 |
+
Farinha, Ana C and
|
| 135 |
+
Glushkova, Taisiya and
|
| 136 |
+
Lavie, Alon and
|
| 137 |
+
Coheur, Luisa and
|
| 138 |
+
Martins, Andr{\'e} F. T.",
|
| 139 |
+
booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
|
| 140 |
+
month = dec,
|
| 141 |
+
year = "2022",
|
| 142 |
+
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
|
| 143 |
+
publisher = "Association for Computational Linguistics",
|
| 144 |
+
url = "https://aclanthology.org/2022.wmt-1.52",
|
| 145 |
+
pages = "578--585",
|
| 146 |
+
}
|
| 147 |
+
```
|
| 148 |
|
| 149 |
```bibtex
|
| 150 |
@inproceedings{rei-EtAl:2020:WMT,
|
comet.py
CHANGED
|
@@ -22,7 +22,7 @@ Usage:
|
|
| 22 |
from evaluate import load
|
| 23 |
comet_metric = load('metrics/comet/comet.py')
|
| 24 |
#comet_metric = load('comet')
|
| 25 |
-
#comet_metric = load('comet', '
|
| 26 |
|
| 27 |
|
| 28 |
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
|
|
@@ -37,6 +37,7 @@ predictions['scores']
|
|
| 37 |
import comet # From: unbabel-comet
|
| 38 |
import datasets
|
| 39 |
import torch
|
|
|
|
| 40 |
|
| 41 |
import evaluate
|
| 42 |
|
|
@@ -44,6 +45,25 @@ import evaluate
|
|
| 44 |
logger = evaluate.logging.get_logger(__name__)
|
| 45 |
|
| 46 |
_CITATION = """\
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
@inproceedings{rei-EtAl:2020:WMT,
|
| 48 |
author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
|
| 49 |
title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
|
|
@@ -85,13 +105,11 @@ Args:
|
|
| 85 |
`sources` (list of str): Source sentences
|
| 86 |
`predictions` (list of str): candidate translations
|
| 87 |
`references` (list of str): reference translations
|
| 88 |
-
`
|
| 89 |
-
`
|
| 90 |
-
`model`: COMET model to be used. Will default to `wmt-large-da-estimator-1719` if None.
|
| 91 |
|
| 92 |
Returns:
|
| 93 |
-
|
| 94 |
-
`scores`: List of scores.
|
| 95 |
|
| 96 |
Examples:
|
| 97 |
|
|
@@ -101,8 +119,8 @@ Examples:
|
|
| 101 |
>>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
|
| 102 |
>>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
|
| 103 |
>>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
| 104 |
-
>>> print([round(v,
|
| 105 |
-
[0.
|
| 106 |
"""
|
| 107 |
|
| 108 |
|
|
@@ -125,6 +143,7 @@ class COMET(evaluate.Metric):
|
|
| 125 |
codebase_urls=["https://github.com/Unbabel/COMET"],
|
| 126 |
reference_urls=[
|
| 127 |
"https://github.com/Unbabel/COMET",
|
|
|
|
| 128 |
"https://www.aclweb.org/anthology/2020.emnlp-main.213/",
|
| 129 |
"http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
|
| 130 |
],
|
|
@@ -132,7 +151,10 @@ class COMET(evaluate.Metric):
|
|
| 132 |
|
| 133 |
def _download_and_prepare(self, dl_manager):
|
| 134 |
if self.config_name == "default":
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
| 136 |
else:
|
| 137 |
self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
|
| 138 |
|
|
@@ -141,5 +163,9 @@ class COMET(evaluate.Metric):
|
|
| 141 |
gpus = 1 if torch.cuda.is_available() else 0
|
| 142 |
data = {"src": sources, "mt": predictions, "ref": references}
|
| 143 |
data = [dict(zip(data, t)) for t in zip(*data.values())]
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
return {"mean_score": mean_score, "scores": scores}
|
|
|
|
| 22 |
from evaluate import load
|
| 23 |
comet_metric = load('metrics/comet/comet.py')
|
| 24 |
#comet_metric = load('comet')
|
| 25 |
+
#comet_metric = load('comet', 'Unbabel/wmt20-comet-da')
|
| 26 |
|
| 27 |
|
| 28 |
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
|
|
|
|
| 37 |
import comet # From: unbabel-comet
|
| 38 |
import datasets
|
| 39 |
import torch
|
| 40 |
+
from packaging import version
|
| 41 |
|
| 42 |
import evaluate
|
| 43 |
|
|
|
|
| 45 |
logger = evaluate.logging.get_logger(__name__)
|
| 46 |
|
| 47 |
_CITATION = """\
|
| 48 |
+
@inproceedings{rei-etal-2022-comet,
|
| 49 |
+
title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
|
| 50 |
+
author = "Rei, Ricardo and
|
| 51 |
+
C. de Souza, Jos{\'e} G. and
|
| 52 |
+
Alves, Duarte and
|
| 53 |
+
Zerva, Chrysoula and
|
| 54 |
+
Farinha, Ana C and
|
| 55 |
+
Glushkova, Taisiya and
|
| 56 |
+
Lavie, Alon and
|
| 57 |
+
Coheur, Luisa and
|
| 58 |
+
Martins, Andr{\'e} F. T.",
|
| 59 |
+
booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
|
| 60 |
+
month = dec,
|
| 61 |
+
year = "2022",
|
| 62 |
+
address = "Abu Dhabi, United Arab Emirates (Hybrid)",
|
| 63 |
+
publisher = "Association for Computational Linguistics",
|
| 64 |
+
url = "https://aclanthology.org/2022.wmt-1.52",
|
| 65 |
+
pages = "578--585",
|
| 66 |
+
}
|
| 67 |
@inproceedings{rei-EtAl:2020:WMT,
|
| 68 |
author = {Rei, Ricardo and Stewart, Craig and Farinha, Ana C and Lavie, Alon},
|
| 69 |
title = {Unbabel's Participation in the WMT20 Metrics Shared Task},
|
|
|
|
| 105 |
`sources` (list of str): Source sentences
|
| 106 |
`predictions` (list of str): candidate translations
|
| 107 |
`references` (list of str): reference translations
|
| 108 |
+
`gpus` (bool): Number of GPUs to use. 0 for CPU
|
| 109 |
+
`progress_bar` (bool): Flag that turns on and off the predict progress bar. Defaults to True
|
|
|
|
| 110 |
|
| 111 |
Returns:
|
| 112 |
+
Dict with all sentence-level scores (`scores` key) a system-level score (`mean_score` key).
|
|
|
|
| 113 |
|
| 114 |
Examples:
|
| 115 |
|
|
|
|
| 119 |
>>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
|
| 120 |
>>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
|
| 121 |
>>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
|
| 122 |
+
>>> print([round(v, 3) for v in results["scores"]])
|
| 123 |
+
[0.839, 0.972]
|
| 124 |
"""
|
| 125 |
|
| 126 |
|
|
|
|
| 143 |
codebase_urls=["https://github.com/Unbabel/COMET"],
|
| 144 |
reference_urls=[
|
| 145 |
"https://github.com/Unbabel/COMET",
|
| 146 |
+
"https://aclanthology.org/2022.wmt-1.52/",
|
| 147 |
"https://www.aclweb.org/anthology/2020.emnlp-main.213/",
|
| 148 |
"http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
|
| 149 |
],
|
|
|
|
| 151 |
|
| 152 |
def _download_and_prepare(self, dl_manager):
|
| 153 |
if self.config_name == "default":
|
| 154 |
+
if version.parse(comet.__version__) >= version.parse("2.0.0"):
|
| 155 |
+
self.scorer = comet.load_from_checkpoint(comet.download_model("Unbabel/wmt22-comet-da"))
|
| 156 |
+
else:
|
| 157 |
+
self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
|
| 158 |
else:
|
| 159 |
self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
|
| 160 |
|
|
|
|
| 163 |
gpus = 1 if torch.cuda.is_available() else 0
|
| 164 |
data = {"src": sources, "mt": predictions, "ref": references}
|
| 165 |
data = [dict(zip(data, t)) for t in zip(*data.values())]
|
| 166 |
+
if version.parse(comet.__version__) >= version.parse("2.0.0"):
|
| 167 |
+
output = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
|
| 168 |
+
scores, mean_score = output.scores, output.system_score
|
| 169 |
+
else:
|
| 170 |
+
scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
|
| 171 |
return {"mean_score": mean_score, "scores": scores}
|
requirements.txt
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
-
git+https://github.com/huggingface/evaluate@
|
| 2 |
unbabel-comet
|
| 3 |
torch
|
|
|
|
| 1 |
+
git+https://github.com/huggingface/evaluate@af3c30561d840b83e54fc5f7150ea58046d6af69
|
| 2 |
unbabel-comet
|
| 3 |
torch
|