Spaces:

evaluate-metric
/

comet

Runtime error

App Files Files Community

evaluate-bot commited on Jun 24, 2023

Commit

3f0ff07

1 Parent(s): eb023e7

Update Space (evaluate main: af3c3056)

Browse files

Files changed (3) hide show

README.md +38 -3
comet.py +36 -10
requirements.txt +1 -1

README.md CHANGED Viewed

@@ -36,7 +36,11 @@ reference = ["They were able to control the fire.", "Schools and kindergartens o
 comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
 ```
-It has several configurations, named after the COMET model to be used. It will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`). Alternate models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on machine translation.
 It also has several optional arguments:
@@ -44,7 +48,7 @@ It also has several optional arguments:
 `progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
-More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/models.html).
 ## Output values
@@ -107,9 +111,40 @@ Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, B
 Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
-Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt20-comet-da`, takes over 1.79GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `wmt21-cometinho-da` is 344MB.
 ## Citation
 ```bibtex
 @inproceedings{rei-EtAl:2020:WMT,

 comet_score = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
 ```
+It has several configurations, named after the COMET model to be used. For versions below 2.0 it will default to `wmt20-comet-da` (previously known as `wmt-large-da-estimator-1719`) and for the latest versions (>= 2.0) it will default to `Unbabel/wmt22-comet-da`.
+Alternative models that can be chosen include `wmt20-comet-qe-da`, `wmt21-comet-mqm`, `wmt21-cometinho-da`, `wmt21-comet-qe-mqm` and `emnlp20-comet-rank`. Notably, a distilled model is also available, which is 80% smaller and 2.128x faster while performing close to non-distilled alternatives. You can use it with the identifier `eamt22-cometinho-da`. This version, called Cometinho, was elected as [the best paper](https://aclanthology.org/2022.eamt-1.9) at the annual European conference on Machine Translation.
+> NOTE: In `unbabel-comet>=2.0` all models were moved to Hugging Face Hub and you need to add the suffix `Unbabel/` to be able to download and use them. For example for the distilled version replace `eamt22-cometinho-da` with `Unbabel/eamt22-cometinho-da`.
 It also has several optional arguments:
 `progress_bar`a boolean -- if set to `True`, progress updates will be printed out. The default value is `False`.
+More information about model characteristics can be found on the [COMET website](https://unbabel.github.io/COMET/html/index.html).
 ## Output values
 Thus, results for language pairs containing uncovered languages are unreliable, as per the [COMET website](https://github.com/Unbabel/COMET)
+Also, calculating the COMET metric involves downloading the model from which features are obtained -- the default model, `wmt22-comet-da`, takes over 2.32GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `eamt22-cometinho-da` is 344MB.
+### Interpreting Scores:
+When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.
+In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.
+However, for the latest COMET models like `Unbabel/wmt22-comet-da`, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.
+It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run statistical significance measures to reliably compare scores between systems.
 ## Citation
+```bibtex
+@inproceedings{rei-etal-2022-comet,
+    title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
+    author = "Rei, Ricardo  and
+      C. de Souza, Jos{\'e} G.  and
+      Alves, Duarte  and
+      Zerva, Chrysoula  and
+      Farinha, Ana C  and
+      Glushkova, Taisiya  and
+      Lavie, Alon  and
+      Coheur, Luisa  and
+      Martins, Andr{\'e} F. T.",
+    booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.wmt-1.52",
+    pages = "578--585",
+}
+```
 ```bibtex
 @inproceedings{rei-EtAl:2020:WMT,

comet.py CHANGED Viewed

@@ -22,7 +22,7 @@ Usage:
 from evaluate import load
 comet_metric = load('metrics/comet/comet.py')
 #comet_metric = load('comet')
-#comet_metric = load('comet', 'wmt-large-hter-estimator')
 source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
@@ -37,6 +37,7 @@ predictions['scores']
 import comet  # From: unbabel-comet
 import datasets
 import torch
 import evaluate
@@ -44,6 +45,25 @@ import evaluate
 logger = evaluate.logging.get_logger(__name__)
 _CITATION = """\
 @inproceedings{rei-EtAl:2020:WMT,
    author    = {Rei, Ricardo  and  Stewart, Craig  and  Farinha, Ana C  and  Lavie, Alon},
    title     = {Unbabel's Participation in the WMT20 Metrics Shared Task},
@@ -85,13 +105,11 @@ Args:
 `sources` (list of str): Source sentences
 `predictions` (list of str): candidate translations
 `references` (list of str): reference translations
-`cuda` (bool): If set to True, runs COMET using GPU
-`show_progress` (bool): Shows progress
-`model`: COMET model to be used. Will default to `wmt-large-da-estimator-1719` if None.
 Returns:
-    `samples`: List of dictionaries with `src`, `mt`, `ref` and `score`.
-    `scores`: List of scores.
 Examples:
@@ -101,8 +119,8 @@ Examples:
     >>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
     >>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
     >>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
-    >>> print([round(v, 2) for v in results["scores"]])
-    [0.19, 0.92]
 """
@@ -125,6 +143,7 @@ class COMET(evaluate.Metric):
             codebase_urls=["https://github.com/Unbabel/COMET"],
             reference_urls=[
                 "https://github.com/Unbabel/COMET",
                 "https://www.aclweb.org/anthology/2020.emnlp-main.213/",
                 "http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
             ],
@@ -132,7 +151,10 @@ class COMET(evaluate.Metric):
     def _download_and_prepare(self, dl_manager):
         if self.config_name == "default":
-            self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
         else:
             self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
@@ -141,5 +163,9 @@ class COMET(evaluate.Metric):
             gpus = 1 if torch.cuda.is_available() else 0
         data = {"src": sources, "mt": predictions, "ref": references}
         data = [dict(zip(data, t)) for t in zip(*data.values())]
-        scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
         return {"mean_score": mean_score, "scores": scores}

 from evaluate import load
 comet_metric = load('metrics/comet/comet.py')
 #comet_metric = load('comet')
+#comet_metric = load('comet', 'Unbabel/wmt20-comet-da')
 source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
 import comet  # From: unbabel-comet
 import datasets
 import torch
+from packaging import version
 import evaluate
 logger = evaluate.logging.get_logger(__name__)
 _CITATION = """\
+@inproceedings{rei-etal-2022-comet,
+    title = "{COMET}-22: Unbabel-{IST} 2022 Submission for the Metrics Shared Task",
+    author = "Rei, Ricardo  and
+      C. de Souza, Jos{\'e} G.  and
+      Alves, Duarte  and
+      Zerva, Chrysoula  and
+      Farinha, Ana C  and
+      Glushkova, Taisiya  and
+      Lavie, Alon  and
+      Coheur, Luisa  and
+      Martins, Andr{\'e} F. T.",
+    booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
+    month = dec,
+    year = "2022",
+    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2022.wmt-1.52",
+    pages = "578--585",
+}
 @inproceedings{rei-EtAl:2020:WMT,
    author    = {Rei, Ricardo  and  Stewart, Craig  and  Farinha, Ana C  and  Lavie, Alon},
    title     = {Unbabel's Participation in the WMT20 Metrics Shared Task},
 `sources` (list of str): Source sentences
 `predictions` (list of str): candidate translations
 `references` (list of str): reference translations
+`gpus` (bool): Number of GPUs to use. 0 for CPU
+`progress_bar` (bool): Flag that turns on and off the predict progress bar. Defaults to True
 Returns:
+    Dict with all sentence-level scores (`scores` key) a system-level score (`mean_score` key).
 Examples:
     >>> hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
     >>> reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
     >>> results = comet_metric.compute(predictions=hypothesis, references=reference, sources=source)
+    >>> print([round(v, 3) for v in results["scores"]])
+    [0.839, 0.972]
 """
             codebase_urls=["https://github.com/Unbabel/COMET"],
             reference_urls=[
                 "https://github.com/Unbabel/COMET",
+                "https://aclanthology.org/2022.wmt-1.52/",
                 "https://www.aclweb.org/anthology/2020.emnlp-main.213/",
                 "http://www.statmt.org/wmt20/pdf/2020.wmt-1.101.pdf6",
             ],
     def _download_and_prepare(self, dl_manager):
         if self.config_name == "default":
+            if version.parse(comet.__version__) >= version.parse("2.0.0"):
+                self.scorer = comet.load_from_checkpoint(comet.download_model("Unbabel/wmt22-comet-da"))
+            else:
+                self.scorer = comet.load_from_checkpoint(comet.download_model("wmt20-comet-da"))
         else:
             self.scorer = comet.load_from_checkpoint(comet.download_model(self.config_name))
             gpus = 1 if torch.cuda.is_available() else 0
         data = {"src": sources, "mt": predictions, "ref": references}
         data = [dict(zip(data, t)) for t in zip(*data.values())]
+        if version.parse(comet.__version__) >= version.parse("2.0.0"):
+            output = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
+            scores, mean_score = output.scores, output.system_score
+        else:
+            scores, mean_score = self.scorer.predict(data, gpus=gpus, progress_bar=progress_bar)
         return {"mean_score": mean_score, "scores": scores}

requirements.txt CHANGED Viewed

@@ -1,3 +1,3 @@
-git+https://github.com/huggingface/evaluate@7d7d81dd3ffec0812e2edb09f86b3b1e31d61118
 unbabel-comet
 torch

+git+https://github.com/huggingface/evaluate@af3c30561d840b83e54fc5f7150ea58046d6af69
 unbabel-comet
 torch