Evaluation detail

#5
by Serpient - opened

I was reproducing some of your results last week, i followed the generation config in your description: gen_length 2048, block_length 32, steps 32, temperature 0, eos_early_stop true. For datasets, i use zero-shot across all benchmarks. While i have comparable results on humaneval, gsm8k, the accuracy on arc-c i got is 89%, much lower than your reported 95.93%. Could you please share your detailed setting for evaluation? That would be lots of help.

Hi,

Thanks for your interest in our work and for taking the time to reproduce our results.

You are correct about most of the generation parameters (block_length=32, steps=32, temperature=0, eos_early_stop=True) We should clarify that we used a longer gen_length of 32,768 in our evaluation. However, this is unlikely to be the source of the discrepancy on ARC-c, as its outputs are typically short.

The difference in accuracy might stems from two key parts of our evaluation setup: the prompt template and the answer parsing logic.

1. Prompt Template:
We wrapped the questions in a specific template to guide the model toward the required output format.

QUERY_PREFIX = "Answer the following multiple choice question.\n"
QUERY_SUFFIX = "The last line of your response should be of the following " \
               "format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD.\n"

messages = [{
  "role": "user",
  "content": QUERY_PREFIX + "Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\n" + QUERY_SUFFIX,
}]

2. Answer Parsing:
We extract the answer using the first match from the following pattern:

answer_pattern = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?"

For the sake of completeness, and to ensure you have all the details, it's worth noting that our evaluation script also includes a threshold parameter, which is set to 0.95.

We would appreciate it if you could rerun the evaluation with this template and share the outcome. We look forward to your update.

Thanks for the fast reply, this explains a lot.

Sign up or log in to comment