Why not experiment?

#1
by Dampfinchen - opened

Why does it always have to be 3B activated parameters? That's too little for good performance. My theory is that upping that to 6B would massively improve the quality while it would still be very fast on mainstream systems.

Why does it always have to be 3B activated parameters? That's too little for good performance. My theory is that upping that to 6B would massively improve the quality while it would still be very fast on mainstream systems.

you are arguing with people that know what they are doing

Why does it always have to be 3B activated parameters? That's too little for good performance. My theory is that upping that to 6B would massively improve the quality while it would still be very fast on mainstream systems.

you are arguing with people that know what they are doing

it's a valid and obvious question to come up with though, the fact that people "know what they're doing" doesn't mean their approach is to maximize the gains, these small MoE models are primarily made for speed. and to give a hint to @Dampfinchen , you can manually set activated experts in your backend when running locally, so try increasing the number and see for yourself if that makes a difference.

Couple of things:

  • size of models : number of experts is experimentally decided between a range of values
  • higher number of experts/active params does not correlate directly to performance (diminishing returns), often experimental based on arch, data, etc.
  • the MoE model optimization goal also considers being able to run on certain compute requirements while balancing quality (48BA3B should be similar in performance to a 32B-40B class dense model of sim arch/data)
  • you can learn more about it here: https://www.cerebras.ai/blog/moe-guide-scale and also look into other MoE guides for expert activation math/active param math to learn more

Sign up or log in to comment