View a PDF of the paper titled Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization, by James Oldfield and 7 other authors
Abstract:The Mixture of Experts (MoE) paradigm provides a powerful way to decompose dense layers into smaller, modular computations often more amenable to human interpretation, debugging, and editability. However, a major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. In this paper, we propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models. $mu$MoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. Consequently, $mu$MoEs (1) avoid the restrictively high inference-time costs of dense MoEs, yet (2) do not inherit the training issues of the popular sparse MoEs’ discrete (non-differentiable) expert routing. We present both qualitative and quantitative evidence that scaling $mu$MoE layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in CelebA attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched $mu$MoE blocks at every layer, maintaining comparable accuracy. Our code is available at: this https URL.
Submission history
From: James Oldfield [view email]
[v1]
Mon, 19 Feb 2024 21:20:22 UTC (6,300 KB)
[v2]
Fri, 31 May 2024 14:04:05 UTC (20,047 KB)
[v3]
Fri, 27 Sep 2024 23:01:28 UTC (21,979 KB)
[v4]
Wed, 16 Oct 2024 20:53:46 UTC (22,084 KB)
Source link
lol