MLPs at the EOC: Spectrum of the NTK

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning



arXiv:2501.13225v1 Announce Type: new
Abstract: We study the properties of the Neural Tangent Kernel (NTK) $overset{scriptscriptstyleinfty}{K} : mathbb{R}^{m_0} times mathbb{R}^{m_0} to mathbb{R}^{m_l times m_l}$ corresponding to infinitely wide $l$-layer Multilayer Perceptrons (MLPs) taking inputs from $mathbb{R}^{m_0}$ to outputs in $mathbb{R}^{m_l}$ equipped with activation functions $phi(s) = a s + b vert s vert$ for some $a,b in mathbb{R}$ and initialized at the Edge Of Chaos (EOC). We find that the entries $overset{scriptscriptstyleinfty}{K}(x_1,x_2)$ can be approximated by the inverses of the cosine distances of the activations corresponding to $x_1$ and $x_2$ increasingly better as the depth $l$ increases. By quantifying these inverse cosine distances and the spectrum of the matrix containing them, we obtain tight spectral bounds for the NTK matrix $overset{scriptscriptstyleinfty}{K} = [frac{1}{n} overset{scriptscriptstyleinfty}{K}(x_{i_1},x_{i_2}) : i_1, i_2 in [1:n]]$ over a dataset ${x_1,cdots,x_n} subset mathbb{R}^{m_0}$, transferred from the inverse cosine distance matrix via our approximation result. Our results show that $Delta_phi = frac{b^2}{a^2+b^2}$ determines the rate at which the condition number of the NTK matrix converges to its limit as depth increases, implying in particular that the absolute value ($Delta_phi=1$) is better than the ReLU ($Delta_phi=frac{1}{2}$) in this regard.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.