This title appears in the Scientific Report :
2023
Please use the identifier:
http://dx.doi.org/10.34734/FZJ-2023-03175 in citations.
Please use the identifier: http://dx.doi.org/10.48550/ARXIV.2305.07715 in citations.
Optimal signal propagation in ResNets through residual scaling
Optimal signal propagation in ResNets through residual scaling
Residual networks (ResNets) have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch f...
Saved in:
Personal Name(s): | Fischer, Kirsten (Corresponding author) |
---|---|
Dahmen, David / Helias, Moritz | |
Contributing Institute: |
Jara-Institut Brain structure-function relationships; INM-10 Computational and Systems Neuroscience; IAS-6 Computational and Systems Neuroscience; INM-6 |
Imprint: |
arXiv
2023
|
DOI: |
10.34734/FZJ-2023-03175 |
DOI: |
10.48550/ARXIV.2305.07715 |
Document Type: |
Preprint |
Research Program: |
Advanced Computing Architectures Theory of multi-scale neuronal networks Transparent Deep Learning with Renormalized Flows Emerging NC Architectures Computational Principles GRK 2416: MultiSenses-MultiScales: Neue Ansätze zur Aufklärung neuronaler multisensorischer Integration Recurrence and stochasticity for neuro-inspired computation |
Subject (ZB): | |
Link: |
Get full text OpenAccess |
Publikationsportal JuSER |
Please use the identifier: http://dx.doi.org/10.48550/ARXIV.2305.07715 in citations.
Residual networks (ResNets) have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks (FFNets), finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size theory for ResNets to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a framework for theory-guided optimal scaling in ResNets and, more generally, provides the theoretical framework to study ResNets at finite widths. |