ALbert first author commentary
A, BERT enhance the depth and width parameters of explosion;
1>、increasing width
Based on retention efficiency drop, the amount of reduction parameter;
1、factorized enbedding parametrization
Extracting large matrix into two small matrix multiplication --- first input variable dimension reduction, the dimension L, a "free widen the network; 2"
2、cross_layer parameter sharing
Parameter sharing layer , all_shared, shared_attention
Bert and contrast parameters
Drawbacks: 1, slower 3x in model
2> increasing depth
removing dropout
Effectiveness: parameter sharing,
self_supervising