bert and ALbert

ALbert first author commentary

A, BERT enhance the depth and width parameters of explosion;

1>、increasing width

Based on retention efficiency drop, the amount of reduction parameter;

1、factorized enbedding parametrization

Extracting large matrix into two small matrix multiplication --- first input variable dimension reduction, the dimension L, a "free widen the network; 2"

2、cross_layer parameter sharing

Parameter sharing layer , all_shared, shared_attention 

Bert and contrast parameters

Drawbacks: 1, slower 3x in model

2>  increasing depth

removing dropout

Effectiveness: parameter sharing,

self_supervising

 

Guess you like

Origin www.cnblogs.com/Christbao/p/12337361.html