bert's inference performance is undoubted, and it is applied to actual tasks. In fact, the focus is on how to improve the speed of inference. albert is the streamlining and optimization of bert, which can be applied to projects. Recently, I did a test:
1. Data source: tnews data set, all are short text, 15 categories, the form is as follows
2. Original albert model, ~ 16M, as follows
3. Fine-tune the ckpt model, ~ 50M, as follows
4. Inference performance, PQS is as follows
It can be seen that bert's PQS is still possible in the case of gpu. If it is equipped with a service framework, there should be a greater improvement