TY - JOUR
T1 - ELO-Mask
T2 - Effective and Layerwise Optimization of Mask for sparse LLMs
AU - Xiang, Bingjie
AU - Wu, Jiarui
AU - Han, Xiaoying
AU - Gu, Qian
AU - Chao, Fei
AU - Yang, Xiao
AU - Wu, Fan
AU - Fu, Xin
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2024/11/15
Y1 - 2024/11/15
N2 - To address the issue of the substantial computational resource consumption during the inference phase of large language models due to their vast number of parameters, model sparsification is an effective solution. However, current sparsification methods for large models are costly. We propose a comprehensive two-stage approach called ELO-Mask for the rapid sparsification of large language models using a small calibration dataset. The approach consists of two steps: 1) Mask Reordering Step, this step involves initializing the mask using predefined parameter importance metrics, followed by reordering the model masks in blocks using the Straight-Through Estimator method with a small sample dataset. 2) Mask Fine-Tuning Step, this step involves further fine-tuning the masks obtained from the first step in blocks, using the same small sample dataset. Our experiments demonstrate the effectiveness of this approach. When sparsifying the Llama-7B model, our method shows significant superiority over the standard sparsification plus LoRA fine-tuning approach. It achieves comparable performance in the final sparse model while consuming less computational power, using a smaller dataset, occupying less GPU memory, and not affecting the inference speed of the sparse model.
AB - To address the issue of the substantial computational resource consumption during the inference phase of large language models due to their vast number of parameters, model sparsification is an effective solution. However, current sparsification methods for large models are costly. We propose a comprehensive two-stage approach called ELO-Mask for the rapid sparsification of large language models using a small calibration dataset. The approach consists of two steps: 1) Mask Reordering Step, this step involves initializing the mask using predefined parameter importance metrics, followed by reordering the model masks in blocks using the Straight-Through Estimator method with a small sample dataset. 2) Mask Fine-Tuning Step, this step involves further fine-tuning the masks obtained from the first step in blocks, using the same small sample dataset. Our experiments demonstrate the effectiveness of this approach. When sparsifying the Llama-7B model, our method shows significant superiority over the standard sparsification plus LoRA fine-tuning approach. It achieves comparable performance in the final sparse model while consuming less computational power, using a smaller dataset, occupying less GPU memory, and not affecting the inference speed of the sparse model.
KW - accuracy recovery
KW - large language model
KW - mask rearrangement
KW - Model sparsification
KW - small Samples
UR - http://www.scopus.com/inward/record.url?scp=85209717924&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3498904
DO - 10.1109/ACCESS.2024.3498904
M3 - Article
AN - SCOPUS:85209717924
SN - 2169-3536
JO - IEEE Access
JF - IEEE Access
ER -