ANALYSIS OF THE EFFICIENCY OF GPT-2 MODEL APPLICATION WITH ADAPTED TRANSFER LEARNING ON VARIOUS HARDWARE ARCHITECTURES
DOI:
https://doi.org/10.61837/mbuir020124174dKeywords:
Adaptive Transfer Learning, GPT-2 Efficiency, GPU Architectures, Hardware Impact, Performance Comparison, AI Optimization, Future AI SystemsAbstract
This paper conducts an analysis of the efficiency in implementing the GPT-2 model, one of the advanced artificial intelligence models for text generation, through adapted transfer learning, focusing particularly on the utilization of various GPU architectures. The primary goal of this research is to examine the impact of adapted transfer learning on the performance of the GPT-2 model exclusively on various GPU architectures, assessing how different GPU strengths enhance or influence the model's efficiency. The work relies on an experimental method to evaluate and compare the model's performance in terms of accuracy, processing speed, and energy efficiency on each of the tested platforms. Special attention is given to analysing how different characteristics of hardware architectures, such as processing power and memory capacity, affect the efficiency of the transfer learning process. This study provides important insights into the potential for optimizing the GPT-2 model for specific hardware platforms, which is crucial for its application in a wide range of real-world scenarios. The results of this research offer valuable information for researchers in the fields of artificial intelligence and machine learning, providing a foundation for further development and improvement of AI technologies.
References
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-toText Transformer. arXiv preprint arXiv:1910.10683. https://doi.org/10.48550/arXiv.1910.10683
Hu, H., & Yang, Y. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://doi.org/10.48550/ arXiv.2106.09685
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). arXiv preprint arXiv:1706.03762. https://doi.org/10.48550/arXiv.1706.03762
Li, L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv preprint arXiv:2101.00190. https://doi.org/10.48550/arXiv.2101.00190
Zheng, X., Zhang, C., & Woodland, P. C. (2021). Adapting GPT, GPT-2, and BERT Language Models for Speech Recognition. arXiv preprint arXiv:2108.07789. https://doi.org/10.48550/arXiv.2108.07789
NVIDIA. (2021). Optimizing T5 and GPT-2 for RealTime Inference with NVIDIA TensorRT. https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2- for-real-time-inference-with-tensorrt/
Microsoft. (2021). DeepSpeed: Accelerating largescale model inference and training via system optimizations and compression. Microsoft Research Blog. https://www.microsoft.com/en-us/research/ blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/
Li, C., Zhang, M., & He, Y. (2022). The StabilityEfficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. In Proceedings of the Neural Information Processing Systems Conference (NeurIPS 2022). https://openreview. net/forum?id=JpZ5du_Kdh
Kotei, E., & Thirunavukarasu, R. (2023). A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning. Information, 14(3), 187. https://doi.org/10.3390/ info14030187
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv preprint arXiv:2104.07857. https://doi. org/10.48550/arXiv.2104.07857
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2022). DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv preprint arXiv:2212.03597. https://ar5iv.labs.arxiv. org/html/2212.03597
Hugging Face. (2024). Efficient training on multiple GPUs. Retrieved from https://huggingface. co/docs/transformers/perf_train_gpu_many#efficient-training-on-multiple-gpus
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. https://d4mucfpksywv.cloudfront.net/better-language-models/ language-models.pdf
PyTorch Team. (2023). Accelerating Generative AI with PyTorch II: GPT, Fast. PyTorch. https://pytorch.org/blog/accelerating-generative-ai-2/
He, C., Li, S., Soltanolkotabi, M., & Avestimehr, S. (2021). PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. PyTorch. https://pytorch.org/blog/pipetransformer-automated-elastic-pipelining/
Shen, L., Sun, Y., Yu, Z., Ding, L., Tian, X., & Tao, D. (2023). On Efficient Training of Large-Scale Deep Learning Models: A Literature Review. arXiv. https://ar5iv.labs.arxiv.org/html/2304.03589
Mustafa, N. (2024). Exploring Pre-trained Model Use Cases with GPT-2 and T5. Toptal. https://www.toptal.com/deep-learning/exploring-pre-trained-models
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., ... & Sun, M. (2022). Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models. arXiv preprint arXiv:2203.06904. https://doi. org/10.48550/arXiv.2203.06904
Hanna, M., Liu, O., & Variengien, A. (2023). How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. arXiv preprint arXiv:2305.00586. https:// doi.org/10.48550/arXiv.2305.00586