Optimizing Machine Learning Workloads for Better Performance

Anudeep Katangoori

Anudeep Katangoori Duke University, United States

Keywords: Artificial Intelligence, Machine Learning, Graphical Processing Unit (GPU), Data Models, Optimization

Abstract

Because machine learning jobs are so big and complex now, maintaining high efficiency for training and inference is very hard. The requirement for more data increases training time, lags, and use of lots of computer resources, preventing modern ML systems from being used well and deployed at large. Because of these challenges, the work suggests an all-encompassing optimization method involving dynamic batch size, fused operators, and mixed precision to maximize throughput and reduce the time needed on different hardware. Since this method is applied to popular ML frameworks such as PyTorch and TensorFlow, it becomes broadly used. According to experimental results, ResNet-50 learned faster on ImageNet (its training was cut in half), and the BERT-base worked more efficiently on the SQuAD dataset (enhanced with 41%). Even so, there was less than 0.5% accuracy loss. Additionally, using an average of 62% more GPU doesn’t require a lot of extra memory. They prove that the framework works well in conserving resources and preserving how the model works. Some of the significant contributions of this work consist of a modular, cross-platform structure for optimization and a thorough look at how systems can be made scalable. It can make ML workloads more efficient and complete faster, so both researchers and industry can speed up their innovation and use of new technologies.

References

Araus, J. L., Kefauver, S. C., Zaman-Allah, M., Olsen, M. S., & Cairns, J. E. (2018, May 1). Translating High-Throughput Phenotyping into Genetic Gain. Trends in Plant Science, 23(5), 451-466. https://doi.org/10.1016/j.tplants.2018.02.001
Blinowski, G., Ojdowska, A., & Przybylek, A. (2022). Monolithic vs. Microservice Architecture: A Performance and Scalability Evaluation. IEEE Access, 10, 20357–20374. https://doi.org/10.1109/ACCESS.2022.3152803
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., ...& Guestrin, C. (2018). TVM: An automated end-to-end optimizing compiler for deep learning. arXiv preprint arXiv:1802.04799. https://doi.org/10.48550/arXiv.1802.04799
Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. https://doi.org/10.48550/arXiv.1604.06174
Cohen, J. I. (2020, July 1). Herpesvirus latency. Journal of Clinical Investigation. American Society for Clinical Investigation. https://doi.org/10.1172/JCI136225
Cranmer, K., Brehmer, J., & Louppe, G. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences of the United States of America, 117(48), 30055–30062. https://doi.org/10.1073/pnas.1912789117
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90
Henning, S., & Hasselbring, W. (2024). Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud. Journal of Systems and Software, 208. https://doi.org/10.1016/j.jss.2023.111879
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14, 1303–1347.
Hubbard, R., Haig, B. D., & Parsa, R. A. (2019). The Limited Role of Formal Statistical Inference in Scientific Inference. American Statistician, 73(sup1), 91–98. https://doi.org/10.1080/00031305.2018.1464947
Khan, D., Jung, L. T., & Hashmani, M. A. (2021, October 2). Systematic literature review of challenges in blockchain scalability. Applied Sciences (Switzerland), 11(20), 9372. https://doi.org/10.3390/app11209372
Kuang, K., Li, L., Geng, Z., Xu, L., Zhang, K., Liao, B., … Jiang, Z. (2020, March 1). Causal Inference. Engineering. Elsevier Ltd. https://doi.org/10.1016/j.eng.2019.08.016
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., ...& Shoeybi, M. (2018). Mixed precision training. arXiv preprint arXiv:1710.03740. https://doi.org/10.48550/arXiv.1710.03740
Ogburn, E. L., Sofrygin, O., Díaz, I., & van der Laan, M. J. (2024). Causal Inference for Social Network Data. Journal of the American Statistical Association, 119(545), 597–611. https://doi.org/10.1080/01621459.2022.2131557
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv. https://doi.org/10.18653/v1/D16-1264
Richer, G., Pister, A., Abdelaal, M., Fekete, J. D., Sedlmair, M., & Weiskopf, D. (2024). Scalability in Visualization. IEEE Transactions on Visualization and Computer Graphics, 30(7), 3314–3330. https://doi.org/10.1109/TVCG.2022.3231230
Roesch, J., Relay, Z., Chen, T., & Moreau, T. (2019). A high performance compiler for deep learning. arXiv preprint arXiv:1802.04799. https://doi.org/10.48550/arXiv.1802.04799
Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y
Schwartz, M., & Stern-Ginossar, N. (2023). Rethinking human cytomegalovirus latency reservoir. Annals of the New York Academy of Sciences, 1524(1), 30–36. https://doi.org/10.1111/nyas.14994
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799. https://doi.org/10.48550/arXiv.1802.05799
Shevlin, M. (2017). Practical High-Throughput Experimentation for Chemists. ACS Medicinal Chemistry Letters, 8(6), 601–607. https://doi.org/10.1021/acsmedchemlett.7b00165
Shukla, S., Hassan, M. F., Tran, D. C., Akbar, R., Paputungan, I. V., & Khan, M. K. (2023). Improving latency in Internet-of-Things and cloud computing for real-time data transmission: a systematic literature review (SLR). Cluster Computing, 26(5), 2657–2680. https://doi.org/10.1007/s10586-021-03279-3
Slagboom, J., Derks, R. J. E., Sadighi, R., Somsen, G. W., Ulens, C., Casewell, N. R., & Kool, J. (2023). High-Throughput Venomics. Journal of Proteome Research, 22(6), 1734–1746. https://doi.org/10.1021/acs.jproteome.2c00780
Swathi, P., &Venkatesan, M. (2021). Scalability improvement and analysis of permissioned-blockchain. ICT Express, 7(3), 283–289. https://doi.org/10.1016/j.icte.2021.08.015
Weidner-Glunde, M., Kruminis-Kaszkiel, E., & Savanagouder, M. (2020). Herpesviral latency—common themes. Pathogens, 9(2). https://doi.org/10.3390/pathogens9020125
Zhang, C., Butepage, J., Kjellstrom, H., & Mandt, S. (2019). Advances in Variational Inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 2008–2026. https://doi.org/10.1109/TPAMI.2018.2889774
Zhou, Q., Huang, H., Zheng, Z., & Bian, J. (2020). Solutions to Scalability of Blockchain: a Survey. IEEE Access, 8, 16440–16455. https://doi.org/10.1109/aCCESS.2020.2967218