Bit-Width Quantization and Prompt Optimization: Achieving 90% Energy Savings in Large Language Models

Anupam Dhakal; Prashant Pokharel; Sabin Adhikari

Anupam Dhakal University of South Dakota, Vermillion, South Dakota, USA
Prashant Pokharel University of the Cumberlands, Williamsburg, Kentucky, USA
Sabin Adhikari University of South Dakota, Vermillion, South Dakota, USA

Keywords: Bit-Width Quantization, Prompt Optimization, Energy-Efficient AI, Large Language Models, Edge Deployment

Abstract

For the rapidly evolving field of Large Language Models (LLMs, the rapid scaling has posed significant challenges. These problems include exorbitant energy consumption, prohibitively expensive deployment, and a significant impact on environmental sustainability. A major contributor to this problem is LLMs' colossal size. Typically, there are billions of parameters, and the need for them to be run in resource-scarce or edge environments. Our research delves into a functional and immediately applicable solution to kickstart the energy efficiency of LLMs by merging low-bit-width quantization and streamlined prompt techniques.

We have tested this approach with Llama-based models ranging from hundreds of millions to over one billion parameters and applied 4-bit post-training compression combined with structured prompt and query optimization to this spectrum of models. Utilizing a well-controlled A/B testing framework, we evaluated the task accuracy, delay, and power consumption between our baseline and optimized configurations. Since we can measure the actual power usage of our hardware, we could use the formula accuracy-per-watt to sum up the performance of both configurations. Our results show that 4-bit compression all by itself knocks out a significant portion of memory usage and electricity consumption, and then, our fine-tuning of the prompts cuts down the cost of token-level inference. When used in tandem, these two techniques have led to a 90% reduction in energy consumption with virtually no or statistically insignificant losses in accuracy on the tests we ran.

We also verified the effectiveness of this strategy for real-world use, demonstrating that it delivers consistent efficiency benefits when running on severely constrained hardware. The scalability analysis showed that this method still delivers a lot of bang for the buck even for models that have over a billion parameters.

References

1. Arasi, B. I., & Hemamalini, U. (2025). Energy-Efficient AI for Medical Imaging: A Green Computing Approach To Diagnosis. International Journal of Creative Research Thoughts, 13(4). https://Doi.Org/10.56975/Ijcrt.V13i4.282315
2. Blüthgen, C. (2025). Technical foundations of large language models. Radiologie, 65(4), 227–234. https://doi.org/10.1007/s00117-025-01427-z
3. Chen, X., Wang, Y., Li, Y., Ling, X., Li, M., Liu, R., … He, Y. (2025). Low-Bit-Width Zero-Shot Quantization with Soft Feature-Infused Hints for IoT Systems. IEEE Internet of Things Journal, 12(7), 8484–8496. https://doi.org/10.1109/JIOT.2024.3507114
4. Cao, L., Xiao, W., Mo, Y., Zeng, S., Chen, H., Wu, Z., & Li, X. (2025). Improved YOLO11 for the Asian Citrus Psyllid on Yellow Sticky Traps: A Lightweight Design for Edge Deployment. Mathematics, 13(23). https://doi.org/10.3390/math13233836
5. Choi, J. (2025). Efficient Prompt Optimization for Relevance Evaluation via LLM-Based Confusion Matrix Feedback. Applied Sciences (Switzerland), 15(9). https://doi.org/10.3390/app15095198
6. Deng, X., Huang, T., Wang, W., & Feng, W. (2025). SE-YOLO: A sobel-enhanced framework for high-accuracy, lightweight real-time tomato detection with edge deployment capability. Computers and Electronics in Agriculture, 239. https://doi.org/10.1016/j.compag.2025.110973
7. Gernigon, C., Filip, S. I., Sentieys, O., Coggiola, C., & Bruno, M. (2024). AdaQAT: Adaptive Bit-Width Quantization-Aware Training. In 2024 IEEE 6th International Conference on AI Circuits and Systems, AICAS 2024 - Proceedings (pp. 442–446). Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/AICAS59952.2024.10595895
8. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., … Liu, T. (2025). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43(2). https://doi.org/10.1145/3703155
9. Lieander, A. J., Wang, H., & Rafferty, K. (2025). Prompt Optimization with Two Gradients for Classification in Large Language Models. AI (Switzerland), 6(8). https://doi.org/10.3390/ai6080182
10. Liu, Y., Du, H., Wu, Y., & Mo, T. (2025). FPGA Accelerated Deep Learning for Industrial and Engineering Applications: Optimal Design Under Resource Constraints. Electronics (Switzerland), 14(4). https://doi.org/10.3390/electronics14040703
11. Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., … Mian, A. (2025). A Comprehensive Overview of Large Language Models. ACM Transactions on Intelligent Systems and Technology, 16(5). https://doi.org/10.1145/3744746
12. Niu, S., Zhang, X., Wang, S., Liao, K., Zhang, B., & Zou, G. (2025). A–ESD: Auxiliary Edge-Server Deployment for Load Balancing in Mobile Edge Computing. Mathematics, 13(19). https://doi.org/10.3390/math13193087
13. Paula, E., Soni, J., Upadhyay, H., & Lagos, L. (2025). Comparative analysis of model compression techniques for achieving carbon efficient AI. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-07821-w
14. Ranpara, R. (2025). Energy-efficient green AI architectures for circular economies through multi-layered sustainable resource optimization framework. Discover Sustainability, 6(1). https://doi.org/10.1007/s43621-025-01846-x
15. Sabbatella, A., Ponti, A., Giordani, I., Candelieri, A., & Archetti, F. (2024). Prompt Optimization in Large Language Models. Mathematics, 12(6). https://doi.org/10.3390/math12060929
16. Sécheresse, X., Guilbert–Ly, J. Y., & Villedieu de Torcy, A. (2025). GAAPO: genetic algorithmic applied to prompt optimization. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1613007
17. Tony, C., Pintor, M., Kretschmann, M., & Scandariato, R. (2026). Discrete prompt optimization using genetic algorithm for secure Python code generation. Journal of Systems and Software, 232. https://doi.org/10.1016/j.jss.2025.112682
18. Ur Rehman, Z., Hassan, U., Ul Islam, S., Gallos, P., & Boudjadar, J. (2025). Energy-Efficient AI for Medical Diagnostics: Performance and Sustainability Analysis of ResNet and MobileNet. In Studies in Health Technology and Informatics (Vol. 327, pp. 1225–1229). IOS Press BV. https://doi.org/10.3233/SHTI250585
19. Witt, N., Deutel, M., Schubert, J., Sobel, C., & Woller, P. (2024). Energy-Efficient AI on the Edge. In Unlocking Artificial Intelligence: From Theory to Applications (pp. 359–380). Springer Nature. https://doi.org/10.1007/978-3-031-64832-8_19
20. Xiao, H., Zhou, F., Liu, X., Liu, T., Li, Z., Liu, X., & Huang, X. (2025). A comprehensive survey of large language models and multimodal large language models in medicine. Information Fusion, 117. https://doi.org/10.1016/j.inffus.2024.102888
21. Wei, L., Wang, S., Liang, X., Du, D., Huang, X., Li, M., … Zheng, Z. (2025). Slim-sugarcane: a lightweight and high-precision method for sugarcane node detection and edge deployment in natural environments. Frontiers in Plant Science, 16. https://doi.org/10.3389/fpls.2025.1643967
22. Xia, Z., Zhu, H., Ying, H., Li, J., Huan, R., & Pan, Y. (2025). A novel intra-layer mixed bit-width quantization method for the classification of 1D periodic time-series signals. Computers and Electrical Engineering, 127. https://doi.org/10.1016/j.compeleceng.2025.110633
23. Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2024, December 1). A survey on multimodal large language models. National Science Review. Oxford University Press. https://doi.org/10.1093/nsr/nwae403
24. Zhong, Y., Zhou, Y., Chao, F., & Ji, R. (2025). MBQuant: A novel multi-branch topology method for arbitrary bit-width network quantization. Pattern Recognition, 158. https://doi.org/10.1016/j.patcog.2024.111061
25. Zhai, H., Du, J., Ai, Y., & Hu, T. (2025). Edge Deployment of Deep Networks for Visual Detection: A Review. IEEE Sensors Journal, 25(11), 18662–18683. https://doi.org/10.1109/JSEN.2024.3502539