Performance Optimization of RISC-V Based Embedded Systems for Real-Time AI Applications
DOI:
https://doi.org/10.66280/cis.v1i1.113Keywords:
RISC-V, embedded AI, real-time systems, hardware–software co-design, inference optimization, edge intelligence.Abstract
Real-time artificial intelligence at the edge increasingly depends on low-power embedded pro-
cessors that can deliver bounded latency under strict energy and memory constraints. RISC-V
is particularly attractive in this context because its open instruction set architecture permits
domain-specific extensions, fine-grained microarchitectural tuning, and portable software stacks
across vendors. This paper presents a complete optimization study for a RISC-V based embed-
ded AI platform targeting low-latency visual and sensor inference. We propose a hardware–
software co-design pipeline that combines quantization-aware deployment, scratchpad-aware
memory scheduling, vectorized kernel fusion, interrupt-aware real-time scheduling, and a lightweight custom instruction extension for convolution accumulation. The target system integrates a 64-
bit dual-core RISC-V processor with RVV 1.0 support, a configurable tensor accelerator, DMA- backed scratchpad memory, and a deterministic inference runtime. Experiments are conducted on three representative workloads: keyword spotting, human activity recognition, and compact object detection. We compare the proposed design against an unoptimized RISC-V baseline, an ARM Cortex-A55 embedded reference, and software-only optimization variants. Across workloads, the proposed system reduces end-to-end latency by 38.7% relative to the optimized software-only RISC-V baseline and by 56.4% relative to the unoptimized deployment, while improving energy efficiency by up to 1.84× and preserving task accuracy within 0.6 percentage points. An ablation study demonstrates that memory scheduling and vector-kernel fusion are the dominant contributors to performance, while the custom instruction path is most beneficial for convolution-heavy models. The results indicate that carefully co-designed RISC-V platforms can satisfy practical real-time AI constraints without sacrificing flexibility or openness.
References
[1] A. Waterman and K. Asanovi, The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Technical Report UCB/EECS-2014-54, University of California, Berkeley, 2014.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[3] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr, “Focal Loss for Dense Object Detec- tion,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2017.
[4] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” in Proc. International Conference on Learning Representations (ICLR), 2016.
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[6] C. L. Liu and J. W. Layland, “Scheduling Algorithms for Multiprogramming in a Hard-Real- Time Environment,” Journal of the ACM, vol. 20, no. 1, pp. 46–61, 1973.
[7] K. Asanovi et al., “The RISC-V Instruction Set Manual, Volume I: Unprivileged Architec- ture,” RISC-V International, 2021.
[8] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, 2017.
[9] P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv:1804.03209, 2018.
[10] D. Anguita, A. Ghoniem, L. Oneto, and X. Parra, “A Public Domain Dataset for Human Activity Recognition Using Smartphones,” in Proc. European Symposium on Artificial Neural Networks, 2013.
[11] B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer- Arithmetic-Only Inference,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[12] T. Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” in Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
[13] Y. Li, B. Zhang, Z. Zhang, N. Suri, and W. Shi, “A Survey of Edge Intelligence: Architectures, Applications, and Optimization Strategies,” ACM Computing Surveys, vol. 54, no. 5, pp. 1–36, 2021.
[14] A. Gonzlez, F. Botta, and J. Abella, “Evaluation of the RISC-V Vector Extension for Em- bedded AI Workloads,” in Proc. International Conference on Embedded Computer Systems, 2021.
[15] S. Mittal, “A Survey of FPGA-Based Accelerators for Convolutional Neural Networks,” Neural Computing and Applications, vol. 32, pp. 1109–1139, 2020.
[16] T.-Y. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[17] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
[18] N. P. Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proc. International Symposium on Computer Architecture (ISCA), 2017.
[19] A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Computational Intelligence Systems

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



