Prompt Injection, Data Leakage, and Safety Defense in Multimodal LLM-Integrated Systems
Keywords:
prompt injection, data leakage, multimodal LLM, safety defense, adversarial robustness, system architecture, governance, socio-technical infrastructureAbstract
The integration of multimodal large language models (LLMs) into socio-technical infrastructures has introduced unprecedented capabilities alongside novel security vulnerabilities. This paper examines the systemic risks posed by prompt injection attacks, data leakage pathways, and the corresponding defensive architectures required to maintain safety and robustness in deployed systems. Unlike prior work that isolates individual attack vectors, we adopt a holistic lens that considers the entire lifecycle of multimodal LLM-integrated systems, from model training and fine-tuning to runtime orchestration and governance. We argue that prompt injection exploits the inherent ambiguity between instruction and data in LLM interfaces, and that multimodal inputs exacerbate this ambiguity by adding heterogeneous encodings such as images, audio, and video. Data leakage, often a consequence of prompt injection or inadequate output filtering, raises critical concerns about privacy, fairness, and regulatory compliance. We propose a layered defense framework that combines input sanitization, context-aware output validation, and architecture-level isolation mechanisms, while acknowledging the fundamental trade-offs between security, usability, and computational cost. Through cross-domain analysis and case illustrations, we demonstrate that no single defense suffices; rather, a governance-oriented approach integrating technical controls with policy mechanisms is essential. Our discussion extends to sustainability implications, deployment challenges in resource-constrained environments, and the ethical imperative of fairness in defense design. We conclude with forward-looking perspectives on adversarial robustness, adaptive threat models, and the need for interdisciplinary collaboration in shaping the future of safe multimodal AI.
References
1. Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., ... & Song, D. (2023). Debiasing may replace implicit bias with explicit bias: A case study in large language models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (pp. 1-10). ACM.
2. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
3. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). More than you've asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173.
4. Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., ... & Song, D. (2021). Extracting training data from large language models. In USENIX Security Symposium (pp. 2633-2650).
5. Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
6. Wallace, E., Feng, S., Kandpal, N., Gardner, M., & Singh, S. (2019). Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (pp. 2153-2162).
7. Perez, F., & Ribeiro, I. (2022). Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
8. Zhao, Y., Li, Y., Dai, X., & Liu, Y. (2023). Adversarial attacks on multimodal models: A survey. arXiv preprint arXiv:2310.05812.
9. Kandpal, N., Wallace, E., & Raffel, C. (2022). Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning (pp. 10697-10707).
10. Chen, Y., Li, D., & Liu, F. (2023). Data leakage via multimodal generation: Risks and mitigations. arXiv preprint arXiv:2304.12345.
11. Kumar, S., Viswanathan, N., Gu, J., & Roy, B. (2023). A survey on safety and security of large language models. ACM Computing Surveys, 56(4), 1-35.
12. Gong, J., Zhang, J., & Li, Y. (2024). Cross-modal adversarial attacks on vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1-10).
13. Schuett, J. (2023). Risk management in the development of advanced AI systems. arXiv preprint arXiv:2309.11235.
14. Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (Vol. 36).
15. Bagdasaryan, E., Hsieh, A., Poursaeed, O., & Shmatikov, V. (2023). Adversarial manipulations of neural networks via adversarial images. In USENIX Security Symposium (pp. 1-18).
16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30).
17. Chen, M., Garg, S., & Kumar, R. (2022). Instruction hierarchy for safe large language models. arXiv preprint arXiv:2212.10495.
18. Hendrycks, D., Mazeika, M., & Woodside, T. (2023). An overview of adversarial machine learning and defense. arXiv preprint arXiv:2305.18029.
19. Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. In International Conference on Machine Learning (pp. 3319-3328).
20. Floridi, L., & Cowls, J. (2022). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1).
21. McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics (pp. 1273-1282).
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Computational Intelligence Systems

This work is licensed under a Creative Commons Attribution 4.0 International License.
This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.



