Deep Guard AV: Audio-Visual Deepfake Detection Framework Using Hybrid Audio Learning, CNN-LSTM Video Analysis, and Automated Transcript Logging

Aylwin Vivian Singh; Arjun Vinod Shinde

doi:10.59256/ijire.20260702060

ARCHIVES

Original Article

Deep Guard AV: Audio-Visual Deepfake Detection Framework Using Hybrid Audio Learning, CNN-LSTM Video Analysis, and Automated Transcript Logging

Aylwin Vivian Singh1Arjun Vinod Shinde2

¹ Department of Computer Science (Artificial Intelligence), Shri Shankaracharya Technical Campus/Chhattisgarh Swami Vivekanand Technical University, India. ² Assistant Professor, Department of Computer Science, Shri Shankaracharya Technical Campus/Chhattisgarh Swami Vivekanand Technical University, India.

Published Online: March-April 2026

Pages: 493-499

Cite this article

↗ https://www.doi.org/10.59256/ijire.20260702060

Abstract

View PDF

The increasing realism of synthetic media generated using deep learning has intensified the threat posed by deepfake videos in domains such as social media, journalism, legal evidence, and digital identity verification. Existing deepfake detection systems often focus on a single modality, thereby limiting their robustness against sophisticated multimodal manipulations. This paper presents Deep Guard AV, an audio-visual deepfake detection framework that jointly analyzes manipulated video and extracted speech signals while preserving textual transcripts for forensic logging and interpretability. The proposed framework processes video inputs through a dual-stream pipeline. Visual frames are analyzed using a Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) architecture to capture spatial and temporal inconsistencies, while the extracted audio is processed using a hybrid deep learning model combining waveform-based and spectrogram-based representations. In parallel, the speech content is transcribed and saved locally to maintain an auditable forensic record of processed media. A weighted fusion strategy combines the outputs of audio and video models to produce the final authenticity score. Experimental evaluation demonstrates that integrating audio and video modalities improves detection robustness compared to unimodal analysis. The proposed framework provides an effective and scalable solution for practical deepfake forensics while enhancing transparency through transcript preservation.

ARCHIVES

Deep Guard AV: Audio-Visual Deepfake Detection Framework Using Hybrid Audio Learning, CNN-LSTM Video Analysis, and Automated Transcript Logging

Published Online: March-April 2026

Pages: 493-499

Cite this article

Abstract

Related Articles

PlumX Metrics

Dimension

Quick Links

Download

policies