VeridisQuo: Open-Source Deepfake Detector Fusing Spatial and Frequency Signals

Deepfakes keep getting cleaner to the human eye, and many detectors still stare only at pixels. At AI Tech Inspire, we spotted an open-source project that takes a two-front approach: combine what you see in the image with what you can’t see—its frequency fingerprints—and then show you exactly where the model thinks manipulation happened. If you care about practical detection that’s interpretable, this is worth a close look.

Quick facts at a glance

Project: VeridisQuo, an open-source deepfake detector combining spatial and frequency-domain analysis, with Grad-CAM heatmaps projected onto original frames.
Architecture: Two parallel streams—EfficientNet-B4 (1792-dim output) for spatial features; frequency module using FFT (radial binning, 8 bands, Hann window) and DCT (8×8 blocks), each yielding 512-dim vectors; fused via MLP to 1024-dim; concatenated (2816-dim) and classified via MLP; ~25M parameters.
Data and preprocessing: Trained on FaceForensics++ (C23) covering Face2Face, FaceShifter, FaceSwap, NeuralTextures; frames extracted at 1 FPS; faces detected with YOLOv11n; ~716K face crops.
Training setup: 7 epochs on an RTX 3090 (rented), ~4 hours; AdamW with lr=1e-4, cosine annealing, CrossEntropyLoss.
Findings: Frequency stream alone underperforms the spatial stream; fusion helps notably on higher-quality fakes; DCT features are effective at surfacing compression-related artifacts; Grad-CAM heatmaps concentrate on blending boundaries and jawlines.
Next steps: Open to feedback; plans for cross-dataset evaluation on Celeb-DF and DFDC.
Code: GitHub at https://github.com/VeridisQuo-orga/VeridisQuo

Why mixing spatial and frequency clues makes sense

Most detectors rely on pixel-space cues—think textures, edges, local inconsistencies—which is exactly what a strong image backbone like EfficientNet-B4 excels at. But generators and post-processing pipelines also leave traces in the frequency domain: unusual power distributions across bands, hints introduced by upsampling, or block-wise quirks from codecs. VeridisQuo embraces both perspectives, bundling a visual stream with a frequency stream to catch fakes that are visually convincing but spectrally odd.

The frequency stream here is not a black box. It performs both FFT with radial binning (8 bands and a Hann window for smoother spectral estimates) and block-wise DCT over 8×8 patches. Each path compresses to a 512-dim vector, then a small MLP fuses to a 1024-dim frequency representation. That gets concatenated with the 1792-dim EfficientNet-B4 output for a 2816-dim feature vector, followed by a classifier MLP. Total footprint: around 25M parameters, which is lean enough for practical experimentation.

Key takeaway: the frequency stream doesn’t have to beat the vision backbone solo—the win comes from complementary signals, especially when pixel-level artifacts are subtle.

Interpretability built in: Grad-CAM that maps back to the video

Plenty of detectors can output a binary label; fewer can show you why. VeridisQuo integrates Grad-CAM on the EfficientNet-B4 backbone and remaps heatmaps back to the original frames, surfacing the regions that contributed most to the decision. In their tests, the hot spots cluster around blending boundaries and jawlines—areas often stressed by face warping, skin tone mismatch, or imperfect feathering.

That matters operationally. If you’re embedding a detector into a trust-and-safety workflow or a content verification pipeline, a short video overlay with attention heatmaps can help triage edge cases and build operator confidence. It’s also educational: researchers and engineers can quickly see failure modes and iterate on data augmentation or model design.

Training recipe and data pipeline

The team used FaceForensics++ (C23), which covers multiple synthesis methods (Face2Face, FaceShifter, FaceSwap, NeuralTextures). Frames were sampled at 1 FPS and faces cropped with YOLOv11n, yielding roughly 716K face images for training. Training ran for 7 epochs on a single RTX 3090 (rented) in about 4 hours using AdamW with lr=1e-4, cosine annealing, and CrossEntropyLoss. Nothing exotic—just a clean, reproducible setup that many practitioners can replicate locally or in the cloud.

If you want to experiment, the pipeline is friendly to the standard Python DL stack. For example, if you prefer PyTorch, you can slot in your own dataloaders, swap backbones, or prototype different frequency transforms. It also plays nicely with GPU tooling like CUDA for acceleration. For quick tests, cloning the repo is as simple as:

git clone https://github.com/VeridisQuo-orga/VeridisQuo

From there, you can adapt it to your data or evaluation setup, and yes—Ctrl+C gracefully stops your first overenthusiastic training run.

What the team found—and why it matters

Fusion helps on higher-quality fakes. When pixel-level cues are minimal, frequency inconsistencies can still betray synthesis or compression steps. The combined signal boosts sensitivity where it counts.
DCT shines on compression artifacts. Most web-delivered videos are compressed, and even minor codec artifacts are not uniformly distributed across frequency bands. Block-wise DCT features can tease out these patterns.
Grad-CAM focuses on plausible regions. Seeing attention around blending boundaries and jawlines matches intuition about common deepfake failure modes, which lends credibility to the decisions.

For developers, this suggests a practical recipe: retain a strong image backbone, add a lightweight frequency branch, and pay attention to explainability. Even if your application is not deepfake detection—say, tamper detection, splicing, or deblurring—frequency cues can complement spatial models without blowing up parameter counts.

Evaluation caveats and next steps

Like every detector, generalization is the hard part. Training on FaceForensics++ is a solid starting point, but performance can shift on out-of-distribution data. The team is looking to evaluate on Celeb-DF and DFDC, which would provide a read on cross-dataset robustness and potential overfitting to specific artifacts. If you adopt VeridisQuo, consider:

Cross-dataset checks: Test on at least one dataset not seen in training. Consider different compression levels and codecs.
Temporal context: This implementation analyzes face crops frame by frame. For videos, temporal inconsistencies (e.g., flicker in frequency patterns) might be useful. Could a small temporal module help?
Adversarial resilience: As detectors use frequency cues, generators may learn to mask them. Keep an eye on adversarial training and augmentation that disturbs spectral regularities.

How engineers can use this today

Several practical use cases emerge:

Content moderation triage: Run VeridisQuo as a pre-filter. Route high-confidence detections to human review with Grad-CAM overlays for fast acceptance or rejection.
Verification toolchains: Pair it with metadata checks and hashing. Frequency analysis provides an orthogonal signal to classic integrity methods.
Dataset curation: Use the model to flag questionable samples when assembling training sets for face analysis tasks, improving downstream model reliability.
Research baselines: Treat VeridisQuo’s fusion design as a baseline. Swap in different backbones (e.g., vision transformers), try multi-scale DCT, or experiment with learnable frequency filters.

Tips to extend the approach

If you want to push this idea further:

Compression-aware training: Randomize bitrate, codec, and quantization during training to stress-test the frequency branch and reduce overfitting to a single pipeline.
Multi-resolution frequencies: Beyond 8×8 DCT blocks, try multi-scale patches or wavelets to capture both local and global spectral cues.
Lightweight deployment: With ~25M parameters, edge deployment is plausible. Prune or distill into a smaller model for mobile scenarios; consider integrating with Hugging Face for model sharing and inference endpoints.
Ensemble with audio or text: If the source is a full video with voiceover or subtitles, a multi-modal detector could combine frequency cues in both image and audio streams.

Bottom line

VeridisQuo doesn’t promise magic. It takes a clear-eyed, engineering-first approach: combine spatial and frequency signals, keep the model compact, and make the decision explainable. That’s a formula developers can reason about and extend. For anyone building real-world detection or verification tools, this project offers a strong starting point—and a reminder that what you don’t see in pixels might matter most.

Explore the code at the project’s GitHub: VeridisQuo repository. If you kick the tires on Celeb-DF or DFDC—or plug in your own dataset—share findings with the community. The frequency game is only getting started.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

ML Foundations (1st Ed.)

Core ML theory.

Raspberry Pi Kits

Edge AI & robotics.