Table of Contents

Voice Authentication Will Not Survive the Rise of Generative AI

The exponential growth of generative AI in the past months has resulted in an increase in both the availability and efficacy of applications capable of cloning users’ voices — giving fraudsters the tools to bypass the voice authentication systems often used by financial enterprises to secure customer accounts.

The rise of audio deepfakes has sent a wakeup call to many businesses that in the age of generative AI, voice recognition can no longer be considered a reliable authentication modality. In this blog post, we’ll discuss the declining efficacy of voice authentication, the techniques used for voice spoofing and how financial institutions and other organizations can adapt to the evolving threat that generative AI poses.

The declining efficacy of voice authentication

What is voice authentication? 

Voice authentication systems are a type of biometric authentication that verify customer identities through voice recognition software that detects unique patterns in users’ voices. Using these systems, customers can authenticate by repeating a simple passphrase, such as “my voice is my password.” 

Over the past decade, this software has been embraced by financial institutions that need to strongly authenticate customers in voice channels such as call centers — but the efficacy of these systems has been rapidly degrading due to the rise of AI voice cloning models. This trend is now poised to accelerate as recent advances in AI have made these tools better, cheaper and more accessible than ever. 

Technical advances and increased availability

Last month, BusinessWire reported that a third of global businesses have already been hit by voice authentication scams. But although voice authentication fraud is on the rise, it has been around for some time, with the BBC reporting in 2019 multiple cases of fraud where fake voices were used to bypass authentication systems.  

At the time, deepfakes were the province of technically adept fraudsters who would need to invest time and expertise in building and tuning AI models. But with the widespread availability of easy-to-use AI tools like VALL-E and ElevenLabs, it is now possible for anyone with a laptop or smartphone to create realistic audio deepfakes that are growing even harder to distinguish from genuine voices.

With these tools, fraudsters can clone the voice of their victim and use it to bypass voice authentication systems and complete additional liveness checks — in some cases using only a few seconds of training data obtained from phishing calls or online voice recordings. 

Voice spoofing tools and countermeasures

Voice cloning with a three-second audio sample

Although many AI voice cloning tools on the market today were created for legitimate purposes such as voiceovers, audiobooks, disability assistance and more, the potential for abuse is clear. 

With the rise of generative AI, sophisticated models have the ability to:

  • Replicate vocal tics like “ums” and “uhs”
  • Preserve the acoustic environment of a speaker’s voice
  • Infuse cloned voices with different emotions 
  • Clone any voice by uploading as little as three seconds of audio
  • Generate the desired phrases you would like the clone to repeat by simply typing them into the UI

As of the writing of this post, there are currently 48 public repositories on GitHub under the topic of voice cloning, which only continue to grow in number.

And these models are getting better and easier to use all the time. Recent articles in Vice, The Guardian and the Wall Street Journal detail just how easy these tools make it to generate a voice clone that is capable of fooling voice authentication systems. 

“When I asked Hany Farid, a digital-forensics expert at the University of California, Berkeley, how we can spot synthetic audio and video, he had two words: good luck….adding that you can’t make everyone an AI detective.” 

— Wall Street Journal

The best voice cloning tools on the market today can create believable audio from a new and unknown speaker using only a three-second sample a capability known as zero-shot voice cloning. To give you a sense of just how impressive the results are, you can access a library of AI-generated vocal samples on VALL-E’s page on Microsoft’s Research Website.

In addition, high-end audio editing software is also becoming more widely available and easy-to-use — meaning that fraudsters on phishing calls do not have to convince their targets to transfer funds in order to score a payday; they simply need the victim to give them a few seconds of dialogue that can be used to create a clone. 

Considering these developments, we can draw two conclusions: 

  • Voice biometrics are not phishing-resistant: In the age of deepfakes, even the most savvy of users can be fooled by a fraudster spoofing the voice of a trusted contact, providing the necessary voice sample needed to clone or edit together a user’s passphrase.
  • If your voice is online, it can be cloned: The ubiquity of social media videos, podcasts, vlogs and other online clips, coupled with the minimal amount of training data required to clone a user’s voice, means that even users who aren’t susceptible to voice phishing have likely already provided enough audio online to mimic their voices using AI. 

And while more and more legitimate voice cloning applications are building protections into their models to prevent misuse, other generative AI tools like ChatGPT significantly reduce the technical skill needed to bootstrap malicious models that do not include such protections.

Bootstrapping an AI voice cloning model

Although building and cloning an AI model from scratch requires a significant amount of time and expertise, open-source pretrained models are now widely available, lowering the bar for bootstrapping a zero-shot voice-cloning model with no fine-tuning required to generate believable audio samples. 

In this scenario, a fraudster could record voice samples using open source high-end recording software which enables eliminating noise and audio interference that might otherwise reduce the believability of cloned voice recordings or editing a passphrase from a victim using a handful of words spoken out of context from an online recording or phishing call. 

The recordings can be uploaded to an open-source speaker encoder — some of which provide everything needed to install the required libraries, import the required modules, upload audio samples and run the code to execute the text you want to synthesize. 

It’s that easy. And with generative AI tools like ChatGPT, AutoGPT and more enabling beginners with minimal coding experience to write their own code, it’s getting easier by the day. 

The insufficiency of countermeasures 

Following the FTC’s advice and growing demand from consumers and thought leaders, many legitimate voice cloning applications now include countermeasures to prevent abuse. However, a close examination of the techniques used for voice cloning detection and fraud prevention show they are far from bulletproof. This includes: 

  • Paywalling accounts: ElevenLabs, one of the most advanced voice cloning applications on the market, recently made a series of changes, including paywalling their application to raise the bar for its usage. However, subscriptions to ElevenLabs start at just $5/month — a small price for fraudsters to pay for the tools that could be used to defraud financial institutions of millions of dollars. 
  • Application moderation: Many voice cloning applications are moderated by researchers to flag and delete accounts that have been used to generate material that could be used maliciously, but this supervision is not occurring in real time — meaning that fraudsters would only lose access to their accounts after they have obtained the means to bypass voice recognition systems.
  • Permission requirements: Permission requirements used in many voice cloning tools can be cursory, requiring users only to check a box stating that they have the necessary rights and consent to clone any voices uploaded to the application — an action that could be easily performed by a malicious user. 
  • Digital watermarking: Some voice cloning applications, like VALL-E, have built or are developing tools that can be used to label cloned voices so they are distinguishable from original ones. However, in many cases, the release of these tools is lagging behind the release of the technology or does not come built into the software.  
  • AI detection tools: Services like Hive and Optic for detecting AI-generated content are now available, but they often work much better when comparing samples with vast amounts of audio from original speakers. In addition, these tools degrade in accuracy with each new advance in AI, and new versions of AI software are being released and updated at a breakneck pace. 

Ultimately, these technologies do not fully account for the risks posed by the developing AI landscape and emerging voice fraud techniques. This means: 

  • Countermeasures will not protect against phishing: Digital watermarking can be detected using AI, but end users do not have access to these capabilities in order to detect phishing attacks.
  • Robust detection will be difficult to execute: The speed at which AI voice cloning and editing technology is improving means that detection methods will quickly become outdated, and the need for businesses to integrate and manage detection software will further increase the cost and complexity of maintaining secure voice authentication.
  • Different detection methods will be needed for different techniques: Multiple services will be required to detect the wide variety of audio manipulation techniques that can be used to create deepfakes, such as voice conversion, text-to-speech AI models and audio editing, further complicating the implementation of detection methods. 
  • Countermeasures do not protect against bootstrapped models: Although legitimate businesses may develop guardrails to prevent misuse, AI models built by fraudsters will not — diminishing the effectiveness of these protections. 

As a result, the FTC has recently issued a warning about the dangers of generative AI in the age of deepfakes and experts are recommending banks with voice authentication services to switch to another mode of authentication with stronger security. 

“I recommend all organizations leveraging voice ‘authentication’ switch to a secure method of identity verification, like multi-factor authentication, ASAP.” 

— Rachel Tobak, CEO of SocialProof Security, via Vice Motherboard

How to adapt to the evolving threat

If banks and financial institutions can no longer trust voice biometrics, what measures can they take to enable strong, frictionless authentication in call centers? Ultimately, the groundbreaking strides being made in generative AI enable sophisticated presentation attacks that can only be prevented with layered detection capabilities. This means wrapping device-bound authentication with additional methods for step-ups and protection throughout the user lifecycle. 

Transmit Security provides these capabilities through a full suite of natively integrated services, including:

  • Phishing-resistant authentication across channels that is inherently multifactor by binding credentials to a specific device in order to leverage the authenticating device as a secondary authentication factor.
  • Wrapping authentication with Detection & Response Services for risk, trust, fraud, bots and behavior, which build a historical profile for each user based on captured behavior, known locations, device fingerprinting, networks and more to detect deviations from known patterns and call out abnormalities that can indicate fraud. 
  • Improving step-up security measures with Identity Verification Services that provide strong step-up capabilities via a fast and thorough inspection of the user’s IDs with robust liveness detection that checks for signs of video spoofing.

It’s time to replace your voice biometric system 

Ultimately, voice antispoofing technologies will always be playing catchup to the technologies used for voice cloning; each new generative AI model and version must first exist before we can develop the tools needed to detect and label it. As such, security and identity leaders must adapt to this rising threat by replacing voice credentials with stronger methods of authentication, fortified with multilayered detection and step-ups throughout the user lifecycle. 

To find out more about how the Transmit Security Platform can help your business replace its voice biometric authentication system, check out the service briefs on our Authentication, Identity Verification and Detection and Response, or contact a sales representative today for a free personalized demo.


  • Rachel Kempf, Senior Technical Copywriter

    Rachel Kempf is a Senior Technical Copywriter at Transmit Security who works closely with the Product Management team to create highly technical, narratively compelling assets for customers and prospects. Prior to joining the team at Transmit Security, she worked as Senior Technical Copywriter and Editor-in-Chief for Azion Technologies, a global edge computing company, and wrote and edited blog posts and third-party research reports for Bizety, a research and consulting company in the CDN industry.