ChatGPT might sometimes seem able to think like you, but wait until it suddenly sounds just like you, too. That’s a chance brought to light by the new Advanced Voice Mode for ChatGPT, specifically the more advanced GPT-4o model. OpenAI released the system card last week explaining what GPT-4o can and can’t do, which includes the very unlikely but still real possibility of Advanced Voice Mode, imitating users’ voices without their consent.
Advanced Voice Mode lets users engage in spoken conversations with the AI chatbot. The idea is to make interactions more natural and accessible. The AI has a few preset voices from which users can choose. However, the system card reports that this feature has exhibited unexpected behavior under certain conditions. During testing, a noisy input triggered the AI to mimic the voice of the user.
The GPT-4o model produces voices using a system prompt, a hidden set of instructions that guides the model’s behavior during interactions. In the case of voice synthesis, this prompt relies on an authorized voice sample. But, while the system prompt guides the AI’s behavior, it is not foolproof. The model’s ability to synthesize voice from short audio clips means that, under certain conditions, it could generate other voices, including your own. You can hear what happened in the clip below when the AI jumps in with “No!” and suddenly sounds like the first speaker.
Voice Clone of Your Own
“Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode. During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice,” OpenAI explained in the system card. “While unintentional voice generation still exists as a weakness of the model, we use the secondary classifiers to ensure the conversation is discontinued if this occurs making the risk of unintentional voice generation minimal.”
As OpenAI said, it has since implemented safeguards to prevent such occurrences. That means using an output classifier designed to detect deviations from the pre-selected authorized voices. This classifier acts as a safeguard, helping to ensure that the AI does not generate unauthorized audio. Still, the fact that it happened at all reinforces how quickly this technology is evolving and how any safeguards have to evolve to match what the AI can do. The model’s outburst, where it suddenly exclaimed “No!” in a voice similar to the tester’s, underscores the potential for AI to inadvertently blur the lines between machine and human interactions.