OpenAI, the artificial intelligence company that unleashed ChatGPT on the world last November, inches closer to feature parity with the seductive AI assistant thanks to an upgrade that adds voice and image recognition to the chatbot.
An upgrade to the ChatGPT mobile apps for iOS and Android announced today lets a person speak their queries to the chatbot and hear it respond with its own synthesized voice. The new version of ChatGPT also adds visual smarts: Upload or snap a photo from ChatGPT and the app will respond with a description of the image and offer more context, similar to Google’s Lens feature.
The new features of ChatGPT demonstrate how OpenAI treats its years-old artificial intelligence models as products that undergo frequent iterative revisions. ChatGPT, the company’s unexpected success, is starting to resemble a consumer app similar to Apple’s Siri or Amazon’s Alexa.
The ability to talk to ChatGPT draws on two separate models. Whisper, OpenAI’s existing speech-to-text model, converts what you say into text, which is then fed to the chatbot. And a new text-to-speech model converts ChatGPT’s responses into spoken words.
In a demo the company gave me last week, Joanne Jang, a product manager, showed off ChatGPT’s range of synthetic voices. These were created by training the text-to-speech model on the voices of actors that OpenAI had hired. In the future it might even allow users to create their own voices. “In fashioning the voices, the number-one criterion was whether this is a voice you could listen to all day,” she says.