OpenAI’s recent release of Whisper boasts human-level robustness and accuracy in speech recognition. I’m not Scottish (although I was born pretty close), but I immediately wanted to test it with a Scottish accent and compare it to “human-level”.
Can I run it on my new iPhone 14 Pro?
Having bought an unexciting new iPhone, at least I could put its A16 Bionic chip with 16-core Neural Engine through its paces for my experiment. No. Maybe.
a significant amount of time passes while I: - port the PyTorch model to CoreML packages, - figure out how to use them, - find no help from Google, - discover the model is too big, - optimise for the ‘small’ model, - finally get the thing working (totally first time)
a little more time passes while I: - realise that what I developed might be the first working iOS app with Whisper, - make a note that this could be the only genuinely accurate offline speech recognition app for iOS, - get over myself, - remind myself what I was doing pre-yak
The interesting stuff
Once the boring tech stuff was out of the way, I shared the test app on TestFlight with a few colleagues, yielding much amusement with its borderline magical results. But what about our friends over the border?
Here’s a little clip from the start of Trainspotting, which is particularly challenging for machines to understand; a Scottish accent over the top of Iggy Pop isn’t something you’d train for.
The app got it with 100% accuracy the first time (the clip shows up to the point it may be less family-friendly!), just by holding my iPhone and playing the video on YouTube playing through my MacBook Pro speakers – and it only took around 5 seconds, with no internet connection required.
On top of this, the app only uses the small model provided by OpenAI (due to device hardware constraints) but proves it’s better than good enough.
I guess we don’t need to try and define “human-level” after all.
“But hang on, San Digital!” you may be asking, “Hang on! I’ve had an AlexaSiriCortana device for ages, isn’t this a solved problem? Why is Whisper important?” AlexaSiriCortana devices work by fuzzy matching intents, so if it hears “Something something mumble mumble cat videos” it can fill in the blanks and infer your meaning from what it makes out, sometimes getting it right (and if it doesn’t, you can try again).
Whisper is text transcribing, which means it has to understand every word accurately. This is a hard problem to solve even though AI has been making AI look easy recently. OpenAI’s own Dall-e 2 can create art and draw pictures, but art has no absolute right or wrong (Okay yep, I guess that could be considered a SharkCat). With noise to text, you don’t have that creative luxury. With words having a right and a wrong it is either transcribed correctly or not. OpenAI claim that Whisper is at least comparable to their competitors, outperforming them in some scenarios and even “Whisper’s performance is close to that of professional human transcribers.”
That’s really cool! Oh, and this also works offline.
The current players in this market have limited or no offline capabilities. Apple speech still requires the internet to work. Alexa understands you enough to tell you to go and check your internet connection. Whisper is an open model you can run anywhere*. An open, offline model breaks your ties with the big companies, which is good if you have data or privacy concerns. And then there are the usual offline benefits such as reducing data costs, improved speed and being able to work in areas with no internet. The downside is a pretty hefty upfront download and storage usage on your device.
*The model is a 500MB download (which is not bad, considering).
Just another feature
AI is rapidly moving from being a product to becoming an enabling component of other much more valuable things; databases followed the same path. OpenAI says “We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.” We wrote about this a little bit already with the commoditisation of features, not just products and services.
Can we combine voice-to-text with a call out to an API running OpenAI’s text-to-image generation services? Yes, you can ask for a picture of a crocodile riding a bicycle.
The incubation time for ideas to become products is now incredibly low. AI ideas used to be hidden behind whitepapers, with results that could only be repeated in controlled settings and hints of a great future that was still years away. Now we can have something running on an iPhone over a weekend.
Let’s do something great