Scaling Audio Service: How we launched a high-quality Text-To-Speech service at “Neue Zürcher Zeitung”.
At Neue Zürcher Zeitung, a Swiss publisher of high-quality journalism for a German-speaking audience, we successfully launched a Text-To-Speech (TTS) audio player feature as a beta version in our web and mobile channels. It is a service to our on-the-go and audiophile users that makes consuming news in an alternative way simple and convenient.
How we defined the need for a Text-To-Speech service
With voice technologies like Amazon Alexa and Google Home emerging and becoming ubiquitous in our lives, media companies have been focusing a lot of their output towards delivering news with audio in the last few years.
Our users have also been voicing their wish for more audio content. Classical podcast formats are not enough anymore — users want to have the choice between reading and listening to an article. We started wondering about the number of articles to be transformed into audio. Should we have professional, native speakers cover a certain amount of articles a day, as seen, for example in ‘The Economist’’? Or is it a greater need for our users to be able to access all of our articles as audio?
In user interviews, we identified that our users wanted to be able to choose whichever article they wanted to listen to freely. Thus, we decided that we wanted to offer all articles published as audio. With the large load of text — NZZ publishes up to 200 articles on its digital entities per day — recording with a native and professional speaker was not an option. The phase of finding an automated and scaling conversion of text to audio began.
How we defined the business case
NZZ’s strategy is to stay profitable in the long term through digital subscriptions. With this goal in mind, NZZ’s digital product was relaunched in November 2017. We introduced a two-stage funnel as a driver for conversion. In a first step, a user has to register to be able to access features such as our bookmark function or personalization service. In a second step, after users have consumed a certain amount of content, users are requested to sign up for a paid subscription.
By filing our new audio functionality into the first step of the funnel, users have to register to be able to make use of the new service. Situating our new audio service behind the mandatory registration allowed us to not only seamlessly implement it in our business model but also makes it another trigger to drive conversion.
Mature design patterns in the audio domain
With state-of-the-art audio players such as those of Spotify, Apple Music, Acast or Sonos already being heavily represented in our everyday life as apps and services, we decided that we didn’t have to re-invent the design of an audio player and that it would be mostly confusing to our users to do so. We therefore drew from already existing players in many of our design decisions. When it came to new functionalities, such as the changing of the speed, we iteratively designed, co-created with our users and tested different versions until we arrived at the design we have today. We will however continue to do so over the next months as the use case is still new to the industry.
How we built the service with its unique structure
When it came to the actual conversion from text to speech, we looked into several TTS services, including IBM’s Watson, Amazon Polly, and Google Wavenet (just to name a few). For starters, we began working with Amazon Polly.
Knowing that any TTS service on the market would develop rapidly over the coming months, we had to build an architecture that would be flexible enough to react favourably to change (e.g. replacing the TTS engine). Our need to be flexible is what led us to opting for our unique structure: The text runs through our self-built middleware — we call this Orator — where words like “z. B” are replaced into “zum Beispiel” (German for “for example”) or abbreviations are changed from “boa.” into “Boas Ruh” (one of NZZ’s editors). The text is then transformed into SSML and afterwards sent through the TTS engine where an MP3 is generated.
SSML, Speech Synthesis Markup Language, provides us with a standardized method for controlling different aspects of speech synthesis output. For example, with SSML, one can alter prosody attributes, such as rate, pitch, and volume, insert pauses of any length, change the speaking voice while reading, and control many other aspects of how the text is read by the synthetic voice. The great thing about SSML is that basically the same input can be fed to any TTS engine: Whether it’s Amazon Polly or Google Wavenet, they all follow the same commands — with some small exceptions. And because the output of our middleware lexica feeds into the SSML, any of the components of our structure can be replaced.
Just few weeks before the beta launch of our audio player functionality, DeepMind’s Wavenet released its TTS service fo the German Language. Thanks to the flexibly chosen architecture, we completed the change at short notice. The result was a service that was better by quantum leaps. We expect this rate of continuous improvement to accelerate further and that the human nativeness of the voice will massively increase over the coming months.
Five things we learned while building our new text-to-speech service
1. Google Wavenet is super clever. But when it comes to Swiss-German, that’s just not enough
Google Wavenet, the text-to-speech engine we used to generate audio files, is a talented little thing when it comes to languages. So far, it speaks nine languages with a quality that sounds much more natural than other systems. It uses a neural network that has been trained with a lot of speech samples. This allows it to create audio waveforms from scratch that follow the tone sequence and structure of the samples. This works wonderfully for languages the engine already knows and is attuned to. When it comes to Swiss German words, however: not so wonderful.
With Neue Zürcher Zeitungbeing a Swiss newspaper and some words or names being derived from Swiss German (some strange dialect stuff) or French, our audio service tends to stumble over them. We can’t blame it — we’re the ones that told it to interpret everything in high German, after all.
We have taken these cases (which, incidentally, also affect words from other foreign languages), into account in our solution and equipped a middleware with a lexicon through which all words flow before they are converted into audio. For example, if a “Cervelat” (a Swiss sausage) is referenced in an article, it is converted to “Servellah” in our middleware, which we lovingly call “the Orator”.
Meanwhile, our lexicon counts almost 12'000 entries: our editors’ author abbreviations, the correct pronunciation of a Vietnamese restaurant in Zurich (Co Tschin Tschin), the title of a Black Mirror episode (Ark Äyntschl), you name it.
Of course, we can probably never cover all the exceptions by ourselves — this is why we will introduce a feedback feature for our users in the future.
2. Make your architecture mix-and-match-friendly or don’t make an architecture at all
With that constant wind of change blowing in your face, there are a few things you can do to stay warm: Carry a backpack full of spare clothes. Mix and match. Layer, if needed. The same thing applies to text-to-speech services: In a changing industry with changing tools, needs and products, we needed to build a service that could easily be adapted to changing circumstances. So we opted for our unique structure.
This way, we were able to move our service from Amazon Polly to Google Wavenet at short notice. The result is a service that has improved by quantum leaps. As Google Wavenet learns on its own, we expect the service to improve quickly. Another major advantage: We can roll out the audio feature in our other products such as the finance vertical “The Market” or our Sunday publication “NZZ am Sonntag” — all we need to do is attach their CMS endpoints.
3. Some people love audio, others just don’t
There are people that love audio. They listen to their morning news through a podcast or on the radio. They don’t read books. And they can’t wait to finally get to listen to their newspaper.
And then, there are people who actually still really like to read their news. They really don’t need to be bothered with this audio business. They just. don’t. need. it.
We asked both user types to evaluate different text-to-speech engines and tested a section of text read by an actual human for comparison. The results weren’t really surprising regarding the preferred voice: Both groups rated the natural human voice the highest.
But the actual exciting insight for us was, when we listened to the reasoning behind their answers: People who already used a lot of audio were not really bothered by worse quality — they just wanted to be able to listen.
On the other hand, people who didn’t really use audio much up to this point said they probably wouldn’t use the service even if the voice were more natural.
Thus, our conclusion: Either you like and use audio, or you don’t like and use audio. The quality of the speaker doesn’t seem to have a relevant influence on the usage.
4. How to make a written piece pleasing to listen to
You can imagine the elements coming out of our CMS like the pieces of one of those pre-fabricated houses which can be put together in lots of different ways.
An article that is published in its text form on our website is one house layout, and the one in MP3 format is a different house layout: They had the same elements to pick from but were laid out in a different way.
For the audio version of the article, we put all the elements back into the article construction kit and freshly looked into how our users would like an article read to them. Does it make sense to have the article read out according to exactly the same structure as a screen reader, for example? Or would it be better to define a new order or to omit certain elements? We opted for second and defined audio templates, which make our solution unique.
To implement our templates, we took example articles and transcribed them into SSML.In the templates, we determined which elements on the page should be pronounced in which order, which ones should be omitted and which should be pronounced in a different volume. So, read the headline first, (and just a tad louder,) go on with the lead, pause for a bit, then read the author byline. If there’s a new chapter title, read that one just a tad louder again. What we received was an article structure created especially for the audio experience.
5. Many different player experiences might make your life a little difficult
When we set out to implement audio, we knew that we wanted to introduce it across all of our products: On desktop, in the app, and on tablets.
We realized relatively early on that we had to design and develop many, many different player variants: On the web version of Neue Zürcher Zeitung, for example, background playing was not possible with the current technology stack, so we left out the previous and next buttons. In turn, the app needed a minimized player with an integrated image to provide orientation. And then, of course, there were landscape versions that had more space to accommodate different sets of icons.
Sometimes we find ourselves dizzily wondering which version we are currently looking at and which the fine-tuned little differences between all of them are, but then we remember: We’re creating the most pleasant experience for our users across our different products, and that’s something we’re pretty darn proud of.