tl;dr
-
I designed a video transcription editor to allow people to correct the sloppy output of Automatic-Speech-recognition to improve both machine-translation and subtitle quality
-
The tool achieved great results: 40% above the benchmark editor efficiency, 45% decrease in editor churn and 30% increase in the quality
Unbabel Video
What happens behind the scenes before translated dialogues appear beneath a video might not seem that fancy. I barely considered the intricacies it involves until I joined an ambitious project seeking to rethink the process.
In 2018, Unbabel, a translation company who combines machine and human translation, decided to establish a video service following the same dual approach. The goal was to provide quality, affordable, and swiftly produced subtitles. Among those who used the beta service were TED, CNN España, and Journeyman Pictures
The crew and my role
I joined the project to lead on user research and all things design, but not exclusively. I also contributed to a wide range of product processes: writing specs, organising workshops, supporting translator recruitment, coordinating development with engineers, client engagement, and prioritisation.
As for the team composition, we were thoroughly cross-functional and highly autonomous, including software engineers, product managers, linguists, ML researchers, translation specialists, marketers and community managers.
The problem: realising the obvious
When I joined, the team was setting the next quarter’s OKRs. Looking at the quality evaluations, we noticed striking discrepancies between subtitles originating from high-quality transcripts versus those from poorer ones. We expected this but didn't anticipate such dramatic impact on two key metrics: speed and quality.
Moreover, this issue led to translation editors becoming inactive or quitting altogether, making it harder to meet delivery timelines.
For editors, doing subtitling jobs only made sense if the pay was fair in relation to the time it cost them. Their earnings per job were determined by video length. When jobs (like those with bad transcription) took too much time, it wasn’t worth it. Not to mention how demanding subtitling already is.
A new approach to tackle ‘the root cause’
There was strong consensus among the team that we should introduce a new human step into our subtitling flow. A step where transcribers would refine the transcript in the original language captured by the ASR before it was machine translated.
Our hypothesis, in a nutshell, was this: Better transcription leads to better machine translation, which results in better subtitles, thereby making jobs easier and of higher quality, keeping translators more engaged, and ultimately providing a faster and better service.
This required us to design an interface specifically built for editing ASR output, and so the ball was now in my court.
User research: editors, editors, editors
Before getting into the weeds of transcript editing, I kicked off my research by gaining familiarity with the broader pain points of editors who worked on video translation jobs. This was important to see what takeaways could inform the design of the new tool.
What followed was explorative research which included interviewing the editor support team, watching dozens of editors sessions working on real jobs, studying editor performance stats, and completing translation tasks myself. I also looked at existing tools on the market and read through online forums used by transcribers.
Drawing from this research, I summarised a broad range of issues into four themes:
Design goals
I organised a workshop to present my findings to the team and gather feedback. We discussed each issue and agreed on the following goals:
The anatomy of transcript editing
Having developed a more grounded understanding of our editors, I started narrowing my focus on transcript editing. Through this analysis, I unpacked both the high-level and the granular modalities. My aim was to visualise the mental model of the end user.
Inadvertently, this study became the basis for the onboarding experience we built for the tool later on.
Early explorations
Constraints: handling time and space
While working on highlighting words in real time with the audio, we discovered a tricky issue we had to manage. Our program could easily highlight words using timestamps provided by the ASR. The challenge arose when editors added new words to the transcript—these new additions lacked timestamps.
Timestamps are crucial not only for aligning audio with text during editing but, more importantly, they determine when subtitles appear and disappear on screen. Without a way to assign timestamps to new words, our entire subtitle generation process would be thrown off.
After some whiteboard sessions with the engineers, we came up with a good-enough framework that ensured every word was time-stamped within the interface while maintaining an intuitive interaction for the editors.
I created the prototypes below to demonstrate how our framework handled different scenarios.
Our framework was far from perfect. There were scenarios that needed more intricate handling. For instance, if an editor deleted a whole sentence or paragraph, before rewriting it, we needed something smarter than just crudely attaching the deleted lines’ time-codes to whatever was typed. However, the logic we defined covered most scenarios, and we considered it to be a good enough starting point.
Introducing transcript editor 1.0
Highlighting what’s most relevant
Working with highly legible text, editors can instantly see what they have already covered, where they are, and what’s left.
Ability to hear what's being edited
Rather than having to manually adjust the video's playback time to the point they are working on, the playback time is automatically updated according to whichever section the editor is refining. Editors could also skip parts without dialogue by navigating to the next sentence directly without having to wait for it.
A streamlined editor journey
Unlike the legacy flow where editors were directed back to the dashboard before starting another task, in the new journey, editors could start a new job right after finishing one.
Eliminating ambiguity
Instead of having to work out how much they would earn per job or what their updated balance was after completing one, editors were provided with this information before and after each job.
Fostering power users
A two-fold approach was used to promote efficiency through shortcuts: firstly, reminding editors about the shortcuts available when the corresponding function is used through pointing and clicking. Secondly, instead of just prompting editors to use shortcuts to work more efficiently, we showed them precisely how much faster they can be when they do so.
Making task instructions digestible
Editors no longer need to read long documents with screenshots explaining each component on the screen. Instructions are now incorporated into onboarding, always accessible and broken down into smaller parts.
Lights out: dark mode
We had editors who spent over 8 hours a day working on tasks, many late at night. Recognising that this interface was ultimately a workspace, I designed a dimmed version.
Adaptive by design
Among our plans was to provide transcription and translation services for audio-only content. This design was created by adapting the components of the original interface without any major modifications.
The outcome
We implemented the tool in stages through several sprints, closely monitoring the impact of each new functionality. Right from the first tests, the results were reassuring. And once all the functionalities, along with the new editor journey were launched, we were met with overwhelmingly positive results.