To Build Speech Recognition for Early Literacy, Soapbox Labs Gives Kids a Voice

You say eether and I say eyether
You say neether and I say nyther
Eether, eyether, neether, nyther
Let's call the whole thing off!
You like potato and I like potahto
You like tomato and I like tomahto
Potato, potahto, tomato, tomahto!
Let's call the whole thing off!

— Ella Fitzgerald and Louis Armstrong, “Let’s Call the Whole Thing Off”

The number of ways to pronounce “water” may be as ubiquitous as the substance itself. In 2003, a team of Harvard University researchers concluded a survey and mapped out the idiosyncrasies in the pronunciation of common English words across the globe. (How do you say “pecan?”)

But just because the same word may sound funny—perhaps even unintelligible—in different parts of the world doesn’t mean the speaker is wrong. Yet some of the voice recognition systems today were neither designed nor trained to pick up different dialects, and that can be a problem when they are used to assess oral fluency, says Patricia Scanlon.

“I can drive three counties away, and probably won’t understand some of the words people say,” quips Scanlon, the founder and CEO of SoapBox Labs, who is based in Dublin, Ireland. “But that doesn’t mean they’re wrong.”

Scanlon has plenty to say about speech. She is an engineer who holds a doctorate in speech-recognition technology, and has spent the past two decades researching this field at Columbia University, IBM and Bell Laboratories.

In 2013, after watching her three-year-old daughter work through phonics exercises on an iPad, it struck Scanlon that the apps at the time often did poorly measuring oral fluency. Some offered multiple-choice questions, for instance, that simply asked kids to pick the correct pronunciation—hardly a useful exercise.

She noticed another problem: Many of the apps used speech recognition technologies that were trained with adult data samples. “The vocal cords and their development are vastly different between children and adults. Some kids tend to over-enunciate or elongate words, and say things in ways that adults just don’t,” says Scanlon. “This messes with the speech recognition systems that were built for adults.”

In other words, kids talk like, well, kids. And that can cause snafus when children interact with voice assistants. A few years ago, a video went viral of a child asking Alexa to play a song and getting raunchy results in return. (The company said it has fixed the glitch.)

Even worse, Scanlon says, some voice recognition tools would even modify adult voice samples to make them sound more high-pitched, even squeaky, to mimic how children might sound.

Standing on a Soapbox

Those shortcomings spurred her to start SoapBox Labs in 2013 and develop voice recognition technology built specifically for children aged 3 to 12.

That effort involved starting from scratch and collecting copious data from kids across the world. Scanlon says her team, which now numbers 22, has captured thousands of hours of audio samples from children across 170 countries. These recordings sometimes take place in living rooms, kitchens or outdoors, she adds, in an effort to capture how children speak in environments that are more natural than the controlled laboratories in which adult audio samples are usually recorded.

The goal is to build a robust data set that reflects the range of dialects and pronunciations in which English is spoken across the world, and which can be used to build tools that assess for speech and oral fluency. Scanlon says it should be possible, for example, for users to tune a system to recognize that different pronunciation of words are acceptable.

Collecting that amount of information from kids may well give privacy advocates pause. Scanlon says she’s aware of the concerns, and notes that “we unequivocally say that the data we have is used to improve our system, and not to market and resell.” This summer, the company submitted a paper to the United Nations Office of the High Commissioner for Human Rights that acknowledges the concerns and risks involved in collecting children’s data for speech-recognition systems, and outlines the “privacy by design” principles that it follows.

Today, the company claims its technology can power assessments for phonics, sentences and other fluency measurements. Its work has been financially supported by $5.9 million in outside investment.

Soapbox does not develop user-facing products, but licenses its speech recognition tools to 25 other organizations that integrate them into their products. They include the MIT Media Lab, which is using the system for an educational robotics project, and Lingumi, a developer of an English learning app for young children. The company has also partnered with Microsoft to bring its speech-recognition tools onto its cloud computing platform Azure. (The software giant is piloting the technology in 20 schools in Ireland and Britain.)

This week, Soapbox announced its newest research partner: the Florida Center for Reading Research at Florida State University, which will study how the speech-recognition technology fares in pilots with 1,000 students in grades K-2 across Florida, Georgia, Oregon and South Carolina.

For Yaacov Petscher, an associate professor at Florida State and associate director at the reading research center there, a key aspect of this partnership is to put the technology through field tests in a variety of real-world classroom environments. “It’s critically important that we have a speech-recognition system that is able to take into account non-mainstream English, and that kids are not going to be misidentified as having a problem like dyslexia just because they speak differently or in a different dialect,” he says.

The plan is to test another 1,000 students in 2020, and another 600 the following year. “The major pieces are looking at how speech recognition can accurately judge a child’s response compared to a traditional assessor or teacher scoring of child accuracy on reading and language items,” he adds.

This partnership is part of the Reach Every Reader initiative, a five-year early literacy project that also includes researchers from Harvard, MIT and the Charlotte-Mecklenburg School District in North Carolina. This effort is funded by a $30 million grant from the Chan Zuckerberg Initiative.

If all goes well, Scanlon is bullish on the possibility that this technology can help scale how oral assessments are delivered, with precision and privacy as the foremost priority.

“It’s hard to imagine a future,” she says, “where speech recognition technology won’t be used to help children with literacy and fluency.”