“I’ve always dreamed of being able to speak to Siri in my mother tongue Igbo and hearing her answer with something other than ‘I’m sorry, I didn’t understand,’” said Chris Emezue, currently a PhD student in computer science at UdeM and Mila–the Quebec Artificial Intelligence Institute.
That desire spurred him to create NaijaVoices, a project to build a comprehensive speech dataset for Nigeria’s three most widely spoken languages, Igbo, Hausa and Yoruba.
Emezue embarked on the project for his master’s degree in computer science, under the supervision of UdeM’s Christopher Pal and in collaboration with Mila. Today, NaijaVoices is one of the world’s largest African speech datasets, with more than 1,800 hours of recordings of over 5,000 speakers, totaling nearly 645,000 unique sentences.
Absent from the digital space
Voice-enabled AI technologies are governed by a simple principle: the more training data the AI model has for a language, the better the AI can understand it.
As a result, these technologies naturally favour dominant languages such as English, while languages like Igbo, Hausa and Yoruba are classified as “low-resource languages”—a term that Emezue finds misleading.
“These languages are spoken by tens of millions of people. They are not scarce in terms of speakers, but in terms of digital data,” he explained.
Until recently, datasets for African languages were very small.
“We’re talking about five or ten hours of recordings, fifty at most,” said Emezue. “And because the datasets are so small, there’s not a lot we can do with them. It’s a vicious cycle that I wanted to break.”
This paucity of data may be due in part to the largely oral nature of these languages.
“In many parts of Africa, knowledge transmission, daily interactions and social life rely heavily on oral communication,” Emezue explained. “We love to gather together, chat and tell stories.”
However, AI models have been trained primarily on written material, which puts languages with little digital presence at a disadvantage.
Collaborative and inclusive
To overcome this obstacle, Emezue took an approach different from those typically used in international tech initiatives.
“Often, to create datasets in developing countries, people go there, collect data, pay and leave, but this extraction method isn’t sustainable,” he argued. “With NaijaVoices, the goal was to build a collaborative relationship—one where we create data with communities, not for them.”
The project’s success relied heavily on trust-based local networks. Emezue involved his own family in the project; his mother, sister and brother all helped coordinate field operations.
“Without people based in Nigeria who know the communities and who to contact, I wouldn’t have been able to carry out this project,” he said.
Special effort was made to ensure balanced representation of women’s voices. His twin sister spent several weeks traveling to remote Hausa communities to connect with women and secure their participation in the project.
“These women allowed her into their homes precisely because she was a woman. Without that, our dataset would have been predominantly male,” said Emezue.
Authentic, culturally rich sentences
Before recording could begin, the team had to determine the exact sentences to be read and recorded. Rather than using content from the Internet, which is often religious or translated from Western languages, the project enlisted 144 writers and linguists to produce original sentences that reflect authentic cultural usage.
This work sparked some lively discussions. “Meetings would sometimes turn into debates: ‘No, you don’t say it like that! My grandmother said it’s…,’” Emezue recalled with a laugh.
Once the sentences to be recorded were approved, local “facilitators” with both technical and linguistic training guided the “voice donors” through the recording process. A dedicated app was used to ensure high-quality audio.
The result was 1,800 hours of audio recorded by over 5,000 speakers, making NaijaVoices one of the richest and most diverse datasets of African languages ever compiled.
Testing on AI models
The next phase was testing the NaijaVoices dataset with existing automatic speech recognition (ASR) models. “For the first time, we had a really large dataset of African speech, so we wanted to see what we could do with it,” said Emezue.
The team began by training several state-of-the-art ASR systems on the NaijaVoices dataset, including OpenAI’s Whisper and Meta’s MMS.
The results showed that training these ASR systems on the NaijaVoices dataset significantly improved their performance, reducing word error rates by up to 75 per cent, depending on the language and configuration. In some cases, performance improved dramatically after integrating just a subset of the NaijaVoices dataset.
“These experiments clearly demonstrate that when high-quality, representative data is available, AI can learn and excel in these languages,” said Emezue.
A world of possibilities
In addition to its technical innovations, NaijaVoices is pioneering new approaches to data distribution. Under its open-access model, the dataset is available free of charge for research and educational purposes. Companies that use it for commercial purposes are asked to make a financial contribution to the community, which is then used to expand the dataset and support local employment.
This model is already beginning to bear fruit.
“This dataset opens up so many possibilities, from research to industrial applications,” said Emezue. “In fact, we’re already seeing the first industrial-grade applications. For example, the open-source Omnilingual automatic speech recognition system was tested on NaijaVoices to evaluate its performance.”
He hopes that millions of Africans will soon be able to interact with digital technologies in their native language.
“NaijaVoices represents the largest concentration of speakers ever recorded in the history of African speech databases—no other database has 5,000 speakers,” said Emezue. “Having so many different speakers ensures exceptional representativeness and opens up a world of possibilities.”