Part of Microsoft Accelerator’s batch 3 of startups, DefinedCrowd is filling a niche in the big data and machine learning community, providing near-real-time feeds of rich language data, checked by actual well-informed humans all over the world.
The need comes from the Catch-22 that often arrests deep data analysis, in that you have to understand the data to analyze it, but you must analyze it to understand it. The vast landscape of the spoken and written word and its big data counterpart in natural language processing is especially troublesome in this way.
“In the artificial intelligence space, to develop virtual assistants like Cortana, or Apple’s Siri and things like that, you need large amounts of voice recordings, you need transcriptions of those voices, you need intents and empathy labeling of those voices,” said Daniela Braga, co-founder and chief scientist, in an interview with TechCrunch. “The crowd input provides the extra refinement of the data that basically no machine can do.”
DefinedCrowd sets up pipelines through which various kinds of language data are filtered, interpreted, and enriched, partly automatically and partly with a human touch.
“If you were to do a sentiment analysis learning model, and you want the machine to learn about a social media user making certain tweets — are they happy, or are they excited? The difference is very subtle,” said Amy Du, co-founder and CEO. “This is where the crowdsourcing technique comes in.”
After the grammar is standardized — slang like “u” is replaced by “you” and emoji are stripped out, for instance — users are asked to score a phrase or sentence on, say a 5-point scale of neutral to happy, or curious to sarcastic. A few users score the same phrase and their inputs are synthesized, and that data goes on to the next step.
Pipelines can have several steps and depending on how complex the data is, it can take a few days to get them in place — but once a workflow is established and native speakers chiming in regularly, the data can be turned around quickly enough for hourly updates. (Any network executive or social media manager can appreciate the occasional urgency of these things.)
The natural objection, especially when users earn money for their work (more than an Mechanical Turk user, but think minimum wage, not get rich quick), is that someone is going to game this thing. DefinedCrowd takes a labor-intensive approach to managing their crowd.
“We partner with universities internationally,” said Du. “We usually start with the linguistics department, establishing a relationship with a local language ambassador, someone we can actually trust. And from there they can expand the network by bringing in additional students from the region. We know every single person that works behind the scenes.”
Not as easy as taking all comers, but this has benefits as well.
“And we have that metadata. Think about virtual assistants, they need to have dialectal and gender and age balance for those agents,” pointed out Braga. “We’re in 30 countries right now, moving to 50 in July; we have 50-100 people steady in each country.”
Du worked in tech consulting for years, specializing in connecting major companies with crowdsourcing, and Braga started as a linguistics professor in Portugal and Spain, eventually working with Microsoft on NLP-related projects like Cortana. Their paths crossed in the Seattle area while working in an overlapping business space, and eventually just decided to throw in together.
The company’s time in the Microsoft Accelerator program has been helpful, the co-founders agreed (a representative from Microsoft was listening in, I should add) — as you might expect, Microsoft is a fairly well connected company, and the startup soon learned applications it might not have thought of on its own. And it doesn’t hurt to get meetings with Fortune 500 companies from all over the world looking for a way to supercharge their big data efforts.
DefinedCrowd showed their product publicly for the first time today at the Microsoft Accelerator demo day in downtown Seattle — along with seven other companies from the program’s third batch.