To train an ML algorithm, sometimes it takes a crowd

The hoops data scientists jump through to train smart applications and appliances are sometimes overlooked. It takes time and considerable effort to train an ML algorithm. Data scientists need to incorporate different data sets and inputs, made up of authentic voices, documents, images and sounds, depending on the goal of the AI algorithm. The process is made easier by giving data scientists, developers and quality assurance teams the raw materials they need to create a smart and resilient AI that doesn’t suffer from bias and has the capacity to continuously learn and improve, writes Jonathan Zaleski, Sr. Director of Engineering & Head of Applause Labs.

Let the crowds do the talking

The ML process enables brands to transform vast quantities of data into predictions. But an ML algorithm is only as good as its training data “rubbish in, rubbish out” as the saying goes.

For example, there are over 40 different dialects in the UK, making it almost impossible for the data scientists at a UK broadcaster that were involved in a recent voice assistant project to replicate that level of language, speech and intonation in a traditional lab. Which is why 100,000 different voice utterances – from 972 people from across the UK – were used to train the smart voice assistant and expose it to different voices and accents. It was no easy task. The truth is brands are unable to source quality data at scale; they need the support of crowds and automation to deliver the quantity, quality and diversity of data required for effective ML testing.

Follow The Stack on LinkedIn

This model of crowdsourcing is quite common and, if used correctly, can future-proof an ML algorithm so it’s ready for interaction with customers under real-world conditions. But before we get ahead of ourselves, there are some key fundamentals at play here that need to be addressed.

Remove data ambiguities

ML algorithms require lots of quality training, validation and test data to make accurate predictions. Although they’re all drawn from a core data set, they each serve a different purpose when preparing algorithms for real-world interactions, which can sometimes confuse data scientists and slow down the testing process. The easiest way to clear up any ambiguities between the different data sets is to illustrate their practical use. Take images for example. Let’s suppose an ML algorithm is expected to analyse and classify pictures of lakes and rivers. The training data would be made up of lots of images of waterways, but not every image that’s available cataloguing every known river or lagoon. That’s where the validation data comes in. It gives the data scientist the opportunity to introduce a new image to test how well the algorithm performs as it cross-references that image against the data that was used to train it. This testing process enables the data scientists to assess the algorithm’s performance, collect new metrics, adjust and evaluate how well it makes predictions based on the new data it is receiving.

Deal with unexpected use cases

Now, testing ML data in a lab is one thing, but data scientists need to be prepared for unexpected use cases once the algorithm starts interacting with disparate data sets. At a basic level, this scenario can be illustrated by an AI recommending a cold drink on a hot day. This would require the algorithm to interact with external data sets to check the weather and the temperature outside, before squaring that situational data with the end user’s preferences and any other behavioural data. Sounds simple and straightforward, but the algorithm must have the capacity to deal with any new scenarios and use cases. Essentially, when you break it down, an ML algorithm needs to address the following data sets and variables to survive, and thrive, in real-world conditions.

Data direct from the user: This applies to data produced by individuals that are actively participating in the training of the algorithm itself. I already mentioned the smart voice assistant example, but there are countless others, including another recent project where over 1,000 people contributed handwritten documents, providing the unique samples needed to train an algorithm to read human handwriting.
Indirect from the user: The process whereby an algorithm can study and analyse any behavioural traits or responses an end user may have as they interact with an application or a service. For example, they may struggle with a specific feature or find it difficult to find the right function on a screen or inside an application. The user is still contributing to the development of the algorithm. They’re just doing it indirectly rather than under test conditions where they’re being asked to complete a certain action or provide a certain phrase.
Data about the user: This is the third and final component focusing on profile information about the user themselves. It zooms in on their situation, such as their location and the time of day to create a more dynamic persona based on user behaviour. The algorithm would, and should, be able react to whatever information was available to make predictions and recommendations based on a dynamic data set.

The qualified algorithm

ML algorithms deal with different inputs from voices to words, static images and even video. To do this at scale requires interaction with lots of people who can share expressions, experiences and behaviours that can be used to instruct an algorithm. Brands can achieve this using the crowd sourcing and crowdtesting model that is readily available and can be tailored to suit almost any data training scenario. You can select from a community of vetted testers that offers specific demographics, including gender, race, native language, location and any other filters that apply. You can be as general and as obscure as you like when it comes to sourcing training data. The whole process makes the data scientist’s job a lot easier, and it helps algorithms cope with life in the real world and whatever new variables they encounter along the way.