top of page
  • Sergii Shelpuk

Lacking data for machine learning models? Build for the future data

The lack of data and data quality issues were among the top 3 CEO concerns for AI and machine learning adoption for at least seven years, from 2015 to 2022. Yet, although handy for AI product design, currently available data is not a critical requirement for developing an AI business, and the lack or imperfection thereof is not a showstopper for developing deep learning products. The crucial is what I call the "future data."

Some time ago, Vasyl Rakivnenko, a founder of the startup Alko Prevent, asked me to help shape his company's AI strategy. Alko Prevent helps trucking companies ensure their drivers' sobriety by checking for signs of intoxication with computer vision and machine learning. An energetic and visionary leader, Vasyl involved me in planning the company's AI efforts and coordinating with the engineering team.

The engineering team worked on the initial product version and machine learning system. The model had to recognize sober and intoxicated faces with an emphasis on quality - the system was only meaningful for the customers after a certain accuracy level.

As often happens, the dataset was the most challenging part of the R&D work. The team suggested using a deep neural network-based classifier to sort all faces into sober and intoxicated. That solution required a large and balanced dataset in which the fraction of sober and intoxicated faces would be roughly equal. As there was no ready-to-use training data, the team planned to scrap Google for face images and then hand-label them into sober and intoxicated classes based on the visible signs.

My perspective was different. Vasyl wanted to build an AI company, meaning machine learning should be the core of its competitive advantage.

As Alko Prevent expands, the company will collect more and more data - driver selfies, the great majority of which would be sober (the whole world around us suggests that truck drivers are sober most of the time). Hence, the proprietary data collected by the company will be, firstly, very imbalanced, mainly containing sober faces, and, secondly, unlabeled. Although deep neural networks trained on the labeled datasets in a supervised manner are indeed state-of-the-art in image classification, they will not be able to benefit from the future Alko Prevent's data. A deep learning classifier might do the trick for the MVP, but the company will likely throw the results of the expensive AI R&D away down the road. Moreover, it does not create any competitive advantage: anyone can scrap Google images and build a similarly performing machine learning product.

Alko Prevent's team suggested building an AI feature but not an AI competitive advantage. AI features do not make AI companies.

AI products get better with new data, thus creating a powerful network effect that is nearly impossible to penetrate. Consider Google. Digital advertising is an enormous $600 billion market. Also, with today's cloud technologies and computer speeds, building a search engine is way easier than in 1998. We would expect such a market to be crowded, yet few giants dominate it. The reason is that Google and other Internet companies turned their data into an almost impenetrable competitive advantage with AI. Google knows all your search history, your clicks on search results, your Gmail emails and replies, YouTube search and view history, and infinitely more. Their algorithms automatically adjust to every user, thus getting better with time. If we imagine someone copying all Google codebase into her own cloud servers, line by line, she will be unable to compete with Google because of the absence of all the historical data.

To follow Google's path, Alko Prevent should build a product that improves with time, turning the company's proprietary data into a competitive advantage. The future data will be a highly imbalanced set of selfies. Yet, there will be structure there: we expect every user to take similar photos in roughly similar environments: a face photo taken from a short distance with the same camera. From this perspective, we can cast the problem differently: given a dataset of selfies of the same person, warn the truck company if today the driver looks significantly different than usual.

Following this logic, I suggested using deep learning anomaly detection instead of classification. Not only does anomaly detection work with highly imbalanced and unlabeled datasets, but it also identifies all possible anomalies not limited to specific intoxication types such as alcohol. Also, the machine learning product improves with more data, eventually turning Alko Prevent's offering into a defensible business.

Lack of data or dataset quality issues may indeed initially be a problem, but it is addressable with the future data. You need to design your product and user journeys around gathering the data required for your AI virtuous cycle. Ideally, your users cannot accomplish the journey without giving you a piece of data useful for your machine learning features.

An AI business encompasses the cycle:

  1. build a simple machine learning model;

  2. convince a few customers to start using it;

  3. get their data;

  4. retrain the machine learning model;

  5. convince a few more customers with an improved model;

  6. repeat.

Being built right, the AI competitive advantage results in competitors perceiving your business in the following frame.

  1. To compete with your company, I need to build a better product than yours.

  2. To build a better product than yours, I need better machine learning models than yours.

  3. To train better models than yours, I need to get more data than you possess.

  4. To get more data than you, I need to get more customers than you have.

  5. To get more customers than you do, I need to have a better product than yours. Proven by contradiction.

AI strategy should also be tightly integrated with the product and business strategy - for AI companies, neither works without the others. Identifying the point of impact and focusing cross-functional efforts on achieving results is crucial. Creating the right AI strategy is complex work that must result in a simple outcome.

A good AI strategy is straightforward, easy to understand, communicate, execute, and measure at every product and business development stage. A thick, hard-to-comprehend slide deck flooded with fluffy buzzwords should be a sign of caution - it often means there is really no AI strategy there.

This blog is dedicated to AI competitive advantage, and we are doing our best to explain how it works and how you can build one for your product or company. You can check our other posts to get an extensive explanation of what the network effect is and how AI enables it, how to build an AI competitive advantage for your company, what culture helps you build the right AI products, what to avoid in your AI strategy and execution, and more.

If you need help in building an AI competitive advantage for your business, look no further. Our team of business experts and AI technology consultants has years of experience in helping technology companies like yours build sustainable competitive advantages through AI technology. From data collection to algorithm development, we can help you stay ahead of the competition and secure your market share for years to come.

Contact us today to learn more about our AI technology consulting offering.

If you want to keep posted on how to build a sustainable competitive advantage with AI technologies, please subscribe to our blog post updates below.


bottom of page