The 5 Side Effects of Not Having a Big Data Strategy
Big data describes the volumes of data that your company generates every single day. Both structured and unstructured. Analysts at Gartner estimate that more than 80 percent of enterprise data is unstructured. This means they can be text files from IT logs, emails and chat logs from customer support, employee complaints to your HR, lengthy performance reviews, and business documents shared between departments. These diverse and scattered data sources are true of almost every enterprise.
A big data strategy, on the other hand, is a glorified term for how you'll collect, store, document, manage, and make the data accessible to the rest of the company. When companies don’t have a good data strategy, they spend enormous amounts of time just getting their data into a usable form when it's needed.
A big data strategy involves planning around how you collect, store, document, manage, and make the data accessible to the rest of the company.
But, you may be wondering, what’s this “Big Data” got to do with AI?
Everything!
Modern AI applications thrive on data. Depending on the problem, it can be your own structured or unstructured data.
In fact, according to IBM’s CEO, Arvind Krishna, data-related challenges are the top reason IBM clients have halted or canceled AI projects. Forrester Research also reports that data quality is among the biggest AI project challenges. This goes to show how critical data – or rather big data – is for AI.
A Support Ticket Routing Example
Let’s take a machine learning model that automatically routes support tickets to the appropriate support agents. In order to build this model, you’d need a large volume of historical support tickets and the corresponding routing. Historical here means all the old, resolved tickets.
This historical routing data is then used to automatically learn patterns so that the machine learning model can make predictions on new incoming tickets.
If this data is not stored or not accessible to your data scientists, you’ll have to rely on some external data sources which may not be ideal. That's because, for AI applications, it’s not just any data, it’s also good data that’s needed.
Alternatively, you can continue your manual way of routing until you’re able to perform some intentional data collection. But this can be a months-long setback.
“And so you run out of patience along the way, because you spend your first year just collecting and cleansing the data…And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.”
This problem happens all the time!
And, it happens because companies generally don’t have a data strategy let alone a good data strategy. Data is acquired and used on an ad-hoc basis and for very specific purposes.
A recent survey of C-level executives representing companies like Ford Motors and Johnson and Johnson showed that over 50 percent of the companies were NOT treating their data as a business asset at all. What’s more interesting is that the leaders admit that technology isn’t the problem. People and processes are.
Unfortunately, not having a data strategy can have unwanted side effects on AI initiatives. That's what we'll focus on in this article.
The 5 Side-Effects of Not Having a Big Data Strategy
When it comes to AI development, there are typically 3 problems that companies struggle with in terms of data :
These problems tend to happen due to a lack of planning around how company data is managed and made accessible to the rest of the company—i.e., your data strategy. While these data problems may not completely halt your AI initiatives, they can have a negative impact on them in the following ways:
1: Hampers exploratory analysis
Exploratory analysis of data can help determine what’s possible and what’s not with AI.
There are two ways companies start AI projects. First, is to simply explore the data and determine what’s possible with it. The second way is to start with a pain point and then determine if AI is the right approach. At which point, your data scientists will have to determine if the company data can support the initiative.
Either way, you need to have access to data and access to the right data to determine feasibility. A broken or non-existent data infrastructure will cripple this.
You need to have access to data and access to the right data to determine feasibility of AI projects
Exploratory analysis will also help surface potential issues in your data, such as data imbalance issues, and sparsity issues before you start a well-formed project. For more context, the steps in the figure below are some of the tasks that data scientists perform during exploratory data analysis.
What data scientists do during exploratory data analysis (EDA). Source: excelr.com
#2: Stale predictions and recommendations
When companies don’t have a centralized data store with fresh data, they work around this by acquiring one-time data dumps. This is acceptable for development but can be harmful in practice.
That’s because the data used for development may not be reflective of the current reality.
For example, if you develop a product recommendation engine trained on customer data from 2018 to make recommendations in 2021 you may be in for a surprise. Customers may be shunning your recommendations as you’ve lost touch with their current taste.
Due to COVID-19, customers may be more cost-sensitive or may prefer products containing disinfecting properties. If you recommend only high-end products or natural products, then customers will assume that you’ve lost touch with their taste and start ignoring the recommendations. This is often referred to as “stale recommendations” or “stale predictions”.
Stale predictions refer to predictions “learned” from outdated historical data or data that does not reflect current reality
Having access to fresh data allows models to be retrained periodically to ensure the output of models remains of high quality. Models typically need to adapt because:
The effects of stale and fresh data on AI applications
A data strategy works towards preventing staleness by ensuring your data, old or new can always be accessed and ready for retraining models.
#3: Low quality models
Low-quality models, in other words, models with low accuracy can make gross mistakes on prediction or recommendation tasks. For example, categorizing a support ticket as pertaining to a “login issue” when in fact it’s related to “fraudulent account access” can have disastrous consequences. This is especially true if the issue is time-sensitive and relates to the health and safety of people.
In 2013, IBM partnered with The University of Texas MD Anderson Cancer Center to develop a new “Oncology Expert Advisor” system, a clinical decision support technology powered by IBM Watson. Unfortunately, Watson was making incorrect and downright dangerous cancer treatment advice. Reports state that the problem happened because the AI was trained on a small number of hypothetical cancer patient data, rather than real patient data which resulted in inaccurate recommendations.
This is clearly a problem of data quality.
And, data quality issues can be introduced by a broken data infrastructure when:
It prevents data scientists and models from getting an accurate, holistic view of things.
Using the support ticket example, if your machine learning model is trained on data from a single satellite office that deals primarily with “login issues”, it’s knowledge of all other types of support issues is limited. The end result – often a model that looks good on paper, but useless in practice.
Data warehousing and integration of your diverse data sources can minimize this by bringing completeness to your data. It also ensures that your data is more easily accessible throughout the company.
#4: Brings bias to life
A broken data setup can introduce bias in your AI applications.
Let’s take facial recognition for example.
With facial recognition, you can identify or verify the identity of an individual using their face. A report released by NIST revealed that top facial recognition algorithms suffer from bias along several lines including race, gender, and age.
For example, some of the facial recognition systems misidentify Asian- and African-Americans far more often than Caucasians. The usual cause of such bias is the underlying data! It most probably lacked representation.
An MIT study found that a popular dataset used to train facial recognition systems was estimated to be ~78 percent male and ~84 percent white. There was very little representation of females and other races which may explain why many facial recognition systems have an ingrained bias in them.
When data scientists have access to limited or incomplete data, which is not a reflection of reality, it becomes difficult to ensure sufficient representation. This results in the data source itself becoming biased or skewed in a technical sense. And this effect is perpetuated through your machine learning models.
Facial recognition algorithms are no different. The algorithms learn to identify a face after being shown millions of pictures of human faces. However, if the faces used to train the algorithm are predominantly of a certain race, the system will have a harder time recognizing anyone who doesn’t fit.
This is dangerous!
By ensuring that ALL your data is centralized and tightly integrated, you can ensure that your data is more complete and more representative of your customers, employees, products, and services. While this does not completely eliminate bias, this minimizes the possibility of it happening.
#5: Causes significant delays in AI initiatives
Finally, the fact that you don’t have data to work with, or don’t have access to the data can be a permanent setback to companies looking into the adoption of AI.
Every project you start may require jumping over hoops to get data just to assess the feasibility of the project. As you saw in the case of IBM, projects were canceled or stalled partly due to the lack of data. The problem gets worse when you’ve already hired talented data scientists, only to realize that they’re unable to start projects or drive planned projects forward because of data issues.
If you’re looking to become more efficient and competitive in your industry, AI adoption is important. But, a data strategy is even more critical as it’s not just the foundation for AI, it’s also the foundation for all analytics and reporting capability in your organization.
Where do you think your data strategy is headed?
Not having a big data strategy can become costly in the long run. If you’re not treating your data as a business asset, you’re missing out on the opportunity to make good data-driven decisions and introduce automation with AI.
Projects may be indefinitely delayed, you may be making lousy, out-of-touch predictions or you may be inadvertently introducing bias in your algorithms. All of this can have a negative impact on your customers and your business at large. But there's a solution to all of this.
If you don’t have a good data infrastructure in your company, the best place to start is to determine the gaps. This needs to be done in collaboration with a data warehousing or a data engineering team.
Some starting questions to answer as you’re making plans to improve your data collection and management capabilities may include:
A point worth making is that, without a data strategy, you can still embark on AI initiatives. However, it’ll be one-off projects, and you may end up with some of the problems outlined above. The good news is that you can always start AI initiatives while also investing in your data strategy.
Kavita Ganesan is the author of the forthcoming book, "The Business Case For AI", and is the founder of Opinosis Analytics, an AI consulting & training company. She advises executives and coaches teams across the organization to help them get value from AI.
[This article first appeared on www.opinosis-analytics.com]