5 Key Challenges In Today’s Era of Big Data

Guest Post
Big Data

Tamjid Aijazi
20 Jul, 2020

This article is a a guest post by Tamjid Aijazi from Makeen Technologies

Digital transformation will create trillions of dollars of value. While estimates vary, the World Economic Forum in 2016 estimated an increase in $100 trillion in global business and social value by 2030. Due to AI, PwC has estimated an increase of $15.7 trillion and McKinsey has estimated an increase of $13 trillion in annual global GDP by 2030.

We are currently in the middle of an AI renaissance, driven by big data and breakthroughs in machine learning and deep learning. These breakthroughs offer opportunities and challenges to companies depending on the speed at which they adapt to these changes.

Modern enterprises face 5 key challenges in today’s era of big data

Handling a multiplicity of enterprise source systems

The average Fortune 500 enterprise has a few hundred enterprise IT systems, all with their different data formats, mismatched references across data sources, and duplication

Incorporating and contextualising high frequency data

The challenge gets significantly harder with increase in sensoring, resulting inflows of real time data. For example, readings of the gas exhaust temperature for an offshore low-pressure compressor are only of limited value in of itself. But combined with ambient temperature, wind speed, compressor pump speed, history of previous maintenance actions, and maintenance logs, this real-time data can create a valuable alarm system for offshore rig operators.

Working with data lakes

Today, storing large amounts of disparate data by putting it all in one infrastructure location does not reduce data complexity any more than letting data sit in siloed enterprise systems.

Ensuring data consistency, referential integrity, and continuous downstream use

A fourth big data challenge is representing all existing data as a unified image, keeping this image updated in real-time and updating all downstream analytics that use these data. Data arrival rates vary by system, data formats from source systems change, and data arrive out of order due to networking delays.

Enabling new tools and skills for new needs

Enterprise IT and analytics teams need to provide tools that enable employees with different levels of data science proficiency to work with large data sets and perform predictive analytics using a unified data image.

Let’s look at what’s involved in developing and deploying AI applications at scale

Data assembly and preparation

The first step is to identify the required and relevant data sets and assemble them. There are often issues with data duplication, gaps in data, unavailable data and data out of sequence.

Feature engineering

This involves going through the data and crafting individual signals that the data scientists and domain experts think will be relevant to the problem being solved. In the case of AI-based predictive maintenance, signals could include the count of specific fault alarms over the trailing 7 days, 14 days and 21 days, the sum of the specific alarms over the same trailing periods; and the maximum value of certain sensor signals over those trailing periods.

Labelling the outcomes

This step involves labeling the outcomes the model tries to predict. For example, in AI-based predictive maintenance applications, source data sets rarely identify actual failure labels, and practitioners have to infer failure points based on a combination of factors such as fault codes and technician work orders.

Setting up the training data

For classification tasks, data scientists need to ensure that labels are appropriately balanced with positive and negative examples to provide the classifier algorithm enough balanced data. Data scientists also need to ensure the classifier is not biased with artificial patterns in the data.

Choosing and training the algorithm

Numerous algorithm libraries are available to data scientists today, created by companies, universities, research organizations, government agencies and individual contributors.

Deploying the algorithm into production

Machine learning algorithms, once deployed, need to receive new data, generate outputs, and have some actions or decisions be made based on those outputs. This may mean embedding the algorithm within an enterprise application used by humans to make decisions – for example, a predictive maintenance application that identifies and prioritizes equipment requiring maintenance to provide guidance for maintenance crews. This is where the real value is created – by reducing equipment downtime and servicing costs through more accurate failure prediction that enables proactive maintenance before the equipment actually fails. In order for the machine learning algorithms to operate in production, the underlying compute infrastructure needs to be set up and managed.

Close-loop continuous improvement

Algorithms typically require frequent retraining by data science teams. As market conditions change, business objects and processes evolve, and new data sources are identified. Organizations need to rapidly develop, retrain, and deploy new models as circumstances change.

Problems that have to be addressed to solve AI computing problems are nontrivial. Massively parallel elastic computing and storage capacity are prerequisites. In addition to the cloud, there is a multiplicity of data services necessary to develop, provision, and operate applications of this nature. However, the price of missing a transformational strategic shift is steep. The corporate graveyard is littered with once-great companies that failed to change.