Development of smart data pipelines

The potential of artificial intelligence (AI) and machine learning (ML) seems almost limitless in its ability to extract and stimulate new sources of customers, products, services, operational, environmental and societal value. If your organization needs to compete in the economy of the future, then AI should be at the core of your business operations.

A Kearney study entitled “The Impact of Analysis in 2020” highlights unused profitability and business impact for organizations looking for an excuse to accelerate their investment in data science (AI / ML) and data management:

  • Researchers could improve profitability by 20% if they were as efficient as leaders
  • Followers could improve profitability by 55% if they were as efficient as leaders
  • The lagging behind could improve profitability by 81% if they were as efficient as leaders

The impact on business, work and society can be staggering, except for one significant organizational challenge – data. No less than the godfather of AI, Andrew Ng, noted the obstacle to data management and data empowering organizations and society in realizing the potential of AI and ML:

“The model and code for many applications are basically a solved problem. Now that the models have advanced to a certain point, we need to make the data work. ” – Andrew Ng

Data is the heart of training AI and ML models. And high-quality, reliable data organized through highly efficient and scalable pipelines means that AI can enable these exciting business and operational results. Just as a healthy heart needs oxygen and reliable blood flow, so there is a constant flow of purified, accurate, enriched and reliable data important to AI / ML engines.

For example, one CIO has a team of 500 data engineers managing over 15,000 retrieval, conversion, and loading (ETL) tasks who are responsible for acquiring, moving, aggregating, standardizing, and aligning data in 100 data warehouses with dedicated purpose (data markets, data warehouses, data lakes and data lakes). They perform these tasks in the organization’s operational and customer-focused systems under ridiculously stringent service level agreements (SLAs) to support their growing number of diverse data users. It seems that Rub Goldberg must have become a data architect (Figure 1).

Figure 1: Rube Goldberg data architecture

Reducing the debilitating structures of the spaghetti architecture of disposable, dedicated, static ETL programs for moving, cleaning, aligning, and transforming data severely hampers the “insight time” organizations need to take full advantage of the unique economic characteristics of data, “the most valuable resource in the world “according to The economist.

The emergence of smart data pipelines

The purpose of the data pipeline is to automate and scale common and recurring acquisition, transformation, movement, and integration tasks. A well-designed data pipeline strategy can speed up and automate the processing associated with collecting, cleaning, transforming, enriching, and moving data to downstream systems and applications. As data volume, diversity, and speed continue to grow, the need for data pipelines that can scale linearly in cloud and hybrid cloud environments is becoming increasingly critical to business.

The data pipeline refers to a set of data processing activities that integrate both operational and business logic to perform advanced sourcing, transform, and load data. The data pipeline can run on a schedule, in real time (streaming), or be triggered by a predefined rule or set of conditions.

In addition, logic and algorithms can be built into a data pipeline to create a “smart” data pipeline. Smart pipelines are reusable and expandable economic assets that can specialize in source systems and perform the data transformations needed to maintain unique data and analytical requirements for the target system or application.

As machine learning and AutoML become more common, data pipelines will become more intelligent. Data pipelines can move data between advanced data enrichment and transformation modules, where neural networks and machine learning algorithms can create more advanced data transformation and enrichment. This includes segmentation, regression analysis, clustering, and the creation of advanced indices and propensity scores.

Finally, one could integrate AI into data pipelines so that they can be continuously learned and adapted based on the source systems, the necessary transformations and data enrichment, and the evolving business and operational requirements of the target systems and applications.

For example: a smart health data pipeline could analyze the grouping of health diagnostic group (DRG) codes to ensure consistency and completeness of DRG submissions and detect fraud as DRG data is moved from the pipeline for data from source system to analytical systems.

Awareness of business value

Chief Data Officers and Chief Data Analysis Directors face the challenge of deploying the business value of their data – applying data to the business to stimulate quantitative financial impact.

The ability to deliver high-quality, reliable data to the right data user at the right time to facilitate more timely and accurate solutions will be a key differentiator for today’s data-rich companies. Rube Goldberg’s system of ELT scripts and various, special analytically oriented repositories hinders the organization’s ability to achieve this goal.

Learn more about smart data pipelines at Modern corporate data channels (eBook) from Dell Technologies here.

This content is produced by Dell Technologies. Not written by the MIT Technology Review.

Source link