A Structured Approach to Data Preparation for Advanced Analytics

March 19, 2020
Posted by: Abhay Das
Category: Data & Analytics

We were in the midst of a long-term advanced analytics project for a B2B SaaS company. The company provided a cloud-based product for marketers to run mass campaigns to millions of visitors to their mobile app and website.

The USP of the product was to suggest personalized marketing ideas based on individual user profiles.

The project was a terrific experience for Indium’s Digital Team, that was gearing up to solve a big data problem that involved multiple tools.

But there was a complex problem right at Step No. 1! Well before building out our data models and deriving insights!

The problem revolved what is now called Data Preparation.

Soon we realized, we were not alone. Forrester Research conducted a study in 2017 and found that 80% of the time spent on data projects revolved around data preparation.

Check out our Advanced Analytics Services

TDWI, an expert in education and research on all things data, conducted a survey with leading CIOs to find that more than 37 percent of the survey participants indicated dissatisfaction with their ability to easily find relevant data for business intelligence (BI) and analytics.

The survey participants recommended that “a self-service, automated approach to data preparation” was probably the only way forward.

In our case – wherein we were building data models for marketers based on real-time data to take real-time action – it was an even greater challenge. We not only had to prepare data for BI but also do so in record time. We had to process 500 million messages per day.

Step 1: Process, Process, Process

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts. The first step is to transform your data so it can be used for analytics.

If you’re using Hadoop, for example, the MapReduce style is the most common approach to split your data and process it in parallel.

The Map refers to filtering and sorting, while the Reduce refers to a summary operation.

In today’s world, Data Chaos is a reality. You data sets will be filled with outliers and nulls. In the case of our project involving this SaaS product, our first effort was to reduce the time taken for data transformation using the ETL process.

We reduced the process time for this from 11 hours to 2 hours. The data required for real-time reporting was generated using Hive tables.

Today, there are emerging methodologies using tools like Trifacta, which can be used for visual data wrangling. The benefits of using such a tool is that you can wrangle data both visually and statistically.

The added benefit of such a tool is that the business owner of the data can contribute in a hands-on manner to the data preparation process.

Step 2: Ensure there is data security and transparent data lineage

The benefits of data quality cannot be emphasized enough. The data preparation process must ensure that metadata from multiple sources are defined, the process of blending and cleansing data is not done on an excel sheet which can often induce manual errors.

Step 3: Standardize & Repeatable

A key part of the data preparation process is to ensure it is standardized. Data teams will do well to automate the process as much as possible, driving efficiency and reducing processing time.

It may also be a good idea to make it a “repeatable” process, so it becomes easier to deliver real-time analytics.

Step 4: Collaboration between Business & Tech

At Indium Software, we believe the future of data preparation will revolve around collaboration between business and technology teams.

A standard workflow with repeatable methods will go a long way in reducing complexity of data preparation.

In the real-time analytics project we did for the B2B SaaS client, we used the following tools to deliver predictive models:

Hadoop
Oozie
Solr
Hive
HDFS
HBase
Phoenix

A combination of Hadoop Distributed File System (HDFS) and Phoenix Implementation loaded real-time data into HBase.

Is Your Application Secure? We’re here to help. Talk to our experts Now

Inquire Now

HBase, which is modelled after Google’s Bigtable, delivered real-time reports.

But the key to the entire process was our Data Prep Process. We reduced the ETL from 11 hours to 2, which laid the foundation for the entire project.