In one of Ntegra’s previous article about Data Management Best Practices, we asserted: “Your information assets are your business, not just the symptoms of your business”. The data that codifies these assets has become more varied and unstructured, demanding more significant effort to gather, cleanse, enrich and prepare for analysis. This overhead is at odds with demands to utilise data to inform business decisions rapidly.

Business users are not content to wait for technical resources to prepare data

and are increasingly demanding a self-service model for wrangling their data. This move towards a more democratised, self-service data preparation capability has seen new vendors entering the market to support end-users who wish to realise the value of their information assets fully.

The term, “data wrangling” was originally coined in the early 1990s and was in intermittent use until 2001 when CNN recruited “a dozen data wranglers” to find information to accompany news stories. Since then it’s become an increasingly popular jargon term covering anything end-users undertake to do with data that’s not IT-led. Data wrangling might include format transformation, aggregation, statistical model training, visualisation and other activities depending on what’s considered appropriate within the business context.

Processes tend to follow a generic set of steps:

  1. extracting data from raw sources;
  2. collating, masking, sorting, filtering, combining, aggregating, and parsing;
  3. loading data into target structures for future use.

So, what’s the difference between data wrangling and traditional ETL?

Data wrangling vs. ETL

Extract, transform and load (ETL) technologies sit at the core of engineered IT data-pipelines. Requirements from the business are passed to an IT-team to implement as flows using ETL tools. When delivered, the ETL provides data in the required format to target systems which end-users can exploit with analytics, reporting and visualisation tools. ETL technologies focus on extracting, transforming and loading data into a data warehouse, a use-case to which they are well suited.

Data wrangling technologies, on the other hand, support business users such as analysts, managers and other line of business users who understand the data and are, therefore, best suited to explore, prepare and exploit it.

To create a solution that enables intuitive data wrangling by non-technical business users requires a tremendous amount of design, engineering and development.

Before these solutions were made available, business users might resort to spreadsheets to interact with their data and cottage industries of ‘Shadow IT’ sprang-up to source the necessary data, leading to unmanaged and unrepeatable operations.

Excel won’t cut it

As a tool for manipulating data, spreadsheets do not scale. The Big Data trinity (volume, variety and velocity) made data wrangling tools a necessity for analysts to understand, clean, and organise their data. Today’s analysts are working with increasing volumes and varieties of data that cannot be shoe-horned into Excel. Data wrangling solutions are built to handle the 3Vs of the Big Data scale, throughput and performance. For example, elastic scaling in the cloud enables solutions to adapt to different workloads, dynamically allowing for right-size data volumes.

ETL solutions are built to handle well-structured data from operational systems and database sources, mapping and maintaining structure and lineage through to the MI and BI systems they underpin. Traditional ETL solutions struggle with unstructured data and innovative structural derivations, and increasingly, analysts want to play in environments where the plan or schema is undefined. In this case, the analyst must understand the data’s value and generate the schema needed to undertake the analysis.

Analysts waste 75% effort

In many cases, source data quality is poor and unreliable, resulting in up to 75% of an analyst’s effort being spent finding, cleaning and preparing data before any analysis or modelling can start. The cottage-industry approach and use of desktop tools like Excel cannot handle current data volumes and don’t ensure governance, lineage and security.

Data wrangling software vendors such as Trifacta and Paxata offer complete, end to end solutions that enable business users and analysts to collect and prepare data from multiple, disparate sources across their organisations, at scale with governance, lineage, management and security built-in.

Tool characteristics

Data wrangling tools provide intuitive spreadsheet-like and visual user experiences to enable business users to interact with data in real time. They guide users who wish to explore, clean, normalise, concatenate and join data using simple mouse-clicks. Solutions provide data profiling, anomaly detection, reporting max/min/mean/median, outliers and extents, as you go. Built-in algorithms and machine learning accelerate data preparation and help business users to understand and incorporate data from unknown sources. Embedded algorithms profile data and provide recommendations for normalisation, cleansing and filtering.

Solutions record every action the user undertakes on data as well as metadata to provide lifecycle lineage to understand and repeat wrangling processes. They provide embedded catalogues for storing curated datasets and search allows users to explore it to identify the correct data for any activity. Fine-grained security enables IT organisations to limit access to sensitive data.

Conclusion

Data wrangling is a critical part of any data related project. Business users and analysts need to ensure data is clean, complete and well-conditioned for convenient querying before they analyse it. This is also essential preparation for data to be used in machine learning and AI experiments, and

never has the adage “Garbage in, garbage out!” been more pertinent.

Data wrangling and ETL solutions are not mutually exclusive capabilities. Ntegra believes they should be deployed as complementary capabilities of an organisation’s data platform. Neither provides the answers that analysts and business users are looking for, but both are essential tools in a data-driven business.