IDG Contributor Network: Choosing the best data solution for your analytics needs

Adam Smithline June 4, 2018

3 5 minutes read

Analytics has always been difficult, and as data sources expand and proliferate the challenge grows. Despite advancements in technology, including a broad range of available databases, data preparation tools, and data visualization applications, companies continue to struggle with analytics.

Part of the problem may be the impressive power of the technology itself. The allure is so strong that most companies choose to begin their analytics journey with a tool evaluation. After a demonstration of their capabilities it’s easy to conclude these tools are the solution. But the truth is they are only part of the solution, and arguably not the most important part.

Starting out by purchasing tools is like trying to build a house without a foundation and a blueprint. You might have the materials you need, but you can’t easily put them to good use.

Data quality is the foundation for successful analytics

Reporting is an attempt to turn data into useful information. As such, the structure, quality and availability of the underlying data have a direct impact on reporting outcomes. It may sound obvious, but this simple truth explains why analytics efforts that start by tackling the issue of data quality enjoy greater success.

There are a variety of data challenges commonly encountered. For one, reporting queries often require huge amounts of data. When querying normalized OLTP databases this means joining many tables with complex logic that can be difficult to write, debug and maintain. A well-designed data solution can tackle this work so report authors and data consumers don’t have to.

The data itself may also have shortcomings that can be addressed. In some cases, source systems are missing data that, with the proper knowledge, can be derived to improve reporting quality and insights. When missing data can’t be resolved in the system of record, a data solution offers a way to apply business rules, and to infer and store new data values.

Similarly, source systems may contain invalid data that must be filtered out to produce meaningful reports. This requires tools, technical skills, and a detailed understanding of how the data relates to business processes. Here again, a data solution can automate the process of data transformation, correcting and filtering data as it is ingested so all users have access to clean, reliable data.

Another common problem is the periodic archiving of data from systems of record. Poor access to historical data limits the range and effectiveness of trend reporting and period-over-period reporting, leaving the business without critical KPIs needed for effective management. Data solutions can easily include a persistent data store that solves this problem.

Perhaps the most significant challenge addressed by data solutions is the need to integrate data from multiple systems and sources. A well-designed data solution can map fields, modify data types and apply sophisticated logic to combine data from multiple sources according to a standard set of rules to produce a reliable, 360-degree view of the business.

No two data solutions are alike

Data solutions can take many forms, depending on the goals and resources of the organization. At the simple end of the spectrum it may be nothing more than a central repository of raw data. At the more complex end of the spectrum the data in the repository is “report-ready,” meaning it can be consumed using any reporting tool and will be easily understood and manipulated by business users.

The process of designing a data solution begins with a clear understanding of reporting requirements and a thorough analysis of the available data. Only by identifying the data gap—the difference between what is available and what is needed—can you determine the best tool or technique to fill it. The recommended data solution will typically be composed of several of the tools available.

One such tool is the persistent staging area. It can be used to insulate the reporting solution from a source system that is in flux, unstable or periodically archived. By introducing this intermediate database, you also get the opportunity to do some limited data processing as data moves between the system of record and the reporting tool. A simple staging area takes computing pressure off the operational system, and potentially reduces the demands on the reporting tool and the report author. More important, it also allows you to track changes to attributes such as customer, product and organization over time, and enables data renaming to improve usability.

The data lake, like persistent staging, alleviates the computing pressure caused by running queries against source systems. It’s primary focus, however, is not to track record changes over time but to house large amounts of data from a variety of sources for analysis. Since it stops short of making material changes to the data, all the underlying data quality issues remain in the repository. As a result, skilled data scientists are required to analyze it, and any cleansing and data transformations must be repeated frequently across the organization.

An operational data store can be used to combine data from more than one system of record, incorporate data transformation based on business rules, and/or include summarized data. The operational data store introduces significant data processing between the systems of record and the reporting tool, and the result is a single data repository that can be used for most types of analysis, yielding consistency across the organization. One disadvantage, however, is that it typically stores data in its native, highly normalized table structure. So as the volume of data and the complexity of the data processing increase, the performance of the operational data store may suffer.

The enterprise data warehouse overcomes this challenge, and dramatically increases usability at the same time. The defining characteristic of the data warehouse is that the data is reorganized into a dimensional data model. Developed specifically for reporting applications, the dimensional model dramatically improves query response time by reducing the number of tables and joins the query engine must traverse. When done correctly it also presents data to users in a structure that is intuitive, and easy to navigate and interpret, enabling self-service analysis by business users.

Last, a semantic layer can be used to implement user-friendly field names and insulate report authoring and reports from many ongoing changes to the underlying data, either in the source systems or the data repository. Semantic constructs like OLAP cubes can also be useful for aggregating and summarizing data prior to report runtime to provide further performance gains and even easier report creation.

Your choice of data solution helps define your overall solution architecture as well as the specific needs that must be filled by data wrangling and visualization tools. Will you need to automate data transformation? Will you be trying to integrate and transform data within the reporting tool? You can’t select the best tools for the job until you understand precisely the job to be done by each tool. So, regardless of which data solution you decide on, the important thing is to define your needs and design an appropriate foundation early in the journey.

This article is published as part of the IDG Contributor Network. Want to Join?