4 reasons big data projects fail—and 4 ways to succeed

Nearly all big data projects end up in failure, despite all the mature technology available. Here's how to make big data efforts actually succeed

Big data projects are, well, big in size and scope, often very ambitious, and all too often, complete failures. In 2016, Gartner estimated that 60 percent of big data projects failed. A year later, Gartner analyst Nick Heudecker‏ said his company was "too conservative" with its 60 percent estimate and put the failure rate at closer to 85 percent. Today, he says nothing has changed.

Gartner isn’t alone in that assessment. Long-time Microsoft executive and (until recently) Snowflake Computing CEO Bob Muglia told the analytics site Datanami, “I can’t find a happy Hadoop customer. It’s sort of as simple as that. … The number of customers who have actually successfully tamed Hadoop is probably fewer than 20 and it might be fewer than ten. That’s just nuts given how long that product, that technology has been in the market, and how much general industry energy has gone into it.” Hadoop, of course, is the engine that launched the big data mania.

Other people familiar with big data also say the problem remains real, severe, and not entirely one of technology. In fact, technology is a minor cause of failure relative to the real culprits. Here are the four key reasons that big data projects fail—and four key ways in which you can succeed.

Big data problem No. 1: Poor integration

Heudecker said there is one major technological problem behind big data failures, and that is integrating siloed data from multiple sources to get the insights companies want. Building connections to siloed, legacy systems are simply not easy. Integration costs are five to ten times the cost of software, he said. “The biggest problem is simple integration: How do you link multiple data sources together to get some sort of outcome? A lot go the data lake route and think if I link everything to something magic will happen. That’s not the case,” he said.

Siloed data is part of the problem. Clients have told him they pulled data from systems of record into a common environment like a data lake and couldn’t figure out what the values meant. “When you pull data into a data lake, how do you know what that number 3 means?” Heudecker asked.

Because they're working in silos or creating data lakes that are just data swamps, they're just scratching the surface of what they could accomplish, said Alan Morrison, a senior research fellow with PwC. “They don't understand all the relationships in data that need to be mined or inferred and made explicit so machines can adequately interpret that data. They need to create a knowledge graph layer so that machines can interpret all the instance data that's mapped underneath. Otherwise, you've just got a data lake that's a data swamp,” he said.

Big data problem No. 2: Undefined goals

You would think most people undertaking a big data project would actually have a goal in mind, but a surprising number don’t. They just launch the project with the goal as an afterthought.

“You have to scope the problem well. People think they can connect structured and unstructured data and get the insight you need. You have to define the problem well up front. What’s the insight you want to get? It’s having a clear definition of the problem and defining it well up front,” said Ray Christopher, product marketing manager with Talend, a data-integration software company.

Joshua Greenbaum, a principal analyst at Enterprise Application Consulting, said part of what has bedeviled both big data and data warehousing projects is the main guiding criteria is typically the accumulation of large amounts of data and not the solving of a discrete business problems.

“If you pull together large amounts of data you get a data dump. I call it a sanitary landfill. Dumps are not a good place to find solutions,” Greenbaum said. “I always tell clients decide what discrete business problem needs to be solved first and go with that, and then look at quality of data available and solve the data problem once the business problem has been identified.”

“Why do most big data projects fail? For starters, most big data project leaders lack vision,” said PwC’s Morrison. “Enterprises are confused about big data. Most just think about numerical data or black box NLP and recognition engines and that do simple text mining and other kinds of pattern recognition.”

Big data problem No. 3: The skills gap

Too often, companies think the in-house skills they have built for data warehousing will translate to big data, when that is clearly not the case. For starters, data warehousing and big data handle data in total opposite fashion: Data warehousing does schema on write, which means the data is cleaned, processed, structured, and organized before it ever goes into the data warehouse.

In big data, data is accumulated and schema on read is applied, where the data is processed as it is read. So if data processing goes backward from one methodology to another, you can bet that skills and tools are as well. And that’s just one example.

“Skills are always going to be a challenge. If we’re talking about big data 30 years from now, there will still be a challenge,” Heudecker said. “A lot of people hang their hat on Hadoop. My clients are challenged finding Hadoop resources. Spark is a little better because that stack is smaller and easier to train up. Hadoop is dozens of software components.”

Big data problem No. 4: The tech generation gap

Big data projects frequently take from older data silos and try to merge them with new data sources, like sensors or web traffic or social media. That’s not entirely the fault of the enterprise, which collected that data in a time before the idea of big data analytics, but it is a problem nonetheless.

“Almost the biggest skill missing is the skill to understand how to blend these two stakeholders to get them to work together to solve complex problems,” consultant Greenbaum said. “Data silos can be a barrier to big data projects because there is no standard anything. So when they start to look at planning, they find these systems have not been implemented with any fashion that this data would be reused,” he said.

“With different architectures you need to do processing differently,” said Talend’s Christopher. “Tech skills and architecture differences were a common reason why you can’t take current tools for an on-premises data warehouse and integrate it with a big data project—because those technologies will become too costly to process new data. So you need Hadoopand Spark, and you need to learn new languages.”

Big data solution No. 1: Plan ahead

It’s an old cliché but applicable here: If you fail to plan, plan to fail. “Successful companies are the ones who have an outcome,” Gartner’s Heudecker said. “Pick something small and achievable and new. Don’t take legacy use case because you get limitations.”

“They need to think about the data first, and model their organizations in a machine-readable way so the data serves that organization,” PwC’s Morrison said.

Big data solution No. 2: Work Together

All too often, stakeholders are left out of big data projects—the very people who would use the results. If all of the stakeholders collaborate, they can overcome many roadblocks, Heudecker said. “If the skilled people are working together and working with the business side to deliver actionable outcome, that can help,” he said.

Heudecker noted that the companies succeeding in big data invest heavily in in the necessary skills. He sees this the most in data-driven companies, like financial services, Uber, Lyft, and Netflix, where the company’s fortune is based on having good, actionable data.

“Make it a team sport to help curate and collect data and cleanse it. Doing that can increase the integrity of the data as well,” Talend’s Christopher said.

Big data solution No. 3: Focus

People seem to have the mindset that a big data project needs to be massive and ambitious. Like anything you are learning for the first time, the best way to succeed is to start small then gradually expand in ambition and scope.

“They should very narrowly define what they are doing,” Heudecker said. “They should pick a problem domain and own it, like fraud detection, micro-segmenting customers, or figuring out what new product to introduce in a Millennial marketplace.”

“At the end of the day, you have to ask the insight you want or the business process to be digitized,” said Christopher. “You don’t just throw technology at a business problem; you have to define it up front. The data lake is a necessity, but you don’t want to collect data if it’s not going to be used by anyone in business.”

In many cases, that also means not overinflating your own company. “In every company I’ve ever studied, there are only a few hundred key concepts and relationships that the entire business runs on. Once you understand that, you realize all of these millions of distinctions are just slight variations of those few hundred important things,” PwC’s Morrison said. “In fact, you discover that many of the slight variations aren’t variations at all. They’re really the same things with different names, different structures, or different labels,” he added.

Big data solution No. 4: Jettison the legacy

While you may want to use those terabytes of data collected and stored in your data warehouse, the fact is you might be better served just focusing on newly gathered data in storage systems designed for big data and designed to be unsiloed.

“I would definitely advise not necessarily being beholden to an existing technology infrastructure just because your company as a license for it,” consultant Greenbaum said. “Often, new complex problems may require new complex solutions. Falling back on old tools around the corporation for a decade isn’t the right way to go. Many companies use old tools, and it kills the project.”

Morrison noted, “Enterprises need to stop getting their feet tangled in their own underwear and just jettison the legacy architecture that creates more silos.” He also said they need to stop expecting vendors to solve their complex system problems for them. “For decades, many seem to assume they can buy their way out of a big data problem. Any big data problem is a systemic problem. When it comes to any complex systems change, you have to build your way out,” he said.

The Data Science Team:

Displaying the different competencies of data analytics experts

 

Introduction

Data science, data analytics, and all those “data” terms have ambiguous and overlapping definitions. On top of that, job postings from many organizations will lump a half dozen data skills together for any and all job titles. These are often unrealistic. This post is my attempt to sort through the BS and assign proper categories for each “data expert”. They are primarily influenced by job postings at Google, Amazon, Apple, and Facebook. If anyone should know the proper definitions for these roles, it’s the Tech giants.

These categorizations will help you understand the strengths and weaknesses of each individual and what is the right question to ask each person. It will also help any aspiring data analytics professionals who want to understand which job path makes the most sense for them. These positions will frequently overlap and each individual is unique, but this should serve as a guide that is directionally correct in defining these roles.

Defining the Data Experts

There are dozens of different job roles for people with data expertise; however, most positions tend to match up closely with the following ten:

 

Scoring the Data Experts

Each skill is rated on a scale of 1–5 (5 being the best). A 1 does not mean this person is unskilled in this area. These scores are relative to the other data analytics roles within the organization. Let’s establish the six skill sets for which we graded each role:

1.) Domain Expertise- Understanding the insights and analyses’ impact on the business.

2.) Data Manipulation- The ability to write clean code or configure a tool that can ingest, manipulate, analyze, model, and visualize data. Depth over breadth is more important because technical skills easily translate between tools.

3.) Communication Skills- The ability to explain complicated technical concepts to a non-technical audience.

4.) Managerial Competence- Requires emotional intelligence, strong social skills, and the ability to communicate clear responsibilities to each member of the team. Must be a leader who can manage direct reports.

5.) Databases- Understanding systems that store data, including data model designs, databases, data warehouses, and data pipelines. This includes knowledge of SQL and NoSQL databases, as well as, the cloud.

6.) Statistics- Knowledge of statistical and machine learning models, including the appropriate assumptions, use cases, and conclusions that can be made beyond the data set. This includes knowledge in statistics and machine learning software tools such as R, Python, and SAS.

14 Most Used Data Science Tools for 2019 – Essential Data Science Ingredients

 

A Data Scientist is responsible for extracting, manipulating, pre-processing and generating predictions out of data. In order to do so, he requires various statistical tools and programming languages. In this article, we will share some of the Data Science Tools used by Data Scientists to carry out their data operations. We will understand the key features of the tools, benefits they provide and comparison of various data science tools.

 

You must check – Top skills to boost Data Science Career

 

Introduction to Data Science

Data Science has emerged out as one of the most popular fields of 21st Century. Companies employ Data Scientists to help them gain insights about the market and to better their products. Data Scientists work as decision makers and are largely responsible for analyzing and handling a large amount of unstructured and structured data. In order to do so, he requires various tools and programming languages for Data Science to mend the day in the way he wants. We will go through some of these data science tools utilizes to analyze and generate predictions.

 

Data Science Tools

 

Top Data Science Tools

Here is the list of 14 best data science tools that most of the data scientists used.

 

1. SAS

It is one of those data science tools which are specifically designed for statistical operations. SAS is a closed source proprietary software that is used by large organizations to analyze data. SAS uses base SAS programming language which for performing statistical modeling. It is widely used by professionals and companies working on reliable commercial software. SAS offers numerous statistical libraries and tools that you as a Data Scientist can use for modeling and organizing their data. While SAS is highly reliable and has strong support from the company, it is highly expensive and is only used by larger industries. Also, SAS pales in comparison with some of the more modern tools which are open-source. Furthermore, there are several libraries and packages in SAS that are not available in the base pack and can require an expensive upgradation.

 

SAS Features

 

2. Apache Spark

Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data Science tool. Spark is specifically designed to handle batch processing and Stream Processing. It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times faster than MapReduce. Spark has many Machine Learning APIs that can help Data Scientists to make powerful predictions with the given data.

 

features of spark

 

 

 

 

Spark does better than other Big Data Platforms in its ability to handle streaming data. This means that Spark can process real-time data as compared to other analytical tools that process only historical data in batches. Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful conjunction of Spark is with Scala programming language which is based on Java Virtual Machine and is cross-platform in nature.

 

Spark is highly efficient in cluster management which makes it much better than Hadoop as the latter is only used for storage. It is this cluster management system that allows Spark to process application at a high speed.

 

3. BigML

BigML, it is another widely used Data Science Tool. It provides a fully interactable, cloud-based GUI environment that you can use for processing Machine Learning Algorithms. BigML provides a standardized software using cloud computing for industry requirements. Through it, companies can use Machine Learning algorithms across various parts of their company. For example, it can use this one software across for sales forecasting, risk analytics, and product innovation. BigML specializes in predictive modeling. It uses a wide variety of Machine Learning algorithms like clustering, classification, time-series forecasting, etc.

 

BigML provides an easy to use web-interface using Rest APIs and you can create a free account or a premium account based on your data needs. It allows interactive visualizations of data and provides you with the ability to export visual charts on your mobile or IOT devices.

 

Furthermore, BigML comes with various automation methods that can help you to automate the tuning of hyperparameter models and even automate the workflow of reusable scripts.

 

4. D3.js

Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows you to make interactive visualizations on your web-browser. With several APIs of D3.js, you can use several functions to create dynamic visualization and analysis of data in your browser. Another powerful feature of D3.js is the usage of animated transitions. D3.js makes documents dynamic by allowing updates on the client side and actively using the change in data to reflect visualizations on the browser.

 

Data Science Tools - D3.js

 

 

You can combine this with CSS to create illustrious and transitory visualizations that will help you to implement customized graphs on web-pages. Overall, it can be a very useful tool for Data Scientists who are working on IOT based devices that require client-side interaction for visualization and data processing.

 

5. MATLAB

MATLAB is a multi-paradigm numerical computing environment for processing mathematical information. It is a closed-source software that facilitates matrix functions, algorithmic implementation and statistical modeling of data. MATLAB is most widely used in several scientific disciplines.

 

In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the MATLAB graphics library, you can create powerful visualizations. MATLAB is also used in image and signal processing. This makes it a very versatile tool for Data Scientists as they can tackle all the problems, from data cleaning and analysis to more advanced Deep Learning algorithms.

 

Data Science Tools - MATLAB

 

Furthermore, MATLAB’s easy integration for enterprise applications and embedded systems make it an ideal Data Science tool. It also helps in automating various tasks ranging from extraction of data to re-use of scripts for decision making. However, it suffers from the limitation of being a closed-source proprietary software.

 

6. Excel

Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for spreadsheet calculations and today, it is widely used for data processing, visualization, and complex calculations. Excel is a powerful analytical tool for Data Science. While it has been the traditional tool for data analysis, Excel still packs a punch.

 

Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own custom functions and formulae using Excel. While Excel is not for calculating the huge amount of Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets. You can also connect SQL with Excel and can use it to manipulate and analyze data. A lot of Data Scientists use Excel for data cleaning as it provides an interactable GUI environment to pre-process information easily.

 

Data Science Tools - Excel

 

With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex analyzations. However, it still pales in comparison with much more advanced Data Science tools like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.

 

7. ggplot2

ggplot2 is an advanced data visualization package for the R programming language. The developers created this tool to replace the native graphics package of R and it uses powerful commands to create illustrious visualizations. It is the most widely used library that Data Scientists use for creating visualizations from analyzed data.

Ggplot2 is part of tidyverse, a package in R that is designed for Data Science. One way in which ggplot2 is much better than the rest of the data visualizations is aesthetics. With ggplot2, Data Scientists can create customized visualizations in order to engage in enhanced storytelling. Using ggplot2, you can annotate your data in visualizations, add text labels to data points and boost intractability of your graphs. You can also create various styles of maps such as choropleths, cartograms, hexbins, etc. It is the most used data science tool.

 

8. Tableau

Tableau is a Data Visualization software that is packed with powerful graphics to make interactive visualizations. It is focused on industries working in the field of business intelligence. The most important aspect of Tableau is its ability to interface with databases, spreadsheets, OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the ability to visualize geographical data and for plotting longitudes and latitudes in maps.

 

Data Science Tools - Tableau

 

Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes with an active community and you can share your findings on the online platform. While Tableau is enterprise software, it comes with a free version called Tableau Public.

 

9. Jupyter

Project Jupyter is an open-source tool based on IPython for helping developers in making open-source software and experiences interactive computing. Jupyter supports multiple languages like Julia, Python, and R. It is a web-application tool used for writing live code, visualizations, and presentations. Jupyter is a widely popular tool that is designed to address the requirements of Data Science.

 

It is an interactable environment through which Data Scientists can perform all of their responsibilities. It is also a powerful tool for storytelling as various presentation features are present in it. Using Jupyter Notebooks, one can perform data cleaning, statistical computation, visualization and create predictive machine learning models. It is 100% open-source and is, therefore, free of cost. There is an online Jupyter environment called Collaboratory which runs on the cloud and stores the data in Google Drive.

 

10. Matplotlib

Matplotlib is a plotting and visualization library developed for Python. It is the most popular tool for generating graphs with the analyzed data. It is mainly used for plotting complex graphs using simple lines of code. Using this, one can generate bar plots, histograms, scatterplots etc. Matplotlib has several essential modules. One of the most widely used modules is pyplot. It offers a MATLAB like an interface. Pyplot is also an open-source alternative to MATLAB’s graphic modules.

 

 

Matplotlib is a preferred tool for data visualizations and is used by Data Scientists over other contemporary tools. As a matter of fact, NASA used Matplotlib for illustrating data visualizations during the landing of Phoenix Spacecraft. It is also an ideal tool for beginners in learning data visualization with Python.

 

11. NLTK

Natural Language Processing has emerged as the most popular field in Data Science. It deals with the development of statistical models that help computers understand human language. These statistical models are part of Machine Learning and through several of its algorithms, are able to assist computers in understanding natural language. Python language comes with a collection of libraries called Natural Language Toolkit (NLTK) developed for this particular purpose only.

 

Data Science Tools - NLTK

 

NLTK is widely used for various language processing techniques like tokenization, stemming, tagging, parsing and machine learning. It consists of over 100 corpora which are a collection of data for building machine learning models. It has a variety of applications such as Parts of Speech Tagging, Word Segmentation, Machine Translation, Text to Speech Speech Recognition, etc.

 

12. Scikit-learn

Scikit-learn is a library based in Python that is used for implementing Machine Learning Algorithms. It is simple and easy to implement a tool that is widely used for analysis and data science. It supports a variety of features in Machine Learning such as data preprocessing, classification, regression, clustering, dimensionality reduction, etc

 

Scikit-learn makes it easy to use complex machine learning algorithms. It is therefore in situations that require rapid prototyping and is also an ideal platform to perform research requiring basic Machine Learning. It makes use of several underlying libraries of Python such as SciPy, Numpy, Matplotlib, etc.

 

13. TensorFlow

TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors which are multidimensional arrays. It is an open-source and ever-evolving toolkit which is known for its performance and high computational abilities. TensorFlow can run on both CPUs and GPUs and has recently emerged on more powerful TPU platforms. This gives it an unprecedented edge in terms of the processing power of advanced machine learning algorithms.

 

Data Science Tools - TensorFlow

 

Due to its high processing ability, Tensorflow has a variety of applications such as speech recognition, image classification, drug discovery, image and language generation, etc. For Data Scientists specializing in Machine Learning, Tensorflow is a must know tool.

 

14. Weka

Weka or Waikato Environment for Knowledge Analysis is a machine learning software written in Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists of various machine learning tools like classification, clustering, regression, visualization and data preparation.

 

It is an open-source GUI software that allows easier implementation of machine learning algorithms through an interactable platform. You can understand the functioning of Machine Learning on the data without having to write a line of code. It is ideal for Data Scientists who are beginners in Machine Learning.

 

Learn how to become a Data Scientist

 

So, this was all in data science tools. Hope you liked our explanation.

 

Summary

We conclude that data science requires a vast array of tools. The tools for data science are for analyzing data, creating aesthetic and interactive visualizations and creating powerful predictive models using machine learning algorithms. Most of the data science tools deliver complex data science operations in one place. This makes it easier for the user to implement functionalities of data science without having to write their code from scratch. Also, there are several other tools that cater to the application domains of data science.

Consulting Services and IT Strategies for Government

 

Best Practices for Data Governance

Leveraging the Power of Your Organization’s Data

All organizations need to plan how they use data so that it’s handled consistently throughout the business, to support business outcomes.

This means that organizations who successfully do this consider the who – what – how – when – where and why of data to not only ensure security and compliance, but to extract value from all the information collected and stored across the business – improving business performance.

It’s all about how you handle the data collected within your business.

This is data governance, and most organizations are doing some sort of this without even knowing it.

According to the 2019 State of Data Management, data governance was one of the top 5 strategic initiatives for global organizations in 2020. Since technology trends such as Machine Learning and AI rely on data quality, and with the push of digital transformation initiatives across the globe, this trend is likely not going to change any time soon.

Because of this, it is important to raise the awareness of data governance to help those who care about data quality learn more about how the role of data governance impacts today’s business environments, stakeholders and company objectives.

What Is Data Governance?

At DataAdaptiX, we believe in keeping things simple, so we’ll give you a single sentence:

Data governance is a set of principles and practices that ensure high quality throughout the complete lifecycle of your data.

According to the Data Governance Institute (DGI), it is “a practical and actionable framework to help a variety of data stakeholders across any organization identify and meet their information needs”.

The DGI maintains that businesses don’t just need systems for managing data.  They need a whole system of rules, with processes and procedures to make sure those rules are followed, consistently, everyday.  That is a tall order for any system of governance.  EA-based data guidance and planning from DataAdaptiX makes the process much easier.

Why Devote So Much Effort to Data Management?

Data is becoming the core corporate asset that will determine the success of your business. Digital transformation is on the agenda everywhere. You can only exploit your data assets and do a successful digital transformation if you are able to govern your data. This means that it is an imperative to deploy a data governance framework that fits your organization and your future business objectives and business models. That framework must control the data standards needed for this journey and delegate the required roles and responsibilities within your organization and in relation to the business ecosystem where your company operates.

A well-managed data governance framework will underpin the business transformation toward operating on a digital platform at many levels within an organization:

  • Management: For top-management this will ensure the oversight of corporate data assets, their value and their impact in the changing business operations and market opportunities

  • Finance: For finance this will safeguard consistent and accurate reportingSales: For sales and marketing this will enable trustworthy insight into customer preferences and behavior

  • Procurement: For procurement and supply chain management this will fortify cost reduction and operational efficiency initiatives based on exploiting data and business ecosystem collaboration

  • Operations: For operations this will be essential in deploying automation

  • Legal: For legal and compliance this will be the only way to meet increasing regulation requirements

Benefits of Data Governance

If you’ve managed to get this far, the benefits are probably obvious.  Data governance means better, leaner, cleaner data, which means better analytics, which means better business decisions, which means better performance results. Ultimately, improved outcomes and protection of your agency’s reputation. 

Goals of Data Governance

Of course definitions are important. But action is more important.  Now we know what it is.  What do we want to do with it?

Here are a few possibilities:

  • Make consistent, confident business decisions based on trustworthy data aligned with all the various purposes for the use of the data assets within the enterprise

  • Meet regulatory requirements and avoid fines by documenting the lineage of the data assets and the access controls related to the data

  • Improve data security by establishing data ownership and related responsibilities

  • Define and verify data distribution policies including the roles and accountabilities of involved internal and external entities

  • Use data to increase profits.  Data monetization starts with having data that is stored, maintained, classified and made accessible in an optimal way.

  • Assign data quality responsibilities in order to measure and follow up on data quality KPIs related to the general performance KPIs within the enterprise

  • Plan better by not having to cleanse and structure data for each planning purpose

  • Eliminate re-work by having data assets that is trusted, standardized and capable of serving multiple purposes

  • Optimize staff effectiveness by providing data assets that meet the desired data quality thresholds

  • Evaluate and improve by rising the data governance maturity level phase by phase

  • Acknowledge gains and build on forward momentum in order to secure stakeholder continuous commitment and a broad organizational support

These are just a handful of things you can do with great data governance.  Bottom line is, we either want to do these things to grow, or we have to do them to meet regulatory requirements. Regardless of reason, the end result of not doing these things is the same.  If we have bad data, we make bad decisions that we don’t realize are bad decisions until later.

Who’s Involved in Data Governance?

Data governance will involve the whole organization in a greater or lesser degree, but let’s break down the most commonly involved stakeholders:

Data Owners: First, you will need to appoint data owners (or data sponsors if you like) in the business. This must be people that are able to make decisions and enforce these decisions throughout the organization. Data owners can be appointed at entity level (e.g. customer records, product records, employee records and so forth) and supplementary on attribute level (e.g. customer address, customer status, product name, product classification and so forth). Data owners are ultimately accountable for the state of the data as an asset.

Data Stewards: Next, you will need data stewards (or data champions if you like) who are the people making sure that the data policies and data standards are adhered to in daily business. These people will often be the subject matter experts for a data entity and/or a set of data attributes. Data stewards are either the ones responsible for taking care of the data as an asset or the ones consulted in how to do that.

Data Custodians: Furthermore, you may use data custodians to make the business and technical onboarding, maintenance and end-of-life updates to your data assets.

Data Governance Committee: Typically, a data governance committee will be established as the main forum for approving data policies and data standards and handle escalated issues. Depending on the size and structure of your organization there may be sub fora for each data domain (e.g. customer, vendor, supplier, product or employee).

These roles highlighted above should optionally be supported by a Data Governance Office with a Data Governance Team. In a typical enterprise, here are some folks who might make up a Data Governance Team:

Manager, Master Data Governance: Leads the design, implementation and continued maintenance of Master Data Control and governance across the corporation.

Solution and Data Governance Architect: Provides oversight for solution designs and implementations.

Data Analyst: Uses analytics to determine trends and review information

Data Strategist: Develops and executes trend-pattern analytics plans

Compliance Specialist: Ensure adherence to required standards (legal, regulatory, medical, privacy laws or standards).

One of the most important aspects of assigning and fulfilling the roles is having a well-documented description of the roles, the expectations and how the roles interact. This will typically be outlined in a RACI matrix describing who is responsible, accountable, to be consulted and to be informed within a certain enforcement, process or for a certain artifact as a policy or standard.

The Data Governance Framework

A data governance framework is a set of data rules, organizational role delegations and processes aimed at bringing everyone on the organization on the same page.

There are many data governance frameworks out there. As an example, we will use the one from The Data Governance Institute. This framework has 10 components; offered below in detail:

 

 

 

 

Figure 1: The DGI Data Governance Framework © The Data Governance Institute

The Data Governance Maturity Model

Measuring your organization up against a data governance maturity model can be a very useful element in making the roadmap and communicating the as-is and to-be part of the data governance initiative and the context for deploying a data governance framework.

Most organizations will before embarking on a data governance program find themselves in the lower phases of such a model.

Phase 0 – Unaware: This might be in the unaware phase, which often will mean that you may be more or less alone in your organization with your ideas about how data governance can enable better business outcomes. In that phase you might have a vision for what is required but need to focus on much humbler things as convincing the right people in the business and IT on smaller goals around awareness and small wins.

 

Phase 1 – Aware: In the aware phase where lack of ownership and sponsorship is recognized and the need for policies and standards is acknowledged there is room for launching a tailored data governance framework addressing obvious pain points within your organization.

 

Phases 2 and 3 – Reactive & Proactive: Going into the reactive and proactive phases means that a more comprehensive data governance framework can be established covering all aspects of data governance and the full organizational structure encompassing data ownership and data stewardship as well as a Data Governance Office / Team in alignment with the achieved and to be achieved business outcomes.

Phases 4 and 5 – Managed & Effective: By reaching the managed and effective phases your data governance framework will be an integrated part of doing business.

If your current data governance policies and procedures is your guidebook, the maturity model is your history book.  It’s compiled from historical data based on a maturity assessment, which compares a company’s performance to established goals and benchmarks over a given period—a quarter, for example, or a year, or even five years.  The model shows where you’ve been, which helps shape where you’re going.

While a “one-size-fits-all” approach doesn’t really work for a maturity model, an “if-the-shoe-fits” approach works well for many companies.  Search for existing models, find one that’s close, and adjust it to meet your company needs.  If the shoe doesn’t fit, it’s easy to change the size of the shoe.  It’s not so easy to change the size of your foot.

 

 

The Connection to Master Data Management (MDM)

Data Governance is the strategic approach.  MDM is the tactical execution. “All enterprise systems need master data management,” Jay Gokul said at our DataAdaptiX 2019 kickoff event.  “Marketing, legal, personnel, finance, and operations.  There is benefit everywhere, in enterprises of any size, in every industry, across the globe, at any point in their data journey.”

Master data is the most important data, Jay said, because it is the data in charge.  It’s about the “business nouns”–the essential elements of your business.  Customers, partners, products, services.  Whatever your business is, that’s where master data lives and breathes.  You may have the best governance plan on the planet.  Well-governed bad data is still bad data. It’s not going to help your business.

“Everybody is in the data business, whether they realize it or not,” Jay said.  “Everything we touch turns to data.  Business is transforming from analog to digital.  No matter what your product is, data is your product.  Business is changing because of data, and data is power. With the right approach, you can harness that power right now.”

Data Protection and Data Privacy

The increasing awareness around data protection and data privacy as for example manifested by the European Union General Data Protection Regulation (GDPR) has a strong impact on data governance.

Terms such as “data protection” and “data privacy” by default must be baked into our data policies and data standards not at least when dealing with data domains as employee data, customer data, vendor data and other party master data.

As a data controller you must have the full oversight over where your data is stored, who is updating the data and who is accessing the data for what purposes. You must know when you handle personal identifiable information and do that for the legitimate purposes in the given geography both in production environments and in test and development environments.

Having well-enforced rules for deletion of data is a must, too, in the compliance era.

Data Governance Best Practices

On one hand you can learn a lot from others who have been on a data governance journey. However, every organization is different, and you need to adapt the data governance practices all the way starting from the unaware maturity phase to the nirvana in the effective maturity phase.

Nevertheless, please find below a collection of 15 short best practices that will apply in general:

Start small. As in all aspects of business, do not try to boil the ocean. Strive for quick wins and build up ambitions over time.

Set clear, measurable, and specific goals. You cannot control what you cannot measure. Celebrate when goals are met and use this to go for the next win.

Define ownership. Without business ownership a data governance framework cannot succeed.

Identify related roles and responsibilities. Data governance requires teamwork with deliverables coming from all parts of the business.

Educate stakeholders. Wherever possible use business terms and translate the academic parts of the data governance discipline into meaningful content in the business context.

Focus on the business model. A data governance framework must integrate into the way of doing business in your enterprise.

Map your Enterprise Architecture.  Your data governance framework must be a sensible part of your enterprise architecture, the IT landscape and the tools needed.

Develop standardized data definitions. It is essential to strike a balance between what needs to be centralized and where agility and localization works best.

Identify data domains. Start with the data domain with the best ratio between impact and effort for elevating the data governance maturity.

Identify critical data elements. Focus on the most critical data elements.

Define control measurements. Deploy these in business process, IT applications and/or reporting where it makes most sense.

Build your business case. Identify advantages of rising data governance maturity related to growth, costs savings, risk and compliance.

Leverage metrics. Focus on a limited set of data quality KPIs that can be related to general performance KPIs within the enterprise.

Data governance practitioners agree that having a multi-pronged communication plan is the most crucial part of the discipline.

Remember – Data Governance is an on-going management process.   It’s a practice, not a project.