Data Quality Frameworks & Methodologies for Data Managers

Executive Summary

Choosing the suitable DQ Framework is crucial for your organisation’s data quality goals. It should integrate various techniques and tools, support different data types, and be flexible and straightforward, with clear goals for each stage.

Essential points discussed in this webinar are:

·       Developing a data quality framework is important

·       PDCA cycle involves planning, executing, checking, and acting for continuous improvement

·       Pareto principle is used to prioritise data quality issues

·       Root cause analysis is conducted to identify underlying reasons for data quality issues

·       Juran, Deming, and ISO offer different approaches to achieving data quality

·       Evaluating the cost of data quality issues is crucial

·       Developing taxonomies for data value realisation and cost reduction is important

·       Planning for data quality involves resolving data quality issues

·       Various frameworks can be utilised based on organisational needs and objectives

Webinar Details

Webinar Title: Data Quality Frameworks & Methodologies for Data Managers

Webinar Date: 6th July 2023

Webinar Presenter: Howard Diesel

Meetup Group: Stretching your Data Management Career: Data Management Professionals

Write-up Author: Howard Diesel

Contents

Executive Summary

Webinar Details

Introduction to Data Quality Framework (PDCA)

Data Quality Planning and Frameworks

Data Analysis and Data Quality

Different Frameworks for Data Quality Assessment and Improvement

Evaluating Data Quality Frameworks

Planning for Data Quality

The Importance of Quality Control in Data Quality

Quality Improvement in Healthcare Data

Importance of Frameworks in Quality Management

Quality Management Frameworks and Data Certifications in Saudi Arabia

Integrating Quality into the Source System to Avoid Bad Data

Data Quality Assessment and Improvement Process

Importance of Root Cause Analysis and Improvement in Data Quality

The Juran Trilogy and Data Quality Targets

Notes on Qualitative and Quantitative Metrics in Data Analysis

Exploring Data Quality Issues and Quality Frameworks

Clarification of Frameworks and Metrics for Data Analysis

The Impact of Distributed Systems on Data Quality

Data Quality Assessment and Accuracy

Great Expectations: A Framework for Ensuring Data Quality

Reference Material

Introduction to Data Quality Framework (PDCA)

The main topic of discussion was introduced as developing a data quality framework, with questions posed regarding the familiar PDCA framework. Monica, studying for an exam, was specifically asked about the acronym PDCA and its meaning. Penelope explained the PDCA framework, emphasising the importance of planning, executing, checking, and acting for continuous improvement.

Figure 1. PDCA Cycle

Data Quality Planning and Frameworks

During the planning phase, data quality issues are prioritised using the "Pareto principle," targeting the 20% of issues that cause 80% of the problems. Root cause analysis is conducted to identify the underlying reasons for these issues. A remediation plan is then developed. In the "do" phase, all identified problems are fixed, and new data quality targets are established. The "check" phase verifies if the targets have been met and identifies the causes of any remaining problems. The "act" phase addresses emerging issues and prevents the need to repeat the entire plan-do-check-act cycle. DQ Frameworks provide various ways to achieve specific goals. Larger contexts, such as company strategies, can also influence the plan. Focusing on the top 20 issues can ensure that all necessary steps for a migration or big project are covered.

Figure 2. DQ Frameworks

Data Analysis and Data Quality

If your data is of good quality, you can either analyse data products or focus on your strategy. However, if you have identified any data quality issues through data governance, it is vital to prioritise and address them before moving forward. Consider revamping your company's product catalogue while implementing new data products. Before providing data sets to data scientists or BI teams, it is crucial to implement use case quality checks. Make sure to categorise data sets as either "salty water" (uncurated) or "freshwater" (curated) in the catalogue. Only curated data sets should be used by BI and advanced analytics. The Plan-Do-Study-Adjust (PDSA) cycle can be helpful regarding data analysis processes. This cycle involves planning, conducting research, reviewing results, and making adjustments, aligning with Magnus's PDSA perspective. Further discussion on the topic is needed.

Figure 3. Use Data Analysis to Reconstruct the state of Data

Different Frameworks for Data Quality Assessment and Improvement

The Juran framework, developed by Juran, prioritises data accuracy over speed and encompasses various ways of thinking. It is often confused with the Deming framework, which mistakenly attributed Juran's name to the PDCA framework in a specialist exam. Juran is also known for creating the Juran trilogy, which includes three aspects of data quality assessment and improvement. Other notable frameworks for data quality include those made by Larry English, Tom Redmond, Laura Sebastian Coleman, and Jeanette McGilvray. The PDCA framework is typically applied to the assessment and improvement phases, with the speaker's work focused mainly on the assessment phase. It is crucial to evaluate the cost of data quality issues and develop a taxonomy for cost in data quality assessment. Additionally, building taxonomies for both data value realisation and cost reduction can help in assessing and understanding data quality issues. Another framework worth noting is the POSMAD framework created by Danette McGilvray, which highlights the life cycle of data quality.

Evaluating Data Quality Frameworks

We are discussing frameworks that primarily focus on the operational aspects of data quality rather than its definition, principles, and policies. The process involves several iterations, including data value realisation, maturity assessment, developing a data quality strategy, and establishing an operating model. Establishing practices by defining, operationalising, and ensuring continuous improvement is crucial. Reviewing and improving the quality policies and procedures is essential as we progress. Standardising and reviewing frameworks require following specific criteria, including planning, obtaining, storing, maintaining, applying, and disposing of data. Different frameworks like POSMAD and ISO offer unique and distinct approaches to achieving data quality.

Figure 4. DQ Framework Analysis

Planning for Data Quality

You will find a comprehensive plan for ensuring high-quality data within the Excel spreadsheet. The Juran Trilogy, consisting of quality planning, control, and improvement, is discussed in detail. Our focus is on resolving data quality issues rather than simply planning for them, as suggested in the DM book. Danette McGilvray’s approach includes various planning areas, including control, assurance, and improvement. ISO 8000-61 incorporates the Plan-Do-Check-Act cycle and additional elements related to data architecture, IT, and HR. We present different planning, control, assurance, and improvement breakdowns, and the optimal framework depends on organisational needs and objectives. We can utilise various frameworks based on desired outcomes, such as implementing quality projects or changing organisational structures.

The Importance of Quality Control in Data Quality

 It's worth noting that while the DMBOK is a valuable resource for understanding data quality, it only covers some of the frameworks like the Juran Trilogy, POSMAD, and ISO 8000. To truly grasp data quality, it's essential to recognise that it's an ongoing process that requires continuous improvement, as exemplified by the "plan-do-check-act" cycle. Technical debt and recurring data errors are persistent issues that must be addressed regularly. The Juran Trilogy emphasises the importance of comprehending the cost of poor data quality and working at a Quality Control level before proceeding to the next iteration. Developing data quality rules and gradually enhancing critical data elements can facilitate the transition to higher levels of quality control, such as from 50% to 60%.

Figure 5. Plan, CONTROL, Improve

Quality Improvement in Healthcare Data

Improving quality is an ongoing process that requires consistent effort. Maintaining clean and accurate data is crucial for achieving better quality. By implementing a framework, we can achieve enhancements and improve data quality. The Canadian Institute for Health Information Area (CIHI) has established its framework to ensure quality in the healthcare industry. Working in the critical sone of data accuracy can be challenging and may have unintended consequences. Setting low-quality expectations may result in dissatisfaction with data accuracy. Regular assessments are necessary to evaluate data quality and adjust expectations accordingly. If data quality falls below expectations, it may be required to discontinue using the report. Beginning with lower expectations is essential to establish a quality foundation to build upon.

Figure 6. Building a Specialised Framework

Importance of Frameworks in Quality Management

Poor data quality presents a significant risk to companies, which calls for prompt intervention. One such intervention may involve hiring university students to contact customers and tackle data-related issues. Depending on the severity of the situation, urgent risk management or some improvement measures may be necessary to improve data quality to acceptable levels. Choosing the appropriate framework for data governance, such as NET, ISO, and DMAIC, is crucial. Standardising criteria is also essential to compare different frameworks and establish a coherent and comprehensive system for quality management. Frameworks offer a way to create a comprehensive quality management system, thus ensuring successful outcomes.

Figure 7. Framework Classification

Quality Management Frameworks and Data Certifications in Saudi Arabia

This text covers several frameworks for quality management, including TQM, Total Quality ISO, Six Sigma, European Foundation for Quality, Balanced Scorecard, Lean, and Lean Six Sigma. The author suggests implementing a certification and standardisation framework for data quality in Saudi Arabia's ministries, with an independent organisation assessing the quality of shared data before use. A lack of Standardisation led to problems with birth dates in a cited case. While Six Sigma focuses on controlling the quality of data products, Lean aims to reduce waste and enhance efficiency. The text stresses the importance of comparing different frameworks, examining improvement techniques, and evaluating the dimensions and costs of data quality issues. ISO & ISTAT address matters of certification and control across various Government ministries.

Figure 8. Distributed Systems & Operating Models

Integrating Quality into the Source System to Avoid Bad Data

Integrating data quality into the source system is crucial to avoid problems. One way to improve data quality is by addressing the issue of ambiguous genders in the system. The system should be enhanced to ensure the correct selection of gender and manage the influx of poor-quality data. Extending data quality to the organisational level when sharing data between organisations or group companies is also important. To improve data quality, processes used to create, read, and update data, such as data warehousing, must be considered. Data quality services are essential and should be reviewed from different perspectives. When selecting a data quality framework, looking for the ability to be distributed with self-sufficient governing teams is crucial. Centralised quality broker services like ISTAT can facilitate multiple organisational structures and cross-boundary usage.

Figure 9. Source System Comparisons

Data Quality Assessment and Improvement Process

To ensure high-quality data, the CDQ offers a comprehensive approach that includes state reconstruction, assessment, and improvement. State reconstruction involves rebuilding the data, processes, and organisational structure to address quality issues and missing metadata. Assessment and measurement involve profiling data elements, analysing statistical distribution, and understanding the data's shape before applying dimensions or data quality rules. Visual tools such as histograms are used to analyse completeness levels and achieve 100% completeness while determining the feasibility of reaching that goal.

Figure 10. After Data Analysis: Assess & Improve

Importance of Root Cause Analysis and Improvement in Data Quality

During the assessment phase, it's important to collaborate with data stewards and tackle data quality by lowering expectations. To identify the causes of data issues, root cause analysis plays a crucial role. The Fishbone (Ishikawa) diagram is often used for this analysis. Additionally, the Five Why's technique is employed for root cause analysis. To comprehend data flow, tracking and tracing data lineage is essential. Another approach to identifying data issues is process analysis. When addressing data quality, improvement can be driven by data or processes. Both quick wins and long-term solutions are taken into consideration.

Figure 11. RCA Methods & Techniques

The Juran Trilogy and Data Quality Targets

We're currently discussing the Juran Trilogy and data quality targets. Our main goal is to reduce the error rate and enhance the overall quality of our data. Although the plan-do-check-act framework is helpful, it doesn't cover state reconstruction. We then introduce the assessment phase, which entails analysing data and understanding data requirements. The conversation involves various stakeholders and their expectations. It's essential to identify critical areas of data corruption by utilising the processing matrix. We also discuss the selection of quality dimensions and objective and subjective metrics. To clarify, we explain the difference between these two types of metrics.

Notes on Qualitative and Quantitative Metrics in Data Analysis

To measure subjective aspects like customer experience, surveys can provide qualitative metrics. Similarly, software applications' usability can be subjectively evaluated. People's behaviour trends, including eating habits, can also be used to measure subjectivity. We can gauge people's reactions by analysing the sales impact of replacing animal-based cheese with plant-based cheese. Regarding health information analysis, it's essential to differentiate between data and information quality. Information quality entails analytics and statistical interpretation. Improving data analysis involves evaluating costs, assigning responsibilities, identifying root causes of errors, and proposing qualitative improvement solutions. Improvement strategies can be data-driven or process-driven, with data-driven modifications directly impacting the value of data.

Exploring Data Quality Issues and Quality Frameworks

To ensure reliable data, avoiding compounding previous data issues is crucial. Various techniques for data-driven approaches include obtaining high-quality data, utilising quality brokers, Standardisation, record linking, ensuring trustworthiness, pinpointing errors, and making corrections. Process-driven approaches are compared to data-driven approaches in terms of long-term data-driven cost efficiency versus short-term cost efficiency. Defining accuracy and completeness at the concept level leads to better comprehension and significance. Improving data quality must be justified by reducing the cost of poor quality. It's important to define different types of poor-quality data, such as structured, unstructured, and semi-structured. Various quality frameworks, such as ISTAT, Dynamic Quality, Larry English's Information Quality, and Wang, require careful consideration when choosing the appropriate one. Not all frameworks cover all aspects, such as data quality requirement analysis or process modelling.

Clarification on Frameworks and Metrics for Data Analysis

During the discussion, the speaker emphasised the significance of selecting the appropriate framework for data analysis rather than solely relying on the conventional plan-do-check-act (PDCA) framework. They also highlighted various methods for comparing elements, such as cost qualification and mapping different areas and systems. The speaker mentioned a large-scale ERP system and considered the potential benefits of utilising distributed data warehouse cooperatives and web-based systems. Additionally, the speaker emphasised the importance of choosing suitable metrics for the framework, including Standardised metrics provided by TDQM and the option to add personalised metrics. The discussion also discussed using tools and methodologies for data collection and assessment. Moreover, the speaker clarified that the framework does not explicitly mention distributed systems, despite being implicitly considered.

Figure 12. Subjective & Objective Measurements

The Impact of Distributed Systems on Data Quality

Magnus is dedicated to developing distributed systems and organisations facilitating data sharing across various domains. By utilising Sera, distributed systems can facilitate interaction and allow users to request specific data types while checking their quality level. Providers can improve data quality ratings by making changes or enhancements, and users will be notified of any improvements. During a speech, the speaker shares an anecdote about their experience at a central bank where switching from Bloomberg to Reuters was met with resistance due to a long-standing perception that Bloomberg was superior in the past.

Data Quality Assessment and Accuracy

While evaluating equality, accuracy becomes an essential factor to consider. Python interface and website offer an extension for calculating data quality that includes accuracy as a significant component. Data quality dimensions such as completeness and accuracy are evaluated when establishing user expectations. Data quality requirements determine the minimum and maximum accuracy levels, along with their significance. The granularity of accuracy can be assessed at various levels, including element, row, data set, and data schema. Accuracy is measured based on two concepts- agreement with the real world and agreement with a source. Several areas, such as employee data and data privacy, require accuracy validation.

Great Expectations: A Framework for Ensuring Data Quality

During the discussion, the speaker brings up a workspace dedicated to a data quality framework where many individuals can contribute. They mention a tool named Great Expectations, which utilises Python routines to assess data quality. Great Expectations employs probes to validate data quality against various frameworks. The speaker also shares a link to the tool, highlighting its features, such as data profiling and seamless integration with Snowflake. Finally, they mention the partnership between Great Expectations and Data IQ.

Reference Material

If you want to receive the recording, kindly contact Debbie (social@modelwaresystems.com)

Don’t forget to join our exciting LinkedIn and Meetup data communities to not miss out!

Previous
Previous

Data Quality Framework & Methodologies – Data Citizen

Next
Next

Captain your Data Career