Preventing an error in data collection is easier than dealing with its consequences. The sagacity of your business decisions depends on the quality of your data. In this article, we tell you how to check the quality of data at all stages of collection, from the statement of work to completed reports.
It’s crucial to assess data quality early on and ensure data quality monitoring by using tools and techniques that examine data against established quality dimensions and uncover inconsistencies.
A comprehensive data quality assessment is essential in the early stages of data collection, employing various techniques such as data profiling, validation, and cleansing to establish specific quality rules and thresholds.
Want to be sure about the quality of your data? Leave it to OWOX BI. We’ll help you develop data quality metrics and customize your analytics processes to ensure high data quality throughout the data collection and data quality monitoring process.
With OWOX BI, tracking data quality metrics becomes a key component of our service, ensuring the accuracy of your data. You don’t need to look for connectors and clean up and process data. You’ll get ready data sets in an understandable and easy-to-use structure.
Note: This post was originally published in January 2020 and was completely updated in June 2024 for accuracy and comprehensiveness on Web analytics.
Data quality pertains to the state of data determined by factors such as precision, entirety, dependability, and pertinence. High-quality data must be fit for its intended uses in operations, decision-making, and planning, ensuring it effectively meets the needs of its users.
Consistency: Consistency checks ensure data is uniform across different sources and systems, maintaining standardized formats and values representing key data quality dimensions like accuracy, completeness, consistency, timeliness, validity, uniqueness, and privacy and security. This process is part of a broader data quality assessment, which encompasses techniques such as data profiling, data validation, and data cleansing, along with establishing specific data quality rules and thresholds.
Accuracy: Accuracy measures how closely data reflects the real-world values it is supposed to represent, which is crucial for maintaining high data quality dimensions.
Completeness: Completeness assesses whether all necessary data is present and no critical elements are missing, a vital aspect of data quality dimensions.
Auditability: Auditability involves the ability to trace and verify data through accessible, clear records that show its history and usage, contributing to the transparency and reliability of data quality dimensions.
Orderliness: Orderliness checks the arrangement and organization of data, ensuring it is systematic and logically structured, an important factor in data quality dimensions.
Uniqueness: Uniqueness ensures no data records are duplicated, and each entry is distinct and singular within its dataset, a critical component of data quality dimensions.
Timeliness: Timeliness evaluates whether data is up-to-date and available when needed, ensuring it is relevant for current tasks and decisions, and is a key element of data quality dimensions.
To effectively monitor and improve these aspects, organizations rely on data quality metrics, quantitative indicators that track and report changes in data quality dimensions and issues over time.
It's crucial to track data quality metrics as they serve as quantitative indicators to determine the accuracy of data, enabling organizations to find areas of improvement and confirm the effectiveness of monitoring tools.
The data quality monitoring process is a continuous one of evaluating and ensuring the integrity, accuracy, and reliability of data throughout its lifecycle, beginning with a foundational step of data quality assessment.
This assessment involves context awareness and the application of various techniques such as data profiling, data validation, and data cleansing, along with the establishment of specific data quality rules and thresholds. It sets the stage for effective monitoring by identifying the key metrics to track.
The process of data quality monitoring involves setting benchmarks for data quality attributes such as completeness, consistency, and timeliness and using tools and methodologies, including advanced data quality monitoring techniques, to track data quality metrics against these standards.
By identifying anomalies and errors in real-time, data quality monitoring enables organizations to take immediate corrective actions, thereby preventing the negative impacts of low-quality data on business operations and decision-making.
The importance of choosing a data quality monitoring solution that enables quick identification and resolution of data quality issues cannot be overstated, as they are vital for preserving the overall health of the data ecosystem.
Monitoring the quality of your data is crucial for ensuring its reliability and usability in decision-making processes. Addressing poor data quality issues early helps prevent errors and reduces costs associated with inaccuracies, enhancing operational efficiency.
Regular data quality monitoring safeguards against the potential risks of regulatory non-compliance and protects the organization’s reputation from the negative impacts of data quality issues.
Tracking data quality metrics is essential in this process, as it provides quantitative indicators that help determine the accuracy of data, thereby enhancing the decision-making process by providing accurate data insights.
Through upholding rigorous data standards and consistently monitoring and rectifying data quality concerns, companies can gain deeper insights into their performance, enhance customer interactions, and secure a competitive advantage in the market, resulting in enhanced business results.
Unfortunately, many companies that spend substantial resources storing and processing data still make important decisions based on intuition and their own expectations instead of data. Utilizing data performance testing tools is essential for simulating various data processing scenarios to ensure the reliability of web analytics.
Why does that happen? Distrust of data is exacerbated by situations where data provides an answer that’s at odds with the expectations of the decision-maker. In addition, if someone has encountered errors in data or reports in the past, they’re inclined to favor intuition. This is understandable, as a decision made on the basis of incorrect data may throw you back rather than move you forward.
Imagine you have a multi-currency project. Your analyst has set up Google Analytics in one currency, and the marketer in charge of contextual advertising has set up cost importing into Google Analytics 4 in another currency. As a result, you have an unrealistic return on ad spend (ROAS) in your advertising campaign reports. If you don’t notice this error in time, you may either disable profitable campaigns or increase the budget on loss-making ones.
In addition, developers are usually very busy, and implementing web analytics is a secondary task for them. While implementing new functionality — for example, a new design for a unit with accessories — developers may forget to check that data is being collected in Google Analytics 4. As a result, when the time comes to evaluate the effectiveness of the new design, it turns out that the data collection was broken two weeks ago. Surprise.
We recommend testing web analytics data as early and as often as possible to minimize the cost of correcting an error.
Imagine you've made an error during the specification phase. If you find it and correct it immediately, the fix will be relatively cheap. If the error is revealed after implementation, when building reports, or even when making decisions, the cost of fixing it will be very high.
Data collection typically consists of five key steps:
At almost all of these stages, it's very important to check your data. It's necessary to test technical documentation, Google Analytics 4 and Google Tag Manager settings, and, of course, the quality of data collected on your site or in your mobile application.
Before you go to each step, let's take a look at some requirements for data testing:
As we’ve mentioned, it’s much easier to correct an error if you catch it in the specifications. Therefore, checking documentation starts long before collecting data. Let’s figure out why we need to check your documentation.
Purposes of testing documentation:
Most common errors in specifications:
Data validation plays a crucial role in preventing these errors by ensuring the data meets established quality criteria before it's processed.
The next step after you check your technical documentation is to check your Google Analytics 4 and Google Tag Manager settings.
Why test Google Analytics 4 and Google Tag Manager settings?
Most common errors in Google Analytics:
Most common errors in Google Tag Manager:
The last stage of testing is testing directly on the site. This stage requires more technical knowledge because you'll need to watch the code, check how the container is installed, and read the logs. So, you need to be savvy and use the right tools.
Why test embedded metrics?
The most common mistakes:
Tools we use to test data:
Let’s take a closer look at these tools. Data quality tools are essential for generating data quality metrics and applying data quality rules to ensure data accuracy, consistency, and reliability.
To get started, you need to install this extension in your browser and enable it. Then open the page ID and go to the Console tab. The information you see is provided by the extension.
This screen shows the parameters that are transmitted with hits and the values that are transmitted for those parameters:
There's also an extended e-commerce block. You can find it in the console as ec:
In addition, error messages are displayed here, such as for exceeding the hit size limit.
If you need to check the composition of the dataLayer, the easiest way to do this is to type the dataLayer command in the console:
Here are all the parameters that are transmitted. You can study them in detail and verify them. Each action on the site is reflected in the dataLayer. Let's say you have seven objects. If you click on an empty field and call the dataLayer command again, an eighth object should appear in the console.
To access Google Tag Manager Debugger, open your Google Tag Manager account and click the Preview button:
Then, open your site and refresh the page. In the lower pane, a panel should appear that shows all the tags running on that page.
Events that are added to the dataLayer are displayed on the left. By clicking on them, you can check the real-time composition of the dataLayer.
Features of mobile browser testing:
Features of mobile application testing:
This step is the fastest and easiest. At the same time, it makes sure the data collected in Google Analytics 4 makes sense. In your reports, you can check hundreds of different scenarios and look at indicators depending on the device, browser, etc. If you find any anomalies in the data, you can play the script on a specific device and in a specific browser.
You can also use Google Analytics 4 reports to check the completeness of data transferred to the data layer. That is, depending on each of the scenarios, the variable is filled, whether there are all parameters in it, whether the parameters take the correct values, etc.
We want to share the most useful reports (in our opinion). You can use them as a data collection checklist:
Let's see what these reports look like in the interface and which of these reports you need to pay attention to first.
In GA4, the "E-commerce purchases" report not only tracks user progression through different stages of the shopping journey but also helps in assessing the completeness of data collection at each stage. This is crucial for identifying any gaps where data might not be accurately captured. For instance, if a significant drop-off is observed between the "add to cart" and "purchase" stages, it could indicate issues with the checkout process or with how events are tracked in that segment.
GA4 uses event-based tracking, which offers flexibility in monitoring specific interactions across the site. Each stage of the enhanced e-commerce process—viewing products, adding items to carts, initiating checkout, and completing a purchase—is tracked through designated events. Analyzing these events can reveal discrepancies or inefficiencies in data collection, enabling marketers to make necessary adjustments to tracking setups or site design to ensure comprehensive data collection and a smoother user experience.
What should we pay attention to here? First, it’s very strange if you have zero values in any of the columns. Second, if you have more values at some stage than at the previous stage, you’re likely to have problems collecting data. That’s weird and worth paying attention to. You can also switch between other parameters in this report, which should also be sent to Enhanced Ecommerce.
First of all, it's necessary to walk through all parameters that are transmitted to Google Analytics and see what values each parameter takes. Usually, it's immediately clear whether everything is okay. More detailed analyses of each of the events can be carried out in custom reports.
Another report that can be useful for checking the importing of expense data into Google Analytics is Cost Analysis.
We often see reports where there are expenses for some source or advertising campaign but no sessions. This can be caused by problems or errors in UTM tags. Alternatively, filters in Google Analytics 4 may exclude sessions from a particular source. These reports need to be checked from time to time.
We would like to highlight the custom report that allows you to track duplicate transactions. It's very easy to set up: the parameter must be a transaction ID, and the key dimension must be transactions.
Note that when there's more than one transaction in the report, this means that information about the same order was sent more than once.
If you find a similar problem, read these detailed instructions on how to fix it.
Google Analytics has a very good Custom Alerts tool that allows you to track important changes without viewing reports. For example, if you stop collecting information about Google Analytics sessions, you can receive an email notification.
We recommend that you set up notifications for at least these four metrics:
In our experience, this is the most difficult and time-consuming task — the narrow line where mistakes are the most common.
To avoid problems with dataLayer implementation, checks must be done at least once a week. In general, the frequency should depend on how often you implement changes on the site. Ideally, you need to test the dataLayer after each significant change. It's time-consuming to do this manually, so we decided to automate the process.
To automate testing, we've built a cloud-based solution that enables us to:
Advantages of test automation:
A simplified scheme of the algorithm we use:
When you sign in to our app, you need to specify the pages you want to verify. You can do this by uploading a CSV file, specifying a link to the sitemap, or simply specifying a site URL, in which case the application will find the sitemap itself.
Then it's important to specify the dataLayer scheme for each scenario to be tested: pages, events, scripts (a sequence of actions, such as for checkout). Then you can use regular expressions to specify that the page types match the URL.
After receiving all this information, our application runs through all pages and events as scheduled, checks each script, and uploads test results to Google BigQuery. Based on this data, we set up email and Slack notifications.
Data quality metrics are standardized criteria used to evaluate the accuracy, completeness, consistency, reliability, and timeliness of data. These metrics help organizations quantify their data quality and identify areas for improvement.
Monitoring data quality involves regularly assessing data against predefined metrics, using tools that automate the detection of anomalies and inconsistencies, and implementing corrective actions based on these insights.
Data quality is measured by applying specific metrics such as accuracy, completeness, consistency, uniqueness, and timeliness. Organizations use these metrics to assess the condition of data and ensure it meets the required standards for their operational and analytical purposes.
Monitoring data quality is essential to ensure the information remains accurate, consistent, and useful for making informed decisions. It aids in risk mitigation cost reduction stemming from errors and enhances overall efficiency and effectiveness in business operations.
Data testing is the process of verifying and validating the accuracy, completeness, consistency, and validity of data used in a system or application. It involves various techniques and tools to identify and correct errors, inconsistencies, and discrepancies in the data.
Data testing is crucial to ensure that data is correct, reliable, and trustworthy. Inaccurate data can lead to wrong decisions, loss of revenue, and damage to reputation. Data testing helps to identify and fix data errors early on, saving time and resources and improving data quality.
There are several types of data testing, including functionality testing, integration testing, performance testing, security testing, and usability testing. Each type of testing evaluates different aspects of data quality and helps to ensure that data meets the required standards.