Data quality management tools (DQM) are growing significantly as volume of data has increased and dependency of more automated tools depend on a high degree of accuracy of the data to avoid exceptions and delays in processes. As customers and other trading partners expectations increase in terms of automation and speed they are more and more dependent on good quality data to be able to execute such processes resulting in a direct impact on both revenues and costs for organizations.
What are the evaluation criteria requirements for a data quality tool and what are the gaping holes which despite implementing these kinds of tools still often results in failure of data cleansing and quality projects. From a technical perspective a DQM application should:
(1) Extracts, parsing and data connectivity
The first step of this kind of application is to either connect to the data or get the data loaded in to the application. There are multiple ways data can get loaded in to the application or the ability to connect and view the data. This also includes the ability to parse or split data fields.
(2) Data profiling
Once the application has or has access to the data the first step of the DQM process is to perform some level of data profiling which would include running statistics on the data (min/max, average, number of missing attributes) including determining relationships between the data. This should also include the ability to verify the accuracy of certain columns such as e-mail addresses, phone numbers etc. as well as the availability of reference libraries such as postal codes, spelling accuracy.
(3) Cleansing and standardization
Data cleansing involves both using seeded automated cleansing functionalities such as date standardization, eliminating spaces, transform functions (such as replacing 1 for F and 2 for M), calculating values, identifying incorrect location names referencing external libraries as well as defining standard rule sets and data normalization which will help the identification of missing or incorrect information. This also includes the ability to manually adjust information.
Deduping records involve leveraging a variety or combination of fields and algorithms to identify, merge and clean up records. Duplicate records can be the result of poor data entry procedures, merging of applications, company mergers or many other reasons. You should ensure that not only addresses are deduped but that any data can be assessed for duplication. Once a suspect duplicate record is identified the process for actually merging the record needs to be clarified which could include automated rules to select which attributes are to be prioritized and/or manual process to clean up the duplication.
(5) Load and export
Ability of the application to export the data in a variety of formats, connect to databases or data stores to drop either full data or incrementally.
New emerging capabilities in DQM applications.
DQM tools are typically designed and built by engineers. Making a data quality project successful is not only the technical aspects of analyzing and cleaning the data but several other aspects. What a few new DQL applications are incorporating in to the application tool set includes areas which are more related to the management of the project and processes either on a one-time of continuing basis. These types of new capabilities can be just as important for successfully getting through a data cleaning or quality project:
(1) Automated task management of stakeholders and data owners
These types of processes or projects usually involve a large set of internal as well as external stakeholders. Managing this through spreadsheets and emails can be a daunting and complex affaire. Applications, which can automate parts of this process, can add significant value and predictability of success of the project. This could be from simple things like monitoring adherence to standards defined and throwing exceptions/tasks to specific users or data owners when violated or coordinating large scale validation directly with external parties such as requesting updated tax exemption certificates or addresses directly.
(2) Data flexibility – ability to handle any data
Some DQM applications are highly specialized to manage only address verification or part/SKU cleansing. The DQM application should be able to handle any type of MDM (master data) or transactional data with flexible rule definitions.
(3) Big Data cleansing
Big Data files can come in structured, semi-structured and completely unstructured formats. Standardizing and automating the cleansing of this data can be necessary on a continuous basis. This emerging process of cleaning up large amounts of data requires automated transformation rules, which can be applied unstructured, formats.
(4) Data governance and adherence monitoring
Data governance and monitoring adherence is a key aspect of being able to maintain accuracy and cleanliness of the data. Many applications are unable to enforce business rules, which is desired from a structural perspective. Some DQM applications can be used to monitoring the data governance processes for requesting new attributes or values and exception monitoring to achieve a higher level of quality of your information.
(5) Project status reporting
A typical data quality management or conversion project goes through a series of steps and phases involving a large set of stakeholders. Appropriate allocation of responsibility, progress on cleansing and inter-dependency of tasks is a complex process and some applications are starting to take on these types of collaborative functionalities as well.