I was looking to establish the main data quality problems that affect marketing databases. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values. These data defects generally manifest themselves as foreign-key mismatches and approximately duplicate records, both of which make further data mining and decision support analyses either impossible or suspect (Chaudhuri et al., 2005). These areas have been the traditional focus of those analysing data quality issues.
However, more sophisticated attributes of data quality have been developed Strong and Wang (1996) to categorise these issues from the perspective of the data user, and I hoped that data quality issues might be discussed with reference to this framework. To expand on this framework briefly, the attributes of data commonly cited in the research literature are as follows:
Accuracy – The extent to which the data is free from significant error. Does the data accurately represent reality or a verifiable source?
Completeness – The extent to which enough of the required data elements are collected from a sufficient portion of the target population or sample. Is all necessary data present?
Consistency – The extent to which data is collected using the same procedures and definitions across collectors and times. Do any data values give conflicting information?
Conformity – Is any data is stored in a non-standard format?
Duplication – Are records repeated?
Ease of use – Is data readily accessible by the users who need it, aided by clear data definitions, user-friendly software and easily used access procedures?
Integrity – Is data missing important relationship linkages?
Timeliness – Is data available when needed? Does the age of the data meet user requirements?
Validity – Are the data items stored in the systems valid entries? Are there any aberrant values?
Respondents from different backgrounds will inevitably have different definitions and requirements for data quality, and this is also what I had hoped to capture.
I asked sixteen experts, in roles that included database administration, statistics and systems architecture, what they thought the primary areas of concern were. Semi-structured interviews were conducted with each participant.
The data quality issue in marketing databases identified by most respondents (50% of sources, 17.02% of total responses) was that an inappropriate amount of data (specifically, too little) had been captured. This issue was highlighted from respondents from backgrounds as diverse as database administrators to statisticians. Respondent 11 outlined ways in which missing data may be dealt with, either through case-wise or variable-wise deletion, or by populating missing fields with some default value in order to conduct analysis on the whole set. However, it was underlined that these techniques are not ideal and it is better to ensure the data is captured in order that robust analysis and modeling can be conducted on the database. Ultimately, as noted by Respondent 8, “Incomplete data becomes incomplete information, and business decisions are formulated based on this incomplete information.”
43.75% of the respondents highlighted inaccuracy of data as being one of the main data quality issues afflicting marketing databases (14.89% of total responses). One example of this would be an invalid address or an instance where the postcode does not corroborate other address details. This has significant implications not least in delivery of marketing literature, as suggested by Respondent 9 – “Poor address capture resulting in lost dispatches and increased sending costs. Incorrect titles, misspelling of names and poor addressing alienates customers.” These themes are expanded upon further below. Another example of inaccuracy might be individuals being marked as active customers when they have been lapsed for some time, which would be a very significant issue for a publisher, as magazines may continue to be dispatched to individuals who are no longer paying their subscription fees. Another significant example, underlined by Respondent 13, was the problem of recording transactional information accurately contributing to data quality issues. This is a foremost concern of this respondent as he has significant experience working with retail transactional systems, and he cited the issue retailers have with counting returns, exchanges, or voided transactions correctly. This leads to them over counting or inaccurately estimating sales.
Perhaps unsurprisingly, duplication of data was identified as a major data quality issue for marketing databases (by 43.75% of respondents, 14.89% of responses). The effects of duplicated data are far-reaching and are explored further in Question 2. Despite the progress that many data centres have now made in creating single customer views and generating effective merge algorithms to eliminate duplicate entries within their databases, it appears that duplication within databases remains a foremost consideration.
37.5% of respondents (12.77% of coded responses) identified inconsistency between sources as a prominent contributor to poor data quality in marketing databases. When multiple data sources produce different values or values in different formats, this can impact an organisation’s ability to leverage key decision-making information from the database. This problem is exacerbated by the divergence in standards across national and cultural boundaries, as noted by Respondent 6, who gives the example of the American date format in one system “[which], if not handled, correctly, may be transposed into another system as 3rd September rather than 9th March.”
An associated issue is that of a lack of standardisation. This can take the form of inconsistent spelling of names or addresses (identified as significant by a quarter of respondents, in 9.3% of total responses) – for example, using Beijing or Peiching depending upon the method of transliteration favoured by the inputter. As Respondent 9 underlines, “Often there are inconsistencies in referencing organisation names – this can be a big problem that is difficult for merge keys to eradicate in deduplication processing, for example, National Security Agency, Central Security Services, NSA – all names that could be entered on a database for the same entity without appropriate guidance for input.” Similarly, inconsistency can affect product buying history information, as more than one name can be used for a single product, if an enterprise-wide product naming convention has not been established.
A quarter of respondents noted time-degraded data as a major data quality issue (9.3% of responses). Respondent 16, who has significant experience of dealing with this issue for a number of risk management companies, provides examples collated while with Dun & Bradstreet – “about 20% of all businesses will move in a year, about 15% or 16% will change their name, about 18% will get a new phone number. Those tend to be smaller businesses, so if you do business on a business-to-business basis with small businesses, that rate of change is a huge problem to overcome. And the typical way that companies try to overcome it is by using a timestamp in the database, but that presumes that the person putting the data in, and thus, attaching the timestamp, has the current name and address. In my experience, that’s often not true.” Inevitably, this causes issues in terms of being sure of the reliability of your data as well as the enterprise’s ability to integrate data and generate a single view of the customer.
A significant issue identified by 18.75% of respondents (6.38% of responses) was that incorrect data can often be entered due to inappropriate input controls. Incorrect data can be entered intentionally in order to evade validation rules.
Spelling or typographical errors were cited as a main data quality issue by 18.75% respondents, in 6.38% of responses. A typographical error can change an address completely, for instance, if Hull Road is entered erroneously as Hill Road (both of these are fairly common street-level addresses in the United Kingdom, Hull Road having 39 observations and Hill Road having 121 observations nationally, some of which are relatively close to one another). Respondent 11 cited examples of “the 800-year old man or the 80-year old lady that just had a baby. They’re just data entry errors [and] there are lots of ways to deal with that, you can have a computer go back, you can write programs that do quality checks, but another way to deal with it which is again quite expensive is if you’re dealing with hand-entered data, you can actually have two people enter the data and then cross-check. The survey places (for example, Gallup) actually do that… taking telephone interviews. Anybody that sets up a database is going to put in allowable and non-allowable values, so there’s simple, inexpensive checks that can be done computationally just like that, but people do this thing where they have two people sitting there typing. I don’t know how much that’s done but I have actually seen it done.” Admittedly, this is likely to be very expensive.
Inappropriate metadata was cited as a salient issue in 4.26% of total responses, by 12.5% of respondents. This issue is closely related to, and overlaps to some degree, with the lack of standardisation issue outlined earlier. As Respondent 15 suggests, “data comes from multiple places, it’s stored in multiple databases, all of which use different terminology, different approaches, different processes for not only tagging what it is, but for measuring its quality.” This is a particularly significant issue in Respondent 15’s industry sector, investment banking, and the respondent cited the varying definitions for closing price in global financial markets as a major source of misunderstanding. ‘Closing price’ generally refers to the last price at which a stock trades during a regular trading session. For many markets (including the New York Stock Exchange and the Nasdaq), regular trading sessions run from 9:30 a.m. to 4:00 p.m. Eastern Time. But a number of markets offer after-hours trading, and some market data vendors use the last trade in these after-hours markets as the closing price for the day. Others publish the 4:00 p.m. price as the closing price and display prices for after-hours trading separately.
12.5% of respondents identified type mismatches as a data quality issue (4.26% of total responses). Type mismatches occur when data fields that are being used as variables or being used as keys to join data sets are of different types, for example, a numeric field that is stored as an integer in one system might be stored as a text string in another. Issues might arise because a variable in a program that requires an integer value can’t accept a text string value unless the whole string can be recognised as an integer. Similarly, a field that stores numbers as an integer data type cannot be properly bound to a text field, so relational joins cannot be performed using the field.
The respondent from a statistical background underlined another significant data quality issue, which was overlooked by respondents from other backgrounds, namely the propensity of individuals to willfully provide false data. This issue is prominent where data is being collected about drugs, prostitution, gambling, alcohol intake or similarly taboo subjects.
References
Chaudhuri, S., Ganjam, K., Ganti, V., Kapoor, R., Narasayya, V., & Vassilakis, T. (2005). Data Cleaning In Microsoft SQL Server 2005. SIGMOD.
Wand, Y., & Wang, R. Y. (1996). Anchoring Data Quality Dimensions In Ontological Foundations. Communications Of The ACM , 39 (11).