Research Paper: What Causes Data Quality Issues To Arise?

I was looking to establish why data quality problems arise. My experience was that data quality issues arise due to factors such as the following:

  • Human Errors – A major source of incorrect data in systems. A large portion of these errors can be attributed to a system’s inability to validate data, but some errors are logical in nature. For instance, if a telephone operator is an inputter (s)he may mishear a respondent’s town and input a town that may be valid but can be wrong from a business perspective.
  • Deliberate Manipulations – A user may be forced by the legacy system to enter data that is prima-facie incorrect but is inevitable because the legacy system would reject the data otherwise.
  • Conformance to a target system model definition – This can be a factor when a legacy system data conversion and migration is taking place. The target system model may dictate the data to be in a certain format, and the need for conversion and migration may make it compulsive for changes to be made to the data.

but wanted to triangulate my experience with the thoughts of experts within industry sectors as disparate as investment management, call centre management and market research. I asked sixteen experts what they thought the main causes of data quality problems were. Semi-structured interviews were conducted with each participant.

English (1999) suggests the main data quality problems arise from poor data architecture, inconsistently defined departmental data, inability to relate data from different data sources, missing and inaccurate data values, inconsistent use of data fields, unacceptable query performance (timeliness of information) and lack of business sponsor, and it is specific examples such as these I was looking to elicit from the answers, while acknowledging that over time, even in the best maintained systems, deficiencies in stored data will develop (Ballou & Tayi, 1989).

43.75% of respondents attributed major data quality issues to organisations having distributed systems (15.56% of responses). Often there exist disparate systems within an enterprise that use different business rules, and store data with different definitions and possibly conflicting values. Customer data exists in silos – there may be a marketing database, a transactions or operational database and a customer service database, all holding different information on or ‘views’ of the same customers. As Respondent 6 recognised that “database design recommendations say you should not store and update the same data in multiple places because it prevents consistency, but this is not always enforced, especially when systems have been unified in a chaotic way, or phased in.” Without a proper integration model to eliminate inconsistent definitions, formats and values, the organisation often finds it is unable to aggregate and harmonise new data as it is introduced from various sources.

37.5% of respondents cited a lack of naming conventions and standards causing a diminution in data quality (13.33% of responses). The symptoms of this are also discussed in the research paper Data Quality Issues In Marketing Databases.

A quarter of respondents suggested a failure of organisations to harness the maximum potential of technological advances to ensure erroneous or duplicated address data isn’t stored in the marketing database (8.89% of responses). Specifically, 12.5% respondents suggested a lack of address verification at point of entry was a significant contributor to poor data quality (4.44% of responses). Address verification modules are an increasingly common feature of systems, their presence allowing for instant address verification at point of entry, immediate identification of a potential duplicate entry and significantly reducing the number of keystrokes at point-of-entry. Where systems are used in countries where street-level post coding exists, such as the United Kingdom, these modules can facilitate the collecting of a full address simply by entering a house number and postcode. Even in a more reactive manner, software advances can aid in improving data quality through address hygiene software.

Poor stewardship of data is cited as a major contributory factor to poor data quality by 25% of respondents (8.89% of coded responses). Respondent 3 suggests that “databases full of junk data happen most often when there is no business owner of the data.” Also, failure to implement processes that properly capture the information needed by the enterprise can be seen as another prominent example of poor data stewardship that tangibly affects the enterprise’s marketing data quality, and this aspect was underlined by two of those respondents.

Input rules that are too restrictive and bypassed completely are a significant contributor to poor quality data according to a quarter of respondents (8.89% of responses for this question). This is a major cause of too little data being captured – “Imposing superfluous controls on data input and editing might cause important data to be lost entirely,” as Respondent 1 notes. It is also closely related to inaccuracy in data stored, or a lack of granularity in the data stored, as Respondent 8 noted – “data entry forms often have a single structure, causing users to force complex inputs into a one size fits all form.”

12.5% of respondents identified inappropriate access levels as having a detrimental effect on the quality of data in systems (4.44% of responses). This can take the form of easy access to data, which may conflict with good security practice or privacy requirements. Conversely, as Respondent 1 suggests, “if the balance is too great in favour of barriers to access… it will obviously have an impact on the quality of data held in the system.”

12.5% of respondents suggested an increasingly overwhelming amount of data (‘information overload’) is adversely affecting quality of data held in information systems, not least marketing databases and customer relationship management systems (4.44% of responses). As Respondent 6 noted, “In larger organisations there is often simply too much data, duplicated across the organisation.” This can make it “difficult to access required data in a reasonable time, negatively impacting on information systems quality. Customers on a telephone call, for instance, expect an operator to retrieve their account details in a reasonable amount of time, and may be put off using the services if it takes an eternity to retrieve a billing history,” as Respondent 1 points out.

Overburdened or time-strapped staff was cited as a key contributor to data quality issues by 12.5% of respondents (4.44% of responses). Respondent 9 commented that “organisational focus on speed and productivity alone is often a fairly significant cause of data quality problems. For example, something like a record store or bookshop that takes orders by phone might make it a priority to reduce time for each phone call to get greater volume of orders with fewer staff, but this will guarantee a lot of duplicate customer records and unverified addresses being captured. If pay incentives or bonuses are based on number of orders taken… this causes problems like this to happen.”

An inability to identify duplicates or instance of customer records already stored on system at the point of entry is a key contributor to poor data quality, specifically the issue of duplicate data, and this was raised as a key issue by 12.5% of respondents (4.44% of responses).

12.5% of respondents underlined their concerns regarding turnover of information (churn) and the impact this can have on an enterprise’s data and decision-making if they are unable to keep up with it (4.44% of responses). Respondent 16 cited figures from their previous organisation, Dun & Bradstreet, who suggest that in a single year, 20.7% of the business postal addresses in an enterprise’s data store will have changed (of new business,  the rate of change is higher, at 27.3%). Telephone numbers change at the rate of 18%. Company names are unstable, changing at the rate of 12.4% a year, making it difficult to sustain effective customer relationships if data isn’t constantly updated. Data quality inherently degrades at a rapid rate, and steps must be taken to ensure this is counteracted.

12.5% of respondents were concerned that many organisations treat information as a by-product rather than a product and as a result they often address data quality issues in a reactive manner (4.44% of responses). They assert that information is not a by-product but a direct product of processes that capture knowledge about the actors, locations and events discovered while conducting business transactions. A corollary of this approach is that there is often no plan for information lifecycle management. Information captured by an enterprise goes through various stages of significance. Initially, it may have to be available for several days then over time will become needed less frequently. Later, it may be needed in order for litigation or regulatory reasons. Therefore, as information has a lifecycle it must be managed appropriately, using a combination of products and services used to manage it based on criteria such as reducing storage costs, ensuring easy access when needed and speeding up business critical applications. Failure to manage it appropriately can have a degradative impact on data quality.

A related issue is that of limited resources or infrastructure. If there doesn’t exist the will or budget to computerise areas of the business, not all communications or transactions may be captured. That this was raised by only one respondent (forming 6.25% of respondents, 2.22% of responses) is perhaps indicative of the strides many enterprises have made in ensuring a great deal of business information is properly captured, facilitated in part by decreasing storage costs and effective off-the-shelf database management software.

One respondent suggested selective judgment in data production, resulting in biased or skewed information, can have an adverse effect on data quality, particularly where that is used for decisioning. Although this is more prevalent in softer systems, such as medical systems, it can still arise in marketing databases, particularly where descriptive text fields are used to store notes on suppliers or customers.

The cost of deciphering and inputting coded or jargonised information can have an adverse effect on the quality and retrievability of an enterprise’s data. Similarly, this issue will be most prevalent in systems such as those in specialised areas such as the medical profession. This was contributed by a single respondent.

Another data quality issue is caused by difficulties in representing non-numeric and non-textual information, for instances images. There is a large storage overhead for storage of such information, and although when stored in systems such data may be retrievable, systems have so far struggled with indexing, aggregating, manipulating such data and identifying trends. This issue was raised by one respondent.

One respondent cited changing data needs in an organisation as a primary source of data quality problems. Data consumers’ needs and the organisation environment are often in a state of flux, meaning systems will need to be redesigned often to maintain the quality and reliability of the data. If the organisation’s processes and systems fail to keep pace with the changing environment and needs of data consumers, the corollary can be lost data or analysis and decisioning based on incomplete or inaccurate information.

Lack of training for staff employed with entering data can have a profound effect on data quality. The accurate and efficient capture of data requires attention to detail and typing speed. Spelling, punctuation and numeracy skills of staff are also important if the enterprise is to accurately and efficiently handle information. However, students increasingly acquire word processing, spreadsheet, and database management skills using computer software in secondary schools, colleges, universities, meaning organisations can rely on a pool of workers who have already acquired many of the skills necessary to capture data accurately and efficiently. That only one respondent underlined this as a salient issue may reflect that it is no longer such a concern for the reasons explained above.

Lack of effective cross checks and validation beyond the point of entry to assure the validity and veracity of collected data was cited as a source of data quality issues by one respondent. Often entry-point validation will be implemented but this will not necessarily eliminate the need for further validation and cross checks against existing data sources, as the user interface is often bypassed by operators performing bulk inserts of records or mass updates to the database for valid business reasons. Errors that can occur in these instances also need to be identified and protected against or corrected in order to maintain data quality and integrity. Further applications that handle validation logic can be introduced (through triggers, or stored procedures that search for business rule violations or anomalies in expected trends), or, less reactively, it can be made a requirement that all transactional updates are run through the component to check validity.

References

Ballou, D. P., & Tayi, G. K. (1989). Methodology For Allocating Resources For Data Quality Enhancement. Communications Of The ACM, 32 (3).
English, L. (1999). Improving Data Warehouse And Business Information Quality: Methods For Reducing Costs And Increasing Profits. Indianapolis: John Wiley & Sons.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.