A Framework for Assessing Database Quality

John A. Hoxmeier
College of Business
Computer Information Systems
Colorado State University
Fort Collins, CO 80523
jhox@lamar.colostate.edu

The ultimate objective of database analysis, design, and implementation is to establish an electronic data store that is a physical model of the relevant aspects of a user’s conceptual ‘world’. Many factors must be considered during this process including, but not limited to, historical and future data perspectives, the diversity of the user community, organizational requirements, security, cost, ownership, performance, temporality, user interface issues, and data integrity. These factors contribute to the success of a database application in either a quantitative or qualitative fashion, and in turn contribute to the overall quality of the database.

To ensure a quality database product, should the emphasis during model development be on the application of quality assurance metrics (designing it right)? It’s hard to argue against this point, but there is a significant amount of research and anecdotal evidence that suggests that a quality process does not necessarily lead to a usable database product [Hoxmeier, 1995; Redman, 1995]. A database should be evaluated in production based on certain quantitative and information-preserving transformation measures, such as data quality, data integrity, normalization, and performance. However, there are also many examples of database applications that are in most ways ‘well-formed’ with high data quality but lack semantic or cognitive fidelity (the right design). Additionally, determining and implementing the proper set of database behaviors can be an elusive task. Depending on the application of course, there may be certain aspects of the quality assessment that deserve heavier weights. Whether the database meets the expectations of its end-users is only one aspect of overall database quality. This paper expands on a hierarchical framework previously presented [Hoxmeier and Monarchi, 1996] and incorporates database quality dimensions discussed in the growing body of literature in the area [Ballou and Pazar, 1995; Storey and Wang, 1994; Wand and Wang, 1994; Wang, et al., 1995].

For years, researchers and practitioners alike have tried to establish a set of factors, attributes, rules or guidelines in order to evaluate system quality. Referring to information systems, James Martin states that the collection of data has little value unless the data are used to understand the world and prescribe action to improve it [Martin, 1976]. Martin proposed 12 qualities that computer-provided information should possess. They include:

Zmud concluded that a set of four dimensions divided into 25 factors represented the dimensions of information quality [Zmud, 1978]. The dimensions included data quality, relevancy, format quality, and meaning quality. Cap Gemini Pandata, a Dutch company, uses a framework that decomposes the entire information quality notion into four dimensions, 21 aspects, and 40 attributes [Delen and Rijsenbrij, 1992]. Cap Gemini has adopted this framework on the company procedures covering software package auditing. AT&T is researching data quality and have identified four primary factors including accuracy, currentness, completeness and consistency [Fox, et al., 1994]. Another group, the Southern California Online Users Group (SCOUG), defined characteristics of a quality library online database [Tenopir, 1990]. The purpose of the set of characteristics was to allow professional searchers to rate each library online database system. The result of the retreat produced the following list of components that a professional searcher can use to evaluate a database:

Wang et al. performed a comprehensive survey study that identified 4 high-level categories of data quality after evaluating 118 variables [Wang, et al., 1996]. The Wang factors include intrinsic data quality, contextual data quality, representation data quality, and accessibility data quality.

There appear to be many similarities in the factors identified in these studies based on the perspective of the evaluators. Both developers and data consumers are concerned with quality metrics like accuracy, timeliness, consistency, etc. These factors have been well documented in data quality research [Wang, et al., 1993; Wang, et al., 1995]. Most of these studies, while focused on data or information quality, indicate there are a diverse set of factors influencing data quality. Any individual variable however, such as accuracy, is difficult to quantify. Nonetheless, researchers have developed a fairly consistent view of data quality. However, there is little available in publications or textbooks on the evaluation of overall database quality including other considerations such as process, semantic, behavioral, and value factors. These additional characteristics are critical when delivering a database product or determining overall database quality of an existing application.

Much attention has been given over the years to process quality improvement. ISO-9000-3 and Total Quality Management (TQM) are approaches which are concerned primarily with the process, not necessarily the outcome [Costin, 1994; Schmauch, 1994]. Quality control is a process of ensuring that the database conforms to predefined standards and guidelines using statistical quality measures [Dyer, 1992]. It compares variations of identified attributes (problem domain) with the results of development (solution domain) and assesses the variation between the two. When deviations from the problem domain are found, they are resolved and the process is modified as needed. This is a reactive form of quality management. Quality assurance attempts to maintain the quality standards in a proactive way. In addition to using quality control measures, quality assurance goals go further by surveying the customer to determine their level of satisfaction with the product. Conceivably, potential problems can be detected early in the process.

The philosophy of ISO-9000-3 is to build quality into a software system on a continuous basis, from conception through implementation. ISO-9000-3 as a process quality standard does not offer any particular metrics to be utilized during the process. In addition, as a general software standard, ISO-9000-3 does not deal specifically with database issues.

Research assessing database quality has focused on the 2 primary dimensions discussed above: data and process. However, the development of new modeling techniques, sophisticated user interfaces, and complex data types has added new dimensions to the database quality model. It is proposed that through the framework presented below, that one can evaluate overall database quality by assessing four primary dimensions: process, data, semantic, and behavior. The framework draws heavily from the previous studies on data and information quality and adds the additional considerations of database semantics and behavior.

Figure 1
Figure 1. Database Quality Dimensions.

Database Process Quality

The database design process is largely driven by the requirements and needs of the end-user, who defines the properties of the problem domain and the requirements of the task. The first step is information discovery and it is one of the most difficult, important and labor intensive stages of database development [Chignell, Parsaye, 1993]. It is in this stage where the semantic requirements are identified, prioritized, and visualized. Requirements can rarely be defined in a serial fashion. Generally, there is significant uncertainty over what these requirements are, and they only become clearer after considerable analysis, discussions with users, and experimentation with prototypes. This means previous work may be revisited.

Concentric design is an approach which is appropriate in database design. This cyclical process emulates the philosophy of continuous quality improvement used in Total Quality Management [Braithwaite, 1994; Dvir and Evans, 1994]. The costs associated with developing quality into the application from design to implementation are much lower than the costs of correcting problems which occur later due to poor design .

A specific addition to the framework is the factor of performance. All too often, specific performance requirements are either ignored during the design process or evaluated after implementation. While performance, per se, is more of an implementation issue, it should be considered as an aspect of overall database quality, even in the conceptual phase. Both relational and object databases can contain rather serious problems in terms of data redundancy, relationships, integrity, and structure. The objective is to design a normalized, high-fidelity database while minimizing complexity. When evaluating performance there are times when de-normalization may represent an optimal solution. However, anytime a general purpose database is optimized for a given situation, other requirements inevitably arise that negate the advantage. The measures used to assess the trade-offs may include query and update performance, storage, and the avoidance of data anomalies. Similar to the contrast between data and semantic quality, a database that is otherwise well-designed but does not perform well is useless.

Database Data Quality

Data integrity is one of the keys to developing a quality database. Without accurate data, users will lose confidence in the database or make uninformed decisions[Redman, 1995]. While data integrity can become a problem over time, there are relatively straightforward ways to enforce constraints and domains and to ascertain when problems exist [Moriarty, 1996]. The identification, interpretation, and application of business rules, however, present a more difficult challenge for the developer. Rules and policies must be communicated and translated and much of the meaning and intent can be lost in this process.

A frequently overlooked metric in the evaluation of data integrity is the age of the data, database, and model. The data should only be as old as the real world will allow and maintained as long as the situation requires. This can be a few seconds or years. At some point, the data needs to be refreshed in order to maintain its currency. Over time, the age of the model may degrade in its ability to depict the real world. The model also must be updated so that as the real world changes, the model of the database changes as well.

Additionally, the assessment of data quality must include value considerations. Time and financial constraints are real concerns. As IT departments are expected to do more with less and as cycle times continue to shorten for database applications, developers must make decisions about the extent to which they are going to implement and evaluate quality considerations. Shorter cycle times present a good argument for modularity and reusability, so quality factors must be addressed on a micro basis.

Database Semantic Quality

As has been presented, data quality is usually associated with the quality of the data values. However, even data that meets all other quality criteria is of little use if it is based on a deficient data model. The dimension of semantic quality has been added to the model above. Semantic quality is an important objective of database systems. Information that represents a high proportionate match between the problem and solution domains should be the goal of a database with high semantic quality. Content, scope, consistency, conformity to generally accepted design principals, and flexibility are all characteristics of model quality [Hoxmeier, 1996; Levitin and Redman, 1995].

Qualitative techniques address the ambiguous and subjective dimensions of conceptual database design. The interaction between people and information is one where human preference and constraints have a huge impact on the effectiveness of database design. The use of techniques such as affinity and pareto diagrams, semantic object models, group decision support systems, nominal group, and interrelationship diagraphs help to improve the process of problem and solution domain definition. Well studied quantitative techniques, such as entity-relationship diagrams, object models, data flow diagrams, and performance benchmarks, on the other hand, allow the results of the qualitative techniques to be described in a visual format and measured in a meaningful way. Other object attributes that explicitly express quality can be included in the model as well. Storey and Wang present an extension to the traditional ER approach for incorporating quality requirements (database quality data and product quality data) into conceptual database design [Storey and Wang, 1994]. The underlying premise of the approach is that quality requirements should be distinct from other database properties.

These techniques can be used to assist the developer extract a strong semantic model. However, it is difficult to design a database with high semantic value without significant domain knowledge and experience. These may be the two most important considerations in databases of high semantic quality. In addition, conceptual database design remains more of an art than a science. It takes a high amount of creativity and vision to design a solution which is robust, usable, and can stand the test of time.

Figure 2
Figure 2. Mapping the solution to the problem domain.

Database Behavior Quality

What constitutes a database of high behavioral quality? Are the criteria different than those used for software applications in general? The process of behavior implementation consists of the design and construction of a solution domain following the identification of the problem domain (See Figure 2). Because of the difficulties associated with the definition of a fixed set of current requirements and the determination of future utilization the database problem domain is typically a moving target or resembles an "amoebae". When one corner gets squeezed, the problem expands in another area. In addition, insufficient identification of appropriate database ‘behaviors’, poor communication, and inexperience in the problem domain leads to inferior solutions. As a result, the solution domain rarely approaches the optimal solution presented in Figure 1.c. The database developer must attempt to develop a database model which closely matches the perceptions of the user, and deliver a design that can be implemented, maintained, and modified in a cost-effective way.

Software development, in general, is very procedure- or function-driven. The objective is to build a system that works (and do it quickly). Database development, on the other hand, should be more focused on the content, context, behavior, semantics and persistence of the data. Rapid application development and prototyping techniques contribute to arriving at a close match between the problem and solution domains. There may be no substitute for experience and proficiency with the software and tools used in the entire development process. It is one thing to discuss how a database should behave and even document these behaviors completely. Implementation and modification of these behaviors is an altogether different issue.

Many databases are perceived to be of low quality simply because they are difficult to use. In a recent survey in the UK, managers and professionals from various disciplines were asked to evaluate the quality of information they were using [Rolph and Bartram, 1994]. Using 8 factors, "accuracy" rated the highest, "usable format" the lowest. Developers tend to focus on aspects of data quality at the expense of behavioral quality. Granted, the behaviors associated with a general purpose database used for decision and analytical support are varied and complex.

How does one ensure a final database product that is of high quality? Database quality must be measured in terms of a combination of factors including process and behavior quality, data quality, semantic quality, and value. The model proposed herein presents a framework for assessing these dimensions. The purpose of this paper was to attempt to add value to an existing model in yet another series of steps that will ultimately provide a more comprehensive view of database quality. The area is of great concern as information becomes a critical organizational asset and the preservation of organizational memory is a high priority [Saviano, 1997]. Further research is required to validate the framework, identify additional quality dimensions and develop metrics to quantify the quality of a database.

References

Ballou, D. and H. Pazer,"Designing information systems to optimize the accuracy timeliness tradeoff", Information Systems Research, Vol. 6, No. 1, 1995, pp. 51-72.

Braithwaite, T., "Information service excellence through TQM, building partnerships for business process reengineering and continuous improvement", ASQC Quality Press, 1994.

Chignell, M. and P. Kamran, Intelligent Database Tools and Applications, Wiley, Los Angeles, California, 1993.

Costin, H., Total Quality Management, Dryden, United States, 1994.

Delen, G., and D. Rijsenbrij, "A specification, engineering and measurement of infomration systems qulisty", Journal of Systems Software, 1992, Vol. 17, No. 3, pp. 205-217.

Dvir, R. and Evans, S., "A TQM approach to the improvement of information quality", http://wem.mit.edu/tdqm/papers, accessed 7/97.

Dyer, M., The Cleanroom Approach to Quality Software Development, Wiley, 1992.

Fox, C., Levitin, A. and T. Redman, "The notion of data and its quality dimensions", Information Processing and Management, 1994, Vol. 30, No. 1, pp. 9-19.

Hoxmeier, J. and D. Monachi, "An assessment of database quality: design it right or the right design?", Proceedings of the Association for Information Systems Annual Meeting, Phoenix, AZ, August, 1996.

Hoxmeier, J., "Managing the legacy systems reengineering process: lessons learned and prescriptive advice", Proceedings of the Seventh Annual Software Technology Conference, Ogden ALC/TISE, Salt Lake City, April, 1995.

Levitin, A., and T. Redman, "Quality dimensions of a conceptual view", Information Processing and Management, 1995, Vol. 31, No 1.

Martin, J., Principles of Data-base Management, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1976.

Moriarty, T., "Barriers to data quality", Database Programming and Design,, May, 1996, pp. 61.

Redman, T.C., "Improve data quality for competitive advantage", Sloan Management Review, Winter, 1995, Vol. 36, No. 2, pp. 99-107.

Rolph, P., and P. Bartram, The Information Agenda: Harnessing Relevant Information in a Changing Business Environment, 1994, London, Management Books 2000, pp. 65-87.

Saviano, J., "Are we there yet?", CIO, 1 June 1997, pp. 87-96.

Schmauch, C., ISO-9000 for Software Developers, ASQC Quality Press, 1994.

Storey, V. and R. Wang, "Modeling quality requirements in conceptual database design", Total Data Quality Management, Working Paper Series, TDQM-94-02, 1994, http://web.mit.edu/tdqm/www/wp94.html (accessed 7/97).

Tenopir, C., "Database quality revisited", Library Journal, 1 October 1990, pp. 64-67.

Teorey,T., Database Modeling and Design, The Fundamental Principles, Morgan Kaufman, San Francisco, California, 1994.

Wand, Y. and R. Wang, "Anchoring data quality dimensions in ontological foundations", Total Data Quality

Management, Working Paper Series, TDQM-96-07, 1996, http://web.mit.edu/tdqm (accessed 6/97).

Wang, R., Kon, H. and S. Madnick, "Data quality requirements analysis and modeling", 9th International Conference on Data Engineering, 1993, pp. 670-677.

Wang, R., Strong, D. and L. Guarascio, "Beyond accuracy: What data quality means to data consumers", Journal of Management Information Systems, Spring 1996, Vol. 12, No. 4, pp. 5-34.

Wang, R., V. Storey, and C. Firth, "A framework for analysis of data quality research", IEEE Transactions on Knowledge and Data Engineering, Vol. 7, No. 4, 1995, pp. 349-372.

Zmud, R., "Concepts, theories and techniques: An empirical investigation of the dimensionability of the concept of information", Decision Design, 1978, Vol. 9. No. 2. pp. 187-195.

Acknowledgment:

Matt Spruill, a graduate student at Colorado State University contributed to this research.


Copyright © 1997, ER'97 and John A. Hoxmeier. All rights reserved.