You are on page 1of 15

Data Management and Data Quality

The definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is: "Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets. It can also be defined as development and execution of architectures, policies, practices and procedures in order to manage the information lifecycle needs of an enterprise in an effective manner. Data management is one of the most difficult challenges facing todays organizations. Data management helps companies improve productivity by insuring that people can find what they need without having to conduct a long and difficult search. The goal of the data management is to provide the infrastructure and tools to transform raw data into usable corporate information of the higher quality. Just as we study how to manage financial assets(i.e. Identify, control, protect, analyze, and invest capital) to maximize their value in accounting and finance courses, we manage informational assets in data management. The basic rule is that, to maximize earnings, companies invest in data management technologies that increase 1) The opportunity to earn revenues(e.g. CRM) 2) The ability to cut expenses(e.g. inventory management)

Data management is a structured approach for capturing, storing, processing, integrating, distributing, securing, and archiving data effectively throughout their life cycle. The life cycle identifies the way data travel through an organization, from their capture or creation to their use in supporting data-driven solutions, such as supply chain management(SCM), customer relationship management(CRM), and electronic commerce(EC). SCM, CRM and EC are enterprise applications that require current and readily accessible data to function properly. One of the foundational structures of a business solution is the data warehouse. Corporate data are assets. Managing the quality and usability of those assets is vital to productivity and profitability. Dirty data result in poor business decisions, poor customer service, inadequate product design, and other wasteful situation.

One widespread data management problem is that people do not get data in the format they need to do their jobs. Therefore, even if the data are accurate, timely, and clean, they still might not be usable. Just as workers waste time tracking down and correcting invoicing and ordering errors among healthcare suppliers, they also spend significant amounts of time getting data into usable formats. Managing, searching for, and retrieving data located throughout the enterprise is a major challenge, for various reasons: The volume of data increases exponentially with time. New data are added constantly and rapidly. Business records must be kept for a long time for auditing of legal reasons, even though the organization itself may no longer access them. Only a small percentage of an organization's data is relevant for any specific application or time, and those relevant data must be identified and found in order to be useful.

External data that need to be considered in making organizational decisions are constantly increasing in volume. Data are scattered throughout organizations and are collected and created by many individuals using different methods and devices. Data are frequently stored in multiple servers and locations and also in different computing systems , databases, formats, and human and computer languages. Data security, quality, and integrity are critical, yet easily jeopardized. In addition, legal requirements relating to data differ among countries, and they change frequently. Data are being created and used offline without going through quality control checks; hence, the validity of the data is questionable. Data throughout an organization may be redundant and out-ofdate, creating a huge maintenance problem for data managers. To deal with these difficulties, organizations invest in data management solutions

With prevalence of client/server networks(also called client/server computing) and web technologies, numerous distinct databases are created and spread throughout the organization, creating problems in managing those data so that it is consistent in each location. Client/server networks consists of user PCs called clients, linked to high-performance computers, called servers, that provide software, data, or computing services over a network. As businesses become more complex and their volumes of enterprise data explode, they increasingly are turning to master data management as a way to intelligently consolidate and manage these data. Master data management(MDM) is a process whereby companies integrate data from various sources or enterprise applications to provide a more unified view of the data. Although vendors may claim that their MDM solution creates a single version of the truth, this claim is probably not true. In reality MDM cannot create a single unified version of the data because constructing a completely unified view of all master data is simply not possible. Realistically, MDM consolidates data from various data sources into master reference file, which then feeds data back to the applications, thereby creating accurate and consistent data across the enterprise.

A master data reference file is based on data entities. A data entity is anything real or abstract about which a company wants to collect and store date. Common data entities in business include customer, vendor, product, and employees. Each organizational department has distinct master data needs. For example, marketing is concerned with product pricing, brand and product packaging, whereas production is concerned with products costs and schedules. A customer master reference file can feed data to all enterprise systems that have a customer relationship component, thereby providing a unified picture of the customers. An MDM includes tools for cleaning and auditing the master data elements as well as tools for integrating and synchronizing data to make the data more accessible.

Data quality implies data accuracy, but it is much more than that. Most cleansing operations concentrate on just data accuracy. However, it is much more than that. If the data is fit for the purpose for which it is intended, we can then say such data has quality. Therefore, data quality is to be related to the usage for the data item as defined by the users. Does the data item in an entity reflect exactly what the user is expecting to observe? Does the data item possess fitness of purpose as defined by the users? If it does, the date item conforms to the standards of data quality. Data quality in a data warehouse is not just the quality of individual data items but the quality of the full, integrated systems as a whole. It is more than the data edits on individual fields. For example, while entering data about the customers in an order entry application, we may also collect the demographics of each customer. The customer demographics are not relevant to the order entry application, and therefore, they are not given too much attention. But you run into problems when you try to access the customer demographics in the data warehouse. The customer data as an integrated whole lacks data quality.

Data Quality Challenges All data warehouses need historical data. A substantial part of the historical data comes from antiqued legacy systems. Frequently, the end-users use the historical data in the data warehouse for strategic decision making without knowing exactly what the date really means. In most cases, detailed metadata hardly exists for the old legacy systems. Therefore, the data pollution problems have to be fixed for the data that emanate from the old operational systems without the assistance of adequate information about the data there. Common Sources of Data Pollution: System conversions-Moving from the flat files to hierarchical database systems and then to relation database application. Data aging The older values for data lose their meaning and significance. Heterogeneous system integration- The more the heterogeneous and disparate your source systems are, the stronger is the possibility of corrupted data that leads to data inconsistency problem. Need to be more cautious if the sources for one table are from several heterogeneous systems.

Poor database design Good database design based on sound principles reduces the introduction of errors. Incomplete information at data entry-Some of the input fields are not completed at the time of initial data entry. The result is missing values. If the unavailable data is mandatory at the time of initial entry, the person entering the data forces a generic value into the mandatory field. Example, entering N/A for not available in the field for city is an example of this kind of data pollution. Input errors. Erroneous entry of data is a major source of data corruption. Internationalization/localization- As a company is internationalized, the existing data elements must adapt to newer and different values. This change company structure and the resulting revisions in the source systems are sources of data pollution. Fraud- Entry of incorrect data deliberately is not uncommon and hence it must be ensured that the sources systems are fortified with tight edits for such fields.

Lack of policies- Prevention of entry of corrupt data and preservation of data quality in the source systems are deliberate activities. An enterprise without explicit policies on data quality cannot be expected to have adequate levels of data quality. The characteristics of high quality data are: Accuracy Domain Integrity The data value of an attribute falls in the range of allowable, defined values. The common example is the allowable values being male and female for the gender data element. Data type -Value for a data attribute is actually stored as the data type defined for that attribute. Consistency The form and content of a data field is the same across multiple source systems (The product code for one system should be the same for other system also). Redundancy - The same data must not be stored in more than one place in a system. However, if for any reasons, a data element is stored in more than one place in a system, then the redundancy must be clearly identified.

Completeness- There is no missing value for a given attribute in the system. Duplication Duplication of records in a system is completely resolved. If the product file is known to have duplicate records, then all the duplicate records for each product are identified and a cross-reference created. Conformance to business rule- The values of each data item adhere to prescribed business rules.(In an auction system, the hammer or sale price cannot be less than the reserve price. Structural Definiteness- Wherever a data item can naturally be structured indo individual components; the item must contain this well-defined structure. For example, storing of individual names as first name, middle initial, and last name. Data anomaly- A field must be used only for the purpose for it is defined.

Clarity- Proper naming conventions help to make the data elements well understood by the users. Timely The users determine the timeliness of the data. Ie. The customer dimension data is not to be older than one day. Must be updated daily. Usefulness- Every data element in the data warehouse must satisfy some requirements of the collection of users. Adherence to Data integrity Rules- The data stored in the relational databases of the source systems must adhere to entity integrity and referential integrity rules. In a customer-to-order relationship, referential integrity ensures the existence of a customer for every order in the database.

Benefits of improved Data Quality Analysis with Timely information Better customer service Newer opportunities- Quality data in a data warehouse provides immense opportunities to cross-sell across product lines and departments. Reduced costs and Risks. The risks are the strategic decision based on bad quality data can lead to disastrous consequences. Other risks are wastage of time, malfunction of processes and system, legal action by customers and business partners. Wrong addresses of a potential customer can lead to wrong delivery of mails relating to market campaign. Improved productivity. Reliable Decision Making.

Data quality Data collection is a highly complex process that can create problems concerning the quality of the data that are being collected. Therefore, regardless of how the data are collected, they need to be validated so users know they can trust them. Data quality is a measure of the datas usefulness as well as the quality of the decisions based on the data. It has the following five dimensions: accuracy , accessibility, relevance, timeliness and completeness. Although having high-quality data are essential for business success, numerous organizational and technical issues make it difficult to reach this objective. One such issue is data ownership. Inconsistent data quality requirements of various stand-alone applications create an additional set of problems as organizations try to combine individual applications into integrated enterprise systems. Interorganizational information systems add a new level of complexity to managing data quality. Companies must resolve the issue of administrative authority to ensure that each partner complies with the data quality standards.

You might also like