It is tough to define data. Think of it more like the raw information that when processed and analyzed can give you an understanding of a situation, process or fact. Clearly if data is the first step to information, we must do out best to ensure the accuracy and reliability of data. Let us take a look at what data is used for:
Data can be used to Stimulate new organizational ideas
Improve quality of emergency care and procedures
Draw attention to an issue
Influence legislative and regulations
Provide justification for an existing program
Illustrate a need for a new initiative
Help provide funding
Communicate the importance of data collection
Provide education" list slightly altered to from original source, to make it general purpose)
As you can tell by the above list, data can be used to initiate, validate, invalidate, modify, improve, or streamline processes. Hence, effective data management is one of the key factors in the success of any organization. Effective data management includes:
1) Identification of what data is needed
2) Identification of the source of that data
3) Decision on the mode of data collection
4) Creation of a mechanism to access data
5) Understanding of the issues of data reliability
6) Creation of processes for testing the data
Identification of what data is needed: We have moved from a world with too little data to one with too much data. Today, with the electronic methods of data collection and retrieval, we are often faced with a situation where we do not know how to deal with the large amounts of data. This is the classic case of information overload. Here are some principles that one can use to decide what data is needed.
1) Identify the purpose of the process that is being studied. Example: At this stage, do not concern yourself with data issues. Just go with the basic requirement of the organization. For instance, if you are a school, one of the processes you might choose to study is the effectiveness of teaching.
2) Identify the data-variables that are relevant to what you want to study. Example: Continuing the example of the school choosing to measure teaching effectiveness, your judgment might tell you that important data variables are: teacher's qualifications, class size, scores...At the same time, other data-variables might be available which are best ignored for example, the height and weight of the students.
Identification of the source of that data: There can be existing or created data.
Existing sources of data include:
1) Primary Records: Data that needs to be measured might already have been collected.
2) Secondary Records: Actual data might not be available, but there might be proxy data, i.e., data that can be used as a good indicator of the actual data.
Decision on the mode of data collection: To the beginner data-surveyor, the importance of the mode of data collection might not be apparent.
Costs and timeliness are relatively easy to compare between modes. Evaluating quality takes more effort, but is crucial. Using a mode that compromises data quality may negate any advantages in cost or timeliness
Completeness, accuracy, and reliability are all components of 'quality.'
Research to date has produced mixed results concerning the quality of mail and interview administered surveys. Mail surveys generally suffer from lower completeness. But they seem to have an advantage in accuracy allowing respondents time for more thoughtful answers, mail surveys have another advantage in reliability - the correlated component of response error, or between-interviewer variance, is eliminated for cases completed by mail."
The above narration from a National Statistical Foundation publication is a good example of the difference that can arise in the quality of data based on the mode of data collection. Here are some other illustrations for developing a common sense understanding of designing good data collection routines. For the purpose of the discussion below, assume that we are collecting data to evaluate whether a particular teacher is good at her job.
1) face-to-face interview: In a face-to-face situation, a respondent might feel intimidated to speak out openly. Additionally, a respondent might also be concerned about their personal image as communicated through the interview. Hence, they might be inclined to intellectualize or over-analyze their answers, rather than simply speaking their mind
2) Non-anonymous questionnaire: Some questionnaires do not have to be responded to anonymously, as they might occasionally want to get back to the respondent and ask more questions. In this case, the respondent would fear that the person being evaluated would find out their feedback. This could seriously alter the quality of the feedback.
3) Last class student questionnaire: If the questionnaire is administered in the last class, students are already reeling under the pressure of assignments and final exams. It is human for them to have some resentment against teacher for some real/imagined cause. Hence, their feedback would be different from what it would have been a few weeks earlier.
Creation of a mechanism to access data: When large amounts of electronic data are pulled out of computer systems, there is always the risk of ineffective and inefficient usage. Here is an extract from KMWorld (Knowledge Management World) that explains some of the processes that can considerably enhance the quality of data retrieval.
Context mediation -- for determining the business meaning of a word or value based on the context or associations of adjacent data.
Normalization -- for transforming users' terms to your terms and recognizing word variations and synonyms. Getting both sides of the search aligned facilitates matching. But it can't eliminate all nonstandard descriptions, so you need the next capabilities, too.
Fuzzy retrieval -- for finding data without a precise key, such as a product number, or under conditions where data is inconsistent or missing.
Fuzzy matching and filtering -- for measuring and ranking "possible" matches -- to get the best one(s) and avoid irrelevant matches."
Understanding of the issues of data reliability: When collecting data, several types of errors can arise.
Sampling error: Sampling error refers to a non-representative sample being surveyed.
Sampling error occurs in every sample survey. Simply put, it is the difference between the estimated value and the actual value of the target population. When two similar surveys produce two different estimates it is largely due to sampling error. The survey with the least amount of sampling error is, by default, the most accurate. One factor affecting sampling error is the number of observations used in calculating the estimate. The general rule of thumb is, all else being equal; the survey with the most observations will most likely also have the least sampling error. However, even if the surveys had the same number of observations both surveys could still produce different estimates. This is just a fact of surveying - specifically "data variance."
2) Non-sampling Errors: The overall category of non-sampling error includes all those errors that arise as a result of wrong mode of data collection, errors, fraud, bias
Non-sampling error is the term used to describe variations in the estimates that may be caused by population coverage limitations and data collection, processing, and reporting procedures
An important non-sampling error for a telephone survey is the failure to include persons who do not live in households with telephones
Another potential source of non-sampling error is respondent bias. Respondent bias occurs when respondents systematically misreport (intentionally or unintentionally) information in a study."
Creation of processes for testing the data
In considering the use of computer-based data, the following logical questions arise:
How do the data relate to the assignment's objective(s)?
What do we know about the data and the system that processed them?
Are the data reasonably complete and accurate?"
There are many ways of testing data. Prior to establishing a data…