CJFS Social Statistics 3710 Study Guide Exam One

1) Like most people, we probably feel that it is important to "take control of your life.
" But what does this mean? Partly it means being able to properly evaluate the data and claims that bombard us every day. If we cannot distinguish good from faulty reasoning, then we are vulnerable to manipulation and to decisions that are not in your best interest. Statistics provides tools that we need in order to react intelligently to information we hear or read. In this sense, statistics is one of the most important subject matter that we study. To be more specific, here are some claims that we have heard on several occasions. 4 out of 5 dentists recommend Dentyne Almost 85% of lung cancers in men and 45% in women are tobacco-related. Condoms are effective 94% of the time. Native Americans are significantly more likely to be hit crossing the streets than are people of other ethnicities. People tend to be more persuasive when they look others directly in the eye and speak loudly and quickly. Women make 75 cents to every dollar a man makes when they work the same job. 79.48% of all statistics are made up on the spot.
Indeed, data and data-interpretation show up in conversation from virtually every facet of contemporary life. Statistics are often presented in an effort to add credibility to an argument or advice. You can see this by paying attention to television advertisements. To be an intelligent consumer of statistics, our first reflex must be to question the statistics that you encounter. Just as important as detecting the deceptive use of statistics is the appreciation of the proper use of statistics. We must also learn to recognize statistical evidence that supports a stated conclusion. Statistics are all around us, sometimes used well, sometimes not. We must learn how to distinguish the two cases. 2) Research is any process by which information is systematically and carefully gathered for the purpose of answering questions, examining ideas, or testing theories. Research is a disciplined inquiry that can take numerous forms. Statistical analysis is relevant only for those research projects where the information collected is represented by numbers. Numerical information is called data, and the sole purpose of statistics is to manipulate and analyze data. A) A statistic is a fact or piece of information that is expressed as a number or percentage. The facts and figures that are collected and examined for information on a given subject are statistics. Probability is the likelihood of something happening or being true. Statistics is a set of methods used to collect and analyze data. Those statistical methods help people identify, study, and solve a variety of problems. Statistics help people make good decisions about uncertain situations. Probability is used to describe events that do not occur with certainty. B) Theory is an explanation of the relationship between phenomena. In the sciences, theories are created after observation and testing. They are designed to rationally and clearly explain a phenomenon. For example, Isaac Newton came up with a theory about gravity in the 17th century, and the theory proved to be both testable and correct. Scientific theories are not quite
the same thing as facts, but they are often very similar; scientists usually test their theories extensively before airing them, looking for obvious problems which could cause the theory to be challenged.
3)
A variable is any trait that can change values from case to case.
Continuous variables are those variables that have theoretically an infinite number of gradations between two measurements. For example, body weight of individuals, milk yield of cows or buffaloes etc. Most of the variables in biology are of continuous type. Discrete variables, on the other hand, do not have continuous gradations but there is a definite gap between two measurements, i.e. they can not be measured in fractions. For example, number of eggs laid by hens, number of children in a family etc. An independent variable, sometimes called an experimental or predictor variable, is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes called an outcome variable. Categorical variables are also known as discrete or qualitative variables. Categorical variables can be further categorized as nominal, ordinal or dichotomous. Nominal variables are variables that have two or more categories but which do not have an intrinsic order. For example, a real estate agent could classify their types of property into distinct categories such as houses, condos, co-ops or bungalows. So "type of property" is a nominal variable with 4 categories called houses, condos, co-ops and bungalows. Of note, the different categories of a nominal variable can also be referred to as groups or levels of the nominal variable. Another example of a nominal variable would be classifying where people live in the USA by state. In this case there will be many more levels of the nominal variable (50 in fact). Dichotomous variables are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would most probably categorize somebody as either "male" or "female". This is an example of a dichotomous variable (and also a nominal variable). Another example might be if we asked a person if they owned a mobile phone. Here, we may categorize mobile phone ownership as either "Yes" or "No". In the real estate agent example, if type of property had been classified as either residential or commercial then "type of property" would be a dichotomous variable. Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked. So if you asked someone if they liked the policies of the Democratic Party and they could answer either "Not very much", "They are OK" or "Yes, a lot" then you have an ordinal variable. Why? Because you have 3 categories, namely "Not very much", "They are OK" and "Yes, a lot" and you can rank them from the most positive (Yes, a lot), to the middle response (They are OK), to the least positive (Not very much). However, whilst we can rank the levels, we cannot place a "value" to them; we cannot say that "They are OK" is twice as positive as "Not very much" for example.
Continuous variables are also known as quantitative variables. Continuous variables can be further categorized as either interval or ratio variables. Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit). So the difference between 20C and 30C is the same as 30C to 40C. However, temperature measured in degrees Celsius or Fahrenheit is NOT a ratio variable. Ratio variables are interval variables but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. So, temperature measured in degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean there is no temperature. However, temperature measured in Kelvin is a ratio variable as 0 Kelvin (often called absolute zero) indicates that there is no temperature whatsoever. Other examples of ratio variables include height, mass, distance and many more. The name "ratio" reflects the fact that you can use the ratio of measurements. So, for example, a distance of ten meters is twice the distance of 5 meters. Reliability is the consistency of your measurement, or the degree to which an instrument measures the same way each time it is used under the same condition with the same subjects. In short, it is the repeatability of your measurement. A measure is considered reliable if a person's score on the same test given twice is similar. It is important to remember that reliability is not measured, it is estimated. Validity is the strength of our conclusions, inferences or propositions. More formally, Cook and Campbell (1979) define it as the "best available approximation to the truth or falsity of a given inference, proposition or conclusion." In short, were we right? Let's look at a simple example. Say we are studying the effect of strict attendance policies on class participation. In our case, we saw that class participation did increase after the policy was established. Each type of validity would highlight a different aspect of the relationship between our treatment (strict attendance policy) and our observed outcome (increased class participation). Information Exists on a Continuum of Reliability and Quality, Information is everywhere on the Internet, existing in large quantities and continuously being created and revised. This information exists in a large variety of kinds (facts, opinions, stories, interpretations, statistics) and is created for many purposes (to inform, to persuade, to sell, to present a viewpoint, and to create or change an attitude or belief). For each of these various kinds and purposes, information exists on many levels of quality or reliability. It ranges from very good to very bad and includes every shade in between. 4) Data Levels of Measurement There are different levels of measurement that have been classified into four categories. It is important for the researcher to understand the different levels of measurement, as these levels of measurement play a part in determining the arithmetic and the statistical operations that are carried out on the data. In ascending order of precision, the four different levels of measurement are:
nominal ordinal interval ratio
The first level of measurement is nominal measurement. In this level of measurement, the numbers are used to classify the data. Also, in this level of measurement, words and letters can be used. Suppose there are data about people belonging to two different genders. In this case, the person belonging to the female gender could be classified as F, and the person belonging to the male gender could be classified as M. This type of assigning classification is nothing but the nominal level of measurement. The second level of measurement is the ordinal level of measurement. This level of measurement depicts some ordered relationship between the number of items. Suppose a student scores the maximum marks in the class. In this case, he would be assigned the first rank. Then, the person scoring the second highest marks would be assigned the second rank, and so on. This level of measurement signifies some specific reason behind the assignment. The ordinal level of measurement indicates an approximate ordering of the measurements. The researcher should note that in this type of measurement, the difference or the ratio between any two types of rankings is not the same along the scale. The third level of measurement is the interval level of measurement. The interval level of measurement not only classifies and orders the measurements, but it also specifies that the distances between each interval on the scale are equivalent along the scale from low interval to high interval. For example, an interval level of measurement could be the measurement of anxiety in a student between the score of 10 and 11, if this interval is the same as that of a student who is in between the score of 40 and 41. A popular example of this level of measurement is temperature in centigrade, where, for example, the distance between 940C and 960C is the same as the distance between 1000C and 1020C. The fourth level of measurement is the ratio level of measurement. In this level of measurement, the measurements can have a value of zero as well, which makes this type of measurement unlike the other types of measurement, although the properties are similar to that of the interval level of measurement. In the ratio level of measurement, the divisions between the points on the scale have an equivalent distance between them, and the rankings assigned to the items are according to their size. The researcher should note that among these levels of measurement, the nominal level is simply used to classify data, whereas the levels of measurement described by the interval level and the ratio level are much more exact.
5) Measures of Central Tendency The 3 Ms Mean, Median, Mode A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as, the median and the mode. The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections we will look at the mean, mode and median and learn how to calculate them and under what conditions they are most appropriate to be used. Mean (Arithmetic) The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, pronounced "sigma", which means "sum of...":
You may have noticed that the above formula refers to the sample mean. So, why call have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter "mu ", denoted as :
The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimizes error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.
When not to use the mean The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below: Staff Salary 1 15k 2 18k 3 16k 4 14k 5 15k 6 15k 7 12k 8 17k 9 90k 10 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation. Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal then the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data as the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide. Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: 65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below: 65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92
Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5. Mode The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:
Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below:
We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data, as we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with exactly the same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people might be close but with such a small sample (30 people) and a large range of possible weights you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data. Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading. Skewed Distributions and the Mean and Median We often test whether our data is normally distributed as this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:
When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency as it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode. However, when our data is skewed, for example, as with the right-skewed data set below:
we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution the greater the difference between the median and mean, and the greater emphasis should be placed on using the mean as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median. If dealing with a normal distribution, and tests of normality show that the data is non-normal, then it is customary to use the median instead of the mean. This is more a rule of thumb than a strict guideline however. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment) and if it allows easier comparisons to previous research to be made.
Summary of when to use the mean, median and mode Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable. Type of Variable Nominal Ordinal Interval/Ratio (not skewed) Interval/Ratio (skewed) Best measure of central tendency Mode Median Mean Median
6) Range, variance and standard deviation
These are all measures of dispersion. These help you to know the spread of scores within a bunch of scores. Are the scores really close together or are they really far apart? For example, if you were describing the heights of students in your class to a friend, they might want to know how much the heights vary. Are all the men about 5 feet 11 inches within a few centimeters or so? Or is there a lot of variation where some men are 5 feet and others are 6 foot 5 inches? Measures of dispersion like the range, variance and standard deviation tell you about the spread of scores in a data set. Like central tendency, they help you summarize a bunch of numbers with one or just a few numbers.
Range is the simplest measure of dispersion. It is defined as the positive difference between the largest and the smallest values in the given data. It is easily understood and computed but depends exclusively on the two extreme values while it is desirable to have a measure dependent on all the values. The range is a very useful measure in statistical quality control of products in industries wherein the interest lies in getting a quick rather than an accurate measure of variability. Standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are pretty tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you you have a relatively large standard deviation. Computing the value of a standard deviation is complicated. But let me show you graphically what a standard deviation represents...
One standard deviation away from the mean in either direction on the horizontal axis (the red area on the above graph) accounts for somewhere around 68 percent of the people in this group. Two standard deviations away from the mean (the red and green areas) account for roughly 95 percent of the people. And three standard deviations (the red, green and blue areas) account for about 99 percent of the people. If this curve were flatter and more spread out, the standard deviation would have to be larger in order to account for those 68 percent or so of the people. So that's why the standard deviation can tell you how spread out the examples in a set are from the mean. The interquartile range (IQR) is the distance between the 75th percentile and the 25th percentile. The IQR is essentially the range of the middle 50% of the data. Because it uses the middle 50%, the IQR is not affected by outliers or extreme values. The IQR is also equal to the length of the box in a box plot.
8) Properties of the Normal Curve Known characteristics of the normal curve make it possible to estimate the probability of occurrence of any value of a normally distributed variable. Suppose that the total area under the curve is defined to be 1. You can multiply that number by 100 and say there i s a 100 percent chance that any value you can name will be somewhere in the distribution. (Remember, the distribution extends to infinity in both directions.) Similarly, because half of the area of the curve is below the mean and half is above it, you can say that there's a 50 percent chance that a randomly chosen value will be above the mean and the same chance that it will be below it. It makes sense that the area under the normal curve is equivalent to the probability of randomly drawing a value in that range. The area is greatest in the middle, where the hump is, and thins out toward the tails. That's consistent with the fact that there are more values close to the mean in a normal distribution than far from it. When the area of the standard normal curve is divided into sections by standard deviations above and below the mean, the area in each section is a known quantity (see Figure 1 ). The area in each section is the same as the probability of randomly drawing a value in that range.
9) Z-Scores -Sometimes we want to do more than summarize a bunch of scores. Sometimes we want to talk about particular scores within the bunch. We may want to tell other people about whether or not a score is above or below average. We may want to tell other people how far away a particular score is from average. We might also want to compare scores from different bunches of data. We will want to know which score is better. Z-scores can help with all of this.
Z-Scores tell us whether a particular score is equal to the mean, below the mean or above the mean of a bunch of scores. They can also tell us how far a particular score is away from the mean. Is a particular score close to the mean or far away? If a Z-Score. Has a value of 0, it is equal to the group mean. Is positive, it is above the group mean. Is negative, it is below the group mean. Is equal to +1, it is 1 Standard Deviation above the mean. Is equal to +2, it is 2 Standard Deviations above the mean. Is equal to -1, it is 1 Standard Deviation below the mean. Is equal to -2, it is 2 Standard Deviations below the mean.
Z-Scores Can Help Us Understand How typical a particular score is within bunch of scores? If data are normally distributed, approximately 95% of the data should have Z-score between -2 and +2. Z-scores that do not fall within this range may be less typical of the data in a bunch of scores. Z-Scores Can Help Us Compare Individual scores from different bunches of data. We can use Z-scores to standardize scores from different groups of data. Then we can compare raw scores from different bunches of data.

CJFS Social Statistics 3710 Study Guide Exam One

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CJFS Social Statistics 3710 Study Guide Exam One

Uploaded by

Copyright:

Available Formats

1) Like most people, we probably feel that it is important to "take control of your life.

nominal ordinal interval ratio

We again rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92

6) Range, variance and standard deviation

You might also like