Professional Documents
Culture Documents
Contents
Introduction History Applications and uses of biostatistics in science Common statistical terms Common symbols used Data (a) Collection and types (b) Presentation (c) Analysis (d) Interpretation Limitations Conclusion References
Introduction
There are three kinds of lies: lies, damn lies, and statistics. (Benjamin Disraeli / Mark Twain). The word statistics conveys a variety of meaning to people. It is known for handling data in general and in field of research. The word statistics comes from Italian word statista meaning statesman or the German word statistik, each of which means political state. It comes from two main sources, that are (1) Government records (2) Mathematics John Graunt (1620 - 1674) was the father of health statistics.
Definitions
qStatistics : Science of collecting, summarizing, presentation, analysis
and interpretation of data is called statistics.
qBiostatistics :
Method of collecting, organizing, analyzing, tabulating and interpreting the data, related to living organisms and human beings is called biostatistics.
dentistry, 2nd edition. New Delhi : Arya; 2006. 824] [Soben Peter. Essentials of preventive and community
HISTORY
1620 - 1674
Karl Pearson
P-value
IN PHARMACOLOGY
To find the action of drugs To compare the action of two drugs or two successive dosages of same drug To find the relative potency of a new drug with respect to a standard drug
IN MEDICINE
To compare the efficiency of a particular drug, operation or line of treatment To find association between two attributes such as cancer and smoking To identify signs and symptoms of disease
To test usefulness of vaccine in the field In epidemiologic studies the role of causative factors is statistically tested
FOR STUDENTS :
By learning the methods in biostatistics a student learns to evaluate articles published in medical and dental journals or papers read in medical and dental conferences. He also understands the basic methods of observation in his clinical practice and research.
Characteristic that takes different values for different persons, place or things. A quantity that varies between limits i.e. height, weight, blood pressure, age etc. Denoted as X and for orderly series as X1, X2, X3..Xn
CONSTANT :
Quantities that do not vary such as = 3.141, e = 2.718 These do not require statistical study. e.g. in biostatistics, mean, standard deviation are considered constant for a population.
OBSERVATION :
OBSERVATIONAL UNIT :
Source that gives observations, such as object or person etc. In medical stats, term individuals or subject, is used more often. .
DATA:
Set of values recorded on one or more observational units.
POPULATION :
Population includes all persons, events and objects under study. It may be finite or infinite.
SAMPLE :
Defined as a part of a population generally selected so as to be representative of the population whose variables are under study.
PARAMETER
It is a constant that describes a population e.g. in a college there are 40% girls. This describes the population, hence it is a parameter.
STATISTIC
Statistic is a constant that describes the sample e.g. out of 200 students of the same college 45% girls. This 45% will be statistic as it describes the sample
ATTRIBUTE
A characteristic based on which the population can be described into categories or classes e.g. gender, caste, religion.
< > Z %
r
Equal to Greater than Lesser than No. of standard deviations Percentage Pearsons correlation coefficient
Spearmans rank correlation coefficient
d.f. or f K P O E
Degree of freedom Number of groups or classes Probability Observed number Expected number
DATA
Set of values recorded on one or more observational units is called data. It is of two types :
United nations define census as the total process of collecting, compiling and publishing demographic, economic and social data pertaining at the specified time or times, to all persons in a country or delimited territory It is an important source of health information. First regular census in India was taken in 1881, and others took place at 10 year intervals. Primary function of census is to provide demographic information such as total count of population and its breakdown into groups and sub groups such as age and sex distribution.
Population census provides basic data (by age and sex) needed to compute vital statistical rates, and other health, demographic and socio economic indicators.
United nations define a vital event registration system as including legal registration, statistical recording and reporting of the occurrence of, and the collection, compilation, presentation, analysis and distribution of statistics pertaining to vital events i.e. live births, deaths, fetal deaths, marriages, divorces, adoption, legitimations, recognitions, annulments and legal separations. It keeps a continuous check on demographic changes.
In 1873, the Govt. of India had passed the Births, Deaths and Marriages Registration Act. But still the registration system in India tended to be very unreliable, the data being grossly deficient in regard to accuracy, timelines, completeness and coverage.
o Lay Reporting :
It is defined as, Collection of information, its use and transmission to other levels of health system by non professional health workers. Some countries have attempted to employ first line health workers(e.g. village health guides) to record births and deaths in a community.
Its a dual record system consisting of continuous enumeration of births and deaths by an enumerator and an independent survey every 6 months by an investigator- supervisor. It was initiated in the mid 1960s to provide reliable estimates of birth and death rates at the national and state levels. It is a major source of health information.
D. Notification of diseases :
Its primary purpose is to effect prevention and/or control of the diseases. Also a valuable source of morbidity data. Diseases which are considered to be serious menaces to public health are included in the list of notifiable diseases. Limitations : (a) covers only a small part of total sickness in the community (b) System suffers from a good deal of under reporting (c) Many cases specially, atypical and subclinical cases escape notification due to non recognition.
E. Hospital records :
They constitute a basic and primary source of information about diseases prevalent in the community. Drawbacks : (a) Provide info. On only those patients who seek medical care. (b) Admission policy may vary from hospital to hospital. (c) Population served by a hospital cannot be defined.
F. Disease Registers :
Provides a permanent record of diseases and morbidity caused due to them. If reporting system is effective and the coverage is on a national basis, register can provide useful data on morbidity and disease specific mortality.
G. Record Linkage :
Used to describe the process of bringing together, records relating to one individual and the records originating in different times or places. Medical record linkage implies the assembly and maintenance for each individual in a population, of a file of the more important records relating to his health. Problem : Volume of data accumulated. Therefore, in practice, records linkage has been applied only on a limited scale. E.g. twin studies, measurement of morbidity, chronic diseases. Etc.
H. Environmental
health data :
These statistics now provide data on various aspects of air, water and noise pollution; harmful food additives; industrial intoxicants etc.
J. Population surveys :
Carried out for epidemiological studies by trained teams to find incidence or prevalence of health or disease in a community. Provide useful info on : Changing trends in health status. Timely warning of public health hazards. Feedback expected to modify policy and system. Health surveys can be classified as : (a) Health interview (face to face) survey (b) health examination survey (c) health records surveys (d) Mailed questionnaire survey
Health planners also need non quantifiable info. E.g. health policies, health legislations, public attitudes, programme costs, procedures and technologies.
Types of Data
Qualitative or discrete data :
When the data is collected on the basis of attributes or qualities like sex, malocclusion and cavities etc., it is called as qualitative data. The number of person having the same attribute are variable and are measured.
for e.g. Out of 100 people, 75 have diabetes, 15 have T.B and 10 have Anemia. Then diabetes, T.B and Anemia are attributes which can not be measured in figures. Only number of people having it can be determined.
e.g. Height of one person is 150 cm and other is 160 cm and both are of same age and sex. Persons with 150 cms or in range of 150 152 cm may be 10 and that of 160 cm or in range of 160 162 cm may be 20. Thus we find out characteristic and frequency. Both vary from person to person as well as group to group.
Presentation
qTabulation q qDrawings
qTabulation :
Is the most common method Data presentation is in the form of columns and rows
Simple tables :
Month and Year Number of biopsies performed in Oral Pathology department January 2010 15 June 2010 21
December 2010 26
No. of biopsies sent from different departments to Oral Pathology department. Oral Oral Cons Pediatric Perio. Private surgery Medicin and Dept. 1 Clinics 6 2 3 1 2 e Endo 11 NIL 2 2 2 4 19 NIL 1 2 1 3
Presentation of quantitative data is done through graphs. They are : Histograms Frequency Polygons Frequency curve Line chart or graph Cumulative frequency diagram Scatter or dot diagram
Presentation of qualitative data is done through diagrams. They are : Bar Pie or sector Pictogram or picture diagram Map diagram or spot map
Histograms
Consists of series of rectangles. Class interval given on vertical axis Area of rectangle is proportional to the frequency
Frequency Polygon
Obtained by joining midpoints of histogram blocks at the height of frequency by straight lines usually forming a polygon.
Frequency curve :
When number of observations is very large and class interval is reduced the frequency polygon looses its angulations becoming a smooth curve known as frequency curve.
Line Chart
Line diagram are used to show the trends of events with the passage of time.
Bar Chart
Length of bars drawn vertical or horizontal is proportional to frequency of variable. Suitable scale is chosen. Bars are usually equally spaced. They are of three types : -Simple bar chart -Multiple bar chart -Component bar chart
Bars are divided into two or more parts. Each part representing certain item and proportional to magnitude of that item.
Pie chart
In this frequencies of the group are shown as segment of circle. Degree of angle denotes the frequency. Angle is calculated by class frequency x 360 total observations
Pictogram
Popular method of presenting data to the common man.
Analysis
Average value in a distribution is the one central value around which all the other observations are concentrated. Average value helps : To find most characteristic value of a set of measurements. To find which group is better off by comparing the average of one group with that of another.
[K.park. Preventive and social medicine, 20th edition: McGraw-Hill Medical; 2009. 749]
Mean
Refers to arithmetic mean. Individual observations are first added together, and then divided by the number of observations. Addition of the observations is called summation and is denoted by or S. Individual observations are denoted by and the mean is denoted by x ( X bar).
x = x1 + X2 + X3 . X / eg. The diastolic blood pressure of 10 individuals was 83, 75, 81, 79, 71, 95, 75, 77, 84, 90. The total was 810, which was then divided by 10, resulting into 81.0 Advantages It is easy to calculate. Disadvantages Influenced by extreme values.
Median
When all the observation are arranged either in ascending order or descending order, the middle observation is known as median. In case of even number the average of the two middle values is taken. Median is better indicator of central value as it is not affected by the extreme values.
79 +81/2 =80
Mode
Most frequently used observation or most fashionable value in a series of observation, is called mode.
E.g. diastolic blood pressure of 20 individuals is 85, 75, 81, 79, 71, 95, 75, 77, 75, 90, 71, 75, 79, 95, 75, 77, 84, 75, 81, 75. Here the most frequently occurring value is 75.
Advantages : It is easy to understand. Not affected by extreme items. Disadvantages : Exact location is often uncertain and not clearly defined.
Interpretation
Test of Significance :
Whatever be the sampling procedure or the care taken while selecting sample, the sample statistics will differ from the population parameters.
Variations between 2 samples drawn from the same population may also occur. But differences in the results between two research workers for the same investigation may be observed.
So, it becomes important to find out the significance of this observed variation i.e. whether it is due to chance or biological variation (statistically not significant) OR due to influence of some external factors ( statistically significant) To test whether the variation observed is of significance, various tests of significance are done.
Tests of significance can be broadly classified as v Parametric tests v v Non parametric tests
Parametric Tests
Parametric tests are those tests in which certain assumptions are made about the population :
v Population from which sample is drawn has normal distribution. v v The variances of sample do not differ significantly. v v The observations found are truly numerical thus arithmetic procedure such as addition, division, and multiplication can be used. Since these test make assumptions about the population parameters, they are called parametric tests .
These are usually used to test the difference. They are: Student T test( paired or unpaired) ANOVA
ANOVA
Analysis of variance
Investigations may not always be confined to comparison of 2 samples only e.g. we might like to compare the difference in vertical dimension obtained using 2 or more methods like phonetics, swallowing. In such cases where more than 2 samples are used ANOVA can be used Also when measurements are influenced by several factors playing there role e.g. factors affecting retention of a denture, ANOVA can be used. ANOVA helps to decide which factors are more important
Requirements
Data for each group are assumed to be independent and normally distributed Sampling should be at random
-Where only one factor will effect the result between 2 groups
Student t test
1. 2. It was given by WS Gossett whose pen name was student . There are two types of student t Test. Unpaired t test Paired t test
Unpaired t test
Applied to unpaired data of observation made on individuals of 2 separate groups to find the significance of difference between 2 means.
Sample size is less than 30. e.g. difference in accuracy in an impression using two different impression materials
Calculate the standard error of mean which is given by SEM = SD 1/n1 + 1/n2. Calculate observed difference between means X1 X2 Calculate t value = observed difference / Standard error of mean Determine the degree of freedom which is one less than no of observation in a sample (n -1) Here combined degree of freedom will be = (n1 1) + (n2 1)
Refer to table and find the probability of the t value corresponding to degree of freedom
P< 0.05 states difference is significant P> 0.05 states difference is not significant
Paired t test
Used in sample less than 30 The individual gives a pair of observation i.e. observation before and after taking a drug
Calculate SD Calculate SE = SD / n Determine t = y / SE Determine the degree of freedom. Since there is one sample df = n-1
Refer to table and find the probability of the t value corresponding to degree of freedom P< 0.05 states difference is significant
In many biological investigation the research worker may not know the nature of distribution or other required values of the population. Also some biological measurements may not be true numerical values hence arithmetic procedures are not possible in such cases.
In such cases distribution free or non parametric tests are used in which no assumption are made about the population parameters e.g. Mann Whitney test Chi square test Phi coefficient test Fischers Exact test Sign Test Freidman's Test
Test of proportion
Used as an alternate test to find the significance of difference in 2 or more than 2 proportions
Test of association
To measure the probability of association between 2 discreet attributes e.g smoking and cancer
It states, that the sample result is different i.e. larger or smaller than the value of population or statistics of one sample is different from the other.
Null Hypothesis is accepted or rejected depending on whether the result falls in zone of acceptance or zone of rejection. If the result of a sample falls in the area of mean 2SE the null hypothesis is accepted.
This area of normal curve is called zone of acceptance for null hypothesis. If the result of sample falls beyond the area of mean 2 SE. Null hypothesis of no difference is rejected and alternate hypothesis accepted. This area of normal curve is called zone of rejection for null hypothesis
P value is determined using any of the previously mentioned methods. If p> 0.05, the difference is due to chance and is not statistically different but if p < 0.05 the difference is due to some external factor and statistically significant.
Probability or p value
Concept of probability is very important in statistics. Probability is the chance of occurrence of any event or permutation combination. It is denoted by p for sample and P for population. In various tests of significance we are often interested to know whether the observed difference between 2 samples is by chance or due to sampling variation. At this time, probability or p value is used to find out the difference.
P ranges from 0 to 1 0 = there is no chance that the observed difference could not be due to sampling variation 1 = it is absolutely certain that observed difference between 2 samples is due to sampling variation However such extreme values are rare. P = 0.4 i.e. chances that the difference is due to sampling variation is 4 in 10
Obviously the chances that it is not due to sampling variation will be 2 in 10. The essence of any test of significance is to find out p value and draw inference. If p value is 0.05 or more It is customary to accept that difference is due to chance (sampling variation) . The observed difference is said to be statistically not significant.
If p value is less than 0.05 Observed difference is not due chance but due to role of some external factors. The observed difference here is said to be statistically significant.
Sampling
When a large proportion of individuals are to be studied, it is impossible to include each and every member, as it will be time consuming, costly, laborious. So, sampling is done. Sampling is a process by which some unit of a population are selected for the study and by subjecting it to statistical computation, conclusions are drawn about the population from which these units are drawn.
The sample taken will be a representative of entire population. It is sufficiently large. It is unbiased. Such sample will have its statistics almost equal to parameters of entire population.
Precision
Precision depends on a sample size. Ordinarily sample size should not be less than 30. Precision = n/s
Precision is directly proportional to square root of sample size. Greater the sample size greater the precision.
Unbiased character
The sample should be unbiased i.e. every individual should have an equal chance to be selected in the sample. Thus a standard random sampling method should be used. Non sampling errors can be taken care of by Using standardized instruments and criteria. By single, double, triple blind trials Use of a control group
Limitations
Statistics has several limitations :
It gives statistical and not substantive answers. The statistical conclusion refers to groups and not individuals. It only summarizes but does not interpret data.
Statistics can be misused by selective presentation of desired results. Computation is not an end in itself. It is a tool that can be used well or can be misused.
A human must have a clear idea of what is required of the computer and must instruct it accordingly.
The human must also be able to intelligently interpret the output from the computer. All who tinker with computers must remember the adage rubbish in/rubbish out.
Conclusion
Health information systems are the best means of getting reliable, relevant, up to date, adequate and reasonably complete information for health managers at all levels. Although, being a very helpful source for collection of data, it has been very difficult to get information where it matters most i.e. at community level. So, actions should be taken in this direction and this system should be used more frequently for better and clear results, mainly in cases of researches involving large masses.
References
K.park. Preventive and social medicine, 20th edition : Mc Graw Hill Medical ; 2009 .743 756 Soben Peter. Essentials of preventive and community dentistry, 2nd edition. New Delhi : Arya; 2006. 21 50 B.k.Mahajan. Methods in Biostatistics for medical students and research workers, 6th edition. New Delhi : Jaypee brothers ; 2006. 1- 39