You are on page 1of 110

Biostatistics - I

Presented by : Kush Pathak

Contents

Introduction History Applications and uses of biostatistics in science Common statistical terms Common symbols used Data (a) Collection and types (b) Presentation (c) Analysis (d) Interpretation Limitations Conclusion References

Introduction
There are three kinds of lies: lies, damn lies, and statistics. (Benjamin Disraeli / Mark Twain). The word statistics conveys a variety of meaning to people. It is known for handling data in general and in field of research. The word statistics comes from Italian word statista meaning statesman or the German word statistik, each of which means political state. It comes from two main sources, that are (1) Government records (2) Mathematics John Graunt (1620 - 1674) was the father of health statistics.

Definitions
qStatistics : Science of collecting, summarizing, presentation, analysis
and interpretation of data is called statistics.

qBiostatistics :

Method of collecting, organizing, analyzing, tabulating and interpreting the data, related to living organisms and human beings is called biostatistics.
dentistry, 2nd edition. New Delhi : Arya; 2006. 824] [Soben Peter. Essentials of preventive and community

HISTORY

Father of Health Statistics

1620 - 1674

THE HISTORY OF STATISTICS HAS ITS ROOTS IN BIOLOGY

Sir Francis Galton


Inventor of fingerprints, Study of heredity of quantitative traits

Regression & correlation

Karl Pearson

Polymath -Studied genetics -Correlation coefficient

2 test -Standard deviation


Sir Ronald Fisher


The Genetical Theory of Natural Selection Founder of population genetics.

Analysis of variance Likelihood.


P-value

APPLICATIONS AND USES OF BIOSTATISTICS IN SCIENCE

IN PHYSIOLOGY AND ANATOMY :


To define the limits of normality for variables such as height, weight, Blood Pressure etc. in a population. Variation more than natural limits may be pathological i.e. abnormal due to play of certain external factors. To find correlation between two variables like height and weight.

IN PHARMACOLOGY
To find the action of drugs To compare the action of two drugs or two successive dosages of same drug To find the relative potency of a new drug with respect to a standard drug

IN MEDICINE
To compare the efficiency of a particular drug, operation or line of treatment To find association between two attributes such as cancer and smoking To identify signs and symptoms of disease

IN COMMUNITY MEDICINE AND PUBLIC HEALTH

To test usefulness of vaccine in the field In epidemiologic studies the role of causative factors is statistically tested

FOR STUDENTS :
By learning the methods in biostatistics a student learns to evaluate articles published in medical and dental journals or papers read in medical and dental conferences. He also understands the basic methods of observation in his clinical practice and research.

Common Statistical terms


VARIABLES

Characteristic that takes different values for different persons, place or things. A quantity that varies between limits i.e. height, weight, blood pressure, age etc. Denoted as X and for orderly series as X1, X2, X3..Xn

Sigma stands for summation of results or observations.

CONSTANT :

Quantities that do not vary such as = 3.141, e = 2.718 These do not require statistical study. e.g. in biostatistics, mean, standard deviation are considered constant for a population.

OBSERVATION :

An event and its measurements, such as B.P and 120 mm of Hg

OBSERVATIONAL UNIT :
Source that gives observations, such as object or person etc. In medical stats, term individuals or subject, is used more often. .

DATA:
Set of values recorded on one or more observational units.

POPULATION :
Population includes all persons, events and objects under study. It may be finite or infinite.

SAMPLE :

Defined as a part of a population generally selected so as to be representative of the population whose variables are under study.

PARAMETER
It is a constant that describes a population e.g. in a college there are 40% girls. This describes the population, hence it is a parameter.

STATISTIC

Statistic is a constant that describes the sample e.g. out of 200 students of the same college 45% girls. This 45% will be statistic as it describes the sample

ATTRIBUTE
A characteristic based on which the population can be described into categories or classes e.g. gender, caste, religion.

Commonly used symbols



=

< > Z %
r

Equal to Greater than Lesser than No. of standard deviations Percentage Pearsons correlation coefficient
Spearmans rank correlation coefficient

d.f. or f K P O E

Degree of freedom Number of groups or classes Probability Observed number Expected number

DATA
Set of values recorded on one or more observational units is called data. It is of two types :

QUALITATIVE (discrete) data QUANTITIVE (continuous) data


Collection of health information


A. Census :

United nations define census as the total process of collecting, compiling and publishing demographic, economic and social data pertaining at the specified time or times, to all persons in a country or delimited territory It is an important source of health information. First regular census in India was taken in 1881, and others took place at 10 year intervals. Primary function of census is to provide demographic information such as total count of population and its breakdown into groups and sub groups such as age and sex distribution.

Population census provides basic data (by age and sex) needed to compute vital statistical rates, and other health, demographic and socio economic indicators.

B. Registration of vital events :

United nations define a vital event registration system as including legal registration, statistical recording and reporting of the occurrence of, and the collection, compilation, presentation, analysis and distribution of statistics pertaining to vital events i.e. live births, deaths, fetal deaths, marriages, divorces, adoption, legitimations, recognitions, annulments and legal separations. It keeps a continuous check on demographic changes.

In 1873, the Govt. of India had passed the Births, Deaths and Marriages Registration Act. But still the registration system in India tended to be very unreliable, the data being grossly deficient in regard to accuracy, timelines, completeness and coverage.

Due to this other actions were taken :

o The Central Births and Deaths Registration act, 1969 :


Central Births and Deaths Registration Act was promulgated in 1969, which came into force on 1st April 1970. The time limiting of registering the events of births is 14 days and that of deaths is 7 days. In case of any default, a fine of Rs. 50 was imposed.

o Lay Reporting :

It is defined as, Collection of information, its use and transmission to other levels of health system by non professional health workers. Some countries have attempted to employ first line health workers(e.g. village health guides) to record births and deaths in a community.

C. Sample Registration system (SRS) :

Its a dual record system consisting of continuous enumeration of births and deaths by an enumerator and an independent survey every 6 months by an investigator- supervisor. It was initiated in the mid 1960s to provide reliable estimates of birth and death rates at the national and state levels. It is a major source of health information.

D. Notification of diseases :

Its primary purpose is to effect prevention and/or control of the diseases. Also a valuable source of morbidity data. Diseases which are considered to be serious menaces to public health are included in the list of notifiable diseases. Limitations : (a) covers only a small part of total sickness in the community (b) System suffers from a good deal of under reporting (c) Many cases specially, atypical and subclinical cases escape notification due to non recognition.

E. Hospital records :

They constitute a basic and primary source of information about diseases prevalent in the community. Drawbacks : (a) Provide info. On only those patients who seek medical care. (b) Admission policy may vary from hospital to hospital. (c) Population served by a hospital cannot be defined.

F. Disease Registers :

Provides a permanent record of diseases and morbidity caused due to them. If reporting system is effective and the coverage is on a national basis, register can provide useful data on morbidity and disease specific mortality.

G. Record Linkage :

Used to describe the process of bringing together, records relating to one individual and the records originating in different times or places. Medical record linkage implies the assembly and maintenance for each individual in a population, of a file of the more important records relating to his health. Problem : Volume of data accumulated. Therefore, in practice, records linkage has been applied only on a limited scale. E.g. twin studies, measurement of morbidity, chronic diseases. Etc.

H. Environmental

health data :

These statistics now provide data on various aspects of air, water and noise pollution; harmful food additives; industrial intoxicants etc.

I. Health manpower statistics :


Relates to physicians, dentists, pharmacists, veterinarians, nurses, technicians etc. Their records are maintained by state medical/ dental/ nursing counsils and directorates of medial education.

J. Population surveys :

Carried out for epidemiological studies by trained teams to find incidence or prevalence of health or disease in a community. Provide useful info on : Changing trends in health status. Timely warning of public health hazards. Feedback expected to modify policy and system. Health surveys can be classified as : (a) Health interview (face to face) survey (b) health examination survey (c) health records surveys (d) Mailed questionnaire survey

K. Non- quantifiable information :

Health planners also need non quantifiable info. E.g. health policies, health legislations, public attitudes, programme costs, procedures and technologies.

Types of Data
Qualitative or discrete data :

When the data is collected on the basis of attributes or qualities like sex, malocclusion and cavities etc., it is called as qualitative data. The number of person having the same attribute are variable and are measured.

for e.g. Out of 100 people, 75 have diabetes, 15 have T.B and 10 have Anemia. Then diabetes, T.B and Anemia are attributes which can not be measured in figures. Only number of people having it can be determined.

Quantitative or continuous data :


When the data is collected through measurement using calipers, etc. it is called quantitative data. In such classification there are two variables : Characteristic such as height Frequency i.e. number of persons with same characteristic and in same range

e.g. Height of one person is 150 cm and other is 160 cm and both are of same age and sex. Persons with 150 cms or in range of 150 152 cm may be 10 and that of 160 cm or in range of 160 162 cm may be 20. Thus we find out characteristic and frequency. Both vary from person to person as well as group to group.

Presentation
qTabulation q qDrawings

qTabulation :
Is the most common method Data presentation is in the form of columns and rows

It can be of the following types Simple tables Frequency distribution tables

Simple tables :
Month and Year Number of biopsies performed in Oral Pathology department January 2010 15 June 2010 21

December 2010 26

Frequency Distribution tables :


In a frequency distribution table, the data is first split into convenient groups ( class interval ) and the number of items ( frequency ) which occurs in each group is shown in adjacent column

Year and month January 2010 June 2010 Dec 2010

No. of biopsies sent from different departments to Oral Pathology department. Oral Oral Cons Pediatric Perio. Private surgery Medicin and Dept. 1 Clinics 6 2 3 1 2 e Endo 11 NIL 2 2 2 4 19 NIL 1 2 1 3

Charts and Drawings :


Useful method of presenting statistical data Powerful impact on imagination of the people

Presentation of quantitative data is done through graphs. They are : Histograms Frequency Polygons Frequency curve Line chart or graph Cumulative frequency diagram Scatter or dot diagram

Presentation of qualitative data is done through diagrams. They are : Bar Pie or sector Pictogram or picture diagram Map diagram or spot map

Histograms

Pictorial presentation of frequency distribution.

Consists of series of rectangles. Class interval given on vertical axis Area of rectangle is proportional to the frequency

Frequency Polygon
Obtained by joining midpoints of histogram blocks at the height of frequency by straight lines usually forming a polygon.

Frequency curve :
When number of observations is very large and class interval is reduced the frequency polygon looses its angulations becoming a smooth curve known as frequency curve.

Line Chart

Line diagram are used to show the trends of events with the passage of time.

Cumulative frequency diagram


Graphical representation of cumulative frequency . It is obtained by adding the frequency of previous class .

Scatter or Dot diagram


Shows relationship between two variables. If the dots are clustered showing a straight line, it shows a relationship of linear nature.

Bar Chart
Length of bars drawn vertical or horizontal is proportional to frequency of variable. Suitable scale is chosen. Bars are usually equally spaced. They are of three types : -Simple bar chart -Multiple bar chart -Component bar chart

Simple bar chart

Multiple bar chart :

Two or more variables are grouped together

Component bar chart :

Bars are divided into two or more parts. Each part representing certain item and proportional to magnitude of that item.

Pie chart
In this frequencies of the group are shown as segment of circle. Degree of angle denotes the frequency. Angle is calculated by class frequency x 360 total observations

Pictogram
Popular method of presenting data to the common man.

Spot map or Map diagram


These maps are prepared to show geographic distribution of frequencies of characteristics.

Analysis
Average value in a distribution is the one central value around which all the other observations are concentrated. Average value helps : To find most characteristic value of a set of measurements. To find which group is better off by comparing the average of one group with that of another.

[K.park. Preventive and social medicine, 20th edition: McGraw-Hill Medical; 2009. 749]

Most commonly used averages are


Mean Median Mode

Mean
Refers to arithmetic mean. Individual observations are first added together, and then divided by the number of observations. Addition of the observations is called summation and is denoted by or S. Individual observations are denoted by and the mean is denoted by x ( X bar).

x = x1 + X2 + X3 . X / eg. The diastolic blood pressure of 10 individuals was 83, 75, 81, 79, 71, 95, 75, 77, 84, 90. The total was 810, which was then divided by 10, resulting into 81.0 Advantages It is easy to calculate. Disadvantages Influenced by extreme values.

Median
When all the observation are arranged either in ascending order or descending order, the middle observation is known as median. In case of even number the average of the two middle values is taken. Median is better indicator of central value as it is not affected by the extreme values.

Diastolic Blood Pressure 83 (unarranged) 75 81 79 71 95 75 77 84

Diastolic Blood Pressure 71 (arranged) 75 75 77 79 (median) 81 83 84 95

Diastolic Blood In case there are 10 values instead of 9 Pressure 83 (unarranged) 75 81 79 71 95 75 77 84 90

Diastolic Blood Pressure 71 (arranged) 75 75 77 79 81 83 84 90 95

79 +81/2 =80

Mode
Most frequently used observation or most fashionable value in a series of observation, is called mode.

E.g. diastolic blood pressure of 20 individuals is 85, 75, 81, 79, 71, 95, 75, 77, 75, 90, 71, 75, 79, 95, 75, 77, 84, 75, 81, 75. Here the most frequently occurring value is 75.

Advantages : It is easy to understand. Not affected by extreme items. Disadvantages : Exact location is often uncertain and not clearly defined.

[Therefore, mode is not often used in biological or medical statistics.]

Interpretation
Test of Significance :
Whatever be the sampling procedure or the care taken while selecting sample, the sample statistics will differ from the population parameters.

Variations between 2 samples drawn from the same population may also occur. But differences in the results between two research workers for the same investigation may be observed.

So, it becomes important to find out the significance of this observed variation i.e. whether it is due to chance or biological variation (statistically not significant) OR due to influence of some external factors ( statistically significant) To test whether the variation observed is of significance, various tests of significance are done.

Tests of significance can be broadly classified as v Parametric tests v v Non parametric tests

Parametric Tests
Parametric tests are those tests in which certain assumptions are made about the population :

v Population from which sample is drawn has normal distribution. v v The variances of sample do not differ significantly. v v The observations found are truly numerical thus arithmetic procedure such as addition, division, and multiplication can be used. Since these test make assumptions about the population parameters, they are called parametric tests .

These are usually used to test the difference. They are: Student T test( paired or unpaired) ANOVA

ANOVA

Analysis of variance
Investigations may not always be confined to comparison of 2 samples only e.g. we might like to compare the difference in vertical dimension obtained using 2 or more methods like phonetics, swallowing. In such cases where more than 2 samples are used ANOVA can be used Also when measurements are influenced by several factors playing there role e.g. factors affecting retention of a denture, ANOVA can be used. ANOVA helps to decide which factors are more important

Requirements
Data for each group are assumed to be independent and normally distributed Sampling should be at random

One way ANOVA :

-Where only one factor will effect the result between 2 groups

Two way ANOVA

Where we have 2 factors that affect the result or outcome.

Multi way ANOVA

-Three or more factors affect the result or outcomes between groups -

Student t test
1. 2. It was given by WS Gossett whose pen name was student . There are two types of student t Test. Unpaired t test Paired t test

Unpaired t test
Applied to unpaired data of observation made on individuals of 2 separate groups to find the significance of difference between 2 means.

Sample size is less than 30. e.g. difference in accuracy in an impression using two different impression materials

Steps in unpaired t Test are :


Calculate the mean of two samples. Calculate combined standard deviation

Calculate the standard error of mean which is given by SEM = SD 1/n1 + 1/n2. Calculate observed difference between means X1 X2 Calculate t value = observed difference / Standard error of mean Determine the degree of freedom which is one less than no of observation in a sample (n -1) Here combined degree of freedom will be = (n1 1) + (n2 1)

Refer to table and find the probability of the t value corresponding to degree of freedom

P< 0.05 states difference is significant P> 0.05 states difference is not significant

Paired t test

It is applied to paired data of observation from one sample only.

Used in sample less than 30 The individual gives a pair of observation i.e. observation before and after taking a drug

The steps involved are :


Calculate the difference in paired observation i.e. before and after = x1 x2 = y Calculate the mean of this difference = y

Calculate SD Calculate SE = SD / n Determine t = y / SE Determine the degree of freedom. Since there is one sample df = n-1

Refer to table and find the probability of the t value corresponding to degree of freedom P< 0.05 states difference is significant

P> 0.05 states difference is not significant

Non Parametric tests

In many biological investigation the research worker may not know the nature of distribution or other required values of the population. Also some biological measurements may not be true numerical values hence arithmetic procedures are not possible in such cases.

In such cases distribution free or non parametric tests are used in which no assumption are made about the population parameters e.g. Mann Whitney test Chi square test Phi coefficient test Fischers Exact test Sign Test Freidman's Test

Chi square test


Chi square test unlike z and t test is a non parametric test. The test involves calculation of a quantity called chi square . Chi square is denoted by X2 It was developed by Karl Pearson The most important application of chi square test in medical statistics are Test of proportion Test of association Test of goodness of fit

Test of proportion
Used as an alternate test to find the significance of difference in 2 or more than 2 proportions

Test of association
To measure the probability of association between 2 discreet attributes e.g smoking and cancer

Test of goodness of fit


Tests whether the observed values of a character differ from the expected value by chance or due to play of some external factor

Stages in performing Tests of Significance


State the null hypothesis State the alternative hypothesis Accept or reject the null hypothesis Finally determine the p value

State the null hypothesis


State the null hypothesis :
Null Hypothesis, is a hypothesis of no difference between statistics of a sample and parameter of the population or between statistics of two samples. It nullifies the claim that the experimental result is different from or better than the one observed already

State the alternative hypothesis


State the alternative hypothesis :

It states, that the sample result is different i.e. larger or smaller than the value of population or statistics of one sample is different from the other.

Accept or reject the null hypothesis :

Null Hypothesis is accepted or rejected depending on whether the result falls in zone of acceptance or zone of rejection. If the result of a sample falls in the area of mean 2SE the null hypothesis is accepted.

This area of normal curve is called zone of acceptance for null hypothesis. If the result of sample falls beyond the area of mean 2 SE. Null hypothesis of no difference is rejected and alternate hypothesis accepted. This area of normal curve is called zone of rejection for null hypothesis

Finally determining the P value :

P value is determined using any of the previously mentioned methods. If p> 0.05, the difference is due to chance and is not statistically different but if p < 0.05 the difference is due to some external factor and statistically significant.

Probability or p value
Concept of probability is very important in statistics. Probability is the chance of occurrence of any event or permutation combination. It is denoted by p for sample and P for population. In various tests of significance we are often interested to know whether the observed difference between 2 samples is by chance or due to sampling variation. At this time, probability or p value is used to find out the difference.

P ranges from 0 to 1 0 = there is no chance that the observed difference could not be due to sampling variation 1 = it is absolutely certain that observed difference between 2 samples is due to sampling variation However such extreme values are rare. P = 0.4 i.e. chances that the difference is due to sampling variation is 4 in 10

Obviously the chances that it is not due to sampling variation will be 2 in 10. The essence of any test of significance is to find out p value and draw inference. If p value is 0.05 or more It is customary to accept that difference is due to chance (sampling variation) . The observed difference is said to be statistically not significant.

If p value is less than 0.05 Observed difference is not due chance but due to role of some external factors. The observed difference here is said to be statistically significant.

Sampling
When a large proportion of individuals are to be studied, it is impossible to include each and every member, as it will be time consuming, costly, laborious. So, sampling is done. Sampling is a process by which some unit of a population are selected for the study and by subjecting it to statistical computation, conclusions are drawn about the population from which these units are drawn.

The sample taken will be a representative of entire population. It is sufficiently large. It is unbiased. Such sample will have its statistics almost equal to parameters of entire population.

Two main characteristics of a representative sample are : Precision Unbiased character

Precision
Precision depends on a sample size. Ordinarily sample size should not be less than 30. Precision = n/s

n = sample size , s = standard deviation

Precision is directly proportional to square root of sample size. Greater the sample size greater the precision.

Thus, to obtain precision, sample size needs to be increased

Unbiased character
The sample should be unbiased i.e. every individual should have an equal chance to be selected in the sample. Thus a standard random sampling method should be used. Non sampling errors can be taken care of by Using standardized instruments and criteria. By single, double, triple blind trials Use of a control group

Limitations
Statistics has several limitations :
It gives statistical and not substantive answers. The statistical conclusion refers to groups and not individuals. It only summarizes but does not interpret data.

Statistics can be misused by selective presentation of desired results. Computation is not an end in itself. It is a tool that can be used well or can be misused.

A human must have a clear idea of what is required of the computer and must instruct it accordingly.

The human must also be able to intelligently interpret the output from the computer. All who tinker with computers must remember the adage rubbish in/rubbish out.

Conclusion
Health information systems are the best means of getting reliable, relevant, up to date, adequate and reasonably complete information for health managers at all levels. Although, being a very helpful source for collection of data, it has been very difficult to get information where it matters most i.e. at community level. So, actions should be taken in this direction and this system should be used more frequently for better and clear results, mainly in cases of researches involving large masses.

References
K.park. Preventive and social medicine, 20th edition : Mc Graw Hill Medical ; 2009 .743 756 Soben Peter. Essentials of preventive and community dentistry, 2nd edition. New Delhi : Arya; 2006. 21 50 B.k.Mahajan. Methods in Biostatistics for medical students and research workers, 6th edition. New Delhi : Jaypee brothers ; 2006. 1- 39

You might also like