You are on page 1of 22

Chris Harding

Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 1 of 22
Date: 12 September 2013
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
US Veteran
United States of America
CC: US House Committee on Veterans Affairs; US Senate Committee on Veterans Affairs

To Whom It May Concern:
First, this paper appears long because of tables and graphs. Also, I am thankful that a MD with a PhD in statistics has helped
me tweak my analysis for the better and verified my overall conclusions.
Advocacy
As a 100% schedular Total and Permanent disabled veteran, I have time to advocate for veterans. Sadly, I cannot work
because of my illnesses, but I do put a lot of work into veteran advocacy. In addition to sending volumes of E-mails to US
Congress, US President, my congressional representatives, and media, I hope to use some of my statistical knowledge as
well. Since I have had some limited working experience with statistical analysis as a chemical engineer in the pharmaceutical
industry, I have decided to apply my knowledge to veterans advocacy while learning new tools and methods too.
Thankfully, R is open-source, powerful, useful, and free statistical software package[4]. Also, I have discovered that some
statisticians are willing to share their valuable knowledge, which is something they are often paid. As an example, a MD with
a PhD statistics recently suggested that I evaluate the current data with a linear mixed model, which I have learned to be
quite powerful.
Purpose
I will not lie. The topics in this paper are complicated. I have spent considerable time on the topic of linear mixed modeling
and have a basic understanding. I am not a mathematician or statistician. I believe my statistical outcomes are interesting.
As mentioned, I think my statistical outcome, the average number of compensated veterans from the 26 congressional
districts that make up the US House Committee on Veterans Affairs is not significantly different from the average number
of compensated veterans from 26 randomly selected congressional districts, is interesting. Also, I will show that there is a
very good linear fit between the logarithm of compensated veterans, congressional district population, and the logarithm of
total veterans. Although congressional district population was initially included in the linear mixed model and the linear
model, the reader will discover that congressional district population is not needed to model the average number of
compensated veterans. In other words, my statistical analysis discovered that congressional district population had a
statistically insignificant effect on the prediction regarding log of compensated veterans. I also determined that, although a

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 2 of 22
small random effect that explained some of the variability was noted, a linear mixed model with random effect of Sample
was not needed, and a linear fit is sufficient to model the data. The Sample random effect value was so small, 0.0024212,
that the standard deviation, 0.049205, essentially caused the variance by the sample of H1, 26 US House Committee on
Veterans Affairs congressional districts, and R1, 26 randomly selected congressional districts, to be zero. Also, I used the
logarithm transformation because the log(data) was more normal than data without the logarithm transformation (see below).
Normal data looks similar to a bell curve when graphed in a histogram and is often required for different statistical tools.
In my next analysis, I might compare number of compensated veterans per each Veteran Affairs Regional Office (VARO)
region and determine if the VAROs differ statistically when considering total number of compensated veterans.
As previously mentioned, but mentioned again because some are not familiar with statistical methods, I will be comparing
26 congressional districts that are a part of the US House Committee on Veterans Affairs to 26 randomly selected
congressional districts. Because of districting, the congressional district population size is similar throughout the 52
congressional districts. As previously mentioned, I was interested if US House Committee on Veterans Affairs membership
increased the average number of compensated veterans in the sample. To extrapolate, would a veteran with a congressional
representative on US House Committee on Veterans Affairs have more chance of getting US Veterans Affairs
compensation as compared to a veteran with a congressional representative not on the committee. Through statistical
analysis, I discovered that there was not a statistically significant difference between the means of number of compensated
veterans when comparing sample R1 to sample H1. R1 represents the 26 randomly selected congressional districts and H1
represents the 26 US House Committee on Veterans Affairs congressional districts. A boxplot comparing H1 to R1 is
provided later.
Part of the analysis, Linear Mixed Models, is new to me, and I am more familiar with another part, t-test. With that said, I
have read much, found some good online tutorials[6-11], and I believe I have a basic and working knowledge of the
statistical methods that allowed me to conclude that there is no statistical significance between the 26 US House Committee
on Veterans Affairs , H1, and the 26 randomly selected congressional districts, R1. During the somewhat complicated
analysis, I will be determining the effects of numerical fixed parameters such as congressional district population and
congressional district number of veterans, and the effects of random categorical representation such as house sample, H1,
or random sample, R1, on number of compensated veterans. I use R, an open-source statistical software package[4]", to
determine if a linear model or linear mixed model is needed to model the data. In fact, R was used for all statistical analysis,
including t-test and simple graphical techniques.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 3 of 22
Data Structure
Number Sample District
Population
Total
Veterans
log(Total Veterans) Compensated Veterans log(Compensated Veterans)
1 R1 632408 49775 4.70 6416 3.81
2 R1 695888 176342 5.25 18645 4.27
i R1 PR1,i TVR1,i log(TVR1,i) CVR1,i log(CVR1,i)
26 R1 653167 67912 4.83 8990 3.95
1 H1 694158 113094 5.05 29105 4.46
2 H1 812727 195868 5.29 39561 4.60
j H1 PH1,j TVH1,j log(TVH1,j) CVH1,j Log(CVH1,j)
26 H1 625251 96707 4.99 10494 4.02
In the above table, Number represents 26 US House Committee on Veterans Affairs congressional districts and 26 randomly selected congressional districts; R1
is a categorical variable that represents the various randomly selected congressional districts and H1 represents the various US House Committee on Veterans
Affairs congressional districts; District Population represents the total number of people living in the congressional districts; Total Veterans represents the total
number of veterans living in the congressional districts; log(variable) represents the logarithm; Compensated Veterans represents the total number of compensated
veterans in the congressional districts.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 4 of 22
Data Familiarization

R Command: summary(CompSum)
Sample Population Total Veterans log(TotalVeterans) Compensation log(Compensation)
H1:26 Min. :519021 Min. : 44759 Min. :4.651 Min. : 5381 Min. :3.731
R1:26 1st Qu.:659596 1st Qu.: 68454 1st Qu.:4.835 1st Qu.: 9024 1st Qu.:3.955

Median :697012 Median : 91236 Median :4.960 Median :12400 Median :4.093
Mean :709013 Mean :117392 Mean :5.000 Mean :17762 Mean :4.155
3rd Qu.:755059 3rd Qu.:126158 3rd Qu.:5.099 3rd Qu.:21548 3rd Qu.:4.332
Max. :971733 Max. :381825 Max. :5.582 Max. :65007 Max. :4.813
R Command: write.table(summary(CompSum), file ="summary.csv", col.names = TRUE, sep = "," )
CompSum is the name of my data structure. As I have explained in past statistical analysis[3], I am not a statistician so I will not be giving a textbook explanation for
descriptive statistics[1;2;5]. With that said, I believe it is important to note that the mean and median are equal in a normally distributed sample[2]. Also, the median
represents the 50
th
percentile. When a log transform was applied to Total Veterans and Compensation, the means and medians approached each other. Prior to
the log transform, the mean and medians of Total Veterans and Compensation were separated. As for quartiles, the 1
st
Quartile, 1
st
Qu, represents the 25%
percentile and represents that 25% of the data is below 659,596 for population. Likewise, the 3
rd
Quartile represents a point in the data where 75% of the data falls
below the population value of 755,059 and 25% of the data is above the population value of 755,059. In addition, one can see that the range is significant in Total
Veterans and Compensation.
Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 5 of 22
Standard Normal Curve

The above curve is the familiar bell shape of a normal distribution curve. Next, I provide histograms of analyzed data to
show how a log transformation shifted the positively skewed data, data to one extreme, to a shape that approaches normality.
During this analysis with R, I noticed that the log transform significantly impacted the statistical values and reduced the
residuals.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 6 of 22
Normal Probability Density Function:

f (x) =
1
o 2
e
(x)
2
/ 2o
2

Setting the mean ,

, to 0 and the standard deviation,

o
, equal to 1, we get the equation entered in R that generated the
above Standard Normal Curve:

f (x) =
1
2
e
(x)
2
/ 2

Approaching Normality


Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 7 of 22
Another method of evaluating the normality requirement is called a Quantile-Quantile plot, QQPlot[5]. In such a plot, the
sample quantiles are compared to theoretical quantiles. If completely normal, the comparison will produce a straight line by
representing a 1-to-1 relationship[5].
Congressional District Population Normality Test:
R Command: qqnorm(CompSum$Population) and qqline(CompSum$Population)

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 8 of 22

QQ Norm Plots for Total Veterans, Compensated Veterans, and subsequent log transformations

As the reader can see, the population, first large graph, already had a fairly normal distribution, and the log transform
converted the Total Veteran and Compensated Veteran data towards normality. If perfectly normal, the line would travel
from the lower left corner to the upper right corner. Now that the data is near normal, data analysis can proceed.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 9 of 22
Data Analysis
Linear Mixed Model
When I was considering the recent analysis regarding compensated veterans, I was considering the use of Analysis of
Covariance as a statistical method. Specifically, I thought congressional district population might correlate with number of
compensated veterans. After contacting a MD with a PhD in statistics, I learned that a Linear Mixed Model should be used
to determine any random or uncontrolled effects.
I began researching linear mixed models, learned that the models are complicated, and I decided upon the open-source
statistical software R[4]. Still, I had to learn how to use R and interpret the outcome. Since I am not a statistician, I will not
be giving a textbook explanation of linear mixed models. If you are interested in learning more about linear mixed models, I
suggest some of the links in the reference section of this paper[6-11].
Briefly, linear mixed models are complicated mathematically and conceptually. As the name implies, linear mixed models
are composed of fixed and random effects. A fixed effect is a parameter that is exhausted and complete. As an example,
Male, M, and Female, F, would be a fixed parameter since only M and F exists in gender[7;8]. In this paper, I assumed
congressional district population and congressional district total veterans were fixed parameters since each is complete
counting through US Census surveying. In contrast, a random effect is not complete and is often a sample of a population.
As an example, a school classroom is a sample of all classrooms in a school or district. A neighborhood in a city is a sample
from the city. Since the random effect is not exhausted, random variations will occur and multiple intercepts will be seen[7-
9]. In this paper, the samples R1, 26 randomly selected congressional districts and H1, 26 US House Committee on
Veterans Affairs congressional districts are a sample of the 435 US congressional districts. As such, R1 and H1 are
considered as random effects in this paper. The latter description is really simplistic, but there is no reason to reproduce
references[6-11] in this paper.
R Notation
For an informative description and understanding of R notation, read[6;7]. For my linear mixed model R notation:
logCompensation ~ Population + logTotalVeterans + (1|Sample) +

c

Dependent Variable Fixed Parameters Random
Effect
Error term

The Dependent Variable is dependent on the Fixed Parameters, Random Effect, and error term,

c
.

c
provides a
measure of uncontrollable effects[6].
Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 10 of 22
Linear Mixed Model R Outcome
AIC BIC logLik deviance REMLdev
-14.55 -4,794 12.27 -60.81 -24.55

Random effects:
Groups Name Variance St.Dev.
Sample (Intercept) 0.0024212 0.049205
Residual 0.0182183 0.134975
Number of obs: 52 groups: Sample, 2

Fixed effects:
Estimate Std. Error t value
(Intercept) -8.589e-01 4.023e-01 -2.135
Population 2.985e-07 2.808e-07 1.063
logTotalVeterans 9.606e-01 8.968e-02 10.712

Correlation of Fixed Effects:
(Intr) Popltn
Population 0.007
logTtlVtrns -0.892 -0.450
As I recommend previously, the reader should read references[6-11] if interested in a complete description of the above
output. As an example, AIC, BIC, etc are used in model selection and a lower value indicates a better model. Interestingly,
there is much disagreement on which outcome to use.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 11 of 22
What I am currently interested in is the Random effects. To analyze if Sample is an important random effect, I look at
the variance. What percentage of the total variance (Sample + Residual) is the Sample value. I also compare the variance
and standard deviation.

Random Effects:
Groups Name Variance St.Dev.
Sample (Intercept) 0.0024212 0.049205
Residual 0.0182183 0.134975
Sum: 0.0206395
Percent Sample:

Sample
Sum
(100) =
0.0024212
0.0206395
(100) =11.7%
In my opinion, 11.7% is significant when considering percent of total variance. The random effect of Sample explains
some of the variability, which means that there is some noticeable difference between R1, 26 randomly selected
congressional districts, and H1, 26 US House Committee on Veterans Affairs, which can be seen in the below boxplot
graph. With that said, the numerical value of 0.0024212 is quite small and can essentially considered zero in light of the
standard deviation. As such, I believe the random effect of Sample is significant when compared to total variance but
believe the small value indicates that the Sample of R1 versus H1 is not significant. Still, it is interesting to discover that
there may be a very small difference between randomly selected congressional districts and US House Committee on
Veterans Affairs. Personally, I believe the small value indicates the difference is insignificant in realistic terms.
For this reason, I believe I can use the t-test to analyze the difference between the means of compensated veterans. Also, I
believe I can create a linear model to represent the log Compensated Veterans prediction model. Instead of analyzing the
above Fixed estimates, I will create a linear model to analyze.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 12 of 22

The above boxplot[5a] shows that the medians, dark line, of the 26 US House Committee on Veterans Affairs
congressional districts, H1, and the 26 randomly selected congressional districts, R1, are almost the same. Also, R1 falls
within the dataset of H1. Still, there is a difference between the 75
th
percentile, top of box, and the upper adjacent whisker,
upper line. Each sample, R1 and H1, have outside values too. R1 has 2, and H1 has 1. The latter differences might be the
small variability accounted for by the linear mixed model (1|Sample). With that said, I believe the size of the effect is small
and insignificant.
Later, the reader will see the outcome of the t-test, which I believe can be performed since normality assumptions are
reasonable and the random effect is small. Before I calculate the t-test, I will create a linear model and check the effect of
population on the model.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 13 of 22
Linear Mixed Model Comparison
I will be creating a full model and a null model when considering population and performing an Analysis of Variance,
ANOVA, comparison[7]. The ANOVA comparison will provide a Chi-Squared test statistic that I can use to determine if
population is needed in the linear model.
R Notation
Full Model
logCompensation ~ Population + logTotalVeterans + (1|Sample) +

c

Dependent Variable Fixed Parameters Random
Effect
Error Term
Null Model
logCompensation ~ logTotalVeterans + (1|Sample) +

c

Dependent Variable Fixed
Parameters
Random
Effect
Error Term
For informational purposes, I will be including the AIC, BIC, etc values. I will be excluding the Random effects and
Fixed effects while including the ANOVA results. As mentioned previously the analysis of the final linear model will also
explain how to interpret the Fixed effects results in the linear mixed model.
When I do this analysis in R, I will be setting Restricted Maximum Likelihood, REML, to FALSE as I build the
models[7].
Full Model Output (With population)
AIC BIC logLik deviance REMLdev
-51.16 -41.4 30.58 -61.16 -24.28
Null Model Output (Without population)
AIC BIC logLik deviance REMLdev
-52.16 -44.26 30.08 -60.16 -51.49


Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 14 of 22
Anova(compNull.model, compFull.model)
Df AIC BIC LogLik Chisq Chi Df Pr(>Chisq)
compNull.model 4 -52.163 -44.358 30.082
compFull.model 5 -51.157 -41.401 30.579 0.9941 1 0.3187
Comparison
As can be seen from the above results, ANOVA produces near similar results of AIC and BIC values. In most literature, I
have learned that a smaller AIC or BIC result indicates a better model. As for the Analysis of Variance, ANOVA,
comparison, a Chi Square,

_, test statistic is provided as well with a probability:

_(1) =0.9941 and Pr = 0.3187. From a
statistical table in my engineering text book[1], the Chi Squared value must exceed 3.841 at 1 degree freedom and

o = 0.05
if population affected log (Compensated Veterans). Since 0.9941 < 3.841, population has no statistically significant effect.
Linear Model
Assuming that the small value of the random effect of sample is insignificant, a linear model without random effects will be
used to predict the logarithm of compensated veterans. Also, the above ANOVA Chi Squared value indicates that
Population is not needed. In truth, the past Fixed effects data highly suggested that population was not important since
the estimate was so small, 2.985e-07, and the t value was 1.063. The estimate suggests that every population step of 1
produces a 2.985e-07 step in logCompensation. Still, I will do a linear regression analysis with population to obtain a P value
and t value without the random effect of Sample.
R Notation
logCompensation ~ logTotalVeterans Population +

c

Dependent Variable Fixed Parameters Error Term
Since it is assumed that US Census included compensated veterans in the group of total veteran estimation, I would suspect
a linear model would be significant.
Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 15 of 22
Linear Model R Output

I have included the data that I will analyze. Specifically, I have excluded the data on Residuals since I will be including a
Fitted versus Residual plot latter.

With Population

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.836e-01 4.141e-01 -2.134 0.0379
logTotalVeterans 9.754e-01 9.234e-02 10.563 3.16e-14
Population 2.290e-07 2.877e-07 0.796 0.4299

Residual standard error: 0.1395 on 49 degrees of freedom; Multiple R-squared: 0.7526, Adjusted R-squared: 0.7425; F-
statistic: 74.53 on 2 and 49 DF, p-value: 1.376e-15.

With Population Equation


log(CV) = 8.836 x10
1
+9.754 x10
1
(log(TV)) +2.290 x10
7
(Pop )

Without Population

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.88431 0.41260 -2.143 0.037
logTotalVeterans 1.00801 0.08244 12.228 < 2e-16

Residual standard error: 0.139 on 50 degrees of freedom; Multiple R-Squared: 0.7494, Adjusted R-squared: 0.7444; F-
statistic: 149.5 on 1 and 50 DF, p-value: < 2.2e-16.

Without Population Equation

log(CV) = 8.8431x10
1
+1.00801(log(TV))

According to my engineering statistics textbook[1], the t-value and p-value are important for determining if a variable is
important. If the p-value is small and the t-value is large, the variable is important. The maximum value for p-value is 1 since
it is a probability. In the case of Population, the p-value is high at 0.4299 and the t-value is quite small at 0.796. Meanwhile,
the p-value for logTotalVeterans is quite low and the t-value is high. The p-values and t-values suggest that logTotalVeterans
is needed while Population is not needed. Therefore, the linear model analysis produced similar results to the Linear Mixed
Model analysis. When population is removed from the equation, the p-value and t-value for logTotalVeterans changes for
the better. The p-value is a conditional probability[6]a probability under the condition that the null hypothesis is true. In
the case of my model, the p-value is the probability that the slope is zero[1]. If p =1, one could say for certain that the slope
is zero. In the case of Population, the p-value equals 0.4299, which is quite high indeed.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 16 of 22
There are 52 data points, so degrees of freedom is 52 2 = 50. From a statistical table[1], the t-values with a

o = 0.05/2 is

t
0.025
< 1.960 or

t
0.025
>1.960. Since Population t-value of 0.796 falls between the latter values, one can assume that the
slope is essentially zero for Population. In contrast, the with population and without population t-value for logTotalVeterans
is much greater than 1.960 so one can assume that the slope is not equal to 0 and highly significant[1].
The Multiple R-Squared can be thought of as variance explained and it ranges from 0 to 1[6]. As the reader can see, the

R
2
remains relatively the same when population is included and not included. In the case without Population, the

R
2

value is 0.7494. Another way to look at this value is to consider that 74.94% of the variance is explained[6]. For a social
sciences model, the

R
2
is quite high, but I believe that is because compensated veterans are included in total veterans. The
Adjusted R-Squared value explains how much variance is explained as well. In addition, the Adjusted

R
2
is affected by the
number of fixed effect parameters in the model.
Prior to the discussion, I highly suggest reference[6a] if you want a nice understanding of ANOVA, F-statistic, degrees of
freedom, etc. As for the F-statistic, it is a measure of between-group variance, numerator, and within-group variance,
denominator[3;6a]. If the F-statistic, ratio, is large, the unexplained noise might be small as compared to the measured
information. As such, experimenters want a high F-statistic. In the above cases, the F-value changed from F(2, 49) = 74.53
to F(1, 50) = 149.5. From my engineering textbook,

F
0.05
(1,50), where 1 is numerator degrees of freedom and 50 is the
denominator degrees of freedom, falls between 4.08 and 4.00.

F
0.05
Interpolation

y =
(y
2
y
1
)
(x
2
x
1
)
(x x
1
) + y
1
y =
(4.00 4.08)
(60 40)
(50 40) + 4.08
y = 4.04 = F
0.05
(1,50)

As the reader can see, F(1,50) value of 149.5 is significantly larger than 4.04. As such, the between-group variance is much
more significant than the within-group variance, or the within-group variance is quite small. Finally, the p-value <

2.2x10
16

is the probability of a type I error, which is a rejection of the null hypothesis even though it is true.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 17 of 22
Residual Plots

According to my engineering textbook, the information on lack of model fit is stored in the residual information[1]. A
residual is the y distance of the actual data point minus


y estimation of the fitted line. The above graphs of the residuals
show that the model is a good fit. The residuals have a somewhat random and fairly nice constant scattering band around 0.0
in the residual plot, and the residuals distribution approaches a normal curve. Each of the latter suggests a good linear
model fit.

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 18 of 22
Welch t-test
Statistically, I believe I have shown that a random effect of Sample, which compares R1 and H1 effects on the model, is
not needed, and I believe I have shown that congressional district population is insignificant when considering linear models.
Also, I converted the compensation and total veterans data sets to more normal data sets via log transform. Now, I will use
the R Welch t-test[4] to test the null hypothesis that the means are equal.
R Output: Welch Two Sample t-test
t = -1.872, df = 47.326, p-value = 0.2411; alternative hypothesis: true differences in means is not equal to 0; 95 percent
confidence interval: -0.24296653 and 0.06260619; mean of x and y are 4.110214 and 4.200394, respectfully: x = R1 and y =
H1.
The Welch test adjusts the number degrees of freedom when variance differ[13]. From my engineering textbook, the

t
0.05/ 2
= t
0.025
falls between:
Degrees of Freedom

o = 0.025
29 2.045
Infinity 1.960
In truth, the statistical table t-statistic falls between the above two values. Since infinity is in the Degrees of Freedom
column, I cannot interpolate. If the calculated t-value is equal to -2.045 or less than -1.960 or equal to 2.045 or greater than
1.960, the null hypothesis that the means are equal can be rejected. As mentioned previously, the t-statistic falls between the
29 and infinity values. If the calculated t-value would have been within the two table values, I would have personally used the
infinity value since the degrees of freedom is 47.326 and far from 29. From above, the calculated t-value is -1.872 and is not
less that the negative values of -2.045 and -1.960. Also, the calculated t-value is not equal to 2.045 or greater than 1.960. As
such, the null hypothesis that the means,

, are equal must be accepted:

4.110214 =
log(R1CompensatedVeterans)
=
log( H1CompensatedVeterans)
= 4.200394
Of note is the relatively low p-value of 0.24. Such a p-value might indicate more analysis to definitively state that the null
hypothesis of equal means can be accepted. If the means are exactly equal, which can be determined by performing an
Welch t-test R analysis with the same data in each column, the p-value would be equal to 1.
Since the anti-log of the arithmetic mean of log transformed data is the geometric mean[5b], it can be assumed that the
geometric mean is a better average representation of the H1 and R1 Compensated Veterans data. To obtain the geometric
mean, the anti-log of the above two values is performed:

12,889 =
R1CompesnatedVeterans
=
H1CompensatedVeterans
=15,863

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 19 of 22

A Visual Representation

The above graphs provides a nice visual of the linear regression analysis,

R
2
relationship, and residuals. As the reader can
see, a linear fit provides a nice representation of the log(data).

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 20 of 22
Conclusion
Although a random effect percent variance suggested that there is a significant explanation of variance that might mean a
difference between H1, US House Committee on Veterans Affairs 26 congressional districts, and R1, randomly selected 26
congressional districts, the calculated value was quite small. When considered with the standard deviation, one can assume
that the variance for the random effect of Sample, which is H1 and R1, is essentially 0. As such, I decided that a linear
mixed model was not needed to model the data.
Both linear mixed model and linear model analysis, along with ANOVA, indicated that congressional district population was
not needed to model the data and predict log (Compensated Veterans) outcome. On each occasion, the statistics showed
that population was insignificant.
A linear model was produced that included log(Compensated Veterans) as the dependent variable and log(Total Veterans)
as the independent variable. The t-statistic, F-statistic, various p-values, and

R
2
suggested a successful linear fit of the log
transformed data. To compliment the latter, graphical visualization also verified normally distributed residuals and constant
and somewhat randomly distributed residuals about 0.0. In addition, graphical representation showed the difference
between original data and log transformed data. All of the latter statistical evaluations indicated a nice linear model fit of
log(Compensated Veterans) and log(Total Veterans), which, in truth, makes sense because compensated veterans are a part
of Total Veterans within a US Census population survey.
Since the log data were transformed to approach normality and a linear model could be used, a Welch Two Sample t-test,
which takes into account excursion from equal variance assumption, was used to compare the log (R1 Compensated
Veterans) and log (H1 Compensated Veterans). The calculated t-statistic indicated that the null hypothesis of equal means
could not be rejected. The probability was p = 0.24 though, which is relatively lowif the means were exactly equal, the p-
value would be 1. The latter implies that other comparisons might be needed to definitively state that the means are not
statistically different. Since the US House Committee on Veterans Affairs has 26 congressional districts, the comparison
sample size is fixed. Also, I am a chemical engineer with an additional degree in biological sciences. As such, most of my
statistical experience is much different from the social sciences. As such, a p-value = 0.24 might actually be quite high in the
field of social sciences.
As previously mentioned, the current analysis does suggest that the means are equal. Also, one can statistically predict the
number of compensated veterans from the number of total veterans in a congressional district.
Sincerely,
Chris Harding
100% schedular Total and Permanent disabled veteran

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 21 of 22

References:

[1] Johnson, Richard A.(1994). Miller & Freunds Probability & Statistics for Engineers. Fifth Edition. Englewood Cliffs,
New Jersey: Prentice Hall

[2] Crow, Edwin L.; Davis, Frances A.; Maxfield, Margaret W. Statistics Manual. New York: Dover Publications

[3] Harding, Chris. Statistical Analysis: Representation Success at Board of Veterans Appeal, Aug. 31, 2013.
scribd.com[online]. 2013. Available from: http://www.scribd.com/doc/164161570/Statistical-Analysis-Representation-
Success-at-Board-of-Veterans-Appeal

[4] R Foundation for Statistical Computing. R: A Language and Environment for Statistical Computing, 2013. R-
project[online]. 2013. Available from: http://www.R-project.org/

[5] Rice University; University of Huston Clear Lake; Tufts University. Online Statistics Education: An Interactive
Multimedia Course of Study. QQ Plots. Onlinestatbook.com[online]. 2013. Available from:
http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html

[5a] ] Rice University; University of Huston Clear Lake; Tufts University. Online Statistics Education: An Interactive
Multimedia Course of Study. BoxPlots. Onlinestatbook.com[online]. 2013. Available from:
http://onlinestatbook.com/2/graphing_distributions/boxplots.html

[5b] Rice University; University of Huston Clear Lake; Tufts University. Online Statistics Education: An Interactive
Multimedia Course of Study. Log Transformations. Onlinestatbook.com[online]. 2013. Available from:
http://onlinestatbook.com/2/transformations/log.html

[6] Winter, Bodo (2013). Linear models and liner mixed effects models in R with linguistic applications: Tutorial 1.
arXiv:1308.5499. Cornell University Library[online]. 2013. Available from: http://arxiv.org/pdf/1308.5499.pdf

[6a] Winter, Bodo (2011). Tutorial: The F distribution and the basic principle behind ANOVAs, Sep. 21, 2011.
Bodowinter.com[online]. 2013. Available from: http://bodowinter.com/tutorial/bw_anova_general.pdf

[7] Winter, Bobo (2013). A very basic tutorial for performing linear mixed effects analysis (Tutorial 2), Aug 13, 2013.
bodwinter.com[online]. 2013. Available from: http://bodowinter.com/tutorial/bw_LME_tutorial.pdf

[8] UCR. Linear Models: Analyzing data with random effects (Little, Chapter 4). faculty.ucr.edu[online]. 2013. Available
from: http://faculty.ucr.edu/~hanneman/linear_models/c4.html

[9] Starkweather, Jon, Dr. Linear Mixed Effects Modeling using R. unt.edu[online]. 2013. Available from:
http://www.unt.edu/rss/class/Jon/Benchmarks/LinearMixedModels_JDS_Dec2010.pdf

[10] International Livestock Research Institute. Research Methods Group. Mixed Model Analysis Using R. ilri.org[online].
2013. Available from: http://www.ilri.org/biometrics/Publication/Full%20Text/Stephen_Mbunai_Sonal_Nagda.pdf

[11] Seltman, Howard. Experimental Design for Behavioral and Social Sciences. stat.cmu.edu[online]. 2013. Available from:
http://www.stat.cmu.edu/~hseltman/309/Book/

Chris Harding
Statistical Analysis: US House Committee on Veterans Affairs Representation Effect on Number of Compensated Veterans
Page 22 of 22
[12] UNC. Lecture 28 Monday, December 6, 2010. unc.edu[online]. 2013. Available from:
http://www.unc.edu/courses/2010fall/ecol/563/001/docs/lectures/lecture28.htm

[13] Spector, Phil. Using t-tests in R. statistics.berkeley.edu[online]. 2013. Available from:
http://statistics.berkeley.edu/computing/r-t-tests

You might also like