You are on page 1of 11

Identifying Poisonous Fish

Brian Bahmanyar,
October 20, 2014

Introduction
Mercury poisoning is a medical condition caused by exposure to mercury or its
compounds. Mercury is a heavy metal occurring in several forms, all of which can produce toxic
effects in high enough doses. Toxic effects include damage to the brain, kidneys and lungs. The
type and degree of symptoms exhibited depend upon the individual toxin, the dose, and the
method and duration of exposure. The consumption of fish is by far the most significant source
of ingestion-related mercury exposure in humans.
(http://en.wikipedia.org/wiki/Mercury_poisoning)
In my investigation I look at factors contributing to the mercury concentration of
Largemouth Bass in Florida lakes. I was interested to see if there are statistically significant
predictors for the mercury contents of these fish. This would allow fisherman to have a better
sense of whether a Largemouth Bass is safe to consume based on various factors measured
from the water.
I found data for this investigation at The Data and Story Library (DASL). Those who
collected the data studied 53 different Florida lakes to examine the factors that influence the
level of mercury contamination in Largemouth Bass. Unfortunately, there is no indication as to
whether this was a random sample of 53 lakes; we may not be able to generalize our data to
some larger population. They collected water samples from the middle of each lake in August
1990 and then again in March 1991. The pH level, the amount of chlorophyll, calcium, and
alkalinity were measured in each sample. The average of the August and March values were
recorded. Next, a sample of fish was taken from each lake with sample sizes ranging from 4 to
44 fish (also unaware if these samples are random). The age of each fish and mercury
concentration in the muscle tissue was measured. Since fish absorb mercury over time, older
fish will tend to have higher concentrations. Thus, to make a fair comparison of the fish in
different lakes, the investigators used a regression estimate of the expected mercury
concentration in a three-year-old fish as the standardized value for each lake. Finally, in 10 of
the 53 lakes, the age of the individual fish could not be determined and the average mercury
concentration of the sampled fish was used instead of the standardized value.
(http://lib.stat.cmu.edu/DASL/Datafiles/MercuryinBass.html) The observational units in this
study are the samples of Largemouth Bass collected form the various Florida lakes.
In Part I of this project I will focus on the relationship between the alkalinity and the
average mercury concentration in the sample of Largemouth Bass. Alkalinity is the name given
to the quantitative capacity of an aqueous solution to neutralize an acid. Measuring alkalinity is
important in determining a stream's ability to neutralize acidic pollution from rainfall or
wastewater (http://en.wikipedia.org/wiki/Alkalinity). I will use the alkalinity of the lake,
measured in mg/L, as an explanatory variable in an effort to explain some of the variability in
the mercury concentrations in the Largemouth Bass, measured in parts per million in the muscle
tissue. Due to the fact that alkalinity helps neutralize acidic pollution, I predict that higher
alkalinity levels in a lake will be associated with lower concentrations of mercury in the
Largemouth Bass that live there.

Descriptive Statistics
Figure 1: Average Mercury by Alkalinity
I predicted the negative association between
average mercury and alkalinity; however, I did not
anticipate the non-linear relationship. I fit a local
smoother with smoothness (alpha) equal to 0.5 to
better visualize the curvature in the data. To take
care of this sort of monotonically decreasing
curvature in Figure 1 I could have decreased the
power of average mercury or decreased the
power of alkalinity.

Figure 2: Average Mercury by log(10) Alkalinity


Weight: No.samples
I regressed average mercury by the base 10 log of
alkalinity and it seems to be reasonably linear. I
ran a weighted my model, using the number of
Largemouth Bass accounting for the mercury
average of each observation as the weights. This
way samples with more Largemouth Bass will
have more influence on the relationship. The
smoother defiantly looks more linear however
there is still some curving around the middle of
the line in Figure 2. After look at residuals further
transformations may be needed.

Table 1: Multivariate
Weight: No.samples

Correlations
Avg_Mercury
Log10-alkalinity

Avg_Mercury Log10-alkalinity
1.0000
-0.6729
-0.6729
1.0000

According to Table 1 the correlation coefficient, r, is about 0.673. A correlation coefficient of


0.673 tells us that there is a reasonably strong, negative, linear relationship between the
average, average mercury concentration in the bass and the base 10 log of alkalinity.

Now we will look at the individual distributions of the important variables in our model.

Histogram Average Mercury

Summary Statistics Average Mercury


Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean

0.5271698
0.3410356
0.0468448
0.6211709
0.4331688

From the histogram of average mercury concentrations we can see that the data are skewed to
the right. This means that most of the lakes we sampled contained Largemouth Bass that had
low average mercury concentrations, and few lakes had Bass with high average mercury
concentrations. The 53 sampled lakes had contained Largemouth Bass with an estimated
average, average mercury concentration of 0.527 parts/million. (estimated not in the sense that
the mean is a prediction, estimated in the sense that researchers used a sample of bass to
estimate mercury concentrations of all bass in the lake)

Histogram Alkalinity

Summary Statistics Alkalinity


Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean

37.530189
38.203527
5.247658
48.060385
26.999993

From the histogram of alkalinity levels we can see that the data are skewed to the right. This
means that most of the lakes we sampled contained low alkalinity levels, and few lakes had high
levels of alkalinity. The 53 sampled lakes had an average alkalinity of 37.53 parts/million.

Histogram Sample Size


per Observation

Summary Statistics
Sample Size/Observation
Mean
Std Dev
Std Err Mean
Upper 95% Mean
Lower 95% Mean

13.056604
8.5606773
1.1758995
15.416219
10.696989

Due to the fact that observations vary by sample size, it is important to look at the distribution
of these weights. The majority of observations of mercury concentrations come from samples of
between 5 and 15 Largemouth Bass. This may or may not be representative of all the
Largemouth Bass in the lake depending on how they were selected, and how many of these fish
there are in the lake total. On average, the researchers who collected the data, measured
mercury concentrations from samples of about 13 Largemouth Bass.

Residual Analysis
Figure 3: Normality Plot of Stu. Residuals
(for model Average Mercury by log(10) Alkalinity)
Figure 3 displays the non-normality of the studentized
residuals. The residuals clearly bend around the normal
line and reach outside of the 95% bands.

Figure 4: Stu. Residual Average Mercury

Referring to the histogram in Figure 4 we can see that there is indeed right skew in the
studentized residuals. This suggests that I should try another transformation to correct for this. I
will now look at the base 10 log of average mercury by the base 10 log of alkalinity.

Figure 5: Log(10) Average Mercury by Log(10) Alkalinity


Weight: No.samples
Figure 5 shows us a scatterplot of the data
after the second transformation. The data
still seems linear. Table 2 shows us the
output for a lack of fit test. We observed an
F-Ratio of 0.8962, which led to a large
p-value of 0.664. At the alpha equals 0.05
level, we dont have significant evidence
that the linear model is not appropriate.
However, there are a few observations,
circled in blue, that seem to have much
larger residuals than the rest of the data.
The residuals for these data will be looked at in more detail when I evaluate unusual
observations.

Table 2: Lack Of Fit


Source
DF Sum of Squares Mean Square F Ratio
Lack Of Fit 49
35.057333
0.715456 0.8962
Pure Error 2
1.596713
0.798357 Prob > F
Total Error 51
36.654047
0.6642

Figure 6: Normality Plot of Stu. Residuals


(for model log(10) Average Mercury by log(10) Alkalinity)
Figure 6 is a normality plot of the log(10) alkalinity
studentized residuals. The residuals stay close to the
normal line and do not go outside of the 95% bands.
The studentized residuals are now approximately
normally distributed. The transformation corrected the
non-normality.

Figure 7: Stud. Residuals by Log(10) Alkalinity

Although the relationship between log(10)


average mercury and log(10) alkalinity appears
linear and the stud. residuals follow a normal
distribution, Figure 7 shows that the residuals
seem to have unequal variance. We can see a
fan in the residuals in Figure 7. The residuals
vary more at higher values of log(10) alkalinity.

We can assume independence because the average mercury concentration in the sample of
bass in one Florida Lake should not have an effect on the mercury concentrations of the bass in
other lakes because they are isolated bodies of water.
To summarize, this model is behaving fairly well. Using a weighted regression will give more
influence to the observations that came from larger samples of Largemouth Bass. The
monotonically decreasing trend in the data was taken care of by taking the base 10 log of
alkalinity. I then ran into some non-normality in the residuals that I dealt with by taking the base
10 log of average mercury. Unfortunately, there still appears to be some unequal variance in the
residuals shows in Figure 7, I would say is the weakest point of my model.

Linear Model
Figure 8: Regression Plot
(log(10) Average Mercury by log(10) Alkalinity)
Weight: No.samples

Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)

0.50667
0.496997
0.847766
-0.36503
692

Linear Fit
log(10)[avg. mercury] - hat = 0.2472067 - 0.4630143*log(10)[alkalinity]
A 10-fold increase in the alkalinity of a Florida lake is associated with a 100.463 = 2.904
multiplicative decrease in the predicted median, average mercury concentration. In other words
a 10-fold increase in the alkalinity of a Florida lake is associated with an estimated 190.4%
decrease in the mercury concentrations found in Largemouth Bass living in the lake.
Due to the fact that I am using a log-log model the intercept is more difficult to interpret. We
can interpret the intercept as Florida lakes with 0 log(10)alkalinity have Largemouth Fish with a
predicted average log(10)average mercury concentration of 0.247. However, In order to get this
interpretation in terms of alkalinity and average mercury we must manipulate the regression
equation. Our regression equation, which can be seen in the Linear Fit output above, is
log(10)[avg. mercury] - hat = 0.247 - 0.463*log(10)[alkalinity]. I must first remove the
log(10)[alkalinity] term from the equation and we can do this by letting alkalinity equal 1 so that
log(10)[alkalinity] = log(10)[1] = 0. Now we are left with
log(10)[avg. mercury] - hat = 0.247. After applying base 10 to both sides we get
[avg. mercury] = 100.247 = 1.766. This means that Florida lakes with 1 mg/L of alkalinity contain
Largemouth Bass with a predicted average mercury concentration of 1.766 parts/million. We
have data for Lake Trafford, which has 1.2 mg/L of alkalinity, so it is not much of an
extrapolation making a prediction for the average mercury concentrations of Largemouth Fish in
Florida lakes with 1 mg/L of alkalinity. I would say the intercept is meaningful in this context.
Referring to the Summary of Fit output we can see that R-Square 0.507. This means that 50.7%
of the variability in the base 10 log of mercury concentrations in Largemouth Fish in Florida lakes
is explained by this regression model on the base 10 log of alkalinity level (the other 49.3% is
unexplained variability).

Table 3: Top 5 Leverages


Lake
Griffin
East Tohopekaliga
Trout
Brick
Tohopekaliga

hi
0.1736525592
0.1164689823
0.102453403
0.0757299358
0.0757299358

Table 3 displays the lakes with the top 5 leverages. Lakes Griffin, East Tohopekaliga, and Trout
all have leverages greater than
= 0.094. However lakes East Tohopekaliga and Trout have
similar hat values, 0.116 and 0.102 respectively, which are not much greater than that of the
other observations. Lake Griffin, on the other hand, has a leverage of 0.174 which separates it
from the leverages of the other observations. I would consider lake Griffin an observation with
high leverage, depending on whether it has a large residual it may greatly influence the model.

Table 4: Top 5 Studentized Residuals


Lake
Puzzle
Farm-13
Apopka
Parker
Deer Point

Stud. Residual
2.6355479835
- 2.3036408905
- 2.0195020498
- 2.0071976662
1.9428364132

Table 5: Top 5 Cooks Distances


Lake
East Tohopekaliga
Farm-13
Puzzle
Tohopekaliga
Deer Point

Cook's Distance
0.1994561949
0.1529166686
0.1310342084
0.1087293064
0.070541742

Table 4 displays the lakes with the 5 largest studentized residuals. Lake Puzzle has the largest
studentized residual of 2.635, but it did not have a high leverage so it likely wont be the most
influential observation. The top 5 studentized residuals are all around 2, which is fairly high. This
means that there are some lakes in our data that the model does not predict Largemouth Bass
mercury concentrations for very well. From Table 3 and Table 4, we can see that most
observations with high leverages did not tend to have very large residuals, and vice versa, so
there were not any extremely influential observations. Table 5 displays the lakes with the 5
largest Cooks Distances. Although East Tohopekaliga has the largest Cooks Distance of 0.199, it
is not too far from the Cooks Distances of the other influential observations so I dont think it
was too much of a problem in the regression analysis.

Statistical Inference
The population of interest is Largemouth Bass in all Florida lakes. Unfortunately, the researchers
who collected the data did not indicate whether the 53 selected Florida lakes were randomly
selected so dont know if our sample of lakes is representative of all Florida lakes. Therefore we
should not generalize our findings to the population. Also it was not specified if the samples of
Largemouth Bass collected from each lake were random. The researches may have
systematically selected bass from a particular part of the lake where mercury concentrations did
not represent the average mercury concentrations of the lake as a whole. Therefore we may not
even be able to generalize the average mercury concentration in a sample of Largemouth Fish to
all the Largemouth Fish in a particular Florida lake.
H0:

Ha:

=0
0

(there is no relationship between log(10)[alkalinity] and average log(10)[average


mercury])
(there is a relationship between log(10)[alkalinity] and average log(10)[average
mercury])

Where 10 1 represents the true multiplicative chance in median, average mercury associated
with a 10-fold change in alkalinity. I ran a two sided test because I did not have any initial
conjectures of the direction of the relationship between log(10)[average mercury] and
log(10)[alkalinity].

Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95%
Intercept
0.2472067 0.090525 2.73 0.0087*
0.06547 0.4289433
Log10-alkalinity -0.463014 0.063976 -7.24 <.0001* -0.591451 -0.334578
For the regression of log(10)[average mercury] vs. log(10)[alkalinity] the observed slope of
-0.463 led to a test statistic of -7.24 (which follows a t-distribution with 53-2=51 degrees of
freedom) and yielded a p-value of <0.0001 (form Parameter Estimates output). If there was truly
no association between log(10)[average mercury] and log(10)[alkalinity], we would only expect
to see a sample slope as extreme as our less than 0.01% of the time due to random chance. At
the 99% confidence level, we have extremely strong evidence that there exists a genuine
relationship between the mercury concentration in Largemouth Bass, from Florida lakes, and
the alkalinity level of the lake. We are 95% confident each 10-fold increase in alkalinity (mg/L) is
associated with a predicted multiplicative decrease in expected mercury concentrations in
Largemouth Fish of between 10-0.591 = 0.256 parts/million and 10-0.225 = 0.596 parts/million.

Table 6: JMP Data


ID

Alkalinity

log(10)[Alk]

53

25

1.398

Sample
Size
13.05

Lower
95% Mean

Upper
Lower
95% Mean 95% Indv.

Upper
95% Indv.

-0.466

-0.335

0.076

-0.876

I wanted to estimate the mean and individual response prediction for a fictional lake in Florida
with an alkalinity of 25 mg/L. Due to the fact that I used weighted regression I had to assign a
weight to this observation and I used the average of the weights from the other observation. So
I am making predictions for a lake with 25 mg/L where an imaginary sample of 13 Largemouth
Bass was taken. We are 95% confident that all Florida lakes with alkalinity of 25 mg/L will
contain Largemouth Bass with an expected average mercury concentration of between 10-0.466 =
0.342 parts/million and 10-0.335 = 0.462 parts/million. We are 95% confident that a Florida lake
with alkalinity of 25 mg/L will contain Largemouth Bass with an average mercury concentration
of between 10-0.876 = 0.133 parts/million and 100.076 = 1.191 parts/million.

Conclusion
There is statistically significant evidence that there exists a genuine relationship
between alkalinity and mercury concentrations in Largemouth Bass, for Florida Lakes. As
mentioned earlier we may not be able to generalize our findings to all Florida lakes depending
on how the lakes were selected. Regardless of whether we are able to generalize, we cannot
draw a case-and-effect conclusion because a randomized experiment was not conducted. There
may have been other confounding variables influencing this relationship. There is unequal
variance in the residuals that could be effecting our regression equation, parameter estimates,
and confidence intervals. None of the lakes observed were extremely unusual, relative to the
other lakes in the sample. The 3 observations circled in Figure 5 (Lake Parker, Lake Apopka, and
Lake Farm-13 from left to right) all have about the same alkalinity level and similarly large
negative residuals. This may warrant some further investigation; perhaps they have some
similarities that I could include in the analysis to improve the model. The only other question I
would ask about this data is whether this trend can truly be generalized to all Florida Lakes. If I
had more data from lakes of different states I may even be able to create a model that could be
generalize to lakes across the nation.

You might also like