You are on page 1of 4

Spring 2008 - Stat C141/ Bioeng C141 - Statistics for Bioinformatics Course Website: http://www.stat.berkeley.edu/users/hhuang/141C-2008.html Section Website: http://www.stat.berkeley.

edu/users/mgoldman GSI Contact Info: Megan Goldman mgoldman@stat.berkeley.edu Oce Hours: 342 Evans M 10-11, Th 3-4, and by appointment

Q-Q plots

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. A q-q plot is a plot of the quantiles of the rst data set against the quantiles of the second data set. The R command for this is quite intuitive: qqplot(x,y), where x and y are your two data sets. Lets look at a few examples to see what happens in certain situations:

1.1

Two data sets are from the same distribution

x <- runif(100) y <- runif(100) qqplot(x,y) abline(a=0,b=1) abline is a function that adds a line to an existing graph. There are several possible arguments. Here, Ive used a and b... a is the intercept of the line, b is the slope. There are also arguments called h and v which will add a horizontal or vertical line, respectively. Ive added the line here to make the graph easier to interpret. Note that our data seems to fall on a nice straight line right along the reference line Ive drawn: This indicates theyre from the same distribution.

1.2

Two data sets have same shape, but dierent mean

x <- runif(100) y <- runif(100, min = 0.5, max = 1.5) qqplot(x,y) abline(a=0,b=1) Here, the data are still on a nice straight line, but the reference line (a 45 degree line passing through the origin) is parallel to that line at another location.

1.3

One data set has more room in the tails than the other

x <- rnorm(100) y <- rt(100, df = 3) qqplot(x,y) abline(a = 0, b =1) The t-distribution, at low degrees of freedom, will have more room in the tails than the normal distribution. You should see the data falling neatly along the reference line, but going o the line at either end.

K-means Clustering

Another intuitive function! K-means clustering is done using the function k-means. There are many possible arguments, but only a few you should worry about for now. The rst argument is your data. The second is called centers, and can be one of two things: Either a number, which means you want there to be that many clusters, or a vector of data points, which mean you want the initial centers to be exactly at those points. If you choose to use just a number, R will randomly pick that many rows from your data set and use those as the initial centers. There are several other arguments, read the help le for details about them. First, lets set up some data: x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") We havent used rbind before: It stands for row bind and, as the name suggests, it takes the data and binds it together by rows. (Theres also cbind, which is analogous and binds columns.) So, what we wind up with here is a 100x2 array of data. The rst 50 rows are generated by rnorm(100, sd = .3), the second by rnorm(100, mean =1, sd = .3). The colnames function just assigns names to the two columns. Try plotting the data; you should see two blobs of data with a bit of overlap: plot(x) Now, well run k-means on x with K=2 and see what happens: (cl <- kmeans(x,2)) Note that I called the function kmeans, assigned the output to an object called cl, and put the whole statement inside parentheses. This is a two-birds-with-one-stone construction: The result of the function will be both stored in an object called cl AND printed on the screen. Heres what the output looks like (yours will certainly dier from mine, but theyll be pretty close):

K-means clustering with 2 clusters of sizes 51, 49 Cluster means: x y 1 0.03606426 -0.1113093 2 1.07169344 1.0395200 Clustering vector: [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [73] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 Within cluster sum of squares by cluster: [1] 8.831303 10.587096 Available components: [1] "cluster" "centers"

"withinss" "size"

What we learn here: the clustering put 51 points in one group, and 49 in the other. The rst group is centered at (0.03, -0.11) and the second is centered at (1.07, 1.03). Clustering vector tells us which group each point wound up in. Now lets do some neat plotting: plot(x, col = cl$cluster) points(cl$centers, col = 1:2, pch = 8, cex=2) The plot function is familiar to you, but this time Ive used the col argument... col stands for color. col = cl$cluster tells R to assign each point a color according the number in the cluster portion of the kmeans output. Each color has a number: 1 is black, 2 is red, and so on. The points function we havent used since week 1 or 2. Then, we were adding a curve to an existing plot, here, were just adding two points. The two points to be added are cl$centers - the two centers of the groups. col = 1:2 once again assigns colors 1 and 2. pch is print character, telling it what sort of character we want. pch = 8 is a star. The default (pch = 1) is an open dot. The various help/tutorial les have lists of what character goes with which pch number. Finally, cex scales the size of the character: were making it twice as big as usual so it stands out. What happens if we use too many groups? (cl <- kmeans(x, 5)) plot(x, col = cl$cluster) points(cl$centers, col = 1:5, pch = 8) What a mess! I wound up with the two actual groups being split into two chunks each, with the fth group lling the gap in the middle. Try running the code above a few times and 3

watch your centers dance around. ;-) Here, it may be useful to use the nstart argument in the kmeans function. nstart tells R to run the function n times, using a dierent set of starting points each time. (cl <- kmeans(x, 5, nstart = 5)) plot(x, col = cl$cluster) points(cl$centers, col = 1:5, pch = 8) I believe the result it reports is the iteration that generated the lowest within-cluster sum of squares.

as.factor

as.factor() is a function that takes data and treats it like a factor, rather than a number. Observe the dierence between the following: summary(cl$cluster) summary(as.factor(cl$cluster)) In the rst line, R sees cluster as a number, so summary reports the mean, median, etc. In the second line, R sees cluster as a factor, so summary reports the number of items in each factor. In this case, its redundant, since the number in each group is reported in the kmeans output. as.factor can also be used to convert characters to factors: nuc <- c("a","c","g","t") nuc <- sample(nuc, 100, replace = T) summary(nuc) summary(as.factor(nuc)) Note that the rst summary isnt terribly helpful: it tells you that there are 100 characters. We knew that already, thanks. By wrapping nuc in as.factor, we instead get the counts of each nucleotide.

You might also like