You are on page 1of 2

Statement of Hypothesis

An outcome of this observation and investigation in terms of age, gender, address and family income was serious and in terms of format, design, and standard of data was also serious. A data mining algorithm outputs a set of patterns P for an input dataset D, where P is a subset of all the possible patterns P P. The algorithm actually considers all possible patterns P, but only returns some of them that meet the predened criteria. The signicance testing is biased towards interesting patterns, which also most likely have smaller p-values. If only the hypotheses corresponding to patterns in P were adjusted for MHT, the respective MHT error would not be controlled because the p-values do not marginally follow the uniform distribution over [0,1], which is an assumption of most of the existing MHT methods. If, on the other hand, all the patterns in P would be considered in the MHT adjustment, the test would be very conservative and most likely no pattern would be declared statistically signicant due to the extremely large number of hypotheses. For example in frequent item set mining, the set P is of size 2m 1, where m is the number of items. The problem is caused by the fact that the set of hypotheses is not xed before the data is mined for patterns. If only the output patterns are tested for signicance, the set of hypotheses is a random variable and this needs to be accounted for in the signicance testing. This is called the problem of varying set of hypotheses in this thesis. However, if the user denes the set of hypotheses before looking at the data, the existing MHT methods can be used for that set of hypotheses. However, Webbs method can only be used if the data set can be split into two independent parts, and there is enough data to split it. For example, a binary data set in frequent item set mining can often be split into two parts, but splitting 15 Multiple hypothesis testing a graph in frequent subgraph mining is more difcult. Another approach by Webb [103], called layered critical values, is to stop the data mining at a certain level, which can greatly reduce the size. For example in frequent item set mining, only item sets of at most a given size would be returned. The justication is that at least in frequent item set mining, only fairly small item sets are returned and item sets that contain almost all items are virtually never returned. The existing MHT methods can be used with the returned patterns to control for the respective MHT error. While the method can be used in a variety of settings, it is still limited to level-wise search and the interesting patterns need to be located in the lower levels. The theorem is proven in the publication. The user is allowed to freely choose everything in the data mining setting, namely, the algorithm A, test statistic f, null distribution 0, which include the denitions for possible patterns P. However, the combination of all of these denitions is required to satisfy the subset pivotality requirement.

Significance of the Study The result of this observation is trustworthy to the teachers and students.