You are on page 1of 7

CSC 2302 Data Structures

Fall 2010

Description
Write a menu-driven program that allows you to enter, view and store in text and binary files information about animals. you will use the information you entered, with some modifications, to implement binary decision trees and to play the game of Guess the animal. In this game, one person thinks about an animal and the other person needs to guess the animal based on yes/no answers to a series of questions. For example: Does the animal swim? (yes) Is the animal a fish? (no) Is the animal a mammal? (yes) Is the animal very large? (yes) Its a whale! In the version of the game you will build, the user thinks of an animal and the program asks the user yes/no questions about the animal based on its previously acquired knowledge of animals. If the user has thought of an animal that the program knows about and answers the questions about the animal in the way the program expects him/her to, it will eventually guess the animal. Overall Program Structure There are two main stages in the program. The first stage is devoted to building the necessary resources for the program to be able to play the game. The program starts out by loading the questions it knows about; then it asks the user to either build or load animal resources augmented with the menu options dealing with questions (which you may or may not have implemented) and building a binary decision tree. The second stage is to play the animal guessing game, which the program can do after it has acquired animal resources and built the decision tree from them. Building or Loading Resources The program begins by reading questions from a text file, so you will need to have functions to read and write questions to such a file. The first line in the text file should be an integer telling you how many questions are in the text file. Subsequently, the program presents to the user the options in the Assignment 2 menu, which allow the user to load animal data from files (text and binary), add animal data and store animal data in files (text and binary). Display questions: Calls the function Display_Questions to display the contents of questions. This allows the user to see what questions the program knows to ask. Add question: Calls the function Add_Question to add an entry to array questions. When a new question is entered, the user is given the option to go through all the animals already known and add the answer to the new question to each animal. Add QA to animal: Calls the function Add_QA to add (or change) the answer to a question for an animal. It should ask the user for the question (by displaying the available questions and using the index of the question in the question array for the answer) and for the animal (by name). If you didnt implement the above functionality concerning questions, you will need to do so for this assignment. You will also need to develop or modify your test suite to make sure that you have thought how to test your program thoroughly. Building a Binary Decision Tree You will use the question and animal information to build a decision tree (or discrimination net) for playing the animal guessing game. This kind of tree is a tree that associates with each node a binary 1

CSC 2302 Data Structures Fall 2010 decision, expressed as the answer to a question (YES/NO or TRUE/FALSE) and assumes that all the data in one subtree represents a YES (or TRUE) answer to the question, while all the data in the other subtree represents a NO (or FALSE) answer to the same question. In building the decision tree, an important criterion is to use information we have about questions and answers to make the tree be as balanced as possible given the information available to us so far. Assume you have N animals and M questions. Following is an algorithm for building the tree. Step1. You start by computing the proportion of the n animals that answer Yes (or No) to each question. nNOQi = n nYESQi and nYESQi = n - nNOQi, where nNOQi and nYESQi are the number of animals for which the answer to question Qi is NO or YES, respectively. Step 2. From the computations performed in Step 1, you select the question where nNO and nYES most closely approximates N/2. This is the question that most closely splits your animal sample into half and is therefore a good question to put at the root of the tree (if you have two or more questions that are equally good candidates for splitting the sample, you pick one randomly). This question, Qs, will be associated with the root of the current subtree (initially the root of the whole tree). Step 3. a) Recurse on the animals that answered YES to Qs, performing steps 1 and 2 again using only questions Qi, where s i. This will lead you to choose another question as the root of the subtree that will be come the left child of Qr. b) Recurse on the animals that answered NO to Qs, performing steps 1 and 2 again using only questions Qj, where s j. This will lead you to choose another question as the root of the subtree that will be come the right child of Qs. The recursion should stop when you have gone through all the questions, creating a tree with at most M+1 levels at its deepest point. This is because the root of the tree represents the entire set of animals and you can use the M questions to create M YES/NO levels of splitting under it. If N 2M and the questions split the animals exactly in half at each question, you will have a complete binary tree of depth M. However, more realistically, the tree will not be a complete tree; you will construct a tree some of whose paths from the root to the leave will fall short of being length M and others who will be the full M length but will not uniquely identify an animal. An example of how to construct a tree is shown in the attached example spreadsheet. Lets consider how the tree is constructed in this example. You start out with N = 20 animals and M = 5 questions. The root of the tree (Level 0-1) represents all 20 animals. A value of 1 in row ai and qj means that, for animal ai the answer to qj is YES; a value of 0 means that it is NO. The bottom row numbers in tan counts the number of YES answers to each question, in other colors it counts the number of animals represented in the table. Since there are 20 animals, q1 is the question that splits the animals in half most closely, since 20/2 = 10. So question q1 is associated with node Level 0-1, as indicated by the light green column. Level 1 of the tree are the children of Level 0-1 Level 1-1 and Level 1-2 , which represent the animals than answer YES (10 animals) and NO (10 animals) to q1 respectively. Level 2 of the tree is created by looking at the other questions for those two nodes. - For node Level 1-1, the question that splits the animals in half most closely is q5, so its two children represent the 5 animals that answer YES and the 5 animals that answer NO to q5. This split gives rise to nodes Level 2-1-1 and Level 2-1-2. - For node Level 1-2, the question that splits the animals in half most closely is q2, so its children represent the 5 animals that answer YES and the 5 animals that answer NO to q5. This split gives rise to nodes Level 2-2-1 and Level 2-2-2.

CSC 2302 Data Structures

Fall 2010

Level 3 of the tree is created by looking at the remaining questions for nodes at Level 2. There are 8 nodes whose names reflect their level and their ancestry. This time, none of the questions split the animals exactly in half. - For the two children of Level 2-1-1, the three remaining questions (q2, q3, and q4) give exactly the same split (2 YES to 3 NO), so we pick q2 arbitrarily. This split gives rise to nodes Level 3-1-1-1and Level 3-1-1-2. - For the two children of Level 2-1-2, q3, with a 2 YES to 3 NO split, is better than either q2 or q4 (with a 1 YES to 4 NO split), so we pick q3. This split gives rise to nodes Level 31-2-1and Level 3-1-2-2. - For the two children of Level 2-2-1, the three remaining questions (q3, q4, and q5) give exactly the same split (2 YES to 3 NO), so we pick q3 arbitrarily. This split gives rise to nodes Level 3-2-1-1and Level 3-2-1-2. - For the two children of Level 2-2-2 the three remaining questions (q3, q4, and q5) give different but equally bad splits (q3 and q4 give 1 YES to 4 NO; q5 gives 4 YES to 1 NO), so we pick q5 arbitrarily (or on the criterion of more YESs). This split gives rise to nodes Level 3-2-2-1and Level 3-2-2-2. The latter identifies a single animal, a18, and is a leaf. Level 4 of the tree is created by looking at the remaining questions for nodes at Level 3. There are 16 nodes and some of these are simpler than others. - The nodes that contain only 1 animal are leaves in the tree: the animal has been fully identified by the questions. These nodes include the following: Level 4-1-1-1-1 and Level 4-1-1-1-2: Question q3 discriminates (distinguishes) between the 2 remaining animals (q4 doesnt). Level 4-1-1-2-1: A YES answer to question q3 uniquely identifies animal a7 (q4 isnt informative). Level 4-1-2-1-1and Level 4-1-2-1-2: Question q2 discriminates (distinguishes) between the 2 remaining animals (q4 doesnt). Level 4-1-2-2-1: A YES answer to question q4 uniquely identifies animal a7 (q2 isnt informative). Level 4-2-1-1-1and Level 4-2-1-1-2: Question q5 discriminates (distinguishes) between the 2 remaining animals (q4 doesnt). Level 4-1-1-2-2: A NO answer to question q4 uniquely identifies animal a16. A parallel but similar situation would have occurred if we had split based on q5, so the choice of q4 here is as good as q5 (arbitrary or on the criterion of more YESs). Level 4-2-2-1-1: A YES answer to question q3 uniquely identifies animal a11. A parallel but similar situation would have occurred if we had split based on q4, so the choice of q3 here is as good as q4 (arbitrary). - The nodes that contain more than one animal need further discrimination are the following. Level 4-1-1-2-2 and Level 4-1-2-2-2: These two nodes have 2 animals each but the remaining question (q4 and q2, respectively), dont discriminate among them: one or two new questions are needed here too. Level 4-2-1-2-1: Question q5 can be used to discriminate between the 2 remaining animals, producing nodes Level 5-2-1-2-1-1 and Level 5-2-1-2-1-2. Level 4-2-2-1-2: A YES answer to question q4 uniquely identifies animal a19 but still leaves two animals undistinguished: a new question is needed here too.

So, once the tree is built following the general procedure outlined above, if there are leaves that represent a set of 2 or more animals that cannot be distinguished with the given questions, your program should present the user with those animals (in the appropriate sets) and allow the user to add more questions to distinguish the animals. The questions must be answered for the animals associated with the leaf of the tree that motivated the addition of a new question, but also for all of the other animals the program knows about. Each question will allow (some) multi-animal leaves to be refined 3

CSC 2302 Data Structures Fall 2010 into more detailed subtrees. Continue adding questions (and corresponding answers) until all animals are each uniquely associated with a single leaf of the tree (that is, until they can be uniquely identified by a set of questions). At that point in time, compute the balance on the leaves of the tree and display for the user the tree showing, for each node: a) The question it answers and whether it answers YES and NO (for the root just say ROOT) b) the question it ASKS c) the number of animals that it represents d) the balance of the node (see slides). Try to display the tree as a tree. Note that it is MUCH MUCH easier to display the tree horizontally (as shown in the Excel sheet), than vertically (as we do on the board). For the example in the spreadsheet, the display might look something like the following, without the colors. You are welcome to use node names as in the spreadsheet, if it will help you. ROOT - q1 - 20 - 1 YES to q1 - q5 - 10 - 0 YES to q5 - q2 5 - 0 YES to q2 - q3 2 0 YES to q3 q4 - 1 - 0 NO to q3 q4 - 1 - 0 NO to q2 - q3 - 3 - 0 YES to q3 - q4 - 1 - 0 NO to q3 - q4 - 2 - 0 NO to q5 - q3 - 5 - 0 YES to q3 - q2 - 2 - 0 YES to q3 q4 - 1 - 0 NO to q3 q4 - 1 - 0 NO to q3 - q4 - 3 - 0 YES to q4 q2 - 1 - 0 NO to q4 q2 - 2 - 0 NO to q1 - q2- 10 - 1 YES to q2 - q3 5 - 1 YES to q3 q5 2 0 YES to q5 q4 - 1 - 0 NO to q5 q4 - 1 - 0 NO to q3 q4 3 -1 YES to q4 q5 - 2 0 YES to q5 xx - 1 - 0 NO to q5 xx - 1 - 0 NO to q4 q5 - 1 - 0 NO to q2 - q5 5 - -2 YES to q5 q3 4 1 YES to q3 q4 - 1 - 0 NO to q3 q4 - 1 0 YES to q4 xx - 1 - 0 NO to q4 xx - 2 - 0 NO to q5 q3,4 1 0 More information and extra credit options are given below regarding dealing with unbalanced trees and adding questions. Playing the game 4

CSC 2302 Data Structures Fall 2010 Once the user is satisfied with the tree, the program can be used to play the game of guess the animal. The program starts by asking the question at the root of the tree. Depending on whether the user answers YES or NO, the program will move to the left or right subtree and ask the question at the root of that subtree. This continues until a leaf of a subtree is reached; this leaf uniquely identifies an animal, so the program provides the animal as an answer. Then it asks the user if this is the right answer or not. If the user says no, then the program asks the user what the animal is. If the user answers giving the name of an animal that the program knows about then some of the answers provided for the animal must be incorrect. The program should provide the list of questions and answers for the animal it guessed and the list of questions and answers for the animal the user wanted and give the user the option of editing the animal. Editing the animal is a new function. If the animal is not known by the program, the user is invited to add information about the animal, including a question and answer pair that will distinguish that animal from the one that the program incorrectly guessed. The leaf at which the guess was made will have to be substituted by a subtree using the new question.

Basic Requirement
For the basic requirement, implement the new functionality described above, that is: 1. Your main program should first execute the actions described in Building or Loading Resources that is, first load the questions and then allow the user to either build or load animal resources using the menu. IMPORTANT: Your Load_Animals function should completely reload the animals, deleting any animals it had in memory. Your Retrieve_Animals can add animals incrementally to existing animals in memory, but make sure you check for duplicates. 2. Once the data is loaded, you should build the binary decision tree and display it. Make sure that the array of questions in your program is large enough to accommodate the questions you already know and a few more that you might need to add while building the decision tree. If you stored your animal data in a binary file, you will probably need to throw it out and recreate it because the space allocated for questions may not be the same as you allocated now. If you need to add questions and answers to build the tree, make sure you save the changes using both Store_Animals and Dump_Animals. The binary file allows you to restart more quickly but the text file allows you to handle differences in data structures between the time you stored data externally and the time you read it back in. 3. In a loop, play the game until the user wants to quit. If you need to add or change questions and answers because the program could not identify the animal, make sure you store that data in both binary and text format. 4. Provide a test suite for your program and data to use for testing. You will be tested on test data of the complexity of the example given with 20 animals and 5 questions. Extra requirements: Part A: Iterative Building of the tree. After a decision tree has been built, adding new questions and answers for the animals, the user may want to rebuild the tree using the enlarged set of questions with the hope of getting a better tree. The user may want to do so especially if the tree is unbalanced (i.e. any node in the tree is unbalanced, as is the case in the sample tree). Although there is no guarantee that rebuilding will produce a balanced tree, the balance may improve because decisions will be taken differently with the new information added and the incremental building of the tree does strive for balance when it picks the next question for determining the children of the node. Note that rebuilding may cause new questions to need to be added, which may cause rebalancing of the tree an so on, so you should perform the building in a loop that stops when the user says so. 5

CSC 2302 Data Structures Fall 2010 When you rebuild a tree, you need to rerun the algorithm from scratch. Before you do so, the old tree should be disposed of by (carefully) traversing the entire tree and freeing the nodes. Part B: Allowing an Arbitrary Number of Questions. Because as you build and rebuild the tree the number of questions may grow significantly, the question array may need to grow to accommodate more questions than the number it was originally designed to. In the basic assignment, you were asked to make it sufficiently large so you wouldnt run out of space at runtime. To allow arbitrary expansion of the number of questions, however, you will need to change your data structures. The following constant and type definitions allow an arbitrary expansion by using dynamically allocated arrays:
#define QLENMAX 20 // The number of characters in a question #define QASIZE 10 // The number of questions for which space is allocated typedef char question_t[QLENMAX]; // This type holds a question typedef struct questions_s questions_t; struct questions_s { question_t *qarray; // pointer to dynamically allocated array of questions int numqs; // number of questions currently stored in array int curmaxqs; // current maximum number of questions };

The information about questions is a global variable (if it isnt you will probably find yourself passing this data around to a lot of functions). The program should start out by allocating room for a number of questions that is computed as the maximum of QASIZE and the number stored at the top of the questions file. This value should be stored as the value of the curmaxqs field of the structure. Then the program should read the actual questions from the text file. It is assumed that the animal data you have stored in text or binary format actually contains yes/no answers to that set of questions, no less, no more. If thats not the case, you wont be able to use the binary file. You will be able to read the text file and not lose information only if it contains answers to a (possibly improper) subset of the questions it is currently reading, stored in the same order. If the program adds questions as a result of building the decision tree or playing the game and the number of question grows beyond curmaxqs, both the questions array and the answers array in the animals structures will need to grow, so the animal_t structure and code that process it will need to change a bit too. Instead of having the animal structure store an array answers of type
typedef enum { no, yes, unknown } answer_t;

it will store a pointer to a dynamically allocated array whose elements are of that type, i.e.
typedef struct animal_s animal_t; struct animal_s { ... answer_t *answers; // pointer to dynamically allocated array of answers };

Note that you can use the members curmaxqs and numqs on the global questions variable to know the appropriate range of indices in the answers arrays of animals. When the animal is created, the answers array is dynamically allocated to contain the same number of elements as the value of questions.curmaxqs, initialized to the value unknown. When you read in animal information from a file, you will have to do it more carefully than you did before, for both text and binary files.

CSC 2302 Data Structures Fall 2010 When the value of questions.curmaxqs needs to change as a result of building a decision tree and needing to add questions, the answer arrays will, like the question arrays, need to be re-allocated to contain a larger size. The contents of the older smaller arrays copied into the larger arrays. Dont forget to free the old arrays after you have copied their contents! DELIVERABLES (WHAT YOU TURN IN) An archive (.zip or .rar) file containing all your work. Your code. Please use different FILES (or PROJECTS if you use multiple files for the program) for different versions of the code if you do the extra credit: one file for the basics, another file for the basics with the extra credit. A file containing your test suite. A test suite contains several tests for your code. each test contains a list of one or more commands to give to your code. The purpose of the test suite is to make sure that you test your code completely. The test suite is a writing word document. DOCUMENT your program. EXPLAIN your design choices if you feel you need to make any inside your program CONSIDER USING multiple .c files to distinguish the different parts of your program. NAME YOUR FILES: WHICH stands for: BASIC for the basic requirement ECBoth for Extra requirement Part A and B

You might also like