Tracking and Predecting Students Performance With Machine Learning

Tracking and Predicting Student Performance Using Machine Learning 1
1. INTRODUCTION
1.1 OVERVIEW
Making higher education affo at or above50 percent for their full-time students [2]. To make
college more affordable, it is thus crucial to ensure that many more students graduate on time
through early interventions on students whose performance will be unlikely to meet the
graduation criteria of the degree program on time. A critical step towards effective
interventionist to build a system that can continuously keep track of students‟ academic
performance and accurately predict their future performance, such as when they are likely to
graduate and their estimated final GPAs, given the current progress. Although predicting student
performance has been extensively studied in the literature, it was primarily studied in the
contexts of solving problems in Intelligent Tutoring Systems (ITSs) [3][4][5][6], or completing
courses in classroom settings or in Massive Open Online Courses (MOOC) platforms
[7][8].However, predicting student performance within a degree program (e.g. college program)
is significantly different and faces new challenges.
1.2 ORGANIZATION PROFILE
Software Solutions is an IT solution provider for a dynamic environment where business and
technology strategies converge. Their approach focuses on new ways of business combining IT
innovation and adoption while also leveraging an organization‟s current IT assets. Their work
with large global corporations and new products or services and to implement prudent business
and technology strategies in today‟s environment.
1.3 EXISTING SYSTEM
Complexity present in sentiment analysis and the power of machine learning techniques has been
attracted so many researchers for performing research works on these two areas.
Sentimentanalysis along with the machine learning techniques can result in the building of a
high-performance intelligent system and can proof its expertise in the area of
artificialintelligence. But sometimes it becomes a very complex job for theirsearchers to select an
appropriate machine learning technique according to their requirement which leads them to
improper result with very poor accuracy and performance ofthe model. This motivated us
Department of CSE MREC(Autonomous)

towards doing an investigation on performance analysis of available machine learning techniques

for sentiment analysis. We have considered onlythe supervised machine learning techniques and
have tried to do a comparison in each criterion of this technique.
Disadvantage:
In the case of unsupervised learning, labeled examples are not available; hence the model uses its
own intelligence forclassificationreinforcement Learning Technique: In this technique there is a
policy through the model learnsHow to behave with each observation in the world. It‟severy
action has some impact on the environment and alsoenvironment provides some feedback which
again helpsthe model to learn Transduction Technique: It has similarity with the supervised
machine learningtechnique having a difference is that this technique doesnot build any function
rather it tries to predict based on thetraining input, training outputs and new inputs.
1.4Proposed System
The step-by-step procedures for the proposed methodologyadopted in this research are as
follows:Collecting Student Data Sets: Collection of data sets is the primary job for any kind
ofSentiment analysis research, fortunately there are somestudent data sets freely available over
internetCleaning the Data Sets: Student data set consists of characters, numbers, special
characters and unrecognized characters. Whichmay create hazard for our classifier, that‟s why
aftercollecting the data sets, we have undertaken the data set cleaning procedure? Where we used
to make the data setsfree of all unwanted contents. And now the cleaned datasets are ready for
the next step which is classifying theirviews available in the data sets.
Advantage:
The proposed method has two major features: -
• First, a structure comprising of multiple base predictors and a cascade of ensemble

predictors is developed for making predictions based on students'evolving performance
states.
• Second, a data-driven approach based on latent factor models and probabilistic matrix
factorization is proposed to discover course relevance, which is important for
constructing efficient base predictors.

2.LITERATURE SURVEY
Waiyamai et al analyzed the student data base system by using data mining techniques to
improve the quality of education of engineering faculty. They studied and analyzed the student
data base by using the knowledge of engineering to solve some problems of student graduation
[5]. Hendricks et al analyzed student graduation trend of Texas technical college. The Sample
data was data warehouse of three technical colleges in Texas. The knowledge SEEKER IV TM
program was used to analyze the data. The results of this research indicated important variables
including those that affect the graduation of students [6].
The research by Varapron P (2003) has used a rough set theory as the classification approach to
analyze student data. The Rosetta toolkit was used to evaluate the student. Different
dependencies between the attributes and the student status were taken into account. The
discovered patterns are explained in plain English [7]. P.Ramasubramanian et al [8] are
attempted to analyze student information system (SIS) database using Rough set theory to
predict the future of students‟ performance.
M. Vranic et al [9] have explained how data mining algorithms and techniques can be used by
the academic community to potentially improve some aspects of the quality of education. One of
the main concerns of any higher educational system is evaluating and enhancing the educational
organization so as to improve the quality of their services and satisfy their customer‟s needs.
HodaWaguih et al have investigated and justified the importance of data mining in evaluating
student performance in a particular course in a real-world higher education system through
predicting the likelihood of success. The results of the experiment demonstrate how extracted
knowledge may help in improving decision making processes [10]. Higher education institutions
have long been interested in predicting the paths of students and alumni [11], thus identifying
which students join particular course programs [12], and which students require assistance in
order to graduate. At the same time, institutions want to learn whether some students are more
likely to transfer than others, and what groups of alumni are most likely to offer pledges.
Elena susnea et al [13] discussed Data mining (DM) is useful for collecting and interpreting
significant data from huge database. The education field offers several potential data sources for

data mining applications. These applications can help both instructors and students in improving
the learning process.
Juan Carlos Garcia et al [14] identified and characterized the different configurations of the
relevant factors related to administrative procedures, in order to learn about their behavior, and
so about decision-making processes in university management.
Talavera and Gaudioso [15] proposed to shape the analysis problem as data mining task. The
author suggested that the typical data mining cycle bears many resemblances with proposed
models for collaboration management and presented some preliminary experiments using
clustering to discover patterns reflecting user behaviors.
Predictive Techniques searched from higher education data, In Enrollment Management,

Gonzlez et al. (2002) used artificial neural networks (ANN) to predict application behavior, and
compared the results with logistic regression. The ANN model correctly classified 80.2% of
prospective students, and the logistic regression model correctly classified 78% of prospective
students [16].
Herzog et al (2006) noted that the understanding of student enrollment behavior is very
important in higher education institutions prior to predicting accurately the number of dropouts,
or who is likely to take a long time to graduate. This prediction will help the management
professional to improve enrollment, graduation rate and precision of tuition revenue forecasts.
This study compares the prediction accuracy of three neural networks and decision tree
performed as good as the regression model. However, the number and type of variables used as
predictors do not confer a substantial advantage to the data mining methods unless it is worked
with a large dataset [17].
Chang et al. (2006) applied neural networks, Classification and Regression Tree (CART), and
logistic regression to predict admissions yield. CART, neural network, and logistic regression
obtained 74%, 75%, and 64% probability of correct classification respectively [18]. Anton‟s et
al. (2006) used decision trees, neural networks, and logistic regression to predict the enrollees
out of the applications. For the real data, the logistic regression model correctly classified 66% of
the admitted applicants. However, it correctly classified only 49% of the enrollees and 78% of no
enrollees [19].

In Academic Graduation, Bailey et al (2006) developed data mining model to predict the
graduation rates using the Integrated Postsecondary Education Data System (IPEDS). IPEDS is a
National Center for Education Statistics (NCES) initiative that collects data from most of the
higher education institutions. The author collected data from the IPEDS for 5,771 institutions on
various areas, such as, faculty salaries, staff headcount, financial aid, and institutional
characteristics. The objective of this study was to determine the institutional areas that influences
graduation using 17 CARTS. The best relationship between actual and predicted graduate rate,
given by Pearson‟s correlation (r), was 0.885[20].
In Data Mining for Web-Based Education, many studies on educational data mining made use of
data from web-based educational systems which record students‟ accesses in web logs (Minaei-
Bidgoli et al., 2003; Minaei-Bidgoli et al., 2004) Most of these researches are used to provide
adaptation to a learner using the data stored in the student model. The patterns of use in the data
gathered are used to make predictions as to the most beneficial course of studies for each learner
based on their present usage. [21,22]
Minaei-Bidgoli et al (2003) used data mining to predict the final grades/course score of students
based on the features extracted from students‟ logged data in an education web-system at
Michigan State University. The authors developed classification models to find any patterns in
the student usage data, such as the time spent on problems, reading the supporting material, total
number of tries, and others. The data consisted of 250 from the system log and contained
attributes concerning each task solved (success rate, success at first try, number of attempts, time
spent on the problem, and giving up the problem) and other actions like participating in the
communication mechanism and reading support material. Six classifiers (quadratic Bayesian
classifier, 1-nearest neighbors, k-nearest neighbours, Parzen window, feed-forward neural
network, and decision tree) and their combination were compared. When the score variable had
only two values (pass/fail), the accuracy was very good. The combination of classifiers achieved
87% accuracy, and the best individual classifier, k-nearest neighbours, achieved 82% accuracy,

3. SYSTEM ANALYSIS
3.1 STUDY OF THE SYSTEM
In the flexibility of uses the interface has been developed a graphics concepts in mind, associated
through a browser interface. The GUI‟s at the top level has been categorized as follows
Administrative User Interface Design
The Operational and Generic User Interface Design
The administrative user interface: concentrates on the consistent information that is

practically, part of the organizational activities and which needs proper authentication for the
data collection. The Interface helps the administration with all the transactional states like data
insertion, data deletion, and data updating along with executive data search capabilities.
The operational and generic user interface: helps the users upon the system in transactions
through the existing data and required services. The operational user interface also helps the
ordinary users in managing their own information helps the ordinary users in managing their own
information in a customized manner as per the assisted flexibilities.
3.2 SYSTEM ANALYSIS
The Systems Development Life Cycle (SDLC), or Software Development Life Cycle in systems
engineering, information systems and software engineering, is the process of creating or altering
systems, and the models and methodologies that people use to develop these systems. In software
engineering the SDLC concept underpins many kinds of software development methodologies.
These methodologies form the framework for planning and controlling the creation of an
information system the software development process.
SOFTWARE MODEL OR ARCHITECTURE ANALYSIS:
Structured project management techniques (such as an SDLC) enhance management‟s control

over projects by dividing complex tasks into manageable sections. A software life cycle model is
either a descriptive or prescriptive characterization of how software is or should be developed.
But none of the SDLC models discuss the key issues like Change management, Incident

management and Release management processes within the SDLC process, but it is addressed in
the overall project management. In the proposed hypothetical model, the concept of user-
developer interaction in the conventional SDLC model has been converted into a three-
dimensional model which comprises of the user, owner and the developer. In the proposed
hypothetical model, the concept of user-developer interaction in the conventional SDLC model
has been converted into a three-dimensional model which comprises of the user, owner and the
developer. The ―one size fits all‖ approach to applying SDLC methodologies is no longer
appropriate. We have made an attempt to address the above-mentioned defects by using a new
hypothetical model for SDLC described elsewhere. The drawback of addressing these
management processes under the overall project management is missing of key technical issues
pertaining to software development process that is, these issues are talked in the project
management at the surface level but not at the ground level.

4.FEASIBILITY REPORT
4.1 FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put forth with a
very general plan for the project and some cost estimates. During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
 ECONOMICAL FEASIBILITY

 TECHNICAL FEASIBILITY

 SOCIAL FEASIBILITY
TECHNICAL FEASIBILITY
The technical issue usually raised during the feasibility stage of the investigation includes the
following:
 Does the necessary technology exist to do what is suggested?

 Do the proposed equipment have the technical capacity to hold the data required to use the
new system?
 Will the proposed system provide adequate response to inquiries, regardless of the number or
location of users?
 Can the system be upgraded if developed?
 Are there technical guarantees of accuracy, reliability, ease of access and data security?
Earlier no system existed to cater to the needs of „Secure Infrastructure Implementation System‟.
The current system developed is technically feasible. It is a web-based user interface for audit
workflow at NIC-CSD. Thus, it provides an easy access to the users. The database‟s purpose is
to create, establish and maintain a workflow among various entities in order to facilitate all
concerned users in their various capacities or roles. Permission to the users would be granted
based on the roles specified. Therefore, it provides the technical guarantee of accuracy, reliability
and security. The software and hard requirements for the development of this project are not
many and are already available in-house at NIC or are available as free as open source. The work
for the project is done with the current equipment and existing software technology. Necessary

bandwidth exists for providing a fast feedback to the users irrespective of the number of users
using the system.
OPERATIONAL FEASIBILITY
Proposed projects are beneficial only if they can be turned out into information system. That will
meet the organization‟s operating requirements. Operational feasibility aspects of the project are
to be taken as an important part of the project implementation. Some of the important issues
raised are to test the operational feasibility of a project includes the following: -
 Is there sufficient support for the management from the users?

 Will the system be used and work properly if it is being developed and implemented?
 Will there be any resistance from the user that will undermine the possible application
benefits?
This system is targeted to be in accordance with the above-mentioned issues. Beforehand, the
management issues and user requirements have been taken into consideration. So there is no
question of resistance from the users that can undermine the possible application benefits.
The well-planned design would ensure the optimal utilization of the computer resources and
would help in the improvement of performance status.
ECONOMICAL FEASIBILITY
A system can be developed technically and that will be used if installed must still be a good
investment for the organization. In the economical feasibility, the development cost in creating
the system is evaluated against the ultimate benefit derived from the new systems. Financial
benefits must equal or exceed the costs. The system is economically feasible. It does not require
any addition hardware or software.

5. SYSTEM REQUIREMENTS
5.1 FUNCTIONAL REQUIREMENT
Outputs from computer systems are required primarily to communicate the results of processing
to users. They are also used to provide a permanent copy of the results for later consultation. The
various types of outputs in general are:
 External Outputs, whose destination is outside the organization.
 Internal Outputs whose destination is within organization and they are the
 User‟s main interface with the computer.
 Operational outputs whose use is purely within the computer department.
 Interface outputs, which involve the user in communicating directly.
 Understanding user‟s preferences, expertise level and his business requirements through a
friendly questionnaire.
 Input data can be in four different forms - Relational DB, text files, .xls and xml files. For
testing and demo you can choose data from any domain. User-B can provide business data as
input.
5.2 NON-FUNCTIONAL REQUIREMENTS
 Secure access of confidential data (user‟s details). SSL can be used.
 24 X 7 availability.
 Better component design to get better performance at peak time
 Flexible service-based architecture will be highly desirable for future extension
5.3 SOFTWARE REQUIREMETS
 Operating system: Windows
 IDE: python 2.7 and above
5.4 HARDWARE REQUIREMENTS

 System : Inteli3 and above
 Hard Disk : 500 GB.
 Monitor : 15 VGA Colour.

 Mouse : Logitech.
 Ram : 4GB and Higher

6. Software Environment
6.1 INTRODUCTION TO PYTHON
Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is

designed to be highly readable. It uses English keywords frequently where as other languages
use punctuation, and it has fewer syntactical constructions than other languages.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need
to compile your program before executing it. This is similar to PERL and PHP.
Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python is Object-Oriented − Python supports Object-Oriented style or technique of

programming that encapsulates code within objects.
Python is a Beginner's Language − Python is a great language for the beginner-level

programmers and supports the development of a wide range of applications from simple text
processing to WWW browsers to games.
6.2 HISTORY OF PYTHON
Python was developed by Guido van Rossum in the late eighties and early nineties at the
National Research Institute for Mathematics and Computer Science in the Netherlands. Python is
derived from many other languages, including ABC, Modula-3, C, C++, Algol-68, Small Talk,
and Unix shell and other scripting languages. Python is copyrighted. Like Perl, Python source
code is now available under the GNU General Public License GPLGPL. Python is now
maintained by a core development team at the institute, although Guido van Rossum still holds a
vital role in directing its progress.
PYTHON FEATURES
Python's features include −

Easy-to-read − Python code is more clearly defined and visible to the eyes.

Easy-to-maintain − Python's source code is fairly easy-to-maintain.
A broad standard library − Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.
Interactive Mode − Python has support for an interactive mode which allows interactive testing
and debugging of snippets of code.
Portable − Python can run on a wide variety of hardware platforms and has the same interface
on all platforms.
6.3 PYTHON - ENVIRONMENT SETUP
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment.
LOCAL ENVIRONMENT SETUP
Open a terminal window and type "python" to find out if it is already installed and which version
is installed.
 Unix Solaris, Linux, FreeBSD, AIX, HP/UX, SunOS, IRIX, etc. Solaris, Linux,
FreeBSD, AIX, HP/UX, SunOS, IRIX, etc.
 Win 9x/NT/2000
 Macintosh Intel, PPC,68KIntel, PPC,68K
 OS/2
 DOS multiple versions
 PalmOS
 Nokia mobile phones
 Windows CE
 Acorn/RISC OS

 BeOS
 Amiga
 VMS/OpenVMS
 QNX
 VxWorks
 Psion
 Python has also been ported to the Java and .NET virtual machines
GETTING PYTHON
The most up-to-date and current source code, binaries, documentation, news, etc., is available on
the official website of Python https://www.python.org/
You can download Python documentation from https://www.python.org/doc/. The

documentation is available in HTML, PDF, and PostScript formats.
INSTALLING PYTHON
Python distribution is available for a wide variety of platforms. You need to download only the
binary code applicable for your platform and install Python.
If the binary code for your platform is not available, you need a C compiler to compile the source
code manually. Compiling the source code offers more flexibility in terms of choice of features
that you require in your installation.
Here is a quick overview of installing Python on various platforms −
UNIX AND LINUX INSTALLATION

Here are the simple steps to install Python on Unix/Linux machine.
Open a Web browser and go to https://www.python.org/downloads/.
 Follow the link to download zipped source code available for Unix/Linux.
 Download and extract files.

 Editing the Modules/Setup file if you want to customize some options.

 run/configure script
 make
 make install
This installs Python at standard location /usr/local/bin and its libraries

at /usr/local/lib/pythonXXwhere XX is the version of Python.
WINDOWS INSTALLATION
Here are the steps to install Python on Windows machine.
 Open a Web browser and go to https://www.python.org/downloads/.

 Follow the link for the Windows installer python-XYZ.msi file where XYZ is the version
you need to install.
 To use this installer python-XYZ.msi, the Windows system must support Microsoft
Installer 2.0. Save the installer file to your local machine and then run it to find out if
your machine supports MSI.
 Run the downloaded file. This brings up the Python install wizard, which is really easy to
use. Just accept the default settings, wait until the install is finished, and you are done.
MACINTOSH INSTALLATION
Recent Macs come with Python installed, but it may be several years out of
date.see http://www.python.org/download/mac/ for instructions on getting the current version
along with extra tools to support development on the Mac. For older Mac OS's before Mac OS X
10.3 releasedin2003releasedin2003, MacPython is available.
Jack Jansen maintains it and you can have full access to the entire documentation at his website
− http://www.cwi.nl/~jack/macpython.html. You can find complete installation details for Mac
OS installation.

SETTING UP PATH
Programs and other executable files can be in many directories, so operating systems provide a
search path that lists the directories that the OS searches for executables. The path is stored in an
environment variable, which is a named string maintained by the operating system. This variable
contains information available to the command shell and other programs. The path variable is
named asPATH in Unix or Path in
Windows Unixiscasesensitive;WindowsisnotUnixiscasesensitive; Windowsisnot.In Mac OS, the
installer handles the path details. To invoke the Python interpreter from any particular directory,
you must add the Python directory to your path.
SETTING PATH AT UNIX/LINUX
To add the Python directory to the path for a particular session in Unix −
 In the csh shell − type setenv PATH "$PATH:/usr/local/bin/python" and press Enter.
 In the bash shell Linux − type export PATH="$PATH:/usr/local/bin/python" and press
Enter.
 In the sh or ksh shell − type PATH="$PATH:/usr/local/bin/python" and press Enter.
 Note − /usr/local/bin/python is the path of the Python directory
SETTING PATH AT WINDOWS
To add the Python directory to the path for a particular session in Windows −
At the command prompt − type path %path%; C:\Python and press Enter.
Note − C:\Python is the path of the Python directory

7. SYSTEM DESIGN
7.1 INTRODUCTION
Software design sits at the technical kernel of the software engineering process and is applied
regardless of the development paradigm and area of application. Design is the first step in the
development phase for any engineered product or system. The designer‟s goal is to produce a
model or representation of an entity that will later be built. Beginning, once system requirement
has been specified and analyzed, system design is the first of the three technical activities -
design, code and test that is required to build and verify software.
The importance can be stated with a single word “Quality”. Design is the place where quality is
fostered in software development. Design provides us with representations of software that can
assess for quality. Design is the only way that we can accurately translate a customer‟s view into
a finished software product or system. Software design serves as a foundation for all the software
engineering steps that follow. Without a strong design we risk building an unstable system – one
that will be difficult to test, one whose quality cannot be assessed until the last stage. The
purpose of the design phase is to plan a solution of the problem specified by the requirement
document. This phase is the first step in moving from the problem domain to the solution
domain. In other words, starting with what is needed, design takes us toward how to satisfy the
needs. The design of a system is perhaps the most critical factor affection the quality of the
software; it has a major impact on the later phase, particularly testing, maintenance. The output
of this phase is the design document. This document is similar to a blueprint for the solution and
is used later during implementation, testing and maintenance
7.2 Data Flow Diagrams:
A graphical tool used to describe and analyze the moment of data through a system manual or
automated including the process, stores of data, and delays in the system. Data Flow Diagrams
are the central tool and the basis from which other components are developed

1. Dataflow: Data move in a specific direction from an origin to a destination.
2. Process: People, procedures, or devices that use or produce (Transform) Data. The
physical component is not identified.
3. Source: External sources or destination of data, which may be People, programs,

organizations or other entities.
4. Data Store: Here data are stored or referenced by a process in the System.

7.3 UML Diagrams Overview
Figure No.7.3: UML Diagrams Overview.
UML combines best techniques from data modeling (entity relationship diagrams), business
modeling (work flows), object modeling, and component modeling. It can be used with all
processes, throughout the software development life cycle, and across different implementation
technologies. UML has synthesized the notations of the Brooch method, the Object-modeling
technique (OMT) and Object-oriented software engineering (OOSE) by fusing them into a
single, common and widely usable modeling language. UML aims to be a standard modeling
language which can model concurrent and distributed systems.

7.4 UML diagrams
7.4.1 CLASS DAIGRAM:
datasets
+nltk module
positive negative
+empty box +empty box
DATA PREPARATIONS
+pasitive words
+nagitive wods
algoritham implementation
+using navie baye's
check words
+positive words check
+negative words check
predict
+result
no of positive no.of nagative
Figure No.7.4.1:Class Diagram.

7.4.2 USECASE DAIGRAM:
Use case diagrams are considered for high level requirement analysis of a system. So, when the
requirements of a system are analyzed the functionalities are captured in use cases. So, we can
say that uses cases are nothing but the system functionalities written in an organized manner.
Now the second things which are relevant to the use cases are the actors. Actors can be defined
as something that interacts with the system.
 Functionalities to be represented as an use case

 Actors
 Relationships among the use cases and actors.
import nltk module
posive text box
admin data sets
nagative text box
positive words box
data preparation
nagative words box
Figure No.7.4.2: Use Case Diagram Of Admin

ALGORITHAM IMPLEMENTATIONS
Train(using navie baye,s)

USER
predicts
no.of positives no.of nagatives
Figure No.7.4.2: Use Case Diagram Of User

7.4.3 SEQUENCE DAIGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. A sequence diagram shows, as parallel vertical lines ("lifelines"), different
processes or objects that live simultaneously, and, as horizontal arrows, the messages exchanged
between them, in the order in which they occur. This allows the specification of simple runtime
scenarios in a graphical manner.
nlt module datasets datapreparations algoritham implemenations predict(result)
1 : import()
2 : prepair positive nagative words()
3 : install()
4 : check the words()
Figure No.7.4.3: Sequence Diagram

7.4.4 COLLABORATION DAIGRAM:
A collaboration diagram, also called a communication diagram or interaction diagram. A

Collaboration diagram is easily represented by modeling objects in a system and representing the
associations between the objects as links. The interaction between the objects is denoted by
arrows. To identify the sequence of invocation of these objects, a number is placed next to each
of these arrows. A sophisticated modeling tool can easily convert a collaboration diagram into a
sequence diagram and the vice versa. Hence, the elements of a Collaboration diagram are
essentially the same as that of a Sequence diagram.
predict(result)
4 : check positive nagative words()
algoritham implemantations
3 : install()
datapreparation
2 : prepair positive nagative words()
datasets
1 : import()
nlt module
Figure No.7.4.4: Collaboration Diagram

7.4.5 ACTIVITY DAIGRAM:
Activity diagrams are graphical representations of Workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified Modeling Language, activity
diagrams can be used to describe the business and operational step-by-step workflows of
components in a system. An activity diagram shows the overall flow of control.
ADMIN
import nltk module datasets prepair datasets
logout
user
algoritham implementations check words positive nagative words
result
Figure No.7.4.5: Activity Diagram

7.4.6 COMPONENT DIAGRAM:
nltk
datasets
server
NaiveBayesClassifier BigramAssocMeasures
Figure No.7.4.6: Component Diagram

7.4.7 DEPLOYEMENT DIAGRAM:
admin import nltk module datasets prepaire datasets
system
user implement algoritham check positive nagative words result
Figure No.7.4.7: Deployment Diagram

7.5 DATAFLOW DIAGRAM:
Figure No. 7.5: Data Flow Diagram

8. IMPLEMENTATION
8.1 PERFORMANCE ANALYSIS
We have a csv file containing students. Each row in the dataset contains the text of the review,
and whether the tone of the review was classified as positive, or negative. We want to predict
whether a review is negative or positive given only the text. In order to do this, we'll train an
algorithm using the reviews and classifications in train.csv, and then make predictions on the
reviews in test.csv. We'll then be able to calculate our error using the actual classifications
in test.csv, and see how good our predictions were. For our classification algorithm, we're going
to use naive Bayes. A naive Bayes classifier works by figuring out the probability of different
attributes of the data being associated with a certain class. Thesis based on Bayes. The theorem is
SAMPLE CODE
with open (RT_POLARITY_POS_FILE, 'r') as posSentences:
for i in posSentences:
posWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
posWords = [feature_select(posWords), 'pos']
posFeatures.append(posWords)
with open (RT_POLARITY_NEG_FILE, 'r') as negSentences:
for i in negSentences:
negWords = re.findall(r"[\w']+|[.,!?;]", i.rstrip())
negWords = [feature_select(negWords), 'neg']
negFeatures.append(negWords)

8.2 NAIVE BAYES INTRO

Let's try a slightly different example. Let's say we still had one classification -- whether or not
you were tired. And let's say we had two data points -- whether or not you ran, and whether or
not you woke up early. Bayes' theorem doesn't work in this case, because we have two data
points, not just one.
This is where naive Bayes can help. Naive Bayes extends Bayes‟ theorem to handle this case by
assuming that each data point is independent.
The formula looks like this:(y∣x1,…,xn)=P(y)∏ni=1P(xi∣y)P(x1,…,xn)

P(y∣x1,…,xn)=P(y)∏i=1nP(xi∣y)P(x1,…,xn).
This is saying "the probability that classification y is correct given the featuresx1x1,x2x2, and so
on equals the probability of y times the product of each x feature given y, divided by the
probability of the x features".
To find the "right" classification, we just find out which classification
(P(y∣x1,…,xn)P(y∣x1,…,xn))
has the highest probability with the formula.
SAMPLE CODE
classifier = NaiveBayesClassifier.train(trainFeatures)
posCutoff = int(math.floor(len(posFeatures)*3/4))
negCutoff = int(math.floor(len(negFeatures)*3/4))
8.3 FINDING WORD COUNTS

We're trying to determine if a data row should be classified as negative or positive. Because of
this, we can ignore the denominator. As you saw in the last code example, it will be a constant in
each of the possible classes, thus affecting each probability equally, so it won't change which one
is greatest
So, we have to calculate the probabilities of each classification, and the probabilities of each
feature falling into each classification.

We'll then count up how many times each word occurs in the negative reviews, and how many
times each word occurs in the positive reviews. This will allow us to eventually compute the
probabilities of a new review belonging to each class.
SAMPLE CODE
word_scores = {}
for word, freq in word_fd.iteritems():
pos_score = BigramAssocMeasures.chi_sq
(cond_word_fd['pos'][word],
(freq, pos_word_count),total_word_count)
neg_score = BigramAssocMeasures.chi_sq
(cond_word_fd['neg'][word],
(freq, neg_wo
rd_count), total_word_count) word_scores[word] = pos_score + neg_score
return word_scores
8.4 MAKING PREDICTIONS
Now that we have the word counts, we just have to convert them to probabilities and multiply
them out to get the predicted classification. Let's say we wanted to find the probability that the
review didn‟t like it expresses a negative sentiment. We would find the total number of times the
word didn‟t occurred in the negative reviews, and divide it by the total number of words in the
negative reviews to get the probability of x given y. We would then do the same for like and it.
We would multiply all three probabilities, and then multiply by the probability of any document

expressing a negative sentiment to get our final probability that the sentence expresses negative
sentiment.
We would do the same for positive sentiment, and then whichever probability is greater would be
the class that the review is assigned to.
To do all this, we'll need to compute the probabilities of each class occurring in the data, and
then make a function to compute the classification.
SAMPLE CODE
numbers_to_test = [10, 100, 1000, 10000, 15000]
for num in numbers_to_test:
print 'evaluating best %d word features' % (num)
best_words = find_best_words(word_scores, num)
evaluate_features(best_word_features)

9. SYSTEM TESTING
9.1 INTRODUCTION
Software testing is a critical element of software quality assurance and represents the ultimate
review of specification, design and coding. In fact, testing is the one step in the software
engineering process that could be viewed as destructive rather than constructive.
A strategy for software testing integrates software test case design methods into a well-planned
series of steps that result in the successful construction of software. Testing is the set of activities
that can be planned in advance and conducted systematically. The underlying motivation of
program testing is to affirm software quality with methods that can economically and effectively
apply to both strategic to both large and small-scale systems.
9.2 STRATEGIC APPROACH TO SOFTWARE TESTING
The software engineering process can be viewed as a spiral. Initially system engineering defines
the role of software and leads to software requirement analysis where the information domain,
functions, behavior, performance, constraints and validation criteria for software are established.
Moving inward along the spiral, we come to design and finally to coding. To develop computer
software, we spiral in along streamlines that decrease the level of abstraction on each turn.
A strategy for software testing may also be viewed in the context of the spiral. Unit testing
begins at the vertex of the spiral and concentrates on each unit of the software as implemented in
source code. Testing progress by moving outward along the spiral to integration testing, where
the focus is on the design and the construction of the software architecture. Talking another turn
on outward on the spiral we encounter validation testing where requirements established as part
of software requirements analysis are validated against the software that has been constructed.
Finally, we arrive at system testing, where the software and other system elements are tested as a
whole.

UNIT TESTING
MODULE TESTING
Component Testing
SUB-SYSTEM TESING
SYSTEM TESTING
Integration Testing
ACCEPTANCE TESTING
User Testing
Figure No. 9.2: SoftwareTesting
9.3 UNIT TESTING
Unit testing focuses verification effort on the smallest unit of software design, the module. The
unit testing, we have is white box oriented and some modules the steps are conducted in parallel.
1. WHITE BOX TESTING

This type of testing ensures that
 All independent paths have been exercised at least once

 All logical decisions have been exercised on their true and false sides
 All loops are executed at their boundaries and within their operational bounds
 All internal data structures have been exercised to assure their validity.

To follow the concept of white box testing we have tested each form .we have created
independently to verify that Data flow is correct, All conditions are exercised to check their
validity, All loops are executed on their boundaries.
2. BASIC PATH TESTING
Established technique of flow graph with Cyclomatic complexity was used to derive test cases
for all the functions. The main steps in deriving test cases were:
Use the design of the code and draw correspondent flow graph.
Determine the Cyclomatic complexity of resultant flow graph, using formula:
V(G)=E-N+2 or
V(G)=P+1 or
V (G) =Number of Regions
Where V (G) is Cyclomatic complexity,
E is the number of edges,
N is the number of flow graph nodes,
P is the number of predicate nodes.
Determine the basis of set of linearly independent paths.
3. CONDITIONAL TESTING
In this part of the testing each of the conditions were tested to both true and false aspects. And all
the resulting paths were tested. So that each path that may be generate on particular condition is
traced to uncover any possible errors.
4. DATA FLOW TESTING
This type of testing selects the path of the program according to the location of definition and use
of variables. This kind of testing was used only when some local variable were declared. The

definition-use chain method was used in this type of testing. These were particularly useful in
nested statements.
5. LOOP TESTING
In this type of testing all the loops are tested to all the limits possible. The following exercise
was adopted for all loops:
 All the loops were tested at their limits, just above them and just below them.
 All the loops were skipped at least once.
 For nested loops test the inner most loop first and then work outwards.
 For concatenated loops the values of dependent loops were set with the help of
 connected loop Unstructured loops were resolved into nested loops or concatenated loops and
tested as above.

10. OUTPUT SCREENS








11.CONCLUSION
In this paper, a simple approach on sentiment analysis of students is performed using seven
promising supervised machine learning algorithms. The results obtained concludes linear
SVC/SVM as the best classifier among others in achieving 100% accuracy for large number of
students. In future, we try to investigate its effectiveness considering big datasets using the
unsupervised and semi supervised machine learning techniques. we proposed a novel method for
predicting students‟ future performance in degree programs given their current and past
performance. A latent factor model-based course clustering method was developed to discover
relevant courses for constructing base predictors. An ensemble-based progressive prediction
architecture was developed to incorporate students‟ evolving performance into the prediction.
These data-driven methods can be used in conjunction with other pedagogical methods for
evaluating students‟ performance and provide valuable information for academic advisors to
recommend subsequent courses to students and carry out pedagogical intervention measures if
necessary. Additionally, this work will also impact curriculum design in degree programs and
education policy design in general. Future work includes extending the performance prediction
to elective courses and using the prediction results to recommend courses to students.

12. REFERENCE
[1] The White House, “Making college affordable,” https://www:
whitehouse:gov/issues/education/higher-education/making-college-affordable, 2016.
[2] Complete College America, “Four-year myth: Making college more affordable,”
http://completecollege:org/wp-content/uploads/2014/11/4-Year-Myth:pdf, 2014.
[3] H. Cen, K. Koedinger, and B. Junker, “Learning factors analysis–a general method for
cognitive model evaluation and improvement,” in International Conference on Intelligent
Tutoring Systems. Springer,2006, pp. 164–175.
[4] M. Feng, N. Heffernan, and K. Koedinger, “Addressing the assessment challenge with an
online system that tutors as it assesses,” User Modeling and User-Adapted Interaction, vol. 19,
no. 3, pp. 243–266, 2009.
[5] H.-F. Yu, H.-Y. Lo, H.-P. Hsieh, J.-K. Lou, T. G. McKenzie, J.-Chou, P.-H. Chung, C.-H.
Ho, C.-F. Chang, Y.-H. Wei et al., “Feature engineering and classifier ensemble for kdd cup
2010,” in Proceedings of the KDD Cup 2010 Workshop, 2010, pp. 1–16.
[6] Z. A. Pardo‟s and N. T. Heffernan, “Using hmms and bagged decision trees to leverage rich
features of user and skill from an intelligent tutoring system dataset,” Journal of Machine
Learning Research W &CP, 2010.
[7] Y. Meier, J. Xu, O. Atan, and M. van der Schaar, “Personalized grade prediction: A data
mining approach,” in Data Mining (ICDM), 2015IEEE International Conference on. IEEE, 2015,
pp. 907–912.
[8] C. G. Brinton and M. Chiang, “Mooch performance prediction VI clickstream data and social
learning networks,” in 2015 IEEE Conference on Computer Communications (INFOCOM).
IEEE, 2015, pp. 2299–2307.
[9] KDD Cup, “Educational data minding challenge,”
https://pslcdatashop:web:cmu:edu/KDDCup/, 2010.
[10] Y. Jiang, R. S. Baker, L. Paquette, M. San Pedro, and N. T. Heffernan, “Learning, moment-
by-moment and over the long term,” in International Conference on Artificial Intelligence in
Education. Springer, 2015, pp.654–657.
[11] C. Marquez-Vera, C. Romero, and S. Ventura, “Predicting school failure using data
mining,” in Educational Data Mining 2011, 2010.

[12] Y.-h. Wang and H.-C. Liao, “Data mining for adaptive learning in attest-based e-learning
system,” Expert Systems with Applications, vol. 38, no. 6, pp. 6480–6485, 2011.
[13] N. Thai-Nghe, L. Drummond, T. Horvath, L. Schmidt-Themeet al., “Multi-relational
factorization models for predicting student performance, “in Proc. of the KDD Workshop on
Knowledge Discovery in Educational Data. Cite seer, 2011.
[14] A. Toscher and M. Jahre, “Collaborative filtering applied to educational data mining,” KDD
cup, 2010.
[15] R. Bekele and W. Menzel, “A Bayesian approach to predict performance of a student
(bapps): A case with Ethiopian students,” algorithms, vol. 22, no. 23, p. 24, 2005.
[16] N. Thai-Nghe, T. Horvath, and L. Schmidt-Theme, “Factorization models for forecasting
student performance,” in Educational Data Mining2011, 2010.
[17] Y. Meier, J. Xu, O. Atan, and M. van der Schaar, “Predicting grades,” IEEE Transactions on
Signal Processing, vol. 64, no. 4, pp. 959–972, Feb 2016.
[18] N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. Cambridge university
press, 2006.
[19] Y. Koren, R. Bell, C. Volinsky et al., “Matrix factorization techniques for recommender
systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[20] R. Salahuddin and A. Mnih, “Probabilistic matrix factorization,” inNIPS, vol. 20, 2011, pp.
1–8.
[21] M.-C. Yuen, I. King, and K.-S. Leung, “Task recommendation in crowdsourcing systems,”
in Proceedings of the First International Workshop on Crowdsourcing and Data Mining. ACM,
2012, pp. 22–26.
[22] K. Christodoulou and A. Banerjee, “Collaborative ranking with a push at the top,” in
Proceedings of the 24th International Conference on Worldwide Web. ACM, 2015, pp. 205–215.
[23] Y. Xu, Z. Chen, J. Yin, Z. Wu, and T. Yao, “Learning to recommend with user generated
content,” in International Conference on Web-Age Information Management. Springer, 2015, pp.
221–232.
[24] A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk, “Sparse factor analysis for learning
and content analytics.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1959–2008,
2014.

Tracking and Predecting Students Performance With Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tracking and Predecting Students Performance With Machine Learning

Uploaded by

Copyright:

Available Formats

Tracking and Predicting Student Performance Using Machine Learning 1

1.2 ORGANIZATION PROFILE

1.3 EXISTING SYSTEM

Department of CSE MREC(Autonomous)

towards doing an investigation on performance analysis of available machine learning techniques

The proposed method has two major features: -

• First, a structure comprising of multiple base predictors and a cascade of ensemble

Department of CSE MREC(Autonomous)

Department of CSE MREC(Autonomous)

Predictive Techniques searched from higher education data, In Enrollment Management,

Department of CSE MREC(Autonomous)

Department of CSE MREC(Autonomous)

Administrative User Interface Design

The Operational and Generic User Interface Design

The administrative user interface: concentrates on the consistent information that is

3.2 SYSTEM ANALYSIS

SOFTWARE MODEL OR ARCHITECTURE ANALYSIS:

Structured project management techniques (such as an SDLC) enhance management‟s control

Department of CSE MREC(Autonomous)

Department of CSE MREC(Autonomous)

 Does the necessary technology exist to do what is suggested?

Department of CSE MREC(Autonomous)

 Is there sufficient support for the management from the users?

Department of CSE MREC(Autonomous)

 Operating system: Windows

 IDE: python 2.7 and above

5.4 HARDWARE REQUIREMENTS

Department of CSE MREC(Autonomous)

Department of CSE MREC(Autonomous)

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python is

Python is Object-Oriented − Python supports Object-Oriented style or technique of

Python is a Beginner's Language − Python is a great language for the beginner-level

6.2 HISTORY OF PYTHON

Python's features include −

Department of CSE MREC(Autonomous)

Easy-to-maintain − Python's source code is fairly easy-to-maintain.

6.3 PYTHON - ENVIRONMENT SETUP

LOCAL ENVIRONMENT SETUP

 Macintosh Intel, PPC,68KIntel, PPC,68K

 DOS multiple versions

 Nokia mobile phones

Department of CSE MREC(Autonomous)

You can download Python documentation from https://www.python.org/doc/. The

Here is a quick overview of installing Python on various platforms −

UNIX AND LINUX INSTALLATION

Open a Web browser and go to https://www.python.org/downloads/.

Department of CSE MREC(Autonomous)

 Editing the Modules/Setup file if you want to customize some options.

This installs Python at standard location /usr/local/bin and its libraries

Here are the steps to install Python on Windows machine.

 Open a Web browser and go to https://www.python.org/downloads/.

Department of CSE MREC(Autonomous)

SETTING PATH AT UNIX/LINUX

SETTING PATH AT WINDOWS

Note − C:\Python is the path of the Python directory

Department of CSE MREC(Autonomous)

Department of CSE MREC(Autonomous)

1. Dataflow: Data move in a specific direction from an origin to a destination.

3. Source: External sources or destination of data, which may be People, programs,

Department of CSE MREC(Autonomous)

7.3 UML Diagrams Overview

Figure No.7.3: UML Diagrams Overview.

Department of CSE MREC(Autonomous)