MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries

MEE 437
OPERATIONS RESEARCH
Project Document
TEXT MINING FOR SUPPLIER MANUFACTURING
INDUSTRIES
Submitted to: Prof. Vijay Kumar Manupati
Slot: G1+TG1
Yesho Vardhan Gupta (14BCE0536)

Ruchi Agarwal Sanjay (14BCE0570)
Prasun Gokhlani (14BCE0589)
Gauri Agarwal (14BCE0648)
OPERATIONS RESEARCH PROJECT
TEXT MINING
ABSTRACT
With the recent development of weblogs and social networks, many supplier industries share
their data on different websites and weblogs. Even the Small-to-Medium sized enterprises
(SMEs) in the manufacturing sector (as well as non-manufacturing sector) are rapidly
strengthening their web presence in order to improve their visibility, customer reachability and
remain competitive in the global market. With the unprecedented growth of unstructured
content on the Web, more advanced methods for information organization and retrieval are
needed to improve the intelligence and efficiency of the supplier discovery process. Text
mining may be thought of as one such powerful method to obtain such growth. One of the
main text mining applications is how to classify data presented by these industries into different
groups. Our study aims to classify data into various groups so that users can identify the most
appropriate content based on their choice at any given time. Based on Support Vector
Machine (SVM), a new classification method is established. In this approach we have
classified the text data into two broad categories of Manufacturing and Non-Manufacturing
suppliers. This information will be helpful to customers and other industrial suppliers as well.
The performance of the proposed classifier was tested experimentally based on the standard
metrics such as precision, recall, and F-measure. The proposed approach was evaluated
using datasets obtained from Thomas Cook website and successful results were obtained.
KEYWORDS: Supplier Industries, Support Vector Machine, Text Classification.
INTRODUCTION
Text mining, also referred to as text data mining, is the process of deriving high-quality
information from text. High-quality information is typically derived through the devising of
patterns and trends through means such as statistical pattern learning. Text mining generally
involves the process of structuring the input text, deriving patterns within the structured data,
and finally evaluation and interpretation of the output. Text classification is a data mining
practice that can be defined as the process of classifying various documents into predefined
categories based on the content that it contains. It is the computerized assignment of natural
language scripts to previously defined categories. With explosive growth in the amount of the
shapeless data in the Internet, automatic text classification has become progressively more
important as it helps categorize and organize various diverse documents into different classes
of interest with known properties, thus making it easier to search and retrieve information from
anywhere. In this project, text classification problem is inspected under a manufacturing setup.
In particular, this project considers classification of suppliers of manufacturing services
extracted from the company webpages and online sourcing portals. We have used a technique
that adopts Support Vector Machine (SVM) method as the underlined mathematical model of
the text classifier. The necessary steps for training the data and forming the text corpus are
explained in detail in the following sections.
ARCHITECTURE DIAGRAM
METHODOLOGY
TEXT CLASSIFICATION
Text classification is the process of classifying documents into predefined categories based in
their content. There are two broad categories of text classification techniques:
 Single-label classification: In this technique a document is classified under only one

class.
 Multi-label classification: In this technique documents belong to multiple classes.
There are many machine learning techniques and several statistical classification used for text
classification such as s Naive Bayes, linear and nonlinear Support Vector Machines (SVMs),
tree-based classification, Neural Network and K-Nearest Neighbours (KNN).
Classifiers are used for the above classification. There are two broad categories of text
classifiers:
 Term Classifiers – Classification based on the raw content.

 Semantic Classifiers – Classification based on the meaningful content.
The advantages and disadvantages of the various text classification technique are extensively
discussed: KNN is considered to be an efficient method computationally. The main advantage
of the tree-based classification method is its simplicity that enables non-expert users to interpret
the results. However, large training data cannot be readily fitted into memory of decision tree
due to its layered nature. SVM usually produces accurate results although it is slower that other
methods. Furthermore, the main advantage of the neural network classifier is its ability to
classify high-dimension and noisy documents. However it needs high performance CPUs as it
is a memory-intensive technique. Naïve Bayes is a fairly simple technique that can be easily
implemented and it is appropriate for large datasets.
SUPPORT VECTOR MACHINE
Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be
used for both classification and regression challenges. It is based on the concept of decision
planes that define decision boundaries. A decision plane is one that separates between a set of
objects having different class memberships. A classifier that separates a set of objects into their
respect with a line. Complex structures are needed to optimize l separation by classifying new
objects (test cases) on the basis of the datasets that are available (train cases).
Classification tasks based on drawing separating lines to distinguish between objects of

different class memberships are known as hyper-plane classifiers. Support Vector Machines
are particularly suited to handle tasks complex than a line. The process of rearranging the
objects is known as mapping (transformation). The mapped objects is linearly separable and,
thus, instead of constructing the complex curve, all we have to do is to find an optimal line that
can separate the classified objects.
VARIOUS MODULES INVOLVED IN OUR METHODOLOGY
1. Preparation of Training Data
The training dataset is collected from Thomas Net portal narratives. To build the training
dataset, suppliers from both categories (i.e. manufacturing and non-manufacturing) are selected
and their online profiles are converted into textual documents. These documents collectively
form the training corpus. After going through a series of pre-processing steps, as described
below, the corpus documents are categorized successfully.
2. Text Pre-processing
This is the first step involved in the method of text classification. It involves pre-processing of
unstructured data obtained from different sources. In this step, data is cleaned of useless
information and words such as punctuations, commas, semi colons etc. The function that we
have used is:
String is broken before the delimiter and pushed to the

vector using function
getline(stringstream, string, delimiter);
arr.push_back(string);
3. Tokenization
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or
other meaningful elements called tokens. The list of tokens becomes input for further
processing such as parsing or text mining.
While(infile>>word) { text_corpus[no_of_terms++]=word; }
For manufacturing services - text_corpus[]

For non-manufacturing services - text_corpus_non_manu[]
4. Filtering Stop Words
This is a method of information retrieval which involves in deleting unnecessary words which
occur too frequently. It helps in reducing the time and space complexity of the program. Stop
words can be filtered by using simple string manipulation functions available in the string
library. The various stop words that were removed in our implementation are:
STOP WORDS
is
this
and
a
all
about
of
with
use
are
the
these
done
your
&
in
5. Training Document
The method that we have used is “Feature Extraction based on Total Frequency”. The word
frequency means the number of times a term is repeated in the text. First step involves in storing
all the necessary words in vector array. In our implementation we have used two vector arrays
as already mentioned above: 1) text_corpus_manu and 2) text_corpus_non_manu. Then the
weights are calculated based on the frequency of the words and “sgn function”. The calculated
weights are stored in manuterms_weigh and nonManuterms_weigh. The sgn function is defined
below:
The training data sets as manufacturing and non-manufacturing are already known from
module 1. We now calculate the field value x for the training documents using the weight of
terms present in their data. The calculated values are stored in the array count[].
6. Calculation of Support Vector
Support vector (W) is calculated as:
In the given definition xmanu stands for field values belonging to manufacturing datasets and
xnon-manu for field values belonging to non-manufacturing datasets.
7. Creation of Confusion Matrix
The confusion matrix obtained as a result of our implementation is as shown below:
PREDICTED/ACTUAL MANUFACTURING NON-

MANUFACTURING
MANUFACTURING 6 1
NON- 2 6
MANUFACTURING
FLOWCHART
Explanation of the Various Steps in the Flowchart
STEP 1:
Read datasets obtained from sourcing portals or websites of various industrial suppliers.
STEP 2:
In this scenario, classification of manufacturing and non-manufacturing has been performed.
2.1 Perform data cleaning by following the below given steps:

2.2 Remove the delimiters (“,” “.”) present in the text files.
2.3 Convert the text file data to string format and store it in a vector array.
STEP 3:
To analyze the data we create two text-corpus, one consisting of manufacturing terms and the
other of non-manufacturing terms.
STEP 4:
The weights of the terms present in each corpus are computed based on the frequency of their
occurrence. However the assignment for both corpus is based on the 𝑠𝑔𝑛 function. The function
is defined below:
 f , if f  manufacturing

sgn( f )   f , if f  non  manufacturing


STEP 5:
For each training document a value V is computed based on the weights of the words obtained
in the previous step. We calculate the support vector, W, using the values computed above
according to a function. The function is defined below:
𝑊 = ((min(𝑥𝑚𝑎𝑛𝑢 ) + (max(𝑥𝑛𝑜𝑛−𝑚𝑎𝑛𝑢 ))/2
Support vector machine is constructed using the support vector W.
manufacturing , if x  W

svm( x)  non  manufacturing , if x  W


STEP 6:
The testing data is then read and classified as manufacturing or non-manufacturing based on
the support vector machine. A value for each test document is calculated by following STEP
4. Then based on the obtained value and the support vector the company in its text form is
classified as manufacturing or non-manufacturing.
STEP 7:
Based on the classification and the obtained results a confusion matrix is created.
SOURCE CODE
//preprocessing
#include<iostream>
#include<string>
#include<sstream>
#include<vector>
#include<fstream>
using namespace std;
int main()
{
int max=3000;
int no_of_training_sets=15;
string fileName[no_of_training_sets]=
{
"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set\\1
.txt",
.txt",
.txt",
.txt",
.txt",
.txt",
.txt",
.txt",
.txt",
0.txt",
1.txt",
2.txt",
3.txt",
4.txt",
5.txt"
};
string fileName1[no_of_training_sets]=
{
"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set1\\
pagal1.txt",
pagal2.txt",
pagal3.txt",
pagal4.txt",
pagal5.txt",
pagal6.txt",
pagal7.txt",
pagal8.txt",
pagal9.txt",
pagal10.txt",
pagal11.txt",
pagal12.txt",
pagal13.txt",
pagal14.txt",
pagal15.txt"
};
{
pagal1.txt",
pagal2.txt",
pagal3.txt",
pagal4.txt",
pagal5.txt",
pagal6.txt",
pagal7.txt",
pagal8.txt",
pagal9.txt",
pagal10.txt",
pagal11.txt",
pagal12.txt",
pagal13.txt",
pagal14.txt",
pagal15.txt"
};
ifstream infile;
ofstream ofile;
for(int i=0;i<no_of_training_sets;i++)
{
vector<string> arr(max);
vector <string> arr1(max);
infile.open(fileName[i].c_str());
ofile.open( fileName1[i].c_str());
while(infile)
{
string s;
if (!getline( infile, s )) break;
istringstream ss( s );
while ( ss )
{
string s;
if (!getline( ss, s, ',')) break;
arr.push_back( s );
}
for(int k=0;k<arr.size();k++)
ofile<<arr[k];
}
if (!infile.eof())
{
cerr << "Fooey!\n";
}
infile.close();
ofile.close();
//vector <vector <string> >data;
infile.open(fileName1[i].c_str());
ofile.open( fileName2[i].c_str());
while (infile)
{
string s;
if (!getline( infile, s )) break;
istringstream ss( s );
while (ss)
{
string s;
if (!getline( ss, s, '.')) break;
arr1.push_back( s );
}
for(int k=0;k<arr1.size();k++)
ofile<<arr1[k];
//data.push_back( arr1 );
}
if (!infile.eof())
{
cerr << "Fooey!\n";
}
infile.close();
ofile.close();
}
}
******************************END OF CODE 1******************************

//preprocessing
#include<iostream>
#include<string>
#include<sstream>
#include<vector>
#include<fstream>
int main()
{
int max=3000,no_stop_words=16;
int k=0;
string fileName2[no_of_training_sets]={
pagal5.txt",
pagal6.txt",
pagal7.txt",
pagal8.txt"
};
int no_of_terms=0;
ifstream infile;
vector<string> text_corpus(max);
string word;
{
while(infile>>word)
{
text_corpus[no_of_terms++]=word;
}
infile.close();
//string text="this is a machine and a tractor";

}
string stopWords[no_stop_words] =
{"is","this","and","a","in","all","about","of","with","use","are","the",
"these","done","your","&"};
cout<<"\n words in the array:"<<endl;
for(int i=0;i<text_corpus.size();i++)
cout<<text_corpus[i]<<" ";
int z=0;
for(int p=0;p<text_corpus.size();p++)
{
for(int j=0;j<text_corpus.size();j++)
{
for(int k=0;k<no_stop_words;k++)
{
if(text_corpus[j]==stopWords[k])
{
text_corpus.erase(text_corpus.begin()+j);
}
}
}
}
//infile.close();
ofstream oFile;
oFile.open("text_corpus_non_manufacturing.txt");
oFile<<text_corpus[i]<<"\n";
oFile.close();
cout<<endl;
cout<<"after removing stop words:"<<endl;
}
*******************************END OF CODE 2*****************************

//preprocessing
#include<iostream>
#include<string>
#include<sstream>
#include<vector>
#include<fstream>
int main()
{
int max=3000,no_stop_words=17;
int k=0;
{"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\
\pagal1.txt",
pagal2.txt",
pagal3.txt",
pagal4.txt"
};
ifstream infile;
int no_of_terms=0;
vector<string> text_corpus(max);
string word;
{
while(infile>>word)
{
text_corpus[no_of_terms++]=word;
}
infile.close();
//string text="this is a machine and a tractor";
}
string stopWords[no_stop_words]=
{"is","this","and","a","in","all","about","of","with","use","are","the",
"these","done","your","&","for"};
cout<<"\n words in the array:"<<endl;

int z=0;
for(int p=0;p<text_corpus.size();p++)
{
for(int j=0;j<text_corpus.size();j++)
{
for(int k=0;k<no_stop_words;k++)
{
if(text_corpus[j]==stopWords[k])
{
text_corpus.erase(text_corpus.begin()+j); }}}}
ofstream oFile;
oFile.open("text_corpus.txt");
{
oFile<<text_corpus[i]<<endl;
// cout<<text_corpus[i]<<" ";
}
oFile.close();
cout<<endl;
cout<<"after removing stop words:"<<endl;
}
******************************END OF CODE 3******************************

````````````````````````````````````````````````````````````````````````````````````````````````````````````````
//semantic based classification - conncept weighting
#include<iostream>
#include<fstream>
#include<string>
int main()
{
int no_company=15,no_key=200;
int no_manu=1000;
string profile[no_company][no_key];
int count[no_company]={0};
string data;
ifstream infile;
infile.open("C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\text_corpus.txt")
;
string manuterms[no_manu];
int manuterms_weigh[no_manu];
string nonManuTerms[no_manu];
int nonManuTerms_weigh[no_manu];
int company[no_company];
for(int k=0;k<no_company;k++)
company[k]=k+1;
/* string
manuterms[no_manu]={"aerospace","aluminum","components","assembly","customers","engine
ering","brass","equipment","capabilities"} ; */
int no_of_terms=0;
while(infile>>data)
{
manuterms[no_of_terms++]=data;
}
infile.close();
//calculating weights for each manuterm

for(int i=0;i<no_of_terms;i++)
{
manuterms_weigh[i]=0;
for(int j=0;j<no_of_terms;j++)
{
if(manuterms[i]==manuterms[j])
manuterms_weigh[i]++;
}
cout<<manuterms_weigh[i]<<endl;
/* if(i%4==0)
else if(i%3==0)
else
manuterms_weigh[i]=1; */
}
infile.open("C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\text_corpus
_non_manufacturing.txt");
int no_of_terms_non_manu=0;
while(infile>>data
non-manuterm
for(int i=0;i<no_of_terms_non_manu;i++)
{
{
nonManuTerms[no_of_terms_non_manu++]=data;
}
infile.close();
//calculating weights for each non-manuterm

for(int i=0;i<no_of_terms_non_manu;i++)
{
nonManuTerms_weigh[i]=0;
for(int j=0;j<no_of_terms;j++)
{
if(nonManuTerms[i]==nonManuTerms[j])
nonManuTerms_weigh[i]--;
}
cout<<nonManuTerms_weigh[i]<<endl;
/* if(i%4==0)
else if(i%3==0)
else
manuterms_weigh[i]=1; */
}
/* string str="C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\c";
string base=".txt";
*/
int no_training_company=8;
string fileName1[no_training_company] =
{"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\\pagal1.txt",
"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\\pagal2.tx
t",
t",
t",
t",
t",
t",
t"};
int j;
for(int i=0;i<no_training_company;i++)
{
j=0;
while(infile>>data)
{
profile[i][j]=data;
j++;
}
****************************
infile.close(); END OF SOURCE CODE*****************
}
for(int j=0;j<no_training_company;j++){
count[j]=0;
for(int k=0;k<no_key;k++){
for(int i=0;i<no_of_terms;i++){
if (profile[j][k]==nonManuTerms[i])
{
count[j]=count[j]+nonManuTerms_weigh[i];
// break;
}
if (profile[j][k]==manuterms[i])
{
count[j]=count[j]+manuterms_weigh[i];
//break;
}
}
}
cout<<"company"<<j+1<<" "<<count[j]<<endl;
}
int min=count[0];
int max_neg=-1;
for(int i=0;i<no_training_company;i++)
{
if(count[i]>0 && min>count[i])
min=count[i];
if(count[i]<0 && max_neg<count[i])
max_neg=count[i];
}
int w=(min+max_neg)/2;
cout<<w<<endl;
string fileName[no_company] =
{"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\\pagal1.txt",
"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\\pagal2.t
xt",
xt",
xt",
xt",
xt",
xt",
xt",
xt",
"C:\\Users\\chiru\\Desktop\\Academics\\Sem6\\OR\\coding\\training_set2\\pagal1
0.txt",
1.txt",
2.txt",
3.txt",
4.txt",
5.txt"};
// for(int i=0;i<no_company;i++)
// string filename[i]
for(int i=0;i<no_company;i++)
{
j=0;
infile.open(fileName[i].c_str());
while(infile>>data)
{
profile[i][j]=data;
j++;
}
infile.close();
}
for(int j=0;j<no_company;j++){
count[j]=0;
for(int k=0;k<no_key;k++){
for(int i=0;i<no_of_terms;i++){
if (profile[j][k]==nonManuTerms[i])
{
count[j]=count[j]+nonManuTerms_weigh[i];
// break;
}
if (profile[j][k]==manuterms[i])
{
count[j]=count[j]+manuterms_weigh[i];
//break;
}
}
}
cout<<"company"<<j+1<<" "<<count[j]<<endl;
}
int actual_manu=8;
int actual_non_manu=7;
int actual_manufacturing[actual_manu]={1,2,3,4,9,11,13,14};
int predicted_manufacturing[15]={};
int predicted=0;
cout<<"Manufacturing companies:"<<endl;
int predicted=0;
cout<<"Manufacturing companies:"<<endl;
{
if(count[i]>w)
{
cout<<company[i]<<endl;
predicted_manufacturing[predicted]=company[i];
predicted++;
}
}
int tp=0;
for(int i=0;i<actual_manu;i++)
{
for(int j=0;j<predicted;j++)
{
if(actual_manufacturing[i]==predicted_manufacturing[j])
{
tp++;
}
}
}
int fp = predicted-tp;
// int c=actual_non_manu-no_company+actual_manu;
int fn=actual_manu-tp; //b
int tn=actual_non_manu-fp; //d
float precision = (float)tp/(tp+fp);

float recall=(float)tp/(tp+fn);
float f_measure=(float)(2*((precision*recall)/(precision+recall)));
cout<<endl;
cout<<"Predicted \t \t Actual"<<endl;
cout<<"\t \t"<<"manufacturing"<<"\t"<<"non-manufacturing"<<endl;
cout<<"manufacturing"<<"\t \t"<<tp<<"\t \t"<<fp<<endl;
cout<<"non-manufacturing"<<"\t"<<fn<<"\t \t"<<tn<<endl;
cout<<endl;
cout<<"precision:"<<precision<<endl;
cout<<"recall:"<<recall<<endl;
cout<<"f_measure:"<<f_measure<<endl;
}
OUTPUT SCREENSHOTS
CONCLUSION
In this era, with the unprecedented amounts of data present on the World Wide Web, classifying
of data has become a necessity. Classified data makes it easier to analyze, understand and
extract useful information from it. To do the above classification we require structured data.
The data over the network is unstructured. Structured data can be obtained only after processing
the unstructured data. We have implemented an efficient method for classifying raw data into
two classes, namely manufacturing and non-manufacturing sectors. We have implemented
these steps using a C++ code. We have used the technique of Support Vector Machine (SVM)
to do so. We managed to classify data into the required classes and obtained high accuracy and
precision results.
REFERENCES
[1] Martens, D. and Provost, F., 2013. Explaining data-driven document classifications.
[2] Liu, T.Y., 2009. Learning to rank for information retrieval. Foundations and Trends®
in Information Retrieval, 3(3), pp.225-331.
[3] Korde, V. and Mahender, C.N., 2012. Text classification and classifiers: A
survey. International Journal of Artificial Intelligence & Applications, 3(2), p.85.
[4] Qi, X. and Davison, B.D., 2009. Web page classification: Features and
algorithms. ACM computing surveys (CSUR), 41(2), p.12.
[5] Sanchez-Pi, N., Martí, L. and Garcia, A.C.B., 2014. Text classification techniques
in oil industry applications. In International Joint Conference SOCO’13-CISIS’13-
ICEUTE’13 (pp. 211-220). Springer International Publishing.
[6] Hallikas, J., Puumalainen, K., Vesterinen, T. and Virolainen, V.M., 2005. Risk-
based classification of supplier relationships. Journal of Purchasing and Supply
Management, 11(2), pp.72-82.
[7] Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and techniques.
Elsevier.

MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries

Uploaded by

Copyright:

Available Formats

MEE 437

Submitted to: Prof. Vijay Kumar Manupati

Yesho Vardhan Gupta (14BCE0536)

KEYWORDS: Supplier Industries, Support Vector Machine, Text Classification.

 Single-label classification: In this technique a document is classified under only one

 Term Classifiers – Classification based on the raw content.

Classification tasks based on drawing separating lines to distinguish between objects of

1. Preparation of Training Data

String is broken before the delimiter and pushed to the

For manufacturing services - text_corpus[]

Support vector (W) is calculated as:

7. Creation of Confusion Matrix

The confusion matrix obtained as a result of our implementation is as shown below:

PREDICTED/ACTUAL MANUFACTURING NON-

In this scenario, classification of manufacturing and non-manufacturing has been performed.

2.1 Perform data cleaning by following the below given steps:

𝑊 = ((min(𝑥𝑚𝑎𝑛𝑢 ) + (max(𝑥𝑛𝑜𝑛−𝑚𝑎𝑛𝑢 ))/2

Support vector machine is constructed using the support vector W.

******************************END OF CODE 1******************************

//string text="this is a machine and a tractor";

cout<<"\n words in the array:"<<endl;

*******************************END OF CODE 2*****************************

cout<<"\n words in the array:"<<endl;

******************************END OF CODE 3******************************

//calculating weights for each manuterm

//calculating weights for each non-manuterm

float precision = (float)tp/(tp+fp);

You might also like

END OF CODE 1

***END OF CODE 2*

END OF CODE 3