You are on page 1of 75

Data Mining Introduction

By Daniel Calbimonte, 2012/11/12

Introduction

In Classical Antiquity, an oracle was a person considered to be a source of wise counsel with prophetic predictions or precognition of
the future, inspired by the gods. They had the gift to predict the future and advise the people with their wisdom. Today we do not
have those oracles to predict the future. It would be nice to have an oracle to predict if our business is going to make profits, how
much are we going to earn in the next 2 years and the answers to other questions related to the future.
Since we do not have oracles (at least not good ones), data mining was created to help us to analyze our information and predict the
future.

Data Mining
Data Mining is a process to discover patterns for a large data set. It is an expert system that uses its historical experience (stored in
relational databases or cubes) to predict the future. Let me explain you what you can do with data mining using an example:
Imagine that you own a company named Adventureworks. The company sells and manufactures bikes. You want to predict if a
customer will buy or not a bike based in the customer information. How can you accomplish the mission?
The answer is Data Mining. This tool will find the patterns and describe the characteristics of the customers with higher probability to
buy the bikes or the lower probability. Microsoft comes with a nice tool included in SQL Server Analysis Services. You do not need to
create a cube or an analysis services project. You can work with relational databases directly.

Example
In this sample, we are going to work with the database AdventureworksDW if you do not have it installed you can download it from
the http://msftdbprodsamples.codeplex.com/ site.
Once the AdventureworksDW is installed, use a select to verify the existent information in the v_targetmail view:

SELECT *
FROM [AdventureWorksDW.[dbo.[vTargetMail
If you review the results you will find a lot of information about the customers like:

The customer key


The title
The age
Birthdate

Name
Lastname
MaritalStatus
Suffix
Gender
EmailAddress
YearlyIncome
TotalChildren
NumberChildrenAtHome
EnglishEducation
SpanishEducation
FrenchEducation
EnglishOccupation
SpanishOccupation
FrenchOccupation
HouseOwnerFlag
NumberCarsOwned
AddressLine1
AddressLine2
Phone
DateFirstPurchase
CommuteDistance
Region
Age
BikeBuyer
All this information is important, but it is a lot! How can I find patterns? For example, if a person is married, (the maritalstatus column)
it may affect in the decision to buy a bike. The age is important as well, depending on the age the people may want to buy a bike or
not. How do you know which column is important? Which characteristic has more impact in the decision to buy a bike?
As you may notice, it is pretty hard to find which attributes affects the decision because there are 32 columns in the table. There are
too many combinations, so it is hard to find patterns. If you create a cube with all the information, it will be easier to find patterns, but
even with cubes, we may miss some patterns because of the different combinations.
Thats why we use Data Mining. To organize all the columns, analyze them and prioritize them.
Notice that there is a special column named bikebuyer (the last column). This column shows the value of 1 if the customer bought
bikes and 0 if he didnt. This is the value that we want to predict. We want to know if a customer will buy or not bikes based in our
experience (in this case the experience is the vtargetmail view.

Getting started
In this example, I will show how to create a Data Mining project using the view vTargetMail.
There are 3 sections here.
1.

Create a Data Source

2.

Create a Data View

3.

Create a Data Mining Project

4.

Predict information using the Mining Model

Create a Datasource
First, we are going to select the SQL Server and the connection properties. This is the Data Source.
1.

To start a Data Mining project we will use the SQL Server Business Intelligence included with the SQL Server Installation.

2.

Go to File> New Project and select the Analysis Services Project

3.

In the solution explorer right click the Data Sources and select a New data source.

4.

In the Data Source Wizard, press next.

5.

We are going to create a new Data Connection. Press New.

6. In the connection manager specify the SQL Server name and the Database. In this scenario we are going to use the
AdverntureworksDW Database.

7.

In the Data source wizard, press next

8.

Press Next and then Finish.

You have created a Data Source to the AdventureWorksDW Database

Create a Data Source View


Now we are going to add the View vTargetMail in order to add it we are going to use a Data Source View. To resume, a Data Source
View let us add the tables and view in the project.
1.

In the Solution Explorer right click in the Data Source View and select New Data Source View.

2.

In the Welcome to the Data Source View Wizard,

3.

In the Select a Data Source Window select the Data Source created.

4.

In the Select Tables and Views, select the vTargetMail and press the > button.

5.

In the Completing the Wizard window, press Finish.

We just created a Data View with the view to give experience to our Data Mining Model. The vTargetMail is a view that contains
historical data about the customers. Using that experience, our mining model, will predict the future.

Data Mining Model


Now we are going to create the Mining Model using the Data Source and Data Source View created before.
1.

In the Solution Explorer, right click in the Mining Structures Folder and select New Mining Structure.

2.

In the Welcome to the Data Mining Wizard, press Next.

3. In the select the Definition Method, select the option From existing relational database or data warehouse and press Next. As
you can see, we can use relational databases, data warehouses or cubes.

4. In the Create the Data Mining Structure Window, select Create mining structure with a mining model and select the Microsoft
Decision Trees and press next. I am going to explain the details in another article about the mining techniques. By the moment lets
say that we are using a Decision Trees algorithm for this example.

5.

In the Select Data Source View, select the Data Source View created and press Next.

6.

In the Specify Table Types, select the vTargetMail.

7.

In the specify Bike Buyer row in the predict column, select the checkbox and press the button Suggest.

In this option, we are selecting which information we want to predict. In this scenario we want to predict if the person is a bike buyer
or not.

8. In the Input column mark with an x all the Column Names with the Score different than 0. What we are doing is to choose which
columns are relevant in the decision to buy a bike.

9. In the left column, select the first name, last name and email (this is going to be used do drill throw the information) and press
Next.

10.

In the Specify Columns Content and Data Type, press Detect and press Next.

I am going to explain Content Types and Data Types in future articles. By the moment, lets say that we are detecting the Data Types.

11.

In the Create Testing Set, set 100 Maximum number of cases in testing data set and press Next.

This window is used to test the data. I will explain more details in later articles.

12. In the Completing the Wizard Window, write the Mining Structure name and Mining model name and check the Allow drill
through and press Finish.

13.

14.

Now click in the Mining Model Viewer and you will receive a Windows message to deploy the project. Press the yes button.

We will have a Message to process the model. Press Yes.

15.

In the Process Mining Model, press Process

16.

In tree Process Progress, once it has finished successfully, press close.

17.

Click in the Mining Model Viewer. The following Desition Tree should be displayed.

Zoom in | Open in new window

We just created a Data Mining project using Decision Trees. It is ready to test. Our final task is to use it. I will create some queries to
predict if a user will buy or not a bike using the Data Mining.

Predict the future


Now that we have our Data Mining, lets ask to our Oracle if a customer with specific characteristics will buy or not our bikes. We will
create 2 queries.
The first query will ask our oracle if a 45 years old customer with a Commute Distance of 5-10 miles with High School education will
buy a bike. The second query will ask our oracle if a 65 years old customer with a Commute Distance of 1-2 miles with missing
education will by a bike.
1.

First of all we need to move to the Mining Model Prediction Tab. Click there.

2.

In the Mining Model Window, click in the select Model Button

3.

In the Select Model Window, expand the Data Mining>DTStructure and select the DTModle and click OK.

4.

In the Select Input Table, click the Select Case Table

5.

In the Select Table Window, select the vTargetMail and press OK

6.

Right click in the Select Input Table and select the Singleton Query

7.

In the Singleton Query specify the following information:

Age 45: Commnute Distance 5-10 Miles, English Education: High School, English Ocupation: Professional, Marital Status: S,
Numerber of Cars Ownerd: 5, Number of children at home: 3. In this step we are specifying the customer characteristics.

8.

In the source Combobox, Select the DTModel mining model

9.

In the second row of the source column, click in the combobox and select the Prediction Function

10.

In the second row of the field column, select the PredictHistogram

11.

In the Criteria Argument column write: [DTModel].[Bike Buyer]


What we are doing is to specify the probability of this user to buy a bike using the PredictHistogram.

12. Now, click on the switch icon and select Result to verify the results of the query.

13. If you watch the results you will find than the probability to buy a bike is 0.6213. It means 62 %. So now we have our oracle
ready to predict the future!

14.

Finally, we are going to ask if another user with the following characteristics will buy a bike:

Age 65: Commnute Distance 1-2 Miles, English Education: Missing, English Ocupation: Clerical, Marital Status: S, Numerber of Cars
Ownerd: 1, Number of children at home: 0. In this step we are specifying the customer characteristics.

15.

Once this is done, lets select the switch icon and select the Result.

The probability to buy for a customer with these characteristics is 57 %.

Summary
In this article we described how to predict the future using the Data Mining. There are many different scenarios to apply Data Mining.
In this example we used a Decision Tree algorithm to predict the future.
We used a View to feed our Mining Model and then we asked the model if 2 customers will buy or bikes. The first one has a
probability of 65 % and the second one 57 %.
Now that you have your mining model ready. You can ask him the future.
Good luck.

References
http://nocreceenlosarboles.blogspot.com/2011/11/al-oraculo-de-delfos-no-le-dejan-votar.html
http://msdn.microsoft.com/en-us/library/ms167167(v=sql.105).aspx

Data Mining Introduction Part 2


By Daniel Calbimonte, 2012/12/31
In my first article about Data Mining we talked about Data Mining with a classical example named AdventureWorks. In this example I
am going to complement the first article and talk about the decision trees. Let me resume in few words how the Data Mining model
worked.
The data mining is an expert system. It learns from the experience. The experience can be obtained from a table, a view or a cube.
In our example the data mining model learned from the view named dbo.vtargetmail. That view contained the user information about
the customer.

People usually think that they need to use cubes to work with Data Mining. We worked with the Business Intelligence Development
Studio or the SQL Server Data Tools (in SQL 2012), but we did not use cubes, dimensions or hierarchies (we could use it, but it is
not mandatory). We simply used a view.
If we run the following query we will notice that we have 18484 rows in the view used.

Select count(1) from dbo.vtargetmail


Something important to point about Data Mining is that we need a lot of data to predict the future. If we have few rows in the view, the
Mining Model will be inaccurate. The more data you have, the more accurate the model will be.
Another problem in data mining is the input of data for the data mining. How can we determine which information is important for the
Mining Model? We can guess a little bit.
Lets return to the Adventureworks Company and lets think about the customers that may want to buy bikes. The salary may be
important to buy a bike. If you do not have money to buy a bike, you will not buy it. The number of cars is important as well. If you
have 5 cars you may not want to have a bike because you prefer to drive your cars.
There are some data that may be useful as the input to predict if the customer is going to buy a bike or not. How can we determine
which columns of data are important or which ones are not? In order to start, we can think about it. Is it important for the model the
address or the email of the customers?
It may not be important, especially the email. Does someone with Hotmail have less chances to buy a bike that a person with
Gmail? I guess not. They are some input data that we could remove from the model intuitively. However the Data Mining tool lets you
determine which columns affect or not the decision to buy or not a new bike.

The Dependency Network


In the Data Mining Model, go to the Mining Model Viewer Tab. In the Model Viewer Tab, go to the Dependency Network Tab. The
Bike Buyer Oval is the Analysis that we are doing. We want to analyze if a person X is a possible buyer. The number of Children,
Yearly Income, Region and the other variables are the columns of the view. With the Dependency Network, we can analyze which
column has influence to buy or not a Bike.

If you adjust the link bar, you can define which column has more influence to buy or not a bike.

In this example the number of cars owned is the most important factor to buy or not a Bike.

The second factor to buy a bike is the Yearly Income. This information is very important for Business Analysts and the marketing
team.
In my first article we used Decision Trees. Decision Trees are one of the different algorithms used by Microsoft to predict the future.
In this case to predict if a customer x is going to buy a bike or not. In the viewer combo box we can select the option Microsoft
Generic Content Tree Viewer. This option let you get some technical details about the algorithm.

For more information about NODES, cardinality visit this link: http://msdn.microsoft.com/en-us/library/cc645772(v=110).aspx

About Decision Trees


Decision trees are the first basic algorithm that we used in this article. This Data Mining Algorithm divides the population used to
predict if the customers want to buy or not a bike in different nodes. The nodes have branches and child nodes.

The first node contains all the cases. If you click on the node, there is a Mining legend at the right with all the cases used. The value
0 is the number of customers that did not buy the bikes. The value 1 is the group of user that bought bikes. There are colors to
graphically see the percentages of users of each category.

The second node divides the cases in the number of cars owned.

You can see that the colors of the node are different. The darker nodes contain more cases, if you click in the Number Cars
Owned=2 you will notice that the number of cases is 6457. If you click in the Number Cars Owned=3, the number of cases is 1645.

The other nodes are related to the Yearly Income and Age. There is a lot of information that can be analyzed here.

I am going to talk about the Mining accuracy chart in future articles. To end this article, we are going to have a list of prospective
buyers and predict if they will buy or not a bike.
For this example, we are going to use the table dbo.Prospectivebuyers table that is included in the AdventureWorksDW database.
Lets move to the Mining Model Prediction Tab.

Lets select a Model. In this case, select the Decision Tree.

In the select Input Table(s) click the Select Case Table.

In the select Table Windows, select the ProspectiveBuyer Table. This table contains all the Prospective Buyers. We are going to
determine the probability to buy or not bikes.

In the source select the TM Decision Tree. Also select the following fields from the ProspectiveBuyer:
ProspectivebuyerKey, firstName, lastname and Email. Finally select a Prediction Function and select the PredictProbability.
What we are doing is to show the firstname, lastname, email and the probability to buy a bike. The PredictProbability shows a value
from 0 to 1. The closer the value goes to 1, the closer the user will buy a bike.

To verify the results, select the result option

Now you have the information of the prospective buyes and the probability to buy bikes. You predict the future again!

For Example Adam Alexander has a probability to buy of 65 % while Adrienne Alonso has a probability of 50 %. We should focus on
the guys with more probabilities and find why do they prefer to buy bikes. The main reason is the number of cars and after that the
year income.

Conclusion
In this article we talked a little more about Data Mining and then we explained how the decision tree worked. Finally we predict the
future with a list of possible customers and found which have more probability to buy bikes.

Data Mining Introduction Part 3: The Cluster Algorithm


By Daniel Calbimonte, 2013/03/12
This is part 3 of a series on data mining. If you want to find part 1 and 2, you can find them here:

Data Mining Introduction part 1


Data Mining Introduction part 2
In the last chapter I talked about the decision tree algorithm. The decision tree is the first algorithm that we used to explain the
behavior of the customers using data mining.
We found and predict some results using that algorithm, but sometimes there are algorithms that are better predictors of the future.

In this new article I will introduce a new algorithm.

The Microsoft Cluster Algorithm


The Microsoft cluster algorithm is a technique to group the object to study according to different patterns. It is different than the
decision trees because the decision tree uses branches to classify the information. The Microsoft Cluster is a segmentation
technique that divides the customer in different groups. This segments are not intuitive for humans.
For example, once the data mining algorithm detected that young man usually buy beer and diapers at the super market. It will group
the customers according to different characteristics like the age, salary, number of cars, etc.

The figure displayed above shows a cluster. It is a segment of 7 customers grouped.


In this tutorial we are going to create a cluster algorithm that creates different groups of people according to their characteristics. The
image below is a sample of how it groups:

You may ask yourself. When should I use decision tree and when to use cluster algorithm? There is a nice accuracy graph that the
SQL Server Analysis Services (SSAS) uses to measure that. I will explain that graph in other article.
Now, lets start working with the cluster algorithm and verify how it works.

Requirements
For this example, I am using the Adventureworks Multidimensional project and the AdventureworksDW Database. You can download
the project and the database here:
http://msftdbprodsamples.codeplex.com/releases/view/55330

Getting started
Open the AdventureWorksDW Multidimensional project. If it is not processed, process it.

Open the Targeted Mailing dmm

In this sample we are going to work with the targeted Mailing.dmm structure. Double click on it. Now click the Mining Models tab and
you will get the image below.

Mining models contains all the Models used to simulate the behavior of the customer. In this example we are using Decision Trees
(explained in part 2 of these series). The decision trees and the cluster receive the same inputs of information. This information is a
view named dbo. vTargetMail. This view contains customer information like the email, name, age, salary and so on.
In Data Mining Part 1 in the Data Mining Model Section you will find the steps to create a data mining structure. That structure can be
used by other algorithms. In other words, once you have a structure created as an input for the model, you do not need to create it
again for other algorithms.
In this sample, the Cluster algorithm is already created. If it were not created, you only need to click the create a related mining
model icon.

You only need to specify a name and choose the Algorithm name. In this case, choose Microsoft Clustering. Note that you do not
need to specify input and prediction values because it was already done when you created your model in part 1 and 2 of the series.

You will receive a message to reprocess the model, press Yes.

In the next Window press Run to process the Model.

Once finished, the Mining structure will show the start time and the duration of the process.

Go to the mining Model Viewer Tab and select the MyClusterModel just created to visualize the cluster algorithm. As you can see, it
is an algorithm that creates different groups for all the customers. The groups are named cluster 1, cluster 2 and so on. The clusters
creates groups of people based on their characteristis.
For example the cluster 1 contains people from Europe with a salary between 10000 and 35000 $us while the cluster 2 contains
people from north america with a salary between 40000 and 1700000 $us. In the picture bellow you will find the different clusters
created:

There are also different colors for the nodes. The darker colors are used for higher density clusters. In this case, the colors
correspond to the Population. It is the shading variable. You can change the shading variable and the colors will change according to
the value selected.
If you click in the cluster profiles, you will find the different variables and the population for each cluster. The total population is
18484. The cluster 1 is the most populated cluster and cluster 2 is the second 1. In other words, the clusters numbers are grouped
according to the population.

The variables show the customers characteristics like the age, salary and you can find the population with different colors for each
characteristic. You can find interesting information here.

You can also click in the Cluster Characteristics Tab and Find the characteristics per cluster. In this example we are going to select
the cluster 1.

You will find here that the main characteristic of the cluster 1 is that the people are from Europe. That means that an important
segment of people that buy bikes come are European. The second characteristic is the Yearly Income. We have the salary that is
really important as well.
Note and compare the information from the decision tree (in chapter 2) and the cluster. The information provided is really different.
We cannot say that the information from the decision tree is better than the cluster model. We can say that the information is
complementary.
We also have the Cluster Discrimination tab. With this information you can visually find the differences between two clusters. For
example, select Cluster 1 and Cluster 2.

As you can see, the Yearly income is a big difference between these 2 clusters. The cluster 2 earns more money than the cluster 1.
The same for the region, the cluster 2 do not necessarily live in Europe like the cluster 1. They are mainly Americans and earn more
money.
As you can see you can work with different promotions for the different clusters with specific strategies.
Finally lets predict the probability of the customer to buy a bike. The prediction section is the same as the decision trees. We can say
that the Data Mining could be used like a black box to predict probabilities. In this example we are going to find the customer
probability to buy a bike.
Click the Mining Model Prediction Tab. In the Mining Model, press the button Select Model.

In the select Mining Model select the model created at the beginning of this article (MyClusterModel).

I am not going to explain in detail the steps to select a Singleton Query because it was already explained in part 1 and go to the
"predict the future section".
In part 1 we used the decision tree algorithm to predict the behavior of 1 customer with specific characteristics to buy a bike.
In this sample we are going to use repeat the same steps, but using the new cluster model created. In the steps 7 we are going to
use different characteristics:

What we are doing here is asking to the cluster algorithm the probability of someone with a commute Distance of 5-10 miles with
highschool education, Female, a house owner and single with 3 cars, one children professional and from north america to buy a
house. We are using the cluster model created named MyClusterModel and we are using the PredictHistogram function a funcions
that returns the probability from 0 to 1.
We will finally watch the results of the query:

In the results we will see that the probability to buy a bike is 0,544 (54 %) and the probability that the user will not buy is 0,45 (46 %).

Conclusion
In this chapter we used a new algorithm or method named Microsoft Cluster. The way that it organizes the information is different, but
the input used is the same than the decision tree.
The output using the mining model prediction is the same, no matter the algorithm used. The results will be different according to the
accuracy of the algorithm. We will talk about accuracy in latter chapters.

References
http://msdn.microsoft.com/en-us/library/ms174879.aspx

Images
http://userwww.sfsu.edu/art511_h/acmaster/Project1/project1.html
http://www.iglesiadedios.info/maranatha/2012/julio/eligiendo_c01.html

Data mining introduction part 4: the Nave Bayes algorithm


By Daniel Calbimonte, 2013/04/15
This is the fourth article about data mining. If you want to find the other parts they are:

Data Mining Introduction part 1 - Getting started with the basics


Data Mining Introduction part 2 - Decision Trees
Data Mining Introduction part 3 - The Cluster Algorithm
In the last chapter we already created a Data Mining Model using the cluster algorithm. In this new article I will introduce a new
algorithm: the Nave Bayes Algorithm.

The Microsoft Nave Bayes Algorithm


The Microsoft Nave Bayes is based in the Bayes theorem. This theorem was formulated by Thomas Bayes an English Presbyterian
minister (and a mathematician).

In fact the Bayes theorem was presented to the Royal Society after Thomas dead (at least he is famous now and we are talking
about him in the most visited SQL Server site in the world!).
Microsoft created an algorithm based on this theorem. We call this algorithm Nave because we do not consider dependencies in this
algorithm. I am not going to show you Bayesian formulas because Microsoft has an easy to use interface that does not require
knowledge of the mathematical algorithm. That part is transparent to the user.

In few words what the algorithm does is to show the probability of each attribute to do a certain thing.
In the Adventureworks example used in the tutorial we have a list of prospective customers to buy a bike. With the algorithm we
show the percentage of people who will buy a bike according to isolated characteristics.

The algorithm classifies the customers per age and it shows the same probability to buy a bike according to the age range. It will do
the same process per each attribute. It is nave because it does not consider the dependencies between attributes.
As you may notice with the information just provided, it is a simple algorithm (thats why we call it nave) and it requires fewer
resources to compute the information. This is a great algorithm to quickly start researching relationships between attributes and the
results.
For example the address attribute can or cannot be an attribute that affects the probability to buy a bike. In fact, there is a direct
relationship between the address and the probability to buy a bike because some neighbors usually can use the bike there and some
cannot because of the traffic. Unfortunately, it is really hard to group all the addresses so it is a good question if we need to include
the attribute in the model.
If you are not sure which attributes are relevant, you could test the attribute using this algorithm.

Requirements
We are still using the Adventureworks databases and projects from the Data Mining Part 3.
In that chapter we already created a model to predict the probability of a customer to buy bikes using decision trees and clusters. The
algorithm already used a view as an input and we created the input to feed the algorithm. We will use the same information to create
the Nave Bayes Algorithm.

Getting started
1.
2.

We are going to open the AdventureWorks Project used in earlier versions and open it with the SQL Server Data Tools
(SSDT).
In the solution Explorer, we are going to move to the Mining Structures.

3.

In the Mining Structures folder double click in Targeted Mailing. It is the sample to verify which customers are prospective
buyers to email them.

4.

Click in the Mining Model Tab and Click the icon to create a mining model.

5.

In the New Mining Model Window select the name of the model. You can specify any name. Also select the Nave Bayes
Algorithm.

6.

This is very important. The Nave Bayes does not support continuous data. In this sample, it does not support the Yearly
Income. That is why you will receive the following Message:

8. Discrete data means that the number of values is finite (for example the gender can be male and female). Continues is a number
of values that is infinite (for example the number of starts, grains, the weight, size). The Nave Bayes supports only discrete data.
9. It cannot work with data like the salary, taxes that you pay extra incomes and other type of data that is continuous. The algorithm
classifies in groups the attributes according to values. If it has infinite number of values it cannot classify the attributes. That is why it
excludes this type of attributes.
In the Mining Structure tab, you can optionally ignore some inputs in the model. By default the Yearly Income is already ignored if
you press Yes to the question to ignore the column.

9. Now click in the Mining Model Viewer Tab and select the Nave Bayes just created.

10. You will receive a message to process the new model. This new model will be loaded with the data when you process it. Click
yes in the message.

11. In the Process Mining Model you need to press the run button. This button will start processing the information from the views
and get results using the algorithm.

12. At the end of the process you will receive a message with the start day, duration of the process.

13. The first tab is the Dependency network. This is similar to the decision tree. You will find which the main factors to buy a bike are.
At the beginning you will see that all the attributes have an influence in a customer to buy a bike.

14. However, if you move the dependency bar, you will find that the main factor to buy a bike is the Number of cars. That means that
depending of the cars you own, the probabilities to buy a bike changes a lot.

15. The second factor is the age. It means that the age is the second factor to buy a bike or not.

16. The third factor is the number of children. Maybe if you have many children you will want to buy more bikes (or maybe none
because it is too dangerous).

17. The other tab is the Attribute profiles tab. This is nice graphical information that helps you classify the attributes information. For
example, most of the people that buy bikes are 45-52 years old. Also the commute distance is from 0 to 1 mile. You can analyze the
whole population, the people who buy bikes (1 Size) and the people who do not buy (0 size).

18. There is another tab named attribute characteristics. You can see the probability to buy a bike according to specific attributes. For
example the people will buy a bike if they do not have children at home and they are males and single.

19. The discrimination score tab, show you the attributes and the main reasons related to attributes to buy or not to buy a car.
For example the people without a car may want to buy a bike, but people with 2 cars will not like to buy a bike. People with 5 children
wont buy a bike (because controlling 5 kids in the street will drive them crazy) while people with 1 child will like to buy a bike.

20. Finally we will ask to our model what if the probability to buy a bike of a prospective customer who is 40-45 years old, with a
commute distance of 5-10 miles, with high school, female, single, house owner, with 3 cars and 3 children to buy a bike. For this
purpose we will go to the mining model prediction tab.
21. Make sure in the Mining Model that the Nave Bayes model is selected and press the select case table button.

22. In the select table Window, you select the table that you will use to feed you Nave Bayes algorithm. In this case, the
dbo.TargetMail contains a list of prospective customers to send them mails.

23. In the select Input Table, select the option singleton. This option lets you create a single query of a single user with specific
characteristics.

24. Now write the characteristics specified in step 20 with the characteristics of the user (age, number of children, education and so
on).

25. Also add the prediction function named PredictHistogram. This function provides you a histogram with the probability to buy a
bike. Add the Nave Bayes criteria and the Nave Bayes source.

26. Finally, go to the results.

You will see that the probability to buy a bike for the customer with the information provided is 0.5915 (59 %).

Conclusion
In this chapter we used a new algorithm or method named Microsoft Nave Bayes. We learned that it is a simple algorithm that does
not accept continues values in the attributes. It only accepts discrete values. This algorithm is used to get fast results and to analyze
individual attributes specially. In a next chapter we will explain more Data Mining algorithms.
The way to predict data is similar no matter the algorithm used.

References

http://msdn.microsoft.com/en-us/library/ms174806%28v=sql.110%29.aspx
http://en.wikipedia.org/wiki/Thomas_Bayes
http://en.wikipedia.org/wiki/Bayesian_probability
http://msdn.microsoft.com/en-us/library/ms174572.aspx

Data Mining Introduction Part 5: the Neural Network Algorithm


By Daniel Calbimonte, 2013/05/22
In earlier articles I explained the following Microsoft Data Mining Agorithms:

Decision trees
Clusters
Nave Bayes
There is also an introduction to this series if you are interested.
Using these algorithms, we examined a view in SQL Server, and we predicted the probability for customers to buy a bike from the
fictitious company, Adventureworks. In this new chapter we will talk about the Neural Network algorithm. This one is my favorite one.
As the name says, the Neural Network is a pretty nice algorithm based on the way we think the brain works. Lets start comparing the
human being with the Microsoft Neural Network with a simple example: the baby example

When the babies come to the earth, they experiment with the environment. They eat dirt, flies, and papers. They learn with the
experiences.They receive the dirt as input, and if they like it, it will be part of their menu. In their brain, using input, the neural network
system creates connections, and babies learn what the best is for them and what food can be rejected.
The Microsoft Neural network is similar to the babies and the human being:

There are three layers. The input, the hidden layer and the output.

The Input Layer


If we think about the baby, the input would be the dirt. The baby eats the dirt and tastes it, and decides if he likes it. In Microsoft Data
Mining we use a view with the past experience of customers who bought a bike or not. With that input, the Neural Network can take
some inferences. They predict with the input. The more data it has, the more precise the prediction is.

The Hidden Layer


In the baby example, the brain creates different conections and sends electricity through different paths. When a baby eats dirt, the
brain sends a bad electrical sensation and the baby learns that the dirt does not taste good (for some babies).
In our example, the Microsoft Algorithm tests different combinations of possibilities. It analyzes if people from 30-45 years old have a
high possibity of buying a bike. If the results is positive, it keeps the results and continues comparing the different attributes of the
user (gender, salary, cars, etc).

The Output Layer


The output is the result of the experience: if the baby likes the dirt or not. He will experience with his mouth the taste of the food, and
he will determine what is the best for himself.
Neural Networks can be applied to OCR, speech recognition, image analysis, and other artificial intelligence taks. In this case we are
going to use neural networks for our Data Mining example.
In the Microsoft Neural Networks, the system test the differents combinations of states and find the option that best suites the needs.
The output is the result of different tests made by the algorithm.

Getting started
In the part 2 and part 3 of these articles I explained how to create the other algorithms based on a simple View with the customer
information. Based on that information, we created a Data Mining Model and added the different Algorithms.
We are going to continue using the model of earlier chapters and add the new Neural Network Algorithm. Follow these steps.
1.

Open the Adventureworks project used in earlier chapters and double click in the targeted Mailing.

2.

In that project we already added views, inputs to the Data Mining Project, now we are going to add the Neural Network
algorithm. In the Mining Model tab, press the Create a related mining model icon.

3.

Write any name for the Model Name textbox and choose the Microsoft Neural Network as the algorithm name.

4.

If everything is OK, a new algorithm should be created:

5.

In the Mining Model tab click the Process the mining structure icon.

6.

In the process mining Model Tab, press the run button.

7.

Once the process is done, close the window.

8.

In order to see the model, go to the Mining Model Viewer and select My neural network.

9.

You will find that the customers older than 88 years old would not buy a bike (Favors 0). This is because they are too old
to ride a bike. The same for people from 74-79 years old or 79-88. On the other hand people from Pacific will likely buy a
bike, and they are potential customers (Favors 1). If the customer has 4 cars, he may not buy a bike.

If the customers have 3 children they may not want to buy a bike and if the age is between 40 and 45 years old they may want to buy
a bike.

In that chapter we asked the model the probability to buy a bike of a prospective customer who is 40-45 years old, with a commute
distance of 5-10 miles, with high school, female, single, house owner, with 3 cars and 3 children to buy a bike.
10. Finally, in order to test the method we are going to apply the same steps used in earlier chapters. If you did not read
earlier chapters refer to the article about Nave Bayes step 20 to
26: http://www.sqlservercentral.com/articles/Data+Mining/97948/
11. We will select the Neural network model using the select Model button.

12. Choose the My neural network model.

13. Using the Singleton option specify the customer characteristics (age, gender, marital status, etc) and use the
PredictHistogram function to specify the probability to buy a bike.

14. Verify the Results.

The probability to buy a bike for a female with 40-45 years, single, etc is 40 % (0,4085014051).

Conclusion
In this chapter we used a new algorithm or method named Neural Network. The neural network is one of the most exciting algorithms
and it can be used to predict complex models.
Even when the algorithm is complex, using it with Microsoft Data Mining is very simple. In the next chapter we will talk about

References and images

http://msdn.microsoft.com/en-us/library/ms174806%28v=sql.110%29.aspx
http://en.wikipedia.org/wiki/Neural_network
http://pijamasurf.com/2010/05/comer-tierra-aumenta-la-inteligencia-te-pone-de-buenas/
http://msdn.microsoft.com/en-us/library/ms174572.aspx

You might also like