Nov 7, 2008 at 9:50 AM
Edited Nov 7, 2008 at 9:55 AM

I'd like to know about the functionalaty of the algorithm, like:
How many variable the algorithm support?
What kind of variable (int, double, float)?
In my machine, the plugin works but the viewer fails, why?
How can I put the results into new table? whats the fuction do I need to use (GetAttribute...etc)?


Nov 10, 2008 at 9:57 AM
Edited Jan 11, 2009 at 12:52 PM

Hi,
we will work on a tutorial about the algorithm. This might take some time. Therefore, I will try to answer your questions here.
Kernels and variables
Variable 
Description 
C 
With the C parameter you can control the tradeoff between the margin and misclassification. increasing the value of
C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well (overfitting). Values [0, infinite] are floats. 
kernel type 
You can specify the kernel the algorithm should use. Possible values: [linear, polynomial, rbf]. Default: [linear] 
Cache size 
You can specify the size of the kernel cache used by the algorithm. The default value should be good and the type is int. 
Gamma 
The gamma parameter is used by the RBF kernel. Values [0, infinite] are floats. 
Exponent 
The exponent parameter is used by the polynomial kernel. It is the power parameter in the equation. The values should be floats. 
Inhomogeneous 
The Inhomogeneous parameter is used by the polynomial kernel. The values are [true, false] and indicate which type of polynomial kernel should be used. 
More details about the kernels can be found on many places on the internet, like wikipedia. We will describe the kernels in more detail in the tutorial.
The viewer
I cannot tell you why the viewer fails without more details. Can you tell me on whether you installed it on SQL2005 or SQL2008. Are you trying to use the viewer from Excel, BIDS or Management Studio? Do you get an error message from the viewer or installer?
The problem could be in the registry, the location where the dll is placed or the compilation type of the dll.
Results
It depends on what you mean by the results. There are supported functions by the algorithm which provide additional information but these functions are not really results of the algorithm, but more descriptive about the metadata of the input. Unfortunately,
there is no function implemented in this first version to return the weight vector calculated by the SMO algorithm. If you like the results of the predictions be put into a table, then there are greate resources on
www.sqlserverdatamining.com.



Joris Volkonet,
I'm an statician, phD in Markovs Chain. Note that I'm creating a process in SQL Data Mining and the modeling, is the final part of the process flow, because I need to run an algorithm to predict good/bad. But neural networks as are not stable for that,
which means that if I take a 2 samples from the same dataset, I'll get two different models (ranked by score tiers). Note that from the statistic point of view this is terrible.
So I think that SVM is a great solution for that issue.
But as I said, I dont know a lot how it works.
1. Could give a hand telling how to configure the algorithm to predict a variable good/bad from a list of inputs?
2. How many rows approximally supports the algorithm?
3. How many inputs supports (how many dimentions)?
4. Whats the fuction do I need to use to put the predict in a table?
ps: the viewer is working, but just on SQL Data Mining 2008
I'll be glad if you could help me.
Great job!
Best regards,
Paulo Carvalho
On Mon, Nov 10, 2008 at 7:58 AM, JorisValkonet <notifications@codeplex.com> wrote:
From: JorisValkonet
Hi,
we will work on a tutorial about the algorithm. This might take some time. Therefore, I will try to answer your questions here.
Kernels and variables
Variable 
Description 
C 
With the C parameter you can control the tradeoff between the margin and misclassification. The lower the C parameter, the less misclassifications should occur, but it also could lead to overfitting. Values [0, infinite] are floats. 
kernel type 
You can specify the kernel the algorithm should use. Possible values: [linear, polynomial, rbf]. Default: [linear] 
Cache size 
You can specify the size of the kernel cache used by the algorithm. The default value should be good and the type is int. 
Gamma 
The gamma parameter is used by the RBF kernel. Values [0, infinite] are floats. 
Exponent 
The exponent parameter is used by the polynomial kernel. It is the power parameter in the equation. The values should be floats. 
Inhomogeneous 
The Inhomogeneous parameter is used by the polynomial kernel. The values are [true, false] and indicate which type of polynomial kernel should be used. 
More details about the kernels can be found on many places on the internet, like wikipedia. We will describe the kernels in more detail in the tutorial.
The viewer
I cannot tell you why the viewer fails without more details. Can you tell me on whether you installed it on SQL2005 or SQL2008. Are you trying to use the viewer from Excel, BIDS or Management Studio? Do you get an error message from the viewer or installer?
The problem could be in the registry, the location where the dll is placed or the compilation type of the dll.
Results
It depends on what you mean by the results. There are supported functions by the algorithm which provide additional information but these functions are not really results of the algorithm, but more descriptive about the metadata of the input. Unfortunately,
there is no function implemented in this first version to return the weight vector calculated by the SMO algorithm. If you like the results of the predictions be put into a table, then there are greate resources on
www.sqlserverdatamining.com.

Abraços,
_______________________________
Paulo Carvalho
when you think that you know all answers, life comes and change all questions...
brilliant minds discuss ideas, half minds discuss events, dumb minds discuss people...



Hi Paulo,
Thank you for your reply. I believe that SVMs could be the solution for your problem. I will try to help you.
1. Could give a hand telling how to configure the algorithm to predict a variable good/bad from a list of inputs?
Before running the algorithm, you have to do two things:
 Create a data mining structure describing the meta data of your dataset. For example, you specify which columns are used and which column is the key column. Furthermore, you specify the crossvalidation settings with the data mining structure (SQL 2008
only).
 After you created your data mining structure, you have to create a data mining model. In the data mining model you have to specify which columns are input and what the predict column is. The input columns can be continuous and discrete. The predict column
can only be discrete. Note that in this first version, you have to specify a key columns and nested tables (semi multirelational data mining) is not supported. Besides specifying the types of columns, you can specify the kernel with the algorithm parameters.
I believe that with your problem a linear or RBF (gaussian) kernel would be the ones I would pick.
These two steps can be done from Excel with the MSFT Office 2007 Excel Data Mining addin or from Business Intelligence Development Studio or even from code in SQL Server Management Studio. The language DMX can be used for all data mining related tasks.
Here you can find the language reference: http://technet.microsoft.com/enus/library/ms132058.aspx. It might be fun to watch the videos of Rafal Lukawiecki about data mining:
http://www.microsoft.com/emea/spotlight/event.aspx?id=99 (or use a search engine to find the videos). This should provide you a solid base for development.
2. How many rows approximally supports the algorithm?
The performance of the algorithm depends on 3 things:
 The complexity of the problem in relation to the ability of the kernel to solve the problem
 The number of rows
 The number of dimensions
The first point can be tackeled by choosing the right kernel with the right parameters. I believe this is the most difficult part of SVMs; there are numerous papers out there how to solve this. Personally, I use a trail and error approach.
There is no fixed number of rows the algorithm takes. In theory the number of rows can be very large. I believe that it's safe to offer about 2000 rows. The performance is not only depending on the number of rows, but also of the dimensionality of the problem.
The dimensionality depends on the number of columns and the type of the columns. SVMs work in continuous space and therefore a continuous input column is a dimension for the algorithm. A discrete column is transformed to binairy continuous input space. The
number of dimensions a discrete column has depends on the number of different values the column has. Each value translates into an additional dimension. For example if a dataset has the column 'X' with the potential values {'A','B','C'}. If the row with Id
1 has the value 'B ' for column X, then this is translated to:
So, the more discrete columns with discrete values there are, the more dimensions the SVM problem has. This can grow very fast and the running time is really influenced by this. The influence of continous input columns is relatively small.
3. How many inputs supports (how many dimentions)?
There is no limit, but you should consider the text above. with each discrete input column, the dimensionality grows fast.
4. Whats the fuction do I need to use to put the predict in a table?



Tks to clarify some stuffs about the algorithm...
Look me here again...
I've ran a simple model
Predict: 0/1
Input:
Car = 0/1
Retired = 0/1
In a sample with 1000 rows. Using default settings, just changing to RBF.
and then I got the error attached...
About the results, when I use neural networks, I used to use the Preditc Fuction to return, for the followings Securit ID's, its predicted results (0/1 = good/bad).
But on SVM I just found "GetAttributes", "GetClasses", "GetPredictAttribute" and "GetMaxMinValues", but neither one have returned me the predicted. Actually neither one have returned anything.
I'm really sorry about all these questions, but note that I want to solve this issue using your tool, b/c sounds me very trustable.
Best
Paulo
On Mon, Nov 10, 2008 at 10:52 AM, JorisValkonet <notifications@codeplex.com> wrote:
From: JorisValkonet
Hi Paulo,
Thank you for your reply. I believe that SVMs could be the solution for your problem. I will try to help you.
1. Could give a hand telling how to configure the algorithm to predict a variable good/bad from a list of inputs?
Before running the algorithm, you have to do two things:
 Create a data mining structure describing the meta data of your dataset. For example, you specify which columns are used and which column is the key column. Furthermore, you specify the crossvalidation settings with the data mining structure (SQL 2008
only).
 After you created your data mining structure, you have to create a data mining model. In the data mining model you have to specify which columns are input and what the predict column is. The input columns can be continuous and discrete. The predict column
can only be discrete. Note that in this first version, you have to specify a key columns and nested tables (semi multirelational data mining) is not supported. Besides specifying the types of columns, you can specify the kernel with the algorithm parameters.
I believe that with your problem a linear or RBF (gaussian) kernel would be the ones I would pick.
These two steps can be done from Excel with the MSFT Office 2007 Excel Data Mining addin or from Business Intelligence Development Studio or even from code in SQL Server Management Studio. The language DMX can be used for all data mining related tasks.
Here you can find the language reference:
http://technet.microsoft.com/enus/library/ms132058.aspx. It might be fun to watch the videos of Rafal Lukawiecki about data mining:
http://www.microsoft.com/emea/spotlight/event.aspx?id=99 (or use a search engine to find the videos). This should provide you a solid base for development.
2. How many rows approximally supports the algorithm?
The performance of the algorithm depends on 3 things:
 The complexity of the problem in relation to the ability of the kernel to solve the problem
 The number of rows
 The number of dimensions
The first point can be tackeled by choosing the right kernel with the right parameters. I believe this is the most difficult part of SVMs; there are numerous papers out there how to solve this. Personally, I use a trail and error approach.
There is no fixed number of rows the algorithm takes. In theory the number of rows can be very large. I believe that it's safe to offer about 2000 rows. The performance is not only depending on the number of rows, but also of the dimensionality of the problem.
The dimensionality depends on the number of columns and the type of the columns. SVMs work in continuous space and therefore a continuous input column is a dimension for the algorithm. A discrete column is transformed to binairy continuous input space. The
number of dimensions a discrete column has depends on the number of different values the column has. Each value translates into an additional dimension. For example if a dataset has the column 'X' with the potential values {'A','B','C'}. If the row with Id
1 has the value 'B ' for column X, then this is translated to:
So, the more discrete columns with discrete values there are, the more dimensions the SVM problem has. This can grow very fast and the running time is really influenced by this. The influence of continous input columns is relatively small.
3. How many inputs supports (how many dimentions)?
There is no limit, but you should consider the text above. with each discrete input column, the dimensionality grows fast.
4. Whats the fuction do I need to use to put the predict in a table?

Abraços,
_______________________________
Paulo Carvalho
when you think that you know all answers, life comes and change all questions...
brilliant minds discuss ideas, half minds discuss events, dumb minds discuss people...



We implemented the default 'Predict' for prediction. You should be able to run the following query:
SELECT m.[predict_column] as [prediction], x.[predict_column] as [input], x.*
FROM [model] As m
NATURAL PREDICTION JOIN
(SELECT * from [model].CASES)
AS x
If you alter the query and replace predict_column with the predict column of your model and [model] with your model, then you should be able to see the prediction.
The methods: "GetAttributes", "GetClasses", "GetPredictAttribute" and "GetMaxMinValues" are used for the viewer. These methods don't provide prediction functionality. (I think that I removed the visibility of the predict
function; it is there and you can use it, but you are unable to see it).



Tks a lot! It works!
I just couldn't see the viewer working. But thats ok.
So finally, is there any probablity associated to predicted?
Best!!!
Paulo
On Mon, Nov 10, 2008 at 12:04 PM, JorisValkonet <notifications@codeplex.com> wrote:
From: JorisValkonet
We implemented the default 'Predict' for prediction. You should be able to run the following query:
SELECT m.[predict_column] as [prediction], x.[predict_column] as [input], x.*
FROM [model] As m
NATURAL PREDICTION JOIN
(SELECT * from [model].CASES)
AS x
If you alter the query and replace predict_column with the predict column of your model and [model] with your model, then you should be able to see the prediction.
The methods: "GetAttributes", "GetClasses", "GetPredictAttribute" and "GetMaxMinValues" are used for the viewer. These methods don't provide prediction functionality. (I think that I removed the visibility of the predict
function; it is there and you can use it, but you are unable to see it).

Abraços,
_______________________________
Paulo Carvalho
when you think that you know all answers, life comes and change all questions...
brilliant minds discuss ideas, half minds discuss events, dumb minds discuss people...


Nov 11, 2008 at 8:16 PM
Edited Nov 11, 2008 at 8:16 PM

We didn't implement anything for the probability of a prediction. We might look into this and try to release that in the next version of the plugin.
Joris



Just a little bit correction: increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well (overfitting)... keep up the good work,
Luca Del Tongo

