F.A.Q

=====

 

 

 

 

1.What is it?

------------

 

 

The spider is a set of learning algorithms that have been glued together with the object orientated environment.

This allows various standard techniques to be applied when using algorithms, e.g cross validation, statistical tests, plotting functions, etc.

 

 

Version 1.0 contains the following algorithms: SVM, SVR (regression), C4.5, k-NN, LDA, one-vs-rest, RFE, multiplicative update (zero norm minimization), Golub, stability algorithms, one-class svm, \nu-svm, multi-class svm. Other included functionality: hold out testing, cross validation, various loss function evaluation and the ability to glue algorithms together e.g finding features via corrlation coefficients (golub) and then training an svm with these features. Finally, one can easily evaluate an algorithm/combination over many different hyperparameters.

 

 

 

 

2.Why have you made that?

-------------------------

 

 

We think the main reasons for such a library are:

 

 

2.1. Sharing code

 

 

Once everything is an object one can easily write another object and use all the other tools with it.

 

 

2.2 Easier/faster analysis

 

 

Plug in your dataset to an object and that's it. One can try combinations of pre-processing, feature selection, algorithms with different hyper-parameters without writing new code to do this, just a short script.

 

 

Also, we have written the core objects so that they can work with large datasets in terms of memory and speed (caching and removing memory overhead of pass by value) but these are transparent to the user.

 

 

2.3. Easier/faster research

 

 

Plug in a new object and compare it with existing ones.

 

 

We plan to build benchmark objects where you pass in the new object you want to test and the object generates problems, compares your algorithm with standard ones and produces results with significance tests.

 

 

2.4. Building large systems

 

 

To build large systems you need a modular approach, we have thought about the good way to do this. In bioinformatics it looks like big systems are quite natural, e.g in secondary structure prediction. Plugging objects together is an easy way to do this.

 

 

 

 

2.5 Framework for more formal analysis

 

 

We hope with several standard tools in the spider "environment" it will be easier to perform statistical tests on the test error and to be more precise with things like model selection by using objects to do this. Also, hopefully it might prevent mistakes -- when you rewrite code you can easily introduce a bug.

 

 

 

 

 

 

3. Is it different from other systems?

--------------------------------------

 

 

Yes. It's more object orientated -- every algorithm is an object with a train and a test and e.g cross validation is an object as well. Also, we wanted a library in matlab which was powerful and not just for toy examples. Alex Smola is developing something as well but it looks like it is only for kernel algorithms and not for building experiments and plugging objects together so easily, although he may eventually develop it to do that.

 

 

 

 

4. How can I have a look at it quickly?

---------------------------------------

 

 

a) Just unzip into the directory of choice

b) Start up matlab and execute "spider_init" which is in the spider directory

   -- this sets a path to the spider directories at matlab startup

c) You can now run one of the demos, e.g spider/demos/microarray/go.m

 

 

 

 

5. Want to know what objects are available?

-------------------------------------------

 

 

Type "help spider" for a list. Sorry, not all the help has been written for the individual objects -- we plan to make these available using the matlab standard soon.

 

 

 

 

6. What's an object in matlab?

------------------------------

 

 

It's a directory called something like "@knn" e.g for k-nearest neighbours which includes all the methods of that objects (which are M-files). Our training objects have a constructor which sets default hyperparameters and initials the model and a train and test method.

 

 

 

 

7. Want to train and test an algorithm?

---------------------------------------

 

 

a) Prepare your data in a matrix of attributes (e.g X) and a matrix of labels

   (e.g Y) such that the rows are the examples.

b) Create a data object:                         d=data(X,Y)

b) Train an algorithm e.g svm.knn,c45:           [tr a]=train(svm,d)

   tr is the predictions,

   a  is the model that you learnt

c) Test the algorithm on new data d2:            tst=test(a,d2,'class_loss')

   --using the classification loss function

 

 

Type "help train" and "help test" for a little more information.

 

 

 

 

8.How do I set hyperparameters of algorithms?

---------------------------------------------

 

 

When an algorithm is initialized the last parameter is the set of command strings in a cell array which setup hyperparameters. E.g a=svm('C=1') sets hyperparameter C,  a=svm({'C=1','ridge=1e-10'})  sets two hyperparameters. This can also be written as: a=svm('C=1;ridge=1e-10'); which is a bit easier -- you just separate the instructions by commas. Type "help svm" for a list of its hyperparameters.

 

 

 

 

9. Want to build you own object so it's easy to use it with all the other objects and to share your code with everyone?

-----------------------------------------------------------------------------------------------------------------------

 

 

Please do it! Take a look at the @template object which is a simple svm object and copy this directory and just change the training.m and testing.m files andthe constructor (which should be the same name as the directory) to make your new algorithm.

 

 

 

 

10. What do you mean by plugging objects together?

-------------------------------------------------

 

 

Well, there are a few objects included that can explain that. Consider the "param" object. Initialising it with e.g p=param(svm,'C',[0.1 1 10 100 1000]) is a quick and easy way to train an SVM with different values of C. You just type train(p,d) where d is the data.

 

 

Other building blocks include

"feed_fwd"  -- for allowing the output of one object to be the input of another e.g a preprocessing object passing results to a feature ranking objects which then passes results to a classifier, e.g f=feed_fwd({preprocess rfe svm});

 

 

"alg" -- a set of algorithms

 

 

"cv"  -- for cross validation,

 

 

You can use all the building blocks together to create quite complicated constructions where you perform feature selection, preprocessing and so on, at each stage tuning different hyperparameters.

 

 

 

 

 

 

11. What are the upcoming things to be implemented?

---------------------------------------------------

 

 

Here's some things:

 

 

* significance tests

* improve the user interface

* an object for graphs and other visualizations

* an object for turnign results into latex tables/graphs!

* an object for making toy examples

* benchmark objects for comparing algorithms/researching

* model selection objects

* preprocessing objects

* subsampling object

* improve some code by implementing in C

* ridge regression, nadarya-watson, cart?

* multi-class golub

* PCA

* ecoc, 1-vs-1?

* r2w2 grad feature selection

* r2w2 kernel choosing algorithms

* kernel fisher discriminant

 

 

 

 

12. What are the known "issues"?

--------------------------------

 

 

C4.5 is not implemented for windows yet -- if you want to fix it you are very welcome!