Training - cs.tut.fi

Training Cs Tut Fi-ppt Download

  • Date:22 Nov 2020
  • Views:0
  • Downloads:0
  • Size:351.00 KB

Share Presentation : Training Cs Tut Fi

Download and Preview : Training Cs Tut Fi

Report CopyRight/DMCA Form For : Training Cs Tut Fi


Transcription:

2806 Neural ComputationCommittee Machines2005 Ari Visa Some historical notes Some theory.
Committee Machines Conclusions Some Historical NotesBoosting by filtering Schapire 1990AdaBoost Schapire Freund A decision theoretic.
generalization of on line learning and anapplication to boosting 1995 Bagging Leo Breiman Bagging Predictors 1996 General arching Leo Breiman Arcing Classifiers Kittler Hatef Duin On Combining Classifiers 1998.
Some Theory The principle of divide and conquer a complex computational task is solvedby dividing it into a number of computationally simple tasks and thencombining the solutions to the tasks Committee machine In supervised learning computational simplicity is.
achieved by distributing the learning task among a number of experts whichin turn divides the input space into a set of subspaces The combination ofexperts is said to constitute a committee machine Modularity Osherson al 1990 A neural network is said to be modular ifthe computation performed by the network can be decomposed into two or.
more modules that operate on distinct inputs without communicating witheach other The outputs of the modules are mediated by an integrating unit thatis not permitted to feed information back to the modules In particular theintegrating unit both decides how the outputs of the modules should becombined to form the final output of the system and decides which modules.
should learn which training pattern Some Theory Committee machines are universal approximators a Static structures the responses of several experts are combined bymeans of a mechanism that does not involve the input signal.
Ensemble averaging outputs are linearly combined Boosting a weak learning algorithm is converted into one thatachieves arbitarily high accuracy b Dynamic structures the input signal is directly involved inactuating the mechanism that integrates the outputs of the individual.
experts into an overall output Mixture of experts the responses of the experts are nonlinearlycombined by means of a single gating network Hierarchical mixture of experts the responses of the experts arenonlinearly combined by means of several gating network arranged.
in a hierarchical fashion Some TheoryEnsamble Averaging A Note all the experts arenumber of differently trained on the same datatrained neural networks set they may differ from.
which share a common each other in the choice ofinput and whose initial conditions used inindividual outputs are network training somehow combines to The output is sum of theproduce an overall output individual outputs of the.
experts and thencomputing the probabilityof correct classification Some Theory The bias of the ensemble averaged function.
FI x pertaining to the commitee machine isexactly the same as that of the function F x pertaining to a single neural network The variance of the ensemble averagedfunction FI x is less than that of the.
function F x Some Theory Boosting the expertsare trained on data setswith entirely different.
distributions Boosting by filtering Boosting bysubsampling Boosting by.
reweighting Some Theory The original idea of boosting is rooted in a distribution free or probably approximately correctmodel of learning PAC Boosting by filtering the committee machine consists of three experts requires a large data set .
1 1 The first expert is trained on a set consisting of N1 examples 2 2 The trained first expert is used to filter another set of examples by proceeding in the following Flip a fair coin simulates a random guess If the result is heads pass a new pattern through the first expert and discard correctly classifiedpatterns until a pattern is missclassified That missclassified pattern is added to the training set for.
the second expert If the result is tails pass new patterns through the first expert and discard incorrectly classifiedpatterns until apattern is classified correctly That correctly classified pattern is added to thetraining set for the second expert Continue this process until a total of N1 examples has been filtered by the first expert This set of.
filtered examples constitutes the training set for the second expert 3 3 Once the second expert has been trained in the usual way a third training set is formed for thethird expert by proceeding in the following manner Pass a new pattern through both the first and the second experts If the two experts agree in theirdecisions discard that pattern If they disagree the pattern is added to the training set for the third.
Continue with this process until total N1 examples has been filtered jointly by the first and secondexperts This set of jointly filtered examples constitutes the training set for the third expert Some Theory Classification If the The three experts havefirst and the second an error rate of 1 2.
experts in thecommittee agree in g 3 2 2 3their respectivedecisions that class.
label is used Otherwise the classlabel discovered by thethird expert is used Some Theory.
Some Theory AdaBoost boosting byresampling batch AdaBoost adjustsadaptively to the error of.
the weak hypothesisreturned by the weaklearning model When the number ofpossible classes labels is.
M 2 the boostingproblem becomes moreintricate Some Theory Error performance due.
to AdaBoost is The shape of the errorrate debends on thedefinition ofconfidence .
Some Theory Dynamic is used here in the sense that integration of knowledgeacquired by the experts is accomplished under the action of the input Probabilistic generative model 1 An input vector x is picked at random from some prior distribution .
2 A perticular rule say kth rule is selected in accordance with theconditional probability P k x a 0 given x and some parameter vector 3 for rule k k 01 2 K the model response d is linear in x with anadditive error k modeled as a Gaussian distributed random variablewith zero mean and unit variance .
E k 0 for all k and var k 1 for all k P D d x 0 Kk 1 P D d x wk 0 P k x a 0 Some TheoryMixture of Experts Model yk wkTx k 1 2 K.
The gating network consistsof a single layer of Kneurons with each neuronassigned to a specific Some Theory.
The neurons of the gatingnetwork are nonlinear gk exp uk Kj 1exp uj k 1 2 K and uk akTx The gating network is a.
classifier that maps theinput x into multinomialprobabilities Thedifferent experts will beable to match the desired.
Some Theory fD d x Kk 1 gk fD d x k 1 2 Kk 1 gk exp 1 2 d yk 2 associative Gaussian mixture model Given the training sample xi di Ni 1 the problem.
is to learn the conditional means k yk and themixing parameters gk k 1 2 K in an optimummanner so that fD d x provides a good estimateof the underlying probability density function ofthe environment responsible for generating the.
training data Some Theory Hierarchical Mixture ofExperts ModelHME is a natural extension of the.
The HME model differs from theME model in that the inputspace is divided into a nestedset of subspaces with theinformation combined and.
redistributed among the expertsunder the control of severalgating networks arranged in ahierarchical manner Some Theory.
The formulation of the HME model can be viewedin two ways 1 The HME model is a product of the divide andconquer strategy 2 The HME model is a soft decision tree.
Standard decision trees suffer from a greedinessproblem once a decision is made in such a tree itis frozen and never changes thereafter Some TheoryClassification and decision tree.
CART Breiman 1984 1 selection of splits let a node tdenotea subset of the current treeT Let d t denote the average ofdi for all cases x di falling into t .
that is d t 1 N t xj t di wherethe sum is over all di such that xi tand N t is the total number of2 Define E t 1 N xj t di d t 2and E T t T E t .
The best split s is then taken to bethe particular split for which wehave E s t max s S E t s Some Theory Determination of a terminal node A node t is declared a.
terminal node if this condition is satisfied max s S E s t th th is a prescribed threshold Least square estimation of a terminal node s parameters Let t denote a terminal node in the final binary tree T and let X t denote the matrix composed of xi t Let d t denote the.
corresponding vector composed of all the di in t Define w t X t d t where X t is the pseudoinverse of matrix X t Using the weights calculated above the split selection problem issolved by looking for the least sum of squared residuals withrespect to the regression surface .
Some Theoryg 1 1 exp aTx b Using CART to initialize the HME model 1 1 Apply CART to the training data2 2 Set the synaptic weight vectors of the experts in the HME model.
equal to the least squares estimates of the parameter vectors at thecorresponding terminal nodes of the binary tree resulting from theapplication of CART 3 3 For the gating networks 4 a set the synaptic weight vectors to point in direction that are.
orthogonal to the corresponding splits in the binary tree obtainedfrom CART and5 b set the lengths of the synaptic weight vectors equal to smallrandom vectors Some Theory.
A posteriori probabilities at the nonterminal nodes of thehk gk 2j 1 gj k exp 1 2 d yjk 2 2k 1 gk 2j 1 gj k exp 1 2 d hj k gj k exp 1 2 d yjk 2 2j 1 gj k exp 1 2 d yjk 2 the joint a posteriori probability that expert j k produces anoutput yjk .
hjk hk hj k gk gj k exp 1 2 d yjk 2 2k 1 gk 2j 1 gj k exp 1 2 d yjk 2 0 hjk 1 for all j k 2j 1 2k 1 hjk 1 Some Theory The parameter estimation for the HME model .
Maximum likelihood estimation y 2k 1 gk 2j 1fD d x 1 2 2k 1 gk 2j 1 gj k exp 1 2 d yk 2 likelihood function l fD d x log likelihood function L log fD d x L 0 the maximum likelihood.
Some Theory Learning strategies for the 1 Stochastic gradient This approach yields an on linealgorithm for the maximization.
of L L wjk hj k n hk n d n yj k n x n L ak hk n gk n x n L ajk hk n hj k n gj .
Some Theory 2 Expectation maximization approach Expectation step E step which uses the observeddata set of an incomplete data problem and thecurrent value of the parameter vector to.
manufacture data so as to postulate an augmentedor so called complete data set Maximization step M step which consists ofderiving a new estimate of the parameter vectorby maximizing the log likelihood function of the.
complete data manufactured in the E step 100 Some Theory The EM algorithm is directed at finding a value of thatmaximizes the incomplete data log likelihood functionL log fD d .
This problem is solved indirectly by working iterativelywith the complete data log likelihood function Lc logfc r which is a random variable because themissing data vector z is unknown E step Q n E Lc .
M step maximize Q n with respect to n n 1 arg max Q n continue until thedifference between L n 1 and L n drops tosome arbitrary small value Some Theory.
fD di xi 1 2 2k 1 g i k 2j 1 g i j k exp 1 2 di y i k 2 L log N i 1 fD di xi Lc log N i 1 fc di z i jk xi Q n E L N 2 2 h i log g i log g i 1 2 d i y i 2 .
i 1 j 1 k 1 jk k j k jk wjk n 1 arg minwjk Ni 1 h i jk di y i jk 2 aj n 1 arg maxaj Ni 1 2k 1 h i k log g i k ajk n 1 arg maxajk Ni 1 2l 1 h i l 2m 1 h i m l log g i m lEnsemble averaging improves error.
performance by combining two effects a overfitting the individual expertsb using different initial conditions in thetraining of the individual expertsBoosting improves error performance by.
filtering and resampling Simple models provide insight into the problembut lack accuracy Complex models provide accurate results but lack The architecture of HME is similar to that of.
CART but differs from it in soft partitioning theinput space Some Theory g = 1 / {1+exp(-(aTx +b))} Using CART to initialize the HME model: 1) Apply CART to the training data 2) Set the synaptic weight vectors of the experts in the HME model equal to the least-squares estimates of the parameter vectors at the corresponding terminal nodes of the binary tree resulting from the application of CART.

Related Presentations