Radial Basis Function Network (RBFN) Tutorial

A Radial Basis Function Network (RBFN) is a particular type of neural network. In this article, I’ll be describing it’s use as a non-linear classifier.

Generally, when people talk about neural networks or “Artificial Neural Networks” they are referring to the Multilayer Perceptron (MLP). Each neuron in an MLP takes the weighted some of its input values. That is, each input value is multiplied by a coefficient, and the results are all summed together. A single MLP neuron is a simple linear classifier, but complex non-linear classifiers can be built by combining these neurons into a network.

To me, the RBFN approach is more intuitive than the MLP. An RBFN performs classification by measuring the input’s similarity to examples from the training set. Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set. When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype. Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

RBF Network Architecture

Architecture_Simple

The above illustration shows the typical architecture of an RBF Network. It consists of an input vector, a layer of RBF neurons, and an output layer with one node per category or class of data.

The Input Vector

The input vector is the n-dimensional vector that you are trying to classify. The entire input vector is shown to each of the RBF neurons.

The RBF Neurons

Each RBF neuron stores a “prototype” vector which is just one of the vectors from the training set. Each RBF neuron compares the input vector to its prototype, and outputs a value between 0 and 1 which is a measure of similarity. If the input is equal to the prototype, then the output of that RBF neuron will be 1. As the distance between the input and prototype grows, the response falls off exponentially towards 0.  The shape of the RBF neuron’s response is a bell curve, as illustrated in the network architecture diagram.

The neuron’s response value is also called its “activation” value.

The prototype vector is also often called the neuron’s “center”, since it’s the value at the center of the bell curve.

The Output Nodes

The output of the network consists of a set of nodes, one per category that we are trying to classify. Each output node computes a sort of score for the associated category. Typically, a classification decision is made by assigning the input to the category with the highest score.

The score is computed by taking a weighted sum of the activation values from every RBF neuron. By weighted sum we mean that an output node associates a weight value with each of the RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the total response.

Because each output node is computing the score for a different category, every output node has its own set of weights. The output node will typically give a positive weight to the RBF neurons that belong to its category, and a negative weight to the others.

RBF Neuron Activation Function

Each RBF neuron computes a measure of the similarity between the input and its prototype vector (taken from the training set). Input vectors which are more similar to the prototype return a result closer to 1. There are different possible choices of similarity functions, but the most popular is based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional input.

gaussian

Where x is the input, mu is the mean, and sigma is the standard deviation. This produces the familiar bell curve shown below, which is centered at the mean, mu (in the below plot the mean is 5 and sigma is 1).

bell_curve

The RBF neuron activation function is slightly different, and is typically written as:

Activation_Equation

In the Gaussian distribution, mu refers to the mean of the distribution. Here, it is the prototype vector which is at the center of the bell curve.

For the activation function, phi, we aren’t directly interested in the value of the standard deviation, sigma, so we make a couple simplifying modifications.

The first change is that we’ve removed the outer coefficient, 1 / (sigma * sqrt(2 * pi)). This term normally controls the height of the Gaussian. Here, though, it is redundant with the weights applied by the output nodes. During training, the output nodes will learn the correct coefficient or “weight” to apply to the neuron’s response.

The second change is that we’ve replaced the inner coefficient, 1 / (2 * sigma^2), with a single parameter ‘beta’. This beta coefficient controls the width of the bell curve. Again, in this context, we don’t care about the value of sigma, we just care that there’s some coefficient which is controlling the width of the bell curve. So we simplify the equation by replacing the term with a single variable.

Diff_Variances_Plot

RBF Neuron activation for different values of beta

There is also a slight change in notation here when we apply the equation to n-dimensional vectors. The double bar notation in the activation equation indicates that we are taking the Euclidean distance between x and mu, and squaring the result. For the 1-dimensional Gaussian, this simplifies to just (x – mu)^2.

It’s important to note that the underlying metric here for evaluating the similarity between an input vector and a prototype is the Euclidean distance between the two vectors.

Also, each RBF neuron will produce its largest response when the input is equal to the prototype vector. This allows to take it as a measure of similarity, and sum the results from all of the RBF neurons.

As we move out from the prototype vector, the response falls off exponentially. Recall from the RBFN architecture illustration that the output node for each category takes the weighted sum of every RBF neuron in the network–in other words, every neuron in the network will have some influence over the classification decision. The exponential fall off of the activation function, however, means that the neurons whose prototypes are far from the input vector will actually contribute very little to the result.

If you are interested in gaining a deeper understanding of how the Gaussian equation produces this bell curve shape, check out my post on the Gaussian Kernel.

Example Dataset

Before going into the details on training an RBFN, let’s look at a fully trained example.

In the below dataset, we have two dimensional data points which belong to one of two classes, indicated by the blue x’s and red circles. I’ve trained an RBF Network with 20 RBF neurons on this data set. The prototypes selected are marked by black asterisks.

Cluster_Centers

We can also visualize the category 1 (red circle) score over the input space. We could do this with a 3D mesh, or a contour plot like the one below. The contour plot is like a topographical map.

Category_1_Scores_w_Prototypes

The areas where the category 1 score is highest are colored dark red, and the areas where the score is lowest are dark blue. The values range from -0.2 to 1.38.

I’ve included the positions of the prototypes again as black asterisks. You can see how the hills in the output values are centered around these prototypes.

It’s also interesting to look at the weights used by output nodes to remove some of the mystery.

For the category 1 output node, all of the weights for the category 2 RBF neurons are negative:

-0.79934
-1.26054
-0.68206
-0.68042
-0.65370
-0.63270
-0.65949
-0.83266
-0.82232
-0.64140

And all of the weights for category 1 RBF neurons are positive:
0.78968
0.64239
0.61945
0.44939
0.83147
0.61682
0.49100
0.57227
0.68786
0.84207

Finally, we can plot an approximation of the decision boundary (the line where the category 1 and category 2 scores are equal).

Approx_Decision_Boundary

To plot the decision boundary, I’ve computed the scores over a finite grid. As a result, the decision boundary is jagged. I believe the true decision boundary would be smoother.

Training The RBFN

The training process for an RBFN consists of selecting three sets of parameters: the prototypes (mu) and beta coefficient for each of the RBF neurons, and the matrix of output weights between the RBF neurons and the output nodes.

There are many possible approaches to selecting the prototypes and their variances. The following paper provides an overview of common approaches to training RBFNs. I read through it to familiarize myself with some of the details of RBF training, and chose specific approaches from it that made the most sense to me.

Selecting the Prototypes

It seems like there’s pretty much no “wrong” way to select the prototypes for the RBF neurons. In fact, two possible approaches are to create an RBF neuron for every training example, or to just randomly select k prototypes from the training data. The reason the requirements are so loose is that, given enough RBF neurons, an RBFN can define any arbitrarily complex decision boundary. In other words, you can always improve its accuracy by using more RBF neurons.

What it really comes down to is a question of efficiency–more RBF neurons means more compute time, so it’s ideal if we can achieve good accuracy using as few RBF neurons as possible.

One of the approaches for making an intelligent selection of prototypes is to perform k-Means clustering on your training set and to use the cluster centers as the prototypes. I won’t describe k-Means clustering in detail here, but it’s a fairly straight forward algorithm that you can find good tutorials for.

When applying k-means, we first want to separate the training examples by category–we don’t want the clusters to include data points from multiple classes.

Here again is the example data set with the selected prototypes. I ran k-means clustering with a k of 10 twice, once for the first class, and again for the second class, giving me a total of 20 clusters. Again, the cluster centers are marked with a black asterisk ‘*’.

Cluster_Centers

I’ve been claiming that the prototypes are just examples from the training set–here you can see that’s not technically true. The cluster centers are computed as the average of all of the points in the cluster.

How many clusters to pick per class has to be determined “heuristically”. Higher values of k mean more prototypes, which enables a more complex decision boundary but also means more computations to evaluate the network.

Selecting Beta Values

If you use k-means clustering to select your prototypes, then one simple method for specifying the beta coefficients is to set sigma equal to the average distance between all points in the cluster and the cluster center.

Variance_Eq

Here, mu is the cluster centroid, m is the number of training samples belonging to this cluster, and x_i is the ith training sample in the cluster.

Once we have the sigma value for the cluster, we compute beta as:

Beta_Eq

Output Weights

The final set of parameters to train are the output weights. These can be trained using gradient descent (also known as least mean squares).

First, for every data point in your training set, compute the activation values of the RBF neurons. These activation values become the training inputs to gradient descent.

The linear equation needs a bias term, so we always add a fixed value of ‘1’ to the beginning of the vector of activation values.

Gradient descent must be run separately for each output node (that is, for each class in your data set).

For the output labels, use the value ‘1’ for samples that belong to the same category as the output node, and ‘0’ for all other samples. For example, if our data set has three classes, and we’re learning the weights for output node 3, then all category 3 examples should be labeled as ‘1’ and all category 1 and 2 examples should be labeled as 0.

RBFN as a Neural Network

So far, I’ve avoided using some of the typical neural network nomenclature to describe RBFNs. Since most papers do use neural network terminology when talking about RBFNs, I thought I’d provide some explanation on that here. Below is another version of the RBFN architecture diagram.

Architecture_Math

Here the RBFN is viewed as a “3-layer network” where the input vector is the first layer, the second “hidden” layer is the RBF neurons, and the third layer is the output layer containing linear combination neurons.

One bit of terminology that really had me confused for a while is that the prototype vectors used by the RBFN neurons are sometimes referred to as the “input weights”. I generally think of weights as being coefficients, meaning that the weights will be multiplied against an input value. Here, though, we’re computing the distance between the input vector and the “input weights” (the prototype vector).

Example MATLAB Code

I’ve packaged up the example dataset in this post and my MATLAB code for training an RBFN and generating the above plots. You can find it here.

34 responses to “Radial Basis Function Network (RBFN) Tutorial

  1. Pingback: RBF Network MATLAB Code | Chris McCormick·

  2. Hello Chris,
    Excelent tutorial about Radial Basis Function Network (RBFN)!
    Have you used the matlab’s toolboxes like k-means and newrbe to train a RBF network?
    Best Regards ,
    Felipe

  3. hi Chris,
    excellent tutorial!
    I’m doing a license plate recognition using RBF right now.
    how do i match the character with RBF network?
    please help me…I’m stuck…

  4. Hello Chris, You told that it exists two approaches to select the prototypes. One of them is the k-mean clustering. But what is the other ? Thanks in advance!

    Mornero Liszt.

  5. Pingback: Kernel Regression | Chris McCormick·

  6. Hi Cris,

    Your code is very good¡ I’m trying to use with another data, for example:
    X=data (149, 1:3);
    y=data(149,4);

    but I have the followed error:
    Training the RBFN…
    1. Selecting centers through k-Means.\n
    Category 1 centers…
    Index exceeds matrix dimensions.

    Error in trainRBFN (line 73)
    init_Centroids = Xc(1:centersPerCategory, :);

    Error in runRBFNExample (line 53)
    [Centers, betas, Theta] = trainRBFN(X, y, 10, true);

    Could you tell me how I could solve it?

    Thank you in advance,
    MAria

    • It looks like the amount of training data is too small. It’s saying that the number of RBF neurons you’ve specified per category is greater than the number of training examples you’ve provided. Your dataset should have many more examples than the number of RBF neurons in the network. Hope that helps.

  7. Dear Chris,
    I would say it would be nice to adjust code a little bit in trainRbf in manual Gradient Descent part:
    1) initialize Theta(:, c) = initial_theta %do this before FOR loop
    2) I would replace Theta(:,c) -= grad * alpha with explicit version Theta(:,c) = Theta(:,c) .- grad * alpha

    Best regards and thank you for your library.

    BTW. I’ve translated it to Julia – soon will be available at https://github.com/noban/julia-rbfnn

  8. Hello Chris, Thanks for this great tutorial. Your example code is used to train an RBFN. How do you test it on a new input data? Do you have code for that or could you give me some pointers on how to go about that? Thanks.

    • Hi Annette,
      Check out the for-loop at the end of the example code in the “Measure Training Accuracy” section. This evaluates the network over the training data, but it could just as easily be over new data instead. The key function is ‘evaluateRBFN’.

  9. Hey Chris
    I must say, wonderful article.
    I tried a few resources, but was unable to get a gist of RBF NNets. Your article on the other hand sailed me through the concept so effortlessly.
    Well, Thank You.
    Continue to write such wonderful articles.

  10. Hi Chris,
    Could you kindly advise whether your code can be used for prediction problems (e.g. stock price prediction)? Since it is not a classification problem like your example. If so, how should I go able modifying it to suit the application for prediction problems?

  11. Thanks for a great article. There is one thing that confuses me and I can’t quite understand fom your article or any other that I have found 🙂
    It’s the weight coefficient – how is the weight calculated – you write: During training, the output nodes will learn the correct coefficient or “weight” to apply to the neuron’s response.

    But how?

    • Hi,
      The weights are learned through optimization. The optimization goal is to find the set of weight values which minimize some “cost” function (typically the mean squared error or “MSE”). This can be done iteratively using gradient descent, or they can be computed directly using a matrix inverse operation. In the latest version of my RBFN example code, I use the matrix inverse approach.

  12. Thank you very much for your thorough tutorial, you gave me a deeper insight in the RBFs, which was very useful in my Neural Networks class.

  13. can somebody help me.. i am new using Artificial neural network for RBS. The problem is i don’t what is the target data and how to compute data?

  14. Chris, there is no words can thank you enough! Best RBFN tutorial out there!
    Can you please post something about RBFN as Function Approximation? or at least point the differences in the algorithm.
    Thanks

    • Thanks for the kind words, Sasuke! I’ve received a lot of inquiries about RBFNs for function approximation; I’ll see if I can pull something together on the topic soon.

  15. Pingback: RBFN Tutorial Part II – Function Approximation | Chris McCormick·

  16. Hi Chris.
    Thank you very much for your tutorial it is really helpful to me.
    I’m trying to work with a diferent dataset (CarEvaluation), with 4 different classes. But I don’t know how to set the code…

    scores = evaluateRBFN(Centers, betas, Theta, [u(i), v(j)]);

    scores1(i, j) = scores(1, 1);
    scores2(i, j) = scores(2, 1);

    Matlab is returning an error message like this…

    Error using bsxfun
    Non-singleton dimensions of the two input arrays must match each other.
    Error in getRBFActivations (line 25)
    diffs = bsxfun(@minus, centers, input)

    Can you help me ???

    Thanks in advance.

  17. a neural network is a pattern identifier and as such can be seen as a classifier. Whether the separate classes can be approximated by a function isn’t guaranteed. I don’t have the literature handy but my professor said there is academic literature that proves that you should never need more than a single hidden layer in any NN.

    If you are having trouble training the network it might be advisable to change your activation function or training method rather than adding an extra layer. There is also a possibility that you have too many hidden neurons in your hidden layer, there is no method to calculate the optimum architecture of a NN.

    I don’t have much experience with bayesian networks or svm’ so I cant comment on that.

  18. Pingback: Radial Basis Function Neural Networks | Notebook of a Signal Processing Engineer·

  19. Hi Chris ! Firstly, thanks for this wonderful tutorial. I would like to know when we regress by Least squares to find out weights, should not we take the activation function as sigmoid as we are regressing in a range of [0,1] which wont give good results I feel.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s