Last year, Jeff Dean from Google gave one of the keynotes, and you can watch it here. It’s an hour long, but it’s pretty easy to digest.
Jeff shares a lot about how deep learning is being leveraged by Google–here are some of the insights that stood out to me.
What deep learning can and can’t do currently
If you study machine learning, you know that deep learning has been successful at solving some remarkably difficult tasks. However, you also know that we’re far from having computers that are able to tackle every task that a human can.
I like how Jeff Dean explained this distinction. He said we have great models now for solving a number of specific, difficult tasks, but that we still don’t have the right models for many other complex tasks.
Here are some specific complex tasks he said we now have good models for:
Deep Learning at use in Google
One question that’s often on mind–how much are these advances in deep learning providing real market value? Do any of these techniques actually provide value to Google’s business?
Jeff said that they’ve launched “more than 50” products with deep learning in the last 2 years (this presentation was given in 2015). Some examples given: photo search, Android speech recognition, StreetView, Ads placement.
Another interesting tidbit–Jeff said they launched their deep net for speech recognition in 2012, and that it uses a smaller net on the phone, and a bigger one back at the datacenter. The smaller one is lower latency (since there’s no communication overhead), but not as accurate. It wasn’t clear whether the smaller one is just for ‘offline’ recognition, or if the two networks serve complementary roles.
GPU usage at Google
Jeff said that Google “regularly” works on models with “dozens of machines” each of which might have 8 GPU cards; so they have “100s of GPU cards” computing on a single copy of a model. I thought that was a nice insight into the scale of their GPU use.
I’d be curious to understand whether they’re using the highest performing Tesla cards, or if they’ve decided it’s better to use a larger number of cheaper cards. On one hand, I imagine that there are cards which have a better dollar / performance ratio than the pricey Tesla cards. But on the other hand, I would also think that it would get expensive (in equipment, maintenance, power, etc.) to have a larger number of servers and to coordinate between them.
Matrixmatrix vs. matrixvector multiplication
Jeff acknowledges that GPUs are able to accelerate matrixmatrix multiplication operations more efficiently than matrixvector.
This has been my experience as well in measuring GPU performance on some tasks. However, it’s not obvious to me why this should be the case–there’s still plenty of opportunity for parallel computation in a matrixvector operation, so I don’t think that’s it. It must have more to do with memory access patterns. At any rate, it’s good to hear Jeff’s confirmation of this behavior.
]]>It is described in detail in chapter 5 of their free textbook, and you may also be able to access the video lectures here (PageRank is discussed in week 1).
Because PageRank is used by Google for ranking search results, you’d assume it’s name is derived from “Ranking Webpages”; however, it’s actually named after one of the authors, Larry Page of Google fame. I think part of the reason that this distinction makes sense is that PageRank is useful for analyzing directed graphs in general, not just the web graph.
The scores of the nodes on the graph are all interrelated as follows.
This can be expressed with linear algebra–we’ll have a matrix ‘M’ that represents the graph connections, and a column vector ‘r’ that holds the score for each node.
M is an adjacency matrix specifying the connections between the nodes. Instead of putting a ‘1’ where the connections are, however, we put a fraction: 1 / the number of nodes it points to.
We can capture our description of the relationship between the scores of the nodes using the matrix ‘M’ and ‘r’ as the expression
Our equation for the node scores has the same form as the equation for an eigenvector, ‘x’:
This means that we can interpret our score vector ‘r’ as an eigenvector of the matrix M with eigenvalue lambda = 1.
Because of some properties of M (that hold because of how we defined it), our score vector ‘r’ will always be the first eigenvector of M, and will always have an eigenvalue of 1.
One way to calculate ‘r’, then, would just be to run eigs on the matrix M and take the principal eigenvector.
In Matlab:
% Adjacency matrix for the graph M = [0.5, 0.5, 0; 0.5, 0, 1; 0, 0.5, 0] % 'eigs' will return the eigenvectors in order of the magnitude % of the eigenvalue. [V, D] = eigs(M); % The PageRank scores for the nodes will be in the first eigenvector. r = V(:, 1) % We can also verify that M*r gives back r... fprintf('M * r =\n'); M * r
In Python:
M = [[0.5, 0.5, 0], [0.5, 0, 1], [ 0, 0.5, 0]] # Find the eigenvectors of M. w, v = numpy.linalg.eig(M)
I used the ‘eig’ functions above just to show that you can. Because we are only interested in the principal eigenvector, however, there’s actually a computationally simpler way of finding it. It’s called the power iteration. You just initialize the score vector ‘r’ by setting all the values to 1 / the number of nodes in the graph. Then repeatedly evaluate r = M*r. Eventually, ‘r’ will stop changing and converge on the correct scores!
% Adjacency matrix for the graph M = [0.5, 0.5, 0; 0.5, 0, 1; 0, 0.5, 0] % Initialize the scores to 1 / the number of nodes. r = ones(3, 1) * 1/3; % Use the Power Iteration method to find the principal eigenvector. iterNum = 1; while (true) % Store the previous values of r. rp = r; % Calculate the next iteration of r. r = M * r; fprintf('Iteration %d, r =\n', iterNum); r % Break when r stops changing. if ~any(abs(rp  r) > 0.00001) break end iterNum = iterNum + 1; end % We can also verify that M*r gives back r... fprintf('M * r =\n'); M * r
Note: If you run this power iteration method, you’ll find that it gives a different score vector than the ‘eigs’ approach. However, both score vectors appear to be correct. Honestly, I’m not sure why this is the case–perhaps this simple exmaple matrix ‘M’ is reducible, and therefore there is not just a single unique solution?
There is a problem with how we’ve defined PageRank so far. It only works if the matrix M is stochastic (the values in a column sum to 1) and aperiodic (it doesn’t contain any cycles). Our simple example satisfied both of these, but the web graph does not. For example, if you have a webpage with no outlinks, then it’s column in M will be all zeros, and M is no longer stochastic. Or, if you have two pages which point to each other, but no one else, then you have a cycle, and M is no longer aperiodic.
We fix this by tweaking the web graph. We add a link between each page and every other page on the internet, and just give these links a very small weight.
Here’s how we express this tweak algebraically. ‘M’ is our original matrix, and A is our new modified matrix that we will use to determine the scores.
And here’s the Matlab code to run PageRank with this modified graph. If you run this code, you’ll find that it produces close to the same result as before, but the implementation is now robust against dead ends and cycles in the graph.
% Adjacency matrix for the graph M = [0.5, 0.5, 0; 0.5, 0, 1; 0, 0.5, 0] % Tweak the graph to add weak links between all nodes. beta = 0.85; n = size(M, 1); A = beta * M + ((1  beta) * (1 / n) * ones(n, n)) % Initialize the scores to 1 / the number of nodes. r = ones(3, 1) * 1/3; % Use the Power Iteration method to find the principal eigenvector. iterNum = 1; while (true) % Store the previous values of r. rp = r; % Calculate the next iteration of r. r = M * r; fprintf('Iteration %d, r =\n', iterNum); r % Break when r stops changing. if ~any(abs(rp  r) > 0.00001) break end iterNum = iterNum + 1; end % We can also verify that M*r gives back r... fprintf('M * r =\n'); M * r]]>
In my installation, this sample can be found here:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.5\0_Simple\matrixMulCUBLAS\
This example generates two matrices, A and B, filled with random values. The matrices are single precision floating point. The example is going to calculate C = A * B, and it times how quickly CUDA can do this (measured as gigaflops).
Using the default parameters, this example calculates (with matrix sizes shown as [rows x columns]):
C [640 x 320] = A [640 x 320] * B [320 x 320]
A frustrating source of confusion in this example is that B is labeled / generated as having 640 rows, but only the first 320 rows are actually used in the matrix multiplication operation; more on that later. Note, though, that the results and performance measurements are still correct despite this oversight–the example isn’t technically “broken”.
The example also includes a naive, doubleforloop C/C++ implementation of matrix multiplication on the CPU. The results of the two matrix multiplications are compared to ensure that the CUDA implementation is giving the right answer.
There are two sources of confusion with this example. One is a legitimately important detail of working with CUDA that you need to consider and that is worth learning. The other is just stupid and frustrating, and hopefully NVIDIA will fix it in a future version of the example, even though it doesn’t strictly break the code.
In summary:
So what’s this business about rowmajor and columnmajor order? It has to do with how matrices are actually laid out in memory. Rowmajor order means that all of the values in a row are contiguous in memory. Check out this nice example I stole from Wikipedia:
This matrix
Would be stored as follows in the two orders:


When a matrix is passed to CUDA, the memory layout stays the same, but now CUDA assumes that the matrix is laid out in columnmajor order. This won’t cause a buffer overrun, but what it does is effectively transpose the matrix, without actually moving any of the data around in memory.
The assumption in NVIDIA’s example is that, as the user, you want to calculate C = A * B. Your matrices are in C++, so they’re in rowmajor order, and you want your result matrix C to similarly be in rowmajor order as well. If you pass the matrices in reverse order, CUDA will calculate B’ * A’, which is equal to C’. But when you take the result into C++, there’s the implicit transpose again, so what you actually get is C.
Here’s how you interpret the parameters in the code.
The variables uiWA, uiHA, uiWB, uiHB, uiWC, and uiHC are all from the perspective of the rowmajor C++ matrices. So uiWA is the width (number of columns) in A, uiHA is the height (number of rows) in A, etc.
The default values are as follows
uiWA, uiWB, uiWC = 320
uiHA, uiHB, uiHC = 640
But remember the second point about only using half of B? To make this example more sensical, the default for uiHB should really be 320, since that’s all that’s actually used of B. One piece of evidence to confirm this–if you look at the actual ‘gemm’ call, you’ll notice that the uiHB parameter is unused. Instead, that dimension of the matrix is inferred as being equal to uiAW, which is 320. Want even further proof? Change uiHB to 320 (matrix_size.uiHB = 2 * block_size * iSizeMultiple;) and the code will still run, and the results validation will still pass.
So what we’re going to calculate in this example is C [640 x 320] = A [640 x 320] * B [320 x 320]
Now let’s make sense of the parameters in the ‘gemm’ call. The parameters are messy because we’ve defined them with respect to the rowmajor matrices, but CUDA wants to know the parameters assuming that the matrices are in columnmajor order.
‘gemm’ asks for three matrix dimensions (here’s a link to the API doc):
The example also measures the gigaflops that you’re getting from your GPU. Some important notes:
For the matrix multiplication operation:
C [m x n] = A [m x k] * B [k * n]
The number of floating point operations required is 2 * m * k * n. The factor of two is there because you do a multiply and an accumulate for each pair of values in the calculation.
]]>
In fact, classification is really just a specific case of function approximation. In classification, the function you are trying to approximate is a score (from 0 to 1) for a category.
Below is a simple example of an RBFN applied to a function approximation problem with a 1 dimensional input. A dataset of 100 points (drawn as blue dots) is used to train an RBFN with 10 neurons. The red line shows the resulting output of the RBFN evaluated over the input range.
How does the RBFN do this? The output of the RBFN is actually the sum of 10 Gaussians (because our network has 10 RBF neurons), each with a different center point and different height (given by the neuron’s output weight). It gets a little cluttered, but you can actually plot each of these Gaussians:
In case you’re curious, the horizontal line corresponds to the bias term in the output node–a constant value that’s added to the output sum.
The Matlab code for producing the above plots is included at the end of this post.
In practice, there are three things that tend to be different about an RBFN for function approximation versus one for classification.
In most of the applications I’ve encountered of RBFNs for function approximation, you have a multidimensional input, but just a single output value. For example, you might be trying to model the expected sale value of a home based on a number of different input parameters.
The number of output nodes you need in an RBFN is given by the number of output values you’re trying to model. For classification, you typically have one node per output category, each outputing a score for their respective category. For our housing price prediction example, we have just one output node spitting out a sale price in dollars.
Here’s the architecture diagram we used for RBFNs for classification:
And here’s what it looks like for function approximation:
The difference in training is pretty straightforward. Each training example should have a set of input attributes, as well as the desired output value.
For classification, the desired output was either 0 or 1. ‘1’ if the training example belonged to the same category as the output node, and 0 otherwise. Training then optimizes the weights to get the output as close as possible to these desired value.
For function approximation the desired output is just the output value associated with the training example. In our housing price example, the training data would be examples of homes that have previously been sold, and the price they were sold at. So the desired output value is just the actual sell price.
Recall that each RBF neuron applies a Gaussian to the input. We all know from studying bell curves that an important parameter of the Gaussian is the standard deviation–it controls how wide the bell is. That same parameter exists here with our RBF neurons, you’d probably just interperet it a little differently. It still controls the width of the Gaussian, which means it determines how much of the input space the RBF neuron will respond to.
For RBFNs, instead of talking about the standard deviation (‘sigma’) directly, we use the related value ‘beta’:
Here are some examples of how different beta values affect the width of the Gaussian.
For classification, there are some good techniques for “learning” a good width value to use for each RBF neuron. The same technique doesn’t seem to work as well for function approximation, however, and it seems that a more primitive approach is often used. The width parameter is user provided, and is the same for every RBF neuron.
This means that the parameter is generally selected through experimentation. You can try different values, and then see how well the trained network performs on some holdout validation examples. It’s important to optimize the parameter using holdout data, because otherwise it’s too easy to for an RBFN to overfit the training data and generalize poorly.
The final difference with function approximation RBFNs is that normalizing the RBF neuron activations often improves the accuracy of the approximation.
What is meant by normalization here? Every RBF neuron is going to produce an “activation value” between 0 and 1. To normalize the output of the RBFN, we simply divide the output by the sum of all of the RBF neuron activations.
Here is the equation for an RBFN without normalization:
To add normalization, we just divide by the sum of all of the activation values:
I’ve tried to come up with a good rationalization for why normalization improves things, but so far I’ve come up short.
There may be some insight gained from looking at Gaussian Kernel Regression… Gaussian Kernel Regression is just a particular case of a normalized RBFN that doesn’t require training. You create an RBF neuron for every single training example, and the output weights are just set equal to the output values of the training examples. In this case, the model is just calculating a distanceweighted average of the training example output values. It is fairly intuitive why normalization produces good results in this case. The intuition doesn’t quite apply to a trained RBFN, though, because the output weights are learned through optimization, and don’t necessarily correspond to the desired output values.
RBFN for function approximation example code.
I’ve added the function approximation code to my existing RBFN classification example. For function approximation, you can run ‘runRBFNFuncApproxExample.m’. It uses many of the same functions as the classification RBFN, except that you train the RBFN with ‘trainFuncApproxRBFN.m’ and evaluate it with ‘evaluateFuncApproxRBFN.m’.
]]>I found that a good way to get started with scikitlearn on Windows was to install Python(x, y), a bundled distribution of Python that comes with lots of useful libraries for scientific computing. During the the installation, it lets you select which components to install–I’d recommend simply doing the ‘complete’ installation. Otherwise, make sure to check scikitlearn.
One thing it comes with that I’ve liked is the Spyder IDE. Spyder feels a lot like the Matlab IDE, which I’m a fan of, and integrates a code editor, console, and variable browser.
This example has a number of command line options, but you can run it asis without setting any of them.
The example should run fast–it only takes a few seconds to complete with the default parameters.
The data used comes from the “20 Newsgroups” dataset. Newsgroups were the original discussion forums, and this dataset contains posts from 20 different topics:
comp.graphics comp.os.mswindows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x 
rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey 
sci.crypt sci.electronics sci.med sci.space 
misc.forsale  talk.politics.misc talk.politics.guns talk.politics.mideast 
talk.religion.misc alt.atheism soc.religion.christian 
By default, this example just selects four of the categories (‘alt.atheism’, ‘talk.religion.misc’, ‘comp.graphics’, ‘sci.space’) to cluster. There are a total of 3,387 entries across these four categories.
If you use the full dataset (all 20 topics), there are a total of 18,846 entries.
The text data needs to be turned into numerical vectors. This is done with an object labeled the ‘vectorizer’ in the code. The default vectorizer method is the tfidf approach. For each document, it will produce a vector with 10,000 components (10,000 is the default number, this can be modified with a command line option).
The TfidfVectorizer object has a number of interesting properties.
It will strip all English “stop words” from the document. Stop words are really common words that don’t contribute to the meaning of the document. There are actually many of these words–take a quick look here for some examples.
It will also filter out terms that occur in more than half of the documents (max_df=0.5) and terms that occur in only one document (min_df=2).
To enforce the maximum vector length of 10,000, it will sort the terms by the number of times they occur across the corpus, and only keep the 10,000 words with the highest counts.
Finally, the vectorizer normalizes each vector to have an L2 norm of 1.0. This is important for normalizing the effect of document length on the tfidf values. An interesting fact is that if you normalize the vectors (as the example does), then comparing the L2 distances is equivalent to using the cosine similarity to compare the vectors.
The code can optionally use the HashingVectorizer instead. The HashingVectorizer is faster, but speed doesn’t seem to be a real concern here.
The HashingVectorizer still just counts the terms, but it does this more efficiently by using feature hashing. Instead of using a hash map to hash words to buckets which contain the word’s index in the term vector (word > bucket > vector index), you hash the word directly to a vector index (word > vector index). This means you don’t have to build a hash table, but it carries the risk of hash collisions. The risk of two important terms colliding to the same index is low, though, so this trick works well in practice.
The example includes optional dimensionality reduction using “Latent Semantic Analysis” (LSA). This is really just using Singular Value Decomposition (SVD), and it’s called LSA in the context of text data. It’s referred to as “Truncated SVD” because we’re only projecting onto a portion of the vectors in order to reduce the dimensionality.
If you’re familiar with dimensionality reduction using Principal Component Analysis (PCA), this is also the same thing! My understanding of PCA vs. SVD is that they both arrive at the principal components, but SVD has some advantages in how it’s calculated, so it’s used more often in practice.
Try using LSA by passing the command line flag “–lsa=256” to reduce the vectors down to 256 components each. Not only does the clustering run faster, but you’ll find that the accuracy increases significantly!
LSA can be thought of as a kind of feature extraction. In this case we are identifying the top 256 features, and eliminating the rest. Eliminating the less discriminative features can improve the quality of the distance calculation as a metric of similarity, since it’s not incorporating the difference between unimportant features.
Clustering is performed either using the standard kmeans clustering algorithm, or a modified version referred to as “MiniBatch KMeans”.
You can read more about MiniBatch KMeans in the original paper from Google here, it’s only 2 pages. Basically, it performs iterations using a randomly selected subset of the data. By default, the scikitlearn example uses a batch size of 1,000 (which is a little less than a third of the data).
Initialization is done using “kmeans++” by default; this technique is welldescribed on Wikipedia here. Essentially, the initial cluster centers are still taken from the data, but are chosen so that they are spread out.
A number of metrics are provided for assessing the quality of the resulting clusters.
Homogeneity, Completeness, and the VMeasure scores are all related. All three of these range from 0 to 1.0, with 1.0 being a perfect match with the ground truth labels. Homogeneity measures the degree to which the clusters contain only elements of the same class. Completeness measures the degree to which all of the elements belonging to a certain category are found in a single cluster. You can cheat each of these individually: To cheat on homogeneity, just assign every data point to its own cluster. To cheat on completeness, just group all of the items into a single cluster. So, the V Measure combines the two metrics into a single value so that there’s no cheating :).
The Adjusted RandIndex tells you how it’s doing compared to random guessing. Random labeling yields a score of 0, while perfect labeling yields 1.0.
Here are some VMeasure scores I got from trying different parameters:
I averaged these scores over 5 runs; however, the results vary so much from run to run that for an accurate comparison I’d recommend averaging the results over 100 runs.
]]>
I first learned about this topic through Stanford’s Mining of Massive Datasets (“MMDS”) course available for free on Coursera here. What’s especially great about that course is that the authors also provide their textbook online for free! You can find the textbook here, with a separate PDF file for each chapter. Chapter 3 covers the MinHash algorithm, and I’d refer you to that text as a more complete discussion of the topic.
On to the tutorial!
There is an interesting computing problem that arises in a number of contexts called “set similarity”.
Lets say you and I are both subscribers to Netflix, and we’ve each watched roughly 100 movies on Netflix. The list of movies I’ve seen is a set, and the list of movies you’ve seen is another set. To measure the similarity between these two sets, you can use the Jaccard Similarity, which is given by the intersection of the sets divided by their union. That is, count the number of movies we’ve both seen, and divide that by the total number of unique movies that we’ve both collectively seen.
If we’ve each watched exactly 100 movies, and 50 of those were seen by both of us, then the intersection is 50 and the union is 150, so our Jaccard Similarity is 1/3.
What seems to be the more common application of “set similarity” is the comparison of documents. One way to represent a document would be to parse it for all of its words, and represent the document as the set of all unique words it contains. In practice, you’d hash the words to integer IDs, and then maintain the set of IDs present in the document.
By representing the documents as sets of words, you could then use the Jaccard Similarity as a measure of how much overlap there is between two documents.
It’s important to note that we’re not actually extracting any semantic meaning of the documents here, we’re simply looking at whether they contain the same words. This technique of comparing documents probably won’t work as well, for example, for comparing documents that cover similar concepts but are otherwise completely unique.
Instead, the applications of this technique are found where there’s some expectation that the documents will specifically contain a lot of the same words.
One example is aggregating news articles. When the Associated Press releases an article about a particular event, many news agencies will take the AP article, perhaps modify it some, and publish it on their website. A news aggregator needs to recognize that a group of articles are really all based on the same AP article about one particular story. Comparing the web pages using this “similar sets” approach is one way to accomplish this.
Another example is detecting plagiarism. The dataset used in my example code is a large collection of articles, some of which are plagiarisms of each other (where they’ve been just slightly modified).
You might say that these are all applications of “nearduplicate” detection.
A small detail here is that it is more common to parse the document by taking, for example, each possible string of three consecutive words from the document (e.g., “A small detail”, “small detail here”, “detail here is”, etc.) and hashing these strings to integers. This retains a little more of the document structure than just hashing the individual words. This technique of hashing substrings is referred to as “shingling”, and each unique string is called a “shingle”.
Another shingling technique that’s described in the Mining of Massive Datasets textbook is kshingles, where you take each possible sequence of ‘k’ characters. I’m not clear on the motivation of this approach—it may have to do with the fact that it always produces strings of length ‘k’, whereas the threeword approach produces variable length strings.
In the example code, I’m using threeword shingles, and it works well.
So far, this all sounds pretty straight forward and manageable. Where it gets interesting is when you look at the compute requirements for doing this for a relatively large number of documents.
Let’s say you have a large collection of documents, and you want to find all of the pairs of documents that are nearduplicates of each other. You’d do this by calculating the Jaccard similarity between each pair of documents, and then selecting those with a similarity above some threshold.
To compare each document to every other document requires a lot of comparisons! It’s not quite Nsquared comparisons, since that would include doing a redundant comparison of ‘a’ to ‘b’ and ‘b’ to ‘a’, as well as comparing every document to itself.
The number of comparisons required is given by the following formula, which is pronounced “Nchoose2”
As noted in the equation, a good approximation is N^2 / 2 (This is approximation is equivalent to comparing each document pair only once, but also needlessly comparing each document to itself).
Lets say we have a collection of 1 million documents, and that on average, a PC can calculate the Jaccard similarity between two sets in 1ms per pair.
First, let’s calculate the rough number of comparisons required:
Next, the amount of time required:
16 years of compute time! Good luck with that. You’d need 1,000 servers just to get the compute time down to a week. But there’s a better way…
MinHash Signatures
The MinHash algorithm will provide us with a fast approximation to the Jaccard Similarity between two sets.
For each set in our data, we are going to calculate a MinHash signature. The MinHash signatures will all have a fixed length, independent of the size of the set. And the signatures will be relatively short—in the example code, they are only 10 components long.
To approximate the Jaccard Similarity between two sets, we will take their MinHash signatures, and simply count the number of components which are equal. If you divide this count by the signature length, you have a pretty good approximation to the Jaccard Similarity between those two sets.
We can compare two MinHash signatures in this way much quicker than we can calculate the intersection and union between two large sets. This is partly because the MinHash signatures tend to be much shorter than the number of shingles in the documents, and partly because the comparison operation is simpler.
In the example code, we have a collection of 10,000 articles which contain, on average, 250 shingles each. Computing the Jaccard similarities directly for all pairs takes 20 minutes on my PC, while generating and comparing the MinHash signatures takes only about 2 minutes and 45 seconds.
MinHash Algorithm
The MinHash algorithm is actually pretty easy to describe if you start with the implementation rather than the intuitive explanation.
The key ingredient to the algorithm is that we have a hash function which takes a 32bit integer and maps it to a different integer, with no collisions. Put another way, if you took the numbers 0 – (2^32 – 1) and applied this hash function to all of them, you’d get back a list of the same numbers in random order.
To demystify it a bit, here is the definition of the hash function, which takes an input integer ‘x’:
The coefficients ‘a’ and ‘b’ are randomly chosen integers less than the maximum value of ‘x’. ‘c’ is a prime number slightly bigger than the maximum value of ‘x’.
For different choices of ‘a’ and ‘b’, this hash function will produce a different random mapping of the values. So we have the ability to “generate” as many of these random hash functions as we want by just picking different values of ‘a’ and ‘b’.
So here’s how you compute the MinHash signature for a given document. Generate, say, 10 random hash functions. Take the first hash function, and apply it to all of the shingle values in a document. Find the minimum hash value produced (hey, “minimum hash”, that’s the name of the algorithm!) and use it as the first component of the MinHash signature. Now take the second hash function, and again find the minimum resulting hash value, and use this as the second component. And so on.
So if we have 10 random hash functions, we’ll get a MinHash signature with 10 values.
We’ll use the same 10 hash functions for every document in the dataset and generate their signatures as well. Then we can compare the documents by counting the number of signature components in which they match.
That’s it!
Why Does It Work?
The reason MinHash works is that the expected value of the MinHash similarity (the number of matching components divided by the signature length) can be shown to be equal to the Jaccard similarity between the sets. We’ll demonstrate this with a simple example involving two small sets:
Set A (32, 3, 22, 6, 15, 11)
Set B (15, 30, 7, 11, 28, 3, 17)
There are three items in common between the sets (highlighted in bold), and 10 unique items between the two sets. Therefore, these sets have a Jaccard similarity of 3/10.
Let’s look first at the probability that, for just a single MinHash signature component, we end up computing the same MinHash value for both sets.
The MinHash calculation can be thought of as taking the union of the two sets, shuffling them in a random order, and selecting the first value in the new sorted order. (Side note: the actual value used in the MinHash signature is not the original item ID, but the hashed value of that ID—but that’s not important for our discussion of the probabilities).
So if you take the union of these two sets,
Union (32, 3, 22, 6, 15, 11, 30, 7, 28, 17)
and then randomly shuffle them, what are the odds that one of the bold items ends up first in the list? It’s given by the number of common items (3) divided by the total number of items (10), or 3/10, the same as the Jaccard similarity.
The probability that a given MinHash value will come from one of the shared items is equal to the Jaccard similarity.
Now we can go back to look at the full signature. Continuing with our simple example, if we have a MinHash signature with 20 components, on average, how many of those MinHash values would we expect to be in common? The answer is the number of components (20) times the probability of a match (3/10), or 6 components. The expected value of the MinHash similarity, then, would be 6/20 = 3/10, the same as the Jaccard similarity.
The expected value of the MinHash similarity between two sets is equal to their Jaccard similarity.
Example Python Code
You can find my example code on GitHub here.
If you’re not familiar with GitHub, fear not. Here’s the direct link to the zip file containing all of the code.
Notes on the history of the code
After working through this material in the MMDS course, I played with the Python code from GitHub user rahularora here. It’s a complete implementation, but it had a couple important issues (including one fatal bug) that I’ve addressed.
The most important notes:
You’ve probably seen that GPUs are gaining popularity for machine learning because of their inherent parallelism. You may have also heard of renting GPU instances from Amazon for machine learning work. You might suppose, therefore, (as we did!) that renting an Amazon GPU instance would be a good way to gain access to a highperformance GPU.
Here is the key insight I wanted to share:
The motivation for renting a GPU instance at Amazon* is not about getting access to a high performance GPU, but rather for the ability to cheaply gain access to many GPU instances for running parallel experiments.
*This applies to Amazon specifically–I’ll discuss shortly our experience with other providers.
We discovered this insight empirically first. I implemented a benchmark test where we are calculating the L1 distance between a large number of vectors.
On my PC, I have a GeForce GTX 660 ($190 in September, 2013), which has 960 CUDA cores and runs at 980MHz.
Amazon’s GPU instance, named “g2.2xlarge”, has a GRID K520. It’s a board from NVIDIA designed for “cloud gaming”–that is, multiple concurrent users. It has 2 GK104 GPUs with 1,536 CUDA cores each running at 800MHz.
Seems like it should deliver pretty decent performance, but the benchmark results showed otherwise.
To transfer the data to the card for the benchmark, it took 70ms on my PC and 290ms on the Amazon instance (~4.1x slower). For the actual calculations, it took about 190ms on my PC and 430ms on the Amazon instance (~2.3x slower).
What’s the reason for this disparity? I don’t know the detailed answer, but it has to do with the virtualization Amazon uses on their instances. Netflix mentioned this issue in one of their blog posts:
“In a virtualized environment such as the AWS cloud, these accesses cause a trap in the hypervisor that results in even slower access.”
I also posted our issue to Reddit’s Machine Learning community here, and got some helpful replies. Essentially: (1) This is in fact what you should expect, and (2) The point of Amazon instances is to run multiple experiments in parallel at low cost.
Our goal was to benchmark our hardware against highend GPUs, so we wanted fullspeed access to something more potent than the GRID board at Amazon. Luckily, there are other services that will give you this. We landed on Nimbix. Nimbix gives us “baremetal” access (read, “no virtualization”) to a Tesla K40. On the Nimbix machine, the calculation step of the benchmark completes about 4.8x faster than on my PC. Awesome!
The tradeoff is it’s more expensive – $5/hr. But for our purposes, it’s great–we’re running fairly short tests, and they charge you in fractions of an hour (that is, for the precise amount of time that you have the instance up).
Overall, the ability to rent GPU instances cheaply from Amazon for research is awesome. Just make sure you have the right performance expectations!
]]>The code is neat and well written, and I think it’s a huge service to the engineering community when people share their code like this. That said, there’s almost no documentation, explanation, or comments provided, so I’m attempting to partially rectify that here. Down at the bottom of the post you can find a link to my documented / commented versions of many of the CNN functions.
Should You Use DeepLearnToolbox?
Before we dive in, I think it’s worth pointing out that the CNN code in the DeepLearnToolbox is a few years old now, and deep learning has been evolving rapidly. Other than the fact that it’s a multilayered Convolutional Neural Network, this code doesn’t appear to use any of the innovations typically associated with “Deep Learning”. For example, there is no unsupervised feature learning going on here, or layerwise pretraining. It’s just backpropagation over labeled training examples. Also, it still uses the sigmoid activation function, and it seems to me that a lot of current deep networks appear to use other activation functions such as “ReLU”.
That said, I’ve found the code very informative for helping me understand some of the basics of CNNs. At only 18 neurons, the example network is fairly small and simple (relatively speaking–even small CNNs can feel like a rat’s nest of parameters and connections!). So if your goal is, like me, to learn more about CNNs, keep reading on.
If you really just want a good library for building stateoftheart CNNs, you might look elsewhere. There’s an extensive list of libraries here, but it doesn’t seem to be very well organized by quality or relevance. You might find some helpful guidance from these discussions on reddit in /r/machinelearning here and here.
Where I’m Coming From
I learned some of the basics of Machine Learning from Andrew Ng’s Machine Learning course on Coursera, including Neural Networks (the MultiLayer Perceptron). After that, I learned a lot of the fundamentals of deep learning from Stanford’s Deep Learning Tutorial (which Ng also contributed to).
The Stanford tutorial was very helpful, and you even get to build a CNN in the last exercise. However, it uses different terminology and coding styles than some of the other important work out there, and doesn’t show you how to build a CNN with multiple hidden layers. I found this DeepLearnToolbox MATLAB code to be very informative for filling in some of my missing knowledge.
Terminology
One of the most helpful aspects of this exercise for me has been the chance to learn some of the terminology used by Convolutional Neural Networks. There are a lot of redundant terms here. It’s not a terrible thing, though; each different label can help you interpret the same concept in a different way. The key is just knowing what they all mean.
Convolution, filter mask, and kernel
Some of the language in CNNs seems to be taken from image processing. In image processing, a common operation is to apply a filter to an image. The filter has a small “filter mask” or “kernel” which is “convolved” with the image. (If you’re unfamiliar with filters and convolutions, I have a post on it here, though it’s not my best work ).
If you think about the operation performed between the filter mask and a single patch of the image, it’s the same basic operation as an MLP neuron–each of the corresponding components are multiplied together, then summed up. It’s the same as taking the dot product between the image patch and the filter mask if you were to unwind them into vectors. The difference in terminology really just comes from the 2D structure of the image.
So when you see the words “filter” or “kernel”, these just correspond to the weights of a single neuron.
Feature
Another term I like to use for these same weight values is “feature”. The techniques in Unsupervised Feature Learning show us that we can really think of a neuron’s weights as a particular “feature” that the neuron is responding to.
Maps
But wait, there’s more! CNNs introduce another term, which is very helpful in reasoning about them. The result of a convolution between a single filter mask and an image is referred to as a “map”. A single map is a 2D matrix which is the result of applying a filter over the entire image. Note that the map will be a little smaller than the image (by the width of the kernel), because the filter can’t be applied past the edges of the image.
Each layer of a CNN will output multiple maps, one for each of the “neurons” / “kernels” / “filters” / “features” in that layer.
Each layer of the CNN applies its filters to the maps output by the previous layer.
Architecture In Detail
To understand a particular network architecture, I think it makes the most sense to start by looking at the feedforward operation.
Layer 1
For this example, we’re working with the MNIST handwritten digit dataset. The input to the CNN, therefore, is a 28×28 pixel grayscale image. This is the input to the network, but we also treat it in the code as the “output of layer 1”.
Layer 2
The first hidden layer of the network, “Layer 2”, is going to perform some convolutions over the image. The filter mask will be 5×5 pixels in size. Convolving a 5×5 filter with a 28×28 pixel image yields a 24×24 filtered image. This layer has 6 distinct filters (or 6 neurons, if you like) that we’re going to convolve with the input image. These six convolutions will generate 6 separate output maps, giving us a 24x24x6 matrix as the output of layer 2.
Layer 3
Layer 3 is a pooling layer. The pooling operation we’re using here is just averaging, and the pooling regions are very small: just 2×2 pixels. This has the effect of subsampling the output maps by a factor of 2 in both dimensions, so we get a 12x12x6 matrix.
Layer 4
Layer 4 is another convolution layer. Again we’ll use a kernel size of 5×5. In Layer 4, we have 12 distinct filters that we will apply. That is, it contains 12 neurons.
There is an important detail here, though, that you don’t want to miss. When we performed our convolutions on the original input image, the image only had a depth of 1 (because it’s grayscale). The output of Layer 3, though, has a depth of 6.
A single neuron in layer 4 is going to be connected to all 6 maps in layer 3.
There are different ways of thinking about how this is handled.
One way is to say that our Layer 4 filters have a size of 5x5x6. That is, each filter in Layer 4 has 150 unique weight values in it.
In the code, however, convolutions are only done with twodimensional filters, and it becomes a little hairy to describe. Instead of having 12 5x5x6 filters, we have 72 5x5x1 filters. For each of the 12 Layer 4 filters, there are 6 separate 5x5x1 kernels. To apply a single Layer 4 filter, we actually perform 6 convolutions (one for each output map in Layer 3), and then sum up all of the resulting maps to make a single 8x8x1 output map for that filter. This is done for each of the 12 filters to create the 8x8x12 output of Layer 4.
Layer 5
Finally, we perform one last pooling operation that’s identical to the one in Layer 3–we just subsample by a factor of 2 in each dimension. The resulting output maps are unwound into our final feature vector containing 192 values (4x4x12 = 192).
This feature vector is then classified using simple linear classifiers (one per category) on the output.
Some final notes on the code:
The Code
To get the example code, first download the DeepLearnToolbox from its GitHub page. It’s just a directory of MATLAB functions, so there’s nothing special you need to do to install it other than adding it to your path.
Then, you can download my commented / documented code here and replace the corresponding files.
Specifically, here’s what my zip file contains:
For example:
I have some familiarity with Support Vector Machines, but not enough to understand what’s meant specifically by an “L2SVM”.
I found a quick answer though, in the paper Comparison of L1 and L2 Support Vector Machines, Koshiba et al, 2003.
Support vector machines with linear sum of slack variables, which are commonly used, are called L1SVMs, and SVMs with the square sum of slack variables are called L2SVMs.
It’s really just a slight difference in the objective function used to optimize the SVM.
The objective for an L1SVM is:
And for an L2SVM:
The difference is in the regularization term, which is there to make the SVM less susceptible to outliers and improve its overall generalization.
So why use the L2 objective versus the L1?
The paper Deep Learning Using Support Vector Machines, Yichuan Tang, 2013 offers some insight:
L2SVM is differentiable and imposes a bigger (quadratic vs. linear) loss for points which violate the margin.
If you want to dig deeper into the topic, that paper is probably a good bet.
All of these deep neural networks ultimately spit out a final feature vector representation of the input, which must then be classified (if classification is the task at hand). This is generally done using a simple linear classifier. The general impression that I’m getting from these various papers is that training the classifier using the L2SVM objective function outperforms other methods such as L1SVM or Softmax regression.
If you’re looking for some example MATLAB code, Adam Coates provides the code for his original CIFAR10 benchmark implementation here:
http://www.cs.stanford.edu/~acoates/papers/kmeans_demo.tgz
and his code uses the L2SVM objective to train the output classifier.
]]>Calculating the Euclidean distance can be greatly accelerated by taking advantage of special instructions in PCs for performing matrix multiplications. Writing the Euclidean distance in terms of a matrix multiplication requires some reworking of the distance equation which we’ll work through below.
The following is the equation for the Euclidean distance between two vectors, x and y.
Let’s see what the code looks like for calculating the Euclidean distance between a collection of input vectors in X (one per row) and a collection of ‘k’ models or cluster centers in C (also one per row).
The problem with this approach is that there’s no way to get rid of that for loop, iterating over each of the clusters. In the next section we’ll look at an approach that let’s us avoid the forloop and perform a matrix multiplication instead.
If we simply expand the square term:
Then we can rewrite our MATLAB code as follows (see the attached MATLAB script for a commented version of this).
XX = sum(X.^2, 2);
XC = X * C’;
CC = sum(C.^2, 2)’;
dists = sqrt(bsxfun(@plus, CC, bsxfun(@minus, XX, 2*XC)));
No more forloop! Because we are using linear algebra software here (MATLAB) that has been optimized for matrix multiplications, we will see a massive speedup in this implementation over the sumofsquareddifferences approach.
I’ve uploaded a MATLAB script which generates 10,000 random vectors of length 256 and calculates the L2 distance between them and 1,000 models. Running in Octave on my Core i5 laptop, the sumofsquareddifferences approach takes about 50 seconds whereas the matrix multiplication approach takes about 2 seconds.
The above code gets you the actual Euclidean distance, but we can make some additional optimizations to it when we are only interested in comparing distances.
This occurs in KNearest Neighbor, where we are trying to find the ‘k’ data points in a large set which are closest to our input pattern. It also occurs in KMeans Clustering during the cluster assignment step, where we assign a data point to the closest cluster.
In these applications, we don’t need to know the actual L2 distance, we only need to compare distances–that is, determine which distance is smaller or larger.
The following equation expresses the comparison of the L2 distance between an input vector x and two other vectors a and b.
We will show that, in order to make this comparison, it is equivalent to instead compare the quantities:
The following figure shows the derivation of the above equivalence.
There is one additional modification we can make to this equation, which is to divide both sides by 2. This moves the 2 over to the precalculated term. Note that it also flips the comparison sign, so where we were previously looking for the minimum value, we are now looking for the maximum.
]]>