afit-r-tutorials/supervised_learning_support_vector_machines.html at master · AFIT-R/afit-r-tutorials · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />


<title>Support Vector Machines</title>

<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/sandstone.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<link href="site_libs/highlightjs-1.1/textmate.css" rel="stylesheet" />
<script src="site_libs/highlightjs-1.1/highlight.js"></script>

<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
  }
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
   window.setTimeout(function() {
      hljs.initHighlighting();
   }, 0);
}
</script>


<style type="text/css">
h1 {
  font-size: 34px;
}
h1.title {
  font-size: 38px;
}
h2 {
  font-size: 30px;
}
h3 {
  font-size: 24px;
}
h4 {
  font-size: 18px;
}
h5 {
  font-size: 16px;
}
h6 {
  font-size: 12px;
}
.table th:not([align]) {
  text-align: left;
}
</style>

<link rel="stylesheet" href="styles.css" type="text/css" />

</head>

<body>

<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
}
code {
  color: inherit;
  background-color: rgba(0, 0, 0, 0.04);
}
img {
  max-width:100%;
  height: auto;
}
.tabbed-pane {
  padding-top: 12px;
}
button.code-folding-btn:focus {
  outline: none;
}
</style>


<style type="text/css">
/* padding for bootstrap navbar */
body {
  padding-top: 61px;
  padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar)  */
.section h1 {
  padding-top: 66px;
  margin-top: -66px;
}

.section h2 {
  padding-top: 66px;
  margin-top: -66px;
}
.section h3 {
  padding-top: 66px;
  margin-top: -66px;
}
.section h4 {
  padding-top: 66px;
  margin-top: -66px;
}
.section h5 {
  padding-top: 66px;
  margin-top: -66px;
}
.section h6 {
  padding-top: 66px;
  margin-top: -66px;
}
</style>

<script>
// manage active state of menu based on current page
$(document).ready(function () {
  // active menu anchor
  href = window.location.pathname
  href = href.substr(href.lastIndexOf('/') + 1)
  if (href === "")
    href = "index.html";
  var menuAnchor = $('a[href="' + href + '"]');

  // mark it active
  menuAnchor.parent().addClass('active');

  // if it's got a parent navbar menu mark it active as well
  menuAnchor.closest('li.dropdown').addClass('active');
});
</script>


<div class="container-fluid main-container">

<!-- tabsets -->
<script>
$(document).ready(function () {
  window.buildTabsets("TOC");
});
</script>

<!-- code folding -->


<div class="navbar navbar-default  navbar-fixed-top" role="navigation">
  <div class="container">
    <div class="navbar-header">
      <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
        <span class="icon-bar"></span>
      </button>
      <a class="navbar-brand" href="index.html">AFIT/DSL</a>
    </div>
    <div id="navbar" class="navbar-collapse collapse">
      <ul class="nav navbar-nav">
        <li>
  <a href="index.html">Home</a>
</li>
<li>
  <a href="about.html">About</a>
</li>
<li>
  <a href="about.html">Blog</a>
</li>
<li class="dropdown">
  <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
    Tutorials

    <span class="caret"></span>
  </a>
  <ul class="dropdown-menu" role="menu">
    <li class="dropdown-header">Introduction to R</li>
    <li class="dropdown-header">Basic R</li>
    <li class="dropdown-header">Advanced R</li>
    <li class="divider"></li>
    <li class="dropdown-header">Timeseries</li>
    <li class="divider"></li>
    <li class="dropdown-header">Supervised learning</li>
    <li class="dropdown-header">Decision trees</li>
    <li class="dropdown-header">Bagging and random forests</li>
    <li class="dropdown-header">Boosting</li>
    <li>
      <a href="supervised_learning_support_vector_machines.html">Support vector machines</a>
    </li>
    <li class="dropdown-header">Neural networks</li>
  </ul>
</li>
      </ul>
      <ul class="nav navbar-nav navbar-right">

      </ul>
    </div><!--/.nav-collapse -->
  </div><!--/.container -->
</div><!--/.navbar -->

<div class="fluid-row" id="header">


<h1 class="title toc-ignore">Support Vector Machines</h1>

</div>


<p>The advent of computers brought on rapid advances in the field of statistical classification, one of which is the <em>Support Vector Machine</em>, or SVM. The goal of an SVM is to take groups of observations and construct boundaries to predict which group future observations belong to based on their measurements. The different groups that must be separated are called “classes”. SVMs can handle any number of classes, as well as observations of any dimension. SVMs can take almost any shape (including linear, radial, and polynomial, among others), and are generally flexible enough to be used in almost any classification endeavor that the user chooses to undertake.</p>
<div id="tldr" class="section level3">
<h3>tl;dr</h3>
<ol style="list-style-type: decimal">
<li><p><a href="#replication-requirements">Replication Requirements</a>: What you’ll need to reproduce the analysis in this tutorial</p></li>
<li><p><a href="#maximal-margin-classifier">Maximal Margin Classifier</a>: Constructing a classification line for completely separable data</p></li>
<li><p><a href="#support-vector-classifiers">Support Vector Classifiers</a>: Constructing a classification line for data that is not separable</p></li>
<li><p><a href="#support-vector-machines">Support Vector Machines</a>: Constructing a classfication boundary, whether linear or nonlinear, for data that may or may not be separable</p></li>
<li><p><a href="#svms-for-multiple-classes">SVMs for Multiple Classes</a>: SVM techniques for more than 2 classes of observations</p></li>
</ol>
</div>
<div id="replication-requirements" class="section level3">
<h3>Replication Requirements</h3>
<p>For this tutorial, we will leverage the <code>tidyverse</code> package to perform data manipulation, the <code>e1071</code> package to perform calculations related to SVMs, and the <code>ISLR</code> package to load a real world data set to show the functionality of Support Vector Machines. In order to allow for reproducible results, we set the random number generator explicitly.</p>
<pre class="r"><code># set pseudorandom number generator
set.seed(10)

# Attach Packages
library(tidyverse)    # data manipulation and visualization
library(e1071)        # SVM methodology
library(ISLR)         # contains example data set &quot;Khan&quot;</code></pre>
<pre><code>## Warning: package &#39;ISLR&#39; was built under R version 3.4.1</code></pre>
<p>The data sets used in the tutorial (with the exception of <code>Khan</code>) will be generated using built-in R commands. The Support Vector Machine methodology is sound for any number of dimensions, but becomes difficult to visualize for more than 2. As previously mentioned, SVMs are robust for any number of classes, but we will stick to no more than 3 for the duration of this tutorial.</p>
</div>
<div id="maximal-margin-classifier" class="section level3">
<h3>Maximal Margin Classifier</h3>
<p>If the classes are separable by a linear boundary, we can use a <em>Maximal Margin Classifier</em> to find the classification boundary. To visualize an example of separated data, we generate 40 random observations and assign them to two classes. Upon visual inspection, we can see that infinitely many lines exist that will split the two classes.</p>
<pre class="r"><code># Construct sample data set - completely separated
x &lt;- matrix(rnorm(20*2), ncol = 2)
y &lt;- c(rep(-1,10), rep(1,10))
x[y==1,] &lt;- x[y==1,] + 3/2
dat &lt;- data.frame(x=x, y=as.factor(y))

# Plot data
ggplot(data = dat, aes(x = x.2, y = x.1, color = y, shape = y)) +
  geom_point(size = 2) +
  scale_color_manual(values=c(&quot;#000000&quot;, &quot;#FF0000&quot;)) +
  theme(legend.position = &quot;none&quot;)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-3-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The goal of the maximal margin classifier is to identify the linear boundary that maximizes the total distance between the line and the closest point in each class. We can use the <code>svm()</code> function in the <code>e1071</code> package to find this boundary.</p>
<pre class="r"><code># Fit Support Vector Machine model to data set
svmfit &lt;- svm(y~., data = dat, kernel = &quot;linear&quot;, cost = 10, scale = FALSE)
# Plot Results
plot(svmfit, dat)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-4-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>In the plot, points that are represented by an “X” are the <strong>support vectors</strong>, or the points that directly affect the classification line. The points marked with an “o” are the other points, which don’t affect the calculation of the line. This principle will lay the foundation for support vector machines.</p>
</div>
<div id="support-vector-classifiers" class="section level3">
<h3>Support Vector Classifiers</h3>
<p>As convenient as the maximal marginal classifier is mathematically, most real data sets will not be fully separable by a linear boundary. To handle such data, we must use modified methodology. We simulate a new data set where the classes are more mixed.</p>
<pre class="r"><code># Construct sample data set - not completely separated
x &lt;- matrix(rnorm(20*2), ncol = 2)
y &lt;- c(rep(-1,10), rep(1,10))
x[y==1,] &lt;- x[y==1,] + 1
dat &lt;- data.frame(x=x, y=as.factor(y))

# Plot data set
ggplot(data = dat, aes(x = x.2, y = x.1, color = y, shape = y)) +
  geom_point(size = 2) +
  scale_color_manual(values=c(&quot;#000000&quot;, &quot;#FF0000&quot;)) +
  theme(legend.position = &quot;none&quot;)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-5-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Whether the data is separable or not, the <code>svm()</code> command syntax is the same. In the case of data that is not linearly separable, however, the <code>cost =</code> argument takes on real importance. This quantifies the penalty associated with having an observation on the wrong side of the classification boundary. We can plot the fit in the same way as the completely separable case.</p>
<pre class="r"><code># Fit Support Vector Machine model to data set
svmfit &lt;- svm(y~., data = dat, kernel = &quot;linear&quot;, cost = 10, scale = FALSE)
# Plot Results
plot(svmfit, dat)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-6-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>But how do we decide how costly these misclassifications are? Instead of specifying a cost up front, we can use the <code>tune()</code> function to test various costs and identify which value produces the best fitting model.</p>
<pre class="r"><code># find optimal cost of misclassification
tune.out &lt;- tune(svm, y~., data = dat, kernel = &quot;linear&quot;,
                 ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod &lt;- tune.out$best.model)</code></pre>
<pre><code>##
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001,
##     0.01, 0.1, 1, 5, 10, 100)), kernel = &quot;linear&quot;)
##
##
## Parameters:
##    SVM-Type:  C-classification
##  SVM-Kernel:  linear
##        cost:  0.1
##       gamma:  0.5
##
## Number of Support Vectors:  16</code></pre>
<p>For our data set, the optimal cost (from amongst the choices we provided) is calculated to be 0.1, which doesn’t penalize the model much for misclassified observations. Once this model has been identified, we can construct a table of predicted classes against true classes using the <code>predict()</code> command as follows:</p>
<pre class="r"><code># Create a table of misclassified observations
ypred &lt;- predict(bestmod, dat)
(misclass &lt;- table(predict = ypred, truth = dat$y))</code></pre>
<pre><code>##        truth
## predict -1 1
##      -1  9 3
##      1   1 7</code></pre>
<p>Using this support vector classifier, 80% of the observations were correctly classified, which matches what we see in the plot. If we wanted to test our classifier more rigorously, we could split our data into training and testing sets and then see how our SVC performed with the observations not used to construct the model. We will use this training-testing method later in this tutorial to validate our SVMs.</p>
</div>
<div id="support-vector-machines" class="section level3">
<h3>Support Vector Machines</h3>
<p>Support Vector Classifiers are a subset of the group of classification structures known as Support Vector Machines. Support Vector Machines can construct classification boundaries that are nonlinear in shape. The options for classification structures using the <code>svm()</code> command are linear, polynomial, radial, and sigmoid. To demonstrate a nonlinear classification boundary, we will construct a new data set.</p>
<pre class="r"><code># construct larger random data set
x &lt;- matrix(rnorm(200*2), ncol = 2)
x[1:100,] &lt;- x[1:100,] + 2.5
x[101:150,] &lt;- x[101:150,] - 2.5
y &lt;- c(rep(1,150), rep(2,50))
dat &lt;- data.frame(x=x,y=as.factor(y))
# Plot data
ggplot(data = dat, aes(x = x.2, y = x.1, color = y, shape = y)) +
  geom_point(size = 2) +
  scale_color_manual(values=c(&quot;#000000&quot;, &quot;#FF0000&quot;)) +
  theme(legend.position = &quot;none&quot;)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-9-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>Notice that the data is not linearly separable, and furthermore, isn’t all clustered together in a single group. There are two sections of class 1 observations with a cluster of class 2 observations in between. To demonstrate the effectiveness of SVMs, we’ll take 100 random observations from the set and use them to construct our boundary. We set <code>kernel = &quot;radial&quot;</code> based on the shape of our data and plot the results.</p>
<pre class="r"><code># set pseudorandom number generator
set.seed(123)
# sample training data and fit model
train &lt;- base::sample(200,100, replace = FALSE)
svmfit &lt;- svm(y~., data = dat[train,], kernel = &quot;radial&quot;, gamma = 1, cost = 1)
# plot classifier
plot(svmfit, dat)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-10-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>We see that, at least visually, the SVM does a reasonable job of separating the two classes. To fit the model, we used <code>cost = 1</code>, but as mentioned previously, it isn’t usually obvious which cost will produce the optimal classification boundary. We can use the <code>tune()</code> command to try several different values of cost as well as several different values of <span class="math inline">\(\gamma\)</span>, a scaling parameter uded to fit nonlinear boundaries.</p>
<pre class="r"><code># tune model to find optimal cost, gamma values
tune.out &lt;- tune(svm, y~., data = dat[train,], kernel = &quot;radial&quot;,
                 ranges = list(cost = c(0.1,1,10,100,1000),
                 gamma = c(0.5,1,2,3,4)))
# show best model
tune.out$best.model</code></pre>
<pre><code>##
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat[train, ],
##     ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5,
##         1, 2, 3, 4)), kernel = &quot;radial&quot;)
##
##
## Parameters:
##    SVM-Type:  C-classification
##  SVM-Kernel:  radial
##        cost:  10
##       gamma:  0.5
##
## Number of Support Vectors:  22</code></pre>
<p>The model that reduces the error the most in the training data uses a cost of 10 and <span class="math inline">\(\gamma\)</span> value of 0.5. We can now see how well the SVM performs by predicting the class of the 100 testing observations:</p>
<pre class="r"><code># validate model performance
(valid &lt;- table(true = dat[-train,&quot;y&quot;], pred = predict(tune.out$best.model,
                                             newx = dat[-train,])))</code></pre>
<pre><code>##     pred
## true  1  2
##    1 58 19
##    2 15  8</code></pre>
<p>Our best-fitting model produces 66% accuracy in identifying classes. For such a complicated shape of observations, this performed reasonably well. We can challenge this method further by adding additional classes of observations.</p>
</div>
<div id="svms-for-multiple-classes" class="section level3">
<h3>SVMs for Multiple Classes</h3>
<p>The procedure does not change for data sets that involve more than two classes of observations. We construct our data set the same way as we have previously, only now specifying three classes instead of two:</p>
<pre class="r"><code># construct data set
x &lt;- rbind(x, matrix(rnorm(50*2), ncol = 2))
y &lt;- c(y, rep(0,50))
x[y==0,2] &lt;- x[y==0,2] + 2.5
dat &lt;- data.frame(x=x, y=as.factor(y))
# plot data set
ggplot(data = dat, aes(x = x.2, y = x.1, color = y, shape = y)) +
  geom_point(size = 2) +
  scale_color_manual(values=c(&quot;#000000&quot;,&quot;#FF0000&quot;,&quot;#00BA00&quot;)) +
  theme(legend.position = &quot;none&quot;)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-13-1.png" width="672" style="display: block; margin: auto;" /></p>
<p>The commands don’t change. We specify a cost and tuning parameter <span class="math inline">\(\gamma\)</span> and fit a support vector machine. The results and interpretation are similar to two-class classification.</p>
<pre class="r"><code># fit model
svmfit &lt;- svm(y~., data = dat, kernel = &quot;radial&quot;, cost = 10, gamma = 1)
# plot results
plot(svmfit, dat)</code></pre>
<p><img src="public/images/analytics/support_vector_machines/SVM-unnamed-chunk-14-1.png" width="672" style="display: block; margin: auto;" /></p>
<pre class="r"><code>#construct table
ypred &lt;- predict(svmfit, dat)
(misclass &lt;- table(predict = ypred, truth = dat$y))</code></pre>
<pre><code>##        truth
## predict   0   1   2
##       0  43   2   3
##       1   5 144   4
##       2   2   4  43</code></pre>
<p>As shown in the resulting table, 92% of our training observations were correctly classified. However, since we didn’t break our data into training and testing sets, we didn’t truly validate our results.</p>
<div id="application" class="section level4">
<h4>Application</h4>
<p>The <code>Khan</code> data set contains data on 83 tissue samples with 2308 gene expression measurements on each sample. These were split into 63 training observations and 20 testing observations, and there are four distinct classes in the set. It would be impossible to visualize such data, so we choose the simplest classifier (linear) to construct our model.</p>
<pre class="r"><code># fit model
dat &lt;- data.frame(x = Khan$xtrain, y=as.factor(Khan$ytrain))
(out &lt;- svm(y~., data = dat, kernel = &quot;linear&quot;, cost=10))</code></pre>
<pre><code>##
## Call:
## svm(formula = y ~ ., data = dat, kernel = &quot;linear&quot;, cost = 10)
##
##
## Parameters:
##    SVM-Type:  C-classification
##  SVM-Kernel:  linear
##        cost:  10
##       gamma:  0.0004332756
##
## Number of Support Vectors:  58</code></pre>
<p>First of all, we can check how well our model did at classifying the training observations. This is usually high, but again, doesn’t validate the model. If the model doesn’t do a very good job of classifying the training set, it could be a red flag. In our case, all 63 training observations were correctly classified.</p>
<pre class="r"><code># check model performance on training set
table(out$fitted, dat$y)</code></pre>
<pre><code>##
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20</code></pre>
<p>To perform validation, we can check how the model performs on the testing set:</p>
<pre class="r"><code># validate model performance
dat.te &lt;- data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))
pred.te &lt;- predict(out, newdata=dat.te)
table(pred.te, dat.te$y)</code></pre>
<pre><code>##
## pred.te 1 2 3 4
##       1 3 0 0 0
##       2 0 6 2 0
##       3 0 0 4 0
##       4 0 0 0 5</code></pre>
<p>The model correctly identifies 18 of the 20 testing observations. SVMs and the boundaries they impose are more difficult to interpret at higher dimensions, but these results seem to suggest that our model is a good classifier for the gene data.</p>
</div>
</div>

<p>Copyright &copy; 2016 Skynet, Inc. All rights reserved.</p>


</div>

<script>

// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
  bootstrapStylePandocTables();
});


</script>

<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>