# Visualizing Optimisation Algorithms

The first and second courses by deeplearning.ai offer a great insight into the working of various optimisation algorithms used in Machine Learning. Specifically, they focus on Batch Gradient Descent, Mini-batch Gradient Descent (with and without momentum), and Adam optimisation. Having finished the two courses, I've wanted to go deeper into the world of optimisation. This is probably the first step towards that.

This notebook/post is an introductory level analysis on the workings of these optimisation approaches. The intent is to visually see these algorithms in action, and hopefully see how they're different from each other.

The approach below is greatly inspired by this post by Louis Tiao on optimisation visualizations, and this tutorial on matplotlib animation by Jake VanderPlas.

In [1]:
# imports
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import LogNorm
from matplotlib import animation
from IPython.display import HTML
import math
from itertools import zip_longest
from sklearn.datasets import make_classification

In [58]:
%matplotlib inline


Alright. We need something to optimize. To begin with, let's try to find the minima(s) of Himmelblau function , which is represented as:

$$f(x,y)=(x^{2}+y-11)^{2}+(x+y^{2}-7)^{2}$$

This function as 4 minimas:

• $f(3.0,2.0)=0.0$
• $f(-2.805118,3.131312)=0.0$
• $f(-3.779310,-3.283186)=0.0$
• $f(3.584428,-1.848126)=0.0$

Writing the function in Python:

In [3]:
f  = lambda x, y: (x**2 +y -11)**2 + (x + y**2 -7)**2

In [4]:
xmin, xmax, xstep = -5, 5, .2
ymin, ymax, ystep = -5, 5, .2


Let's see a 3d plot of the function. Creating a meshgrid so as to plot contours:

In [5]:
x, y = np.meshgrid(np.arange(xmin, xmax + xstep, xstep), np.arange(ymin, ymax + ystep, ystep))

In [6]:
z = f(x, y)


Creating 4 minima variables to display on the plot.

In [7]:
minima1 = np.array([3., 2.])
minima2 = np.array([-2.80, 3.13])
minima3 = np.array([-3.78, -3.3])
minima4 = np.array([3.6, -1.85])

minima1_ = minima1.reshape(-1, 1)
minima2_ = minima2.reshape(-1, 1)
minima3_ = minima3.reshape(-1, 1)
minima4_ = minima4.reshape(-1, 1)


Plotting.

In [79]:
fig = plt.figure(dpi=150, figsize=(6, 4))
ax = plt.axes(projection='3d', elev=50, azim=-50)

ax.plot_surface(x, y, z, norm=LogNorm(), rstride=1, cstride=1,
edgecolor='none', alpha=.8, cmap=plt.cm.jet)
ax.plot(*minima1_, f(*minima1_), 'r*', markersize=10)
ax.plot(*minima2_, f(*minima2_), 'r*', markersize=10)
ax.plot(*minima3_, f(*minima3_), 'r*', markersize=10)
ax.plot(*minima4_, f(*minima4_), 'r*', markersize=10)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_zlabel('$z$')

ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

plt.show()


We need gradients of the function f wrt x and y to code the optimisation logic.

In [9]:
dx = lambda x, y: 2*(x**2 +y -11)*2*x + 2*(x + y**2 -7)
dy = lambda x, y: 2*(x**2 +y -11) + 2*(x + y**2 -7)*2*y


Writing a simple implementation of gradient descent. Also keeping a track of the values of x and y in the path variable while the algorithms descends to the minima. GD will run for num_epochs iterations.

In [10]:
def gradient_descent(init_point,learning_rate, num_epochs):
x0 = init_point[0]
y0 = init_point[1]

path = np.zeros((2,num_epochs+1))

path[0][0] = x0
path[1][0] = y0

for i in range(num_epochs):
x_ =  learning_rate*dx(x0,y0)
y_ =  learning_rate*dy(x0,y0)
x0 -= x_
y0 -= y_
path[0][i+1] = x0
path[1][i+1] = y0

return (path,(x0,y0))


Choosing a few starting points. We'll initialize GD separately with all of these and plot the paths taken in each case.

In [11]:
begin_points = [
np.array([0.,0.]),
np.array([-0.5,0.]),
np.array([-0.5,-0.5]),
np.array([0.5,-0.5]),
np.array([0.5,-0.5]),
np.array([-1.23633,-1.11064]),
np.array([0.295466,-1.2946]),
np.array([0.3616,4.9298]),
np.array([1.362,-4.774]),
]


All paths will be stored in the paths variable. In order to make the plot clean, selecting a subset of the each path (every 5th point).

In [12]:
paths = []
for begin_point in begin_points:
path = path[:, [i for i in range(0,path.shape[1],5)]]
paths.append(path)


Gradient descent is complete for all the starting points. Let's see the paths taken on a contour plot of the function.

In [82]:
fig, ax = plt.subplots(dpi=150, figsize=(8, 6))

ax.contour(x, y, z, levels=np.logspace(0, 5, 35), norm=LogNorm(), cmap=plt.cm.jet)

for i,path in enumerate(paths):
color = c=np.random.rand(3,)
ax.quiver(path[0,:-1], path[1,:-1], path[0,1:]-path[0,:-1], path[1,1:]-path[1,:-1],\
scale_units='xy', angles='xy', scale=1, color=color)
ax.plot(*begin_points[i], color=color ,marker='o', markersize=10)

ax.plot(*minima1_, 'r*', markersize=18)
ax.plot(*minima2_, 'r*', markersize=18)
ax.plot(*minima3_, 'r*', markersize=18)
ax.plot(*minima4_, 'r*', markersize=18)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

plt.show()


Nice! As seen above, starting point greatly influences where the algorithm converges.

Time to see it in animation. Making use of the following class written by Louis to generate matplotlib animations for multiple paths:

In [14]:
class TrajectoryAnimation(animation.FuncAnimation):

def __init__(self, *paths, labels=[], fig=None, ax=None, frames=None,
interval=60, repeat_delay=5, blit=True, **kwargs):

if fig is None:
if ax is None:
fig, ax = plt.subplots()
else:
fig = ax.get_figure()
else:
if ax is None:
ax = fig.gca()

self.fig = fig
self.ax = ax

self.paths = paths

if frames is None:
frames = max(path.shape[1] for path in paths)

self.lines = [ax.plot([], [], label=label, lw=2)[0]
for _, label in zip_longest(paths, labels)]
self.points = [ax.plot([], [], 'o', color=line.get_color())[0]
for line in self.lines]

super(TrajectoryAnimation, self).__init__(fig, self.animate, init_func=self.init_anim,
frames=frames, interval=interval, blit=blit,
repeat_delay=repeat_delay, **kwargs)

def init_anim(self):
for line, point in zip(self.lines, self.points):
line.set_data([], [])
point.set_data([], [])
return self.lines + self.points

def animate(self, i):
for line, point, path in zip(self.lines, self.points, self.paths):
line.set_data(*path[::,:i])
point.set_data(*path[::,i-1:i])
return self.lines + self.points

In [85]:
fig, ax = plt.subplots(figsize=(10, 6))

ax.contour(x, y, z, levels=np.logspace(0, 5, 35), norm=LogNorm(), cmap=plt.cm.jet)

for i,path in enumerate(paths):
color = c=np.random.rand(3,)
ax.plot(*begin_points[i].reshape(-1,1), color=color ,marker='o', markersize=10)

ax.plot(*minima1_, 'r*', markersize=18)
ax.plot(*minima2_, 'r*', markersize=18)
ax.plot(*minima3_, 'r*', markersize=18)
ax.plot(*minima4_, 'r*', markersize=18)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

anim = TrajectoryAnimation(*paths, ax=ax)

# ax.legend(loc='upper left')

In [86]:
HTML(anim.to_html5_video())

Out[86]: