1.14. Machine Learning with dislib

This tutorial will show the different algorithms available in dislib.

Setup

First, we need to start an interactive PyCOMPSs session:

[1]:
import os
os.environ["ComputingUnits"] = "1"

import pycompss.interactive as ipycompss
if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(graph=True,
                    project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, monitor=1000)
********************************************************
**************** PyCOMPSs Interactive ******************
********************************************************
*          .-~~-.--.           ______         ______   *
*         :         )         |____  \       |____  \  *
*   .~ ~ -.\       /.- ~~ .      __) |          __) |  *
*   >       `.   .'       <     |__  |         |__  |  *
*  (         .- -.         )   ____) |   _    ____) |  *
*   `- -.-~  `- -'  ~-.- -'   |______/  |_|  |______/  *
*     (        :        )           _ _ .-:            *
*      ~--.    :    .--~        .-~  .-~  }            *
*          ~-.-^-.-~ \_      .~  .-~   .~              *
*                   \ \ '     \ '_ _ -~                *
*                    \`.\`.    //                      *
*           . - ~ ~-.__\`.\`-.//                       *
*       .-~   . - ~  }~ ~ ~-.~-.                       *
*     .' .-~      .-~       :/~-.~-./:                 *
*    /_~_ _ . - ~                 ~-.~-._              *
*                                     ~-.<             *
********************************************************
* - Starting COMPSs runtime...                         *
* - Log path : /home/user/.COMPSs/Interactive_14/
* - PyCOMPSs Runtime started... Have fun!              *
********************************************************

Next, we import dislib and we are all set to start working!

[2]:
import dislib as ds

Load the MNIST dataset

[3]:
x, y = ds.load_svmlight_file('/tmp/mnist/mnist', # Download the dataset
                             block_size=(10000, 784), n_features=784, store_sparse=False)
[4]:
x.shape
[4]:
(60000, 784)
[5]:
y.shape
[5]:
(60000, 1)
[6]:
y_array = y.collect()
y_array
[6]:
array([5., 0., 4., ..., 5., 6., 8.])
[7]:
img = x[0].collect().reshape(28,28)
[8]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(img)
[8]:
<matplotlib.image.AxesImage at 0x7ff14e688e20>
../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_10_1.png
[9]:
int(y[0].collect())
[9]:
5

dislib algorithms

Preprocessing

[10]:
from dislib.preprocessing import StandardScaler
from dislib.decomposition import PCA

Clustering

[11]:
from dislib.cluster import KMeans
from dislib.cluster import DBSCAN
from dislib.cluster import GaussianMixture

Classification

[12]:
from dislib.classification import CascadeSVM
from dislib.classification import RandomForestClassifier

Recommendation

[13]:
from dislib.recommendation import ALS

Model selection

[14]:
from dislib.model_selection import GridSearchCV

Others

[15]:
from dislib.regression import LinearRegression
from dislib.neighbors import NearestNeighbors

Examples

KMeans

[16]:
kmeans = KMeans(n_clusters=10)
pred_clusters = kmeans.fit_predict(x).collect()

Get the number of images of each class in the cluster 0:

[17]:
from collections import Counter
Counter(y_array[pred_clusters==0])
[17]:
Counter({8.0: 3499,
         5.0: 1209,
         3.0: 1058,
         2.0: 323,
         0.0: 121,
         9.0: 54,
         6.0: 45,
         7.0: 21,
         4.0: 16,
         1.0: 9})

GaussianMixture

Fit the GaussianMixture with the painted pixels of a single image:

[18]:
import numpy as np
img_filtered_pixels = np.stack([np.array([i, j]) for i in range(28) for j in range(28) if img[i,j] > 10])
img_pixels = ds.array(img_filtered_pixels, block_size=(50,2))
gm = GaussianMixture(n_components=7, random_state=0)
gm.fit(img_pixels)

Get the parameters that define the Gaussian components:

[19]:
from pycompss.api.api import compss_wait_on
means = compss_wait_on(gm.means_)
covariances = compss_wait_on(gm.covariances_)
weights = compss_wait_on(gm.weights_)

Use the Gaussian mixture model to sample random pixels replicating the original distribution:

[20]:
samples = np.concatenate([np.random.multivariate_normal(means[i], covariances[i], int(weights[i]*1000))
                    for i in range(7)])
plt.scatter(samples[:,1], samples[:,0])
plt.gca().set_aspect('equal', adjustable='box')
plt.gca().invert_yaxis()
plt.draw()
../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_35_0.png

PCA

[21]:
pca = PCA()
pca.fit(x)
[21]:
PCA()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Calculate the explained variance of the 10 first eigenvectors:

[22]:
explained_variance = pca.explained_variance_.collect()
sum(explained_variance[0:10])/sum(explained_variance)
[22]:
0.48814980354933996

Show the weights of the first eigenvector:

[23]:
plt.imshow(np.abs(pca.components_.collect()[0]).reshape(28,28))
[23]:
<matplotlib.image.AxesImage at 0x7ff144982fe0>
../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_41_1.png

RandomForestClassifier

[24]:
rf = RandomForestClassifier(n_estimators=5, max_depth=3)
rf.fit(x, y)
[24]:
RandomForestClassifier(max_depth=3, n_estimators=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Use the test dataset to get an accuracy score:

[25]:
x_test, y_test = ds.load_svmlight_file('/tmp/mnist/mnist.test', block_size=(10000, 784), n_features=784, store_sparse=False)
score = rf.score(x_test, y_test)
print(compss_wait_on(score))
0.6152

Close the session

To finish the session, we need to stop PyCOMPSs:

[26]:
ipycompss.stop()
********************************************************
***************** STOPPING PyCOMPSs ********************
********************************************************
Checking if any issue happened.
Warning: some of the variables used with PyCOMPSs may
         have not been brought to the master.
********************************************************