1.14. Machine Learning with dislib

This tutorial will show the different algorithms available in dislib.

Setup

First, we need to start an interactive PyCOMPSs session:

[1]:

import pycompss.interactive as ipycompss
import os
if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(graph=True,
                    project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, monitor=1000)

******************************************************
*************** PyCOMPSs Interactive *****************
******************************************************
*          .-~~-.--.           _____       _______   *
*         :         )         |____ \     /  ___  \  *
*   .~ ~ -.\       /.- ~~ .     ___) |    | (___) |  *
*   >       `.   .'       <    / ___/      > ___ <   *
*  (         .- -.         )  | |___   _  | (___) |  *
*   `- -.-~  `- -'  ~-.- -'   |_____| |_| \_______/  *
*     (        :        )           _ _ .-:          *
*      ~--.    :    .--~        .-~  .-~  }          *
*          ~-.-^-.-~ \_      .~  .-~   .~            *
*                   \ \ '     \ '_ _ -~              *
*                    \`.\`.    //                    *
*           . - ~ ~-.__\`.\`-.//                     *
*       .-~   . - ~  }~ ~ ~-.~-.                     *
*     .' .-~      .-~       :/~-.~-./:               *
*    /_~_ _ . - ~                 ~-.~-._            *
*                                     ~-.<           *
******************************************************
* - Starting COMPSs runtime...                       *
* - Log path : /home/user/.COMPSs/InteractiveMode_14/
* - PyCOMPSs Runtime started... Have fun!            *
******************************************************

Next, we import dislib and we are all set to start working!

[2]:

import dislib as ds

Load the MNIST dataset

[3]:

x, y = ds.load_svmlight_file('/tmp/mnist/mnist', # Download the dataset
                             block_size=(10000, 784), n_features=784, store_sparse=False)

[4]:

x.shape

[4]:

(60000, 784)

[5]:

y.shape

[5]:

(60000, 1)

[6]:

y_array = y.collect()
y_array

[6]:

array([5., 0., 4., ..., 5., 6., 8.])

[7]:

img = x[0].collect().reshape(28,28)

[8]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.imshow(img)

[8]:

<matplotlib.image.AxesImage at 0x7fcdade69e48>

../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_10_1.png

[9]:

int(y[0].collect())

[9]:

dislib algorithms

Preprocessing

[10]:

from dislib.preprocessing import StandardScaler
from dislib.decomposition import PCA

Clustering

[11]:

from dislib.cluster import KMeans
from dislib.cluster import DBSCAN
from dislib.cluster import GaussianMixture

Classification

[12]:

from dislib.classification import CascadeSVM
from dislib.classification import RandomForestClassifier

Recommendation

[13]:

from dislib.recommendation import ALS

Model selection

[14]:

from dislib.model_selection import GridSearchCV

Others

[15]:

from dislib.regression import LinearRegression
from dislib.neighbors import NearestNeighbors

Examples

KMeans

[16]:

kmeans = KMeans(n_clusters=10)
pred_clusters = kmeans.fit_predict(x).collect()

Get the number of images of each class in the cluster 0:

[17]:

from collections import Counter
Counter(y_array[pred_clusters==0])

[17]:

Counter({3.0: 4064,
         8.0: 1942,
         9.0: 110,
         2.0: 381,
         1.0: 10,
         5.0: 1910,
         0.0: 187,
         6.0: 29,
         7.0: 6,
         4.0: 1})

GaussianMixture

Fit the GaussianMixture with the painted pixels of a single image:

[18]:

import numpy as np
img_filtered_pixels = np.stack([np.array([i, j]) for i in range(28) for j in range(28) if img[i,j] > 10])
img_pixels = ds.array(img_filtered_pixels, block_size=(50,2))
gm = GaussianMixture(n_components=7, random_state=0)
gm.fit(img_pixels)

Get the parameters that define the Gaussian components:

[19]:

from pycompss.api.api import compss_wait_on
means = compss_wait_on(gm.means_)
covariances = compss_wait_on(gm.covariances_)
weights = compss_wait_on(gm.weights_)

Use the Gaussian mixture model to sample random pixels replicating the original distribution:

[20]:

samples = np.concatenate([np.random.multivariate_normal(means[i], covariances[i], int(weights[i]*1000))
                    for i in range(7)])
plt.scatter(samples[:,1], samples[:,0])
plt.gca().set_aspect('equal', adjustable='box')
plt.gca().invert_yaxis()
plt.draw()

../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_35_0.png

PCA

[21]:

pca = PCA()
pca.fit(x)

[21]:

PCA(arity=50, n_components=None)

Calculate the explained variance of the 10 first eigenvectors:

[22]:

sum(pca.explained_variance_[0:10])/sum(pca.explained_variance_)

[22]:

0.4881498035493399

Show the weights of the first eigenvector:

[23]:

plt.imshow(np.abs(pca.components_[0]).reshape(28,28))

[23]:

<matplotlib.image.AxesImage at 0x7fcd9ea5c550>

../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_10_Dislib_estimators_41_1.png

RandomForestClassifier

[24]:

rf = RandomForestClassifier(n_estimators=5, max_depth=3)
rf.fit(x, y)

[24]:

RandomForestClassifier(distr_depth='auto', hard_vote=False, max_depth=3,
                       n_estimators=5, random_state=None,
                       sklearn_max=100000000.0, try_features='sqrt')

Use the test dataset to get an accuracy score:

[25]:

x_test, y_test = ds.load_svmlight_file('/tmp/mnist/mnist.test', block_size=(10000, 784), n_features=784, store_sparse=False)
score = rf.score(x_test, y_test)
print(compss_wait_on(score))

0.5965

Close the session

To finish the session, we need to stop PyCOMPSs:

[26]:

ipycompss.stop()

****************************************************
*************** STOPPING PyCOMPSs ******************
****************************************************
Warning: some of the variables used with PyCOMPSs may
         have not been brought to the master.
****************************************************