Data Science Bowl 2018 Kaggle competition

I have at last found the time to try Keras ! It is so much easier to get started and running than Tensorflow, although of course this comes at the price of some flexibility.

I’m trying my hand at the Kaggle Data Science Bowl 2018 competition, on the topic of object segmentation, which in this case mean delimiting cells in medical imagery. The training data are microscopy photographs of cells, and the label of each image is a set of masks (binary pictures) for each of the individual cells in the picture. Below are some examples of input images in the training set. Sizes and aspect ratios are varied, but are often around 256×256 pixels.

An example of an input image (training example) and its masks (labels):

 

As usual, the Kaggle forums are full of helpful people providing starters notebooks; I used the Keras U-Net starter as the basis for my explorations.

The neural network in this approach is a simplified version of the one described in this paper; each layer has less channels. I kept this simplification because I’m working with a Google Colaboratory notebook, and due to the notebook limitations using the same number of features as in the paper on each layer makes the training very slow, as in “surprise!notebook disconnect” sort of slow. Anyway, Colaboratory is still great, considering that it is a free tool. It’s a Jupyter-based python notebook running on shared VMs at Google, and the VM comes with free GPU ! The VM state is not persistent, so after 90 minutes of inactivity the notebook disconnects and the VM is lost. This means any packages installations or data download have to be done at every restart. Thankfully, Linux commands can be integrated in the notebook simply by starting the command with “!”.

Here are the preliminary steps I added to the original Unet Starter Notebook:

!pip install -q tqdm
!git clone -q https://github.com/lopuhin/kaggle-dsbowl-2018-dataset-fixes.git
!pip install -q kaggle
!mkdir -p /content/.kaggle/
!mkdir -p data-science-bowl-2018/original
!mkdir -p data-science-bowl-2018/merged/stage1_train
# unsaaaaaafe
!echo '{"username":"mykagglename","key":"mykaggleapikeygoeshere"}' > /content/.kaggle/kaggle.json
!kaggle competitions download -c data-science-bowl-2018 -p data-science-bowl-2018/original/.
!unzip -q -o "data-science-bowl-2018/original/stage1_test.zip" -d "data-science-bowl-2018/original/stage1_test"
!unzip -q -o "data-science-bowl-2018/original/stage1_train.zip" -d "data-science-bowl-2018/original/stage1_train"
!cp -r data-science-bowl-2018/original/stage1_train/* data-science-bowl-2018/merged/stage1_train/.
!cp -r kaggle-dsbowl-2018-dataset-fixes/stage1_train/* data-science-bowl-2018/merged/stage1_train/.

Now comes the Python script starting with some imports and global variables:

</pre>
import os
import sys
import warnings
from subprocess import check_output

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from skimage.morphology import extrema
import matplotlib as mp

from tqdm import tqdm
from itertools import chain

import skimage.io
import skimage.transform
import skimage.morphology
from scipy import ndimage as ndi
from skimage.morphology import watershed
from skimage.feature import peak_local_max

from keras import backend as K
from keras.models import Model, load_model
from keras.layers import Input
from keras.layers.core import Dropout, Lambda
from keras.layers.convolutional import Conv2D, Conv2DTranspose
from keras.layers.pooling import MaxPooling2D
from keras.layers.merge import concatenate
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from keras.preprocessing.image import ImageDataGenerator
from keras.losses import binary_crossentropy

IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3 # use only RGB channels
TRAIN_PATH = '/content/data-science-bowl-2018/merged/stage1_train'
TEST_PATH = '/content/data-science-bowl-2018/original/stage1_test'
BATCH_SIZE = 15

np.random.seed(42)

cmap2 = mp.colors.ListedColormap(np.concatenate((np.array([[0,0,0]]), np.random.rand(256,3))))

warnings.filterwarnings('ignore', category=UserWarning, module='skimage')

The images in the dataset have different sizes, so we are going to resize them all to the same size; the original notebook uses 128×128 pixels, I think to keep the training fast enough for Kagglers to be able to run the notebook easily. I used 512×512, which is closer to the majority of pictures sizes, and allows us to keep more information.

I also added a function to read the pictures and merge the masks into a single image:

def imread_grey(f):
    return skimage.io.imread(f, as_grey=True)

def read_image_labels(image_id, merge=True):
    # return image with id number image_id 
    # and label = union of masks if merge = True (binary picture), 
    # if merge = False: label = matrix with distinct integers for each cell
    # (useful for plotting for performance checks)
    image_file = "{}/{}/images/{}.png".format(TRAIN_PATH, image_id, image_id)
    mask_file = "{}/{}/masks/*.png".format(TRAIN_PATH, image_id)
    image = skimage.io.imread(image_file)
    masks = skimage.io.ImageCollection(mask_file, load_func=imread_grey).concatenate()
    masks_union = np.sum(masks, axis=0) # has 2 if 2 cells are overlapping, 3 if 3 overlap etc
    if merge:
        # assumes masks and image have same dimensions
        labels = masks_union > 0 
    else:
        masks_indexed = [masks[i]*i for i in range(len(masks))]
        # overlapping cells attributed to last one in mask name order
        labels = np.max(masks_indexed, axis=0) 

    return image, labels

Example result:

We make a big simplification here by merging all the masks. This mean cells that overlap can’t be individually contoured anymore … So it’s on the user to post-process the predicted masks to separate large clusters into individual cells. This is especially important since the metric used for scoring penalises missed cells more than inaccurately contoured cells.

We then build the X and Y arrays of training and test set examples:

# Get train and test IDs
test_ids = next(os.walk(TEST_PATH))[1]
train_ids = next(os.walk(TRAIN_PATH))[1]

def preprocess(image_input):
    # force same number of channels for grayscale pics as for color pics (3)
    rgb_img = skimage.color.grey2rgb(image_input, alpha=False)
    return skimage.transform.resize(rgb_img, (IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), mode='constant', preserve_range=True)

# Get and resize train images and masks
size_train = len(train_ids)
X_train = np.zeros((size_train, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((size_train, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool) # additional "dim" around each value for unet input; use np.squeeze for plots

print('Getting and resizing train images and masks ... ')
sys.stdout.flush()
for n, image_id in tqdm(enumerate(train_ids), total=size_train):
    image, mask = read_image_labels(image_id)
    X_train[n] = preprocess(image)
    Y_train[n] = np.expand_dims(skimage.transform.resize(mask, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True), axis=-1)
    image, mask = read_image_labels(image_id, merge=False)
    Y_train_distinct[n] = np.expand_dims(skimage.transform.resize(mask, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True), axis=-1)

# Get and resize test images
X_test = np.zeros((len(test_ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
sizes_test = []
print('Getting and resizing test images ... ')
sys.stdout.flush()
for n, id_ in tqdm(enumerate(test_ids), total=len(test_ids)):
    image = skimage.io.imread(TEST_PATH + "/"+id_ + "/images/" + id_ + '.png')[:,:,:IMG_CHANNELS]
    sizes_test.append(list(image.shape))
    X_test[n] = preprocess(image)

I tried different losses for the network; mean IoU as defined in this competition (using thresholds) is not available in either Keras or Tensorflow libraries. A Dice loss (intersection over union) gives the best results.

This is the loss function and the U-net network:

def dice_coef(y_true, y_pred):
    smooth = 1.
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_loss(y_true, y_pred):
    return -dice_coef(y_true, y_pred)

inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)

c1 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)
c1 = Dropout(0.1) (c1)
c1 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p1)
c2 = Dropout(0.1) (c2)
c2 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p2)
c3 = Dropout(0.2) (c3)
c3 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p3)
c4 = Dropout(0.2) (c4)
c4 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p4)
c5 = Dropout(0.3) (c5)
c5 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c5)

u6 = Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u6)
c6 = Dropout(0.2) (c6)
c6 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c6)

u7 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u7)
c7 = Dropout(0.2) (c7)
c7 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c7)

u8 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u8)
c8 = Dropout(0.1) (c8)
c8 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c8)

u9 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u9)
c9 = Dropout(0.1) (c9)
c9 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c9)

outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=dice_loss, metrics=['accuracy'])

My understanding of this is that the descending branch is a classic convnet, i.e. convolutionnal layers with several filters (or channels) to extract various features that get more and more high level with depth, interlayered with max-pooling layers to simplify the spatial localization information.  The ascending branch reconstructs a high resolution image using the learned features at every step in the descending branch, so that the detailed knowledge of what happens at high resolution (first layers) can be combined with high level understanding of the image (middle layers). The paper also does loss weighting to force the network to learn the thin boundaries between objects, which I should find time to do here as well !

In order to get more training examples, I added data augmentation, which can be done on-the-fly using a generator, i.e. there is no need to create a function to rotate/crop/flip and no need to create a new training array with the added augmentations. The training finally looks like this:

earlystopper = EarlyStopping(patience=5, verbose=1)
checkpointer = ModelCheckpoint('model-dsbowl2018.h5', verbose=1, save_best_only=True)

val_cutoff = int(size_train*0.9)
X_train2 = X_train[:val_cutoff]
Y_train2 = Y_train[:val_cutoff]
X_val = X_train[val_cutoff:]
Y_val = Y_train[val_cutoff:]

def data_generator(input, augment=False):
    # for keras on the fly data augmentation
    if augment:
        data_gen_args = dict(shear_range=0.5, rotation_range=50, zoom_range=0.2,
                         width_shift_range=0.2, height_shift_range=0.2, fill_mode='reflect')
    else:
        data_gen_args = dict()
    datagen = ImageDataGenerator(**data_gen_args)
    # Provide the same seed and keyword arguments to the fit and flow methods
    datagen.fit(input, augment=True, seed=42) # TODO: is this useful ?
    return datagen.flow(input, batch_size=BATCH_SIZE, shuffle=True, seed=42)

# combine generators into one which yields image and masks
train_generator = zip(data_generator(X_train2, augment=True), data_generator(Y_train2, augment=True))
val_generator = zip(data_generator(X_val), data_generator(Y_val))

# steps per epoch without data generation = nbr_samples/batch size (around 60 with batch = 10)
results = model.fit_generator(train_generator,
                              validation_data = val_generator,
                              epochs=50,
                              steps_per_epoch = 250,
                              validation_steps = 10,
                              callbacks=[earlystopper, checkpointer])

I tried using Tensorboard, but I couldn’t make it work in Colaboratory using the method described in this notebook.

In order to separate the touching cells, I tried watershed postprocessing, but it invents separations everywhere on elongated shapes (although it works quite well when the touching cells look like two almost tangent circles). My simple watersheding on all the predictions does not improve performance.

The submission is built using run-length encoding (source). The submission file should look like:
ImageId,EncodedPixels
0114f484a16c152baa2d82fdd43740880a762c93f436c8988ac461c5c9dbe7d5,1 1 -> nuclei #1
0999dab07b11bc85fb8464fc36c947fbd8b5d6ec49817361cb780659ca805eac,1 1 -> nuclei #1
0999dab07b11bc85fb8464fc36c947fbd8b5d6ec49817361cb780659ca805eac,2 3 8 9 -> nuclei #2

In the rle_encoding function, x is a matrix with non-1 values and 1’s (points of interest); the function returns a list of the indexes in flattened x transposed where a batch of 1’s start, followed by the number of 1’s in this batch ([index_batch_1, nbr_of_one_batch_1, index_batch_2, nbr_of_ones_batch_2 …])
For example, if x is
0001100000
0011110000
0000110000
0000000100
0000001100
The function returns [3,2,12,4,24,2,37,1,46,2].

def rle_encoding(x):
    dots = np.where(x.T.flatten() == 1)[0] # indexes of the 1's
    run_lengths = []
    prev = -2
    for b in dots:
        if (b>prev+1): run_lengths.extend((b + 1, 0)) # if we hit a new batch of 1's
        run_lengths[-1] += 1 # add 1 to last element of run_length (coresponding to index of start of current batch of 1's)
        prev = b
    return run_lengths

def prob_to_rles(x, cutoff=0.5):
    # x has 0 for background, values > 0.5 for content
    #lab_img = skimage.morphology.label(x > cutoff) # label connected regions of any nonzero values in x; not needed if using watershed split
    for i in range(1, lab_img.max() + 1): # for each connected region, return an iterator on its rle representation
        yield rle_encoding(x == i)   

new_test_ids = []
rles = []
for n, id_ in enumerate(test_ids):
    split_cells = wathershed_split(preds_test_upsampled[n])
    rle = list(prob_to_rles(split_cells))
    rles.extend(rle)
    new_test_ids.extend([id_] * len(rle))

# Create submission DataFrame
sub = pd.DataFrame()
sub['ImageId'] = new_test_ids
sub['EncodedPixels'] = pd.Series(rles).apply(lambda x: ' '.join(str(y) for y in x))
sub.to_csv('sub-dsbowl2018-12.csv', index=False)
#print(sub.head())

And finally the Kaggle submission can be done via the API:

!kaggle competitions submit -c data-science-bowl-2018 -f sub-dsbowl2018-12.csv -m "added watershed splitting"

My best score using this is 0.329, not groundbreaking by any means, but. When looking at the prediction on a validation set, I was pretty impressed by how good this relatively simple network trained with few images (around 600) could segment a lot of cells. This isn’t a pretrained network that is trained on its last layers with a small dataset; the only images used are the ones from this competition.

Advertisements

Neural Net generated knits

Janelle Shane has trained a text generating neural network on knitting patterns (kindly contributed by Ravelry users and Stitch-Maps) and the results are hilariously absurd. The neural network has been named SkyKnit.

I tested two of these:

  • “mystery lace (copy)”; the training set does include patterns for tentacled toys, but these tentacles below are probably a consequence of SkyKnit’s inability to keep stitch counts consistent across rows. This one had some success on Twitter 🙂

qmwfxx7fhhqdvmkwl31egqqaucifi6lctnjeyb2g-34b3j42gfhcyad-7oqxglymvis5sdjlghra8py9fm82wtflzkplnxb9zw2z_medium2

  • “snowing leaves”, a pretty convincing attempt at a lace motif

wiljxitff-s-hxcoz40vvu4dfwr_-owo0bn77qlprcc6wkmo7l3mmy0o1xzxmotqdcdl67rwavxqjl1wymmm6axshy2jend-xnpy_medium2

SkyKnit has better performance when trained exclusively on Stitchmap patterns, which are written in a standard and consistent way, as opposed to ordinary patterns which often include natural language alongside the knitting instructions.

Draw your own dataset

For all those who want to visually test their classification algorithms on toy data, here is a RShiny app I made to easily click & draw your own custom dataset ! It can be accessed on shinyapps.io, and the source code is on Github.

It’s mostly useful for small two-dimensionnal numeric datasets that are inconvenient to build as a few superpositions of classic distributions samplings.

screenshot2

Screenshot of the app

RShiny is a web application framework by RStudio that allows you to make neat interactive web apps entirely in R. Shiny is quite practical because it handles most of the event-handling and variables updating under the hood. All we need to do is to declare the variables that are subject to interactive change as reactive values, and of keep track of where the changes happen.

In this app, every time the user clicks on the canvas, a small group of points is created near the click position; the number, the class (color), and the spread of the points created at each click can be adjusted in the parameters bar. The canvas can be cleared or the latest points undone if necessary. Once the dataset is ready, it can be downloaded as a csv file containing the x,y coordinates and class of each point; the downloaded dataset is scaled to zero mean and unit variance.  This GUI is defined in the ui.R program:


library(shiny)
library(shinydashboard)

dashboardPage(
  dashboardHeader(title="Draw Dataset"),
  dashboardSidebar(disable=T),
  dashboardBody(
    box(width=12,
        title="Parameters",
        fluidRow(
          column(4, numericInput("num_points", "Number of points per clic", 3)),
          column(4,
                 numericInput("sigma",
                              "Standard deviation of each clic point set",
                              step=0.01,
                              value=5)
          ),
          column(4,
                 selectInput("class", "Class (color)", choices=c("red"="firebrick1",
                                                                 "green"="forestgreen",
                                                                 "blue"="dodgerblue")))
        ),
        fluidRow(
          column(12,
                 downloadButton("save", "Save"),
                 actionButton("clear", "Clear", icon=icon("remove")),
                 actionButton("undo", "Undo", icon=icon("undo"))
          )
        )
    ),
    box(width=12,
        height=870,
        title="Draw your own dataset by clicking on this canvas",
        plotOutput("data_plot", click="plot_click")
    )
  )
)

The code in server.R shown below holds the data frame that will contains the dataset, and updates it whenever a user triggered event happens.
The dataset is a reactive object; its potential user-triggered changes are handled in the observe({}) blocks. The variables corresponding to user input are called input$something, where the various “something” are defined in ui.R.


library(ggplot2)

server <- function(input, output, session) {

  addGroup <- function(data, n, center, sigma, class){
    # append a group of points distributed around the clic coordinates
    # to a data frame holding all the points created and their color.
    new_group <- data.frame("x"=rnorm(n, mean=center$x, sd=sigma),
                            "y"=rnorm(n, mean=center$y, sd=sigma),
                            "class"=class)
    return(rbind(data, new_group))
  }

  # initialize reactive dataset holding the points created by the user
  v <- reactiveValues(data = data.frame())

  observe({
    # populates the dataset with points distributed around the clic target point
    if(!is.null(input$plot_click)) {
      # "isolate" reads fresh value of v$data without the update re-evaluating it
      # avoids infinite loop of update with rbind then re-rbind the updated data with the new group
      v$data <- isolate(addGroup(v$data, input$num_points, input$plot_click, input$sigma, input$class))
    }
  })

  observe({
    # remove all points from the canvas and the dataset when clear button is clicked
    if(!is.null(input$clear)) {
      v$data <- data.frame()
    }
  })

  observeEvent(input$undo, {
    # remove the latest drawn point from the dataset when undo button is clicked
    v$data <- v$data[-nrow(v$data), ]
  })

  output$save <- downloadHandler(
    # save the dataset as a csv file
    # scale to zero mean and unit variance before saving
    filename = function() {'DIYdataset.csv'},
    content = function(file) {
      write.csv(data.frame(scale(v$data[,c("x","y")]), "color"=v$data$class),
                file,
                quote=F,
                row.names=F)}
  )

  output$data_plot <- renderPlot({
    # display the base plot
    plot <- ggplot() + xlim(0, 100) + ylim(0, 100) + xlab("x") + ylab("y")     # if data is not empty, add it to plot     # points outside of plot boundaries are added to the dataset but not displayed     if (nrow(v$data) > 0) {
      plot <- plot + geom_point(aes(x=v$data$x, y=v$data$y),
                                size=4,
                                colour=v$data$class,
                                show.legend=F)
    }
    return(plot)
  }, height=800)

}

The block in which the dataset receives new points uses “isolate” to avoid an infinite loop.
Without this, the dataset would first be updated with the new points, but RShiny would then detect that the dataset has changed and would reiterate the assignment of new points, then would detect this new change, etc.

RShiny is a great tool for quickly developing apps to demo models or visualize data; this app takes only 100 lines of code !

Using Tensorflow with Docker

I started the Google Deep Learning Course featuring the Python Tensorflow library.

Now I’m using a Windows 7 laptop, and the installation instructions for Windows on the Tensorflow website are a bit less developed than for unix based systems. The Deep Learning course comes with a Docker image packing tensorflow and the course exercises in an IPython notebook, and this seems to actually be the simplest way to get Tensorflow on Windows.

To install and get started:

  1. Download and install the Docker Toolbox
  2. Open the Docker Quickstart terminal
  3. Paste the following command in this terminal. This will download (or open if it is already downloaded) the docker image containing Tensorflow and the course exercises.
    docker run -p 8888:8888 -it --rm b.gcr.io/tensorflow-udacity/assignments:0.4.0
  4. At the top of your Docker terminal you should see an IP address: go to http://<this_ip_adress>:8888 in your browser. This should be the home page of the Jupyter notebook featuring the course exercises. You will be able to import tensorflow when coding through this interface.

I tried this install on a mac and it works too !

Machine learning and data science ressources

This post lists some useful sources for learning data science and machine learning. The first two are a good way to start, since they quickly bring you to the point where you can play with real world data.

Machine Learning by Andrew Ng (Stanford)

The best MOOC ever,  which is also the one that started Coursera.

Prerequisites: it helps if you are familiar with scalar products and matrix operations, and if you know how to program basic functions.

What’s inside: starts with basic linear regression and ramps up with gradient descent, logistic regressions, regularisation, neural networks, clustering, svm, recommenders, PCA …

Data Science Specialization by Roger Peng, Brian Caffo, and Jeff Leek (Johns Hopkins University)

This one is a long hike through many areas of data science (not just machine learning), and offers a very pragmatic introduction to the R programming language.

Prerequisites: familiarity with programming and undergrad math. The R exercises are quite close to what you will get asked in a Data Scientist job interview.

What’s inside: how to do everything to data using R (parse, explore, fit), how to connect to an API or parse data from a website, statistics, machine learning, reproducibility … It has some overlap with Andrew Ng’s course.

Probability and Statistics by Khan Academy

The basics of statistics and probabilities (and clearer IMO than the statistics section of the Johns Hopkins course).

Data Science at Scale Specialization by Bill Howe (University of Washington)

A course about the “Big” in “Big Data”, with data visualisation thrown in.

Prerequisites: python programming, familiarity with APIs.

What’s inside: SQL, noSQL, MapReduce, graphs, machine learning …

Prerequisites: depends a lot on the courses, but a bit of background in Computer Science helps.

Books

I found An introduction to Statistical Learning very helpful both for algorithms explanations and R exercises. The authors also made a series of videos following the contents of the books (index of the videos here). The advanced version of this book with All The Math is The Elements of Statistical Learning.

If you want to make beautiful graphs (and you should), books by Edward Tufte show how.

 

This list is subject to changes and additions as I continue watching courses and reading books !

 

Classification of knitting patterns

Knitting patterns for sale on Ravelry are accompanied by a text description going from a few words to a full page story in several languages. This text sometimes includes instructions about knitting techniques, a personal background story, or a reference to a larger collection of patterns by the same designer. This post looks how reliably one can predict the pattern category (hat, sweater, scarf …) from the text of the pattern.

We first build a dataset using Ravelry database queries and web scraping. This will consist in a data frame with pattern names, their text descriptions, and their category: hat, sweater etc.


## Build dataset from Ravelry API: pattern permalink, pattern category, pattern text description
# Get url to patterns of interest from API search
pat0 <- GET("https://api.ravelry.com/patterns/search.json?page_size=2000&craft=knitting", config=config("token"=ravelry.token))
pat <- content(pat0)

permalinks <- sapply(pat$patterns, function(x) x$permalink)
permalinks_full <- sapply(permalinks, function(name) paste("http://www.ravelry.com/patterns/library/",name,sep="",collapse=""))
names(permalinks_full) <- permalinks

# Get top level pattern category and description text using web scraping 
pattern_info <- lapply(permalinks_full, htmlTreeParse, useInternalNodes = TRUE)

pattern_description_par <- lapply(pattern_info, getNodeSet, path="//p", fun=xmlValue)
pattern_description <- sapply(pattern_description_par, paste, collapse=" ")

pattern_cat <- lapply(pattern_info, getNodeSet, path="//div[@class='category']/a/span/text()", fun=xmlValue)
pattern_topcat <- simplify2array(sapply(pattern_cat, head, 1))

Some pattern categories appear quite rarely, and may be not frequent enough to get a decent accuracy on prediction. We can filter out the corresponding  entries to get cleaner data.


## Data: 3 columns with pattern permalink, text description, and toplevel category
data <- as.data.frame(cbind(permalinks, pattern_topcat, pattern_description),stringsAsFactors=F,row.names=F)
names(data) <- c("permalink", "category", "description")
data$category <- as.factor(data$category)

cat_freq <- table(data$category)
nbr_examples <- dim(data)[1]

# Remove from data the categories with too few examples 
data <- subset(data, subset=(cat_freq[category] > 50))
data$category <- factor(data$category)

The following R functions are quite useful to prepare the data for text mining. In order to predict a pattern category from its text, we look at the frequencies of words in each text for each category, and use these numbers for prediction. For example, a description in which the word “head” appears several times is more likely to be for a hat than a sock.

The first function removes the punctuation, numbers, and stopwords (“the”, “a” …) that appear very often in all text but do not carry enough meaning to help prediction. It returns a clean corpus of texts where each document corresponds to a cleaned up pattern description.

The second function builds the Document Term Matrix (DTM), an object that holds for each document the frequencies of all the words in it. The columns are all the words in the corpus, and the lines are all the documents in the corpus (pattern descriptions). This DTM will be the dataset for running the algorithms, with words being the features, and text descriptions being the cases.

The third function wraps it all together to turn the data frame into a ready to use dataset.



cleanCorpus = function(corpus){
  # Clean the text data to remove punctuation, suffixes, numbers etc
  # To lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove stopwords first, else for ex. l'or becomes lor and l' is not removed
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  # Remove punctuation
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(function(str) gsub("[^[:alnum:] ]", " ",str)))
  # Remove  html tags with regexp
  corpus <- tm_map(corpus, content_transformer(function(x) gsub("<[a-z]*>", " ", x)))
  # Remove numbers - but they may be useful ... TODO ?
  corpus <- tm_map(corpus, removeNumbers)
  # Simplify whitespace
  corpus <- tm_map(corpus, stripWhitespace)
  # Stem words (tm_map stem has type error), use option lazy=T on mac os
  corpus <- tm_map(corpus, stemDocument, "english", lazy=T)
}

buildData = function(corpus, sparsity=0.999){
  # Arg: corpus where one document is one pattern description
  #      optionnal float word sparsity threshold 
  #      default: remove (almost) nothing
  # Returns Document Term Matrix
  dtm <- DocumentTermMatrix(corpus, 
                            control = list(weighting = weightTfIdf))
  # remove words that don't appear often enough for every category, else weird words and very large matrix
  dtm <- removeSparseTerms(dtm, sparsity)
}

prepareData <- function(df){
  # make clean cases and outcome based on text/category data frame
  corpus <- Corpus(VectorSource(df$description))
  names(corpus) <- df$category
  y <- df$category
  clean <- cleanCorpus(corpus)
  dtm = buildData(clean, 0.9)
  data <- as.data.frame(as.matrix(dtm))
  names(data) <- dtm$dimnames$Terms
  return (list("category" = y, "data" = data))
}

Before diving into the classical machine learning algorithms, we can do a very simple prediction benchmark. We search the text for the pattern category name or related keywords (for example “scarf” and “shawl” for the Neck/Torso category). We then predict the category as the one whose  keywords appear most often. A text including “sock” and “foot” but no “sweater” is probably in the “Feet / Legs” category. The code below does just that, predicting the most frequent category in case none of the keywords appear in the description. (The keywords are guessed using this knitter’s domain knowledge !)



## Build train, cross-validation, and test sets, 50% of cases go to train set
sampler <- createDataPartition(data$category, times=3)
trainData <- data[sampler[[1]],]
crossValData <- data[sampler[[2]],]
testData <- data[sampler[[3]],]

trainFull <- prepareData(trainData)
y_train <- trainFull$category
train <- trainFull$data

cvFull <- prepareData(crossValData)
y_cv <- cvFull$category
cv <- cvFull$data

testFull <- prepareData(testData)
y_test <- testFull$category
test <- testFull$data

# benchmark test: predict category whose keywords appearing the most in the text
# if no keywords in text, predict most frequent category
predEasy <- function(text, sortedCategories){
  # assumes categories sorted by most frequent in trainning set
  # categoryInText holds for each category the indexes where the category keywords appears in the text
  categoryInText <- sapply(sortedCategories, function(pattern) gregexpr(pattern, text)[[1]])
  # catNbrOccurences holds the number of times a category keyword appears in the text 
  catNbrOccurences <- sapply(categoryInText, function(l) sum(l > 0))
  # return category with most keywords occurences
  cat <- sortedCategories[which.max(catNbrOccurences)]
}

sort(table(y_train), decreasing = T)
sortedCategories <- c("[sS]hawl|[sS]carf", "[Ss]weater|[sS]leeve", "\\b[Hh]at\\b", 
                      "[Ff]eet|[Ff]oot|[sS]ock", "\\b[Hh]and\\b|[gG]love|[mMitt]]", "[Ss]ofties|[tT]oy")
y_easy <- sapply(crossValData$description, predEasy, sortedCategories)
# reorder y_cv names to have true positives in the diagonal (regexp letters mess up ordering)
table(y_easy, y_cv)[, c(1,5,6,4,2,3)]

# resulting confusion matrix:

# y_easy                     y_cv     Feet / Legs Softies Sweater Neck / Torso Hands Hat
# [Ff]eet|[Ff]oot|[sS]ock                  79       6       5           11     4   3
# [Ss]ofties|[tT]oy                         0       6       0            0     0   0
# [Ss]weater|[sS]leeve                      0       1     123            2     1   1
# [sS]hawl|[sS]carf                        38      13      45          361    10  36
# \\b[Hh]and\\b|[gG]love|[mMitt]]           0       2       4            6    16   3
# \\b[Hh]at\\b                              0       1       0            8     6 101

 

The performance can be checked in the confusion matrix. This benchmark is not so bad, most of the patterns in each category are correctly predicted.

However, if our keywords do not appear in the text, the prediction quality will go down. In order to take the full text into account, we need the algorithm to take the whole Document Term Matrix into account instead of guessing a priori the keywords that are good predictors of  a category.

Since we have several categories and plenty of features in the dataset, we can try a random forest. The default forest in R’s randomForest library works quite well here, after a bit a data preparation (matching the words in the DTM for the train/test/cross-validation sets):



matchWords <- function(testDtm, referenceDtm){
  # Can't predict categories never seen in reference set 
  # => remove them from the set used for prediction
  # and add to predicting set the words that were in train set but not in predicting set
  # Args: document term matrix to modify and reference document term matrix
  # Returns the modified dtm with columns matching the reference dtm
  test2 <- testDtm[, intersect(colnames(referenceDtm), colnames(testDtm))]
  trainWordsNotInTest <- setdiff(names(referenceDtm), names(test2))
  yy <- data.frame(matrix(0, ncol = length(trainWordsNotInTest),
                          nrow = dim(test2)[1]))
  names(yy) <- trainWordsNotInTest
  # Final processed test set
  return(cbind(test2, yy))
}

cvMatched <- matchWords(cv, train)
testMatched <- matchWords(test, train)

## Random Forest
rf <- randomForest(train, y_train)
pred <- predict(rf, cvMatched)
table(pred, y_cv)
# interpretation: graph showing which words make the most interesting splits in the trees
varImpPlot(rf, type=2) 

# (pred)          Feet / Legs Hands Hat Neck / Torso Softies Sweater     (y_cv)
# Feet / Legs           97     3   3           12       2       8
# Hands                  0    20   0            0       0       0
# Hat                    2     3 130            6       1       1
# Neck / Torso          17    10   8          362       9      28
# Softies                0     0   0            1      17       0
# Sweater                1     1   3            7       0     140

Comparing the confusion matrices between the benchmark and the random forest, the forest wins !

RFcategoryGini

Words importance for category prediction, as illustrated by their contribution to the decrease of the Gini index.

Looking at the forest variable importance plot, it appears as expected that words like “hat” or “sleeves” are good predictors (no sleeves on scarves usually). Other more generic good predictors appear: “top” probably narrows the category to scarves or sweaters, excluding socks, and “fit” is probably more likely to appear for items where size matters (hats, socks, and sweaters).

Autumn tree canopy, Foret des Laurentides, Quebec, Canada

Random forests are the best forests.

 

In order to estimate the performance of the winner on wild data, we use the test set:


# on test set:
predTest <- predict(rf, testMatched)
table(predTest, y_test)

# (predTest)      Feet / Legs Hands Hat Neck / Torso Softies Sweater (y_test)
# Feet / Legs          102     0   1           12       3       8
# Hands                  0    21   0            0       0       0
# Hat                    2     2 132            5       2       1
# Neck / Torso          12    12  10          363      11      23
# Softies                0     0   0            0      13       0
# Sweater                1     2   1            8       0     145

Pattern launch study

On Ravelry, users make heavy use of the bookmarking tools (“favorite” and “queue”) to remember the knitting patterns that caught their eye among the several millions that the database holds.

The popular Brooklyn Tweed design house released their Fall 2015 knitting patterns collection on Wednesday, September 16th; I used this opportunity to look at the bookmarking activity on these patterns in real time for several days after the launch.

Screen Shot 2015-09-26 at 11.26.49

Some of the 15 patterns in the Brooklyn Tweed Fall 2015 collection.

The pattern names are kept secret until the launch, so it’s not possible to “listen” to the Ravelry database to detect their apparition and start following them. One needs to first search the Ravelry database for recently published Brooklyn Tweed designs, and assuming that the Fall 2015 designs will be among the most recent results. This can, and should, be changed after the launch, editing the script to query the actual names of the collection patterns, because at some point there will be more recent patterns tagged “Brooklyn Tweed” that push those from this Fall collection out of the most recent results. The following code does just that, getting the permalinks (unique pattern names) for the 30 most recent “Brooklyn Tweed” search results:

# Search for recent BT patterns (will also have others made with BT yarns)
BT_query <- GET("https://api.ravelry.com/patterns/search.json?page_size=30&query=Olga%20Buraya%20Kefelian&sort=date",
             config=config("token"=ravelry.token))
content_to_follow <- content(BT_query)

permalinks_to_follow <- sapply(content_to_follow[[1]], function(x) x$permalink)

Once we have the links, the following code queries the API for each pattern for the properties we want to look at: current number of favorites, of users who queued the pattern, of comments, and of projects. It then appends the resulting data to a text file “BT_follow.txt”.

# get url from pattern unique name (permalink)
links_to_follow <- sapply(permalinks_to_follow, function(name) paste("https://api.ravelry.com/patterns/",name,".json&amp;quot;,sep=&amp;quot;&amp;quot;,collapse=&amp;quot;&amp;quot;))
# query the API
pat0 <- lapply(links_to_follow, GET, config=config("token"=ravelry.token))
pats <- lapply(pat0, content)

# filter the results for the properties we are interested in
pattern_data <- sapply(pats, function(x) x$pattern[c("permalink",
                                                    "queued_projects_count",
                                                     "projects_count&amp",                                                       "favorites_count",
                                                     "comments_count")])

# turn into a data frame object
pattern_df <- data.frame(matrix(unlist(pattern_data), nrow=length(links_to_follow), byrow=T))
# add time and date
pattern_df$time <- Sys.time()

# write to text file
write.table(pattern_df, file = "/Users/saskia/R/ravelry_explorer/BT_follow.txt",append=T,row.names=F,col.names=F,quote=F)

This code gets the data at the point in time where it is run; in order to get this data as a function of time, the code above in ran automatically every 30 minutes using a Cron job (useful advice here).

After letting the data file grow for a few days, it’s time for harvest !

# read from the text file
BT_time <- read.csv("BT_follow.txt",header=F,sep=";")
# add columns names
names(BT_time) <- c("pattern","queued_projects_count","projects_count","favorites_count","comments_count","day","time")

# Fall 2015 pattern unique names
names_BT <- c("ashland-2","bannock","birch-bay-2","cascades","deschutes-2","copse-2",
             "fletching-3","lander","lolo","mcloughlin","nehalem-2","riverbend-2",
             "sauvie","trailhead-2","willamette-7")

# Fall 2015 pattern official names
names_full <- c("Ashland","Bannock","Birch Bay","Cascades","Deschutes","Copse",
                "Fletching","Lander","Lolo","McLoughlin","Nehalem","Riverbend",
                "Sauvie","Trailhead","Willamette")

names(names_full) <- names_BT

# filter the data to keep only Brooklyn tweed Fall 2015 patterns
BT_time <- subset(BT_time,subset=(pattern %in% names_BT))
BT_time$pattern <- droplevels(BT_time$pattern)
# use pattern name as identifier instead of permalink
BT_time$pattern <- revalue(BT_time$pattern, names_full)
# get full date from day and hour/min, in Brooklyn Tweed's local time (PDT)
BT_time$date <- as.POSIXct(paste(BT_time$day, BT_time$time))
attributes(BT_time$date)$tzone <- "America/Los_Angeles"

And now we can plot the graph of the number of favorites on each pattern as a function of time. I chose 15 colors as far from each other as possible using this handy tool from the Medialab at Sciences Po.

ggplot(BT_time) + 
  geom_line(aes(x=date, y=favorites_count, color=pattern, group=pattern)) +
  geom_text(data=BT_time[BT_time$date ==lastdate,], 
            aes(x=date,
                y=favorites_count,
                color=pattern,
                label=pattern),
            vjust = -0.2) +
  xlab("Time") +
  ylab("Number of favorites") +
  theme(legend.position="none") +
  scale_colour_manual(values=BT_colors) +
  ggtitle("Evolution of number of favorites on BT Fall 2015 patterns")
The sun (almost) never sets on the knitting empire !

The sun (almost) never sets on the knitting empire! (click to enlarge)

Looks like the hottest time window is within the first day! The time dependence has a very similar shape for each pattern, probably because they are all released in the same context, with a consistent design concept throughout the collection.

People bookmark patterns all the time, although when it’s night in the USA, there is a small dip in activity. This is much more visible if we plot the favoriting rate (number of new favorites per hour):

# function getting the nbr of favorites per unit time
dfav_dt <- function(data){
  n <- dim(data)[1] # nbr lines
  favdiff <- c(0, diff(data$favorites_count)) # length n
  timediff <- difftime(data$date, c(data$date[1], data$date[1:n-1]),units="hours")
  # div by 0 on first item, but ggplot will just ignore those points
  timediff <- as.numeric(timediff) 
  favderiv <- SMA(favdiff/timediff)
  return(data.frame(deriv = favderiv, 
                    date = data$date))
}

# rate gets pretty constant after a few days, so keep only
# the points shortly after launch
m <- 3000
# get rates for each pattern
rates <- ddply(BT_time[1:m,],.(pattern), dfav_dt)

# normalize the data to more easily see the general behavior
norm_col <- function(data){
  maxd <- max(data$deriv, na.rm=T)
  return(cbind(data,norm_deriv=data$deriv/maxd))
}

rates <- ddply(rates,.(pattern), norm_col)

ggplot(rates) + 
  geom_line(aes(x=date, y=norm_deriv, color="red", group=pattern)) +
  xlab("Time") +
  ylab("Normalized favoriting rate") +
  theme(legend.position="none") +
  ggtitle("Evolution of favoriting rate on BT Fall 2015 patterns")

BT_fav_rate_t

Favoriting rate for the few days after the launch; the dip in activity when night moves over the USA is now quite visible. There are 15 normalized curves, one for each pattern in the collection. (click to enlarge)

The time dependence of the number of times a pattern was queued is very highly correlated with the time dependence of the number of favorites, so I’m not showing it here.

I’m letting the code run once per day now; I’ll post the new data in a few months to look at long term tendencies, and at the number of projects (there are obviously not a lot of them so soon after the launch). I’m guessing Christmas Crunch will be having interesting effects …