Data Science Bowl 2018 Kaggle competition

I have at last found the time to try Keras ! It is so much easier to get started and running than Tensorflow, although of course this comes at the price of some flexibility.

I’m trying my hand at the Kaggle Data Science Bowl 2018 competition, on the topic of object segmentation, which in this case mean delimiting cells in medical imagery. The training data are microscopy photographs of cells, and the label of each image is a set of masks (binary pictures) for each of the individual cells in the picture. Below are some examples of input images in the training set. Sizes and aspect ratios are varied, but are often around 256×256 pixels.

An example of an input image (training example) and its masks (labels):

 

As usual, the Kaggle forums are full of helpful people providing starters notebooks; I used the Keras U-Net starter as the basis for my explorations.

The neural network in this approach is a simplified version of the one described in this paper; each layer has less channels. I kept this simplification because I’m working with a Google Colaboratory notebook, and due to the notebook limitations using the same number of features as in the paper on each layer makes the training very slow, as in “surprise!notebook disconnect” sort of slow. Anyway, Colaboratory is still great, considering that it is a free tool. It’s a Jupyter-based python notebook running on shared VMs at Google, and the VM comes with free GPU ! The VM state is not persistent, so after 90 minutes of inactivity the notebook disconnects and the VM is lost. This means any packages installations or data download have to be done at every restart. Thankfully, Linux commands can be integrated in the notebook simply by starting the command with “!”.

Here are the preliminary steps I added to the original Unet Starter Notebook:

!pip install -q tqdm
!git clone -q https://github.com/lopuhin/kaggle-dsbowl-2018-dataset-fixes.git
!pip install -q kaggle
!mkdir -p /content/.kaggle/
!mkdir -p data-science-bowl-2018/original
!mkdir -p data-science-bowl-2018/merged/stage1_train
# unsaaaaaafe
!echo '{"username":"mykagglename","key":"mykaggleapikeygoeshere"}' > /content/.kaggle/kaggle.json
!kaggle competitions download -c data-science-bowl-2018 -p data-science-bowl-2018/original/.
!unzip -q -o "data-science-bowl-2018/original/stage1_test.zip" -d "data-science-bowl-2018/original/stage1_test"
!unzip -q -o "data-science-bowl-2018/original/stage1_train.zip" -d "data-science-bowl-2018/original/stage1_train"
!cp -r data-science-bowl-2018/original/stage1_train/* data-science-bowl-2018/merged/stage1_train/.
!cp -r kaggle-dsbowl-2018-dataset-fixes/stage1_train/* data-science-bowl-2018/merged/stage1_train/.

Now comes the Python script starting with some imports and global variables:

</pre>
import os
import sys
import warnings
from subprocess import check_output

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from skimage.morphology import extrema
import matplotlib as mp

from tqdm import tqdm
from itertools import chain

import skimage.io
import skimage.transform
import skimage.morphology
from scipy import ndimage as ndi
from skimage.morphology import watershed
from skimage.feature import peak_local_max

from keras import backend as K
from keras.models import Model, load_model
from keras.layers import Input
from keras.layers.core import Dropout, Lambda
from keras.layers.convolutional import Conv2D, Conv2DTranspose
from keras.layers.pooling import MaxPooling2D
from keras.layers.merge import concatenate
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from keras.preprocessing.image import ImageDataGenerator
from keras.losses import binary_crossentropy

IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3 # use only RGB channels
TRAIN_PATH = '/content/data-science-bowl-2018/merged/stage1_train'
TEST_PATH = '/content/data-science-bowl-2018/original/stage1_test'
BATCH_SIZE = 15

np.random.seed(42)

cmap2 = mp.colors.ListedColormap(np.concatenate((np.array([[0,0,0]]), np.random.rand(256,3))))

warnings.filterwarnings('ignore', category=UserWarning, module='skimage')

The images in the dataset have different sizes, so we are going to resize them all to the same size; the original notebook uses 128×128 pixels, I think to keep the training fast enough for Kagglers to be able to run the notebook easily. I used 512×512, which is closer to the majority of pictures sizes, and allows us to keep more information.

I also added a function to read the pictures and merge the masks into a single image:

def imread_grey(f):
    return skimage.io.imread(f, as_grey=True)

def read_image_labels(image_id, merge=True):
    # return image with id number image_id 
    # and label = union of masks if merge = True (binary picture), 
    # if merge = False: label = matrix with distinct integers for each cell
    # (useful for plotting for performance checks)
    image_file = "{}/{}/images/{}.png".format(TRAIN_PATH, image_id, image_id)
    mask_file = "{}/{}/masks/*.png".format(TRAIN_PATH, image_id)
    image = skimage.io.imread(image_file)
    masks = skimage.io.ImageCollection(mask_file, load_func=imread_grey).concatenate()
    masks_union = np.sum(masks, axis=0) # has 2 if 2 cells are overlapping, 3 if 3 overlap etc
    if merge:
        # assumes masks and image have same dimensions
        labels = masks_union > 0 
    else:
        masks_indexed = [masks[i]*i for i in range(len(masks))]
        # overlapping cells attributed to last one in mask name order
        labels = np.max(masks_indexed, axis=0) 

    return image, labels

Example result:

We make a big simplification here by merging all the masks. This mean cells that overlap can’t be individually contoured anymore … So it’s on the user to post-process the predicted masks to separate large clusters into individual cells. This is especially important since the metric used for scoring penalises missed cells more than inaccurately contoured cells.

We then build the X and Y arrays of training and test set examples:

# Get train and test IDs
test_ids = next(os.walk(TEST_PATH))[1]
train_ids = next(os.walk(TRAIN_PATH))[1]

def preprocess(image_input):
    # force same number of channels for grayscale pics as for color pics (3)
    rgb_img = skimage.color.grey2rgb(image_input, alpha=False)
    return skimage.transform.resize(rgb_img, (IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), mode='constant', preserve_range=True)

# Get and resize train images and masks
size_train = len(train_ids)
X_train = np.zeros((size_train, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((size_train, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool) # additional "dim" around each value for unet input; use np.squeeze for plots

print('Getting and resizing train images and masks ... ')
sys.stdout.flush()
for n, image_id in tqdm(enumerate(train_ids), total=size_train):
    image, mask = read_image_labels(image_id)
    X_train[n] = preprocess(image)
    Y_train[n] = np.expand_dims(skimage.transform.resize(mask, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True), axis=-1)
    image, mask = read_image_labels(image_id, merge=False)
    Y_train_distinct[n] = np.expand_dims(skimage.transform.resize(mask, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True), axis=-1)

# Get and resize test images
X_test = np.zeros((len(test_ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
sizes_test = []
print('Getting and resizing test images ... ')
sys.stdout.flush()
for n, id_ in tqdm(enumerate(test_ids), total=len(test_ids)):
    image = skimage.io.imread(TEST_PATH + "/"+id_ + "/images/" + id_ + '.png')[:,:,:IMG_CHANNELS]
    sizes_test.append(list(image.shape))
    X_test[n] = preprocess(image)

I tried different losses for the network; mean IoU as defined in this competition (using thresholds) is not available in either Keras or Tensorflow libraries. A Dice loss (intersection over union) gives the best results.

This is the loss function and the U-net network:

def dice_coef(y_true, y_pred):
    smooth = 1.
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def dice_loss(y_true, y_pred):
    return -dice_coef(y_true, y_pred)

inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)

c1 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)
c1 = Dropout(0.1) (c1)
c1 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p1)
c2 = Dropout(0.1) (c2)
c2 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p2)
c3 = Dropout(0.2) (c3)
c3 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p3)
c4 = Dropout(0.2) (c4)
c4 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p4)
c5 = Dropout(0.3) (c5)
c5 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c5)

u6 = Conv2DTranspose(128, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u6)
c6 = Dropout(0.2) (c6)
c6 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c6)

u7 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u7)
c7 = Dropout(0.2) (c7)
c7 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c7)

u8 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u8)
c8 = Dropout(0.1) (c8)
c8 = Conv2D(32, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c8)

u9 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u9)
c9 = Dropout(0.1) (c9)
c9 = Conv2D(16, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c9)

outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=dice_loss, metrics=['accuracy'])

My understanding of this is that the descending branch is a classic convnet, i.e. convolutionnal layers with several filters (or channels) to extract various features that get more and more high level with depth, interlayered with max-pooling layers to simplify the spatial localization information.  The ascending branch reconstructs a high resolution image using the learned features at every step in the descending branch, so that the detailed knowledge of what happens at high resolution (first layers) can be combined with high level understanding of the image (middle layers). The paper also does loss weighting to force the network to learn the thin boundaries between objects, which I should find time to do here as well !

In order to get more training examples, I added data augmentation, which can be done on-the-fly using a generator, i.e. there is no need to create a function to rotate/crop/flip and no need to create a new training array with the added augmentations. The training finally looks like this:

earlystopper = EarlyStopping(patience=5, verbose=1)
checkpointer = ModelCheckpoint('model-dsbowl2018.h5', verbose=1, save_best_only=True)

val_cutoff = int(size_train*0.9)
X_train2 = X_train[:val_cutoff]
Y_train2 = Y_train[:val_cutoff]
X_val = X_train[val_cutoff:]
Y_val = Y_train[val_cutoff:]

def data_generator(input, augment=False):
    # for keras on the fly data augmentation
    if augment:
        data_gen_args = dict(shear_range=0.5, rotation_range=50, zoom_range=0.2,
                         width_shift_range=0.2, height_shift_range=0.2, fill_mode='reflect')
    else:
        data_gen_args = dict()
    datagen = ImageDataGenerator(**data_gen_args)
    # Provide the same seed and keyword arguments to the fit and flow methods
    datagen.fit(input, augment=True, seed=42) # TODO: is this useful ?
    return datagen.flow(input, batch_size=BATCH_SIZE, shuffle=True, seed=42)

# combine generators into one which yields image and masks
train_generator = zip(data_generator(X_train2, augment=True), data_generator(Y_train2, augment=True))
val_generator = zip(data_generator(X_val), data_generator(Y_val))

# steps per epoch without data generation = nbr_samples/batch size (around 60 with batch = 10)
results = model.fit_generator(train_generator,
                              validation_data = val_generator,
                              epochs=50,
                              steps_per_epoch = 250,
                              validation_steps = 10,
                              callbacks=[earlystopper, checkpointer])

I tried using Tensorboard, but I couldn’t make it work in Colaboratory using the method described in this notebook.

In order to separate the touching cells, I tried watershed postprocessing, but it invents separations everywhere on elongated shapes (although it works quite well when the touching cells look like two almost tangent circles). My simple watersheding on all the predictions does not improve performance.

The submission is built using run-length encoding (source). The submission file should look like:
ImageId,EncodedPixels
0114f484a16c152baa2d82fdd43740880a762c93f436c8988ac461c5c9dbe7d5,1 1 -> nuclei #1
0999dab07b11bc85fb8464fc36c947fbd8b5d6ec49817361cb780659ca805eac,1 1 -> nuclei #1
0999dab07b11bc85fb8464fc36c947fbd8b5d6ec49817361cb780659ca805eac,2 3 8 9 -> nuclei #2

In the rle_encoding function, x is a matrix with non-1 values and 1’s (points of interest); the function returns a list of the indexes in flattened x transposed where a batch of 1’s start, followed by the number of 1’s in this batch ([index_batch_1, nbr_of_one_batch_1, index_batch_2, nbr_of_ones_batch_2 …])
For example, if x is
0001100000
0011110000
0000110000
0000000100
0000001100
The function returns [3,2,12,4,24,2,37,1,46,2].

def rle_encoding(x):
    dots = np.where(x.T.flatten() == 1)[0] # indexes of the 1's
    run_lengths = []
    prev = -2
    for b in dots:
        if (b>prev+1): run_lengths.extend((b + 1, 0)) # if we hit a new batch of 1's
        run_lengths[-1] += 1 # add 1 to last element of run_length (coresponding to index of start of current batch of 1's)
        prev = b
    return run_lengths

def prob_to_rles(x, cutoff=0.5):
    # x has 0 for background, values > 0.5 for content
    #lab_img = skimage.morphology.label(x > cutoff) # label connected regions of any nonzero values in x; not needed if using watershed split
    for i in range(1, lab_img.max() + 1): # for each connected region, return an iterator on its rle representation
        yield rle_encoding(x == i)   

new_test_ids = []
rles = []
for n, id_ in enumerate(test_ids):
    split_cells = wathershed_split(preds_test_upsampled[n])
    rle = list(prob_to_rles(split_cells))
    rles.extend(rle)
    new_test_ids.extend([id_] * len(rle))

# Create submission DataFrame
sub = pd.DataFrame()
sub['ImageId'] = new_test_ids
sub['EncodedPixels'] = pd.Series(rles).apply(lambda x: ' '.join(str(y) for y in x))
sub.to_csv('sub-dsbowl2018-12.csv', index=False)
#print(sub.head())

And finally the Kaggle submission can be done via the API:

!kaggle competitions submit -c data-science-bowl-2018 -f sub-dsbowl2018-12.csv -m "added watershed splitting"

My best score using this is 0.329, not groundbreaking by any means, but. When looking at the prediction on a validation set, I was pretty impressed by how good this relatively simple network trained with few images (around 600) could segment a lot of cells. This isn’t a pretrained network that is trained on its last layers with a small dataset; the only images used are the ones from this competition.