Classification of knitting patterns

Knitting patterns for sale on Ravelry are accompanied by a text description going from a few words to a full page story in several languages. This text sometimes includes instructions about knitting techniques, a personal background story, or a reference to a larger collection of patterns by the same designer. This post looks how reliably one can predict the pattern category (hat, sweater, scarf …) from the text of the pattern.

We first build a dataset using Ravelry database queries and web scraping. This will consist in a data frame with pattern names, their text descriptions, and their category: hat, sweater etc.


## Build dataset from Ravelry API: pattern permalink, pattern category, pattern text description
# Get url to patterns of interest from API search
pat0 <- GET("https://api.ravelry.com/patterns/search.json?page_size=2000&craft=knitting", config=config("token"=ravelry.token))
pat <- content(pat0)

permalinks <- sapply(pat$patterns, function(x) x$permalink)
permalinks_full <- sapply(permalinks, function(name) paste("http://www.ravelry.com/patterns/library/",name,sep="",collapse=""))
names(permalinks_full) <- permalinks

# Get top level pattern category and description text using web scraping 
pattern_info <- lapply(permalinks_full, htmlTreeParse, useInternalNodes = TRUE)

pattern_description_par <- lapply(pattern_info, getNodeSet, path="//p", fun=xmlValue)
pattern_description <- sapply(pattern_description_par, paste, collapse=" ")

pattern_cat <- lapply(pattern_info, getNodeSet, path="//div[@class='category']/a/span/text()", fun=xmlValue)
pattern_topcat <- simplify2array(sapply(pattern_cat, head, 1))

Some pattern categories appear quite rarely, and may be not frequent enough to get a decent accuracy on prediction. We can filter out the corresponding  entries to get cleaner data.


## Data: 3 columns with pattern permalink, text description, and toplevel category
data <- as.data.frame(cbind(permalinks, pattern_topcat, pattern_description),stringsAsFactors=F,row.names=F)
names(data) <- c("permalink", "category", "description")
data$category <- as.factor(data$category)

cat_freq <- table(data$category)
nbr_examples <- dim(data)[1]

# Remove from data the categories with too few examples 
data <- subset(data, subset=(cat_freq[category] > 50))
data$category <- factor(data$category)

The following R functions are quite useful to prepare the data for text mining. In order to predict a pattern category from its text, we look at the frequencies of words in each text for each category, and use these numbers for prediction. For example, a description in which the word “head” appears several times is more likely to be for a hat than a sock.

The first function removes the punctuation, numbers, and stopwords (“the”, “a” …) that appear very often in all text but do not carry enough meaning to help prediction. It returns a clean corpus of texts where each document corresponds to a cleaned up pattern description.

The second function builds the Document Term Matrix (DTM), an object that holds for each document the frequencies of all the words in it. The columns are all the words in the corpus, and the lines are all the documents in the corpus (pattern descriptions). This DTM will be the dataset for running the algorithms, with words being the features, and text descriptions being the cases.

The third function wraps it all together to turn the data frame into a ready to use dataset.



cleanCorpus = function(corpus){
  # Clean the text data to remove punctuation, suffixes, numbers etc
  # To lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove stopwords first, else for ex. l'or becomes lor and l' is not removed
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  # Remove punctuation
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(function(str) gsub("[^[:alnum:] ]", " ",str)))
  # Remove  html tags with regexp
  corpus <- tm_map(corpus, content_transformer(function(x) gsub("<[a-z]*>", " ", x)))
  # Remove numbers - but they may be useful ... TODO ?
  corpus <- tm_map(corpus, removeNumbers)
  # Simplify whitespace
  corpus <- tm_map(corpus, stripWhitespace)
  # Stem words (tm_map stem has type error), use option lazy=T on mac os
  corpus <- tm_map(corpus, stemDocument, "english", lazy=T)
}

buildData = function(corpus, sparsity=0.999){
  # Arg: corpus where one document is one pattern description
  #      optionnal float word sparsity threshold 
  #      default: remove (almost) nothing
  # Returns Document Term Matrix
  dtm <- DocumentTermMatrix(corpus, 
                            control = list(weighting = weightTfIdf))
  # remove words that don't appear often enough for every category, else weird words and very large matrix
  dtm <- removeSparseTerms(dtm, sparsity)
}

prepareData <- function(df){
  # make clean cases and outcome based on text/category data frame
  corpus <- Corpus(VectorSource(df$description))
  names(corpus) <- df$category
  y <- df$category
  clean <- cleanCorpus(corpus)
  dtm = buildData(clean, 0.9)
  data <- as.data.frame(as.matrix(dtm))
  names(data) <- dtm$dimnames$Terms
  return (list("category" = y, "data" = data))
}

Before diving into the classical machine learning algorithms, we can do a very simple prediction benchmark. We search the text for the pattern category name or related keywords (for example “scarf” and “shawl” for the Neck/Torso category). We then predict the category as the one whose  keywords appear most often. A text including “sock” and “foot” but no “sweater” is probably in the “Feet / Legs” category. The code below does just that, predicting the most frequent category in case none of the keywords appear in the description. (The keywords are guessed using this knitter’s domain knowledge !)



## Build train, cross-validation, and test sets, 50% of cases go to train set
sampler <- createDataPartition(data$category, times=3)
trainData <- data[sampler[[1]],]
crossValData <- data[sampler[[2]],]
testData <- data[sampler[[3]],]

trainFull <- prepareData(trainData)
y_train <- trainFull$category
train <- trainFull$data

cvFull <- prepareData(crossValData)
y_cv <- cvFull$category
cv <- cvFull$data

testFull <- prepareData(testData)
y_test <- testFull$category
test <- testFull$data

# benchmark test: predict category whose keywords appearing the most in the text
# if no keywords in text, predict most frequent category
predEasy <- function(text, sortedCategories){
  # assumes categories sorted by most frequent in trainning set
  # categoryInText holds for each category the indexes where the category keywords appears in the text
  categoryInText <- sapply(sortedCategories, function(pattern) gregexpr(pattern, text)[[1]])
  # catNbrOccurences holds the number of times a category keyword appears in the text 
  catNbrOccurences <- sapply(categoryInText, function(l) sum(l > 0))
  # return category with most keywords occurences
  cat <- sortedCategories[which.max(catNbrOccurences)]
}

sort(table(y_train), decreasing = T)
sortedCategories <- c("[sS]hawl|[sS]carf", "[Ss]weater|[sS]leeve", "\\b[Hh]at\\b", 
                      "[Ff]eet|[Ff]oot|[sS]ock", "\\b[Hh]and\\b|[gG]love|[mMitt]]", "[Ss]ofties|[tT]oy")
y_easy <- sapply(crossValData$description, predEasy, sortedCategories)
# reorder y_cv names to have true positives in the diagonal (regexp letters mess up ordering)
table(y_easy, y_cv)[, c(1,5,6,4,2,3)]

# resulting confusion matrix:

# y_easy                     y_cv     Feet / Legs Softies Sweater Neck / Torso Hands Hat
# [Ff]eet|[Ff]oot|[sS]ock                  79       6       5           11     4   3
# [Ss]ofties|[tT]oy                         0       6       0            0     0   0
# [Ss]weater|[sS]leeve                      0       1     123            2     1   1
# [sS]hawl|[sS]carf                        38      13      45          361    10  36
# \\b[Hh]and\\b|[gG]love|[mMitt]]           0       2       4            6    16   3
# \\b[Hh]at\\b                              0       1       0            8     6 101

 

The performance can be checked in the confusion matrix. This benchmark is not so bad, most of the patterns in each category are correctly predicted.

However, if our keywords do not appear in the text, the prediction quality will go down. In order to take the full text into account, we need the algorithm to take the whole Document Term Matrix into account instead of guessing a priori the keywords that are good predictors of  a category.

Since we have several categories and plenty of features in the dataset, we can try a random forest. The default forest in R’s randomForest library works quite well here, after a bit a data preparation (matching the words in the DTM for the train/test/cross-validation sets):



matchWords <- function(testDtm, referenceDtm){
  # Can't predict categories never seen in reference set 
  # => remove them from the set used for prediction
  # and add to predicting set the words that were in train set but not in predicting set
  # Args: document term matrix to modify and reference document term matrix
  # Returns the modified dtm with columns matching the reference dtm
  test2 <- testDtm[, intersect(colnames(referenceDtm), colnames(testDtm))]
  trainWordsNotInTest <- setdiff(names(referenceDtm), names(test2))
  yy <- data.frame(matrix(0, ncol = length(trainWordsNotInTest),
                          nrow = dim(test2)[1]))
  names(yy) <- trainWordsNotInTest
  # Final processed test set
  return(cbind(test2, yy))
}

cvMatched <- matchWords(cv, train)
testMatched <- matchWords(test, train)

## Random Forest
rf <- randomForest(train, y_train)
pred <- predict(rf, cvMatched)
table(pred, y_cv)
# interpretation: graph showing which words make the most interesting splits in the trees
varImpPlot(rf, type=2) 

# (pred)          Feet / Legs Hands Hat Neck / Torso Softies Sweater     (y_cv)
# Feet / Legs           97     3   3           12       2       8
# Hands                  0    20   0            0       0       0
# Hat                    2     3 130            6       1       1
# Neck / Torso          17    10   8          362       9      28
# Softies                0     0   0            1      17       0
# Sweater                1     1   3            7       0     140

Comparing the confusion matrices between the benchmark and the random forest, the forest wins !

RFcategoryGini

Words importance for category prediction, as illustrated by their contribution to the decrease of the Gini index.

Looking at the forest variable importance plot, it appears as expected that words like “hat” or “sleeves” are good predictors (no sleeves on scarves usually). Other more generic good predictors appear: “top” probably narrows the category to scarves or sweaters, excluding socks, and “fit” is probably more likely to appear for items where size matters (hats, socks, and sweaters).

Autumn tree canopy, Foret des Laurentides, Quebec, Canada

Random forests are the best forests.

 

In order to estimate the performance of the winner on wild data, we use the test set:


# on test set:
predTest <- predict(rf, testMatched)
table(predTest, y_test)

# (predTest)      Feet / Legs Hands Hat Neck / Torso Softies Sweater (y_test)
# Feet / Legs          102     0   1           12       3       8
# Hands                  0    21   0            0       0       0
# Hat                    2     2 132            5       2       1
# Neck / Torso          12    12  10          363      11      23
# Softies                0     0   0            0      13       0
# Sweater                1     2   1            8       0     145