Draw your own dataset

For all those who want to visually test their classification algorithms on toy data, here is a RShiny app to easily click & draw your own custom dataset ! It can be accessed on shinyapps.io, and the source code is on Github.

It’s mostly useful for small two-dimensionnal numeric datasets that are inconvenient to build as a few superpositions of classic distributions samplings.

screenshot2

Screenshot of the app

RShiny is a web application framework by RStudio that allows you to make neat interactive web apps entirely in R. Shiny is quite practical because it handles most of the event-handling and variables updating under the hood. All we need to do is to declare the variables that are subject to interactive change as reactive values, and of keep track of where the changes happen.

In this app, every time the user clicks on the canvas, a small group of points is created near the click position; the number, the class (color), and the spread of the points created at each click can be adjusted in the parameters bar. The canvas can be cleared or the latest points undone if necessary. Once the dataset is ready, it can be downloaded as a csv file containing the x,y coordinates and class of each point; the downloaded dataset is scaled to zero mean and unit variance.  This GUI is defined in the ui.R program:


library(shiny)
library(shinydashboard)

dashboardPage(
  dashboardHeader(title="Draw Dataset"),
  dashboardSidebar(disable=T),
  dashboardBody(
    box(width=12,
        title="Parameters",
        fluidRow(
          column(4, numericInput("num_points", "Number of points per clic", 3)),
          column(4,
                 numericInput("sigma",
                              "Standard deviation of each clic point set",
                              step=0.01,
                              value=5)
          ),
          column(4,
                 selectInput("class", "Class (color)", choices=c("red"="firebrick1",
                                                                 "green"="forestgreen",
                                                                 "blue"="dodgerblue")))
        ),
        fluidRow(
          column(12,
                 downloadButton("save", "Save"),
                 actionButton("clear", "Clear", icon=icon("remove")),
                 actionButton("undo", "Undo", icon=icon("undo"))
          )
        )
    ),
    box(width=12,
        height=870,
        title="Draw your own dataset by clicking on this canvas",
        plotOutput("data_plot", click="plot_click")
    )
  )
)

The code in server.R shown below holds the data frame that will contains the dataset, and updates it whenever a user triggered event happens.
The dataset is a reactive object; its potential user-triggered changes are handled in the observe({}) blocks. The variables corresponding to user input are called input$something, where the various “something” are defined in ui.R.


library(ggplot2)

server <- function(input, output, session) {

  addGroup <- function(data, n, center, sigma, class){
    # append a group of points distributed around the clic coordinates
    # to a data frame holding all the points created and their color.
    new_group <- data.frame("x"=rnorm(n, mean=center$x, sd=sigma),
                            "y"=rnorm(n, mean=center$y, sd=sigma),
                            "class"=class)
    return(rbind(data, new_group))
  }

  # initialize reactive dataset holding the points created by the user
  v <- reactiveValues(data = data.frame())

  observe({
    # populates the dataset with points distributed around the clic target point
    if(!is.null(input$plot_click)) {
      # "isolate" reads fresh value of v$data without the update re-evaluating it
      # avoids infinite loop of update with rbind then re-rbind the updated data with the new group
      v$data <- isolate(addGroup(v$data, input$num_points, input$plot_click, input$sigma, input$class))
    }
  })

  observe({
    # remove all points from the canvas and the dataset when clear button is clicked
    if(!is.null(input$clear)) {
      v$data <- data.frame()
    }
  })

  observeEvent(input$undo, {
    # remove the latest drawn point from the dataset when undo button is clicked
    v$data <- v$data[-nrow(v$data), ]
  })

  output$save <- downloadHandler(
    # save the dataset as a csv file
    # scale to zero mean and unit variance before saving
    filename = function() {'DIYdataset.csv'},
    content = function(file) {
      write.csv(data.frame(scale(v$data[,c("x","y")]), "color"=v$data$class),
                file,
                quote=F,
                row.names=F)}
  )

  output$data_plot <- renderPlot({
    # display the base plot
    plot <- ggplot() + xlim(0, 100) + ylim(0, 100) + xlab("x") + ylab("y")     # if data is not empty, add it to plot     # points outside of plot boundaries are added to the dataset but not displayed     if (nrow(v$data) > 0) {
      plot <- plot + geom_point(aes(x=v$data$x, y=v$data$y),
                                size=4,
                                colour=v$data$class,
                                show.legend=F)
    }
    return(plot)
  }, height=800)

}

The block in which the dataset receives new points uses “isolate” to avoid an infinite loop.
Without this, the dataset would first be updated with the new points, but RShiny would then detect that the dataset has changed and would reiterate the assignment of new points, then would detect this new change, etc.

RShiny is a great tool for quickly developing apps to demo models or visualize data; this app takes only 100 lines of code !

Using Tensorflow with Docker

I started the Google Deep Learning Course featuring the Python Tensorflow library.

Now I’m using a Windows 7 laptop, and the installation instructions for Windows on the Tensorflow website are a bit less developed than for unix based systems. The Deep Learning course comes with a Docker image packing tensorflow and the course exercises in an IPython notebook, and this seems to actually be the simplest way to get Tensorflow on Windows.

To install and get started:

  1. Download and install the Docker Toolbox
  2. Open the Docker Quickstart terminal
  3. Paste the following command in this terminal. This will download (or open if it is already downloaded) the docker image containing Tensorflow and the course exercises.
    docker run -p 8888:8888 -it --rm b.gcr.io/tensorflow-udacity/assignments:0.4.0
  4. At the top of your Docker terminal you should see an IP address: go to http://<this_ip_adress>:8888 in your browser. This should be the home page of the Jupyter notebook featuring the course exercises. You will be able to import tensorflow when coding through this interface.

I tried this install on a mac and it works too !

Machine learning and data science ressources

This post lists some useful sources for learning data science and machine learning. The first two are a good way to start, since they quickly bring you to the point where you can play with real world data.

Machine Learning by Andrew Ng (Stanford)

The best MOOC ever,  which is also the one that started Coursera.

Prerequisites: it helps if you are familiar with scalar products and matrix operations, and if you know how to program basic functions.

What’s inside: starts with basic linear regression and ramps up with gradient descent, logistic regressions, regularisation, neural networks, clustering, svm, recommenders, PCA …

Data Science Specialization by Roger Peng, Brian Caffo, and Jeff Leek (Johns Hopkins University)

This one is a long hike through many areas of data science (not just machine learning), and offers a very pragmatic introduction to the R programming language.

Prerequisites: familiarity with programming and undergrad math. The R exercises are quite close to what you will get asked in a Data Scientist job interview.

What’s inside: how to do everything to data using R (parse, explore, fit), how to connect to an API or parse data from a website, statistics, machine learning, reproducibility … It has some overlap with Andrew Ng’s course.

Probability and Statistics by Khan Academy

The basics of statistics and probabilities (and clearer IMO than the statistics section of the Johns Hopkins course).

Data Science at Scale Specialization by Bill Howe (University of Washington)

A course about the “Big” in “Big Data”, with data visualisation thrown in.

Prerequisites: python programming, familiarity with APIs.

What’s inside: SQL, noSQL, MapReduce, graphs, machine learning …

Prerequisites: depends a lot on the courses, but a bit of background in Computer Science helps.

Books

I found An introduction to Statistical Learning very helpful both for algorithms explanations and R exercises. The authors also made a series of videos following the contents of the books (index of the videos here). The advanced version of this book with All The Math is The Elements of Statistical Learning.

If you want to make beautiful graphs (and you should), books by Edward Tufte show how.

 

This list is subject to changes and additions as I continue watching courses and reading books !

 

Classification of knitting patterns

Knitting patterns for sale on Ravelry are accompanied by a text description going from a few words to a full page story in several languages. This text sometimes includes instructions about knitting techniques, a personal background story, or a reference to a larger collection of patterns by the same designer. This post looks how reliably one can predict the pattern category (hat, sweater, scarf …) from the text of the pattern.

We first build a dataset using Ravelry database queries and web scraping. This will consist in a data frame with pattern names, their text descriptions, and their category: hat, sweater etc.


## Build dataset from Ravelry API: pattern permalink, pattern category, pattern text description
# Get url to patterns of interest from API search
pat0 <- GET("https://api.ravelry.com/patterns/search.json?page_size=2000&craft=knitting", config=config("token"=ravelry.token))
pat <- content(pat0)

permalinks <- sapply(pat$patterns, function(x) x$permalink)
permalinks_full <- sapply(permalinks, function(name) paste("http://www.ravelry.com/patterns/library/",name,sep="",collapse=""))
names(permalinks_full) <- permalinks

# Get top level pattern category and description text using web scraping 
pattern_info <- lapply(permalinks_full, htmlTreeParse, useInternalNodes = TRUE)

pattern_description_par <- lapply(pattern_info, getNodeSet, path="//p", fun=xmlValue)
pattern_description <- sapply(pattern_description_par, paste, collapse=" ")

pattern_cat <- lapply(pattern_info, getNodeSet, path="//div[@class='category']/a/span/text()", fun=xmlValue)
pattern_topcat <- simplify2array(sapply(pattern_cat, head, 1))

Some pattern categories appear quite rarely, and may be not frequent enough to get a decent accuracy on prediction. We can filter out the corresponding  entries to get cleaner data.


## Data: 3 columns with pattern permalink, text description, and toplevel category
data <- as.data.frame(cbind(permalinks, pattern_topcat, pattern_description),stringsAsFactors=F,row.names=F)
names(data) <- c("permalink", "category", "description")
data$category <- as.factor(data$category)

cat_freq <- table(data$category)
nbr_examples <- dim(data)[1]

# Remove from data the categories with too few examples 
data <- subset(data, subset=(cat_freq[category] > 50))
data$category <- factor(data$category)

The following R functions are quite useful to prepare the data for text mining. In order to predict a pattern category from its text, we look at the frequencies of words in each text for each category, and use these numbers for prediction. For example, a description in which the word “head” appears several times is more likely to be for a hat than a sock.

The first function removes the punctuation, numbers, and stopwords (“the”, “a” …) that appear very often in all text but do not carry enough meaning to help prediction. It returns a clean corpus of texts where each document corresponds to a cleaned up pattern description.

The second function builds the Document Term Matrix (DTM), an object that holds for each document the frequencies of all the words in it. The columns are all the words in the corpus, and the lines are all the documents in the corpus (pattern descriptions). This DTM will be the dataset for running the algorithms, with words being the features, and text descriptions being the cases.

The third function wraps it all together to turn the data frame into a ready to use dataset.



cleanCorpus = function(corpus){
  # Clean the text data to remove punctuation, suffixes, numbers etc
  # To lowercase
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Remove stopwords first, else for ex. l'or becomes lor and l' is not removed
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  # Remove punctuation
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, content_transformer(function(str) gsub("[^[:alnum:] ]", " ",str)))
  # Remove  html tags with regexp
  corpus <- tm_map(corpus, content_transformer(function(x) gsub("<[a-z]*>", " ", x)))
  # Remove numbers - but they may be useful ... TODO ?
  corpus <- tm_map(corpus, removeNumbers)
  # Simplify whitespace
  corpus <- tm_map(corpus, stripWhitespace)
  # Stem words (tm_map stem has type error), use option lazy=T on mac os
  corpus <- tm_map(corpus, stemDocument, "english", lazy=T)
}

buildData = function(corpus, sparsity=0.999){
  # Arg: corpus where one document is one pattern description
  #      optionnal float word sparsity threshold 
  #      default: remove (almost) nothing
  # Returns Document Term Matrix
  dtm <- DocumentTermMatrix(corpus, 
                            control = list(weighting = weightTfIdf))
  # remove words that don't appear often enough for every category, else weird words and very large matrix
  dtm <- removeSparseTerms(dtm, sparsity)
}

prepareData <- function(df){
  # make clean cases and outcome based on text/category data frame
  corpus <- Corpus(VectorSource(df$description))
  names(corpus) <- df$category
  y <- df$category
  clean <- cleanCorpus(corpus)
  dtm = buildData(clean, 0.9)
  data <- as.data.frame(as.matrix(dtm))
  names(data) <- dtm$dimnames$Terms
  return (list("category" = y, "data" = data))
}

Before diving into the classical machine learning algorithms, we can do a very simple prediction benchmark. We search the text for the pattern category name or related keywords (for example “scarf” and “shawl” for the Neck/Torso category). We then predict the category as the one whose  keywords appear most often. A text including “sock” and “foot” but no “sweater” is probably in the “Feet / Legs” category. The code below does just that, predicting the most frequent category in case none of the keywords appear in the description. (The keywords are guessed using this knitter’s domain knowledge !)



## Build train, cross-validation, and test sets, 50% of cases go to train set
sampler <- createDataPartition(data$category, times=3)
trainData <- data[sampler[[1]],]
crossValData <- data[sampler[[2]],]
testData <- data[sampler[[3]],]

trainFull <- prepareData(trainData)
y_train <- trainFull$category
train <- trainFull$data

cvFull <- prepareData(crossValData)
y_cv <- cvFull$category
cv <- cvFull$data

testFull <- prepareData(testData)
y_test <- testFull$category
test <- testFull$data

# benchmark test: predict category whose keywords appearing the most in the text
# if no keywords in text, predict most frequent category
predEasy <- function(text, sortedCategories){
  # assumes categories sorted by most frequent in trainning set
  # categoryInText holds for each category the indexes where the category keywords appears in the text
  categoryInText <- sapply(sortedCategories, function(pattern) gregexpr(pattern, text)[[1]])
  # catNbrOccurences holds the number of times a category keyword appears in the text 
  catNbrOccurences <- sapply(categoryInText, function(l) sum(l > 0))
  # return category with most keywords occurences
  cat <- sortedCategories[which.max(catNbrOccurences)]
}

sort(table(y_train), decreasing = T)
sortedCategories <- c("[sS]hawl|[sS]carf", "[Ss]weater|[sS]leeve", "\\b[Hh]at\\b", 
                      "[Ff]eet|[Ff]oot|[sS]ock", "\\b[Hh]and\\b|[gG]love|[mMitt]]", "[Ss]ofties|[tT]oy")
y_easy <- sapply(crossValData$description, predEasy, sortedCategories)
# reorder y_cv names to have true positives in the diagonal (regexp letters mess up ordering)
table(y_easy, y_cv)[, c(1,5,6,4,2,3)]

# resulting confusion matrix:

# y_easy                     y_cv     Feet / Legs Softies Sweater Neck / Torso Hands Hat
# [Ff]eet|[Ff]oot|[sS]ock                  79       6       5           11     4   3
# [Ss]ofties|[tT]oy                         0       6       0            0     0   0
# [Ss]weater|[sS]leeve                      0       1     123            2     1   1
# [sS]hawl|[sS]carf                        38      13      45          361    10  36
# \\b[Hh]and\\b|[gG]love|[mMitt]]           0       2       4            6    16   3
# \\b[Hh]at\\b                              0       1       0            8     6 101

 

The performance can be checked in the confusion matrix. This benchmark is not so bad, most of the patterns in each category are correctly predicted.

However, if our keywords do not appear in the text, the prediction quality will go down. In order to take the full text into account, we need the algorithm to take the whole Document Term Matrix into account instead of guessing a priori the keywords that are good predictors of  a category.

Since we have several categories and plenty of features in the dataset, we can try a random forest. The default forest in R’s randomForest library works quite well here, after a bit a data preparation (matching the words in the DTM for the train/test/cross-validation sets):



matchWords <- function(testDtm, referenceDtm){
  # Can't predict categories never seen in reference set 
  # => remove them from the set used for prediction
  # and add to predicting set the words that were in train set but not in predicting set
  # Args: document term matrix to modify and reference document term matrix
  # Returns the modified dtm with columns matching the reference dtm
  test2 <- testDtm[, intersect(colnames(referenceDtm), colnames(testDtm))]
  trainWordsNotInTest <- setdiff(names(referenceDtm), names(test2))
  yy <- data.frame(matrix(0, ncol = length(trainWordsNotInTest),
                          nrow = dim(test2)[1]))
  names(yy) <- trainWordsNotInTest
  # Final processed test set
  return(cbind(test2, yy))
}

cvMatched <- matchWords(cv, train)
testMatched <- matchWords(test, train)

## Random Forest
rf <- randomForest(train, y_train)
pred <- predict(rf, cvMatched)
table(pred, y_cv)
# interpretation: graph showing which words make the most interesting splits in the trees
varImpPlot(rf, type=2) 

# (pred)          Feet / Legs Hands Hat Neck / Torso Softies Sweater     (y_cv)
# Feet / Legs           97     3   3           12       2       8
# Hands                  0    20   0            0       0       0
# Hat                    2     3 130            6       1       1
# Neck / Torso          17    10   8          362       9      28
# Softies                0     0   0            1      17       0
# Sweater                1     1   3            7       0     140

Comparing the confusion matrices between the benchmark and the random forest, the forest wins !

RFcategoryGini

Words importance for category prediction, as illustrated by their contribution to the decrease of the Gini index.

Looking at the forest variable importance plot, it appears as expected that words like “hat” or “sleeves” are good predictors (no sleeves on scarves usually). Other more generic good predictors appear: “top” probably narrows the category to scarves or sweaters, excluding socks, and “fit” is probably more likely to appear for items where size matters (hats, socks, and sweaters).

Autumn tree canopy, Foret des Laurentides, Quebec, Canada

Random forests are the best forests.

 

In order to estimate the performance of the winner on wild data, we use the test set:


# on test set:
predTest <- predict(rf, testMatched)
table(predTest, y_test)

# (predTest)      Feet / Legs Hands Hat Neck / Torso Softies Sweater (y_test)
# Feet / Legs          102     0   1           12       3       8
# Hands                  0    21   0            0       0       0
# Hat                    2     2 132            5       2       1
# Neck / Torso          12    12  10          363      11      23
# Softies                0     0   0            0      13       0
# Sweater                1     2   1            8       0     145

Pattern launch study

On Ravelry, users make heavy use of the bookmarking tools (“favorite” and “queue”) to remember the knitting patterns that caught their eye among the several millions that the database holds.

The popular Brooklyn Tweed design house released their Fall 2015 knitting patterns collection on Wednesday, September 16th; I used this opportunity to look at the bookmarking activity on these patterns in real time for several days after the launch.

Screen Shot 2015-09-26 at 11.26.49

Some of the 15 patterns in the Brooklyn Tweed Fall 2015 collection.

The pattern names are kept secret until the launch, so it’s not possible to “listen” to the Ravelry database to detect their apparition and start following them. One needs to first search the Ravelry database for recently published Brooklyn Tweed designs, and assuming that the Fall 2015 designs will be among the most recent results. This can, and should, be changed after the launch, editing the script to query the actual names of the collection patterns, because at some point there will be more recent patterns tagged “Brooklyn Tweed” that push those from this Fall collection out of the most recent results. The following code does just that, getting the permalinks (unique pattern names) for the 30 most recent “Brooklyn Tweed” search results:

# Search for recent BT patterns (will also have others made with BT yarns)
BT_query <- GET("https://api.ravelry.com/patterns/search.json?page_size=30&query=Olga%20Buraya%20Kefelian&sort=date",
             config=config("token"=ravelry.token))
content_to_follow <- content(BT_query)

permalinks_to_follow <- sapply(content_to_follow[[1]], function(x) x$permalink)

Once we have the links, the following code queries the API for each pattern for the properties we want to look at: current number of favorites, of users who queued the pattern, of comments, and of projects. It then appends the resulting data to a text file “BT_follow.txt”.

# get url from pattern unique name (permalink)
links_to_follow <- sapply(permalinks_to_follow, function(name) paste("https://api.ravelry.com/patterns/",name,".json&amp;quot;,sep=&amp;quot;&amp;quot;,collapse=&amp;quot;&amp;quot;))
# query the API
pat0 <- lapply(links_to_follow, GET, config=config("token"=ravelry.token))
pats <- lapply(pat0, content)

# filter the results for the properties we are interested in
pattern_data <- sapply(pats, function(x) x$pattern[c("permalink",
                                                    "queued_projects_count",
                                                     "projects_count&amp",                                                       "favorites_count",
                                                     "comments_count")])

# turn into a data frame object
pattern_df <- data.frame(matrix(unlist(pattern_data), nrow=length(links_to_follow), byrow=T))
# add time and date
pattern_df$time <- Sys.time()

# write to text file
write.table(pattern_df, file = "/Users/saskia/R/ravelry_explorer/BT_follow.txt",append=T,row.names=F,col.names=F,quote=F)

This code gets the data at the point in time where it is run; in order to get this data as a function of time, the code above in ran automatically every 30 minutes using a Cron job (useful advice here).

After letting the data file grow for a few days, it’s time for harvest !

# read from the text file
BT_time <- read.csv("BT_follow.txt",header=F,sep=";")
# add columns names
names(BT_time) <- c("pattern","queued_projects_count","projects_count","favorites_count","comments_count","day","time")

# Fall 2015 pattern unique names
names_BT <- c("ashland-2","bannock","birch-bay-2","cascades","deschutes-2","copse-2",
             "fletching-3","lander","lolo","mcloughlin","nehalem-2","riverbend-2",
             "sauvie","trailhead-2","willamette-7")

# Fall 2015 pattern official names
names_full <- c("Ashland","Bannock","Birch Bay","Cascades","Deschutes","Copse",
                "Fletching","Lander","Lolo","McLoughlin","Nehalem","Riverbend",
                "Sauvie","Trailhead","Willamette")

names(names_full) <- names_BT

# filter the data to keep only Brooklyn tweed Fall 2015 patterns
BT_time <- subset(BT_time,subset=(pattern %in% names_BT))
BT_time$pattern <- droplevels(BT_time$pattern)
# use pattern name as identifier instead of permalink
BT_time$pattern <- revalue(BT_time$pattern, names_full)
# get full date from day and hour/min, in Brooklyn Tweed's local time (PDT)
BT_time$date <- as.POSIXct(paste(BT_time$day, BT_time$time))
attributes(BT_time$date)$tzone <- "America/Los_Angeles"

And now we can plot the graph of the number of favorites on each pattern as a function of time. I chose 15 colors as far from each other as possible using this handy tool from the Medialab at Sciences Po.

ggplot(BT_time) + 
  geom_line(aes(x=date, y=favorites_count, color=pattern, group=pattern)) +
  geom_text(data=BT_time[BT_time$date ==lastdate,], 
            aes(x=date,
                y=favorites_count,
                color=pattern,
                label=pattern),
            vjust = -0.2) +
  xlab("Time") +
  ylab("Number of favorites") +
  theme(legend.position="none") +
  scale_colour_manual(values=BT_colors) +
  ggtitle("Evolution of number of favorites on BT Fall 2015 patterns")
The sun (almost) never sets on the knitting empire !

The sun (almost) never sets on the knitting empire! (click to enlarge)

Looks like the hottest time window is within the first day! The time dependence has a very similar shape for each pattern, probably because they are all released in the same context, with a consistent design concept throughout the collection.

People bookmark patterns all the time, although when it’s night in the USA, there is a small dip in activity. This is much more visible if we plot the favoriting rate (number of new favorites per hour):

# function getting the nbr of favorites per unit time
dfav_dt <- function(data){
  n <- dim(data)[1] # nbr lines
  favdiff <- c(0, diff(data$favorites_count)) # length n
  timediff <- difftime(data$date, c(data$date[1], data$date[1:n-1]),units="hours")
  # div by 0 on first item, but ggplot will just ignore those points
  timediff <- as.numeric(timediff) 
  favderiv <- SMA(favdiff/timediff)
  return(data.frame(deriv = favderiv, 
                    date = data$date))
}

# rate gets pretty constant after a few days, so keep only
# the points shortly after launch
m <- 3000
# get rates for each pattern
rates <- ddply(BT_time[1:m,],.(pattern), dfav_dt)

# normalize the data to more easily see the general behavior
norm_col <- function(data){
  maxd <- max(data$deriv, na.rm=T)
  return(cbind(data,norm_deriv=data$deriv/maxd))
}

rates <- ddply(rates,.(pattern), norm_col)

ggplot(rates) + 
  geom_line(aes(x=date, y=norm_deriv, color="red", group=pattern)) +
  xlab("Time") +
  ylab("Normalized favoriting rate") +
  theme(legend.position="none") +
  ggtitle("Evolution of favoriting rate on BT Fall 2015 patterns")

BT_fav_rate_t

Favoriting rate for the few days after the launch; the dip in activity when night moves over the USA is now quite visible. There are 15 normalized curves, one for each pattern in the collection. (click to enlarge)

The time dependence of the number of times a pattern was queued is very highly correlated with the time dependence of the number of favorites, so I’m not showing it here.

I’m letting the code run once per day now; I’ll post the new data in a few months to look at long term tendencies, and at the number of projects (there are obviously not a lot of them so soon after the launch). I’m guessing Christmas Crunch will be having interesting effects …

Knitting patterns price distribution

Ravelry allows knitting designers from all over the world to sell individual patterns directly to knitters. This post looks at the distribution of the prices of these items. The Ravelry pattern database holds patterns published as books, e-book collections, pdf downloads, or club subscriptions; all patterns that are available online for purchase have their price indicated on their Ravelry page.

The following code queries the Ravelry API for the 500 patterns with most projects in each of the following categories: “hat”, “sweater”, “neck/torso”, “feet/legs”, “hands”, “home”, “component” (these are tutorial or special techniques instead of full patterns), “toys and hobbies”, and “pet”. From this data we get the URLs to the patterns and use standard web scrapping to get the html code for each pattern page. This works on patterns, since their pages are public (viewable by users not logged into Ravelry), but would not work for projects or stash entries, which are private by default (viewable only by Ravelry users).




categories <- c("hat","sweater","neck-torso","feet-legs","hands","home","toysandhobbies","pattern-component","pet")

# Get dataset of non-free patterns with a high number of projects, available as pdf downloads
# (Full treatment 1h30 for 500/category)
search_url <- "https://api.ravelry.com/patterns/search.json?page_size=500&sort=projects&craft=knitting&availability=ravelry%2B-free&pc="
cat_search <- sapply(categories, function(name) paste(search_url, name,sep="", collapse=""))

# Get lists of search results; price attribute is NULL => use web scraping to get it
pat0 <- lapply(cat_search, GET, config=config("token"=ravelry.token))
pat <- lapply(pat0, content)

# Extract patterns permalinks in each category
permalinks <- sapply(pat, function(x) sapply(x$patterns, function(y) y$permalink))
names(permalinks) <- categories
permalinks <- melt(permalinks)
names(permalinks) <- c("link","category")

permalinks_full <- sapply(permalinks$link, function(name) paste("http://www.ravelry.com/patterns/library/",name,sep="",collapse=""))

# Random sampling for testing
samp = sample(1:length(permalinks$link),length(permalinks$link))
permalinks_full <- permalinks_full[samp]       
permalinks <- permalinks[samp,]       

# Web scraping to get the price from the pattern page
# Takes about 1 min for 50 links
n=dim(permalinks)[1] # 1000 ok
pattern_info <- lapply(permalinks_full[1:n], htmlTreeParse, useInternalNodes = TRUE)
names(pattern_info) <- permalinks$link[1:n]

Once we have the html code for each pattern page, we parse it for the prices. The path to the price information in the html tree can be checked by looking at the source code for a typical pattern page. Since most patterns are priced in US dollars (around 75% of them in this dataset), all the price data is converted to current USD to match, using the R quantmod library.

 

pattern_prices <- lapply(pattern_info, function(html) getNodeSet(html, 
                                                                 path="//strong[@class='price']/a/text()", 
                                                                 fun=xmlValue)[[1]] )

num_prices <- lapply(pattern_prices, function(str) c("price"=regmatches(str,
                                                                regexpr("[[:digit:]]+\\.*[[:digit:]]*",str)),
                                                     "currency"=substr(str, nchar(str)-2, nchar(str)) 
                                                     )
                     )


pattern_nbr_projects <- melt(sapply(pattern_info, nbr_projects))
price_data  <- data.frame(matrix(unlist(num_prices), nrow=length(num_prices), byrow=T), stringsAsFactors=F)
price_data <- cbind(pattern_nbr_projects, permalinks[1:n,], price_data)
names(price_data) <- c("nbr_projects", "link","category", "price", "currency")
price_data$price <- as.numeric(price_data$price)

# Local currency conversion is proposed by Ravelry only for logged in users
# => do normalizeing of prices here
currencies_codes = sapply(price_data$currency, paste,"USD",sep="")
# getFX puts exchange rate in the environment, but sapply does not change env. variables
for (curr in unique(price_data$currency)) getFX(paste(curr, "/USD", sep=""), from = Sys.Date())
exchange_rates = sapply(currencies_codes, get)
price_data$price_usd = price_data$price * exchange_rates

And finally, the global price distribution (all categories aggregated):

ggplot(price_data) +
  geom_histogram(aes(x=price_usd), fill='Blue', alpha=0.5, binwidth=0.5) +
  scale_x_continuous(limits = c(0, 20), breaks = round(seq(0, 20, by = 1), 1)) +
  xlab("Pattern price in USD") +
  ylab("Number of patterns")

hist_prices

Pattern prices distribution. The data is only shown for patterns up to 20 dollars (there are a few expensive outliers, mostly kits with pattern+yarn included).

It looks like the “99 cents is cheaper than 1$” strategy is mostly used in the lower prices range. In the prices 6$ to 9$, there are much fewer price points just below the integer prices, but in the 3$ to 5$, it’s the contrary.

The breakdown by category:

ggplot(price_data, aes(x=category, y=price_usd, fill=category)) +
  geom_boxplot(alpha=0.5) +
  ylab("Price in USD") +
  title("Pattern prices distributions in each category")
Prices by category

Pattern price is pretty constant.

Price does not depend on category much. But negative results are just as interesting as positive results, so this graph is still proudly displayed! I was a bit surprised by this, since there can be a lot a variance in pattern design complexity between a one-size-fits-all accessory and a sweater.

Which yarn colors are most stashed and most used ?

Color is a pretty important part of a knitting project, and a big factor in deciding to buy and knit a skein (or hank, as is more often the case with hand-dyed yarns). Maybe counter-intuitively for non-knitters, many knitters buy yarn without a specific project planned for it, sometimes even in quantities that they couldn’t knit through even if they spent the rest of their lives knitting 24/7! Such is the power of beautiful yarn.

The yarn stash entry on ravelry allows the user to select the color category of their yarn when adding it to their notebook stash. Color is quite subjective and categorizing yarn colors can be especially difficult, as shown on those pictures:

Which color is this ? (“Top Draw Sock” by Skein in the “inner city” colorway)

Is this yellow, brown, or orange ? (“Tosh Sock” by Madelinetosh in the “ginger glazed” colorway)

Often though, the color distinctly falls into one of the Ravelry color families: “Blue”, “Green”, “Purple”, “Brown”, “Gray”, “Blue-green”, “Pink”, “Red”, “Natural/Undyed”, “Black”, “Red-purple”, “White”, “Orange”, “Yellow”, “Blue-purple”, “Red-orange”, “Yellow-green”, and “Yellow-orange”. This post excludes yarn spun by the users (“handspun”), to look only at commercial yarns (spun at a mill) that users have bought and added to the “yarn stash” section of their notebooks.

The color family of projects cannot be retrieved with the API (it is a yarn property); in addition, the actual number of stashed yarn items and projects in Ravelry are displayed when doing a manual search. I entered this data by hand to look at color distributions rather than doing an API call. Not as practical as an API call, but it’s more accurate since those tend to timeout if we query more than 5000 entries.

The code below defines the data as taken (manually) from Ravelry, and shows a plot of how many items are stashed in each color (bars) and how many projects were knitted using this color (dots).


# Colors selected among the 140 html supported color names
ravColors <- c("Black"="#000000","Blue"="#0000ff","Blue-green"="#008080","Blue-purple"="#6A5ACD","Brown"="#A52A2A",
                "Gray"="#808080","Green"="#008000","Natural/Undyed"="#e0d8c8","Orange"="#FFA500","Pink"="#FFC0CB",
                "Purple"="#800080","Red"="#ff0000","Red-orange"="#FF4500","Red-purple"="#C71585","White"="#ffffff",
                "Yellow"="#ffff00","Yellow-green"="#ADFF2F","Yellow-orange"="#FFD700")

# Color breakdown (august) from project and stash search page info: 
yarnColors <- data.frame("color"=c("Blue","Green","Purple","Brown","Gray","Blue-green",
                                    "Pink","Red","Natural/Undyed", "Black","Red-purple",
                                    "White","Orange","Yellow","Blue-purple","Red-orange",
                                    "Yellow-green","Yellow-orange"),
                          "project"=c(702934,572417,420128,384078,495350,368599,
                                      412316,362362,268711,237424,148006,
                                      254850,170679,168060,143248,92505,
                                      65680,53732),
                          "stash"=c(923548,753213,600723,534547,506782,493983,
                                    561139,433341,304647,261388,204782,
                                    241738,228724,194401,198696,127643,
                                    82374,65809))


# sort by increasing number of items in stash
yarn_colors <- arrange(yarn_colors,stash)

ggplot(data=yarnColors)+
  geom_bar(aes(x=seq_along(color), y=stash, fill=factor(color)),
           stat="identity")+
  geom_point(aes(x=seq_along(color), y=project, fill=factor(color)),
             colour="white", pch=21, size=5)+
  guides(fill=guide_legend(title="colors\nbars: stash\ndots: projects"))+
  theme(axis.text.x=element_blank())+
  theme(axis.ticks.x=element_blank())+
  scale_fill_manual(values=rav_colors)+
  xlab("Ravelry color family")+
  ylab("Number of stash items / projects")+
  ggtitle("Color distribution of stashed yarns and yarns used in projects")

Colors in stashes and projects. Blue and green lead !

Colors in stashes and projects. Blue and green lead !

Projects colors tend to follow the same distribution as stash colors. Blue is clearly the winner, both in stashes and in projects. Yellows and oranges, on the other hand, don’t seem to tempt many knitters.

Gray is knitted a lot more, comparatively, than it is stashed (its dot doesn’t follow the slope defined by the other dots).