Pattern launch study

On Ravelry, users make heavy use of the bookmarking tools (“favorite” and “queue”) to remember the knitting patterns that caught their eye among the several millions that the database holds.

The popular Brooklyn Tweed design house released their Fall 2015 knitting patterns collection on Wednesday, September 16th; I used this opportunity to look at the bookmarking activity on these patterns in real time for several days after the launch.

Screen Shot 2015-09-26 at 11.26.49

Some of the 15 patterns in the Brooklyn Tweed Fall 2015 collection.

The pattern names are kept secret until the launch, so it’s not possible to “listen” to the Ravelry database to detect their apparition and start following them. One needs to first search the Ravelry database for recently published Brooklyn Tweed designs, and assuming that the Fall 2015 designs will be among the most recent results. This can, and should, be changed after the launch, editing the script to query the actual names of the collection patterns, because at some point there will be more recent patterns tagged “Brooklyn Tweed” that push those from this Fall collection out of the most recent results. The following code does just that, getting the permalinks (unique pattern names) for the 30 most recent “Brooklyn Tweed” search results:

# Search for recent BT patterns (will also have others made with BT yarns)
BT_query <- GET("",
content_to_follow <- content(BT_query)

permalinks_to_follow <- sapply(content_to_follow[[1]], function(x) x$permalink)

Once we have the links, the following code queries the API for each pattern for the properties we want to look at: current number of favorites, of users who queued the pattern, of comments, and of projects. It then appends the resulting data to a text file “BT_follow.txt”.

# get url from pattern unique name (permalink)
links_to_follow <- sapply(permalinks_to_follow, function(name) paste("",name,".json&amp;quot;,sep=&amp;quot;&amp;quot;,collapse=&amp;quot;&amp;quot;))
# query the API
pat0 <- lapply(links_to_follow, GET, config=config("token"=ravelry.token))
pats <- lapply(pat0, content)

# filter the results for the properties we are interested in
pattern_data <- sapply(pats, function(x) x$pattern[c("permalink",
                                                     "projects_count&amp",                                                       "favorites_count",

# turn into a data frame object
pattern_df <- data.frame(matrix(unlist(pattern_data), nrow=length(links_to_follow), byrow=T))
# add time and date
pattern_df$time <- Sys.time()

# write to text file
write.table(pattern_df, file = "/Users/saskia/R/ravelry_explorer/BT_follow.txt",append=T,row.names=F,col.names=F,quote=F)

This code gets the data at the point in time where it is run; in order to get this data as a function of time, the code above in ran automatically every 30 minutes using a Cron job (useful advice here).

After letting the data file grow for a few days, it’s time for harvest !

# read from the text file
BT_time <- read.csv("BT_follow.txt",header=F,sep=";")
# add columns names
names(BT_time) <- c("pattern","queued_projects_count","projects_count","favorites_count","comments_count","day","time")

# Fall 2015 pattern unique names
names_BT <- c("ashland-2","bannock","birch-bay-2","cascades","deschutes-2","copse-2",

# Fall 2015 pattern official names
names_full <- c("Ashland","Bannock","Birch Bay","Cascades","Deschutes","Copse",

names(names_full) <- names_BT

# filter the data to keep only Brooklyn tweed Fall 2015 patterns
BT_time <- subset(BT_time,subset=(pattern %in% names_BT))
BT_time$pattern <- droplevels(BT_time$pattern)
# use pattern name as identifier instead of permalink
BT_time$pattern <- revalue(BT_time$pattern, names_full)
# get full date from day and hour/min, in Brooklyn Tweed's local time (PDT)
BT_time$date <- as.POSIXct(paste(BT_time$day, BT_time$time))
attributes(BT_time$date)$tzone <- "America/Los_Angeles"

And now we can plot the graph of the number of favorites on each pattern as a function of time. I chose 15 colors as far from each other as possible using this handy tool from the Medialab at Sciences Po.

ggplot(BT_time) + 
  geom_line(aes(x=date, y=favorites_count, color=pattern, group=pattern)) +
  geom_text(data=BT_time[BT_time$date ==lastdate,], 
            vjust = -0.2) +
  xlab("Time") +
  ylab("Number of favorites") +
  theme(legend.position="none") +
  scale_colour_manual(values=BT_colors) +
  ggtitle("Evolution of number of favorites on BT Fall 2015 patterns")
The sun (almost) never sets on the knitting empire !

The sun (almost) never sets on the knitting empire! (click to enlarge)

Looks like the hottest time window is within the first day! The time dependence has a very similar shape for each pattern, probably because they are all released in the same context, with a consistent design concept throughout the collection.

People bookmark patterns all the time, although when it’s night in the USA, there is a small dip in activity. This is much more visible if we plot the favoriting rate (number of new favorites per hour):

# function getting the nbr of favorites per unit time
dfav_dt <- function(data){
  n <- dim(data)[1] # nbr lines
  favdiff <- c(0, diff(data$favorites_count)) # length n
  timediff <- difftime(data$date, c(data$date[1], data$date[1:n-1]),units="hours")
  # div by 0 on first item, but ggplot will just ignore those points
  timediff <- as.numeric(timediff) 
  favderiv <- SMA(favdiff/timediff)
  return(data.frame(deriv = favderiv, 
                    date = data$date))

# rate gets pretty constant after a few days, so keep only
# the points shortly after launch
m <- 3000
# get rates for each pattern
rates <- ddply(BT_time[1:m,],.(pattern), dfav_dt)

# normalize the data to more easily see the general behavior
norm_col <- function(data){
  maxd <- max(data$deriv, na.rm=T)

rates <- ddply(rates,.(pattern), norm_col)

ggplot(rates) + 
  geom_line(aes(x=date, y=norm_deriv, color="red", group=pattern)) +
  xlab("Time") +
  ylab("Normalized favoriting rate") +
  theme(legend.position="none") +
  ggtitle("Evolution of favoriting rate on BT Fall 2015 patterns")


Favoriting rate for the few days after the launch; the dip in activity when night moves over the USA is now quite visible. There are 15 normalized curves, one for each pattern in the collection. (click to enlarge)

The time dependence of the number of times a pattern was queued is very highly correlated with the time dependence of the number of favorites, so I’m not showing it here.

I’m letting the code run once per day now; I’ll post the new data in a few months to look at long term tendencies, and at the number of projects (there are obviously not a lot of them so soon after the launch). I’m guessing Christmas Crunch will be having interesting effects …

Knitting patterns price distribution

Ravelry allows knitting designers from all over the world to sell individual patterns directly to knitters. This post looks at the distribution of the prices of these items. The Ravelry pattern database holds patterns published as books, e-book collections, pdf downloads, or club subscriptions; all patterns that are available online for purchase have their price indicated on their Ravelry page.

The following code queries the Ravelry API for the 500 patterns with most projects in each of the following categories: “hat”, “sweater”, “neck/torso”, “feet/legs”, “hands”, “home”, “component” (these are tutorial or special techniques instead of full patterns), “toys and hobbies”, and “pet”. From this data we get the URLs to the patterns and use standard web scrapping to get the html code for each pattern page. This works on patterns, since their pages are public (viewable by users not logged into Ravelry), but would not work for projects or stash entries, which are private by default (viewable only by Ravelry users).

categories <- c("hat","sweater","neck-torso","feet-legs","hands","home","toysandhobbies","pattern-component","pet")

# Get dataset of non-free patterns with a high number of projects, available as pdf downloads
# (Full treatment 1h30 for 500/category)
search_url <- ""
cat_search <- sapply(categories, function(name) paste(search_url, name,sep="", collapse=""))

# Get lists of search results; price attribute is NULL => use web scraping to get it
pat0 <- lapply(cat_search, GET, config=config("token"=ravelry.token))
pat <- lapply(pat0, content)

# Extract patterns permalinks in each category
permalinks <- sapply(pat, function(x) sapply(x$patterns, function(y) y$permalink))
names(permalinks) <- categories
permalinks <- melt(permalinks)
names(permalinks) <- c("link","category")

permalinks_full <- sapply(permalinks$link, function(name) paste("",name,sep="",collapse=""))

# Random sampling for testing
samp = sample(1:length(permalinks$link),length(permalinks$link))
permalinks_full <- permalinks_full[samp]       
permalinks <- permalinks[samp,]       

# Web scraping to get the price from the pattern page
# Takes about 1 min for 50 links
n=dim(permalinks)[1] # 1000 ok
pattern_info <- lapply(permalinks_full[1:n], htmlTreeParse, useInternalNodes = TRUE)
names(pattern_info) <- permalinks$link[1:n]

Once we have the html code for each pattern page, we parse it for the prices. The path to the price information in the html tree can be checked by looking at the source code for a typical pattern page. Since most patterns are priced in US dollars (around 75% of them in this dataset), all the price data is converted to current USD to match, using the R quantmod library.


pattern_prices <- lapply(pattern_info, function(html) getNodeSet(html, 
                                                                 fun=xmlValue)[[1]] )

num_prices <- lapply(pattern_prices, function(str) c("price"=regmatches(str,
                                                     "currency"=substr(str, nchar(str)-2, nchar(str)) 

pattern_nbr_projects <- melt(sapply(pattern_info, nbr_projects))
price_data  <- data.frame(matrix(unlist(num_prices), nrow=length(num_prices), byrow=T), stringsAsFactors=F)
price_data <- cbind(pattern_nbr_projects, permalinks[1:n,], price_data)
names(price_data) <- c("nbr_projects", "link","category", "price", "currency")
price_data$price <- as.numeric(price_data$price)

# Local currency conversion is proposed by Ravelry only for logged in users
# => do normalizeing of prices here
currencies_codes = sapply(price_data$currency, paste,"USD",sep="")
# getFX puts exchange rate in the environment, but sapply does not change env. variables
for (curr in unique(price_data$currency)) getFX(paste(curr, "/USD", sep=""), from = Sys.Date())
exchange_rates = sapply(currencies_codes, get)
price_data$price_usd = price_data$price * exchange_rates

And finally, the global price distribution (all categories aggregated):

ggplot(price_data) +
  geom_histogram(aes(x=price_usd), fill='Blue', alpha=0.5, binwidth=0.5) +
  scale_x_continuous(limits = c(0, 20), breaks = round(seq(0, 20, by = 1), 1)) +
  xlab("Pattern price in USD") +
  ylab("Number of patterns")


Pattern prices distribution. The data is only shown for patterns up to 20 dollars (there are a few expensive outliers, mostly kits with pattern+yarn included).

It looks like the “99 cents is cheaper than 1$” strategy is mostly used in the lower prices range. In the prices 6$ to 9$, there are much fewer price points just below the integer prices, but in the 3$ to 5$, it’s the contrary.

The breakdown by category:

ggplot(price_data, aes(x=category, y=price_usd, fill=category)) +
  geom_boxplot(alpha=0.5) +
  ylab("Price in USD") +
  title("Pattern prices distributions in each category")
Prices by category

Pattern price is pretty constant.

Price does not depend on category much. But negative results are just as interesting as positive results, so this graph is still proudly displayed! I was a bit surprised by this, since there can be a lot a variance in pattern design complexity between a one-size-fits-all accessory and a sweater.

Which yarn colors are most stashed and most used ?

Color is a pretty important part of a knitting project, and a big factor in deciding to buy and knit a skein (or hank, as is more often the case with hand-dyed yarns). Maybe counter-intuitively for non-knitters, many knitters buy yarn without a specific project planned for it, sometimes even in quantities that they couldn’t knit through even if they spent the rest of their lives knitting 24/7! Such is the power of beautiful yarn.

The yarn stash entry on ravelry allows the user to select the color category of their yarn when adding it to their notebook stash. Color is quite subjective and categorizing yarn colors can be especially difficult, as shown on those pictures:

Which color is this ? (“Top Draw Sock” by Skein in the “inner city” colorway)

Is this yellow, brown, or orange ? (“Tosh Sock” by Madelinetosh in the “ginger glazed” colorway)

Often though, the color distinctly falls into one of the Ravelry color families: “Blue”, “Green”, “Purple”, “Brown”, “Gray”, “Blue-green”, “Pink”, “Red”, “Natural/Undyed”, “Black”, “Red-purple”, “White”, “Orange”, “Yellow”, “Blue-purple”, “Red-orange”, “Yellow-green”, and “Yellow-orange”. This post excludes yarn spun by the users (“handspun”), to look only at commercial yarns (spun at a mill) that users have bought and added to the “yarn stash” section of their notebooks.

The color family of projects cannot be retrieved with the API (it is a yarn property); in addition, the actual number of stashed yarn items and projects in Ravelry are displayed when doing a manual search. I entered this data by hand to look at color distributions rather than doing an API call. Not as practical as an API call, but it’s more accurate since those tend to timeout if we query more than 5000 entries.

The code below defines the data as taken (manually) from Ravelry, and shows a plot of how many items are stashed in each color (bars) and how many projects were knitted using this color (dots).

# Colors selected among the 140 html supported color names
ravColors <- c("Black"="#000000","Blue"="#0000ff","Blue-green"="#008080","Blue-purple"="#6A5ACD","Brown"="#A52A2A",

# Color breakdown (august) from project and stash search page info: 
yarnColors <- data.frame("color"=c("Blue","Green","Purple","Brown","Gray","Blue-green",
                                    "Pink","Red","Natural/Undyed", "Black","Red-purple",

# sort by increasing number of items in stash
yarn_colors <- arrange(yarn_colors,stash)

  geom_bar(aes(x=seq_along(color), y=stash, fill=factor(color)),
  geom_point(aes(x=seq_along(color), y=project, fill=factor(color)),
             colour="white", pch=21, size=5)+
  guides(fill=guide_legend(title="colors\nbars: stash\ndots: projects"))+
  xlab("Ravelry color family")+
  ylab("Number of stash items / projects")+
  ggtitle("Color distribution of stashed yarns and yarns used in projects")

Colors in stashes and projects. Blue and green lead !

Colors in stashes and projects. Blue and green lead !

Projects colors tend to follow the same distribution as stash colors. Blue is clearly the winner, both in stashes and in projects. Yellows and oranges, on the other hand, don’t seem to tempt many knitters.

Gray is knitted a lot more, comparatively, than it is stashed (its dot doesn’t follow the slope defined by the other dots).

Knitting patterns categories

Let’s have a look at the repartition of Ravelry patterns among the various categories. The categories have a tree structure. The number of patterns in each category is not accessible via the API, but it is visible on the website main page.

The code below is not the most exciting, but it is a more accurate reflection of the breakdown of knitting patterns in August 2015 than what could be estimated from an API call on the database. I took into account only the patterns that have at least one picture, to avoid potential bad entries (duplicates, drafts …). Some of the categories are aggregated to keep the tree plot simple; the entries named “All Item” are aggregated and actually contain subcategories.

pattern_nbr_tree <- list("Clothing"=list( "Coat/Jacket"=10707,
                                          "Intimate Apparel"=list("Bra"=35,
                                          "Tops"=list("Sleeveles Top"=7195,
                                                      "Strapless Top"=130,
                          "Accesories"=list("All Bag"=8883,
                                                             "All Socks"=23631,
                                            "All Hands"=23733,
                                            "All Hat"=45386,
                                            "All Jewelry"=1642,
                                            "All Other Headwear"=3989,
                          "All Home"=36478,
                          "All Toys and Hobbies"=21023,
                          "All Pet"=1551,
                          "All Components"=7073)

This plot shows the number of patterns in each category as the area of the corresponding rectangle.

pattern_nbr_tree <- melt(pattern_nbr_tree)
names(pattern_nbr_tree) <- c("patterns_count","cat3","cat2","cat1")

custom_palette <- c("#FAF0E6","#7FFFD4","#008080","#FF6347","#FA8072","#B0C4DE")
title="Number of knitting patterns per category",palette=custom_palette,border.col="White",border.lwds=c(6,3,0.5),
Number of patterns in each category.

Number of patterns in each category (click to enlarge).

Accessories dominate, probably because they represent a smaller time commitment than clothing items, and there are fewer risks on the fit!

Tags in popular knitting projects

This second post in the “Knitting data” series looks at the text Ravelry users use to describe their own projects, and more specifically the tags attached to popular projects, i.e. the projects with the most “favorites”. “Favoriting” is a way for users to bookmark and manifest their appreciation for an item.

The following code queries the Ravelry API for the 5000 most favorited projects.

knit_all <- GET(";craft=knitting&amp;sort=favorites", config=config("token"=ravelry.token))
knit <- content(knit_all)

The following function retrieves the tags from a list of projects (the result of the query above) that have a number of favorites above a given threshold. The tags are then grouped into a text corpus to use with the R text mining library tm. (The corpus form may generally be useful to split the tags into several groups by popularity thresholds for example).

get_tags <- function(projects, fav_threshold){
  # get the tags from the project list
  # use only projects where nbr of favorites >= fav_threshold
  tags_threshold <- lapply(projects,
                  function(x) ifelse(x$favorites_count>=fav_threshold,x$tag_names,""))
  tags <- paste(unlist(tags_threshold), collapse=" ")
  # build corpus from project tags
  corpus_tags <- Corpus(VectorSource(tags))
  names(corpus_tags) <- paste("fav",as.character(fav_threshold),sep="")
  # clean up the tags
  corpus_tags <- tm_map(corpus_tags, removeNumbers) # ignore dates
  corpus_tags <- tm_map(corpus_tags, removePunctuation)
  corpus_tags <- tm_map(corpus_tags, stemDocument)

Now the “alltags” object below is a text corpus with all the tags used in the 5000 most favorited projects; since the threshold is 0, all projects are included. The tags are stemmed, meaning that the end of the word is cut according to certain rules such as to avoid differencing between “cables” and “cabling” for example.

alltags <- get_tags(knit$projects, 0)

The following function gets the frequency of each tag into a data frame (useful to more easily get the origins of the stemmed words).

tags_frequencies <- function(corpus_tags){
  # return tags and their frequencies in each document of corpus
  tdm <- TermDocumentMatrix(corpus_tags)
  tdmframe <- )
  names(tdmframe) <- names(corpus_tags)
  tdmframe$tags <- tdm$dimnames$Terms

Now we can apply this to our tag corpus to make a word cloud and a bar chart showing the most frequently used tags.

The tags are linked back to a correct word origin: “cable” is the simplest origin of the stem “cabl”; “cardigan”, on the other hand, is not changed by stemming. There should be a more elegant way (probably using Wordnet) to do this than the simple table lookup below, which I hope to get to in a later post !

all_freq <- tags_frequencies(alltags)
# keep only most frequent tags on all dataset
all_freq <- arrange(all_freq, desc(fav0))
all_freq <- all_freq[1:30,]
# replace stemmed word by a readable origin word
correction <- c(babi="baby",cabl="cable",pullov="pullover",bulki="bulky",contigu="contiguous",finger="fingering")
realword <- function(x) ifelse(x %in% names(correction), correction[x],x)
all_freq$tags <- sapply(all_freq$tags, realword)

wordcloud(words=all_freq$tags, freq=all_freq$fav0,colors=brewer.pal(8, "Dark2"),rot.per=0)
Word cloud of the most frequent tags in popular projects

Word cloud of the most frequent tags in popular projects

And the bar chart and its code:


Chart of the most frequent tags in popular projects (same data as the word cloud)

  geom_bar(aes(x=seq_along(all_freq$tags), y = all_freq$fav0), stat="identity", fill='Gold')+
            position=position_dodge(width=0.9) )+
  ylab("tag frequency")+

Looks like cabled cardigans for babies are your best bet for knitterly fame !

About Ravelry – API connection

This post is the first of a series investigating the data found on using the R programming language and the Ravelry API. Ravelry is a website functionning as a database, organisationnal tool, and social network for knitting and fiber crafts enthusiasts.

A lot of user-created content is accessible via the Ravelry API. Ravelry hosts a collection of about 300,000 knitting patterns; there are over 9 million knitting projects pages created by the 5 million users, with about 7,000 projects and 65000 forum posts added per day (more Ravelry facts from 2014 here).

The organizational tool is the most relevant for data mining, as it holds a lot of information about patterns, projects, and tools. Each user has a public notebook consisting of their craft projects, their yarn stash, their “favorites” (bookmarks) and other features. Project entries in the notebook have attributes like the name of the pattern, the notes taken by the user, the yarns and needles used … Stash entries have attributes like the name of the company producing the yarn, the color family, the user rating, the weight and so on.


Screenshot of the notebook page on Ravelry, with the list of projects made by the user.


Pattern page: includes the designer’s name, the pattern category, the recommended yarns, pictures … (“Sockhead hat” by Kelly McClure)


Yarn stash entry: includes link to yarn in the database, yarn weight, color family … One stash entry can represent several skeins of the same yarn.


Project entry: includes name of the project, link to pattern, needle (or hook for crochet) and yarn, user’s notes … (“Lorenz Manifold” by alicialight)

The R code below configures the oauth access to the API using the httr library. Your user access key and secret key provided by Ravelry are assumed to be in the user_rav.txt file in the working directory.


# user_rav.txt contains API username and password 
credentials <- readLines("user_rav.txt")
names(credentials) <- c("user","access_key","secret_key")

OpenConnection <- function(credentials){
  # Args: login info for the Ravelry API
  # Returns oauth token
  # Open connection to Ravelry API and return token
  reqURL <- ""
  accessURL <- ""
  authURL <- "" <- oauth_app("ravelry", key=credentials["access_key"], 
  ravelry.urls <- oauth_endpoint(reqURL, authURL, accessURL)

# Quick test of API connection by getting connected user info
TestConnection <- function(ravelry.token) {
  # Arg: API token
  # Returns name of the user connected with this token
  test <- GET("", 

ravelry.token <- OpenConnection(credentials)

Once the connection is approved, this command should query and show your user name:

userinfo <- GET("",

In the next post we will get to the real data !