Pattern launch study

On Ravelry, users make heavy use of the bookmarking tools (“favorite” and “queue”) to remember the knitting patterns that caught their eye among the several millions that the database holds.

The popular Brooklyn Tweed design house released their Fall 2015 knitting patterns collection on Wednesday, September 16th; I used this opportunity to look at the bookmarking activity on these patterns in real time for several days after the launch.

Screen Shot 2015-09-26 at 11.26.49

Some of the 15 patterns in the Brooklyn Tweed Fall 2015 collection.

The pattern names are kept secret until the launch, so it’s not possible to “listen” to the Ravelry database to detect their apparition and start following them. One needs to first search the Ravelry database for recently published Brooklyn Tweed designs, and assuming that the Fall 2015 designs will be among the most recent results. This can, and should, be changed after the launch, editing the script to query the actual names of the collection patterns, because at some point there will be more recent patterns tagged “Brooklyn Tweed” that push those from this Fall collection out of the most recent results. The following code does just that, getting the permalinks (unique pattern names) for the 30 most recent “Brooklyn Tweed” search results:

# Search for recent BT patterns (will also have others made with BT yarns)
BT_query <- GET("https://api.ravelry.com/patterns/search.json?page_size=30&query=Olga%20Buraya%20Kefelian&sort=date",
             config=config("token"=ravelry.token))
content_to_follow <- content(BT_query)

permalinks_to_follow <- sapply(content_to_follow[[1]], function(x) x$permalink)

Once we have the links, the following code queries the API for each pattern for the properties we want to look at: current number of favorites, of users who queued the pattern, of comments, and of projects. It then appends the resulting data to a text file “BT_follow.txt”.

# get url from pattern unique name (permalink)
links_to_follow <- sapply(permalinks_to_follow, function(name) paste("https://api.ravelry.com/patterns/",name,".json&amp;quot;,sep=&amp;quot;&amp;quot;,collapse=&amp;quot;&amp;quot;))
# query the API
pat0 <- lapply(links_to_follow, GET, config=config("token"=ravelry.token))
pats <- lapply(pat0, content)

# filter the results for the properties we are interested in
pattern_data <- sapply(pats, function(x) x$pattern[c("permalink",
                                                    "queued_projects_count",
                                                     "projects_count&amp",                                                       "favorites_count",
                                                     "comments_count")])

# turn into a data frame object
pattern_df <- data.frame(matrix(unlist(pattern_data), nrow=length(links_to_follow), byrow=T))
# add time and date
pattern_df$time <- Sys.time()

# write to text file
write.table(pattern_df, file = "/Users/saskia/R/ravelry_explorer/BT_follow.txt",append=T,row.names=F,col.names=F,quote=F)

This code gets the data at the point in time where it is run; in order to get this data as a function of time, the code above in ran automatically every 30 minutes using a Cron job (useful advice here).

After letting the data file grow for a few days, it’s time for harvest !

# read from the text file
BT_time <- read.csv("BT_follow.txt",header=F,sep=";")
# add columns names
names(BT_time) <- c("pattern","queued_projects_count","projects_count","favorites_count","comments_count","day","time")

# Fall 2015 pattern unique names
names_BT <- c("ashland-2","bannock","birch-bay-2","cascades","deschutes-2","copse-2",
             "fletching-3","lander","lolo","mcloughlin","nehalem-2","riverbend-2",
             "sauvie","trailhead-2","willamette-7")

# Fall 2015 pattern official names
names_full <- c("Ashland","Bannock","Birch Bay","Cascades","Deschutes","Copse",
                "Fletching","Lander","Lolo","McLoughlin","Nehalem","Riverbend",
                "Sauvie","Trailhead","Willamette")

names(names_full) <- names_BT

# filter the data to keep only Brooklyn tweed Fall 2015 patterns
BT_time <- subset(BT_time,subset=(pattern %in% names_BT))
BT_time$pattern <- droplevels(BT_time$pattern)
# use pattern name as identifier instead of permalink
BT_time$pattern <- revalue(BT_time$pattern, names_full)
# get full date from day and hour/min, in Brooklyn Tweed's local time (PDT)
BT_time$date <- as.POSIXct(paste(BT_time$day, BT_time$time))
attributes(BT_time$date)$tzone <- "America/Los_Angeles"

And now we can plot the graph of the number of favorites on each pattern as a function of time. I chose 15 colors as far from each other as possible using this handy tool from the Medialab at Sciences Po.

ggplot(BT_time) + 
  geom_line(aes(x=date, y=favorites_count, color=pattern, group=pattern)) +
  geom_text(data=BT_time[BT_time$date ==lastdate,], 
            aes(x=date,
                y=favorites_count,
                color=pattern,
                label=pattern),
            vjust = -0.2) +
  xlab("Time") +
  ylab("Number of favorites") +
  theme(legend.position="none") +
  scale_colour_manual(values=BT_colors) +
  ggtitle("Evolution of number of favorites on BT Fall 2015 patterns")
The sun (almost) never sets on the knitting empire !

The sun (almost) never sets on the knitting empire! (click to enlarge)

Looks like the hottest time window is within the first day! The time dependence has a very similar shape for each pattern, probably because they are all released in the same context, with a consistent design concept throughout the collection.

People bookmark patterns all the time, although when it’s night in the USA, there is a small dip in activity. This is much more visible if we plot the favoriting rate (number of new favorites per hour):

# function getting the nbr of favorites per unit time
dfav_dt <- function(data){
  n <- dim(data)[1] # nbr lines
  favdiff <- c(0, diff(data$favorites_count)) # length n
  timediff <- difftime(data$date, c(data$date[1], data$date[1:n-1]),units="hours")
  # div by 0 on first item, but ggplot will just ignore those points
  timediff <- as.numeric(timediff) 
  favderiv <- SMA(favdiff/timediff)
  return(data.frame(deriv = favderiv, 
                    date = data$date))
}

# rate gets pretty constant after a few days, so keep only
# the points shortly after launch
m <- 3000
# get rates for each pattern
rates <- ddply(BT_time[1:m,],.(pattern), dfav_dt)

# normalize the data to more easily see the general behavior
norm_col <- function(data){
  maxd <- max(data$deriv, na.rm=T)
  return(cbind(data,norm_deriv=data$deriv/maxd))
}

rates <- ddply(rates,.(pattern), norm_col)

ggplot(rates) + 
  geom_line(aes(x=date, y=norm_deriv, color="red", group=pattern)) +
  xlab("Time") +
  ylab("Normalized favoriting rate") +
  theme(legend.position="none") +
  ggtitle("Evolution of favoriting rate on BT Fall 2015 patterns")

BT_fav_rate_t

Favoriting rate for the few days after the launch; the dip in activity when night moves over the USA is now quite visible. There are 15 normalized curves, one for each pattern in the collection. (click to enlarge)

The time dependence of the number of times a pattern was queued is very highly correlated with the time dependence of the number of favorites, so I’m not showing it here.

I’m letting the code run once per day now; I’ll post the new data in a few months to look at long term tendencies, and at the number of projects (there are obviously not a lot of them so soon after the launch). I’m guessing Christmas Crunch will be having interesting effects …

Knitting patterns price distribution

Ravelry allows knitting designers from all over the world to sell individual patterns directly to knitters. This post looks at the distribution of the prices of these items. The Ravelry pattern database holds patterns published as books, e-book collections, pdf downloads, or club subscriptions; all patterns that are available online for purchase have their price indicated on their Ravelry page.

The following code queries the Ravelry API for the 500 patterns with most projects in each of the following categories: “hat”, “sweater”, “neck/torso”, “feet/legs”, “hands”, “home”, “component” (these are tutorial or special techniques instead of full patterns), “toys and hobbies”, and “pet”. From this data we get the URLs to the patterns and use standard web scrapping to get the html code for each pattern page. This works on patterns, since their pages are public (viewable by users not logged into Ravelry), but would not work for projects or stash entries, which are private by default (viewable only by Ravelry users).




categories <- c("hat","sweater","neck-torso","feet-legs","hands","home","toysandhobbies","pattern-component","pet")

# Get dataset of non-free patterns with a high number of projects, available as pdf downloads
# (Full treatment 1h30 for 500/category)
search_url <- "https://api.ravelry.com/patterns/search.json?page_size=500&sort=projects&craft=knitting&availability=ravelry%2B-free&pc="
cat_search <- sapply(categories, function(name) paste(search_url, name,sep="", collapse=""))

# Get lists of search results; price attribute is NULL => use web scraping to get it
pat0 <- lapply(cat_search, GET, config=config("token"=ravelry.token))
pat <- lapply(pat0, content)

# Extract patterns permalinks in each category
permalinks <- sapply(pat, function(x) sapply(x$patterns, function(y) y$permalink))
names(permalinks) <- categories
permalinks <- melt(permalinks)
names(permalinks) <- c("link","category")

permalinks_full <- sapply(permalinks$link, function(name) paste("http://www.ravelry.com/patterns/library/",name,sep="",collapse=""))

# Random sampling for testing
samp = sample(1:length(permalinks$link),length(permalinks$link))
permalinks_full <- permalinks_full[samp]       
permalinks <- permalinks[samp,]       

# Web scraping to get the price from the pattern page
# Takes about 1 min for 50 links
n=dim(permalinks)[1] # 1000 ok
pattern_info <- lapply(permalinks_full[1:n], htmlTreeParse, useInternalNodes = TRUE)
names(pattern_info) <- permalinks$link[1:n]

Once we have the html code for each pattern page, we parse it for the prices. The path to the price information in the html tree can be checked by looking at the source code for a typical pattern page. Since most patterns are priced in US dollars (around 75% of them in this dataset), all the price data is converted to current USD to match, using the R quantmod library.

 

pattern_prices <- lapply(pattern_info, function(html) getNodeSet(html, 
                                                                 path="//strong[@class='price']/a/text()", 
                                                                 fun=xmlValue)[[1]] )

num_prices <- lapply(pattern_prices, function(str) c("price"=regmatches(str,
                                                                regexpr("[[:digit:]]+\\.*[[:digit:]]*",str)),
                                                     "currency"=substr(str, nchar(str)-2, nchar(str)) 
                                                     )
                     )


pattern_nbr_projects <- melt(sapply(pattern_info, nbr_projects))
price_data  <- data.frame(matrix(unlist(num_prices), nrow=length(num_prices), byrow=T), stringsAsFactors=F)
price_data <- cbind(pattern_nbr_projects, permalinks[1:n,], price_data)
names(price_data) <- c("nbr_projects", "link","category", "price", "currency")
price_data$price <- as.numeric(price_data$price)

# Local currency conversion is proposed by Ravelry only for logged in users
# => do normalizeing of prices here
currencies_codes = sapply(price_data$currency, paste,"USD",sep="")
# getFX puts exchange rate in the environment, but sapply does not change env. variables
for (curr in unique(price_data$currency)) getFX(paste(curr, "/USD", sep=""), from = Sys.Date())
exchange_rates = sapply(currencies_codes, get)
price_data$price_usd = price_data$price * exchange_rates

And finally, the global price distribution (all categories aggregated):

ggplot(price_data) +
  geom_histogram(aes(x=price_usd), fill='Blue', alpha=0.5, binwidth=0.5) +
  scale_x_continuous(limits = c(0, 20), breaks = round(seq(0, 20, by = 1), 1)) +
  xlab("Pattern price in USD") +
  ylab("Number of patterns")

hist_prices

Pattern prices distribution. The data is only shown for patterns up to 20 dollars (there are a few expensive outliers, mostly kits with pattern+yarn included).

It looks like the “99 cents is cheaper than 1$” strategy is mostly used in the lower prices range. In the prices 6$ to 9$, there are much fewer price points just below the integer prices, but in the 3$ to 5$, it’s the contrary.

The breakdown by category:

ggplot(price_data, aes(x=category, y=price_usd, fill=category)) +
  geom_boxplot(alpha=0.5) +
  ylab("Price in USD") +
  title("Pattern prices distributions in each category")
Prices by category

Pattern price is pretty constant.

Price does not depend on category much. But negative results are just as interesting as positive results, so this graph is still proudly displayed! I was a bit surprised by this, since there can be a lot a variance in pattern design complexity between a one-size-fits-all accessory and a sweater.