Which yarn colors are most stashed and most used ?

Color is a pretty important part of a knitting project, and a big factor in deciding to buy and knit a skein (or hank, as is more often the case with hand-dyed yarns). Maybe counter-intuitively for non-knitters, many knitters buy yarn without a specific project planned for it, sometimes even in quantities that they couldn’t knit through even if they spent the rest of their lives knitting 24/7! Such is the power of beautiful yarn.

The yarn stash entry on ravelry allows the user to select the color category of their yarn when adding it to their notebook stash. Color is quite subjective and categorizing yarn colors can be especially difficult, as shown on those pictures:

Which color is this ? (“Top Draw Sock” by Skein in the “inner city” colorway)

Is this yellow, brown, or orange ? (“Tosh Sock” by Madelinetosh in the “ginger glazed” colorway)

Often though, the color distinctly falls into one of the Ravelry color families: “Blue”, “Green”, “Purple”, “Brown”, “Gray”, “Blue-green”, “Pink”, “Red”, “Natural/Undyed”, “Black”, “Red-purple”, “White”, “Orange”, “Yellow”, “Blue-purple”, “Red-orange”, “Yellow-green”, and “Yellow-orange”. This post excludes yarn spun by the users (“handspun”), to look only at commercial yarns (spun at a mill) that users have bought and added to the “yarn stash” section of their notebooks.

The color family of projects cannot be retrieved with the API (it is a yarn property); in addition, the actual number of stashed yarn items and projects in Ravelry are displayed when doing a manual search. I entered this data by hand to look at color distributions rather than doing an API call. Not as practical as an API call, but it’s more accurate since those tend to timeout if we query more than 5000 entries.

The code below defines the data as taken (manually) from Ravelry, and shows a plot of how many items are stashed in each color (bars) and how many projects were knitted using this color (dots).

# Colors selected among the 140 html supported color names
ravColors <- c("Black"="#000000","Blue"="#0000ff","Blue-green"="#008080","Blue-purple"="#6A5ACD","Brown"="#A52A2A",

# Color breakdown (august) from project and stash search page info: 
yarnColors <- data.frame("color"=c("Blue","Green","Purple","Brown","Gray","Blue-green",
                                    "Pink","Red","Natural/Undyed", "Black","Red-purple",

# sort by increasing number of items in stash
yarn_colors <- arrange(yarn_colors,stash)

  geom_bar(aes(x=seq_along(color), y=stash, fill=factor(color)),
  geom_point(aes(x=seq_along(color), y=project, fill=factor(color)),
             colour="white", pch=21, size=5)+
  guides(fill=guide_legend(title="colors\nbars: stash\ndots: projects"))+
  xlab("Ravelry color family")+
  ylab("Number of stash items / projects")+
  ggtitle("Color distribution of stashed yarns and yarns used in projects")

Colors in stashes and projects. Blue and green lead !

Colors in stashes and projects. Blue and green lead !

Projects colors tend to follow the same distribution as stash colors. Blue is clearly the winner, both in stashes and in projects. Yellows and oranges, on the other hand, don’t seem to tempt many knitters.

Gray is knitted a lot more, comparatively, than it is stashed (its dot doesn’t follow the slope defined by the other dots).


Knitting patterns categories

Let’s have a look at the repartition of Ravelry patterns among the various categories. The categories have a tree structure. The number of patterns in each category is not accessible via the API, but it is visible on the website main page.

The code below is not the most exciting, but it is a more accurate reflection of the breakdown of knitting patterns in August 2015 than what could be estimated from an API call on the database. I took into account only the patterns that have at least one picture, to avoid potential bad entries (duplicates, drafts …). Some of the categories are aggregated to keep the tree plot simple; the entries named “All Item” are aggregated and actually contain subcategories.

pattern_nbr_tree <- list("Clothing"=list( "Coat/Jacket"=10707,
                                          "Intimate Apparel"=list("Bra"=35,
                                          "Tops"=list("Sleeveles Top"=7195,
                                                      "Strapless Top"=130,
                          "Accesories"=list("All Bag"=8883,
                                                             "All Socks"=23631,
                                            "All Hands"=23733,
                                            "All Hat"=45386,
                                            "All Jewelry"=1642,
                                            "All Other Headwear"=3989,
                          "All Home"=36478,
                          "All Toys and Hobbies"=21023,
                          "All Pet"=1551,
                          "All Components"=7073)

This plot shows the number of patterns in each category as the area of the corresponding rectangle.

pattern_nbr_tree <- melt(pattern_nbr_tree)
names(pattern_nbr_tree) <- c("patterns_count","cat3","cat2","cat1")

custom_palette <- c("#FAF0E6","#7FFFD4","#008080","#FF6347","#FA8072","#B0C4DE")
title="Number of knitting patterns per category",palette=custom_palette,border.col="White",border.lwds=c(6,3,0.5),
Number of patterns in each category.

Number of patterns in each category (click to enlarge).

Accessories dominate, probably because they represent a smaller time commitment than clothing items, and there are fewer risks on the fit!

Tags in popular knitting projects

This second post in the “Knitting data” series looks at the text Ravelry users use to describe their own projects, and more specifically the tags attached to popular projects, i.e. the projects with the most “favorites”. “Favoriting” is a way for users to bookmark and manifest their appreciation for an item.

The following code queries the Ravelry API for the 5000 most favorited projects.

knit_all <- GET("https://api.ravelry.com/projects/search.json?page_size=5000&amp;craft=knitting&amp;sort=favorites", config=config("token"=ravelry.token))
knit <- content(knit_all)

The following function retrieves the tags from a list of projects (the result of the query above) that have a number of favorites above a given threshold. The tags are then grouped into a text corpus to use with the R text mining library tm. (The corpus form may generally be useful to split the tags into several groups by popularity thresholds for example).

get_tags <- function(projects, fav_threshold){
  # get the tags from the project list
  # use only projects where nbr of favorites >= fav_threshold
  tags_threshold <- lapply(projects,
                  function(x) ifelse(x$favorites_count>=fav_threshold,x$tag_names,""))
  tags <- paste(unlist(tags_threshold), collapse=" ")
  # build corpus from project tags
  corpus_tags <- Corpus(VectorSource(tags))
  names(corpus_tags) <- paste("fav",as.character(fav_threshold),sep="")
  # clean up the tags
  corpus_tags <- tm_map(corpus_tags, removeNumbers) # ignore dates
  corpus_tags <- tm_map(corpus_tags, removePunctuation)
  corpus_tags <- tm_map(corpus_tags, stemDocument)

Now the “alltags” object below is a text corpus with all the tags used in the 5000 most favorited projects; since the threshold is 0, all projects are included. The tags are stemmed, meaning that the end of the word is cut according to certain rules such as to avoid differencing between “cables” and “cabling” for example.

alltags <- get_tags(knit$projects, 0)

The following function gets the frequency of each tag into a data frame (useful to more easily get the origins of the stemmed words).

tags_frequencies <- function(corpus_tags){
  # return tags and their frequencies in each document of corpus
  tdm <- TermDocumentMatrix(corpus_tags)
  tdmframe <- as.data.frame(as.matrix(tdm) )
  names(tdmframe) <- names(corpus_tags)
  tdmframe$tags <- tdm$dimnames$Terms

Now we can apply this to our tag corpus to make a word cloud and a bar chart showing the most frequently used tags.

The tags are linked back to a correct word origin: “cable” is the simplest origin of the stem “cabl”; “cardigan”, on the other hand, is not changed by stemming. There should be a more elegant way (probably using Wordnet) to do this than the simple table lookup below, which I hope to get to in a later post !

all_freq <- tags_frequencies(alltags)
# keep only most frequent tags on all dataset
all_freq <- arrange(all_freq, desc(fav0))
all_freq <- all_freq[1:30,]
# replace stemmed word by a readable origin word
correction <- c(babi="baby",cabl="cable",pullov="pullover",bulki="bulky",contigu="contiguous",finger="fingering")
realword <- function(x) ifelse(x %in% names(correction), correction[x],x)
all_freq$tags <- sapply(all_freq$tags, realword)

wordcloud(words=all_freq$tags, freq=all_freq$fav0,colors=brewer.pal(8, "Dark2"),rot.per=0)
Word cloud of the most frequent tags in popular projects

Word cloud of the most frequent tags in popular projects

And the bar chart and its code:


Chart of the most frequent tags in popular projects (same data as the word cloud)

  geom_bar(aes(x=seq_along(all_freq$tags), y = all_freq$fav0), stat="identity", fill='Gold')+
            position=position_dodge(width=0.9) )+
  ylab("tag frequency")+

Looks like cabled cardigans for babies are your best bet for knitterly fame !

About Ravelry – API connection

This post is the first of a series investigating the data found on ravelry.com using the R programming language and the Ravelry API. Ravelry is a website functionning as a database, organisationnal tool, and social network for knitting and fiber crafts enthusiasts.

A lot of user-created content is accessible via the Ravelry API. Ravelry hosts a collection of about 300,000 knitting patterns; there are over 9 million knitting projects pages created by the 5 million users, with about 7,000 projects and 65000 forum posts added per day (more Ravelry facts from 2014 here).

The organizational tool is the most relevant for data mining, as it holds a lot of information about patterns, projects, and tools. Each user has a public notebook consisting of their craft projects, their yarn stash, their “favorites” (bookmarks) and other features. Project entries in the notebook have attributes like the name of the pattern, the notes taken by the user, the yarns and needles used … Stash entries have attributes like the name of the company producing the yarn, the color family, the user rating, the weight and so on.


Screenshot of the notebook page on Ravelry, with the list of projects made by the user.


Pattern page: includes the designer’s name, the pattern category, the recommended yarns, pictures … (“Sockhead hat” by Kelly McClure)


Yarn stash entry: includes link to yarn in the database, yarn weight, color family … One stash entry can represent several skeins of the same yarn.


Project entry: includes name of the project, link to pattern, needle (or hook for crochet) and yarn, user’s notes … (“Lorenz Manifold” by alicialight)

The R code below configures the oauth access to the API using the httr library. Your user access key and secret key provided by Ravelry are assumed to be in the user_rav.txt file in the working directory.


# user_rav.txt contains API username and password 
credentials <- readLines("user_rav.txt")
names(credentials) <- c("user","access_key","secret_key")

OpenConnection <- function(credentials){
  # Args: login info for the Ravelry API
  # Returns oauth token
  # Open connection to Ravelry API and return token
  reqURL <- "https://www.ravelry.com/oauth/request_token"
  accessURL <- "https://www.ravelry.com/oauth/access_token"
  authURL <- "https://www.ravelry.com/oauth/authorize"
  ravelry.app <- oauth_app("ravelry", key=credentials["access_key"], 
  ravelry.urls <- oauth_endpoint(reqURL, authURL, accessURL)
  return(oauth1.0_token(ravelry.urls, ravelry.app))

# Quick test of API connection by getting connected user info
TestConnection <- function(ravelry.token) {
  # Arg: API token
  # Returns name of the user connected with this token
  test <- GET("https://api.ravelry.com/current_user.json", 

ravelry.token <- OpenConnection(credentials)

Once the connection is approved, this command should query and show your user name:

userinfo <- GET("https://api.ravelry.com/current_user.json",

In the next post we will get to the real data !