Knitting patterns categories

Let’s have a look at the repartition of Ravelry patterns among the various categories. The categories have a tree structure. The number of patterns in each category is not accessible via the API, but it is visible on the website main page.

The code below is not the most exciting, but it is a more accurate reflection of the breakdown of knitting patterns in August 2015 than what could be estimated from an API call on the database. I took into account only the patterns that have at least one picture, to avoid potential bad entries (duplicates, drafts …). Some of the categories are aggregated to keep the tree plot simple; the entries named “All Item” are aggregated and actually contain subcategories.

pattern_nbr_tree <- list("Clothing"=list( "Coat/Jacket"=10707,
                                          "Dress"=5232,
                                          "Intimate Apparel"=list("Bra"=35,
                                                                 "Pasties"=10,
                                                                 "Underwear"=152,
                                                                 "Other"=69),
                                          "Leggings"=332,
                                          "Onesies"=905,
                                          "Pants"=1107,
                                          "Robe"=104,
                                          "Shorts"=330,
                                          "Shrug/Bolero"=3939,
                                          "Skirt"=1660,
                                          "Sleepwear"=354,
                                          "Soakers"=470,
                                          "Sweater"=list("Cardigan"=31580,
                                                         "Pullover"=41795,
                                                         "Other"=1088),
                                          "Swimwear"=135,
                                          "Tops"=list("Sleeveles Top"=7195,
                                                      "Strapless Top"=130,
                                                      "Tee"=3604,
                                                      "Other"=550),
                                          "Vest"=9256,
                                          "Other"=767),
                          "Accesories"=list("All Bag"=8883,
                                            "Belt"=271,
                                            "Feet/Legs"=list("Booties"=3501,
                                                             "Legwarmers"=1801,
                                                             "Slippers"=2188,
                                                             "All Socks"=23631,
                                                             "Spats"=89,
                                                             "Other"=540),
                                            "All Hands"=23733,
                                            "All Hat"=45386,
                                            "All Jewelry"=1642,
                                            "Neck/Torso"=list("Bib"=364,
                                                              "Cape"=1591,
                                                              "Collar"=921,
                                                              "Cowl"=17798,
                                                              "Necktie"=362,
                                                              "Poncho"=2604,
                                                              "Scarf"=26600,
                                                              "Shawl/Wrap"=26210,
                                                              "Other"=691),
                                            "All Other Headwear"=3989,
                                            "Other"=1277),
                          "All Home"=36478,
                          "All Toys and Hobbies"=21023,
                          "All Pet"=1551,
                          "All Components"=7073)

This plot shows the number of patterns in each category as the area of the corresponding rectangle.

pattern_nbr_tree <- melt(pattern_nbr_tree)
names(pattern_nbr_tree) <- c("patterns_count","cat3","cat2","cat1")

custom_palette <- c("#FAF0E6","#7FFFD4","#008080","#FF6347","#FA8072","#B0C4DE")
treemap(pattern_nbr_tree,index=c("cat1","cat2","cat3"),vSize="patterns_count",
title="Number of knitting patterns per category",palette=custom_palette,border.col="White",border.lwds=c(6,3,0.5),
        bg.labels=0,fontsize.labels=c(14,13,11),
        align.labels=list(c("left","top"),c("center","center"),c("center","center")))
Number of patterns in each category.

Number of patterns in each category (click to enlarge).

Accessories dominate, probably because they represent a smaller time commitment than clothing items, and there are fewer risks on the fit!

Advertisements

Tags in popular knitting projects

This second post in the “Knitting data” series looks at the text Ravelry users use to describe their own projects, and more specifically the tags attached to popular projects, i.e. the projects with the most “favorites”. “Favoriting” is a way for users to bookmark and manifest their appreciation for an item.

The following code queries the Ravelry API for the 5000 most favorited projects.

knit_all <- GET("https://api.ravelry.com/projects/search.json?page_size=5000&amp;craft=knitting&amp;sort=favorites", config=config("token"=ravelry.token))
knit <- content(knit_all)

The following function retrieves the tags from a list of projects (the result of the query above) that have a number of favorites above a given threshold. The tags are then grouped into a text corpus to use with the R text mining library tm. (The corpus form may generally be useful to split the tags into several groups by popularity thresholds for example).

get_tags <- function(projects, fav_threshold){
  # get the tags from the project list
  # use only projects where nbr of favorites >= fav_threshold
  tags_threshold <- lapply(projects,
                  function(x) ifelse(x$favorites_count>=fav_threshold,x$tag_names,""))
  tags <- paste(unlist(tags_threshold), collapse=" ")
  # build corpus from project tags
  corpus_tags <- Corpus(VectorSource(tags))
  names(corpus_tags) <- paste("fav",as.character(fav_threshold),sep="")
  # clean up the tags
  corpus_tags <- tm_map(corpus_tags, removeNumbers) # ignore dates
  corpus_tags <- tm_map(corpus_tags, removePunctuation)
  corpus_tags <- tm_map(corpus_tags, stemDocument)
  return(corpus_tags)
}

Now the “alltags” object below is a text corpus with all the tags used in the 5000 most favorited projects; since the threshold is 0, all projects are included. The tags are stemmed, meaning that the end of the word is cut according to certain rules such as to avoid differencing between “cables” and “cabling” for example.

alltags <- get_tags(knit$projects, 0)

The following function gets the frequency of each tag into a data frame (useful to more easily get the origins of the stemmed words).

tags_frequencies <- function(corpus_tags){
  # return tags and their frequencies in each document of corpus
  tdm <- TermDocumentMatrix(corpus_tags)
  tdmframe <- as.data.frame(as.matrix(tdm) )
  names(tdmframe) <- names(corpus_tags)
  tdmframe$tags <- tdm$dimnames$Terms
  return(tdmframe)
}

Now we can apply this to our tag corpus to make a word cloud and a bar chart showing the most frequently used tags.

The tags are linked back to a correct word origin: “cable” is the simplest origin of the stem “cabl”; “cardigan”, on the other hand, is not changed by stemming. There should be a more elegant way (probably using Wordnet) to do this than the simple table lookup below, which I hope to get to in a later post !

all_freq <- tags_frequencies(alltags)
# keep only most frequent tags on all dataset
all_freq <- arrange(all_freq, desc(fav0))
all_freq <- all_freq[1:30,]
# replace stemmed word by a readable origin word
correction <- c(babi="baby",cabl="cable",pullov="pullover",bulki="bulky",contigu="contiguous",finger="fingering")
realword <- function(x) ifelse(x %in% names(correction), correction[x],x)
all_freq$tags <- sapply(all_freq$tags, realword)

wordcloud(words=all_freq$tags, freq=all_freq$fav0,colors=brewer.pal(8, "Dark2"),rot.per=0)
Word cloud of the most frequent tags in popular projects

Word cloud of the most frequent tags in popular projects

And the bar chart and its code:

cloud_acc

Chart of the most frequent tags in popular projects (same data as the word cloud)

ggplot()+
  geom_bar(aes(x=seq_along(all_freq$tags), y = all_freq$fav0), stat="identity", fill='Gold')+
  geom_text(aes(x=seq_along(all_freq$tags),
                y=all_freq$fav0,
                label=all_freq$tags),
            hjust=1.01,
            position=position_dodge(width=0.9) )+
  scale_x_reverse()+
  xlab("tag")+
  ylab("tag frequency")+
  theme(axis.ticks.y=element_blank())+
  theme(axis.text.y=element_blank())+
  coord_flip()

Looks like cabled cardigans for babies are your best bet for knitterly fame !

About Ravelry – API connection

This post is the first of a series investigating the data found on ravelry.com using the R programming language and the Ravelry API. Ravelry is a website functionning as a database, organisationnal tool, and social network for knitting and fiber crafts enthusiasts.

A lot of user-created content is accessible via the Ravelry API. Ravelry hosts a collection of about 300,000 knitting patterns; there are over 9 million knitting projects pages created by the 5 million users, with about 7,000 projects and 65000 forum posts added per day (more Ravelry facts from 2014 here).

The organizational tool is the most relevant for data mining, as it holds a lot of information about patterns, projects, and tools. Each user has a public notebook consisting of their craft projects, their yarn stash, their “favorites” (bookmarks) and other features. Project entries in the notebook have attributes like the name of the pattern, the notes taken by the user, the yarns and needles used … Stash entries have attributes like the name of the company producing the yarn, the color family, the user rating, the weight and so on.

projects

Screenshot of the notebook page on Ravelry, with the list of projects made by the user.

pattern

Pattern page: includes the designer’s name, the pattern category, the recommended yarns, pictures … (“Sockhead hat” by Kelly McClure)

yarn

Yarn stash entry: includes link to yarn in the database, yarn weight, color family … One stash entry can represent several skeins of the same yarn.

project

Project entry: includes name of the project, link to pattern, needle (or hook for crochet) and yarn, user’s notes … (“Lorenz Manifold” by alicialight)

The R code below configures the oauth access to the API using the httr library. Your user access key and secret key provided by Ravelry are assumed to be in the user_rav.txt file in the working directory.

library(httr)

# user_rav.txt contains API username and password 
credentials <- readLines("user_rav.txt")
names(credentials) <- c("user","access_key","secret_key")

OpenConnection <- function(credentials){
  # Args: login info for the Ravelry API
  # Returns oauth token
  # Open connection to Ravelry API and return token
  reqURL <- "https://www.ravelry.com/oauth/request_token"
  accessURL <- "https://www.ravelry.com/oauth/access_token"
  authURL <- "https://www.ravelry.com/oauth/authorize"
  
  ravelry.app <- oauth_app("ravelry", key=credentials["access_key"], 
                           secret=credentials["secret_key"])
  ravelry.urls <- oauth_endpoint(reqURL, authURL, accessURL)
  
  return(oauth1.0_token(ravelry.urls, ravelry.app))
}

# Quick test of API connection by getting connected user info
TestConnection <- function(ravelry.token) {
  # Arg: API token
  # Returns name of the user connected with this token
  test <- GET("https://api.ravelry.com/current_user.json", 
              config=config("token"=ravelry.token)) 
  print(content(test)$user$username)
}

ravelry.token <- OpenConnection(credentials)
TestConnection(ravelry.token)

Once the connection is approved, this command should query and show your user name:

userinfo <- GET("https://api.ravelry.com/current_user.json",
              config=config("token"=ravelry.token))
print(content(userinfo)$user$username)

In the next post we will get to the real data !