Tags in popular knitting projects

This second post in the “Knitting data” series looks at the text Ravelry users use to describe their own projects, and more specifically the tags attached to popular projects, i.e. the projects with the most “favorites”. “Favoriting” is a way for users to bookmark and manifest their appreciation for an item.

The following code queries the Ravelry API for the 5000 most favorited projects.

knit_all <- GET("https://api.ravelry.com/projects/search.json?page_size=5000&amp;craft=knitting&amp;sort=favorites", config=config("token"=ravelry.token))
knit <- content(knit_all)

The following function retrieves the tags from a list of projects (the result of the query above) that have a number of favorites above a given threshold. The tags are then grouped into a text corpus to use with the R text mining library tm. (The corpus form may generally be useful to split the tags into several groups by popularity thresholds for example).

get_tags <- function(projects, fav_threshold){
  # get the tags from the project list
  # use only projects where nbr of favorites >= fav_threshold
  tags_threshold <- lapply(projects,
                  function(x) ifelse(x$favorites_count>=fav_threshold,x$tag_names,""))
  tags <- paste(unlist(tags_threshold), collapse=" ")
  # build corpus from project tags
  corpus_tags <- Corpus(VectorSource(tags))
  names(corpus_tags) <- paste("fav",as.character(fav_threshold),sep="")
  # clean up the tags
  corpus_tags <- tm_map(corpus_tags, removeNumbers) # ignore dates
  corpus_tags <- tm_map(corpus_tags, removePunctuation)
  corpus_tags <- tm_map(corpus_tags, stemDocument)
  return(corpus_tags)
}

Now the “alltags” object below is a text corpus with all the tags used in the 5000 most favorited projects; since the threshold is 0, all projects are included. The tags are stemmed, meaning that the end of the word is cut according to certain rules such as to avoid differencing between “cables” and “cabling” for example.

alltags <- get_tags(knit$projects, 0)

The following function gets the frequency of each tag into a data frame (useful to more easily get the origins of the stemmed words).

tags_frequencies <- function(corpus_tags){
  # return tags and their frequencies in each document of corpus
  tdm <- TermDocumentMatrix(corpus_tags)
  tdmframe <- as.data.frame(as.matrix(tdm) )
  names(tdmframe) <- names(corpus_tags)
  tdmframe$tags <- tdm$dimnames$Terms
  return(tdmframe)
}

Now we can apply this to our tag corpus to make a word cloud and a bar chart showing the most frequently used tags.

The tags are linked back to a correct word origin: “cable” is the simplest origin of the stem “cabl”; “cardigan”, on the other hand, is not changed by stemming. There should be a more elegant way (probably using Wordnet) to do this than the simple table lookup below, which I hope to get to in a later post !

all_freq <- tags_frequencies(alltags)
# keep only most frequent tags on all dataset
all_freq <- arrange(all_freq, desc(fav0))
all_freq <- all_freq[1:30,]
# replace stemmed word by a readable origin word
correction <- c(babi="baby",cabl="cable",pullov="pullover",bulki="bulky",contigu="contiguous",finger="fingering")
realword <- function(x) ifelse(x %in% names(correction), correction[x],x)
all_freq$tags <- sapply(all_freq$tags, realword)

wordcloud(words=all_freq$tags, freq=all_freq$fav0,colors=brewer.pal(8, "Dark2"),rot.per=0)
Word cloud of the most frequent tags in popular projects

Word cloud of the most frequent tags in popular projects

And the bar chart and its code:

cloud_acc

Chart of the most frequent tags in popular projects (same data as the word cloud)

ggplot()+
  geom_bar(aes(x=seq_along(all_freq$tags), y = all_freq$fav0), stat="identity", fill='Gold')+
  geom_text(aes(x=seq_along(all_freq$tags),
                y=all_freq$fav0,
                label=all_freq$tags),
            hjust=1.01,
            position=position_dodge(width=0.9) )+
  scale_x_reverse()+
  xlab("tag")+
  ylab("tag frequency")+
  theme(axis.ticks.y=element_blank())+
  theme(axis.text.y=element_blank())+
  coord_flip()

Looks like cabled cardigans for babies are your best bet for knitterly fame !

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s