Struggle with Harry Potter Data

2018-04-09

rstats javascript

Notes about creation of Harry Potter Books Survey. It is not over, I need your help.

Prologue

Right now I am in the final stage of developing two packages devoted to results of abstract competitions (still not perfectly ready, so use with caution):

comperes - infrastructure package for dealing with different formats of competition results and summarising them.
comperank - package which implements some rating and ranking algorithms. Inspired and driven by insightful book Who’s #1?: The Science of Rating and Ranking by Amy N. Langville and Carl D. Meyer.

Understanding of competition is quite general: it is a set of games (abstract event) in which players (abstract entity) gain some abstract scores (typically numeric).

Both packages use example data set called ncaa2005 which is results of an isolated group of Atlantic Coast Conference teams provided in “Who’s #1”. In order to demonstrate abstract nature of term “competition” I decided to find another data set, not from the world of sports. I thought it would be easy…

Overview

This post describes my struggle with and process of creating Harry Potter Data with results of “competition” between seven original Harry Potter books. Long story short, I didn’t find suitable existing data and decided to make my own survey. The struggle hasn’t ended yet and I need your help. Please, take part in the survey at this link. It is a fairly standard Google Form with two questions: first one simulates picking random subset of books (read about that later); second offers rating choices. And that’s it. Please, note the following:

It is assumed that you’ve read all seven original J. K. Rowling Harry Potter books and are willing to give an honest feedback about your impressions.
In order to answer you have to sign in your Google account. This is done to ensure that one person participates in the survey only once. Your personal data won’t become public because I will not track it.

This post has the following structure:

Decision about the source describes how I ended up with conducting my own survey.
Survey design has my thoughts about how I approached creating the survey.
Implementation describes technical details about creating a survey using Google Forms and JavaScript.

Decision about the source

After all sorts of games, the process of people rating common items is, probably, the most common example of competition. Here items can be considered as “players”. “Game” is a person that reviews a set of items by rating them with numerical “score” (stars, points, etc.). To produce this data set I need data in the following format (called “long” in terminology of aforementioned packages):

Person identifier (anonymized). It will serve as game identifier.
Item identifier which this person rated.
Numeric score of item’s rating by person.

One person can rate multiple items but only once, to avoid bias in the output.

After some thought I picked seven original J.K. Rowling “Harry Potter” books as items. Besides being interesting and popular books, I figured that it won’t be hard to find a data set with information I need. It wasn’t quite right because all available data sets don’t seem to have suitable license:

There are rating on Amazon but I didn’t find an easy way to even get the data.
My second candidate, Goodreads site, has an API which provides data exactly I need. However, their Developer Terms of Service doesn’t look “R package with MIT license” friendly.
There is also Kaggle’s goodbooks-10k data set. This comes as close to my needs as I could find. Nevertheless, it is still Goodreads data, so I am not sure about it.

All this led me to conducting my own survey. The good news is that this way I am the full owner of the data set with no license issues. The bad news - I should conduct the survey…

Survey design

The freedom in creating your own survey is overwhelming. Basically, you can do whatever you like and can wish to obtain any data you want. In grim reality, there are a couple of things to consider:

Survey should collect data that will actually help to fulfill its goals. It should be designed in a way that minimizes the chance of misinterpretation by respondents.
Survey should be implemented and conducted in the form that is accessible to people as much as possible. This is needed to collect more data with less efforts.

The goal of my Harry Potter Survey can be formulated as follows: collect data enough to rank Harry Potter books from worst to best. The most common way to do that is to mimic rating process from various online marketplaces. In more scientific language it means to ask a question using Likert scale. That is, an answer to some question should be chosen from some ordinal scale. Very helpful advice can be found in this SurveyMonkey article. The most common Likert scales can be found here.

Unfortunately, I decided to read about how to conduct survey not in the beginning of constructing my own. Its core is a question that asks to associate Harry Potter Books with some numerical ratings. Evolution of this question was as follows:

Stage 1.
- Question: ‘Rate these BOOKS (not movies) from 1 (“very poor book”) to 7 (“exceptional book”)’. As you can see, this isn’t actually a question because at first I planned to give a task. Also I was trying to highly indicate that the books are of interest and not movies.
- Scale: numeric, from 1 to 7. This seemed logical as numeric scores will be actually used in analysis.
Stage 2. After reading SurveyMonkey article and other sources, drastic transformation was made. I decided to actually ask a question about satisfaction after reading the book. This makes it a personal and, for my opinion, clear question.
- Question: ‘How did you like these BOOKS?’. One thing to consider here is whether it is better to use past or present tense. At this stage I decided to go with past tense because [in my opinion] it questions a fixed moment in time “just after reading a book”. Unfortunately, it wasn’t the case.
- Scale: combined, from 1 to 5. It is fairly standard bipolar “Satisfaction” scale: “1 - Very dissatisfied”; “2 - Dissatisfied”; “3 - Neutral”; “4 - Satisfied”; “5 - Very satisfied”. I decided to move to 5 point scale as it is more common and should provide more reliable data. Its downside is smaller variance. I also preserved numbers for extra indication of implied linear scale.
Stage 3. After some thought and practical testing I decided not to invent the wheel and stick with more common “Quality” scale. This has an advantage of being more or less standard, which should provide more robust data.
- Question: ‘What is your impression of these Harry Potter BOOKS?’. Added explicit indication about Harry Potter to be able to shorten books’ names. Changed to present tense because I had mixed feedback about previous question and which moment in the past it referred to. Of course, I can add explicit reference but it might overcomplicate the question. Also, question in present tense should be easier to answer.
- Scale: combined, from 1 to 5. It is fairly standard unipolar “Quality” scale: “1 - Poor”, “2 - Fair”, “3 - Good”, “4 - Very Good”, “5 - Excellent”.

After designing the basic question, there are couple of other things to consider:

Item names should be understandable. With seven Harry Potter books it might be confusing if they are presented only by title. So I decided to add book’s number in the series after its title. Also, explicit indication of “Harry Potter” in the title seems overcomplicating a survey, as it doesn’t add extra necessary information, so I decided to shorten it to “HP”. The resulting books are named “HP and the Philosopher’s (Sorcerer’s) Stone (#1)”, “HP and the Chamber of Secrets (#2)”, “HP and the Prisoner of Azkaban (#3)”, “HP and the Goblet of Fire (#4)”, “HP and the Order of the Phoenix (#5)”, “HP and the Half-Blood Prince (#6)”, “HP and the Deathly Hallows (#7)”.
Actual set of items can affect the outcome. For example, if person’s favourite book is present in the list, he/she might anchor his/her other ratings to this book. This can be solved by randomizing set of books asked to rate.
Actual order of items can affect the outcome. The reasoning is similar to previous note. This can be solved by randomizing the order of books presented.

So here is the final design of a survey. Respondent is asked a question “What is your impression of these Harry Potter BOOKS?” and presented with random subset (in random order) of names of seven Harry Potter books (presented above) which should be rated on pretty standard Likert “Quality” scale (with present numeric scores).

About the desired number of respondents I think that hitting 100 will produce fairly usable output data set. But the more the better.

After exhausting process of survey design I hoped that implementation should be easy. I again wasn’t quite right…

Implementation

The main obstacle in implementing the intended survey is randomization of presented items. Also I had to keep in mind that answering process should be as easy as possible and that one person should be able to answer only once.

After some Internet surfing, it seemed that the most optimal way of conducting a survey is with Google Forms. It can provide an option to participate in survey only once with a downside: one should have and be logged into a Google account. It can possibly scare off potential respondents. However, Google Forms has an option to not track user data, which I gladly used. It also has a feature to randomly shuffle the order of the items used in question, which is very helpful.

The biggest trouble with Google Forms is that it can’t randomly generate questions. I decided to work around this problem the following way:

Create many variants of question for all possible subsets of books. There are total of 127 non-empty subsets for 7 books. Items in every question should be shuffled.
Create dummy question (to be put first) which has a list of numbers - pointers to subsets of books. This list will be randomly shuffled for every respondent. Picking the first item from the list simulates generating random subset of books.

All this can be done manually. And I’ve actually done that… However, after deciding to change the question and scale (move from “Stage 1” to “Stage 2” in question evolution), I realized that it would be better to program Form creation. As it turns out, this can be done with Google Apps Script which accepts JavaScript code. After learning language basics and with great support of Internet, I came up with this solution:

// Function to generate all non-empty subsets of array
function generatePowerSet(array) {
  var result = [];
  
  for (var i = 1; i < (1 << array.length); i++) {
    var subset = [];
    for (var j = 0; j < array.length; j++)
      if (i & (1 << j))
      subset.push(array[j]);
    
    result.push(subset);
  }
  
  return result;
}

// Function to create target survey
function createHPSurvey() {
  var form = FormApp.create('Harry Potter Books Survey')
                    .setAllowResponseEdits(false)
                    .setCollectEmail(false)
                    .setLimitOneResponsePerUser(true);
   
  // Add select list
  var selectList = form.addListItem()
                       .setTitle('Choose first listed number')
                       .setHelpText('This simulates random subsetting of books.')
                       .setRequired(true);
  
  // Initialize main questions data
  var questionSingular = 'What is your impression of this Harry Potter BOOK?';
  var questionPlural   = 'What is your impression of these Harry Potter BOOKS?';
  var likertScale = ['1 - Poor', '2 - Fair', '3 - Good',
                     '4 - Very Good', '5 - Excellent'];
  
  var books = ["HP and the Philosopher's (Sorcerer's) Stone (#1)",
               "HP and the Chamber of Secrets (#2)",
               "HP and the Prisoner of Azkaban (#3)",
               "HP and the Goblet of Fire (#4)",
               "HP and the Order of the Phoenix (#5)",
               "HP and the Half-Blood Prince (#6)",
               "HP and the Deathly Hallows (#7)"];
  
  var allSubsets = generatePowerSet(books);
  
  // Create pages with all subsets
  var pages = []; // for collecting the choices in the list item
  for (var n = 0; n < allSubsets.length; n++) {
    
    // Make a section for current subset
    var newPage = form.addPageBreakItem()
                      .setTitle('Rate books');
    
    // Set the section to submit after completing (rather than next subset section)
    newPage.setGoToPage(FormApp.PageNavigationType.SUBMIT)
    
    // Add question for current subset with scale
    var question = form.addGridItem()
                       .setRows(allSubsets[n])
                       .setColumns(likertScale)
                       .setRequired(true);
    if (allSubsets[n].length == 1) {
      question.setTitle(questionSingular);
    } else {
      question.setTitle(questionPlural);
    }
    
    // Push our choice to the list select
    pages.push(selectList.createChoice(n + 1, newPage));
  }
  
  // Add all subsets to select list 
  selectList.setChoices(pages);
}

This code should be run into Google Apps Script project. It creates a Google Form named “Harry Potter Books Survey” and stores it on Google Drive.

Unfortunately, I didn’t find option of adding shuffling programmatically for every question, so I did it manually… one by one. I thought for a while about creating questions not only for every subset but also for its every permutation. This doesn’t really the same way of randomizing because questions with more permutations will be chosen more frequently at the first step.

Very helpful sources I used to implement this:

Basic code for “branching” form
Official Google Apps Script Reference for Forms
Generating all subsets in JavaScript (modified to not include empty subset)

Conclusions

Creating your own data set is hard. But very rewarding.
Designing a survey is hard. But very insightful.
Implementing a dynamic survey in Google Forms is hard. But very JavaScript.
I still need your help. Please, take part in the survey at this link.

Struggle with Harry Potter Data

Prologue

Overview

Decision about the source

Survey design

Implementation

Conclusions

Related

Statistical uncertainty with R and pdqr

Local randomness in R

Arguments of stats::density()

Comments

Contents