Like a lot of people, I keep a list of books I want to read. And because there are a great many more books that interest me than I can possibly read in my lifetime, this list has become quite long.
In the olden days of brick-and-mortar bookstores and libraries, I would discover books to read by browsing shelves and picking up what looked interesting at the time. I might even find something that I knew was on my list. “Oh, I’ve been meaning to read that!”
The Internet changes this dynamic dramatically. It makes it much easier for me to discover books that interest me, and also to access any book that I might want to read, instantly, anywhere. At any given time, I have a couple of books that I’m “currently reading”, and when I finish one I can start another immediately. I use Goodreads to manage my to-read list, and it’s easy for me to scroll through the list and pick out my next book.
But again, this list is very long. So I wanted a good way to filter out books I will really never read, and sort it such that the most “important” books in some sense show up first. Then every time I need a new book I could take the first one from the list and make a binary decision: either “I will read this right now”, or “I am never reading this”. In the latter case, if a book interests me enough at a later time, I’m sure it will find its way back onto my list.
The problem then is to find a good metric by which to rank books. Goodreads lets users rank books with a star-rating from 1 to 5, and presents you with an average rating by which you can sort the list. The problem is that a lot of books that interest me have only one rating and it’s 5 stars, giving the book an “average” of 5.0. So if I go with that method I will be perpetually reading obscure books that one other person has read and loved. This is not necessarily a bad thing, but I do want to branch out a bit.
Another possibility is to use the number of ratings to calculate a confidence interval for the average rating. For example, using the Wilson score I could find an upper and lower bound
s2 (higher and lower than the average rating, respectively) that will let me say “I am 95% sure that any random sample of readers of an equal size would give an average rating between
s2.” I could then sort the list by the lower bound
But this method is dissatisfactory for a number of reasons. First, it’s not clear how to fit star ratings to such a measure. If we do the naive thing and count a 1-star rating as 1/5 and a 5 star rating as 5/5, that counts a 1-star rating as a “partial success” in some sense. We could discard 1-stars as 0, and count 2, 3, 4, and 5 stars as 25%, 50%, 75%, and 100%, respectively.
But even if we did make it fit somehow, it turns out that if you take any moderately popular book on Goodreads at random, it will have an average rating somewhere close to 4. I could manufacture a prior based on this knowledge and use that instead of the normal distribution or the Jeffreys prior in the confidence interval, but that would still not be a very good ranking because reader review metascores are meaningless.
In the article “Reader review metascores are meaningless”, Stephanie Shun suggests using the percentage of 5-star ratings as the relevant metric rather than the average rating. This is a good suggestion, since even a single 5-star rating carries a lot of actionable information whereas an average rating close to 4.0 carries very little.
I can then use the Wilson score directly, counting a 5-star rating as a successful trial and any other rating as a failed one. I can then just use the normal distribution instead of working with an artisanally curated prior.
Mathematica makes it easy to generate the Wilson score. Here,
pos is the number of positive trials (number of 5-star ratings),
n is the number of total ratings, and
confidence is the desired confidence percentage. I’m taking the lower bound of the confidence interval to get my score.
1 2 3 4 5 6 7 8 9 10
Now I just need to get the book data from Goodreads. Fortunately, it has a pretty rich API. I just need a developer key, which anyone can get for free.
For example, to get the ratings for a given book
id, we can use their XML api for books and pattern match on the result to get the ratings by score:
1 2 3 4 5 6 7 8 9 10
key is my Goodreads developer API key, defined elsewhere. I put a
Pause in the call since Goodreads throttles API calls so you can’t make more than one call per second to each API endpoint. I’m also memoizing the result, by assigning to
Ratings[id] in the global environment.
Ratings will give us an association list with the number of ratings for each score from 1 to 5, together with the total. For example, for the first book in their catalogue, Harry Potter and the Half-Blood Prince, here are the scores:
1 2 3 4 5 6 7 8
Sweet. Let’s see how Harry Potter #6 would score with our rating:
1 2 3
So Wilson is 95% confident that in any random sample of about 1.2 million Harry Potter readers, at least 61.572% of them would give The Half-Blood Prince a 5-star rating. That turns out to be a pretty high score, so if this book were on my list (which it isn’t), it would feature pretty close to the very top.
But now the score for a relatively obscure title is too low. For example, the lower bound of the 95% confidence interval for a single-rating 5-star book will be 0.206549, which will be towards the bottom of any list. This means I would never get to any of the obscure books on my reading list, since they would be edged out by moderately popular books with an average rating close to 4.0.
See, if I’ve picked a book that I want to read, I’d consider five ratings that are all five stars a much stronger signal than the fact that people who like Harry Potter enough to read 5 previous books loved the 6th one. Currently the 5*5 book will score 57%, a bit weaker than the Potter book’s 62%.
I can fix this by lowering the confidence level. Because honestly, I don’t need a high confidence in the ranking. I’d rather err on the side of picking up a deservedly obscure book than to miss out on a rare gem. Experimenting with this a bit, I find that a confidence around 80% raises the obscure books enough to give me an interesting mix. For example, a 5*5 book gets a 75% rank, while the Harry Potter one stays at 62%.
I’m going to call that, the lower bound of the 80% Wilson confidence interval, the Rúnar Rank of a given book:
Unfortunately, there’s no way to get the rank of all the books in my reading list in one fell swoop. I’ll have to get the reading list first, then call
RunarRank for each book’s
id. In Goodreads, books are managed by “shelves”, and the API allows getting the contents of a given shelf, 200 books at a time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
I’m doing a bunch of XML pattern matching here to get the
average_rating, and first
author of each book. Then I put that in an association list. I’m getting only the top-200 books on the list by average rating (which currently is about half my list).
With that in hand, I can get the contents of my “to-read” shelf with
GetShelf[runar, "to-read"], where
runar is my Goodreads user id. And given that, I can call
RunarRank on each book on the shelf, then sort the result by that rank:
1 2 3
To get the ranked reading list of any user:
And to print it out nicely:
1 2 3 4 5 6 7
Now I can get, say, the first 10 books on my improved reading list:
|17278||The Feynman Lectures on Physics Vol 1||67.2231%|
|Richard P. Feynman||4.58|
|640909||The Knowing Animal: A Philosophical Inquiry Into Knowledge and Truth||64.6221%|
|640913||The Hand: A Philosophical Inquiry Into Human Being||64.6221%|
|4050770||Volition As Cognitive Self Regulation||62.231%|
|8664353||Unbroken: A World War II Story of Survival, Resilience, and Redemption||60.9849%|
|Benjamin C. Pierce||4.80|
|77523||Harry Potter and the Sorcerer’s Stone (Harry Potter #1)||59.1459%|
|13539024||Free Market Revolution: How Ayn Rand’s Ideas Can End Big Government||59.1102%|
I’m quite happy with that. Some very popular and well-loved books interspersed with obscure ones with exclusively (or almost exclusively) positive reviews. The most satisfying thing is that the rating carries a real meaning. It’s basically the relative likelihood that I will enjoy the book enough to rate it five stars.
I can test this ranking against books I’ve already read. Here’s the top of my “read” shelf, according to their Rúnar Rank:
|17930467||The Fourth Phase of Water||68.0406%|
|Gerald H. Pollack||4.85|
|7687279||Nothing Less Than Victory: Decisive Wars and the Lessons of History||64.9297%|
|John David Lewis||4.67|
|43713||Structure and Interpretation of Computer Programs||62.0211%|
|7543507||Capitalism Unbound: The Incontestable Moral Case for Individual Rights||57.6085%|
|13542387||The DIM Hypothesis: Why the Lights of the West Are Going Out||55.3296%|
|5932||Twenty Love Poems and a Song of Despair||54.7205%|
|24113||Gödel, Escher, Bach: An Eternal Golden Braid||53.5588%|
|Douglas R. Hofstadter||4.29|
|19312||The Brothers Lionheart||53.0952%|
|13541678||Functional Programming in Scala||52.6902%|
That’s perfect. Those are definitely books I thouroughly enjoyed and would heartily recommend. Especially that last one.