OpenRarity developer documentation
  • OpenRarity documentation
  • Quick Guides
    • Integrating OpenRarity in your application
    • Rank any collections in batches with OpenSea API
    • Rank any collections in shell with OpenSea API
  • Fundamentals
    • Why we need OpenRarity
    • Methodology
    • Library Design
  • Contributions
    • Project setup and core technologies
    • Contributions guide
Powered by GitBook
On this page
  1. Fundamentals

Methodology

PreviousWhy we need OpenRarityNextLibrary Design

Last updated 2 years ago

OpenRarity Principles

The core tenets of the OpenRarity methodology are:

  • It must be easy for creators, consumers, and developers to understand

  • It must be objective and grounded in mathematical principles (open-source, introspectable)

  • It must be easy to recalculate as the dataset changes (new mints, metadata typos, mutable attributes)

  • It must provide consistent rarity ranks across all publishers

Methodology

We evaluated several different platforms and collections to understand the methodologies currently being used across different providers. While several collections have some form of customization, we found the most commonly adopted rarity function to be a rarity score that is a sum of the probability of each trait, and normalized by category distribution (Trait Normalization).

The problem here is that . Summing produces the probability of a token having a Green Hat or a Blue Hat, while multiplying produces the probability of a token having a Green Hat and a Blue Hat. We believe that the rarity of any given token is rooted in its set of traits occurring together.

Surprisal Ranking Algorithm

is an alternative way of expressing probabilities that is more well suited for assessing rarity. Think of it as a measure of how surprised someone would be upon discovering something.

  1. Probabilities of 1 (i.e. every single token has the Trait) convey no rarity and add zero information to the score.

  2. As the probability approaches zero (i.e. the Trait becomes rarer), the information content continues to rise with no bound. See equation below for explanation.

  3. It is valid to perform linear operations (e.g. addition or arithmetic mean) on information, but not on raw probabilities.

Information content is used to solve lots of problems that involve something being unlikely (i.e. rare or scarce). and also has an explanation of the equations, along with graphics to make it easier to understand. You can if you’d like.

The score is defined as:

This can look daunting, so let’s break it down:

    • Each of these points is actually called a “bit” of information.

    • The important thing is that even if there was a one-off grail in an impossibly large NFT collection, we could keep assigning points!

    • Unlike with probabilities, it’s valid to add together bits of information.

I(x)E[I(x)] where I(x)=∑i=1n−log⁡2(P(traiti))\frac{I(x)}{\mathbb{E}[I(x)]} \textrm{ where } I(x) = \sum_{i=1}^n-\log_2(P(trait_i))E[I(x)]I(x)​ where I(x)=i=1∑n​−log2​(P(traiti​))

P(trait)P(trait)P(trait)imply means the probability of an NFT having a specific trait within the entire collection. When calculating this value for NFTs without any value for a trait, we use an implicit “null” trait as if the creator had originally marked them as “missing”.

−log2(P(trait))-log_2(P(trait))−log2​(P(trait))is the mathematical way to calculate how many times you’d have to split the collection in half before you reach a trait that’s just as rare. Traits that occur in half of the NFTs get 1 point, those that occur in a quarter of the NFTs get 2 points, and so on. Using the −log2-log_2−log2​ is just a way to account for the spaces in between whole-number points, like assigning 1.58 points to traits that occur in every third NFT.

Conversely, if a trait exists on every NFT, i.e. P(trait)=1P(trait)=1P(trait)=1, then it's perfectly unsurprising because −log2(1)=0-log_2(1) = 0−log2​(1)=0.

Σ\SigmaΣ is the Greek letter sigma (like an English S), which means “sum of”. Mathematicians like to be rigorous so the iii and the nnn tell us exactly what to sum up, but really just means “add up the points for each of the NFT's traits”.

E[I(x)]\mathbb{E}[I(x)]E[I(x)] is the “expected value”, which is a weighted average of the information of all the NFTs in the collection, the weighting done by probability. Because this a collection-wide value, it doesn’t change the ranking nor the relative rarity scores, it just squishes them closer. We include it because it normalizes the scores for collections that have lots and lots of traits—these will have a higher I(x)I(x)I(x)rarity score for each NFT, but will also have a higher E[I(x)]\mathbb{E}[I(x)]E[I(x)] across the collection so they cancel each other out and make it fairer to compare between collections.

summing probabilities is inaccurate
Information content
This video shows how it was used to solve Wordle
skip straight to the part on information theory