Current Twt Hash spec and probability of hash collision:

The probability of a Twt Hash collision depends on the size of the hash and the number of possible values it can take. For the Twt Hash, which uses a Blake2b 256-bit hash, Base32 encoding, and takes the last 7 characters, the space of possible hash values is significantly reduced.

Breakdown:
  1. Base32 encoding: Each character in the Base32 encoding represents 5 bits of information (since ( 2^5 = 32 )).
  2. 7 characters: With 7 characters, the total number of possible hashes is:
    [
 32^7 = 3,518,437,208
 ]
 This gives about 3.5 billion possible hash values.
Probability of Collision:

The probability of a hash collision depends on the number of hashes generated and can be estimated using the Birthday Paradox. The paradox tells us that collisions are more likely than expected when hashing a large number of items.

The approximate formula for the probability of at least one collision after generating n hashes is:
[
P(\text{collision}) \approx 1 - e^{-\frac{n^2}{2M}}
]
Where:

  • (n) is the number of generated Twt Hashes.
  • (M = 32^7 = 3,518,437,208) is the total number of possible hash values.

For practical purposes, here are some example probabilities for different numbers of hashes (n):

  • For 1,000 hashes:
    [
 P(\text{collision}) \approx 1 - e^{-\frac{1000^2}{2 \cdot 3,518,437,208}} \approx 0.00014 \, \text{(0.014%)}
    ]
  • For 10,000 hashes:
    [
 P(\text{collision}) \approx 1 - e^{-\frac{10000^2}{2 \cdot 3,518,437,208}} \approx 0.14 \, \text{(14%)}
    ]
  • For 100,000 hashes:
    [
 P(\text{collision}) \approx 1 - e^{-\frac{100000^2}{2 \cdot 3,518,437,208}} \approx 0.999 \, \text{(99.9%)}
    ]
Conclusion:
  • For small to moderate numbers of hashes (up to around 1,000–10,000), the collision probability is quite low.
  • However, as the number of Twts grows (above 100,000), the likelihood of a collision increases significantly due to the relatively small hash space (3.5 billion).

⤋ Read More

@prologic@twtxt.net the basic idea was to stem the hash.. so you have a hash abcdef0123456789... any sub string of that hash after the first 6 will match. so abcdef, abcdef012, abcdef0123456 all match the same. on the case of a collision i think we decided on matching the newest since we archive off older threads anyway. the third rule was about growing the minimum hash size after some threshold of collisions were detected.

⤋ Read More

Participate

Login to join in on this yarn.