Wikibase
MediaWiki Wikibase extension
Wikibase\Repo\Rdf\HashDedupeBag Class Reference

Hash based implementation of DedupeBag. More...

+ Inheritance diagram for Wikibase\Repo\Rdf\HashDedupeBag:
+ Collaboration diagram for Wikibase\Repo\Rdf\HashDedupeBag:

Public Member Functions

 __construct ( $cutoff=5)
 Constructs a new HashDedupeBag with the given cutoff value, which is the number of hash characters to use. More...
 
 alreadySeen ( $hash, $namespace='')
 

Private Attributes

 $bag
 
 $cutoff
 

Detailed Description

Hash based implementation of DedupeBag.

This implementation of DedupeBag operates like a rather lossy cache; it's implemented as a hash that just evicts old values when a collision occurs.

The idea for this implementation was taken mainly from from blog posts:

The implementation of alreadySeen() works as follows:

  • Determine $key be truncating the $hash parameter, and prepending the $namespace to it. The point of truncation is governed by the $cutoff setting in the parameter, and is used to govern the tradeoff between bag size and the likelihood of false negatives.
  • Look up $key in $this->bag. If $this->bag[$key] is not set, return false, indicating that the value (combination of $hash and $namespace) has never been seen before (we are sure in this case).
  • If $this->bag[$key] is set, compare the value stored there with $hash. If they are the same, return true, to indicate that the value has been seen before.
  • If $hash is different from $this->bag[$key], the value might have been seen before but been evicted due to a collision, or not. In this case, return false, to be sure. This is safe since the contract of alreadySeen() states that false negatives are admissible.
  • In all cases, before returning anything, set $this->bag[$key] = $hash.
Author
Daniel Kinzler

Constructor & Destructor Documentation

◆ __construct()

Wikibase\Repo\Rdf\HashDedupeBag::__construct (   $cutoff = 5)

Constructs a new HashDedupeBag with the given cutoff value, which is the number of hash characters to use.

A larger number means less collisions (fewer false negatives), but a larger bag. The number can be read as an exponent to the size of the hash's alphabet, so with a hex hash and $cutoff = 5, you'd get a max bag size of 16^5, and a collision probability of 16^-5 = 1/32.

Parameters
int$cutoff

Member Function Documentation

◆ alreadySeen()

Wikibase\Repo\Rdf\HashDedupeBag::alreadySeen (   $hash,
  $namespace = '' 
)
See also
DedupeBag::alreadySeen

Returns true if the given combination of $hash and $namespace has been seen before - that is, alreadySeen() had already been called on this HashDedupeBag with the same values for $hash and $namespace. Returning false is inconclusive: The hash and namespace may or may not have been seen before, false negatives are possible. The probability of a false negatives here can be controlled using the $cutoff parameter passed to the constructor.

See the class level documentation for an explanation of the algorithm.

Parameters
string$hash
string$namespace
Returns
bool

Implements Wikibase\Repo\Rdf\DedupeBag.

Member Data Documentation

◆ $bag

Wikibase\Repo\Rdf\HashDedupeBag::$bag
private

◆ $cutoff

Wikibase\Repo\Rdf\HashDedupeBag::$cutoff
private

The documentation for this class was generated from the following file: