MinHash API

The MinHash class provides a powerful sketching algorithm for estimating set similarity and distances. Below is the detailed documentation for both C++ and Python interfaces.

C++ Interface

The C++ interface provides direct access to the MinHash class with full control over the sketching process.

### Methods

void update(char *seq)

Updates the MinHash with a new sequence. This method supports streaming updates.

Parameters:
  • seq: A pointer to the sequence to be added.

void merge(MinHash &msh)

Merges another MinHash into this MinHash, combining their hash values.

Parameters:
  • msh: The MinHash object to be merged.

double jaccard(MinHash *msh)

Computes the Jaccard index between this MinHash and another.

Parameters:
  • msh: The MinHash object to compare against.

Returns:
  • double: The Jaccard index.

double distance(MinHash *msh)

Computes the mutation distance between this MinHash and another, as defined in Mash.

Parameters:
  • msh: The MinHash object to compare against.

Returns:
  • double: The mutation distance.

MashLite toLite(vector<uint64_t> hashL) const

Converts the current MinHash object into a MashLite representation, for index_dict method.

void printMinHashes()

Prints the MinHash values for debugging purposes.

uint64_t getTotalLength()

Returns the total sequence length, including multiple updates.

Returns:
  • uint64_t: The total sequence length.

### Attributes

int getKmerSize()

Returns the k-mer size used by this MinHash.

Returns:
  • int: The k-mer size.

uint32_t getSeed()

Returns the hash seed used for generating the MinHash.

Returns:
  • uint32_t: The hash seed.

uint32_t getMaxSketchSize()

Returns the maximum sketch size.

Returns:
  • uint32_t: The maximum sketch size.

bool isEmpty()

Checks if the MinHash is empty.

Returns:
  • bool: true if the MinHash is empty, false otherwise.

void saveMinHashes(vector<MashLite> &sketches, sketchInfo_t &info, string outputFile)

Saves MashLite sketches to a specified file.

Parameters:
  • sketches (vector<MashLite>&): A reference to the vector of MashLite sketches to save.

  • info (sketchInfo_t&): Metadata associated with the sketches.

  • outputFile (string): Path to the file where the sketches will be saved.

void transMinHashes(vector<MashLite> &sketches, sketchInfo_t &info, string dictFile, string indexFile, int numThreads)

Transforms MashLite sketches into a format suitable for the index_dict method.

Parameters:
  • sketches (vector<MashLite>&): A reference to the vector of MashLite sketches to be transformed.

  • info (sketchInfo_t&): Metadata associated with the sketches.

  • dictFile (string): Path to the dictionary file used for transformation.

  • indexFile (string): Path to the index file used for transformation.

  • numThreads (int): The number of threads to use for parallel processing.

void index_tridist_MinHash(vector<MashLite> &sketches, sketchInfo_t &info, string refSketchOut, string outputFile, int kmer_size, double maxDist, int isContainment, int numThreads)

Computes the sketch index using the index_dict method.

Parameters:
  • sketches (vector<MashLite>&): A reference to the vector of MashLite sketches.

  • info (sketchInfo_t&): Metadata associated with the sketches.

  • refSketchOut (string): Path to save the reference sketches.

  • outputFile (string): Path to the output file for results.

  • kmer_size (int): The size of the k-mers used for sketching.

  • maxDist (double): The maximum allowed distance for comparisons.

  • isContainment (int): Whether to use containment comparisons (1 for true, 0 for false).

  • numThreads (int): The number of threads to use for parallel processing.

Python Interface

The Python interface exposes MinHash functionality via pybind11, enabling easy use in Python projects.

### Constructor

class MinHash(kmer=21, size=1000, seed=42)

Creates a MinHash object with the specified parameters.

Parameters:
  • kmer (int): Size of the k-mers (default: 21).

  • size (int): Maximum number of hashes to store (default: 1000).

  • seed (int): Random seed for reproducibility (default: 42).

### Methods

update(seq: str)

Updates the MinHash with a new sequence.

Parameters:
  • seq (str): The sequence to add.

merge(other: MinHash)

Merges another MinHash into this MinHash, combining their hash values.

Parameters:
  • other (MinHash): The MinHash object to merge.

jaccard(other: MinHash) float

Computes the Jaccard index between this MinHash and another.

Parameters:
  • other (MinHash): The MinHash object to compare against.

Returns:
  • float: The Jaccard index.

distance(other: MinHash) float

Computes the mutation distance between this MinHash and another.

Parameters:
  • other (MinHash): The MinHash object to compare against.

Returns:
  • float: The mutation distance.

get_total_length() int

Returns the total sequence length, including multiple updates.

Returns:
  • int: The total sequence length.

Prints the MinHash values for debugging purposes.

count() int

Estimates the cardinality count of the set represented by the MinHash.

Returns:
  • int: The estimated cardinality count.

### Attributes

kmer_size

Returns the k-mer size used by this MinHash.

Returns:
  • int: The k-mer size.

seed

Returns the hash seed used for generating the MinHash.

Returns:
  • int: The hash seed.

max_sketch_size

Returns the maximum sketch size.

Returns:
  • int: The maximum sketch size.

is_empty

Checks if the MinHash is empty.

Returns:
  • bool: True if the MinHash is empty, False otherwise.