MinHash API
The MinHash class provides a powerful sketching algorithm for estimating set similarity and distances. Below is the detailed documentation for both C++ and Python interfaces.
C++ Interface
The C++ interface provides direct access to the MinHash class with full control over the sketching process.
### Methods
-
void update(char *seq)
Updates the MinHash with a new sequence. This method supports streaming updates.
- Parameters:
seq: A pointer to the sequence to be added.
-
void merge(MinHash &msh)
Merges another MinHash into this MinHash, combining their hash values.
- Parameters:
msh: The MinHash object to be merged.
-
double jaccard(MinHash *msh)
Computes the Jaccard index between this MinHash and another.
- Parameters:
msh: The MinHash object to compare against.
- Returns:
double: The Jaccard index.
-
double distance(MinHash *msh)
Computes the mutation distance between this MinHash and another, as defined in Mash.
- Parameters:
msh: The MinHash object to compare against.
- Returns:
double: The mutation distance.
-
MashLite toLite(vector<uint64_t> hashL) const
Converts the current MinHash object into a MashLite representation, for index_dict method.
-
void printMinHashes()
Prints the MinHash values for debugging purposes.
-
uint64_t getTotalLength()
Returns the total sequence length, including multiple updates.
- Returns:
uint64_t: The total sequence length.
### Attributes
-
int getKmerSize()
Returns the k-mer size used by this MinHash.
- Returns:
int: The k-mer size.
-
uint32_t getSeed()
Returns the hash seed used for generating the MinHash.
- Returns:
uint32_t: The hash seed.
-
uint32_t getMaxSketchSize()
Returns the maximum sketch size.
- Returns:
uint32_t: The maximum sketch size.
-
bool isEmpty()
Checks if the MinHash is empty.
- Returns:
bool: true if the MinHash is empty, false otherwise.
-
void saveMinHashes(vector<MashLite> &sketches, sketchInfo_t &info, string outputFile)
Saves MashLite sketches to a specified file.
- Parameters:
sketches (vector<MashLite>&): A reference to the vector of MashLite sketches to save.
info (sketchInfo_t&): Metadata associated with the sketches.
outputFile (string): Path to the file where the sketches will be saved.
-
void transMinHashes(vector<MashLite> &sketches, sketchInfo_t &info, string dictFile, string indexFile, int numThreads)
Transforms MashLite sketches into a format suitable for the index_dict method.
- Parameters:
sketches (vector<MashLite>&): A reference to the vector of MashLite sketches to be transformed.
info (sketchInfo_t&): Metadata associated with the sketches.
dictFile (string): Path to the dictionary file used for transformation.
indexFile (string): Path to the index file used for transformation.
numThreads (int): The number of threads to use for parallel processing.
-
void index_tridist_MinHash(vector<MashLite> &sketches, sketchInfo_t &info, string refSketchOut, string outputFile, int kmer_size, double maxDist, int isContainment, int numThreads)
Computes the sketch index using the index_dict method.
- Parameters:
sketches (vector<MashLite>&): A reference to the vector of MashLite sketches.
info (sketchInfo_t&): Metadata associated with the sketches.
refSketchOut (string): Path to save the reference sketches.
outputFile (string): Path to the output file for results.
kmer_size (int): The size of the k-mers used for sketching.
maxDist (double): The maximum allowed distance for comparisons.
isContainment (int): Whether to use containment comparisons (1 for true, 0 for false).
numThreads (int): The number of threads to use for parallel processing.
—
Python Interface
The Python interface exposes MinHash functionality via pybind11, enabling easy use in Python projects.
### Constructor
- class MinHash(kmer=21, size=1000, seed=42)
Creates a MinHash object with the specified parameters.
- Parameters:
kmer (int): Size of the k-mers (default: 21).
size (int): Maximum number of hashes to store (default: 1000).
seed (int): Random seed for reproducibility (default: 42).
### Methods
- update(seq: str)
Updates the MinHash with a new sequence.
- Parameters:
seq (str): The sequence to add.
- merge(other: MinHash)
Merges another MinHash into this MinHash, combining their hash values.
- Parameters:
other (MinHash): The MinHash object to merge.
- jaccard(other: MinHash) float
Computes the Jaccard index between this MinHash and another.
- Parameters:
other (MinHash): The MinHash object to compare against.
- Returns:
float: The Jaccard index.
- distance(other: MinHash) float
Computes the mutation distance between this MinHash and another.
- Parameters:
other (MinHash): The MinHash object to compare against.
- Returns:
float: The mutation distance.
- get_total_length() int
Returns the total sequence length, including multiple updates.
- Returns:
int: The total sequence length.
- print_min_hashes()
Prints the MinHash values for debugging purposes.
- count() int
Estimates the cardinality count of the set represented by the MinHash.
- Returns:
int: The estimated cardinality count.
### Attributes
- kmer_size
Returns the k-mer size used by this MinHash.
- Returns:
int: The k-mer size.
- seed
Returns the hash seed used for generating the MinHash.
- Returns:
int: The hash seed.
- max_sketch_size
Returns the maximum sketch size.
- Returns:
int: The maximum sketch size.
- is_empty
Checks if the MinHash is empty.
- Returns:
bool: True if the MinHash is empty, False otherwise.