API Reference
- exception rfmix_reader.BinaryFileNotFoundError(binary_fn, binary_dir)[source]
Bases:
FileNotFoundErrorCustom exception raised when a required binary file is not found.
This exception provides detailed information about the missing file and offers suggestions for resolving the issue.
- binary_fn(str)
- Type:
The name of the missing binary file.
- binary_dir(str)
- Type:
The directory where the binary file was expected.
Example usage
raise BinaryFileNotFoundError(binary_fn, binary_dir)
- class rfmix_reader.Chunk(nsamples=1024, nloci=1024)[source]
Bases:
objectChunk specification for a contiguous submatrix of the haplotype matrix.
- Parameters:
Notes
Small chunks may increase computational time, while large chunks may increase memory usage.
For small datasets, try setting both nsamples and nloci to None.
For large datasets where you need to use every sample, try setting nsamples=None and choose a small value for nloci.
- rfmix_reader.admix_to_bed_individual(loci, g_anc, admix, sample_num, chunk_size=10000, min_segment=3, verbose=True)[source]
Returns loci and admixture data to a BED (Browser Extensible Data) file for a specific chromosome.
This function processes genetic loci data along with admixture proportions and returns BED format DataFrame for a specific chromosome.
- Parameters:
loci (DataFrame) – A DataFrame containing genetic loci information. Expected to have columns for chromosome, position, and other relevant genetic markers.
g_anc (DataFrame) – A DataFrame containing sample and population information. Used to derive sample IDs and population names.
admix (Array) – A Dask Array containing admixture proportions. The shape should be compatible with the number of loci and populations.
sample_num (int) – Zero-based integer index of the sample to extract from
g_anc. For example,0selects the first sample,1the second, and so on.chunk_size (int, optional) – Size of chunks to process at once (default=10_000) Adjust based on available memory
min_segment (int, optional) – Minimum length of a segment to consider it a true change (default=3)
verbose (bool) –
Truefor progress information;Falseotherwise. Default:True.
- Returns:
DataFrame – ‘chromosome’, ‘start’, ‘end’, and ancestry data columns.
- Return type:
A DataFrame (pandas or cudf) in BED-like format with columns:
- Raises:
IndexError – If
sample_numis negative or >= the number of samples ing_anc.
Notes
The function internally calls _generate_bed() to perform the actual BED formatting.
Column names in the output file are formatted as “{sample}_{population}”.
The output file includes data for all chromosomes present in the input loci DataFrame.
Large datasets may require significant processing time and disk space.
Example
>>> loci, g_anc, admix = read_rfmix(prefix_path) >>> admix_to_bed_individual(loci_df, g_anc_df, admix_array, "chr22")
- rfmix_reader.create_binaries(file_prefix, binary_dir='./binary_files')[source]
Create binary files from fullband (FB) TSV files.
This function identifies FB TSV files based on a given prefix, creates a directory for binary files if it doesn’t exist, and converts the identified TSV files to binary format.
- Parameters:
(str) (file_prefix) – The prefix used to identify the relevant FB TSV files.
(str (binary_dir) – The directory where the binary files will be stored. Defaults to “./binary_files”.
optional) – The directory where the binary files will be stored. Defaults to “./binary_files”.
file_prefix (str)
binary_dir (str)
- Return type:
None
- Raises:
FileNotFoundError – If no files matching the given prefix are found.:
PermissionError – If there are insufficient permissions to create: the binary directory.
IOError – If there’s an error during the file conversion process.:
RuntimeError – If both
.fb.tsvand.fb.tsv.gzforms of the same: file are found in the same directory, which would cause ambiguous prefix resolution.
Example
create_binaries(”data_”, “./output_binaries”)
Notes
This function relies on helper functions get_prefixes and _generate_binary_files.
Ensure that the necessary permissions are available to create directories and files.
Creates a directory for binary files if it doesn’t exist.
Converts identified FB TSV files to binary format.
Prints messages about the creation process.
Dependencies
get_prefixes: Function to get file prefixes.
_generate_binary_files: Function to convert TSV files to binary format.
os.makedirs: For creating directories.
- rfmix_reader.delete_files_or_directories(path_patterns)[source]
Deletes the specified files or directories using the ‘rm -rf’ command.
This function takes a list of path patterns, finds all matching files or directories, and deletes them using the ‘rm -rf’ command. It prints a message for each deleted path and handles errors gracefully.
- Parameters:
str) (path_patterns (list of) – patterns to delete. These patterns can include wildcards.
- Return type:
None
Example
delete_files_or_directories([‘/tmp/test_dir/’, ‘/tmp/old_files/.log’])
Notes
This function uses the ‘glob’ module to find matching paths and the ‘subprocess’ module to execute the ‘rm -rf’ command.
Ensure that the paths provided are correct and that you have the necessary permissions to delete the specified files or directories.
Use this function with caution as it will permanently delete the specified files or directories.
Deletes files or directories that match the specified patterns.
Prints messages indicating the deletion status of each path.
Prints error messages if a path cannot be deleted.
- rfmix_reader.get_pops(g_anc)[source]
Extract population names from an RFMix Q-matrix DataFrame.
This function removes the ‘sample_id’ and ‘chrom’ columns from the input DataFrame and returns the remaining column names, which represent population names.
- Parameters:
(pd.DataFrame) (g_anc) – Expected to have ‘sample_id’ and ‘chrom’ columns, along with population columns.
g_anc (DataFrame)
- Returns:
np.ndarray
- Return type:
An array of population names extracted from the column names.
Example
If g_anc has columns [‘sample_id’, ‘chrom’, ‘pop1’, ‘pop2’, ‘pop3’], this function will return [‘pop1’, ‘pop2’, ‘pop3’].
Note
This function assumes that all columns other than ‘sample_id’ and ‘chrom’ represent population names.
- rfmix_reader.get_prefixes(file_prefix, mode='rfmix', verbose=True)[source]
Retrieve and clean file prefixes for specified file types.
This function searches for files with a given prefix, cleans the prefixes, and constructs a list of dictionaries mapping specific file types to their corresponding file paths.
- Parameters:
file_prefix (str) – The prefix used to identify relevant files. This can be a directory or a common prefix for the files.
mode ({"rfmix", "flare"}) – The expected output type. - “rfmix” expects files with suffixes: [“fb.tsv”, “fb.tsv.gz”, “rfmix.Q”]. - “flare” expects files with suffixes: [“anc.vcf.gz”, “global.anc.gz”].
verbose (bool, optional) –
Truefor progress information;Falseotherwise. Default:True.
- Returns:
A list of dictionaries where each dictionary maps file types to their corresponding file paths.
- Return type:
- Raises:
FileNotFoundError – If no valid files matching the given prefix and mode are found.
Notes
Uses _clean_prefixes helper to normalize and deduplicate prefixes.
Assumes RFMix outputs follow the convention <prefix>.fb.tsv and <prefix>.rfmix.Q.
Assumes FLARE outputs follow <prefix>.anc.vcf.gz and <prefix>.global.anc.gz.
- rfmix_reader.get_sample_names(g_anc)[source]
Extract unique sample IDs from an RFMix Q-matrix DataFrame and convert to Arrow array.
This function retrieves unique values from the ‘sample_id’ column of the input DataFrame and converts them to a PyArrow array.
- Parameters:
(pd.DataFrame) (g_anc) – Expected to have a ‘sample_id’ column.
g_anc (DataFrame)
- Returns:
pa.Array
- Return type:
A PyArrow array containing unique sample IDs.
Example
If g_anc has a ‘sample_id’ column with values [‘sample1’, ‘sample2’, ‘sample1’, ‘sample3’], this function will return a PyArrow array containing [‘sample1’, ‘sample2’, ‘sample3’].
Note
This function assumes that the ‘sample_id’ column exists in the input DataFrame. It uses PyArrow on GPU for efficient memory management and interoperability with other data processing libraries.
- rfmix_reader.read_fb(filepath, nrows, ncols, row_chunk, col_chunk)[source]
Read and process data from a file in chunks, skipping the first 2 rows (comments) and 4 columns (loci annotation).
- Parameters:
- Returns:
dask.array
- Return type:
Concatenated array of processed data.
- Raises:
ValueError – If row_chunk or col_chunk is not a positive integer.:
FileNotFoundError – If the specified file does not exist.:
IOError – If there is an error reading the file.:
- rfmix_reader.set_gpu_environment()[source]
Reviews and prints the properties of available GPUs.
This function checks the number of GPUs available on the system. If no GPUs are found, it prints a message indicating that no GPUs are available. If GPUs are found, it iterates through each GPU and prints its properties, including the name, total memory in gigabytes, and CUDA capability.
The function relies on two external functions:
device_count(): Returns the number of GPUs available.
get_device_properties(device_id): Returns the properties of the GPU with the given device ID.
- Raises:
Any exceptions raised by device_count or get_device_properties –
will propagate up to the caller. –
Dependencies –
------------ –
- torch.cuda.device_count – Counts the numer of GPU devices:
- torch.cuda.get_device_propoerties – Get device properties:
Example
- GPU 0: NVIDIA GeForce RTX 3080
Total memory: 10.00 GB CUDA capability: 8.6
- GPU 1: NVIDIA GeForce RTX 3070
Total memory: 8.00 GB CUDA capability: 8.6