API Reference

exception rfmix_reader.BinaryFileNotFoundError(binary_fn, binary_dir)[source]

Bases: FileNotFoundError

Custom exception raised when a required binary file is not found.

This exception provides detailed information about the missing file and offers suggestions for resolving the issue.

binary_fn(str)

Type:: The name of the missing binary file.

binary_dir(str)

Type:: The directory where the binary file was expected.

Example usage

raise BinaryFileNotFoundError(binary_fn, binary_dir)

class rfmix_reader.Chunk(nsamples=1024, nloci=1024)[source]

Bases: object

Chunk specification for a contiguous submatrix of the haplotype matrix.

Parameters:

nsamples (Optional[int], default=1024) – Number of samples in a single chunk, limited by the total number of samples. Set to None to include all samples.
nloci (Optional[int], default=1024) – Number of loci in a single chunk, limited by the total number of loci. Set to None to include all loci.

Notes

Small chunks may increase computational time, while large chunks may increase memory usage.
For small datasets, try setting both nsamples and nloci to None.
For large datasets where you need to use every sample, try setting nsamples=None and choose a small value for nloci.

nloci: Optional[int] = 1024

nsamples: Optional[int] = 1024

rfmix_reader.admix_to_bed_individual(loci, g_anc, admix, sample_num, chunk_size=10000, min_segment=3, verbose=True)[source]

Returns loci and admixture data to a BED (Browser Extensible Data) file for a specific chromosome.

This function processes genetic loci data along with admixture proportions and returns BED format DataFrame for a specific chromosome.

Parameters:

loci (DataFrame) – A DataFrame containing genetic loci information. Expected to have columns for chromosome, position, and other relevant genetic markers.
g_anc (DataFrame) – A DataFrame containing sample and population information. Used to derive sample IDs and population names.
admix (Array) – A Dask Array containing admixture proportions. The shape should be compatible with the number of loci and populations.
sample_num (int) – Zero-based integer index of the sample to extract from g_anc. For example, 0 selects the first sample, 1 the second, and so on.
chunk_size (int, optional) – Size of chunks to process at once (default=10_000) Adjust based on available memory
min_segment (int, optional) – Minimum length of a segment to consider it a true change (default=3)
verbose (bool) – True for progress information; False otherwise. Default:True.

Returns:

DataFrame – ‘chromosome’, ‘start’, ‘end’, and ancestry data columns.

Return type:

A DataFrame (pandas or cudf) in BED-like format with columns:

Raises:

IndexError – If sample_num is negative or >= the number of samples in g_anc.

Notes

The function internally calls _generate_bed() to perform the actual BED formatting.
Column names in the output file are formatted as “{sample}_{population}”.
The output file includes data for all chromosomes present in the input loci DataFrame.
Large datasets may require significant processing time and disk space.

Example

>>> loci, g_anc, admix = read_rfmix(prefix_path)
>>> admix_to_bed_individual(loci_df, g_anc_df, admix_array, "chr22")

rfmix_reader.create_binaries(file_prefix, binary_dir='./binary_files')[source]

Create binary files from fullband (FB) TSV files.

This function identifies FB TSV files based on a given prefix, creates a directory for binary files if it doesn’t exist, and converts the identified TSV files to binary format.

Parameters:

(str) (file_prefix) – The prefix used to identify the relevant FB TSV files.
(str (binary_dir) – The directory where the binary files will be stored. Defaults to “./binary_files”.
optional) – The directory where the binary files will be stored. Defaults to “./binary_files”.
file_prefix (str)
binary_dir (str)

Return type:

None

Raises:

FileNotFoundError – If no files matching the given prefix are found.:
PermissionError – If there are insufficient permissions to create: the binary directory.
IOError – If there’s an error during the file conversion process.:
RuntimeError – If both .fb.tsv and .fb.tsv.gz forms of the same: file are found in the same directory, which would cause ambiguous prefix resolution.

Example

create_binaries(”data_”, “./output_binaries”)

Notes

This function relies on helper functions get_prefixes and _generate_binary_files.
Ensure that the necessary permissions are available to create directories and files.
Creates a directory for binary files if it doesn’t exist.
Converts identified FB TSV files to binary format.
Prints messages about the creation process.

Dependencies

get_prefixes: Function to get file prefixes.
_generate_binary_files: Function to convert TSV files to binary format.
os.makedirs: For creating directories.

rfmix_reader.delete_files_or_directories(path_patterns)[source]

Deletes the specified files or directories using the ‘rm -rf’ command.

This function takes a list of path patterns, finds all matching files or directories, and deletes them using the ‘rm -rf’ command. It prints a message for each deleted path and handles errors gracefully.

Parameters:: str) (path_patterns (list of) – patterns to delete. These patterns can include wildcards.
Return type:: None

Example

delete_files_or_directories([‘/tmp/test_dir/’, ‘/tmp/old_files/.log’])

Notes

This function uses the ‘glob’ module to find matching paths and the ‘subprocess’ module to execute the ‘rm -rf’ command.
Ensure that the paths provided are correct and that you have the necessary permissions to delete the specified files or directories.
Use this function with caution as it will permanently delete the specified files or directories.
Deletes files or directories that match the specified patterns.
Prints messages indicating the deletion status of each path.
Prints error messages if a path cannot be deleted.

rfmix_reader.get_pops(g_anc)[source]

Extract population names from an RFMix Q-matrix DataFrame.

This function removes the ‘sample_id’ and ‘chrom’ columns from the input DataFrame and returns the remaining column names, which represent population names.

Parameters:

(pd.DataFrame) (g_anc) – Expected to have ‘sample_id’ and ‘chrom’ columns, along with population columns.
g_anc (DataFrame)

Returns:

np.ndarray

Return type:

An array of population names extracted from the column names.

Example

If g_anc has columns [‘sample_id’, ‘chrom’, ‘pop1’, ‘pop2’, ‘pop3’], this function will return [‘pop1’, ‘pop2’, ‘pop3’].

Note

This function assumes that all columns other than ‘sample_id’ and ‘chrom’ represent population names.

rfmix_reader.get_prefixes(file_prefix, mode='rfmix', verbose=True)[source]

Retrieve and clean file prefixes for specified file types.

This function searches for files with a given prefix, cleans the prefixes, and constructs a list of dictionaries mapping specific file types to their corresponding file paths.

Parameters:

file_prefix (str) – The prefix used to identify relevant files. This can be a directory or a common prefix for the files.
mode ({"rfmix", "flare"}) – The expected output type. - “rfmix” expects files with suffixes: [“fb.tsv”, “fb.tsv.gz”, “rfmix.Q”]. - “flare” expects files with suffixes: [“anc.vcf.gz”, “global.anc.gz”].
verbose (bool, optional) – True for progress information; False otherwise. Default:True.

Returns:

A list of dictionaries where each dictionary maps file types to their corresponding file paths.

Return type:

list of dict

Raises:

FileNotFoundError – If no valid files matching the given prefix and mode are found.

Notes

Uses _clean_prefixes helper to normalize and deduplicate prefixes.
Assumes RFMix outputs follow the convention <prefix>.fb.tsv and <prefix>.rfmix.Q.
Assumes FLARE outputs follow <prefix>.anc.vcf.gz and <prefix>.global.anc.gz.

rfmix_reader.get_sample_names(g_anc)[source]

Extract unique sample IDs from an RFMix Q-matrix DataFrame and convert to Arrow array.

This function retrieves unique values from the ‘sample_id’ column of the input DataFrame and converts them to a PyArrow array.

Parameters:

(pd.DataFrame) (g_anc) – Expected to have a ‘sample_id’ column.
g_anc (DataFrame)

Returns:

pa.Array

Return type:

A PyArrow array containing unique sample IDs.

Example

If g_anc has a ‘sample_id’ column with values [‘sample1’, ‘sample2’, ‘sample1’, ‘sample3’], this function will return a PyArrow array containing [‘sample1’, ‘sample2’, ‘sample3’].

Note

This function assumes that the ‘sample_id’ column exists in the input DataFrame. It uses PyArrow on GPU for efficient memory management and interoperability with other data processing libraries.

rfmix_reader.read_fb(filepath, nrows, ncols, row_chunk, col_chunk)[source]

Read and process data from a file in chunks, skipping the first 2 rows (comments) and 4 columns (loci annotation).

Parameters:

filepath (str) – Path to the binary file.
nrows (int) – Total number of rows in the dataset.
ncols (int) – Total number of columns in the dataset.
row_chunk (int) – Number of rows to process in each chunk.
col_chunk (int) – Number of columns to process in each chunk.

Returns:

dask.array

Return type:

Concatenated array of processed data.

Raises:

ValueError – If row_chunk or col_chunk is not a positive integer.:
FileNotFoundError – If the specified file does not exist.:
IOError – If there is an error reading the file.:

rfmix_reader.set_gpu_environment()[source]

Reviews and prints the properties of available GPUs.

This function checks the number of GPUs available on the system. If no GPUs are found, it prints a message indicating that no GPUs are available. If GPUs are found, it iterates through each GPU and prints its properties, including the name, total memory in gigabytes, and CUDA capability.

The function relies on two external functions:

device_count(): Returns the number of GPUs available.
get_device_properties(device_id): Returns the properties of the GPU with the given device ID.

Raises:

Any exceptions raised by device_count or get_device_properties –
will propagate up to the caller. –
Dependencies –
------------ –
- torch.cuda.device_count – Counts the numer of GPU devices:
- torch.cuda.get_device_propoerties – Get device properties:

Example

GPU 0: NVIDIA GeForce RTX 3080: Total memory: 10.00 GB CUDA capability: 8.6
GPU 1: NVIDIA GeForce RTX 3070: Total memory: 8.00 GB CUDA capability: 8.6

Subpackages

`rfmix_reader`
`rfmix_reader.cli`