Haplotypes

Using rfmix-reader is as simple as pandas-plink.

We provide example data for two and three population admixtured from simulation data created with Haptools. You can download it from Figshare:

Once downloaded, we are ready to start!

Input

First, we need to generate binary files. I suggest using create_binaries.

from rfmix_reader import create_binaries

prefix_path = "../examples/two_populations/out/"
create_binaries(prefix_path)
Created binary files at: ./binary_files
Converting fb files to binary!
  0% 0/3 [00:00<?, ?it/s] 33% 1/3 [05:03<10:07, 303.82s/it]100% 3/3 [05:03<00:00, 101.27s/it]
Successfully converted 3 files to binary format.

As of v0.1.20, this can also be invoked via the command line.

create-binaries -h

Note

usage: create-binaries [-h] [--version] [--binary_dir BINARY_DIR] file_prefix

Create binary files from RFMix *.fb.tsv files.

positional arguments:
  file_prefix           The prefix used to identify the relevant FB TSV files.

options:
  -h, --help            show this help message and exit
  --version             Show the version of the program and exit.
  --binary_dir BINARY_DIR
                        The directory where the binary files will be stored.
                        Defaults to './binary_files'.

Once the binary files are created, we can read in the data with the main function read_rfmix.

from rfmix_reader import read_rfmix

loci, rf_q, admix = read_rfmix(prefix_path)
GPU 0: NVIDIA TITAN V
  Total memory: 11.77 GB
  CUDA capability: 7.0
Multiple files read in this order: ['chr20', 'chr21', 'chr22']
Mapping loci files:   0% 0/3 [00:00<?, ?it/s]Mapping loci files:  33% 1/3 [00:02<00:05,  2.72s/it]Mapping loci files:  67% 2/3 [00:04<00:01,  1.93s/it]Mapping loci files: 100% 3/3 [00:05<00:00,  1.73s/it]Mapping loci files: 100% 3/3 [00:05<00:00,  1.86s/it]
Mapping Q files:   0% 0/3 [00:00<?, ?it/s]Mapping Q files: 100% 3/3 [00:00<00:00, 47.69it/s]
Mapping fb files:   0% 0/3 [00:00<?, ?it/s]Mapping fb files:  33% 1/3 [00:00<00:00,  2.66it/s]Mapping fb files:  67% 2/3 [00:00<00:00,  3.46it/s]Mapping fb files: 100% 3/3 [00:00<00:00,  3.75it/s]Mapping fb files: 100% 3/3 [00:00<00:00,  3.55it/s]

With a GPU, three chromosomes can be loaded in to your session in less than a minute.

Output

loci

loci are the metadata for the RFMix results.

loci.shape
(646287, 3)
loci
       chromosome  physical_position       i
0           chr20              60137       0
1           chr20              60291       1
2           chr20              60340       2
3           chr20              60440       3
4           chr20              60823       4
...           ...                ...     ...
646282      chr22           50790690  646282
646283      chr22           50790993  646283
646284      chr22           50791163  646284
646285      chr22           50791228  646285
646286      chr22           50791360  646286

[646287 rows x 3 columns]

To model it after pandas_plink, there is an index column i. This is useful for software developing, but in general only the first two columns are needed.

rf_q

rf_q is the global ancestry results per chromosome for each individual. This is the *.rfmix.Q files combined into a single DataFrame.

rf_q.shape
(1500, 4)
rf_q
       sample_id      AFR      EUR  chrom
0       Sample_1  0.85383  0.14617  chr20
1       Sample_2  0.68933  0.31067  chr20
2       Sample_3  1.00000  0.00000  chr20
3       Sample_4  0.86754  0.13246  chr20
4       Sample_5  0.68280  0.31720  chr20
...          ...      ...      ...    ...
1495  Sample_496  0.82322  0.17678  chr22
1496  Sample_497  0.73456  0.26544  chr22
1497  Sample_498  1.00000  0.00000  chr22
1498  Sample_499  0.87362  0.12638  chr22
1499  Sample_500  0.85129  0.14871  chr22

[1500 rows x 4 columns]

Since we have three chromosomes, that means there are 500 samples in this example dataset.

rf_q.groupby("chrom").size()
chrom
chr22    500
chr20    500
chr21    500
dtype: int64

Let’s exact the sample names! This is a cudf DataFrame, so we need to extract the data with .to_arrow(). When running on CPU, this will be a regular pandas DataFrame.

type(rf_q)
<class 'cudf.core.dataframe.DataFrame'>
sample_ids = rf_q.sample_id.unique().to_arrow()
len(sample_ids)
500

We’ll also get the unique populations.

pops = rf_q.drop(["sample_id", "chrom"], axis=1).columns.values
pops
['AFR' 'EUR']

admix

admix is the convert RFMix results from the *.fb.tsv files. Here, we add the alleles and re-subset the data so that the first population is first (all samples) followed by the next, and the next. This means instead of 0 and 1, you can get 0, 1, or 3.

admix
dask.array<concatenate, shape=(646287, 1000), dtype=float32, chunksize=(1024, 256), chunktype=numpy.ndarray>

To reduce memory consumption, this large data is held in a dask array, exactly like pandas_plink BED data.

admix.compute()
[[2 2 2 ... 0 0 0]
 [2 2 1 ... 0 0 1]
 [1 2 1 ... 0 0 0]
 ...
 [1 1 2 ... 0 0 0]
 [2 2 2 ... 1 1 1]
 [2 2 1 ... 1 0 1]]
admix.shape
(646287, 1000)

The rows are the same as the loci data, in the sample order.

loci.shape
(646287, 3)

The rows are the total samples x number of populations. This is in a specific order. All samples are grouped by population instead of by the sample.

col_names = [f"{sample}_{pop}" for pop in pops for sample in sample_ids]
len(col_names)
1000
col_names[0:4]
['Sample_1_AFR', 'Sample_2_AFR', 'Sample_3_AFR', 'Sample_4_AFR']
col_names[500:504]
['Sample_1_EUR', 'Sample_2_EUR', 'Sample_3_EUR', 'Sample_4_EUR']

This is the correct order for the admix array data.

Loci Imputation

Imputing local ancestry loci information to genotype variant locations improves integration of the local ancestry information with genotype data. As such, we also provide the interpolate_array function to efficiently interpolate missing values when local ancestry loci information is converted to more variable genotype variant locations. It leverages the power of Zarr arrays, making it suitable for handling substantial datasets while managing memory usage effectively.

Note: Following imputation, variant_df will include genomic positions for both local ancestry and genotype data.

def _load_genotypes(plink_prefix_path):
    from tensorqtl import pgen
    pgr = pgen.PgenReader(plink_prefix_path)
    variant_df = pgr.variant_df
    variant_df.loc[:, "chrom"] = "chr" + variant_df.chrom
    return pgr.load_genotypes(), variant_df

def _load_admix(prefix_path, binary_dir):
    from rfmix_reader import read_rfmix
    return read_rfmix(prefix_path, binary_dir=binary_dir)
from rfmix_reader import interpolate_array
basename = "/projects/b1213/large_projects/brain_coloc_app/input"
# Local ancestry
prefix_path = f"{basename}/local_ancestry_rfmix/_m/"
binary_dir = f"{basename}/local_ancestry_rfmix/_m/binary_files/"
loci, _, admix = _load_admix(prefix_path, binary_dir)
loci.rename(columns={"chromosome": "chrom",
                     "physical_position": "pos"},
            inplace=True)
# Variant data
plink_prefix = f"{basename}/genotypes/TOPMed_LIBD"
_, variant_df = _load_genotypes(plink_prefix)
variant_df = variant_df.drop_duplicates(subset=["chrom", "pos"],
                                        keep='first')
# Keep all locations for more accurate imputation
variant_loci_df = variant_df.merge(loci.to_pandas(), on=["chrom", "pos"],
                                   how="outer", indicator=True)\
                            .loc[:, ["chrom", "pos", "i", "_merge"]]
data_path = f"{basename}/local_ancestry_rfmix/_m"
z = interpolate_array(variant_loci_df, admix, data_path)