.. raw:: org #+PROPERTY: header-args: :dir /dcs04/lieber/statsgen/jbenjami/tutorials/eqtl_analysis_tutorial .. raw:: org #+PROPERTY: header-args:R :cache yes :exports both :session *R* :eval never-export .. raw:: org #+PROPERTY: header-args:python :session *Python* :cache yes :exports both :eval never-export .. raw:: org #+PROPERTY: header-args:sh :cache yes :exports both :eval never-export .. raw:: org #+STARTUP: align fold nodlcheck hidestars oddeven lognotestate .. raw:: org #+TAGS: Write(w) Update(u) Fix(f) Check(c) noexport(n) Haplotypes ========== Using ``rfmix-reader`` is as simple as `pandas-plink `__. We provide example data for two and three population admixtured from simulation data created with `Haptools `__. You can download it from Figshare: Once downloaded, we are ready to start! Input ----- First, we need to generate binary files. I suggest using ``create_binaries``. .. code:: python from rfmix_reader import create_binaries prefix_path = "../examples/two_populations/out/" create_binaries(prefix_path) .. raw:: org :: Created binary files at: ./binary_files Converting fb files to binary! 0% 0/3 [00:00 .. code:: python sample_ids = rf_q.sample_id.unique().to_arrow() len(sample_ids) .. raw:: org :: 500 We'll also get the unique populations. .. code:: python pops = rf_q.drop(["sample_id", "chrom"], axis=1).columns.values pops .. raw:: org :: ['AFR' 'EUR'] ``admix`` ~~~~~~~~~ ``admix`` is the convert RFMix results from the ``*.fb.tsv`` files. Here, we add the alleles and re-subset the data so that the first population is first (all samples) followed by the next, and the next. This means instead of 0 and 1, you can get 0, 1, or 3. .. code:: python admix .. raw:: org :: dask.array To reduce memory consumption, this large data is held in a dask array, exactly like ``pandas_plink`` BED data. .. code:: python admix.compute() .. raw:: org :: [[2 2 2 ... 0 0 0] [2 2 1 ... 0 0 1] [1 2 1 ... 0 0 0] ... [1 1 2 ... 0 0 0] [2 2 2 ... 1 1 1] [2 2 1 ... 1 0 1]] .. code:: python admix.shape .. raw:: org :: (646287, 1000) The rows are the same as the ``loci`` data, in the sample order. .. code:: python loci.shape .. raw:: org :: (646287, 3) The rows are the total samples x number of populations. This is in a specific order. All samples are grouped by population instead of by the sample. .. code:: python col_names = [f"{sample}_{pop}" for pop in pops for sample in sample_ids] len(col_names) .. raw:: org :: 1000 .. code:: python col_names[0:4] .. raw:: org :: ['Sample_1_AFR', 'Sample_2_AFR', 'Sample_3_AFR', 'Sample_4_AFR'] .. code:: python col_names[500:504] .. raw:: org :: ['Sample_1_EUR', 'Sample_2_EUR', 'Sample_3_EUR', 'Sample_4_EUR'] This is the correct order for the admix array data. Loci Imputation ================ Imputing local ancestry loci information to genotype variant locations improves integration of the local ancestry information with genotype data. As such, we also provide the `interpolate_array` function to efficiently interpolate missing values when local ancestry loci information is converted to more variable genotype variant locations. It leverages the power of `Zarr `_ arrays, making it suitable for handling substantial datasets while managing memory usage effectively. **Note**: Following imputation, `variant_df` will include genomic positions for both local ancestry and genotype data. .. code:: python def _load_genotypes(plink_prefix_path): from tensorqtl import pgen pgr = pgen.PgenReader(plink_prefix_path) variant_df = pgr.variant_df variant_df.loc[:, "chrom"] = "chr" + variant_df.chrom return pgr.load_genotypes(), variant_df def _load_admix(prefix_path, binary_dir): from rfmix_reader import read_rfmix return read_rfmix(prefix_path, binary_dir=binary_dir) .. code:: python from rfmix_reader import interpolate_array basename = "/projects/b1213/large_projects/brain_coloc_app/input" # Local ancestry prefix_path = f"{basename}/local_ancestry_rfmix/_m/" binary_dir = f"{basename}/local_ancestry_rfmix/_m/binary_files/" loci, _, admix = _load_admix(prefix_path, binary_dir) loci.rename(columns={"chromosome": "chrom", "physical_position": "pos"}, inplace=True) # Variant data plink_prefix = f"{basename}/genotypes/TOPMed_LIBD" _, variant_df = _load_genotypes(plink_prefix) variant_df = variant_df.drop_duplicates(subset=["chrom", "pos"], keep='first') # Keep all locations for more accurate imputation variant_loci_df = variant_df.merge(loci.to_pandas(), on=["chrom", "pos"], how="outer", indicator=True)\ .loc[:, ["chrom", "pos", "i", "_merge"]] data_path = f"{basename}/local_ancestry_rfmix/_m" z = interpolate_array(variant_loci_df, admix, data_path)