data-handling utils¶
-
mpca.utils.datahandling.write_h5(filename, dataname, data)[source]¶ Write data to a h5 file. NOTE: Replaces dataset dataname if already exists.
Parameters: - filename – Name of the file on disk.
- dataname – Name the datataset.
- data – The data to store.
-
mpca.utils.datahandling.read_h5(filename, dataname)[source]¶ Read data from a h5 file.
Parameters: - filename – Name of the file on disk.
- dataname – Name of the dataset.
Returns: The data.
-
mpca.utils.datahandling.get_pop_superpop_list(file)[source]¶ Get a list mapping populations to superpopulations from file.
Parameters: file – directory, filename and extension of a file mapping populations to superpopulations. Returns: a (n_pops) x 2 list Assumes file contains one population and superpopulation per line, separated by “,” e.g.
Kyrgyz,Central/South Asia
Khomani,Sub-Saharan Africa
-
mpca.utils.datahandling.read_from_EIGENSTRAT(genofile, popfile)[source]¶ Read genotypes from eigenstrat file, and sample ID and population ID from another file. :param genofile: text file with genotypes represented as 0,1,2 (9 for missing data) in a (n_markers x n_samples) order :param popfile: text file containing sample IDs and their population IDs. E.g. a plink fam file,
or a file that contains one line for each sample with the following information: “populationID sampleID”Returns: - genotypes (n_samples x n_markers)
- ind_pop_list: array mapping individual IDs to populations so that ind_pop_list[i,0] is the individual ID
of sample i, and ind_pop_list[i,1] is the population of sample i, in the same order as in genotypes
NOTE: the genotypes are transposed
-
mpca.utils.datahandling.normalize_genos_EIGENSTRATstyle(genodata)[source]¶ Normalize genotypes as described in EIGENSTRAT article
Principal components analysis corrects for stratification in genome-wide association studies Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reich Nature Genetics 2006
Centering by mean and normalization over rows (over SNPs).
Missing data exluded from normalization and set to value 0.
Parameters: genodata (array, shape (n_markers x n_samples)) – genotypes represented as 0,1,2, missing values encoded as 9 Returns: Centered and normalized genodata, transposed to (n_samples x n_markers).
-
mpca.utils.datahandling.remove_values(data, missing_fraction, missing_val=-1.0)[source]¶ Randomly set missing_fraction of the data to missing.
Parameters: - data (array, shape (n_samples x n_variables)) – the data
- missing_fraction – fraction of data to set to missing
- missing_val – the value used to represent missing data
Returns: