data-handling utils¶

mpca.utils.datahandling.write_h5(filename, dataname, data)[source]¶

Write data to a h5 file. NOTE: Replaces dataset dataname if already exists.

Parameters:	filename – Name of the file on disk. dataname – Name the datataset. data – The data to store.

mpca.utils.datahandling.read_h5(filename, dataname)[source]¶

Read data from a h5 file.

Parameters:	filename – Name of the file on disk. dataname – Name of the dataset.
Returns:	The data.

mpca.utils.datahandling.get_pop_superpop_list(file)[source]¶

Get a list mapping populations to superpopulations from file.

Parameters:	file – directory, filename and extension of a file mapping populations to superpopulations.
Returns:	a (n_pops) x 2 list

Assumes file contains one population and superpopulation per line, separated by “,” e.g.

Kyrgyz,Central/South Asia

Khomani,Sub-Saharan Africa

mpca.utils.datahandling.read_from_EIGENSTRAT(genofile, popfile)[source]¶

Read genotypes from eigenstrat file, and sample ID and population ID from another file. :param genofile: text file with genotypes represented as 0,1,2 (9 for missing data) in a (n_markers x n_samples) order :param popfile: text file containing sample IDs and their population IDs. E.g. a plink fam file,

or a file that contains one line for each sample with the following information: “populationID sampleID”

Returns:

genotypes (n_samples x n_markers): ind_pop_list: array mapping individual IDs to populations so that ind_pop_list[i,0] is the individual ID

of sample i, and ind_pop_list[i,1] is the population of sample i, in the same order as in genotypes

NOTE: the genotypes are transposed

mpca.utils.datahandling.normalize_genos_EIGENSTRATstyle(genodata)[source]¶

Normalize genotypes as described in EIGENSTRAT article

Principal components analysis corrects for stratification in genome-wide association studies Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reich Nature Genetics 2006

Centering by mean and normalization over rows (over SNPs).

Missing data exluded from normalization and set to value 0.

Parameters:	genodata (array, shape (n_markers x n_samples)) – genotypes represented as 0,1,2, missing values encoded as 9
Returns:	Centered and normalized genodata, transposed to (n_samples x n_markers).

mpca.utils.datahandling.remove_values(data, missing_fraction, missing_val=-1.0)[source]¶

Randomly set missing_fraction of the data to missing.

Parameters:	data (array, shape (n_samples x n_variables)) – the data missing_fraction – fraction of data to set to missing missing_val – the value used to represent missing data
Returns: