data-handling utils

mpca.utils.datahandling.write_h5(filename, dataname, data)[source]

Write data to a h5 file. NOTE: Replaces dataset dataname if already exists.

Parameters:
  • filename – Name of the file on disk.
  • dataname – Name the datataset.
  • data – The data to store.
mpca.utils.datahandling.read_h5(filename, dataname)[source]

Read data from a h5 file.

Parameters:
  • filename – Name of the file on disk.
  • dataname – Name of the dataset.
Returns:

The data.

mpca.utils.datahandling.get_pop_superpop_list(file)[source]

Get a list mapping populations to superpopulations from file.

Parameters:file – directory, filename and extension of a file mapping populations to superpopulations.
Returns:a (n_pops) x 2 list

Assumes file contains one population and superpopulation per line, separated by “,” e.g.

Kyrgyz,Central/South Asia

Khomani,Sub-Saharan Africa

mpca.utils.datahandling.read_from_EIGENSTRAT(genofile, popfile)[source]

Read genotypes from eigenstrat file, and sample ID and population ID from another file. :param genofile: text file with genotypes represented as 0,1,2 (9 for missing data) in a (n_markers x n_samples) order :param popfile: text file containing sample IDs and their population IDs. E.g. a plink fam file,

or a file that contains one line for each sample with the following information: “populationID sampleID”
Returns:
genotypes (n_samples x n_markers)
ind_pop_list: array mapping individual IDs to populations so that ind_pop_list[i,0] is the individual ID

of sample i, and ind_pop_list[i,1] is the population of sample i, in the same order as in genotypes

NOTE: the genotypes are transposed

mpca.utils.datahandling.normalize_genos_EIGENSTRATstyle(genodata)[source]

Normalize genotypes as described in EIGENSTRAT article

Principal components analysis corrects for stratification in genome-wide association studies Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick & David Reich Nature Genetics 2006

Centering by mean and normalization over rows (over SNPs).

Missing data exluded from normalization and set to value 0.

Parameters:genodata (array, shape (n_markers x n_samples)) – genotypes represented as 0,1,2, missing values encoded as 9
Returns:Centered and normalized genodata, transposed to (n_samples x n_markers).
mpca.utils.datahandling.remove_values(data, missing_fraction, missing_val=-1.0)[source]

Randomly set missing_fraction of the data to missing.

Parameters:
  • data (array, shape (n_samples x n_variables)) – the data
  • missing_fraction – fraction of data to set to missing
  • missing_val – the value used to represent missing data
Returns: