KING is a toolset to explore genotype data from a genome-wide association study (GWAS) and a sequencing project. KING can be used to check family relationship and flag pedigree errors by estimating all kinship coefficients for all pairwise relationships. Unrelated pairs can be well separated from close relatives (up to 3rd-degree) and vice versa. The kinship coefficient estimates for close relatives are highly accurate. Other applications of KING such as the identification of population substructure will not be described in detail in this tutorial.
Family relationship inference in KING is very FAST (minutes to examine thousands of individuals, in contrast to days/weeks using other software, over 100 to 1000 fold savings in compuational time), and robust to the presence of population structure. The sample size (number of individuals) of the data can be as small as 2, or as large as > 100,000, in either scenario standard relationship inference software cannot handle. Genotype data do not need to be cleaned (or pruned) at the SNP level (i.e., no SNPs need to be removed) prior to the KING relationship inference. (However, sample-level QC such as removing samples with low call rate is still necessary since a single poorly genotyped sample could cause a cluster of inflated relationships).GENERAL INPUT FILES
The input files include a data file (-d), a pedigree file (-p) and a map file in MERLIN format, or alternatively a binary format file (-b). The command line will look like this:
prompt> king -d ex.dat.gz -p ex.ped.gz -m ex.map.gz --kinship prompt> king -b ex.bgeno.gz --kinship --related prompt> king -b ex.bed --kinship --related
Commands above specify the estimation of all pair-wise kinship coefficients within- and between-families.
KING supports input files in MERLIN format. In additional to recognizing zipped files (file name ends with .gz) and multiple files as in Merlin, e.g.,
prompt> king -d ex1.dat.gz,ex2.dat.gz -p ex1.ped.gz,ex2.ped.gz -m ex1.map.gz,ex2.map.gz --kinship
KING also supports a binary format, either in KING binary format (unique to KING), or a more well-known PLINK binary format. A binary format allows the compression of genotype data by using two bits to represent a genotype. Examples are:
prompt> king -b ex.bgeno --kinship prompt> king -b ex.bed --kinshipThe binary format offers convenient computational savings. With the use of the binary format, the time to load a typical GWAS dataset usually reduces from > 30 minutes to a few seconds, and only a fraction of computer memory and disk space is needed. Binary format data can be generated with commands
prompt> king -d ex.dat -p ex.ped -m ex.map --plinkIt is highly recommended to generate a binary format dataset first before trying different options implemented in KING. It typically takes a much longer time to load the Merlin format data than to perform the pairwise relationship inference. An example dataset in the KING binary format can be downloaded at this link: ex.bgeno [6.7MB] ,and this dataset will be used throughout the tutorial. It has been tested that KING relationship inference works quite well with the genome sequence data. This feature may give KING huge advantage over alternative methods in which rare variants need to be excluded from the inference procedure. The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:
prompt> plink2 --vcf example.vcf --make-bed --out exAlternative approaches are: 1)a shell script like this: VCFtobed.bsh; or 2) vcftools VCFtool:
prompt> vcftools --vcf example.vcf --plink-tped prompt> plink --tfile out --make-bed --out ex
Pair-wise relationship is checked between each pair of individuals. Two algorithms are available for relationship inference. One algorithm assumes a homogeneous population (through paramter --homo), and the other algorithm allows the existence of population structure (through paramter --kinship). Examples are
prompt> king -b ex.bed --kinship prompt> king -b ex.bed --kinship --related --degree 2 prompt> king -b ex.bed --kinship --ibs prompt> king -b ex.bed --homoThe robust algorithm (default) is highly recommended. In each relationship inference, the output is separate for relationships that are within or between families. Note an unrelated individual is treated as a family of size one. If the datasets only consist of unrelated individuals as reported, then all results are saved in the between-family output.
--kinship produces a subset of results produced by --kinship --ibs analysis. In addition to the robust kinship estiamte, summary statistics such as the counts of IBS0, IBS1, IBS2, the average of IBS, and the standard error of the IBS estimate are provided in the --kinship --ibs analysis. The option --ibs by itself only summarizes the IBS statistics without calculating the kinship coefficients. Parameter --related --degree 2 specifies that only related pairs (up to the 2nd-degree in this case) between families are included in the output. Specifically all pairs across families with a kinship coefficient less than 0.0884 will be excluded from the output.
The --related option is highly recommended when dealing with large datasets (e.g., with sample size > 10,000). Besides substantial disk space saving, the computational time is now dramatically reduced, thanks to a computationally efficient algorithm newly implemented in versions 1.3 and later. E.g., When only 1st- or 2nd-degree relative pairs (through parameter --degree 2) are included in the output (through --related), the computation time could be >10 times less! Note the inference accuracy is not sacrificed and the inference results for the close relatives of interest are the same as the relationship inference without the --related option. This speed-up should be extremely attractive to many applications. When sample size is really large, say > 100,000, "king --related" is probably the only choice to have the analysis done in a reasonable amount of time (say couple of days using a single CPU).
--unrelated is a handy option to extract a list of unrelated individuals. E.g.,
prompt> king -b ex.bed --unrelated --degree 2estimates relatedness in the data first, followed by extracting a list of individuals that contains no pairs of individuals with a 1st- or 2nd-degree relationship. This option is available in version 1.4 and later. The detailed algorithm is described in this reference: Manichaikul et al. 2012 [PDF]
The output for within-family relationship checking using --kinship (saved in file king.kin) will look like this:
FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error 28 1 2 2359853 0.000 0.2500 0.162 0.0008 0.2459 0 28 1 3 2351257 0.000 0.2500 0.161 0.0008 0.2466 0 28 2 3 2368538 1.000 0.0000 0.120 0.0634 -0.0108 0 117 1 2 2354279 0.000 0.2500 0.163 0.0006 0.2477 0 117 1 3 2358957 0.000 0.2500 0.164 0.0006 0.2490 0 117 2 3 2348875 1.000 0.0000 0.122 0.0616 -0.0017 0 1344 1 12 2372286 0.000 0.2500 0.149 0.0003 0.2480 0 1344 1 13 2370435 0.000 0.2500 0.148 0.0003 0.2465 0 1344 12 13 2374888 1.000 0.0000 0.117 0.0582 0.0003 0Each row provides information for one pair of individuals. The columns are
FID: Family ID for the pair ID1: Individual ID for the first individual of the pair ID2: Individual ID for the second individual of the pair N_SNP: The number of SNPs that do not have missing genotypes in either of the individual Z0: Pr(IBD=0) as specified by the provided pedigree data Phi: Kinship coefficient as specified by the provided pedigree data HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG) IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG) Kinship: Estimated kinship coefficient from the SNP data Error: Indicates difference between the estimated and specified kinship coefficients (1 for error, 0.5 for warning)
The default kinship coefficient estimation only involves the use of SNP data from this pair of individuals, and the inference is robust to population structure. A negative kinship coefficient estimation indicates an unrelated relationship. The reason that a negative kinship coefficient is not set to zero is a very negative value may indicate the population structure between the two individuals. Close relatives can be inferred fairly reliably based on the estimated kinship coefficients as shown in the following simple algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively. Relationship inference for more distant relationships is more challenging. A plot of the estimated kinship coefficient against the proportion of zero IBS-sharing is highly recommended. In the absence of population structure, relationship inference can also be carried out using an alternative algorithm through parameter "--homo".
Here is an example of the relationship inference using the HapMap GWAS data: PDF and its R code
The output for between-family relationship checking (saved in file king.kin0) will look like this:
FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship 28 3 117 1 2360618 0.143 0.0267 0.1356 28 3 117 2 2352628 0.161 0.0009 0.2441 28 3 117 3 2354540 0.120 0.0624 -0.0119 28 3 1344 1 2361807 0.093 0.1095 -0.2295 28 3 1344 12 2367180 0.094 0.1091 -0.2225 28 3 1344 13 2364816 0.093 0.1082 -0.2224 117 1 1344 1 2362787 0.094 0.1093 -0.2312 117 1 1344 12 2368467 0.095 0.1088 -0.2230 117 1 1344 13 2365036 0.094 0.1084 -0.2253 117 2 1344 1 2354855 0.094 0.1084 -0.2281 117 2 1344 12 2361351 0.095 0.1078 -0.2206 117 2 1344 13 2357936 0.095 0.1067 -0.2190 117 3 1344 1 2357771 0.094 0.1102 -0.2348 117 3 1344 12 2364365 0.095 0.1086 -0.2232 117 3 1344 13 2361061 0.094 0.1096 -0.2301This analysis shows the "unrelated" families 28 and 117 are actually connected through an unreported parent-offspring pair (28_3, 117_2).
Here is an example of relationship inference across families using the HapMap GWAS data: PDF and its R code
To identify population substructure, parameter individual, pca, or mds can be specified:
prompt> king -b ex.bed --individual prompt> king -b ex.bed --mds prompt> king -b ex.bed --pca 5
The --individual option of KING provides the mean and variance estimation of allele frequencies for each individuals, the --mds specifies the multidimensional scaling (MDS) analysis, while --pca 5 specifies the principal component analysis (PCA). The --mds is highly recommended (over --pca). More details are here.
The following parameters can also be specified:
--errorrate: the error (IBS=0) rate between any pair of parent-offspring should be less than this errorrate cutoff
--homo estimates kinship and IBD0 assuming all samples are from a homogeneous population, similar to most other software.
--minMAF specifies the minimum minor allele frequency to select SNPs for relationship inference. It only applies to --homo. Default value is 0.01.
--showIBD allows --homo analysis to show IBD1 and IBD2 in the output.
--prefix specifies the file name to store the output statistics data for relationship inference. "king" is used as default.
--binary rewrites data in KING binary format.
--merlin rewrites data in MERLIN format.
--plink rewrites data in PLINK binary format.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF][Citations]
Last updated: May 2012 by Wei-Min Chen