KING is a toolset to explore genotype data from a genome-wide association study (GWAS) or a sequencing project. The latest version is KING 2.1.3 available on February 13, 2018. KING can be used to check family relationship and flag pedigree errors, by estimating kinship coefficients and/or inferring IBD (identical by descent) segments for all pairwise relationships. Unrelated pairs can be well separated from close relatives (up to 3rd-degree) and vice versa. The kinship coefficient estimates and IBD segment inference for close relatives are highly accurate. Other applications of KING such as Quality Control (QC), the identification of population substructure or gene mapping are described elsewhere.
Family relationship inference in KING is very FAST (seconds to infer close relatives among 10,000s of samples), and robust to a number of scenarios including the presence of population structure. The number of samples in the dataset can be as small as 2, or as large as > 1,000,000. Genome-wide SNPs are required though. Please do not prune or filter any "good" SNPs (that pass QC) prior to any KING inference (unless the number of variants is too many to fit the computer memory e.g., > 100,000,000, in which case rare variants can be filtered out).
The input files need to be in PLINK binary format, e.g., ex.bed, ex.fam, and ex.bim. A binary format allows the compression of genotype data by using two bits to represent a genotype. Examples are:
prompt> king -b ex.bed --related prompt> king -b ex.bed --fam ex.fam --bim ex.bim --relatedThe binary format offers convenient computational savings. An example dataset in the KING binary format can be downloaded at this link: ex.tar.gz [1.35MB] ,and this dataset will be used throughout the tutorial. It has been tested that KING relationship inference works well with the genome sequence data. The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:
prompt> plink2 --vcf example.vcf.gz --make-bed --out ex
Relationship is verified between each pair of individuals. Two algorithms implemented in KING are highly recommended. One algorithm estimates pair-wise kinship coefficients (through paramter --kinship), and another one (available in version 2.1 and later) determines IBD segments (through parameter --ibdseg). Examples are
prompt> king -b ex.bed --kinship prompt> king -b ex.bed --ibdseg prompt> king -b ex.bed --ibs prompt> king -b ex.bed --homoFaster algorithms are available for identifying close relationships. They include
prompt> king -b ex.bed --related prompt> king -b ex.bed --duplicateA few other useful applications include:
prompt> king -b ex.bed --unrelated prompt> king -b ex.bed --build prompt> king -b ex.bed --clusterIn each relationship inference, the output is separate for relationships that are within or between families. Note an unrelated individual is treated as a family of size one. If the datasets only consist of unrelated individuals as reported, then all results are saved in the between-family output.
--kinship estimates pair-wise kinship coefficients. Filter is available via "--degree". More details of --kinship are available in the PAIRWISE RELATIONSHIP WITHIN FAMILIES and PAIRWISE RELATIONSHIP ACROSS FAMILIES sections later in this tutorial.
--ibs provides summary statistics such as the counts of IBS0, IBS1, IBS2, the average of IBS, in additional to the kinship estimates.
--homo estimates pair-wise kinship coefficients assuming a homogeneous population. The best application of --homo may be for the linear mixed models (LMM), where the population structure information needs to be explicitly incorporated in the kinship coefficient estimation. Although --homo is not recommended as a good method to infer relatedness in general populations, it provides inference results comparable to multiple alternative methods.
--ibdseg carries out the IBD segment analysis and is newly available in version 2.1. IBD segment analysis determines all IBD (IBD1 and IBD2) segments shared between relatives, from which relatedness can be inferred. Inferring IBD segments in KING is as fast as estimating kinship coefficients, e.g., seconds in 1000s of samples, in contrast to days as required by alternative tools. More details of ibdseg analysis are avialable in the IBD SEGMENT INFERENCE section later in this tutorial.
--related provides integrative, fast, and accurate inference for close relationships. It is highly recommended, especially when dealing with very large datasets consisting of > 1,000,000 samples. Integration of the IBD segment inference furthur improves the inference accuracy. When "--rplot" is specified, several relationship plots are generated automatically. --related --degree 2 specifies that only related pairs (up to the 2nd-degree in this case) between families are included in the output. Specifically all pairs across families with a kinship coefficient less than 0.0884 will be excluded from the output. More details of --related analysis are available in the INTEGRATED RELATIONSHIP INFERENCE section later in this tutorial.
--duplicate implements the fastest (and accurate) algorithm to identify duplicates/MZ twins. The running time is in seconds, unless the number of samples is > 1,000,000 in which case a few minutes is needed to identify all pairs of duplicates.
--unrelated is a handy option to extract a list of unrelated individuals. E.g.,
prompt> king -b ex.bed --unrelated --degree 2estimates relatedness in the data first, followed by extracting a list of individuals that contains no pairs of individuals with a 1st- or 2nd-degree relationship. This option is available in version 1.4 and later. The detailed algorithm is described in this reference: Manichaikul et al. 2012 [PDF]
--build can reconstruct pedigrees with no or partial pedigrees. It provides two files: kingupdateids.txt and kingupdateparents.txt. Users can then use these two files to update the pedigrees, e.g., using plink 1.9. The current algorithm can connect 1st-degree relatives with high accuracy. Known scenarios that KING --build does well are families consisting of at least a pair of full siblings, and/or a parent-child trio, etc. Algorithms that can utilize higher degree relateness are currently under development and should be available soon.
--cluster is both a standalone parameter and a parameter to go with other options. As a standalone option, it clusters relatives into families by generating an updateid file which can then be used to update the pedigrees (e.g., using PLINK). --cluster can also be used to group cyptic relatives together prior to association analysis, e.g.,
prompt> king -b ex.bed --cluster --tdt
The output for within-family relationship checking using --kinship (saved in file king.kin) will look like this:
FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 Kinship Error 28 1 2 2359853 0.000 0.2500 0.162 0.0008 0.2459 0 28 1 3 2351257 0.000 0.2500 0.161 0.0008 0.2466 0 28 2 3 2368538 1.000 0.0000 0.120 0.0634 -0.0108 0 117 1 2 2354279 0.000 0.2500 0.163 0.0006 0.2477 0 117 1 3 2358957 0.000 0.2500 0.164 0.0006 0.2490 0 117 2 3 2348875 1.000 0.0000 0.122 0.0616 -0.0017 0 1344 1 12 2372286 0.000 0.2500 0.149 0.0003 0.2480 0 1344 1 13 2370435 0.000 0.2500 0.148 0.0003 0.2465 0 1344 12 13 2374888 1.000 0.0000 0.117 0.0582 0.0003 0Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair ID1: Individual ID for the first individual of the pair ID2: Individual ID for the second individual of the pair N_SNP: The number of SNPs that do not have missing genotypes in either of the individual Z0: Pr(IBD=0) as specified by the provided pedigree data Phi: Kinship coefficient as specified by the provided pedigree data HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG) IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG) Kinship: Estimated kinship coefficient from the SNP data Error: Flag indicating differences between the estimated and specified kinship coefficients (1 for error, 0.5 for warning)
The default kinship coefficient estimation only involves the use of SNP data from this pair of individuals, and the inference is robust to population structure. A negative kinship coefficient estimation indicates an unrelated relationship. The reason that a negative kinship coefficient is not set to zero is a very negative value may indicate the population structure between the two individuals. Close relatives can be inferred fairly reliably based on the estimated kinship coefficients as shown in the following simple algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively. Relationship inference for more distant relationships is more challenging. A plot of the estimated kinship coefficient against the proportion of zero IBS-sharing is highly recommended. In the absence of population structure, relationship inference can also be carried out using an alternative algorithm through parameter "--homo".
Here is an example of the relationship inference using the HapMap GWAS data: PDF and its R code
The output for between-family relationship checking (saved in file king.kin0) will look like this:
FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship 28 3 117 1 2360618 0.143 0.0267 0.1356 28 3 117 2 2352628 0.161 0.0009 0.2441 28 3 117 3 2354540 0.120 0.0624 -0.0119 28 3 1344 1 2361807 0.093 0.1095 -0.2295 28 3 1344 12 2367180 0.094 0.1091 -0.2225 28 3 1344 13 2364816 0.093 0.1082 -0.2224 117 1 1344 1 2362787 0.094 0.1093 -0.2312 117 1 1344 12 2368467 0.095 0.1088 -0.2230 117 1 1344 13 2365036 0.094 0.1084 -0.2253 117 2 1344 1 2354855 0.094 0.1084 -0.2281 117 2 1344 12 2361351 0.095 0.1078 -0.2206 117 2 1344 13 2357936 0.095 0.1067 -0.2190 117 3 1344 1 2357771 0.094 0.1102 -0.2348 117 3 1344 12 2364365 0.095 0.1086 -0.2232 117 3 1344 13 2361061 0.094 0.1096 -0.2301This analysis shows the "unrelated" families 28 and 117 are actually connected through an unreported parent-offspring pair (28_3, 117_2).
Here is an example of relationship inference across families using the HapMap GWAS data: PDF and its R code
Examples of IBD segment analysis are
prompt> king -b ex.bed --ibdseg prompt> king -b ex.bed --ibdseg --degree 3 --rplot --prefix exThe second command specifies only pairs with IBD proportion > 0.0884 will be saved in the output. Since writing to hard drives is usually the computing bottleneck, "--degree 3" option can save substantial amount of computational time without sacrifying any inference accuracy. "--rplot" makes plots using the inferred relatedness results (as in file ex.seg).
The summary of IBD segments in file ex.seg will look like this:
FID1 ID1 FID2 ID2 MaxIBD1 MaxIBD2 IBD1Seg IBD2Seg PropIBD InfType 1330 NA12335 1330 NA12340 105.7 0.0 1.0000 0.0000 0.5000 PO 1330 NA12335 1330 NA12341 105.7 0.0 1.0000 0.0000 0.5000 PO 1330 NA12336 1330 NA12342 105.7 0.0 0.9951 0.0000 0.4976 PO 1330 NA12336 1330 NA12343 105.7 0.0 1.0000 0.0000 0.5000 PO 1328 NA06984 1345 NA07346 21.7 0.0 0.0160 0.0000 0.0080 UN 1334 NA10846 1334 NA12144 105.7 0.0 1.0000 0.0000 0.5000 PO 1334 NA10846 1334 NA12145 105.7 0.0 1.0000 0.0000 0.5000 PO 1334 NA10847 1334 NA12146 105.7 0.0 1.0000 0.0000 0.5000 PO 1334 NA10847 1334 NA12239 105.7 0.0 1.0000 0.0000 0.5000 POEach row above provides information for one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair ID1: Individual ID for the first individual of the pair FID2: Family ID for the second individual of the pair ID2: Individual ID for the second individual of the pair MaxIBD1: Length of the longest IBD1 segment (in Mb) MaxIBD2: Length of the longest IBD2 segment (in Mb) IBD1Seg: Total length of IBD1 segments divided by total length of all segments IBD2Seg: Total length of IBD2 segments divided by total length of all segments PropIBD: Proportion of IBD. It is calculated as IBD2Seg + IBD1Seg/2 InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd/3rd, UN
The detailed IBD segments are in a tar zipped file ex.segments.gz. The header of "zcat ex.segments.gz" looks like this:
FID1 ID1 FID2 ID2 IBDType Chr StartMB StopMB StartSNP StopSNP N_SNP Length 1330 NA12335 1330 NA12340 IBD1 1 57.074 115.800 rs12032625 rs34385187 384 58.7 1330 NA12335 1330 NA12340 IBD1 1 152.971 242.844 rs925042 rs10803228 640 89.9 1330 NA12335 1330 NA12340 IBD1 2 3.968 76.724 rs4261733 rs9332421 512 72.8 1330 NA12335 1330 NA12340 IBD1 2 165.994 238.148 rs1835889 rs2280288 448 72.2 1330 NA12335 1330 NA12340 IBD1 3 2.625 68.873 rs1382866 rs6795777 512 66.2 1330 NA12335 1330 NA12340 IBD1 3 113.243 195.396 rs1844925 rs4441603 512 82.2 1330 NA12335 1330 NA12340 IBD1 5 72.046 177.796 rs3010239 rs6422347 704 105.7 1330 NA12335 1330 NA12340 IBD1 6 3.685 47.246 rs11242871 rs9381515 384 43.6 1330 NA12335 1330 NA12340 IBD1 6 88.258 165.734 rs3778671 rs520349 512 77.5Each row above provides information for one IBD segment in one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair ID1: Individual ID for the first individual of the pair FID2: Family ID for the second individual of the pair ID2: Individual ID for the second individual of the pair IBDType: Type of IBD segments: IBD1 or IBD2 Chr: Chromosome number. StartMB: Start position of the IBD segment (in Mb) StopMB: Stop position of the IBD segment (in Mb) StartSNP: Start SNP of the IBD segment StopSNP: Stop SNP of the IBD segment N_SNP: The number of SNPs in the IBD segment Length: Total Length of the IBD segment (in Mb)
IBD segment results are quite informative and worth further exploration. Application of inferred IBD segments can be beyond relationship inference. One effective way is to plot all IBD segments for each pair of close relatives. Here is the R code for visualizing IBD segments: king_segments_plot.R. To run this R code, please first make sure ggplot2 and parallel libraries are properly installed. You can then type
prompt> Rscript king_segments_plot.R ex ibdsegwhere ex is the sepcified prefix. All pairs of close relatives are plotted in their own files/plots, and all plots/files are gzipped together in a single file ex_ibdseg_rplots.tar.gz. Below is one plot example. IBD segments of one pair of ex individuals are shown:
Integrated relationship inference is available in KING through --related option. Multiple relatedness measures are used to screen close relatives. Examples of integrated relationship inference are
prompt> king -b ex.bed --related prompt> king -b ex.bed --related --degree 2 --rplot --prefix exThe first command identifies close relatives up to the first degree, and the second command specifies close relatives up to the second degree. Identifying first-degree relatives only offers substantial computational advantages over identifying up to the second-degree relatives, and --related without the --degree option is highly recommended. "--rplot" makes plots using both reported and inferred relationships (in output files ex.kin and ex.kin0). Here is an example of inference plots that are generated by the second command above: ex_relplot.pdf.
The summary of relationship inference in file ex.kin will look like this:
king:1077> head ex.kin FID ID1 ID2 N_SNP Z0 Phi HetHet IBS0 HetConc HomIBS0 Kinship IBD1Seg IBD2Seg PropIBD InfType Error Y001 NA18484 NA18486 18250 0.000 0.2500 0.2324 0.0002 0.3368 0.0006 0.2515 0.9945 0.0000 0.4972 PO 0 Y001 NA18484 NA18488 18249 0.000 0.2500 0.2332 0.0002 0.3379 0.0004 0.2522 1.0000 0.0000 0.5000 PO 0 Y001 NA18486 NA18488 18270 1.000 0.0000 0.2141 0.1053 0.3036 0.2460 0.0039 0.0000 0.0000 0.0000 UN 0 Y002 NA18485 NA18487 18276 0.000 0.2500 0.2349 0.0002 0.3413 0.0005 0.2541 1.0000 0.0000 0.5000 PO 0 Y002 NA18485 NA18489 18275 0.000 0.2500 0.2274 0.0003 0.3298 0.0007 0.2474 1.0000 0.0000 0.5000 PO 0 Y002 NA18487 NA18489 18269 1.000 0.0000 0.2098 0.1120 0.2999 0.2601 -0.0157 0.0000 0.0000 0.0000 UN 0 Y003 NA18497 NA18498 18262 0.000 0.2500 0.2232 0.0003 0.3254 0.0009 0.2448 0.9951 0.0000 0.4975 PO 0 Y003 NA18497 NA18499 18258 0.000 0.2500 0.2310 0.0002 0.3387 0.0006 0.2525 1.0000 0.0000 0.5000 PO 0 Y003 NA18498 NA18499 18280 1.000 0.0000 0.2100 0.1043 0.2997 0.2452 0.0015 0.0000 0.0000 0.0000 UN 0Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair ID1: Individual ID for the first individual of the pair ID2: Individual ID for the second individual of the pair N_SNP: The number of SNPs that do not have missing genotypes in either of the individual Z0: Pr(IBD=0) as specified by the provided pedigree data Phi: Kinship coefficient as specified by the provided pedigree data HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG) IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG) Kinship: Estimated kinship coefficient from the SNP data IBD1Seg: Total length of IBD1 segments divided by total length of all segments IBD2Seg: Total length of IBD2 segments divided by total length of all segments PropIBD: Proportion of IBD. It is calculated as IBD2Seg + IBD1Seg/2 InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd, 3rd, UN Error: Flag Indicating differences between inferred and reported relationship (1 for error, 0.5 for warning)
The following parameters can also be specified:
--prefix specifies the name of the file that stores various relatedness measures. "king" is used as default.
--rplot generates R code and calls R program to produce PDF plots.
--cpus specifies the number of CPU cores to be used in the parallel computing. If not specified, the default number is half of the total number of (logical) cores.
--degree specifies the degree of relatedness. It goes with multiple KING analyses including --kinship, --ibdseg, --related, --unrelated, --build, --cluster.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF][Citations]
Last updated: February 13, 2018 by Wei-Min Chen