KING Tutorial: Relationship Inference

KING is a toolset to explore genotype data from a genome-wide association study (GWAS) or a sequencing project. The latest version is KING 2.1.6 available on November 28, 2018. KING can be used to check family relationship and flag pedigree errors by estimating kinship coefficients and inferring IBD segments for all pairwise relationships. Unrelated pairs can be precisely separated from close relatives with no false positives, with accuracy up to 3rd- or 4th-degree (depending on array or WGS) for --related and --ibdseg analyses, and up to 2nd-degree for --kinship analysis.

This tutorial discusses different types of relationship inference such as the kinship coefficient estimates and the IBD segment inference, as well as derived applications such as pedigree reconstruction and extraction of a subset of unrelated individuals. Other applications of KING such as Quality Control (QC), the identification of population substructure or gene mapping are described elsewhere.

Family relationship inference in KING is very FAST (seconds to identify all close relatives in 10,000s of samples), and robust to a number of realistic scenarios including the presence of population structure. The number of samples in the dataset can be as small as 2 (for --kinship inference), or as large as > 10,000,000 (for --duplicate and --related inferences). Genome-wide SNPs are required in KING. Please do not prune or filter any "good" SNPs that pass QC prior to any KING inference, unless the number of variants is too many to fit the computer memory, e.g., > 100,000,000 as in a WGS study, in which case rare variants can be filtered out. LD pruning is not recommended in KING.


GENERAL INPUT FILES

The input files for KING need to be in PLINK binary format, which include a binary genotype file, a family file, and a map file, e.g., ex.bed, ex.fam, and ex.bim. A binary format allows efficient compression of genotype data by using two bits to represent a genotype, which offers substantial computational savings that are essential to KING analysis. The amount of computer memory required by KING analysis is modest, at ~N ✕ M / 4 (where N is the number of samples and M is the number of SNPs) plus a small percentage of overhead cost. E.g., for a dataset consisting of 100,000 samples each genotyped at 1,000,000 SNPs, the required memory size is ~25GB. Examples of reading in a dataset are:

  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --fam ex.fam --bim ex.bim --related
In the first example, although only ex.bed is specified, the other two input files are pre-assumed to be ex.fam and ex.bim. In the case where the other two input files may have a different prefix, the second example can be used instead. One advanced option is available in KING for analyzing multiple datasets (with multiple sets of input files):
  prompt> king -b ex,mystudy --duplicate
In the example above, KING reads in two sets of data (ex.bed, ex.fam, ex.bim, mystudy.bed, mystudy.fam, mystudy.bim) and then identifies all duplicate pairs, within and across datasets. One strength of merging datasets in KING over alternative tools is that users do not need to worry about allele strands, which are well taken care in KING by either autoflip (at unambiguous SNPs) or removal (at ambiguous SNPs). The only requirement is all IDs (combinations of FID and IID) must be unique within and across datasets.

An example dataset can be downloaded at this link: ex.tar.gz [1.35MB]. This dataset consists of 332 HapMap samples, and will be used throughout this tutorial.

It has been well tested that KING relationship inference also works well with the genome sequence data, even though KING was originally designed for GWAS. The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:

 prompt> plink2 --vcf example.vcf.gz --make-bed --out ex


RELATIONSHIP INFERENCE OPTIONS

Relationship can be verified between each pair of individuals. Two of the algorithms implemented in KING are highly recommended. The first algorithm estimates pair-wise kinship coefficients (through option --kinship), and the second algorithm infers pairwise IBD (identical by descent) segments (through option --ibdseg). Both algorithms can also be integrated in a single inference procedure through option --related. Examples are

  prompt> king -b ex.bed --kinship
  prompt> king -b ex.bed --ibdseg
  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --ibs
  prompt> king -b ex.bed --homog
Faster algorithms are available for identifying close relationships:
  prompt> king -b ex.bed --duplicate
  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --related --degree 2
The examples above identify all duplicate pairs (including MZ twins), all close relatives up to the first degree, and all close relatives up to the second degree, respectively. Other useful applications include:
  prompt> king -b ex.bed --unrelated
  prompt> king -b ex.bed --build
  prompt> king -b ex.bed --cluster
Here is a brief description of various relationship inference options:

--kinship estimates pair-wise kinship coefficients. Degrees of relatedness can be filtered via parameter "--degree" and only relative pairs with larger kinship coefficients are included in the inference results. In our tests, --kinship --degree has been successfully applied to datasets consisting of 1 million samples. The output files are separated for relationships that are within or between families. Note an unrelated individual is treated as a family of size one. If the datasets only consist of unrelated individuals as reported, then all results are saved in the between-family output. More details of --kinship inference are available in the KINSHIP INFERENCE section later in this tutorial.

--ibdseg carries out the IBD segment analysis and is newly available in version 2.1. IBD segment analysis determines all IBD (IBD1 and IBD2) segments shared between relatives, from which relatedness can be inferred. Inferring IBD segments in KING is as fast as estimating kinship coefficients, e.g., seconds in 1000s of samples, in contrast to days as required by alternative tools. Degrees of relatedness can be filtered via parameter "--degree" and only relative pairs with longer IBD segments are included in the inference results. In our tests, --ibdseg --degree has been successfully applied to datasets consisting of 1 million samples. More details of ibdseg analysis are avialable in the IBD SEGMENT INFERENCE section later in this tutorial.

--related provides integrative, fast, and accurate inference for close relationships. This option is highly recommended, especially when dealing with biobank-level datasets. Integration of the IBD segment inference furthur improves the inference accuracy. In our tests, --related has been successfully applied to datasets consisting of ~10 million samples. When "--rplot" is specified, several relationship plots are generated automatically. --related --degree 2 specifies that only related pairs (up to the 2nd-degree in this case) between families are included in the output. Specifically all pairs across families with a kinship coefficient less than 0.0884 will be excluded from the output. More details of --related analysis are available in the INTEGRATED RELATIONSHIP INFERENCE section later in this tutorial.

--duplicate implements the fastest (and accurate) algorithm to identify duplicates or MZ twins. The running time is in seconds, unless the number of samples is > 1,000,000 in which case a few minutes may be needed. In our tests, --duplicate has been successfully applied to datasets consisting of ~10 million samples. One potential application of the duplicate analysis is to identify duplicates accross different studies, in which case multiple datasets can be read in conveniently as shown in GENERAL INPUT FILES section.

--ibs provides summary statistics such as the counts of IBS0, IBS1, IBS2, the average of IBS, in additional to the kinship estimates.

--homog estimates pair-wise kinship coefficients assuming a homogeneous population. The best application of --homog may be for the linear mixed models (LMM), where the population structure information needs to be explicitly incorporated in the kinship coefficient estimation. Although --homog is not recommended as a good method to infer relatedness in general populations, it provides inference results comparable to multiple alternative methods.

--unrelated is a handy option to extract a list of unrelated individuals. E.g.,

  prompt> king -b ex.bed --unrelated --degree 2
This example estimates relatedness in the data first, followed by extracting a list of individuals that contains no pairs of individuals with a 1st- or 2nd-degree relationship. The detailed algorithm is described in this reference: Manichaikul et al. 2012 [PDF]

--build reconstructs pedigrees using SNP data without the need of specifying pedigrees (although the pedigree information can still be incorporated):

  prompt> king -b ex.bed --build
  prompt> king -b ex.bed --build --degree 2
The output includes two files: kingupdateids.txt and kingupdateparents.txt. Users can then use these two files to update the pedigree data, e.g., in plink 1.9:
  prompt> plink1.9 --bfile ex --update-ids kingupdateids.txt --make-bed --out ex2
  prompt> plink1.9 --bfile ex2 --update-parents kingupdateparents.txt --make-bed --out ex3
The reconstructed pedigrees are now reflected in the saved files ex3.bed, ex3.fam and ex3.bim. The current --build algorithm connects all 1st-degree relatives with high accuracy. Known scenarios that --build does well are families that consist of at least a pair of full siblings, and/or a parent-child trio, etc. Algorithms that can utilize higher degree relateness are currently under development.

--cluster is both a standalone parameter and a parameter to go with other options. As a standalone option, it clusters relatives into families by generating an updateid file which can then be used to update the pedigrees (e.g., using PLINK --update-ids). --cluster can also be used to group cyptic relatives together prior to association analysis, e.g.,

  prompt> king -b ex.bed --cluster --tdt


KINSHIP INFERENCE

--kinship estimates pair-wise kinship coefficients using the KING-Robust algorithm described in the original KING paper. If pedigrees are documented in the .fam file (see LINKAGE format examples), kinship coefficients can be estimated within families. Note if each FID is unique and no pedigrees are provided, then the within-family inference will be skipped. The output for within-family relationship checking using --kinship (saved in file king.kin) will look like this:

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    Kinship Error
28      1       2       2359853 0.000   0.2500  0.162   0.0008  0.2459  0
28      1       3       2351257 0.000   0.2500  0.161   0.0008  0.2466  0
28      2       3       2368538 1.000   0.0000  0.120   0.0634  -0.0108 0
117     1       2       2354279 0.000   0.2500  0.163   0.0006  0.2477  0
117     1       3       2358957 0.000   0.2500  0.164   0.0006  0.2490  0
117     2       3       2348875 1.000   0.0000  0.122   0.0616  -0.0017 0
1344    1       12      2372286 0.000   0.2500  0.149   0.0003  0.2480  0
1344    1       13      2370435 0.000   0.2500  0.148   0.0003  0.2465  0
1344    12      13      2374888 1.000   0.0000  0.117   0.0582  0.0003  0
Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
N_SNP: The number of SNPs that do not have missing genotypes in either of the individual
Z0: Pr(IBD=0) as specified by the provided pedigree data
Phi: Kinship coefficient as specified by the provided pedigree data
HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG)
IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG)
Kinship: Estimated kinship coefficient from the SNP data
Error: Flag indicating differences between the estimated and specified kinship coefficients (1 for error, 0.5 for warning)

The default kinship coefficient estimation only involves the use of SNP data from this pair of individuals, and the inference is robust to population structure. A negative kinship coefficient estimation indicates an unrelated relationship. The reason that a negative kinship coefficient is not set to zero is a very negative value may indicate the population structure between the two individuals. Close relatives can be inferred fairly reliably based on the estimated kinship coefficients as shown in the following simple algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively. Relationship inference for more distant relationships is more challenging. A plot of the estimated kinship coefficient against the proportion of zero IBS-sharing is highly recommended. In the absence of population structure, relationship inference can also be carried out using an alternative algorithm through parameter "--homog".

Here is an example of the relationship inference using the HapMap GWAS data: PDF and its R code

Majority of kinship inference is carried out across families (if pedigrees exist) or between individuals (when no pedigrees are documented). The output for between-family relationship checking (saved in file king.kin0) will look like this:

FID1    ID1     FID2    ID2     N_SNP   HetHet  IBS0    Kinship
28      3       117     1       2360618 0.143   0.0267  0.1356
28      3       117     2       2352628 0.161   0.0009  0.2441
28      3       117     3       2354540 0.120   0.0624  -0.0119
28      3       1344    1       2361807 0.093   0.1095  -0.2295
28      3       1344    12      2367180 0.094   0.1091  -0.2225
28      3       1344    13      2364816 0.093   0.1082  -0.2224
117     1       1344    1       2362787 0.094   0.1093  -0.2312
117     1       1344    12      2368467 0.095   0.1088  -0.2230
117     1       1344    13      2365036 0.094   0.1084  -0.2253
117     2       1344    1       2354855 0.094   0.1084  -0.2281
117     2       1344    12      2361351 0.095   0.1078  -0.2206
117     2       1344    13      2357936 0.095   0.1067  -0.2190
117     3       1344    1       2357771 0.094   0.1102  -0.2348
117     3       1344    12      2364365 0.095   0.1086  -0.2232
117     3       1344    13      2361061 0.094   0.1096  -0.2301
This analysis shows the "unrelated" families 28 and 117 are actually connected through an unreported parent-offspring pair (FID 28 IID 3, and FID 117 IID 2). Here is an example of relationship inference across families using the HapMap GWAS data: PDF and its R code

One way to speed up the computing without sacrifying inference accuracy is to use the --degree option, since writing to hard drives is usually the computing bottleneck:

  prompt> king -b ex.bed --kinship --degree 2
In this example, only pairs with kinship coefficient > 0.0884 are saved in the king.kin0 output file. In our tests, --kinship --degree has been successfully applied to datasets consisting of 1 million samples on a single server (with many CPU cores).


IBD SEGMENT INFERENCE

IBD (identical by descent) segments can be raplidly and accurately inferred between any pair of individuals in KING. The associated manuscript is yet to be published but this algorithm has been well tested. Examples of IBD segment analysis are

  prompt> king -b ex.bed --ibdseg
  prompt> king -b ex.bed --ibdseg --degree 3 --rplot --prefix ex
The second command specifies only pairs with IBD proportion > 0.0884 will be saved in the output. Since writing to hard drives is usually the computing bottleneck, "--degree 3" option can save substantial amount of computational time without sacrifying any inference accuracy. In our tests, --ibdseg --degree has been successfully applied to datasets consisting of 1 million samples on a single server (with many CPU cores). "--rplot" generates plots using the inferred relatedness results (as in file ex.seg).

The summary of IBD segments in file ex.seg will look like this:

FID1    ID1     FID2    ID2     MaxIBD1 MaxIBD2 IBD1Seg IBD2Seg PropIBD InfType
1330    NA12335 1330    NA12340 109.8   0.0     0.9959  0.0000  0.4980  PO
1330    NA12335 1330    NA12341 109.8   0.0     1.0000  0.0000  0.5000  PO
1330    NA12336 1330    NA12342 109.8   0.0     0.9944  0.0000  0.4972  PO
1330    NA12336 1330    NA12343 109.8   0.0     0.9942  0.0000  0.4971  PO
1328    NA06984 1345    NA07346 32.7    0.0     0.0228  0.0000  0.0114  UN
1334    NA10846 1334    NA12144 109.8   0.0     1.0000  0.0000  0.5000  PO
1334    NA10846 1334    NA12145 109.8   0.0     1.0000  0.0000  0.5000  PO
1334    NA10847 1334    NA12146 109.8   0.0     1.0000  0.0000  0.5000  PO
1334    NA10847 1334    NA12239 109.8   0.0     0.9889  0.0000  0.4945  PO
Each row above provides information for one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair
ID1: Individual ID for the first individual of the pair
FID2: Family ID for the second individual of the pair
ID2: Individual ID for the second individual of the pair
MaxIBD1: Length of the longest IBD1 segment (in Mb)
MaxIBD2: Length of the longest IBD2 segment (in Mb)
IBD1Seg: Total length of IBD1 segments divided by total length of all segments, estimate of π1=Pr(IBD=1)
IBD2Seg: Total length of IBD2 segments divided by total length of all segments, estimate of π2=Pr(IBD=2)
PropIBD: Proportion of genomes shared identical-by-descent, estimated by IBD2Seg + IBD1Seg/2, estimate of π=π21/2
InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd, 3rd, 4th, UN

The detailed IBD segments are in a tar zipped file ex.segments.gz. The header of "zcat ex.segments.gz" looks like this:

FID1    ID1     FID2    ID2     IBDType Chr     StartMB StopMB  StartSNP        StopSNP         N_SNP   Length
1330    NA12335 1330    NA12340 IBD1    1       51.799  95.862  rs7534689       rs1858111       294     44.1
1330    NA12335 1330    NA12340 IBD1    1       148.175 247.083 rs1868992       rs12058711      692     98.9
1330    NA12335 1330    NA12340 IBD1    2       0.143   88.714  rs408209        rs7581608       619     88.6
1330    NA12335 1330    NA12340 IBD1    2       165.994 242.590 rs1835889       rs10186231      484     76.6
1330    NA12335 1330    NA12340 IBD1    3       0.080   90.221  rs990284        rs9877833       643     90.1
1330    NA12335 1330    NA12340 IBD1    3       113.243 165.061 rs1844925       rs4519708       320     51.8
1330    NA12335 1330    NA12340 IBD1    5       70.869  180.626 AFFX-SNP_7697354__rs276593      rs876154        738     109.8
1330    NA12335 1330    NA12340 IBD1    6       0.131   58.178  rs736864        rs3863230       482     58.0
1330    NA12335 1330    NA12340 IBD1    6       88.258  170.736 rs3778671       rs734249        560     82.5
Each row above provides information for one IBD segment in one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair                
ID1: Individual ID for the first individual of the pair                  
FID2: Family ID for the second individual of the pair
ID2: Individual ID for the second individual of the pair                 
IBDType: Type of IBD segments: IBD1 or IBD2
Chr: Chromosome number.
StartMB: Start position of the IBD segment (in Mb)
StopMB: Stop position of the IBD segment (in Mb)
StartSNP: Start SNP of the IBD segment
StopSNP: Stop SNP of the IBD segment
N_SNP: The number of SNPs in the IBD segment
Length: Total Length of the IBD segment (in Mb) 

IBD segment results are quite informative and worth further exploration. Application of inferred IBD segments can be beyond relationship inference. One effective way is to plot all IBD segments for each pair of close relatives. Here is the R code for visualizing IBD segments: king_segments_plot.R. To run this R code, please first make sure ggplot2 and parallel libraries are properly installed. You can then type

  prompt> Rscript king_segments_plot.R ex ibdseg
where ex is the sepcified prefix. All pairs of close relatives are plotted in their own files/plots, and all plots/files are gzipped together in a single file ex_ibdseg_rplots.tar.gz. Below is one plot example. IBD segments of one pair of ex individuals are shown:



INTEGRATED AND FAST INFERENCE FOR CLOSE RELATIVES

Fast and integrated relationship inference is available in KING through --related option. Close relatives can be accurately inferred in seconds/minutes. The largest dataset we have successfully analyzed on a single server using the --related option consists of ~10 million samples (i.e., ~50,000,000,000,000 pairs!). Multiple relatedness measures are used to screen close relatives. Examples of integrated relationship inference are

  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --related --degree 2 --rplot --prefix ex
The first command identifies close relatives up to the first degree, and the second command specifies close relatives up to the second degree. Identifying first-degree relatives only offers substantial computational advantages over identifying higher degree relatives. --related without the --degree option is highly recommended. Although distant relatedness that is higher than 2 is allowed, no fast algorithm is available at the moment and computation is substantially slower than --related --degree 2. "--rplot" generates plots using both reported and inferred relationships (in output files ex.kin and ex.kin0). Here is an example of inference plots that are generated by the second command above: ex_relplot.pdf.

The summary of relationship inference in file ex.kin will look like this:

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    HetConc HomIBS0 Kinship IBD1Seg IBD2Seg PropIBD InfType Error
Y001    NA18484 NA18486 18250   0.000   0.2500  0.2324  0.0002  0.3368  0.0006  0.2515  0.9750  0.0000  0.4875  PO      0
Y001    NA18484 NA18488 18249   0.000   0.2500  0.2332  0.0002  0.3379  0.0004  0.2522  1.0000  0.0000  0.5000  PO      0
Y001    NA18486 NA18488 18270   1.000   0.0000  0.2141  0.1053  0.3036  0.2460  0.0039  0.0000  0.0000  0.0000  UN      0
Y002    NA18485 NA18487 18276   0.000   0.2500  0.2349  0.0002  0.3413  0.0005  0.2541  1.0000  0.0000  0.5000  PO      0
Y002    NA18485 NA18489 18275   0.000   0.2500  0.2274  0.0003  0.3298  0.0007  0.2474  1.0000  0.0000  0.5000  PO      0
Y002    NA18487 NA18489 18269   1.000   0.0000  0.2098  0.1120  0.2999  0.2601  -0.0157 0.0000  0.0000  0.0000  UN      0
Y003    NA18497 NA18498 18262   0.000   0.2500  0.2232  0.0003  0.3254  0.0009  0.2448  0.9897  0.0000  0.4949  PO      0
Y003    NA18497 NA18499 18258   0.000   0.2500  0.2310  0.0002  0.3387  0.0006  0.2525  0.9612  0.0000  0.4806  PO      0
Y003    NA18498 NA18499 18280   1.000   0.0000  0.2100  0.1043  0.2997  0.2452  0.0015  0.0000  0.0000  0.0000  UN      0
Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
N_SNP: The number of SNPs that do not have missing genotypes in either of the individual
Z0: Pr(IBD=0) as specified by the provided pedigree data
Phi: Kinship coefficient as specified by the provided pedigree data
HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG)
IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG)    
Kinship: Estimated kinship coefficient (φ) from the SNP data
IBD1Seg: Total length of IBD1 segments divided by total length of all segments, estimate of π1=Pr(IBD=1)
IBD2Seg: Total length of IBD2 segments divided by total length of all segments, estimate of π2=Pr(IBD=2)
PropIBD: Proportion of genomes shared identical-by-descent, estimated by IBD2Seg + IBD1Seg/2, estimate of π=π21/2
InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd, 3rd, 4th, UN
Error: Flag Indicating differences between inferred and reported relationship (1 for error, 0.5 for warning)


OTHER PARAMETERS

The following parameters can also be specified:

--rplot generates R code first and then calls R program to make PDF plots.

--degree specifies the degree of relatedness, which goes with multiple KING analyses including --kinship, --ibdseg, --related, --unrelated, --build, --cluster, etc.

--prefix specifies the name of the output files that store various inference results. "king" is the default prefix.

--cpus specifies the number of CPU cores to be used for parallel computing. If not specified, the default number is half of the total number of (logical) cores.

--lessmem allows users to use less RAM memory. This option is now retired in KING 2.1.6 and later.

--sexchr specifies the pair number of the sex chromosome, for a non-human species.


REFERENCE

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF][Citations]


======================================
Last updated: November 28, 2018 by Wei-Min Chen


 
 

KING Tutorial | KING Download | KING Homepage