KING Tutorial: Relationship Inference

KING is a toolset to explore genotype data from a genome-wide association study (GWAS) or a sequencing project. The latest version is KING 2.1.4 available on June 6, 2018. KING can be used to check family relationship and flag pedigree errors, by estimating kinship coefficients and/or inferring IBD (identical by descent) segments for all pairwise relationships. Unrelated pairs can be accurately separated from close relatives (up to 4th-degree for --related and --ibdseg, and up to 2nd-degree for --kinship) and vice versa. The kinship coefficient estimates and IBD segment-based relatedness inference for close relatives are highly accurate. Other applications of KING such as Quality Control (QC), the identification of population substructure or gene mapping are described elsewhere.

Family relationship inference in KING is very FAST (seconds to infer close relatives among 10,000s of samples), and robust to a number of scenarios including the presence of population structure. The number of samples in the dataset can be as small as 2 (for --kinship inference), or as large as > 1,000,000 (for --related inference). Genome-wide SNPs are required in KING. Please do not prune or filter any "good" SNPs (that pass QC) prior to any KING inference, unless the number of variants is too many to fit the computer memory e.g., > 100,000,000, in which case rare variants should be filtered out.


GENERAL INPUT FILES

The input files need to be in PLINK binary format, e.g., ex.bed, ex.fam, and ex.bim. A binary format allows the compression of genotype data by using two bits to represent a genotype. Examples are:

  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --fam ex.fam --bim ex.bim --related
The binary format offers convenient computational savings. An example dataset in the KING binary format can be downloaded at this link: ex.tar.gz [1.35MB] ,and this dataset will be used throughout the tutorial.

It has been tested that KING relationship inference works well with the genome sequence data. The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:
 prompt> plink2 --vcf example.vcf.gz --make-bed --out ex

RELATIONSHIP INFERENCE

Relationship is verified between each pair of individuals. Two algorithms implemented in KING are highly recommended. One algorithm estimates pair-wise kinship coefficients (through paramter --kinship), and another one (available in version 2.1 and later) determines IBD segments (through parameter --ibdseg). Examples are

  prompt> king -b ex.bed --kinship
  prompt> king -b ex.bed --ibdseg
  prompt> king -b ex.bed --ibs
  prompt> king -b ex.bed --homo
Faster algorithms are available for identifying close relationships. They include
  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --duplicate
A few other useful applications include:
  prompt> king -b ex.bed --unrelated
  prompt> king -b ex.bed --build
  prompt> king -b ex.bed --cluster
In each relationship inference, the output is separate for relationships that are within or between families. Note an unrelated individual is treated as a family of size one. If the datasets only consist of unrelated individuals as reported, then all results are saved in the between-family output.

--kinship estimates pair-wise kinship coefficients. Filter is available via "--degree". More details of --kinship are available in the PAIRWISE RELATIONSHIP WITHIN FAMILIES and PAIRWISE RELATIONSHIP ACROSS FAMILIES sections later in this tutorial.

--ibs provides summary statistics such as the counts of IBS0, IBS1, IBS2, the average of IBS, in additional to the kinship estimates.

--homo estimates pair-wise kinship coefficients assuming a homogeneous population. The best application of --homo may be for the linear mixed models (LMM), where the population structure information needs to be explicitly incorporated in the kinship coefficient estimation. Although --homo is not recommended as a good method to infer relatedness in general populations, it provides inference results comparable to multiple alternative methods.

--ibdseg carries out the IBD segment analysis and is newly available in version 2.1. IBD segment analysis determines all IBD (IBD1 and IBD2) segments shared between relatives, from which relatedness can be inferred. Inferring IBD segments in KING is as fast as estimating kinship coefficients, e.g., seconds in 1000s of samples, in contrast to days as required by alternative tools. More details of ibdseg analysis are avialable in the IBD SEGMENT INFERENCE section later in this tutorial.

--related provides integrative, fast, and accurate inference for close relationships. It is highly recommended, especially when dealing with very large datasets consisting of > 1,000,000 samples. Integration of the IBD segment inference furthur improves the inference accuracy. When "--rplot" is specified, several relationship plots are generated automatically. --related --degree 2 specifies that only related pairs (up to the 2nd-degree in this case) between families are included in the output. Specifically all pairs across families with a kinship coefficient less than 0.0884 will be excluded from the output. More details of --related analysis are available in the INTEGRATED RELATIONSHIP INFERENCE section later in this tutorial.

--duplicate implements the fastest (and accurate) algorithm to identify duplicates/MZ twins. The running time is in seconds, unless the number of samples is > 1,000,000 in which case a few minutes is needed to identify all pairs of duplicates.

--unrelated is a handy option to extract a list of unrelated individuals. E.g.,

  prompt> king -b ex.bed --unrelated --degree 2
estimates relatedness in the data first, followed by extracting a list of individuals that contains no pairs of individuals with a 1st- or 2nd-degree relationship. This option is available in version 1.4 and later. The detailed algorithm is described in this reference: Manichaikul et al. 2012 [PDF]

--build can reconstruct pedigrees with no or partial pedigrees. It provides two files: kingupdateids.txt and kingupdateparents.txt. Users can then use these two files to update the pedigrees, e.g., using plink 1.9. The current algorithm can connect 1st-degree relatives with high accuracy. Known scenarios that KING --build does well are families consisting of at least a pair of full siblings, and/or a parent-child trio, etc. Algorithms that can utilize higher degree relateness are currently under development and should be available soon.

--cluster is both a standalone parameter and a parameter to go with other options. As a standalone option, it clusters relatives into families by generating an updateid file which can then be used to update the pedigrees (e.g., using PLINK). --cluster can also be used to group cyptic relatives together prior to association analysis, e.g.,

  prompt> king -b ex.bed --cluster --tdt


PAIRWISE RELATIONSHIP WITHIN FAMILIES

The output for within-family relationship checking using --kinship (saved in file king.kin) will look like this:

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    Kinship Error
28      1       2       2359853 0.000   0.2500  0.162   0.0008  0.2459  0
28      1       3       2351257 0.000   0.2500  0.161   0.0008  0.2466  0
28      2       3       2368538 1.000   0.0000  0.120   0.0634  -0.0108 0
117     1       2       2354279 0.000   0.2500  0.163   0.0006  0.2477  0
117     1       3       2358957 0.000   0.2500  0.164   0.0006  0.2490  0
117     2       3       2348875 1.000   0.0000  0.122   0.0616  -0.0017 0
1344    1       12      2372286 0.000   0.2500  0.149   0.0003  0.2480  0
1344    1       13      2370435 0.000   0.2500  0.148   0.0003  0.2465  0
1344    12      13      2374888 1.000   0.0000  0.117   0.0582  0.0003  0
Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
N_SNP: The number of SNPs that do not have missing genotypes in either of the individual
Z0: Pr(IBD=0) as specified by the provided pedigree data
Phi: Kinship coefficient as specified by the provided pedigree data
HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG)
IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG)
Kinship: Estimated kinship coefficient from the SNP data
Error: Flag indicating differences between the estimated and specified kinship coefficients (1 for error, 0.5 for warning)

The default kinship coefficient estimation only involves the use of SNP data from this pair of individuals, and the inference is robust to population structure. A negative kinship coefficient estimation indicates an unrelated relationship. The reason that a negative kinship coefficient is not set to zero is a very negative value may indicate the population structure between the two individuals. Close relatives can be inferred fairly reliably based on the estimated kinship coefficients as shown in the following simple algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively. Relationship inference for more distant relationships is more challenging. A plot of the estimated kinship coefficient against the proportion of zero IBS-sharing is highly recommended. In the absence of population structure, relationship inference can also be carried out using an alternative algorithm through parameter "--homo".

Here is an example of the relationship inference using the HapMap GWAS data: PDF and its R code


PAIRWISE RELATIONSHIP ACROSS FAMILIES (OR UNRELATED INDIVIDUALS)

The output for between-family relationship checking (saved in file king.kin0) will look like this:

FID1    ID1     FID2    ID2     N_SNP   HetHet  IBS0    Kinship
28      3       117     1       2360618 0.143   0.0267  0.1356
28      3       117     2       2352628 0.161   0.0009  0.2441
28      3       117     3       2354540 0.120   0.0624  -0.0119
28      3       1344    1       2361807 0.093   0.1095  -0.2295
28      3       1344    12      2367180 0.094   0.1091  -0.2225
28      3       1344    13      2364816 0.093   0.1082  -0.2224
117     1       1344    1       2362787 0.094   0.1093  -0.2312
117     1       1344    12      2368467 0.095   0.1088  -0.2230
117     1       1344    13      2365036 0.094   0.1084  -0.2253
117     2       1344    1       2354855 0.094   0.1084  -0.2281
117     2       1344    12      2361351 0.095   0.1078  -0.2206
117     2       1344    13      2357936 0.095   0.1067  -0.2190
117     3       1344    1       2357771 0.094   0.1102  -0.2348
117     3       1344    12      2364365 0.095   0.1086  -0.2232
117     3       1344    13      2361061 0.094   0.1096  -0.2301
This analysis shows the "unrelated" families 28 and 117 are actually connected through an unreported parent-offspring pair (28_3, 117_2).

Here is an example of relationship inference across families using the HapMap GWAS data: PDF and its R code


IBD SEGMENT INFERENCE

Examples of IBD segment analysis are

  prompt> king -b ex.bed --ibdseg
  prompt> king -b ex.bed --ibdseg --degree 3 --rplot --prefix ex
The second command specifies only pairs with IBD proportion > 0.0884 will be saved in the output. Since writing to hard drives is usually the computing bottleneck, "--degree 3" option can save substantial amount of computational time without sacrifying any inference accuracy. "--rplot" makes plots using the inferred relatedness results (as in file ex.seg).

The summary of IBD segments in file ex.seg will look like this:


FID1    ID1     FID2    ID2     MaxIBD1 MaxIBD2 IBD1Seg IBD2Seg PropIBD InfType
1330    NA12335 1330    NA12340 105.7   0.0     1.0000  0.0000  0.5000  PO
1330    NA12335 1330    NA12341 105.7   0.0     1.0000  0.0000  0.5000  PO
1330    NA12336 1330    NA12342 105.7   0.0     0.9951  0.0000  0.4976  PO
1330    NA12336 1330    NA12343 105.7   0.0     1.0000  0.0000  0.5000  PO
1328    NA06984 1345    NA07346 21.7    0.0     0.0160  0.0000  0.0080  UN
1334    NA10846 1334    NA12144 105.7   0.0     1.0000  0.0000  0.5000  PO
1334    NA10846 1334    NA12145 105.7   0.0     1.0000  0.0000  0.5000  PO
1334    NA10847 1334    NA12146 105.7   0.0     1.0000  0.0000  0.5000  PO
1334    NA10847 1334    NA12239 105.7   0.0     1.0000  0.0000  0.5000  PO
Each row above provides information for one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair
ID1: Individual ID for the first individual of the pair
FID2: Family ID for the second individual of the pair
ID2: Individual ID for the second individual of the pair
MaxIBD1: Length of the longest IBD1 segment (in Mb)
MaxIBD2: Length of the longest IBD2 segment (in Mb)
IBD1Seg: Total length of IBD1 segments divided by total length of all segments, estimate of π1=Pr(IBD=1)
IBD2Seg: Total length of IBD2 segments divided by total length of all segments, estimate of π2=Pr(IBD=2)
PropIBD: Proportion of genomes shared identical-by-descent, estimated by IBD2Seg + IBD1Seg/2, estimate of π=π21/2
InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd, 3rd, 4th, UN

The detailed IBD segments are in a tar zipped file ex.segments.gz. The header of "zcat ex.segments.gz" looks like this:

FID1    ID1     FID2    ID2     IBDType Chr     StartMB StopMB  StartSNP        StopSNP         N_SNP   Length
1330    NA12335 1330    NA12340 IBD1    1       57.074  115.800 rs12032625      rs34385187      384     58.7
1330    NA12335 1330    NA12340 IBD1    1       152.971 242.844 rs925042        rs10803228      640     89.9
1330    NA12335 1330    NA12340 IBD1    2       3.968   76.724  rs4261733       rs9332421       512     72.8
1330    NA12335 1330    NA12340 IBD1    2       165.994 238.148 rs1835889       rs2280288       448     72.2
1330    NA12335 1330    NA12340 IBD1    3       2.625   68.873  rs1382866       rs6795777       512     66.2
1330    NA12335 1330    NA12340 IBD1    3       113.243 195.396 rs1844925       rs4441603       512     82.2
1330    NA12335 1330    NA12340 IBD1    5       72.046  177.796 rs3010239       rs6422347       704     105.7
1330    NA12335 1330    NA12340 IBD1    6       3.685   47.246  rs11242871      rs9381515       384     43.6
1330    NA12335 1330    NA12340 IBD1    6       88.258  165.734 rs3778671       rs520349        512     77.5
Each row above provides information for one IBD segment in one pair of individuals. The columns are
FID1: Family ID for the first individual of the pair                
ID1: Individual ID for the first individual of the pair                  
FID2: Family ID for the second individual of the pair
ID2: Individual ID for the second individual of the pair                 
IBDType: Type of IBD segments: IBD1 or IBD2
Chr: Chromosome number.
StartMB: Start position of the IBD segment (in Mb)
StopMB: Stop position of the IBD segment (in Mb)
StartSNP: Start SNP of the IBD segment
StopSNP: Stop SNP of the IBD segment
N_SNP: The number of SNPs in the IBD segment
Length: Total Length of the IBD segment (in Mb) 

IBD segment results are quite informative and worth further exploration. Application of inferred IBD segments can be beyond relationship inference. One effective way is to plot all IBD segments for each pair of close relatives. Here is the R code for visualizing IBD segments: king_segments_plot.R. To run this R code, please first make sure ggplot2 and parallel libraries are properly installed. You can then type

  prompt> Rscript king_segments_plot.R ex ibdseg
where ex is the sepcified prefix. All pairs of close relatives are plotted in their own files/plots, and all plots/files are gzipped together in a single file ex_ibdseg_rplots.tar.gz. Below is one plot example. IBD segments of one pair of ex individuals are shown:



INTEGRATED RELATIONSHIP INFERENCE

Integrated relationship inference is available in KING through --related option. Multiple relatedness measures are used to screen close relatives. Examples of integrated relationship inference are

  prompt> king -b ex.bed --related
  prompt> king -b ex.bed --related --degree 2 --rplot --prefix ex
The first command identifies close relatives up to the first degree, and the second command specifies close relatives up to the second degree. Identifying first-degree relatives only offers substantial computational advantages over identifying up to the second-degree relatives, and --related without the --degree option is highly recommended. "--rplot" makes plots using both reported and inferred relationships (in output files ex.kin and ex.kin0). Here is an example of inference plots that are generated by the second command above: ex_relplot.pdf.

The summary of relationship inference in file ex.kin will look like this:

king:1077> head ex.kin
FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    HetConc HomIBS0 Kinship IBD1Seg IBD2Seg PropIBD InfType Error
Y001    NA18484 NA18486 18250   0.000   0.2500  0.2324  0.0002  0.3368  0.0006  0.2515  0.9945  0.0000  0.4972  PO      0
Y001    NA18484 NA18488 18249   0.000   0.2500  0.2332  0.0002  0.3379  0.0004  0.2522  1.0000  0.0000  0.5000  PO      0
Y001    NA18486 NA18488 18270   1.000   0.0000  0.2141  0.1053  0.3036  0.2460  0.0039  0.0000  0.0000  0.0000  UN      0
Y002    NA18485 NA18487 18276   0.000   0.2500  0.2349  0.0002  0.3413  0.0005  0.2541  1.0000  0.0000  0.5000  PO      0
Y002    NA18485 NA18489 18275   0.000   0.2500  0.2274  0.0003  0.3298  0.0007  0.2474  1.0000  0.0000  0.5000  PO      0
Y002    NA18487 NA18489 18269   1.000   0.0000  0.2098  0.1120  0.2999  0.2601  -0.0157 0.0000  0.0000  0.0000  UN      0
Y003    NA18497 NA18498 18262   0.000   0.2500  0.2232  0.0003  0.3254  0.0009  0.2448  0.9951  0.0000  0.4975  PO      0
Y003    NA18497 NA18499 18258   0.000   0.2500  0.2310  0.0002  0.3387  0.0006  0.2525  1.0000  0.0000  0.5000  PO      0
Y003    NA18498 NA18499 18280   1.000   0.0000  0.2100  0.1043  0.2997  0.2452  0.0015  0.0000  0.0000  0.0000  UN      0
Each row above provides information for one pair of individuals. The columns are
FID: Family ID for the pair
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
N_SNP: The number of SNPs that do not have missing genotypes in either of the individual
Z0: Pr(IBD=0) as specified by the provided pedigree data
Phi: Kinship coefficient as specified by the provided pedigree data
HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG)
IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG)    
Kinship: Estimated kinship coefficient (φ) from the SNP data
IBD1Seg: Total length of IBD1 segments divided by total length of all segments, estimate of π1=Pr(IBD=1)
IBD2Seg: Total length of IBD2 segments divided by total length of all segments, estimate of π2=Pr(IBD=2)
PropIBD: Proportion of genomes shared identical-by-descent, estimated by IBD2Seg + IBD1Seg/2, estimate of π=π21/2
InfType: Inferred relationship type, such as Dup/MZTwin, PO, FS, 2nd, 3rd, 4th, UN
Error: Flag Indicating differences between inferred and reported relationship (1 for error, 0.5 for warning)


OTHER PARAMETERS

The following parameters can also be specified:

--prefix specifies the name of the file that stores various relatedness measures. "king" is used as default.

--rplot generates R code and calls R program to produce PDF plots.

--cpus specifies the number of CPU cores to be used in the parallel computing. If not specified, the default number is half of the total number of (logical) cores.

--lessmem reduces memory by a half -- memory needed is approximately as the size of the .bed file. This option is useful for large datasets (e.g., when the file size is > 100GB).

--degree specifies the degree of relatedness. It goes with multiple KING analyses including --kinship, --ibdseg, --related, --unrelated, --build, --cluster.


REFERENCE

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF][Citations]


======================================
Last updated: June 6, 2018 by Wei-Min Chen


   

KING Download | KING Homepage