How to merge multiple plink files

Hello everyone, I am Deng Fei, here is a summary of the problem of merging multiple plink files.

Merge has two application scenarios:

  • 1. The sample is the same, but the locus is different. It is not as good as the same sample, the data of chromosome 1 and the data of chromosome 2 are merged.
  • 2. The sites are the same, but the samples are different, such as the same chip data (same map data), the first batch of data, and the second batch of data.

Therefore, here are also divided into two methods to summarize.

1. The samples are the same, but the sites are different

Typical situation: Now there are data of 4 chromosomes, each chromosome has a set of plink files, how to merge them together.

For example data:

dat_chr_1.map  dat_chr_2.map  dat_chr_3.map  dat_chr_4.map
dat_chr_1.ped  dat_chr_2.ped  dat_chr_3.ped  dat_chr_4.ped

Used here --merge-listto merge multiple files.

First, we generate a txt file and put the names of the ped and map data that need to be merged into it, with ped at the front and map at the back.

The file name below is: p12.txt, which is divided into two columns, the first column is the name of the ped, the second column is the name of the map, and each line is a pair of plink files.

dat_chr_1.ped   dat_chr_1.map
dat_chr_2.ped   dat_chr_2.map
dat_chr_3.ped   dat_chr_3.map
dat_chr_4.ped   dat_chr_4.map

code show as below:

 plink --merge-list p12.txt --recode --out hebing

The log line is as follows:

$ plink --merge-list p12.txt --recode --out hebing
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hebing.log.
Options in effect:
  --merge-list p12.txt
  --out hebing
  --recode

15236 MB RAM detected; reserving 7618 MB for main workspace.
Performing single-pass merge (165 people, 426095 variants).
Merged fileset written to hebing.bed + hebing.bim + hebing.fam .
426095 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997722.
426095 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (53 phenotypes
are missing.)
--recode ped to hebing.ped + hebing.map ... done.

Result file:

The sum of map data is the merged map data.

$ wc -l *map
  119487 dat_chr_1.map
  119502 dat_chr_2.map
   98971 dat_chr_3.map
   88135 dat_chr_4.map
  426095 hebing.map
  852190 total

ped data unchanged:

$ wc -l *ped
      165 dat_chr_1.ped
      165 dat_chr_2.ped
      165 dat_chr_3.ped
      165 dat_chr_4.ped
      165 hebing.ped
      825 total

2. Same site, different samples

Use the same method as above. Use --merge-list, and then define the name of the file to be merged.

Two plink files are used here, sample1 and sample2, and the operation method of multiple files is the same.

sample1.map  sample1.ped  sample2.map  sample2.ped

Generate p12.txt file:

sample1.ped     sample1.map
sample2.ped     sample2.map

Run the command merge:

 plink --merge-list p12.txt --recode --out hebing2

The log is as follows:

$ plink --merge-list p12.txt --recode --out hebing2
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to hebing2.log.
Options in effect:
  --merge-list p12.txt
  --out hebing2
  --recode

15236 MB RAM detected; reserving 7618 MB for main workspace.
Performing single-pass merge (25 people, 1457897 variants).
Merged fileset written to hebing2.bed + hebing2.bim + hebing2.fam .
1457897 variants loaded from .bim file.
25 people (13 males, 12 females) loaded from .fam.
17 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 17 founders and 8 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.996107.
1457897 variants and 25 people pass filters and QC.
Among remaining phenotypes, 10 are cases and 7 are controls.  (8 phenotypes are
missing.)
--recode ped to hebing2.ped + hebing2.map ... done.
Warning: 2 het. haploid genotypes present (see hebing2.hh ); many commands
treat these as missing.

The result is as follows:

The map data is exactly the same, and the ped data is added.

3. Precautions

Note 1: If the positions are different, the union of the two maps will be calculated

Note 2: When merging, it is not based on the chromosome + physical position, but based on the name of the second column map. Make sure there is an intersection, otherwise the result of the merge will be wrong

Note 3: When merging samples, if there are duplicate sample IDs, an error will be reported. Recommended Extraction Inspection

Guess you like

Origin blog.csdn.net/yijiaobani/article/details/129695676
Recommended