Detailed explanation of bed bam wig, bigWig and bedgraph files

SOPs/coordinates – BaRC Wiki icon-default.png?t=N7T8http://barcwiki.wi.mit.edu/wiki/SOPs/coordinates

The file is a file that compares the sequencing reads to the reference genome! Files in bam or bed format are mainly used to track which regions of the genome our reads map to. However, these file formats (wig, bigWig and bedgraph) specified by UCSC have different purposes. They are only used to track the reference genome. Coverage and sequencing depth of each region! And these defined files can be seamlessly connected to UCSC’s Genome Browser tool for visualization!

This website provides construction and conversion scripts for these data formats: SOPs/coordinates – BaRC Wiki

Comparison of BED, WIG, BIGWIG and BEDGRAPH files

file format type of data Storage method Application scenarios
BED genomic region text Genome annotation, ChIP-seq, ATAC-seq, etc.
WIG Continuous measurements text Gene expression, DNA methylation, chromatin accessibility, etc.
BIGWIG Continuous measurements compression Gene expression, DNA methylation, chromatin accessibility, etc.
BEDGRAPH Continuous measurements text Gene expression, DNA methylation, chromatin accessibility, etc.

BAM files are text file formats widely used in bioinformatics for storing alignment results of high-throughput sequencing data (such as DNA-seq, RNA-seq, and ChIP-seq). BAM files typically consist of one or more alignments, each consisting of one or more reads.

Basic structure of BAM files

A BAM file consists of one or more "records", each describing an alignment result or read. The record consists of the following fields:

  • Chromosome name (Reference Name): Describes the name or number of the chromosome where the alignment result or read is located.
  • Start position (Start): The starting coordinate of the comparison result or read, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the comparison result or reading, also 0-based.
  • Bases: Alignment results or read base sequences.
  • Quality: Quality value of the comparison result or read.
  • Auxiliary Information: Comparison results or additional information read, such as comparison scores, trust levels, etc.

1 Example of BAM file

Here is a simplified BAM file example to better understand its structure:

@SQ	SN:chr1	LN:249250621
@SQ	SN:chr2	LN:243199373
@RG	ID:test_readgroup	LB:test_library	PL:illumina	SM:test_sample

read1	100	200	ACGTACGTACGTACGT	*	0	0	.	MD:Z:1234567890
read2	300	400	ACGTACGTACGTACGT	*	0	0	.	MD:Z:1234567890

In the above example, the first and second rows define information for two chromosomes. The third and fourth lines define two pieces of information to be read.

Application of BAM files

BAM files have a variety of applications in bioinformatics, including but not limited to:

  • Alignment data analysis: used to analyze the alignment results of high-throughput sequencing data, such as alignment accuracy, alignment confidence, etc.
  • Data visualization: Used to visualize the comparison results of high-throughput sequencing data, such as gene expression maps, chromatin accessibility maps, etc.
  • Data mining: Used to mine potential patterns in the comparison results of high-throughput sequencing data, such as gene regulation mechanisms, disease correlations, etc.

2 BED files

BED files are a text file format widely used in bioinformatics to describe features and regions on the genome. BED files typically contain genomic coordinates, feature names, descriptions, and other additional information.

Basic structure of BED files

BED files consist of lines of text, each line representing a feature or region on the genome. Each line usually contains the following fields, separated by tabs or spaces:

  • Chromosome Name: The name or number of the chromosome where the characteristic or region is located.
  • Start position (Start): The starting coordinates of the feature or area, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the feature or area, also 0-based.

Optional fields for BED files

  • Name: The name or identifier of a feature or area.
  • Score: Used to express the score, quality, or importance of a feature.
  • Strand: indicates the direction of the chain where the feature is located, usually "+" (positive chain) or "-" (negative chain), or "." indicates the unknown chain direction.
  • Partition information (Thick Start, Thick End, Item RGB): These fields are usually related to protein coding regions (CDS), describing the start and end positions of CDS and color information.
  • Block Count: The number of blocks in the description feature.
  • Block Sizes: Describes the size of each block, separated by commas.
  • Block Starts: Describes the relative starting position of each block, separated by commas.

Example of BED file

Here is a simplified example of a BED file to better understand its structure:

chr1    1000    2000    Gene1    0     +
chr1    2500    3000    Gene2    0     -
chr2    5000    6000    Gene3    0     +

In the above example, each row describes a feature on the genome, including fields such as chromosome name, start and end positions, name, score, and chain direction.

Application of BED files

BED files have a variety of applications in bioinformatics, including but not limited to:

  • Genome annotation: used to describe the exons, introns, UTR (untranslated region) and other regions of genes.
  • ChIP-seq and ATAC-seq analysis: used to define regions related to protein binding, chromatin accessibility, etc.
  • Alignment and visualization: Used to compare sequencing data to genomes and build visual maps to help researchers understand the genome structure.
  • Functional annotation: used to annotate the functions of genes and non-coding regions to help interpret biological data.
  • Genome sequencing and assembly: used to define regions in the assembled sequence, such as exons, CDS, etc.

BAM file

BAM files are text file formats widely used in bioinformatics for storing alignment results of high-throughput sequencing data (such as DNA-seq, RNA-seq, and ChIP-seq). BAM files typically consist of one or more alignments, each consisting of one or more reads.

Basic structure of BAM files

A BAM file consists of one or more "records", each describing an alignment result or read. The record consists of the following fields:

  • Chromosome name (Reference Name): Describes the name or number of the chromosome where the alignment result or read is located.
  • Start position (Start): The starting coordinate of the comparison result or read, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the comparison result or reading, also 0-based.
  • Bases: Alignment results or read base sequences.
  • Quality: Quality value of the comparison result or read.
  • Auxiliary Information: Comparison results or additional information read, such as comparison scores, trust levels, etc.

3 WIG files

WIG files are a text file format widely used in bioinformatics to store continuous measurements along the genome. WIG files are commonly used to store gene expression data, DNA methylation data, chromatin accessibility data, etc.

Basic structure of WIG file

A WIG file consists of one or more "records", each record describing a continuous measurement value. The record consists of the following fields:

  • Chromosome name (Reference Name): describes the name or number of the chromosome where the measured value is located.
  • Start position (Start): The starting coordinate of the measured value, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the measured value, also 0-based.
  • Measured value (Value): The actual value of the measured value.

Example of WIG file

Here is a simplified example of a WIG file to better understand its structure:

chr1    1000    2000    1.0
chr1    2500    3000    2.0
chr2    5000    6000    3.0

In the above example, each row describes a continuous measurement on a genome, including fields such as chromosome name, start and end positions, and measurement values.

4 BIGWIG files

BIGWIG files are a variant of WIG files that use a more efficient compression algorithm to store data. BIGWIG files are commonly used to store expression data for high-throughput sequencing data such as DNA-seq, RNA-seq, and ChIP-seq.

Basic structure of BIGWIG file

A BIGWIG file consists of one or more "chunks", each chunk describing a continuous measurement. A block consists of the following fields:

  • Chromosome name (Reference Name): describes the name or number of the chromosome where the measured value is located.
  • Start position (Start): The starting coordinate of the measured value, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the measured value, also 0-based.
  • Measured value (Value): The actual value of the measured value.
  • BlockSize: The size of the block, in bytes.

Example of BIGWIG file

Here is a simplified example of a BIGWIG file to better understand its structure:

chr1    1000    2000    1.0    100
chr1    2500    3000    2.0    200
chr2    5000    6000    3.0    300

In the above example, each row describes a continuous measurement on a genome, including fields such as chromosome name, start and end positions, measurement value, block size, etc.

5 BEDGRAPH files

BEDGRAPH files are another variation of BIGWIG files that use a simpler format to store data. BEDGRAPH files are commonly used to store gene expression data, DNA methylation data, chromatin accessibility data, etc.

Basic structure of BEDGRAPH files

BEDGRAPH files consist of one or more "records", each record describing a continuous measurement value. The record consists of the following fields:

  • Chromosome name (Reference Name): describes the name or number of the chromosome where the measured value is located.
  • Start position (Start): The starting coordinate of the measured value, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the measured value, also 0-based.
  • Measured value (Value): The actual value of the measured value.

Example of BEDGRAPH file

Here is a simplified example of a BEDGRAPH file to better understand its structure:

chr1    1000    2000    1.0
chr1    2500    3000    2.0
chr2    5000    6000    3.0

In the above example, each row describes a continuous measurement on a genome, including chromosome name, start and end positions, measurement values, etc.

6 sam files

SAM file is the predecessor of BAM file. It is a text file format widely used in bioinformatics and is used to store the comparison results of high-throughput sequencing data (such as DNA-seq, RNA-seq and ChIP-seq). SAM files typically consist of one or more alignments, each consisting of one or more reads.

The basic structure of a SAM file is the same as that of a BAM file, but a SAM file does not have additional information fields. The records of a SAM file consist of the following fields:

  • Chromosome name (Reference Name): Describes the name or number of the chromosome where the alignment result or read is located.
  • Start position (Start): The starting coordinate of the comparison result or read, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the comparison result or reading, also 0-based.
  • Bases: Alignment results or read base sequences.
  • Quality: Quality value of the comparison result or read.

The application of SAM files is the same as that of BAM files.

Overall, the difference between SAM files and BAM files is that SAM files have no additional information fields. SAM files have a simpler structure, but are also less flexible. BAM files have a more complex structure, but are also more flexible and can store more types of information.

In fact, for our bam files, we can also easily obtain the coverage and sequencing depth of the genomic region using samtools software, such as:

samtools depth -r chr12:126073855-126073965  Ip.sorted.bam

chr12 126073855 5

chr12 126073856 15

chr12 126073857 31

chr12 126073858 40

chr12 126073859 44

chr12 126073860 52

~~~~~~~~~Omit the rest of the output~~~~~~~~~

This is actually the prototype of a wig file, but the wig file will be a little more complicated!

First of all, it does not need the first column because all are repeated fields. You only need to define the chromosome in the first row of each chromosome.

7bw file

The bw file is a variant of bigwig, which is a compressed format of bigwig files. bw files are compressed in the same way as bigwig files, but bw files have a higher compression rate.

A bw file consists of one or more "chunks", each chunk describing a continuous measurement. A block consists of the following fields:

  • Chromosome name (Reference Name): describes the name or number of the chromosome where the measured value is located.
  • Start position (Start): The starting coordinate of the measured value, usually expressed in a 0-based manner.
  • End position (End): The end coordinate of the measured value, also 0-based.
  • Measured value (Value): The actual value of the measured value.
  • BlockSize: The size of the block, in bytes.

The compression method of bw files uses the RLE (Run-length encoding) algorithm. The RLE algorithm is a compression algorithm that compresses by recording the number of consecutive identical characters.

The compression rate of bw files is higher than that of bigwig files because bw files use the RLE algorithm. The RLE algorithm can effectively compress sequences of consecutive identical characters, while bigwig files use an index compression algorithm, which is not suitable for compressing sequences of consecutive identical characters.

The application of bw files is the same as that of bigwig files. bw files are commonly used to store gene expression data, DNA methylation data, chromatin accessibility data, etc.

Advantages of bw files include:

  • Higher compression rate saves storage space.
  • Using the RLE algorithm, the compression efficiency is higher.

Disadvantages of bw files include:

  • Reading is slower because decompression is required.
  • Not suitable for storing non-continuous measurements.

Summarize

file format type of data Storage method Application scenarios
BED genomic region text Genome annotation, ChIP-seq, ATAC-seq, etc.
WIG Continuous measurements text Gene expression, DNA methylation, chromatin accessibility, etc.
BIGWIG Continuous measurements compression Gene expression, DNA methylation, chromatin accessibility, etc.
BEDGRAPH Continuous measurements text Gene expression, DNA methylation, chromatin accessibility, etc.
SAM Comparison results text High-throughput sequencing data comparison
BAM Comparison results text High-throughput sequencing data comparison
bw Continuous measurements compression Gene expression, DNA methylation, chromatin accessibility, etc.

drive_spreadsheetExport to Sheets

BED files are used to store features and regions on the genome. Each line describes a feature or region on the genome, including fields such as chromosome name, start and end positions, name, score, and chain direction.

WIG files are used to store continuous measurement values ​​along the genome direction. Each line describes a continuous measurement value, including chromosome name, start and end positions, measurement values ​​and other fields.

BIGWIG files are a variant of WIG files that use a more efficient compression algorithm to store data.

BEDGRAPH files are another variation of BIGWIG files that use a simpler format to store data.

SAM files are used to store alignment results of high-throughput sequencing data (such as DNA-seq, RNA-seq, and ChIP-seq). Each line describes an alignment result or read, including chromosome name, start position, and end position. , base sequence, quality value and other fields.

BAM files are the predecessor of SAM files. They use a more complex format to store data and can store more types of information.

The bw file is a variant of bigwig. It uses the RLE algorithm to compress data, and the compression rate is higher than that of bigwig files.

Generally, you don’t need to worry about these parameters unless you are familiar with UCSC’s Genome Browser tool.

Then you need to set the properties of each chromosome. Several important parameters are:

fixedStepchrom=chrNstart=positionstep=stepInterval[span=windowSize]

Here is a specific example of wig:

track type=print wiggle_0 name=hek  description=hek

variableStep chrom=chr1 span=10

10008    7

10018    14

10028    27

10038    37

10048    45

10058    43

10068    37

10078    26

~~~~~~~~~Omit the rest of the output~~~~~~~~~

UCSC also provides a wig file: http://genome.ucsc.edu/goldenPath/help/examples/wiggleExample.txt

You can see that I set very few parameters, and I directly used a script to convert the sorted bam file into a wig file.


The wig file format is as follows:

Then there is nothing to say about the bigwig format file. It is a binary compressed version of the wig format file, which saves more space.

We only need to use the tools provided by UCSC to convert our wig files. The steps are as follows:

  • Save this wiggle file to your machine (this satisfies steps 1 and 2 above).
  • Save this text file to your machine. It contains the chrom.sizes for the human (hg19) assembly (this satisfies step 4 above).
  • Download the wigToBigWig utility (see step 3).
  • Run the utility to create the bigWig output file (see step 5):
    wigToBigWig wigVarStepExample.gz hg19.chrom.sizes myBigWig.bw

Finally, let’s talk about the BedGraph format file. It is an extension of the BED file and is a 4-column BED format. However, it needs to add the attributes displayed in UCSC’s Genome Browser tool, but generally only a limited number of attributes can be defined.

track type=bedGraph name=track_labeldescription=center_label        visibility=display_modecolor=r,g,baltColor=r,g,b        priority=priorityautoScale=on|offalwaysZero=on|off        gridDefault=on|offmaxHeightPixels=max:default:min        graphType=bar|pointsviewLimits=lower:upper        yLineMark=real-valueyLineOnOff=on|off        windowingFunction=maximum|mean|minimumsmoothingWindow=off|2-16

One thing to note:  These coordinates are  zero-based, half-open .

 Chromosome positions are specified as 0-relative. The first chromosome position is 0. The last position in a chromosome of length N would be N - 1. Only positions specified have data.

 Positions not specified do not have data and will not be graphed.

All positions specified in the input data must be in numerical order.

I have a BedGraph file that comes with MACS call peaks for CHIP-seq data. You can also use tools to get it directly from the bam format file:

track type=bedGraph name="hek_treat_all" description="Extended tag pileup from MACS version 1.4.2 20120305"

chr1    9997    9999    1

chr1    9999    10000   2

chr1    10000   10001   4

chr1    10001   10003   5

chr1    10003   10007   6

chr1    10007   10010   7

chr1    10010   10012   8

chr1    10012   10015   9

chr1    10015   10016   10

chr1    10016   10017   11

chr1    10017   10018   12

Guess you like

Origin blog.csdn.net/qq_52813185/article/details/135223846
BAM