Molecular Biology Chapter 3 Genes, Genomes and Genomics

Chapter 3 Genes, Genomes, and Genomics

Section 1 Gene

1 Three stages of gene recognition

  1. Chromosomal Genetics Stages of Genes

    • Mendel: Every trait of an organism is controlled by genetic factors , and these factors can be passed from parent to offspring, from generation to generation.
    • In 1909, the Danish geneticist W. Johannsen first used the word
      " gene ".
    • Morgan proposed the theory of genes: germplasm must be composed of independent elements,
      which we call genetic factors, or simply genes.
    • Definition of this stage: Genes are heritable independent elements located on chromosomes that control hereditary traits.
  2. Molecular Biology Stages of Genes

    • At this time, with the understanding of DNA, we have entered the stage of molecular biology, and we can understand the structure of genes from the molecular level.

    • Avery et al proved that the chemical essence of gene is DNA.

    • Watton et al. elucidated the double helix structure of DNA.

    • In 1941, Beeadle and Tatum put forward the "one gene: one enzyme hypothesis" that a gene is a DNA segment
      responsible for encoding a protease (protein). When a protein is composed of heterologous subunits , the hypothesis should be revised to "one gene, one polypeptide chain"

    • Current definition of gene: A gene is a nucleotide sequence capable of expressing and producing a gene product (protein or RNA) . Includes coding sequences, regulatory sequences, introns, and non-coding sequences at both ends of the coding region.

    • The concept of the gene is also taking on new challenges

    • Studies in recent years have found that the regulatory region of a gene is not necessarily adjacent to the coding region, or even on the same DNA molecule or the same chromosome. Spilianakis et al. found that the activator region of the Y-interferon gene on chromosome 10 and the regulatory region of TH2 cytokines on chromosome 11 are adjacent to each other in the nucleus and may be jointly regulated.

      The notion that genes have well-defined boundaries is also being challenged, with evidence that two adjacent genes encoding distinct protein products can co-produce fusion proteins. Although it is not known whether such fusion proteins are functional, this phenomenon is indeed not uncommon. Some proteins can even consist of exons from far apart regions or different chromosomes. These new evidences may lead to a completely new concept of genes: units of genomic sequences encoding a set of related functional products . The new definition classifies genes according to their functional products (proteins or RNAs) rather than specific DNA loci, with all DNA elements classified as gene-associated regions.

  3. The Reverse Biology Stage of Genes

    • Traditional biology: from phenotypes to genes.
    • Reverse Biology: From Genes to Phenotypes.
    • Now, natural genes can be isolated by various methods, and genes can also be purposefully synthesized or designed and modified by chemical methods .

2 Characteristics of genes

(1) Jumping gene

jumping gene; or movable gene

  • It is some DNA components that can be transferred from one position to another on the chromosomal genome, and even jump between different chromosomes.
  • The phenomenon of shifting the position of DNA sequence in the genome like this is called transposition (transposition)
  • Such DNA sequences are called transposons or transposable elements

(2) Broken gene

spliting gene;

There is a DNA spacer in the middle of the nucleotide sequence of a eukaryotic gene that has nothing to do with amino acid coding, which separates a gene into several discontinuous segments. This kind of interrupted gene with discontinuous coding sequence is called broken gene/discontinuous gene.

introns

In fact, in addition to the protein-coding nuclear gene DNA may be broken genes, many others. Some coding RNAs, tRNAs may also be broken genes

By the end of 1977 it had become very clear that broken genes were a ubiquitous phenomenon in higher nuclear organisms .
Not only most nuclear genes encoding proteins in eukaryotes are broken genes
, but nuclear genes encoding FRNA or tRNA may also be broken genes.
The organelle genomes of plants and lower eukaryotes, such as mitochondrial
genes in yeast and chloroplast genes in plants, may also be broken genes
.
Split genes have even been found in certain archaea and coliphages
.
However, eubacterial genomes generally do not contain broken genes.

  • About introns

  • some summary
  • The vast majority of eukaryotic genes are split genes.
  • A few eukaryotic genes lack introns (histones, interferons).
  • Split genes are also present in a few prokaryotes (T4 phage).
  • Not all introns are "silent", and some can encode protein introns. (such as encoding splicing factors, transposases, etc.)
  • Not all exons are "shown", and some do not encode amino acids. (e.g. 88 nucleotides of the first exon of the human urokinase gene)
  • Generally speaking
    • Low eukaryotes have few introns and short sequences;
    • Higher eukaryotes have many introns and long sequences.
  • The origin of introns and the biological significance of introns are still not fully understood.
  1. fake gene

pseudogene:

  • An inactivated gene whose nucleotide sequence is basically the same as its corresponding normal functional gene but cannot synthesize functional protein, usually denoted by

  • Pseudogenes have been found in most eukaryotes.

  • It is estimated that the human genome contains about 20,000 pseudogenes.

  • The main characteristics of pseudogenes are their homology to known genes
    (homology) and non-functionality.

  • The identification of pseudogenes is generally difficult, usually by sequence alignment to
    determine whether two conditions (40% to 100%) are met.

  • Three main types of pseudogenes

  • Pseudogenes may complicate the study of molecular genetics, for example, when genes are amplified by PCR, pseudogenes with similar sequences may be amplified.
  • Since the identification of pseudogenes mainly relies on computer analysis of genome sequences using complex algorithms , there may be misjudgments.
  1. overlapping genes

overlapping genes

The nucleotide sequences of different genes can sometimes be shared, that is, the nucleotide sequences of these genes overlap with each other, such genes are called overlapping genes or nested genes (nested genes)

Enabling limited DNA sequences to contain more genetic information is an economical and reasonable use of its genetic material by organisms.

  • At first I thought it was a lower organism that made full use of bases, but later found that there are many overlapping genes in eukaryotes, which may not be so simple

  • In 1986, Henikpff and Spencer also discovered gene overlap in the Drosophila genome.

  • It not only exists in the genes of prokaryotes such as bacteria and viruses, but also exists
    in the genomes of higher eukaryotes;

  • There are not only double overlaps between two genes, but also
    triple overlaps between three genes; not only in coding sequences
    but also in regulatory sequences.

  • Gene overlap may not only be used for more cost-effective use of DNA genetic
    information, but may also be involved in gene regulation.

  1. gene family

According to the distribution of gene family members, they can be divided into:

  • gene cluster

    • Each member of the gene family is closely clustered and arranged into a large segment of tandem repeat unit, which is located in a special region of the chromosome. They are the products of the amplification of the same ancestral gene.

    • For example, human a-chain-like gene clusters and B-chain-like gene clusters:

  • disseminated gene family

    • Gene family members have no obvious physical connection on DNA, and are even scattered on multiple chromosomes. such as the actin gene family and the tubulin gene family
  • Classification, according to the degree of sequence similarity between gene family members, is divided into:

    • Classical gene families with high sequence homology
    • Gene families with highly conserved sequences
    • Gene families with short conserved sequences
    • supergene family with no sequence homology
  1. Repeated gene/sequence

Definition: There are multiple copies of genes on chromosomes, mainly in eukaryotic genomes. These genes are often genes related to the most basic and important functions of life activities, such as histone genes, rRNA genes, tRNA genes, etc.

  • histone gene

    • Histone genes are the only genes known to have protein-coding functions among repetitive genes.

    • The copy number of histone genes in the genomes of different organisms is different.

    • Histone genes are arranged differently in the genomes of different organisms.

    • All histone genes are intron-free and highly conserved.

    • All of this is for simplicity, and can be synthesized in a large amount in a short time

  • repeat sequence

    • Repeated sequences are a bigger concept than repetitive genes because not all sequences encode genes
    • Repeated genes are repetitive sequences;
      • In lower eukaryotes, the proportion of repetitive sequences is generally lower than 20%.
      • In higher eukaryotes, this ratio can reach 50%-80%.
    • Divided into:
      • Moderately repetitive sequences: composed of relatively short sequences with a repeat number of 10 to 1000 times, generally non-coding sequences, which mainly play a role in gene regulation.
      • Highly repetitive sequences: composed of very short sequences (less than 100 bp), repeated thousands to millions of times, some are coding genes, such as rRNA genes and some tRNA genes; most are non-coding genes with no transcriptional activity sequence.
  • Repeated sequences can also be classified according to their arrangement on the chromosome:

    • Tandem repeats: clustered in specific regions of chromosomes.
    • Interspersed repeats: scattered at various sites on chromosomes
      • short scatter element
      • Long Scattered Elements

  • Microsatellite DNA
    • Microsatellite DNA is polymorphic and conservative, and can be used as molecular genetic markers, widely used in gene positioning, linkage analysis, paternity testing, etc.
    • It is generally believed that microsatellite DNA originates from sliding mismatches during DNA replication, resulting in the deletion or insertion of one or several repeat units.
    • The function of microsatellite DNA in the genome is still unclear, and it may be involved in the process of chromosome structure change, gene regulation and cell differentiation.

3 Classification of genes

  • According to the function of genes, they can be divided into two categories:
    • Structural genes: Genes capable of expressing functional products, including genes encoding proteins and genes encoding RNA
    • Regulatory gene: A DNA or RNA sequence unit involved in regulating the expression of a structural gene.

4 Gene structure

  • Prokaryotic coding regions are generally continuous, while eukaryotic coding regions are fragmented

  • similarity and difference
    • Eukaryotic dna molecule encodes a gene product, single cis

5 gene size

Average molecular weight of protein: 40,000D
Average molecular weight of amino acid: 100D
Average number of amino acids in each protein molecule: 400 aa
Average size of gene: 1200bp

This size depends more on introns than exons

6 Number of genes

the number of
genes in a given DNA= bp/1200

  • Calculated based on genome size;
  • Identification by genetic segregation:
  • Identification of ORFs by sequencing:
  • Calculate the number of expressed genes:
  • by mutational analysis;

In fact, there is a large error in the current measurement.

At present, there are about 25,000 to 30,000 genes in the human genome

  • In general, the size of an organism's genome and the number of genes it contains increase with the increase in the complexity of the organism's structure and function.
  • of course there are exceptions
  • The number of genes does not necessarily determine the complexity and level of evolution of organisms. The fundamental reason for determining the complexity of organisms lies in how genes are expressed and managed. (also what we want to figure out)
  • N-value paradox: There is not always a positive correlation between the complexity of an organism and the number of genes.
  • K-value paradox: There is not always a positive correlation between the complexity of an organism and the number of chromosomes.

Section 2 Genome

1 The concept of genome

  • The term genome was first proposed by Hans Winkler, a professor of botany at the University of Hamburg in Germany in 1920, and is composed of genes and chromosomes .

  • Originally, the genome was defined as a complete set of chromosomes in a haploid cell. Modern molecular biology and genetics define the genome as all the genetic information in an organism , encoded by DNA or RNA, including all genes and non-coding sequences .

  • In practical applications, the concept of genome can refer to the entire set of DNA stored in the nucleus (ie, the nuclear genome), or to the entire set of DNA stored in organelles (ie, the mitochondrial genome or the chloroplast genome), and can also include some non-chromosomal Genetic elements such as viruses, plasmids, and transposable elements.

2 Phage genomes

The usual research is mainly divided into prokaryotes and eukaryotes, and reduce the number of model organisms

Prokaryotes: bacteriophages, which use bacteria as hosts, so they are classified as prokaryotes

Eukaryotes: mainly yeast and eukaryotic viruses (in eukaryotes)

  • Generally, it is double-stranded DNA, which will circularize itself after infecting the host, and is in a state of replication

3 Bacterial genomes

Including two types of DNA molecules
: Chromosomes —carry all the genetic information needed for cell survival and reproduction; plasmids —DNA molecules that exist independently outside of chromosomes;

  • Prokaryotes generally have only one chromosome, but under different growth conditions, chromosomes can have multiple copies.
  • The genetic information carried by the plasmid is not necessary for cell survival, and the presence or absence of the plasmid has no decisive effect on the survival of the host cell. (It is a good carrier that can bring exogenous information to bacteria)

Taking Escherichia coli (a representative of prokaryotes) as the research object

  • There is no obvious nuclear structure, but 2-4 relatively concentrated regions of DNA are formed, that is, nucleoid ;
  • In 1997, the first complete Escherichia coli DNA sequence (E. coli K12 strain) was completed;
  • Chromosomal DNA is a double-stranded circular molecule consisting of 4.6X100bp , containing 4288 protein-coding genes (integrated into 2584 operons ), 7 rRNA operons, and 86 tRNA genes:
  • A variety of DNA-binding proteins compact chromosomes into a scaffold structure that is divided into approximately 100 domains .
  • its nature
  • Protein genes are usually present in a single copy, while RNA genes are usually in multiple copies.
  • Functionally related genes are usually arranged in tandem, and the expression regulation is carried out in units of operons.
  • Different operons can be regulated by the same regulatory gene product, constituting a regulatory element .
  • The gene density in the genome is very high, and the average interval between genes is only 118 bp.
  • The genome contains a large number of transposable elements, repetitive sequences, prophage and phage remnants.
  • The genome sequences of more than 60 E. coli strains have now been sequenced;
  • The genome sequences of different strains are very different, only about 20% of the sequences are present in all genomes, and the remaining
    80% of the sequences vary greatly among different genomes;
  • Each genome contains 4000 to 5500 genes.

4 Yeast genome

Taking yeast (representative of eukaryotes) as the research object

  • preliminary understanding

    • The genetic material of yeast includes: nuclear DNA, mitochondrial DNA, and plasmid DNA;

    • In April 1996, the first eukaryotic full-length genome to be determined: Saccharomyces cerevisiae (Saccbarommyces cerevisiae) genome sequencing was completed:

    • 12068 kb; 5885 open reading frames, with an average length of 1450bp;

    • Genes are tightly packed, with shorter intergenic regions and fewer introns :

  • Yeast Genome Characterization

    • The GC content of nuclear DNA sequences is not uniform. The region with high GC content is generally located in the middle of the chromosome arm, with a high gene density; the region with low GC content is generally close to telomeres and centromeres, and the number of genes is relatively poor.
    • Contains numerous DNA repeats, including chromosomal terminal repeats, interspersed single gene repeats, and clustered gene repeat regions.
  • other

    • At least 31% of the protein-coding genes or open reading frames are highly homologous to mammalian protein-coding genes.
    • Homologies are often restricted to individual domains rather than entire proteins, reflecting rearrangements of functional domains during protein evolution.
    • Especially suitable as a model organism for human genome research

5 Plant Genomes

  • Arabidopsis genome

    • In December 2000, the first plant genome, the genome of the cruciferous plant Arabidopsis thaliana, was sequenced.

    • Arabidopsis thaliana has a small genome, short life cycle, and is easy to perform genetic experiments. It is an important model organism for plant molecular biology research.

  • rice genome

    • In April 2002, the US "Science" magazine published the genome sequencing of rice indica subtypes jointly completed by 12 Chinese scientific research institutions;
    • The full length of the genome is 466Mb=4.66x108bp, containing 46022-55615 genes:

    • It is the largest plant genome sequenced after the Arabidopsis genome .
  • Rice Genome Characteristics

    • The total number of genes is about 40,000, which is almost twice the total number of genes in the human genome:
    • The number of members of the gene family increases mainly through gene doubling, but the function of each member is relatively simple:
    • The average length of a gene is only 4500 bases, while the average length of a human gene is 72000 bases:

6 Human Genome

  • Human Genome Project

    • research strategy
      • mapping then sequencing(clone by clone) ------Francis Collins;
        • United States
        • Each part of the chromosome is clearly marked and then (cloned) sequenced
      • shortgun sequencing------Craig Venter;
        • a private company came up with
        • Shotgun method, DNA is divided into several fragments for sequencing, and then spliced ​​​​by bioinformatics methods
      • Both methods can be successful
  • last information

  • The actual measured sequence is 2851 330 913bp in full length, about 2.85Gb;
  • Among them, there are 22287 genes encoding proteins, including 19438 known genes and 2188 predicted genes:
  • 34 214 transcripts can be generated, with an average of 1.5 transcripts per gene, indicating that a large number of genes have alternative splicing:
  • There are 231,667 exons in total, with an average of 10.4 exons per gene. The sum of exon sequences of all genes is about 34 Mb, accounting for only 1.2% of the euchromatin in the human genome;
  • There are several thousand additional genes encoding various RNA products, and the function of most of the remaining sequences remains unknown.
  • some more conclusions

    • The human genome contains numerous repetitive sequences that may underlie new primate-specific genes
    • About half of the sequence of the human genome is derived from transposable elements, but most transposons are inactive;
    • Many genes in the human genome come from the horizontal transfer of bacteria. Therefore, the formation of the human genome is not entirely derived from the mutation and rearrangement of internal genes, but also from the introduction of external genes.
  • Taking chromosome 22 as an example

    • Completed by the end of 1999;
    • The short arm (22p) is pure heterochromatin, which is considered to be a gene blank area:
    • The total length of the long arm (22q) is 34 491 kb, and
      the length of completed sequence determination accounts for 97% of the total length. Contains 679 annotated genes; including 247
      known genes, 150 related genes, 148 predicted genes and 134
      pseudogenes:
    • All annotated genes account for 39% of the full length (including introns), and exons only account for 3% of the full length:
    • Repeated sequences in 22q accounted for 41%; including a large number of Alu sequences and LINEs.
  • Chromosome 21

    • Completed in 2000: the smallest autosome;
    • The actual determined sequence length is 33.55 Mb,
    • Short arm (21p) 281kb, may only have 1 gene:
    • 284 annotated genes in the long arm (21q); including 127 known genes, 98 predicted genes, and 59 pseudogenes:
    • It is associated with genetic diseases such as Down syndrome, Alzheimer's disease, and amyotrophic lateral sclerosis .
  • X chromosome

    • Completed in 2005;
    • Full length 155Mb, 151Mb complete sequence determination
    • Contains 1798 annotated genes, including 700 pseudogenes: 173 non-coding RNAs (ncRNAs), 2 tRNA genes, 13 microRNAs;
    • Rich in L1 elements: may be related to gene silencing of X chromosome;
    • It is related to sex-linked genetic diseases such as hemophilia and Duchenne muscular dystrophy
      ; 10% of single-gene genetic diseases are located on the X chromosome.
    • XISTRNA
      • Perhaps the most striking of these ncRNAs is the 32kb XISTRNA (X inactivation-specific transcript RNA), which plays an important role in female X chromosome inactivation.
      • XIST RNA leads to the silencing of most genes on one X chromosome of a female by covering the X chromosome. This silencing effect is cis-acting, that is, it only works on the same chromosome. On the other X chromosome that is active, the XIST gene is closed.

7 Mouse Genome

The most important animal models for biomedical research;

  • why is it important

    • Humans and mice share 99% of their genes;
    • Only 300 genes are specific;
  • so important to medicine

8 Organelle Genomes

  • The vast majority are circular , and a few lower eukaryotes are linear molecules:

  • Mitochondrial DNA is tens of kb, and chloroplast can reach more than two hundred kb;

  • The organelle genome encodes the proteins, RNA and rRNA required for itself;

  • It has its own protein synthesis system; (it is similar to the system of bacteria, so there is a hypothesis that the organelle is the prokaryote swallowed in the biological evolution)

  • Some proteins are encoded by nuclear genes:

  • feature

    • The efficiency of DNA utilization is extremely high; the gene arrangement is precise, and the spacer only accounts for 0.5% of the total length of DNA; there are overlapping genes.
    • Has a specific stop codon; AGA or AGG (encodes Arg in the nuclear gene)

9 Genome size and C value contradict

  • C value (C-value)

    • The total number of DNA base pairs contained in the genome of a haploid cell.

    • For the same organism, the C value is relatively constant:

    • The C value of different organisms varies greatly, from 1 0 4 bp to 10^{4}bp104 bpto10 11 bp 10^{11}bp1011 bpnot equal;

    • The more complex the structure and function of an organism, the greater its C value;

    • In lower eukaryotes, there is a positive correlation between the C value and the structural and functional complexity of the species

    • Some organisms with similar structure and function have very different C values;

    • Some eukaryotes, especially higher eukaryotes, have C values ​​that do not correlate with the organism's structural-functional complexity;

  • C value contradictory

    • Excessive DNA content of the genome compared to the expected number of protein-coding genes;
    • The C value of some species is not positively correlated with the structure-functional complexity of the organism;

Section 3 Genomics

  • The discipline that studies the structure and function of the entire genome.
  • Contains two aspects:
    • Structural genomics targeting whole genome sequencing , structural genomics
    • Functional genomics aimed at genome function research , functional genomics
  • The structure has been measured, but we don't know the function yet, so at present, it is mainly in functional genomics
  • functional genomics
    • Transcriptomics
      • Study the expression pattern of genes at the RNA level (whether expressed and the amount expressed)
      • Specific cell-specific conditions, gene transcription
    • Egg self-omics (proteomics)
      • The study of the structure and function of all proteins is called proteomics
      • Study all protein functions at scale
    • Research is the study of all RNA and proteins. Therefore, high-throughput measurements are required, such as various DNA and RNA chips and two-dimensional electrophoresis. These are inseparable from bioinformatics, so there are many more omics, from the individual to the whole

Guess you like

Origin blog.csdn.net/weixin_57345774/article/details/130138190