Open source student letter Python tutorial
Concise Python text and video tutorials for students
The source code is at: https://github.com/Tong-Chen/Bioinfo_course_python
Table of contents
background introduction
Beginning of programming
Why learn Python
How to install Python
How to run Python commands and scripts
What editor to use to write Python scripts
Python program example
Python basic syntax
Numeric Variable Operations
String variable manipulation
list manipulation
set operation
Range use
dictionary operations
Hierarchical indentation
Variables, data structures, flow control
input Output
Interactive input and output
file read and write
Practical exercises (1)
background knowledge
Homework related work (1)
function operation
function operation
Biological letter-related assignments (2)
module
command line parameters
command line parameters
Biological letter-related homework (3)
More Python content
single block
List synthesis, a simplified for loop that produces a new list
lambda, map, filer, reduce (repertoire)
exec, eval (executes string python statements, repertoire)
regular expression
Python drawing
Reference
some practice questions
Given a file in FASTA format (test1.fa and test2.fa), write a program
cat.py
to read the file and output it to the screen (2 points)
open(file)
for .. in loop
print()
strip() function
Knowledge points used
Given a file in FASTQ format (test1.fq), write a program cat.py
to read the file and output it to the screen (2 points)
ditto
Knowledge points used
Write a program splitName.py
, read in test2.fa, and take the name before the first space of the original sequence name as the processed sequence name, and output it to the screen (2 points)
split
the index of the string
Knowledge points used
The output format is:
>NM_001011874 gcggcggcgggcgagcgggcgctggagtaggagctg.......
Write a program formatFasta.py
, read in test2.fa, connect each FASTA sequence into a line and output it (2 points)
join
strip
Knowledge points used
The output format is:
>NM_001011874 gcggcggcgggc......TCCGCTG......GCGTTCACC......CGGGGTCCGGAG
Write a program formatFasta-2.py
, read test2.fa, and divide each FASTA sequence into a sequence of 80 letters per row (2 points)
string slice operation
range
Knowledge points used
The output format is
>NM_001011874 gcggcggcgc.(60个字母).TCCGCTGACG #(每行80个字母) acgtgctacg.(60个字母).GCGTTCACCC ACGTACGATG(最后一行可不足80个字母)
Write a program sortFasta.py
, read in test2.fa, and take the name before the first space of the original sequence name as the processed sequence name, sort and output (2 points)
sort
dict
aDict[key] = []
aDict[key].append(value)
Knowledge points used
Extract the sequence given the name (2 points)
Knowledge points used
print >>fh, or fh.write()
Modulo operation, 4 % 2 == 0
Write a program
grepFasta.py
to extract the sequence of test2.fa corresponding to the name in fasta.name and output it to the screen.Write a program
grepFastq.py
to extract the sequence of test1.fq corresponding to the name in fastq.name and output it to a file.
Write a program screenResult.py
to filter the genes whose foldChange is greater than 2 and padj is less than 0.05 in test.expr, and can output the entire line or only the gene name. (4 points)
logical AND operator and
The contents read in the file are all strings, which need to be converted to integers with int and converted to floating point numbers with float
Knowledge points used
Write a program transferMultipleColumToMatrix.py
to convert the expression data of genes in multiple tissues in the file (multipleColExpr.txt) into a matrix form, and draw a heat map. (6 points)
aDict['key'] = {}
aDict[‘key’][‘key2’] = value
if key not in aDict
aDict = {'ENSG00000000003': {“A-431”: 21.3, “A-549”, 32.5,…},”ENSG00000000003”:{},}
Knowledge points used
Input format (only the first 3 columns are required)
Gene Sample Value Unit Abundance ENSG00000000003 A-431 21.3 FPKM Medium ENSG00000000003 A-549 32.5 FPKM Medium ENSG00000000003 AN3-CA 38.2 FPKM Medium ENSG00000000003 BEWO 31.4 FPKM Medium ENSG00000000003 CACO-2 63.9 FPKM High ENSG00000000005 A-431 0.0 FPKM Not detected ENSG00000000005 A-549 0.0 FPKM Not detected ENSG00000000005 AN3-CA 0.0 FPKM Not detected ENSG00000000005 BEWO 0.0 FPKM Not detected ENSG00000000005 CACO-2 0.0 FPKM Not detected
output format
Name A-431 A-549 AN3-CA BEWO CACO-2 ENSG00000000460 25.2 14.2 10.6 24.4 14.2 ENSG00000000938 0.0 0.0 0.0 0.0 0.0 ENSG00000001084 19.1 155.1 24.4 12.6 23.5 ENSG00000000457 2.8 3.4 3.8 5.8 2.9
Write a program reverseComplementary.py
to calculate ACGTACGTACGTCACGTCAGCTAGAC
the reverse complement of a sequence. (2 minutes)
reverse
list(seq)
Knowledge points used
Write a program collapsemiRNAreads.py
to convert smRNA-Seq sequencing data. (5 points)
Input file format (mir.collapse, tab-separated two-column file, the first column is the sequence, and the second column is the number of times the sequence was measured)
ID_REF VALUE ACTGCCCTAAGTGCTCCTTCTGGC 2 ATAAGGTGCATCTAGTGCAGATA 25 TGAGGTAGTAGTTTGTGCTGTTT 100 TCCTACGAGTTGCATGGATTC 4
Output file format (mir.collapse.fa, the first 3 letters of the name are the specific identification of the sample, the number in the middle indicates the sequence number, which is the only identification of the sequence name, and the third part is x plus each reads detected The number of times. The three parts are connected with an underscore as the name of the fasta sequence.)
>ESB_1_x2 ACTGCCCTAAGTGCTCCTTCTGGC >ESB_2_x25 ATAAGGTGCATCTAGTGCAGATA >ESB_3_x100 TGAGGTAGTAGTTTGTGCTGTTT >ESB_4_x4 TCCTACGAGTTGCATGGATTC
The simplified short sequence matching program (map.py) compares the sequences in short.fa to ref.fa, and outputs which sequences the short sequences match with which positions in the ref.fa file. (10 points)
find
Knowledge points used
Output format (the output format is bed format, the first column is the matched chromosome, the second and third columns are the start and end positions of the matched chromosome sequence (the position mark starts with 0, representing the first position; The termination position is not included. The position of the sequence shown in the first example is (199,208] (front closed and rear opened, actually the sequence of Chr1 chromosome 199-206, starting from 0). The fourth column is the short sequence itself the sequence of.).
Additional requirements: It can only match to a given template strand, or consider matching to the complementary strand of the template strand. At this time, the fifth column can be the name of the short sequence, and the sixth column is the information of the strand, which matches the template strand as '+' and matches the complementary strand as '-'. Note that when the complementary strand is matched, the starting position is also counted from the 5' end of the template strand.
chr1 199 208 TGGCGTTCA chr1 207 216 ACCCCGCTG chr2 63 70 AAATTGC chr3 0 7 AATAAAT
Daily Book Recommendations - Fluent Python
Luciano Ramalho, the author of "Smooth Python", is the chief consultant of Thoughtworks, a member of the Python Software Foundation, and the co-founder of Python Brasil, a well-known Python language learning community in Brazil. With 25 years of experience in Python programming, his "Smooth Python" is a classic in the field of programming, affecting nearly 80,000 readers, based on Python 3.10, with detailed content and nearly 500 well-designed code examples! There are also a large number of diagrams and tables, which are really friendly to learning! .
See the evaluation of ChatGPT for details:
Past products (click on the picture to go directly to the text corresponding tutorial)
Reply in the background with "The first wave of benefits in the Life Letter Collection" or click to read the original text to get a collection of tutorials