Biopython
     Karin Lagesen

karin.lagesen@bio.uio.no
ConcatFasta.py
Create a script that has the following:
  function get_fastafiles(dirname)
     gets all the files in the directory, checks if they are fasta
       files (end in .fsa), returns list of fasta files
     hint: you need os.path to create full relative file names
  function concat_fastafiles(filelist, outfile)
     takes a list of fasta files, opens and reads each of them,
       writes them to outfile
  if __name__ == “__main__”:
     do what needs to be done to run script
Remember imports!
Object oriented programming
Biopython is object-oriented
Some knowledge helps understand how
 biopython works
OOP is a way of organizing data and
 methods that work on them in a coherent
 package
OOP helps structure and organize the code
Classes and objects
A class:
  is a user defined type
  is a mold for creating objects
  specifies how an object can contain and
    process data
  represents an abstraction or a template for how
    an object of that class will behave
An object is an instance of a class
All objects have a type – shows which class
  they were made from
Attributes and methods
Classes specify two things:
  attributes – data holders
  methods – functions for this class
Attributes are variables that will contain the
 data that each object will have
Methods are functions that an object of that
 class will be able to perform
Class and object example
Class: MySeq
MySeq has:
   attribute length
   method translate
An object of the class MySeq is created like this:
   myseq = MySeq(“ATGGCCG”)
Get sequence length:
   myseq.length
Get translation:
   myseq.translate()
Summary
An object has to be instantiated, i.e.
 created, to exist
Every object has a certain type, i.e. is of a
 certain class
The class decides which attributes and
 methods an object has
Attributes and methods are accessed
 using . after the object variable name
Biopython
Package that assists with processing
 biological data
Consists of several modules – some with
 common operations, some more
 specialized
Website: biopython.org
Working with sequences
Biopython has many ways of working with
  sequence data
Components for today:
  Alphabet
  Seq
  SeqRecord
  SeqIO
Other useful classes for working with alignments,
 blast searches and results etc are also available,
 not covered today
Class Alphabet
Every sequence needs an alphabet
CCTTGGCC – DNA or protein?
Biopython contains several alphabets
  DNA
  RNA
  Protein
  the three above with IUPAC codes
  ...and others
Can all be found in Bio.Alphabet package
Alphabet example
Go to freebee
Do module load python (necessary to find biopython
  modules) – start python
  >>> import Bio.Alphabet
                                           NOTE: have to import
  >>> Bio.Alphabet.ThreeLetterProtein.letters
                                           Alphabets to use them
  ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile', 
  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr', 
  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx']
  >>> from Bio.Alphabet import IUPAC
  >>> IUPAC.IUPACProtein.letters
  'ACDEFGHIKLMNPQRSTVWY'
  >>> IUPAC.unambiguous_dna.letters
  'GATC'
  >>> 
Packages, modules and
              classes
What happens here?
>>> from Bio.Alphabet import IUPAC
   >>> IUPAC.IUPACProtein.letters


Bio and Alphabet are packages
    packages contain modules
IUPAC is a module
    a module is a file with python code
IUPAC module contains class IUPACProtein and
  other classes specifying alphabets
IUPACProtein has attribute letters
Seq
Represents one sequence with its alphabet
Methods:
  translate()
  transcribe()
  complement()
  reverse_complement()
  ...
Using Seq

>>> from Bio.Seq import Seq
>>> import Bio.Alphabet       Create object
>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq
Seq('CCGGGTT', IUPACUnambiguousDNA())
>>> seq.transcribe()
Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna)
>>> seq.transcribe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>        New object, different alphabet
  File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",
 line 830, in transcribe
    raise ValueError("RNA cannot be transcribed!")
ValueError: RNA cannot be transcribed!
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> 
                                            Alphabet dictates which
                                            methods make sense
Seq as a string
Most string methods work on Seqs
If string is needed, do str(seq)
>>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq[:5]
Seq('CCGGG', IUPACUnambiguousDNA())
>>> len(seq)
13
>>> seq.lower()
Seq('ccgggttaacgta', DNAAlphabet())
>>> print seq
CCGGGTTAACGTA
>>> list(seq)
['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A']
>>> mystring = str(seq)
>>> print mystring
CCGGGTTAACGTA
>>> type(seq)
<class 'Bio.Seq.Seq'>       How to check what class
>>> type(mystring)          or type an object is from
<type 'str'>
>>> 
MutableSeq
Seqs are immutable as strings are
If mutable string is needed, convert to MutableSeq
Allows in-place changes
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> seq
  Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> mut_seq[0] = 'T'
  >>> mut_seq
  MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> mut_seq.complement()
  >>> mut_seq
  MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA())
  >>>                                Notice: object is changed!
SeqRecord
Seq contains the sequence and alphabet
But sequences often come with a lot more
SeqRecord = Seq + metadata
Main attributes:
   id – name or identifier
   seq – seq object containing the sequence
>>> seq   Existing sequence
Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class
>>> from Bio.SeqRecord import SeqRecord     found inside the
>>> seqRecord = SeqRecord(seq, id='001')
>>> seqRecord
                                            Bio.SeqRecord module
SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()), 
id='001', name='<unknown name>', description='<unknown description>', 
dbxrefs=[])
>>> 
SeqRecord attributes
From the biopython webpages:
Main attributes:

id - Identifier such as a locus tag (string)
seq - The sequence itself (Seq object or similar)

Additional attributes:

name - Sequence name, e.g. gene name (string)
description - Additional text (string)
dbxrefs - List of database cross references (list of strings)
features - Any (sub)features defined (list of SeqFeature objects)
annotations - Further information about the whole sequence (dictionary)
      Most entries are strings, or lists of strings.
letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds
      Python sequences (lists, strings or tuples) whose length matches that of the
      sequence. A typical use would be to hold a list of integers representing
      sequencing quality scores, or a string representing the secondary structure.
SeqRecords in practice...
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import DNAAlphabet
>>> seqRecord = SeqRecord(Seq('GCAGCCTCAAACCCCAGCTG', 
… DNAAlphabet), id = 'NM_005368.2', name = 'NM_005368', 
… description = 'Myoglobin var 1',
… dbxrefs = ['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations['note'] = 'Information goes here'
>>> seqRecord
SeqRecord(seq=Seq('GCAGCCTCAAACCCCAGCTG',
 <class 'Bio.Alphabet.DNAAlphabet'>), id='NM_005368.2', 
name='NM_005368', description='Myoglobin var 1', 
dbxrefs=['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations
{'note': 'Information goes here'}
>>> 
SeqIO
How to get sequences in and out of files
Retrieves sequences as SeqRecords, can
 write SeqRecords to files
Reading:
  parse(filehandle, format)
  returns a generator that gives SeqRecords
Writing:
  write(SeqRecord(s), filehandle, format)

  NOTE: examples in this section from http://biopython.org/wiki/SeqIO
SeqIO formats
List: http://biopython.org/wiki/SeqIO
Some examples:
  fasta
  genbank
  several fastq-formats
  ace
Note: a format might be readable but not
 writable depending on biopython version
Reading a file
        from Bio import SeqIO
        handle = open("example.fasta", "r")
        for record in SeqIO.parse(handle,"fasta") :
            print record.id
        handle.close()




SeqIO.parse returns a SeqRecord iterator
An iterator will give you the next element the
 next time it is called – compare to
 readline()
Useful because if a file contains many
 records, we avoid putting all into memory
 all at once
Exercise
Use mb.gbk, found in Karins folder
Use the SeqIO methods to
   read in the file
   print the id of each of the records
   print the first 10 nucleotide of each record
 >>> from Bio import SeqIO
 >>> fh = open("mb.gbk", "r")
 >>> for record in SeqIO.parse(fh, "genbank"):
 ...     print record.id
 ...     print record.seq[:10]
 ... 
 NM_005368.2
 GCAGCCTCAA
 XM_001081975.2
 CCTCTCCCCA
 NM_001164047.1
 TAGCTGCCCA
 >>> 
SeqRecords lists and
           dictionaries
To get everything as a list:
  handle = open("example.fasta", "r")
     records = list(SeqIO.parse(handle, "fasta"))
     handle.close()


To get everything as a dictionary:
  handle = open("example.fasta", "r")
     record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
     handle.close()


But: avoid if at all possible
Writing files
                                      sequences are here a
           from Bio import SeqIO      list of SeqRecords
           sequences = ... # add code here
           output_handle = open("example.fasta", "w")
           SeqIO.write(sequences, output_handle, "fasta")
           output_handle.close()




Note: sequences is here a list
Can write any iterable containing
 SeqRecords to a file
Can also write a single sequence
seq_length.py
Write script that reads a file containing genbank
 sequences and writes out name and sequence
 length
Should have
  Function sequence_length(inputfile)
      Open file
      Per seqRecord in input:
           Print name, length of sequence
      Close file
  If __name__ == “__main__”:
      Get input from command line:
           inputfile
Modifications
Figure out how to:
  print the description of each genbank entry
  which annotations each entry has
  print the taxonomy for each entry
Description:
  seqRecord.description
Annotations:
  seqRecord.annotations.keys()
Taxonomy:
  seqRecord.annotations['taxonomy']
tag_fasta.py
Create script that takes a file containing fasta
  sequences, adds a tag at the front of the name and
  writes it out to a new file
Should have
  Function change_name(seqRecord, tag)
     Change name, return seqRecord
  Function read_write_fasta(tag, input, output)
     Per seqRecord in input:
         Change name
         Write to output
  If __name__ == “__main__”:
     Get input from command line:
         Tag, input file, output file
Optional homework
           convert.py
Create a script that:
  takes input filename, input file type, output
    filename and output file type
  converts input file to output file type and writes it
    to output file

More Related Content

PPT
Biopython
PPT
Clustal
PDF
PPT
Bioinformatics
PDF
Ab Initio Protein Structure Prediction
PPTX
Protein database
PPTX
Scoring matrices
PPTX
MULTIPLE SEQUENCE ALIGNMENT
Biopython
Clustal
Bioinformatics
Ab Initio Protein Structure Prediction
Protein database
Scoring matrices
MULTIPLE SEQUENCE ALIGNMENT

What's hot (20)

PPTX
Secondary protein structure prediction
PPTX
Protein structure
PPTX
Motif andpatterndatabase
PPT
Protein protein interaction
PPTX
Protein data bank
PPT
Biological databases
PPTX
Clustal W - Multiple Sequence alignment
PPT
Sequence Alignment In Bioinformatics
PPTX
Presentation on Biological database By Elufer Akram @ University Of Science ...
PDF
Protein Structure Prediction
PPTX
Proteins databases
PPTX
Dynamic programming
PDF
Introduction to Python for Bioinformatics
PPTX
Needleman-Wunsch Algorithm
PPT
PPT
PPTX
Threading modeling methods
PPTX
Global and Local Sequence Alignment
PPT
The uni prot knowledgebase
Secondary protein structure prediction
Protein structure
Motif andpatterndatabase
Protein protein interaction
Protein data bank
Biological databases
Clustal W - Multiple Sequence alignment
Sequence Alignment In Bioinformatics
Presentation on Biological database By Elufer Akram @ University Of Science ...
Protein Structure Prediction
Proteins databases
Dynamic programming
Introduction to Python for Bioinformatics
Needleman-Wunsch Algorithm
Threading modeling methods
Global and Local Sequence Alignment
The uni prot knowledgebase
Ad

Similar to Biopython (20)

ODP
Java 7 Features and Enhancements
PDF
Biopython: Overview, State of the Art and Outlook
PDF
Java7 New Features and Code Examples
PPTX
Implementing jsp tag extensions
PDF
File Handling in Java.pdf
PDF
Functions and modules in python
PPTX
15. text files
PPS
Advance Java
PPT
Learning Java 1 – Introduction
PDF
Struts2 - 101
PPTX
Python and You Series
PPTX
2016 bioinformatics i_bio_python_wimvancriekinge
ODP
Dynamic Python
PDF
PPT
Jug java7
PDF
What`s new in Java 7
PPT
JDK1.7 features
PPT
BioMake BOSC 2004
PDF
WhatsNewNIO2.pdf
PDF
Java IO Stream, the introduction to Streams
Java 7 Features and Enhancements
Biopython: Overview, State of the Art and Outlook
Java7 New Features and Code Examples
Implementing jsp tag extensions
File Handling in Java.pdf
Functions and modules in python
15. text files
Advance Java
Learning Java 1 – Introduction
Struts2 - 101
Python and You Series
2016 bioinformatics i_bio_python_wimvancriekinge
Dynamic Python
Jug java7
What`s new in Java 7
JDK1.7 features
BioMake BOSC 2004
WhatsNewNIO2.pdf
Java IO Stream, the introduction to Streams
Ad

Recently uploaded (20)

PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
SaaS reusability assessment using machine learning techniques
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
The AI Revolution in Customer Service - 2025
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
substrate PowerPoint Presentation basic one
PDF
Decision Optimization - From Theory to Practice
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Human Computer Interaction Miterm Lesson
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
SaaS reusability assessment using machine learning techniques
LMS bot: enhanced learning management systems for improved student learning e...
The AI Revolution in Customer Service - 2025
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
substrate PowerPoint Presentation basic one
Decision Optimization - From Theory to Practice
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
CEH Module 2 Footprinting CEH V13, concepts
NewMind AI Weekly Chronicles – August ’25 Week IV
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Connector Corner: Transform Unstructured Documents with Agentic Automation
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Human Computer Interaction Miterm Lesson
Build automations faster and more reliably with UiPath ScreenPlay
Auditboard EB SOX Playbook 2023 edition.
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf

Biopython

  • 2. ConcatFasta.py Create a script that has the following: function get_fastafiles(dirname) gets all the files in the directory, checks if they are fasta files (end in .fsa), returns list of fasta files hint: you need os.path to create full relative file names function concat_fastafiles(filelist, outfile) takes a list of fasta files, opens and reads each of them, writes them to outfile if __name__ == “__main__”: do what needs to be done to run script Remember imports!
  • 3. Object oriented programming Biopython is object-oriented Some knowledge helps understand how biopython works OOP is a way of organizing data and methods that work on them in a coherent package OOP helps structure and organize the code
  • 4. Classes and objects A class: is a user defined type is a mold for creating objects specifies how an object can contain and process data represents an abstraction or a template for how an object of that class will behave An object is an instance of a class All objects have a type – shows which class they were made from
  • 5. Attributes and methods Classes specify two things: attributes – data holders methods – functions for this class Attributes are variables that will contain the data that each object will have Methods are functions that an object of that class will be able to perform
  • 6. Class and object example Class: MySeq MySeq has: attribute length method translate An object of the class MySeq is created like this: myseq = MySeq(“ATGGCCG”) Get sequence length: myseq.length Get translation: myseq.translate()
  • 7. Summary An object has to be instantiated, i.e. created, to exist Every object has a certain type, i.e. is of a certain class The class decides which attributes and methods an object has Attributes and methods are accessed using . after the object variable name
  • 8. Biopython Package that assists with processing biological data Consists of several modules – some with common operations, some more specialized Website: biopython.org
  • 9. Working with sequences Biopython has many ways of working with sequence data Components for today: Alphabet Seq SeqRecord SeqIO Other useful classes for working with alignments, blast searches and results etc are also available, not covered today
  • 10. Class Alphabet Every sequence needs an alphabet CCTTGGCC – DNA or protein? Biopython contains several alphabets DNA RNA Protein the three above with IUPAC codes ...and others Can all be found in Bio.Alphabet package
  • 11. Alphabet example Go to freebee Do module load python (necessary to find biopython modules) – start python >>> import Bio.Alphabet NOTE: have to import >>> Bio.Alphabet.ThreeLetterProtein.letters Alphabets to use them ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile',  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr',  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx'] >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters 'ACDEFGHIKLMNPQRSTVWY' >>> IUPAC.unambiguous_dna.letters 'GATC' >>> 
  • 12. Packages, modules and classes What happens here? >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters Bio and Alphabet are packages packages contain modules IUPAC is a module a module is a file with python code IUPAC module contains class IUPACProtein and other classes specifying alphabets IUPACProtein has attribute letters
  • 13. Seq Represents one sequence with its alphabet Methods: translate() transcribe() complement() reverse_complement() ...
  • 14. Using Seq >>> from Bio.Seq import Seq >>> import Bio.Alphabet Create object >>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq Seq('CCGGGTT', IUPACUnambiguousDNA()) >>> seq.transcribe() Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods >>> seq.translate() Seq('PG', IUPACProtein()) >>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna) >>> seq.transcribe() Traceback (most recent call last):   File "<stdin>", line 1, in <module> New object, different alphabet   File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",  line 830, in transcribe     raise ValueError("RNA cannot be transcribed!") ValueError: RNA cannot be transcribed! >>> seq.translate() Seq('PG', IUPACProtein()) >>>  Alphabet dictates which methods make sense
  • 15. Seq as a string Most string methods work on Seqs If string is needed, do str(seq) >>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq[:5] Seq('CCGGG', IUPACUnambiguousDNA()) >>> len(seq) 13 >>> seq.lower() Seq('ccgggttaacgta', DNAAlphabet()) >>> print seq CCGGGTTAACGTA >>> list(seq) ['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A'] >>> mystring = str(seq) >>> print mystring CCGGGTTAACGTA >>> type(seq) <class 'Bio.Seq.Seq'> How to check what class >>> type(mystring) or type an object is from <type 'str'> >>> 
  • 16. MutableSeq Seqs are immutable as strings are If mutable string is needed, convert to MutableSeq Allows in-place changes >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> seq Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> mut_seq[0] = 'T' >>> mut_seq MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> mut_seq.complement() >>> mut_seq MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA()) >>>  Notice: object is changed!
  • 17. SeqRecord Seq contains the sequence and alphabet But sequences often come with a lot more SeqRecord = Seq + metadata Main attributes: id – name or identifier seq – seq object containing the sequence >>> seq Existing sequence Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class >>> from Bio.SeqRecord import SeqRecord found inside the >>> seqRecord = SeqRecord(seq, id='001') >>> seqRecord Bio.SeqRecord module SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()),  id='001', name='<unknown name>', description='<unknown description>',  dbxrefs=[]) >>> 
  • 18. SeqRecord attributes From the biopython webpages: Main attributes: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object or similar) Additional attributes: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of database cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings. letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
  • 20. SeqIO How to get sequences in and out of files Retrieves sequences as SeqRecords, can write SeqRecords to files Reading: parse(filehandle, format) returns a generator that gives SeqRecords Writing: write(SeqRecord(s), filehandle, format) NOTE: examples in this section from http://biopython.org/wiki/SeqIO
  • 21. SeqIO formats List: http://biopython.org/wiki/SeqIO Some examples: fasta genbank several fastq-formats ace Note: a format might be readable but not writable depending on biopython version
  • 22. Reading a file from Bio import SeqIO handle = open("example.fasta", "r") for record in SeqIO.parse(handle,"fasta") :     print record.id handle.close() SeqIO.parse returns a SeqRecord iterator An iterator will give you the next element the next time it is called – compare to readline() Useful because if a file contains many records, we avoid putting all into memory all at once
  • 23. Exercise Use mb.gbk, found in Karins folder Use the SeqIO methods to read in the file print the id of each of the records print the first 10 nucleotide of each record >>> from Bio import SeqIO >>> fh = open("mb.gbk", "r") >>> for record in SeqIO.parse(fh, "genbank"): ...     print record.id ...     print record.seq[:10] ...  NM_005368.2 GCAGCCTCAA XM_001081975.2 CCTCTCCCCA NM_001164047.1 TAGCTGCCCA >>> 
  • 24. SeqRecords lists and dictionaries To get everything as a list: handle = open("example.fasta", "r") records = list(SeqIO.parse(handle, "fasta")) handle.close() To get everything as a dictionary: handle = open("example.fasta", "r") record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) handle.close() But: avoid if at all possible
  • 25. Writing files sequences are here a from Bio import SeqIO list of SeqRecords sequences = ... # add code here output_handle = open("example.fasta", "w") SeqIO.write(sequences, output_handle, "fasta") output_handle.close() Note: sequences is here a list Can write any iterable containing SeqRecords to a file Can also write a single sequence
  • 26. seq_length.py Write script that reads a file containing genbank sequences and writes out name and sequence length Should have Function sequence_length(inputfile) Open file Per seqRecord in input: Print name, length of sequence Close file If __name__ == “__main__”: Get input from command line: inputfile
  • 27. Modifications Figure out how to: print the description of each genbank entry which annotations each entry has print the taxonomy for each entry Description: seqRecord.description Annotations: seqRecord.annotations.keys() Taxonomy: seqRecord.annotations['taxonomy']
  • 28. tag_fasta.py Create script that takes a file containing fasta sequences, adds a tag at the front of the name and writes it out to a new file Should have Function change_name(seqRecord, tag) Change name, return seqRecord Function read_write_fasta(tag, input, output) Per seqRecord in input: Change name Write to output If __name__ == “__main__”: Get input from command line: Tag, input file, output file
  • 29. Optional homework convert.py Create a script that: takes input filename, input file type, output filename and output file type converts input file to output file type and writes it to output file

Editor's Notes

  • #11: Show webpage for alphabets. Each alphabet is a separate class