Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

Latest commit

 

History

History
159 lines (107 loc) · 5.07 KB

day1-python-packages.rst

File metadata and controls

159 lines (107 loc) · 5.07 KB

Installing Python packages; useful Python packages

The easiest way to install Python packages is to use pip, the Python package installer.

pip and Anaconda on OS X

If you're using anaconda on OS X, you have pip installed; but you will need to refer to the python package installer by a full path name:

~/anaconda/bin/pip

More generally, you'll need to prefix many of the commands below with '~/anaconda/bin/'. You can set this as a default for your current shell by doing:

export PATH=~/anaconda/bin:$PATH

(or you can add that command to the file ~/.bashrc using nano. Ask a TA for help!)

OPTIONAL: Using virtualenv

Down the road, if you're running on a machine where you don't have sysadmin access, you can use a python package called 'virtualenv' to set up your own installation of Python into which you can install your own packages. Once virtualenv is installed (by a sysadmin, presumably) It's as simple as

python -m virtualenv NAME

where NAME is the name of your workspace, e.g.

python -m virtualenv env

followed by

. env/bin/activate

From that point on, you will be able to use pip to install things within this workspace, and Python (again from within that workspace) will be able to access and use those installed packages.

Python packages

There are, literally, thousands of Python packages. The basic deal is this: Python comes with "batteries included", which means that you can do amazing numbers of things with just a basic Python install. The anaconda install and VirtualBox virtual machine come with tons more stuff. But there's always the need to use an updated version of something, or a little package that someone wrote that addresses just your concern... so you'll always need to install stuff.

Here's how to install and use some potentially useful packages from my lab, but there's a whole world of Python packages out there. See http://docs.python.org/2/library for packages that come included with Python, and http://pypi.python.org/pypi for the Python package index for third-party packages.

screed

Screed is a little Python package from Titus's lab that reads in DNA sequences -- more explicitly, it's a FASTA and FASTQ parser. You can see some documentation here:

http://screed.readthedocs.org/en/latest/

But how do you use it?

To install screed directly from github, do:

pip install git+https://github.com/ged-lab/screed.git

Using screed:

screed can read FASTA and FASTQ files, as well as gzip or bzip2 versions of those files. For example, in the python directory there is a file called '25k.fq.gz'.

Note:

All of the below screed commands are in the using-screed.ipynb notebook.

screed, in a nutshell, lets you read in all that data and access it in Python. Try:

import screed
for record in screed.open('/path/to/2012-11-scripps/python/25k.fq.gz'):
   print record.name
   print record.sequence
   print record.accuracy
   break

A couple of points here.

First, there are 25,000 sequences in this file. You might want to avoid printing them all out (hence the 'break' command at the end of the loop!) This is a typical approach to reading through big files -- just put in a "if I've done more than 10 things, stop"

Second, you can use this for short read data or genomic sequences or whatever. We've mostly designed it for short-read data but it works fine for genome-scale data (which is, after all, rather smaller than most short-read data...)

Third, you can open any kind of sequence file with this command.

This can be a simple and handy way to extract a particular sequence from a large file --

for record in screed.open('/path/to/2012-11-scripps/python/25k.fq.gz'):
   if record.name == '@895:1:4:1596:8538/2':
      break

# do stuff with record

You can even pull out a list:

list_of_names = ['@895:1:4:1596:8538/2', '@895:1:4:1596:6003/2']
list_of_records = []

for record in screed.open('/path/to/2012-11-scripps/python/25k.fq.gz'):
   if record.name in list_of_names:
      list_of_records.append(record)

# do stuff with list_of_records

(You might want to use a 'set' here, note.)

So how is this stuff useful!?

Well, here's one simple example --

n = 0.
m = 0.
for record in screed.open('/path/to/2012-11-scripps/python/25k.fq.gz'):
   n += len(record.sequence)
   m += record.sequence.count('G') + record.sequence.count('C')

print '%.3f G/C content' % (m / n,)

You can also do your quality trimming, or analysis of the first bases, or... whatever.

Another example --

outfp = open('out.fa', 'w')
for record in screed.open('/path/to/2012-11-scripps/python/25k.fq.gz'):
   outfp.write('>%s\n%s\n' % (record.name, record.sequence))

This converts FASTQ to FASTA.

Also see the IPython Notebook, using-screed.ipynb.