To run Python commands, you need an interpreter. Like R, this is typically already installed on new Mac and Linux/Unix systems. If you are using Windows, you need to install the interpreter. In addition to the Python interpreter, you may also want to install an ‘Interactive Development Environment’ or IDE. RStudio is an example of a nice IDE for R. Spider is a decent IDE for Python.

The Anaconda package will install Python (choose Python 3) and some nice tools including Spyder (like RStudio for Python) and your own local version of Jupyter Notebooks.

Jupyter Notebooks is a simple and handy way to learn and practice Python syntax and share scripts. If you have a CAC account you can access Jupyter Notebooks ‘in the cloud’ through the same portal as RStudio Server:

CAC Links

1. Python Basics

Make comments inside your code. Very important (unless you are using Jupyter notebooks)!

# Use hastags to make comments - not read by the Python terminal
# Use other characters and blank lines to improve readability:
# ------------------------- 
# My first Python script 
# Today's Date
# -------------------------
# Add a summary description of what the script does
# This script will...
# And annotate individual parts of the script

print()

Use the print() function to output objects to the screen

print("string")
## string
print('string')
## string
print('''
use 3 quotation marks
to make a string string 
that can span 
multiple
lines
''')
## 
## use 3 quotation marks
## to make a string string 
## that can span 
## multiple
## lines

Basic Math

print(10+2) # add
## 12
print(10-2) # subtract
## 8
print(10*2) # multiply
## 20
print(10/2) # divide
## 5.0
# What about exponent?
print(10^2)
## 8

The ^ is exclusive-or (bitwise XOR), not exponent. This is the exponent function in Python:

print(10**2) # exponent
## 100

What is bitwise XOR?

1. Convert number to binary notation
2. For each bit:
    a. 0 if bits are the same (i.e. both 1 or both 0)
    b. 1 if bits are different 

For example, in binary notation: 0 = 000 1 = 001 2 = 010 3 = 011 4 = 100 5 = 101 6 = 110 etc.

print(0^0)
## 0
print(0^1)
## 1
print(1^0)
## 1
print(1^1)
## 0
print(2^2)
## 0
print(3^5)
## 6

A few more basic functions

print(11//3) # floor (round down)
## 3
print(10%3) # Modulus (remainder)
## 1
print(abs(-10)) # absolute value
## 10

Slightly more advanced math

Python uses ‘modules’ that contain functions, just like R uses ‘packages’. First we import the package before we can use its functions. However, unlike R, we have to specify the package when calling the function:

import math 
math.sqrt(9) # Square Root
math.log(10) # Natural log
math.log(10,10) # Log base 10
math.log(10,2) # Log base 2
math.exp(2) # Power of e
math.pow(2,10) # 2 to the power of 10
math.pi # Good old pi -- notice no brackets
math.e # e -- also no brackets
math.sin(math.pi/2) # Sine function
math.asin(1) # Inverse sine
math.cos(math.pi) # Cosine
math.acos(-1) # Inverse cosine
math.tan(0) # Tangent
math.atan(0) # Inverse tangent

Round/Truncate

import math
round(math.pi,3) # standard rounding to 3 digits
math.floor(math.pi) # round down to closest whole number
math.ceil(math.pi) # round up to closest whole number

Python logic operators

Note: ! is a negation/inverse operator

1 > 2 # greater than
1 < 2 # less than
1 <= 2 # less than or equal to
1 == 1 # equal to
1 != 1 # not equal to
(1 == 2) | (1 == 1) # | means 'OR'
(1 == 2) & (1 == 1) # & means 'AND' 
(1 == 1) & (1 == 1)

PROTIP: | vs in

Instead of |, you (can) us in to check a large number of values:

my_list=[1,2,3,4,5,6,7,8,9,10]
1 in my_list

Random Numbers

Generate some random numbers. Useful for modelling, testing scripts, etc. Again, we need to import a package to do this

import random
random.randint(1,100) # random integer between 1 and 100
random.random() # Random floating point (decimal) number between 0 and 1
random.uniform(1,10) # Random floating point number between 1 and 10
random.gauss(0,1) # Random number from Gaussian distribution with mean 0 and standard deviation of 1
random.randrange (1,1000,10) # randomly select from 1 to 1000 by 10s
print(dir(random))
## ['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST', 'SystemRandom', 'TWOPI', '_BuiltinMethodType', '_MethodType', '_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_acos', '_bisect', '_ceil', '_cos', '_e', '_exp', '_inst', '_itertools', '_log', '_pi', '_random', '_sha512', '_sin', '_sqrt', '_test', '_test_generator', '_urandom', '_warn', 'betavariate', 'choice', 'choices', 'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random', 'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform', 'vonmisesvariate', 'weibullvariate']

Set seed

Fun fact, random numbers generated by a computer are generated by a calculation from a ‘seed’ number, so they are never truly random. They act random because the seed number is typically something like the millionth of a second of the time on your computer’s internal clock.

It’s not just philosophical, it is also useful for testing and debugging since you can set the seed to generate the same ‘random’ numbers.

Compare these outputs:

import random
random.seed(10)
random.randint(1,100)
random.seed(10)
random.randint(1,100)
random.randint(1,100)
random.randint(1,100)

Combining objects

Lists are groups of objects defined by square brackets []

my_list=[1,2,3,4,5,6,7,8,9,10]
print(my_list)
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
my_list2=[1,2,3,4,5,6,7,8,9,10,"string"]
print(my_list2)
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 'string']

Q: Notice what happens when we add a string. How is this different from the c() function in R?

# The numeric values would switch to strings in the c() function in R:
c(1,2,3,4,5,6,7,8,9,10,"string")
##  [1] "1"      "2"      "3"      "4"      "5"      "6"      "7"     
##  [8] "8"      "9"      "10"     "string"
# This means you can't do simple vector math. For example:
print(my_list*10)
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

2. Defining variables/objects

Cells

The most basic object is a single number or string

X="string"

Why no output?

When we wrote: X<-"string"

Python created the object called X, so no output is produced.

To see the contents of X:

print(X)
## string

You can also concatenate multiple items using print – similar to the paste() function in R

print("the string inside X is '",X,"'")
## the string inside X is ' string '

A List

* A one to many dimensional list of items, defined using square brackets
* Ordered from 1 to ?
* Can be all the same type or a mix of types (e.g. numeric & string)
* Lists can be nested within other lists
Xvec=[X,1,2,3,"E", "Computational Biology", my_list]
print(Xvec)
## ['string', 1, 2, 3, 'E', 'Computational Biology', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]

Notice how the list my_list is incorporated into the list Xvec. How is this different from the c() function in R?

The list is NESTED (2-dimension) in Python but FLATTENED (1-dimension) in R. 

Subset lists with square brackets

print(Xvec)
## ['string', 1, 2, 3, 'E', 'Computational Biology', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
print(Xvec[1])
## 1
print(Xvec[5])
## Computational Biology
print(Xvec[0])
## string
print(Xvec[-1:-3])
## []
print(Xvec[-4])
## 3
print(Xvec[0:3])
## ['string', 1, 2]

What do you notice about these indices? How is this different from indices in R?

* Positive values count from left to right
    * First element on right is ZERO, not ONE!
* Negative values count from right to left
    * First element on right is -1, not 0
* Range of values using colon [a,b) 
    * i.e. includes elements from a up to, but not including, b

Use multiple [] to drill down into individual nested terms

new_list=[my_list,my_list] # nests 2 lists within 3rd list
new_list2=my_list*2 # repeats elements 2 times
print(new_list)
## [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
print(new_list2)
# Nest [] to pull out nested lists
## [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
new_list2[0]
new_list[0]
new_list[0][0]

3. Numpy arrays

NumPy arrays

The package NumPy is popular for array math. These are more similar to the c() and matrix() functions in R. They allow math on groups of numbers

import numpy
my_array=numpy.linspace(-1,1,5)
print(my_array)
## [-1.  -0.5  0.   0.5  1. ]
print(my_array*10)
## [-10.  -5.   0.   5.  10.]

PROTIP: Python package functions

It can get tedious always typing out the package before each function (e.g. `numpy.linspace). There are two ways to work around this:

1. Use as with a custom name

import numpy as np
my_array=np.linspace(-1,1,5)

2. Use from to import the specific function

from numpy import linspace
my_array=linspace(-1,1,5)

Numpy matrices

A 2-D array of cells

With 1 to ? rows by columns

from numpy import matrix
A = matrix([[11,12,13],[21,22,23],[31,32,33]])
x = matrix([[1],[2],[3]])
print(A)
## [[11 12 13]
##  [21 22 23]
##  [31 32 33]]
print(x)
## [[1]
##  [2]
##  [3]]

Some basic matrix algebra

print(A*x) # Multiply matrix A by vector x
## [[ 74]
##  [134]
##  [194]]
print(A.T) # Transpose A
## [[11 21 31]
##  [12 22 32]
##  [13 23 33]]
print(A.I) # Inverse of A
## [[  2.95161921e+14  -5.90323841e+14   2.95161921e+14]
##  [ -5.90323841e+14   1.18064768e+15  -5.90323841e+14]
##  [  2.95161921e+14  -5.90323841e+14   2.95161921e+14]]

Did you notice the extra square brackets in the output?

Higher-order array (tensors)

from numpy import tensordot
print(tensordot(A,A,axes=0)) # tensor product (a %*% b)
## [[[[ 121  132  143]
##    [ 231  242  253]
##    [ 341  352  363]]
## 
##   [[ 132  144  156]
##    [ 252  264  276]
##    [ 372  384  396]]
## 
##   [[ 143  156  169]
##    [ 273  286  299]
##    [ 403  416  429]]]
## 
## 
##  [[[ 231  252  273]
##    [ 441  462  483]
##    [ 651  672  693]]
## 
##   [[ 242  264  286]
##    [ 462  484  506]
##    [ 682  704  726]]
## 
##   [[ 253  276  299]
##    [ 483  506  529]
##    [ 713  736  759]]]
## 
## 
##  [[[ 341  372  403]
##    [ 651  682  713]
##    [ 961  992 1023]]
## 
##   [[ 352  384  416]
##    [ 672  704  736]
##    [ 992 1024 1056]]
## 
##   [[ 363  396  429]
##    [ 693  726  759]
##    [1023 1056 1089]]]]

Subset a matrix with [row,col]

from numpy import matrix
A = matrix([[11,12,13],[21,22,23],[31,32,33]])
print(A)
## [[11 12 13]
##  [21 22 23]
##  [31 32 33]]
print(A[0,1])
## 12
from numpy import tensordot
T = tensordot(A,A,axes=0)
print(T[0,0,1,1])
## 242
print(A*A)
## [[ 776  812  848]
##  [1406 1472 1538]
##  [2036 2132 2228]]

Standard matrix operators

from numpy import matrix
X = matrix([[11,12,13],[21,22,23],[31,32,33]])
Y = matrix([[1,0,0],[0,1,0],[0,0,1]])
print(X*2)
## [[22 24 26]
##  [42 44 46]
##  [62 64 66]]
print(X*Y) 
## [[11 12 13]
##  [21 22 23]
##  [31 32 33]]
print(X+Y)
## [[12 12 13]
##  [21 23 23]
##  [31 32 34]]
print(Y/X)
## [[ 0.09090909  0.          0.        ]
##  [ 0.          0.04545455  0.        ]
##  [ 0.          0.          0.03030303]]
print(X^Y)
## [[10 12 13]
##  [21 23 23]
##  [31 32 32]]

More advanced matrix algebra

Use numpy tensordot with axes= for: (Tensor Product)[https://en.wikipedia.org/wiki/Tensor_product]

# Tensor product (a %*% b)
print(tensordot(A,A,axes=0)) 
## [[[[ 121  132  143]
##    [ 231  242  253]
##    [ 341  352  363]]
## 
##   [[ 132  144  156]
##    [ 252  264  276]
##    [ 372  384  396]]
## 
##   [[ 143  156  169]
##    [ 273  286  299]
##    [ 403  416  429]]]
## 
## 
##  [[[ 231  252  273]
##    [ 441  462  483]
##    [ 651  672  693]]
## 
##   [[ 242  264  286]
##    [ 462  484  506]
##    [ 682  704  726]]
## 
##   [[ 253  276  299]
##    [ 483  506  529]
##    [ 713  736  759]]]
## 
## 
##  [[[ 341  372  403]
##    [ 651  682  713]
##    [ 961  992 1023]]
## 
##   [[ 352  384  416]
##    [ 672  704  736]
##    [ 992 1024 1056]]
## 
##   [[ 363  396  429]
##    [ 693  726  759]
##    [1023 1056 1089]]]]

(Dot/scalar product)[https://en.wikipedia.org/wiki/Dot_product]

# Tensor dot-product 
print(tensordot(A,A,axes=1))
## [[ 776  812  848]
##  [1406 1472 1538]
##  [2036 2132 2228]]

4. Lists & tuples

Lists

We have already seen lists, which are assigned using square brackets []

my_numlist=[1,2,3,4,5]
my_strlist=["a","b","c"]
print(my_numlist)
## [1, 2, 3, 4, 5]
print(my_strlist)
## ['a', 'b', 'c']

You can also subset and edit lists using square brackets

REMEMBER index number starts at 0, not 1

my_numlist=[1,2,3,4,5]
print(my_numlist [0:3])
## [1, 2, 3]
my_numlist[3]=1000
print(my_numlist)
## [1, 2, 3, 1000, 5]

Tuples

Tuples are sequences of Python objects similar to lists, but unlike lists they can’t be changed. They are defined using regular brackets ()

my_tup=(1,2,3,4,5,"string")
print(my_tup[2])
## 3
my_tup[2]=1000
## TypeError: 'tuple' object does not support item assignment
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

Type operators

Check for a value inside a list or tuple using in and not in:

nucleotides=["A","T","G","C"]
print("A" in nucleotides)
## True
print("A" not in nucleotides)
## False

Use and and or for multiple comparisons

print("A" in nucleotides and "R" in nucleotides)
## False
print("A" in nucleotides or "R" in nucleotides)
## True

5. Dictionaries

Defining

Dictionaries in Python are assigned with ‘curly brackets’ {} . Dictionaries are like lists and tuples. They are mutable like lists but contain customizable indices. Dictionaries are assigned with the general form index : value ,

adict={'seq' : 'GATTACA', 'species' : 'Human'}
print(adict)
## {'seq': 'GATTACA', 'species': 'Human'}

Updates & additions

You can update dictionaries with square brackets, just like lists. But unlike lists it doesn’t have to be a number; you use the index value, which can be a string.

adict['species']="Homo sapiens"

You can also add new entries this way:

adict['ID']=90211
print(adict)
## {'seq': 'GATTACA', 'species': 'Homo sapiens', 'ID': 90211}

With print()

Dictionaries are handy for tracking outputs using %(index), followed by a modifier for the type of variable:

  • %s – string
  • %d – digit (integer)
  • %f – float/double (number with decimal place)
  • %b – boolean (true/false)
print("The sequence of %(species)s reference #%(ID)d is %(seq)s" %adict)
## The sequence of Homo sapiens reference #90211 is GATTACA

PROTIP: Formatting variables

You can use %d and %f to format numbers with many decimal places

from math import pi
print("Round pi to whole number %d" %pi)
## Round pi to whole number 3
print("Round pi to 2 decimal places %.2f" %pi)
## Round pi to 2 decimal places 3.14
print("Round pi to default number of decimal places (6) %f" %pi)
## Round pi to default number of decimal places (6) 3.141593
print("Round pi to 48 decimal places %.48f" %pi)
## Round pi to 48 decimal places 3.141592653589793115997963468544185161590576171875

What’s wrong with pi?

Now try pi to 100 decimal places.

print("Round pi to 100 decimal places %.100f" %pi)
## Round pi to 100 decimal places 3.1415926535897931159979634685441851615905761718750000000000000000000000000000000000000000000000000000

Why is pi broken (all 0s after the 48th decimal place)?

Hint: What is %f?

6. Flow control

if/then

if/then statements in Python do not use brackets, making them much ‘cleaner’ and easier to read than if/then statements in R.

Instead of using brackets, the first line ends with a colon : and tab-spacing is used to delimit subsequent lines within the if/then statement. This can be a bit confusing at first. To demonstrate, compare these two blocks of code:

CODE A

x=6
if x > 5:
  print("X is > 5. In fact, x is %s" %x)
  if x > 1:
    print("X is > 1")
## X is > 5. In fact, x is 6
## X is > 1

CODE B

x=6
if x > 5:
  print("X is > 5. In fact, x is %s" %x)
## X is > 5. In fact, x is 6
if x > 1:
  print("X is > 1")
## X is > 1

Now set x=4 and re-run both sets of code. Why is the output different?

elif

Use else and elif to clarify if two lines are part of the same if statement:

x=4
if x > 5:
  print("X is > 5. In fact, x is %s" %x)
else:
  print("X is > 1")
## X is > 1
x=4
if x > 5:
  print("X is > 5. In fact, x is %s" %x)
elif x > 1:
  print("X is > 1")
else:
  print("X is not > 1 or > 5")
## X is > 1

For loops

For loops use a similar structure to if/then statements:

for i in range(0,100,10):
  print(i)
## 0
## 10
## 20
## 30
## 40
## 50
## 60
## 70
## 80
## 90

You can also use for loops to cycle through lists, tuples or dictionaries

basepairs=["G","A","T","T","A","C","A"]
for i in basepairs:
  print(i)
  if i in ["A","G"]:
    print("Purine")
  elif i in ["T","C"]:
    print("Pyramidine")
## G
## Purine
## A
## Purine
## T
## Pyramidine
## T
## Pyramidine
## A
## Purine
## C
## Pyramidine
## A
## Purine

You can also use break and else to break out of loops

inputs=["G","A","T","T","y","A","C","A"]
basepairs=["A","T","G","C"]
for i in inputs:
  for j in basepairs:
    if i == j:
      print(j)
      break
  else:
    print("The character ",i," is not a base pair")
## G
## A
## T
## T
## The character  y  is not a base pair
## A
## C
## A

This is a bit tricky – the else statement belongs to the second for statement, not the if statement nested inside it. Take a few minutes to walk through the loop for each value of i and j to understand how this works.

inline for loops

One handy way to use for loops in Python is inside of other objects, using square brackets[]

single_digit_squares=[x**2 for x in range(10) ] 
# Recall: range(10) starts at 0 and includes whole numbers up to but not including 10.
print(single_digit_squares)
## [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

While loops

While loops continue until some expression or criterion is fulfilled.

counter=0
while counter < 10:
  counter +=1
  print(counter)
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10

PROTIP: Counter shortcut

Notice the line counter +=1, which is a nice Python shortcut for counter = counter + 1

7. Custom functions

Functions in Python are defined with the def function. As with flow control (above), the syntax is much simpler and cleaner than R

def seqtype(seq):
  for i in seq:
    print(i)
    if i in ["A","G"]:
      print("Purine")
    elif i in ["T","C"]:
      print("Pyramidine")
myseq=["G","T","C","A"]
seqtype(myseq)
## G
## Purine
## T
## Pyramidine
## C
## Pyramidine
## A
## Purine

We can also use return to return an object (e.g. value, list, etc.) back to the user

def seqtype(seq):
  seqlen=0
  for i in seq:
    seqlen+=1
  return(seqlen)
myseq=["G","T","C","A","C","T","T","A"]
myseqlen=seqtype(myseq)
print("Sequence length is ",myseqlen)
## Sequence length is  8

8. Reading & Writing Files

Reading & writing text files in Python can take a bit of exploration and practice. BUT one important thing to remember is that you have to ‘close’ a file after it is ‘open’ or else you probably won’t be able to access it.

Open & close

Probably the most straight-forward way to read files in Python is to open the file and then copy the lines into a new object.

myfile='./FallopiaData.csv'
openfile=open(myfile,'r')
for line in openfile:
  line=line.strip('\n') # .strip() removes the string \n from the line
  print(line)
openfile.close() ## This closes the file -- very important!

NOTE: OUTPUT NOT SHOWN

Output is not shown since it just prints out the entire text of the file.

PROTIP: Memory use

When you load a file into a data.frame object in R using MyData<-read.csv(), the entire file is loaded into memory. In contrast, the above code loads only a single line into memory at a time. This is one huge advantage of Python, and why it is so common for processing very large datasets (e.g. high-throughput sequencing).

On the other hand, you can also load a file into a Python object:

myfile='./FallopiaData.csv'
openfile=open(myfile,'r')
mydata=[]
for line in openfile:
  mydata=mydata+[line.strip('\n')]
openfile.close() 
print(mydata)

PROTIP: Loops & methods inside lists

The above can be simplified as follows:

1. Combine file name with `open()` function
2. Include `.strip()` and `.readlines()` inside a list object using `[]`
3. Include the __for loop__ inside the same list object 
myfile=open('./FallopiaData.csv','r')
mydata=[line.strip('\n') for line in myfile]
myfile.close() ## This closes the file -- very important!
print(mydata[0:2]) # Print first few lines
## ['PotNum,Scenario,Nutrients,Taxon,Symphytum,Silene,Urtica,Geranium,Geum,All_Natives,Fallopia,Total,Pct_Fallopia', '1,low,low,japon,9.81,36.36,16.08,4.68,0.12,67.05,0.01,67.06,0.01']

Notice that the file is read in as raw text as a single list object (only one set of square brackets []).

Notice the use of strip() and readlines() each applied to their respective objects line and myfile with the application of a period: .

Parsing

Notice the printed output above. This is much different than a data.frame() object in R. If we want to make sense of the csv file, we have to parse the file into rows and columns, which can be a bit tricky using the methods we’ve learned so far.

Start by parsing by row, using the csv library

import csv
myfile=open("./FallopiaData.csv")
mydata=csv.reader(myfile)
print(mydata)
## <_csv.reader object at 0x000000001B622DB0>

Notice the mydata object is not a standard type of object that we usually work with. However, it does contain a set of objects, one for each row. We can access these with a for loop:

row=0
for line in mydata:
  row += 1
  print(line)
  if row > 2:
    break
## ['PotNum', 'Scenario', 'Nutrients', 'Taxon', 'Symphytum', 'Silene', 'Urtica', 'Geranium', 'Geum', 'All_Natives', 'Fallopia', 'Total', 'Pct_Fallopia']
## ['1', 'low', 'low', 'japon', '9.81', '36.36', '16.08', '4.68', '0.12', '67.05', '0.01', '67.06', '0.01']
## ['2', 'low', 'low', 'japon', '8.64', '29.65', '5.59', '5.75', '0.55', '50.18', '0.04', '50.22', '0.08']

We can also look at a particular column, using square brackets

row=0
for line in mydata:
  row += 1
  print(line[2])
  if row > 2:
    break
## low
## low
## low

Writing to files

Writing is just like reading in terms of opening and closing the file, with file.write() instead of file.read() in between.

outfile=open('./OutFile.txt','w')
for line in range(10):
  outstring="This is line number %d" %line
  print(outstring)
  outfile.write(outstring+"\n")
## This is line number 0
## This is line number 1
## This is line number 2
## This is line number 3
## This is line number 4
## This is line number 5
## This is line number 6
## This is line number 7
## This is line number 8
## This is line number 9
outfile.close() 

Read/write

Python is a great tool for editing text files. By text files we mean anything that is human-readable, including a lot of input/output files from common bioinformatics programs (e.g. FASTA, FASTQ, SAM). Python is perfect for analysis pipelines where you want to take the output of one analysis (e.g. SAM) and modify it for input into another program (e.g. FASTQ of aligned reads only).

infile=open('./OutFile.txt','r')
outfile=open('./ModOutFile.txt','w')
for line in infile:
  outstring="%s with new info added" %line 
  print(outstring)
  outfile.write(outstring+"\n")
## This is line number 0
##  with new info added
## This is line number 1
##  with new info added
## This is line number 2
##  with new info added
## This is line number 3
##  with new info added
## This is line number 4
##  with new info added
## This is line number 5
##  with new info added
## This is line number 6
##  with new info added
## This is line number 7
##  with new info added
## This is line number 8
##  with new info added
## This is line number 9
##  with new info added
infile.close()
outfile.close() 

Notice that our added input goes on a new line, that’s because the line that we read in already has \n at the end. We can remove this with the .strip() function:

infile=open('./OutFile.txt','r')
outfile=open('./ModOutFile.txt','w')
for line in infile:
  instring=line.strip('\n')
  outstring="%s with new info added" %instring 
  print(outstring)
  outfile.write(outstring+'\n')
## This is line number 0 with new info added
## This is line number 1 with new info added
## This is line number 2 with new info added
## This is line number 3 with new info added
## This is line number 4 with new info added
## This is line number 5 with new info added
## This is line number 6 with new info added
## This is line number 7 with new info added
## This is line number 8 with new info added
## This is line number 9 with new info added
infile.close()
outfile.close() 

This can be even more powerful when combined with regular expressions (see Regex tutorial)

9. Pandas

As you can see above, Python is great for text files but not an elegant program for dealing with the kinds of data that are found in tables (e.g. csv files). A more recent library called pandas adds a lot of the more intuitive functionality of R

import pandas as pd
FileLoc="./FallopiaData.csv"
mydata=pd.read_csv(FileLoc)
print(mydata[:3])
##    PotNum Scenario Nutrients  Taxon  Symphytum  Silene  Urtica  Geranium  \
## 0       1      low       low  japon       9.81   36.36   16.08      4.68   
## 1       2      low       low  japon       8.64   29.65    5.59      5.75   
## 2       3      low       low  japon       2.65   36.03   17.09      5.13   
## 
##    Geum  All_Natives  Fallopia  Total  Pct_Fallopia  
## 0  0.12        67.05      0.01  67.06          0.01  
## 1  0.55        50.18      0.04  50.22          0.08  
## 2  0.09        60.99      0.09  61.08          0.15

Now you can see something much more analogous to R. There are also functions for inspecting and plotting the data. It’s still not quite as good as R, but much better than base Pyton. For more information on pandas, see the documentation and tutorials on the pandas website