Skip to content

PyBCCython

Katy Huff edited this page Jan 29, 2012 · 5 revisions

Other Wiki Pages: PyBC , Session01, Session02, Session03, Session04, Session05, Session06, Session07, Session08, Session09, f2py, swig, Cpython, Cython, PyTables, PyTaps, PythonBots, Django, GIS, AdvancedPython, WxPython, standardlib,

Speeding up Python with Cython

Installing Cython

Cython (http://www.cython.org/, http://docs.cython.org/) is distributed as part of the Enthought Python Distribution, and is also distributed with Python(x,y). If you're on a Linux-based system, the standard package manager (apt-get or yum) has a cython package that is installable. If you like, you can install Cython from the PyPI, or from source (http://www.cython.org/#download).

Session notes

This being a breakout session, I will be going quickly through the examples below due to time constraints. I encourage you to follow along as best you can, ask questions after the session, and work through this tutorial on your own.

The source files for these examples are attached: [attachment:cython-tutorial.tgz].

Motivation

Pros & Cons of pure python

If the PYBC has been successful, then you should have a good idea of the many benefits that Python can bring to any programming project:

  • Pseudo-code-like syntax
  • Very high level, human readability comes first
  • No compilation; fast edit-run-edit cycle
  • Extremely dynamic -- can easily dig deeper in language to do amazing things
  • Usually one obvious way to do it in Python
  • Can tackle projects of any size -- small scripts to tens of thousands of LOC
  • Wonderful built-in datastructures: lists, sets, dicts, iterators, generators, first-order functions, metaprogramming...
  • Accomodates many paradigms: functional (like Scheme/LISP), procedural (like C), OO (like Java, C++)

In addition to the above, there are many benefits of Python specific to scientific computing:

  • Numpy/Scipy extensions merge the best of fast C & Fortran scientific routines with Python
  • Matplotlib allows the creation of publication-quality visuals
  • Many research groups provide python bindings to their work for integration with other tools

One critique of Python that is often heard: "all this dynamism comes at a huge speed hit."

Is it true? That depends on what you mean by "speed" -- either speed of programming, or speed of the program itself. It is necessary to point out a rarely acknowledged tradeoff: often the runtime speed of a program is inversely proportional to the time taken to create it, in large part because the fastest code must be written at a low level (C, assembly, machine code...).

Some illustrative examples:

For a shell-like script, you can write it in an hour, easily glue together many components and have it all working very easily. In this case a high-level language is essential and yields a net time *savings* (think of doing the equivalent in C or (shudder) Fortran). This is the traditional use of scripting languages and one where they always excel over lower-level languages.

Another example would be in prototyping a larger project: you can sketch it in pseudocode, translate that very quickly into Python, and test the implementation easily and see if it works. Python is ideal in this case, allowing easy experimentation and gradual refinement until you arrive at an optimally designed program. For scientific programs, one can often stop here, since it is very often fast enough. Even though the resulting code may not run as fast as a pure C or Fortran version of the same program, Python offers a huge time savings in writing the program itself. It is difficult to over-emphasize the magnitude of these time-savings. It can easily be on the order of days to months, if the human-friendly aspects of Python facilitate getting the code working. The low-level languages require much more effort to get to the same point, and are much more difficult to debug, which translates into more time before a working program exists.

When a program is a critical component and must run as fast as possible, this is where the ''programming'' speed vs. ''program'' speed tradeoff must be in favor of the latter. Many would scoff at using Python in this case. And this is where Cython comes in.

Cython (http://www.cython.org/) is a tool that takes Python code (optionally agumented with static typing information) and translates it into C code that can be compiled and run. Cython is designed to speed up Python code that simply must run as fast as possible.

Cython can achieve impressive speedups, and offers the best of both worlds -- C speed for critical code, with Python's syntax.

Example

Here's a simple example of numerical integration using the trapezoid rule. It is an ideal candidate for Cython, since it has a number of simply-typed variables (ints and floats) all used repeatedly inside a loop. There's also a (somewhat contrived) function call for illustration purposes.

#!Lineno
#!python
# file: slow_integration.py

from math import sin

def integrand(x):
return sin(x)
def integrate(a, b, N):
if N > 1:
step = (b-a)/(N-1)
else:
step = b-a

result = 0.0; fa = integrand(a) for i in range(1, N):

fb = integrand(a + i * step) result += .5 * step * (fa + fb) fa = fb

return result

The above module is used in a timing routine like this:

#!Lineno
#!python
# file: integration_timer.py

from timeit import repeat

setup = """ import %s N=%d"""

N=10**4

def runner(mod_name, N, extra_setup=''):
_setup = extra_setup + (setup % (mod_name, N)) return repeat("%s.integrate(0.0, 1.0, N)" % mod=name, setup= _setup, number = 100)
if __name__ == '__main__':
slow_repeat = runner("slow_integration", N) print "slow integration: %f seconds" % min(slow_repeat)

I refer you to the builtin timeit module for usage of the timeit.repeat function. It runs the specified code snippet many times, after running the setup code once, and returns the time taken for each run. Taking the min of these runs gives a good idea of the best runtime for the snippet.

Executing the above module:

$ python integration_timer.py

slow integration: 0.875929 seconds

With Cython and some small modifications to slow_integration.py we will see how much speedup we can get.

The first thing to do is just compile the unchanged Python source code with Cython to see what results. Let's copy the file slow_integration.py to fast_integration_v0.pyx and run Cython on it. The .pyx extension is a Cython specific file extension. We can use the pyximport module to easily compile and import our cython module:

#!Lineno
#!python

import pyximport pyximport.install()

import fast_integration_v0

print fast_integration_v0.integrate(0.0, 1.0, N)

Running the above yields identical results to the slow_integration.integrate routine. We can easily extend our timing script to accomodate the new Cythonized version:

#!Lineno
#!python
# file: integration_timer.py

from timeit import repeat

setup = """ import %s N=%d"""

N=10**4

def runner(mod_name, N, extra_setup=''):
_setup = extra_setup + (setup % (mod_name, N)) return repeat("%s.integrate(0.0, 1.0, N)" % mod_name, setup = _setup, number = 100)
if __name__ == '__main__':

slow_repeat = runner("slow_integration", N) print "slow integration: %f seconds" % min(slow_repeat) print

extra_setup = 'import pyximport; pyximport.install()'

for fast_integration in ["fast_integration_v0"]:
fast_repeat = runner(fast_integration, N, extra_setup) print "%s: %f seconds" % (fast_integration, min(fast_repeat)) print "speedup factor:", min(slow_repeat) / min(fast_repeat)

Running the above prints:

$ python integration_timer.py

slow integration: 0.826386 seconds

fast_integration_v0: 0.554552 seconds speedup factor: 1.49018705833

Good for about a 33% speedup, with no change in the code and a little setup work. But we can do much better.

One thing that slows down python are function calls -- the dynamism Python offers can slow things down when a function call is in an inner loop. So let's replace Python's math.sin with the equivalent from C's standard math library. This requires pulling in the appropriate C header and declaring the function signature in the Cython source file:

#!Lineno
#!python
# file: fast_integration_v1.pyx
cdef extern from "math.h":
double sin(double x)
def integrand(x):
return sin(x)
def integrate(a, b, N):
# unchanged ...

This introduces a couple new Cython-specific keywords, cdef and extern. The line cdef extern from "math.h": tells Cython to include the C header file math.h and to declare anything in the indented block below, in this case, the function signature double sin(double x). You'll notice that there is no from math import sin line anymore; the sin function called in integrand is now the C math library's sin function. The primary difference speed-wise is that this function call is at the C level, and there is no Python overhead in the function call.

What difference does it make speed-wise? We save our new Cython source file as fast_integration_v1.pyx and modify our timing script:

#!Lineno
#!python
# file: integration_timer.py

# showing only the changed line ...

for fast_integration in ["fast_integration_v0", "fast_integration_v1"]:
fast_repeat = runner(fast_integration, N, extra_setup) print "%s: %f seconds" % (fast_integration, min(fast_repeat)) print "speedup factor:", min(slow_repeat) / min(fast_repeat)

Running the timing script yields:

$ python integration_timer.py

slow integration: 0.831489 seconds

fast_integration_v0: 0.571665 seconds speedup factor: 1.45450458266 fast_integration_v1: 0.422799 seconds speedup factor: 1.96662922337

So it cuts the runtime in half. Not bad -- it is a very minor change and we get an immediate benefit. Lets continue.

When the integrate function calls the integrand function, there is, yet again, Python-related overhead. All the arguments are Python Objects, and the return value is a Python Object. It would be nice to pass in not Python objects, but a naked C double and have integrand return a naked C double. Cython allows this quite easily:

#!Lineno
#!python
# file: fast_integration_v2.pyx
cdef extern from "math.h":
double sin(double x)
cdef double integrand(double x):
return sin(x)

# rest unchanged ...

The only modification here is to the integrand function signature. Instead of def, we use cdef to tell Cython to create a ''C'' function out of integrand, not a ''Python'' function. The return type is declared as double, and integrand is declared to take one double argument. The generated Cython code will now be pure C code, since the sin function is from the C math library. Let's see about speedup:

$ python integration_timer.py

slow integration: 0.831489 seconds

fast_integration_v0: 0.571665 seconds speedup factor: 1.45450458266 fast_integration_v1: 0.422799 seconds speedup factor: 1.96662922337 fast_integration_v2: 0.331946 seconds speedup factor: 2.50489160609

Yet another significant speedup, for a very minor change. There is one caveat, however. The fact that integrand is now a pure C function and not a Python function means that integrand cannot be called from pure Python code outside the fast_integration_v2.pyx module. It is invisible to the outside world. There are ways around this limitation which will be covered later.

One last speedup is possible -- we can attack the integrate function. All its variables are Python variables; it could benefit from some static typing. We can type the function's arguments:

#!Lineno
#!python
# file: fast_integration_v3.pyx
def integrate(double a, double b, int N):
# ...

Since it is still a def function, integrate can be called by external Python code. When Python code calls the integrate function, the arguments are converted from Python Objects to their corresponding C type (either double or int), and if a conversion is not possible, an exception is raised.

We can also type integrate's local variables, using the cdef keyword. Here is the entire "cythonized" function:

#!Lineno
#!python
# file: fast_integration_v3.pyx

# integrand same as previous version...

def integrate(double a, double b, int N):

cdef double step, result, fa, fb cdef int i

if N > 1:
step = (b-a)/(N-1)
else:
step = b-a

result = 0.0; fa = integrand(a) for i in range(1, N):

fb = integrand(a + i * step) result += .5 * step * (fa + fb) fa = fb

return result

Again, very small changes to the source code. The body of the function is unchanged in this case -- just some cdef lines and static typing of the arguments at the top. You will notice, too, that ''all'' variables are typed -- none are pure Python variables. This is important for loops, since it allows the entire loop to be converted into C without any expensive Python calls. Saving this as fast_integration_v3.pyx, incorporating it into the timing script and running yields:

$ python integration_timer.py

slow integration: 0.831489 seconds

fast_integration_v0: 0.571665 seconds speedup factor: 1.45450458266 fast_integration_v1: 0.422799 seconds speedup factor: 1.96662922337 fast_integration_v2: 0.331946 seconds speedup factor: 2.50489160609 fast_integration_v3: 0.043582 seconds speedup factor: 19.0787435174

A net speedup by a factor of ~20!

Here are the points to remember when Cythonizing:

  • Adding in static type information increases speed, but decreases flexibility. * Therefore, only Cythonize the stuff that must be fast, almost always inner loops.
  • Profile (by using the cProfile module) to find the hotspots of your code.

What's going on under the hood?

Cython takes Python code with optional static type information and creates a C source file known as an extension module. This extension module source is compiled into a shared object file on Linux, a .dylib on Mac OS X or a .dll on Windows. These files can be imported by Python and used just like any other external Python source, with the primary difference that the extension module is compiled to machine code. This potentially allows the extension module to be much faster than a pure Python module. It comes at the cost of portability and flexibility, however.

More Cython stuff

Cython has much more to it than demonstrated here. The most egregious exclusions include:

  • Extension types -- Cython allows you to create Python extension types very easily.
  • Numerical Programming features -- Cython has special support for Python buffers (a more generalized array) that work well with NumPy arrays. These can yield huge speedups over straight Python/Numpy code.
  • Code annotations -- A simple compiler flag ($ cython -a filename.pyx) instructs cython to output an html file that indicates what C code is generated for each line of Python/Cython source. Helpful to see what Cython is doing for you under the hood.
  • Code profiling -- Can profile Cython code performance selectively, hooks into Python's cProfile module seamlessly.

See the Cython documentation for more: http://docs.cython.org/

Cython on windows

If you don't have a compiler, the Enthought Python Distribution comes with one (mingw). To help cython use mingw, you will need to make a simple config file and put it in the correct directory. Create the file pydistutils.cfg

[build_ext]

compiler=mingw32

From a python shell, print os.path.expanduser('~'). This is the directory where pydistutils.cfg goes. Happy cythoning!

Other Python-speedup tools

I list here some other Python-speedup tools in what is becoming a robust sub-ecosystem :-)

  • Pyrex -- the precursor to Cython, by Greg Ewing and collaborators http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/
  • Psyco -- A Just-In-Time (JIT) complier for Python code. Very easy to use, but only works on the x86 instruction set (no 64-bit processors, no PowerPCs). Not actively improved, but bugs are fixed, AFAIK. http://psyco.sourceforge.net/
  • PyPy -- A very ambitious project that provides an implementation of Python (the interpreter & compiler, i.e. the thing you run at a command prompt) in Python (the programming language). It is quickly nearing completion, and reportedly absorbs the Psyco project and provides a JIT that works on all systems. http://codespeak.net/pypy/dist/pypy/doc/
  • ShedSkin -- Similar to Cython, although ShedSkin will compile a strict ''subset'' of the Python language to C++. No special syntax for static variables like Cython/Pyrex. http://shed-skin.blogspot.com/
Clone this wiki locally