The Best Language to Write Python In Is Rust

Trent Hauck

Trent Hauck

  • Principal at WHERE TRUE Technologies
  • ~8 years in biotech, recsys before that
  • Writing Python for about 12 years

https://www.linkedin.com/in/trent-hauck/ https://twitter.com/trent_hauck/

How WTT Uses Python + Rust

https://github.com/wheretrue/biobear

You Might Be Using Rust in Python Already

Lots of Python libraries have Rust under the hood.

Polars

Polars (Performance)

Pydantic

Ruff

Why use Rust with Python?

  • Performance of Rust compared to Python
  • Ease of implementation
  • Ease of distribution

Warning

It might take you a while to get up to speed with Rust. It’s an unforgiving language.

Agenda

  1. End to End Example
  2. Step back and look at SDLC
  3. Special Topic: Data Engineering
  4. Conclusion

End to End Function Example

Walk through a simple function that takes two numbers and returns their sum as a string.

Anatomy of a Function

1#[pyfunction]
fn sum_as_string(
2    a: usize,
    b: usize,
3) -> PyResult<String> {
    Ok((a + b).to_string())
}
1
Rust attribute to expose the function to Python
2
Arguments must implement FromPyObject
3
Two things: Return type must implement IntoPyObject and PyResult is a Result type for PyO3

Creating a Module

Adding the function to a module is straightforward.

/// A Python module implemented in Rust.
#[pymodule]
fn pycascades2024(
1    _py: Python,
    m: &PyModule
) -> PyResult<()> {
2    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    Ok(())
}
1
While not used here, Python (scare-quotes) is a token that represents holding the GIL (scare-quotes again)
2
The add_function method and wrap_pyfunction! macro are used to expose the function to Python

Using the Function

Use maturin to build the module and install it into the current Python environment.

maturin develop

Then we can use the function in Python.

import pycascades2024

result = pycascades2024.sum_as_string(1, 2)
print(f"1 + 2 = {result}")
print(f"Type: {type(result)}")
1 + 2 = 3
Type: <class 'str'>

Function UX

Success, but we can improve the function in several ways to make it more user-friendly.

  • Function Signature
  • Docstrings
  • Type Annotations

Let’s looks a slightly more complex example.

New Function

#[pyfunction]
1#[pyo3(signature = (a, b, /, times = 1))]
fn sum_as_string_with_times(
    a: usize,
    b: usize,
    times: usize
) -> PyResult<String> {
2    if times < 1 {
        return Err(PyValueError::new_err("times must be greater than 0"));
    }

    let sum = (a + b) * times;
    Ok(sum.to_string())
}
1
Function signature with times as a keyword-only argument
2
Error handling for negative times

Calling the New Function

Called with a valid times argument.

import pycascades2024

result = pycascades2024.sum_as_string_with_times(1, 2, times=3)
print(f"1 + 2 * 3 = {result}")
1 + 2 * 3 = 9

Called with a 0 times argument.

import pycascades2024

try:
    result = pycascades2024.sum_as_string_with_times(1, 2, times=0)
except ValueError as e:
    print(f"Error: {e}")
Error: times must be greater than 0

Adding Type Annotations to Our Function

.pyi stub files can be used to add type annotations to Rust functions.

def sum_as_string_with_times(a: int, b: int, /, times: int = 1) -> str:
    """Sum as string with times."""

Then you get pretty type hints in your $EDITOR.

Working with Types

Earlier we say that arguments must implement FromPyObject and the return type must implement IntoPyObject.

1#[derive(FromPyObject)]
2enum IntOrSlice<'py> {
    Int(i32),
3    Slice(Bound<'py, PySlice>),
}
1
Implementing FromPyObject for an enum
2
'py is a lifetime specifier for the Python object
3
This is a slice, like 1:3

Would enable something like:

x[1] = 2
x[1:3] = 2

See pyo3/examples/getitem for a complete example.

Error Handling

PyO3 follows the same error handling as Rust. And as we saw, there are built-in error types that align with Python’s, so returning a PyValueError will be raised as a ValueError in Python.

Custom error handling is also possible, and must implement the From<PyErr> trait.

#[derive(Debug)]
struct CustomIOError;

// impl std::error::Error and std::fmt::Display

impl std::convert::From<CustomIOError> for PyErr {
    fn from(err: CustomIOError) -> PyErr {
        PyOSError::new_err(err.to_string())
    }
}

Class Example

Classes are a bit more complex, but follow a similar overall pattern.

#[pyclass] // 1
struct Summer {
    a: usize,
}

#[pymethods] // 2
impl Summer {
    #[new] // 3
    fn new(a: usize) -> Self {
        Summer { a }
    }

    fn add(&self, b: usize) -> PyResult<String> { // 4
        Ok((self.a + b).to_string())
    }
}

Using the Class

We can then use the class in Python.

import pycascades2024

summer = pycascades2024.Summer(1)
result = summer.add(2)

print(f"1 + 2 = {result}")
1 + 2 = 3

Similar improvements can be made to the class as we did with the function.

Software Development Life Cycle

So we wrote a bunch of Rust code, but how do we ship it?

maturin

Dev process is very simple thanks to maturin.

Go do the tutorial: https://www.maturin.rs/tutorial/

Build Process

Thanks to maturin and its GitHub Action, we can build and deploy our Rust library to PyPI with ease.

graph LR;
    A[GitHub Repo] --> B[Maturin GitHub Action];
    B --> C[x86_64-manylinux];
    B --> D[windows];
    B --> E[macOS];
    B --> F[Source];
    C --> G[PyPI];
    D --> G;
    E --> G;
    F --> G;

$ maturin generate-ci github > .github/workflows/CI.yml

PyPI

On PyPI you can see the specific wheels that were built for each platform.

Package Requirements

A big difference between Python and Rust is the package requirements. A PyO3 package requires a Rust toolchain to build, but that built artifact can be used without a Rust toolchain or Python Packages.

Pandas, for example, requires numpy, and then whatever else is needed as you go down the dependency tree.

dependencies = [
  "numpy>=1.23.5; python_version<'3.12'",
  "numpy>=1.26.0; python_version>='3.12'",
  "python-dateutil>=2.8.2",
  "pytz>=2020.1",
  "tzdata>=2022.7"
]

Special Topic: Data Engineering

Rust and Python are a great combination for data intensive applications.

Arrow as an Intermediate

Using Arrow we can pass data between Rust and Python without copying.

A Simple Example

#[pyfunction]
fn double_array(array: PyArrowType<ArrayData>) -> PyResult<PyArrowType<ArrayData>> {
    let array: Arc<dyn Array> = make_array(array.0); // Convert ArrayData to ArrayRef

1    let array: &Int64Array = array.as_any().downcast_ref().ok_or_else(|| PyValueError::new_err("expected int64 array"))?;
2    let new_array: Int64Array = array.iter().map(|x| x.map(|x| x * 2)).collect();

3    Ok(PyArrowType(new_array.into_data()))
}
1
Downcast the Array to an Int64Array
2
Double the values in the array
3
Convert the Int64Array back into a pyarrow compatible ArrayData

We can then use this function in Python.

import pyarrow as pa
import pycascades2024

array = pa.array([1], type=pa.int64())
print(pycascades2024.double_array(array))
[
  2
]

The C Stream Interface

#ifndef ARROW_C_STREAM_INTERFACE
#define ARROW_C_STREAM_INTERFACE

struct ArrowArrayStream {
  // Callbacks providing stream functionality
  int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out); // <1>
  int (*get_next)(struct ArrowArrayStream*, struct ArrowArray* out); // <2>

  // ..
};

#endif  // ARROW_C_STREAM_INTERFACE
  1. Get the schema of the stream
  2. Get the next record batch

Rust DataFrame to Python DataFrame

Apache DataFusion is a Rust-based an execution engine that also provides a DataFrame API. Let’s look at an example of taking a Rust DataFrame and converting it to a Python DataFrame.

To do that we need to implement the RecordBatchReader trait and have a way to pass that from Rust to Python.

pub trait RecordBatchReader: Iterator<Item = Result<RecordBatch, ArrowError>> {
    // Required method
    fn schema(&self) -> Arc<Schema>;

    // Provided method
    fn next_batch(&mut self) -> Result<Option<RecordBatch>, ArrowError> { ... }
}

DataFrameRecordBatchStream

#[pin_project::pin_project]
/// A stream of record batches from a DataFrame.
pub struct DataFrameRecordBatchStream {
    #[pin]
1    exec_node: SendableRecordBatchStream,

2    rt: Arc<tokio::runtime::Runtime>,
}
1
The SendableRecordBatchStream is stream of arrow record batches that can be sent between threads.
2
The Arc<tokio::runtime::Runtime> is used to run the stream, this is basically a thread pool.

Implementing RecordBatchReader

impl Iterator for DataFrameRecordBatchStream {
    type Item = arrow::error::Result<arrow::record_batch::RecordBatch>;

    fn next(&mut self) -> Option<Self::Item> {
1        match self.rt.block_on(self.exec_node.next()) {
            Some(Ok(batch)) => Some(Ok(batch)),
            Some(Err(e)) => Some(Err(ArrowError::ExternalError(Box::new(e)))),
2            None => None,
        }
    }
}

impl RecordBatchReader for DataFrameRecordBatchStream {
    fn schema(&self) -> SchemaRef {
        self.exec_node.schema()
    }
}
1
Use the runtime to block on the next record batch.
2
Return None when the stream is done.

Passing the Stream to Python

#[pymethods]
impl PyExecutionResult {
    // inside #[pymethods]
    fn to_arrow_record_batch_reader(&mut self, py: Python) -> PyResult<PyObject> {
        let dataframe_record_batch_stream = DataFrameRecordBatchStream::new(stream, runtime);

1        let mut stream = FFI_ArrowArrayStream::new(Box::new(dataframe_record_batch_stream));

2        let stream_reader = unsafe {
                ArrowArrayStreamReader::from_raw(&mut stream).map_err(BioBearError::from)
        }?;

        stream_reader.into_pyarrow(py)
    }
}
1
Create the FFI_ArrowArrayStream from the DataFrameRecordBatchStream
2
Convert the FFI_ArrowArrayStream to a ArrowArrayStreamReader and then to a Python object

Calling from Python

Using biobear, we can read a FASTA file and then convert it to a DataFrame. This could just as easily be a SQL query or a CSV file.

import biobear as bb

conn = bb.connect()
file_io = conn.read_fasta_file("./assets/sequence.fasta")

for batch in file_io.to_arrow_record_batch_reader():
    print(batch)
pyarrow.RecordBatch
id: string not null
description: string
sequence: string not null
----
id: ["t"]
description: [null]
sequence: ["ATCG"]

Conclusion

What did we learn?

  1. PyO3 is a great way to write Python extensions in Rust
  2. Rust is fast and can be used to speed up Python code
  3. Arrow is a great way to pass data between Rust and Python
  4. Data Engineering is a great use case for Rust and Python, but there are more

Resources