import pycascades2024
result = pycascades2024.sum_as_string(1, 2)
print(f"1 + 2 = {result}")
print(f"Type: {type(result)}")
1 + 2 = 3
Type: <class 'str'>
https://www.linkedin.com/in/trent-hauck/ https://twitter.com/trent_hauck/
Lots of Python libraries have Rust under the hood.
Warning
It might take you a while to get up to speed with Rust. It’s an unforgiving language.
Walk through a simple function that takes two numbers and returns their sum as a string.
1#[pyfunction]
fn sum_as_string(
2 a: usize,
b: usize,
3) -> PyResult<String> {
Ok((a + b).to_string())
}
FromPyObject
IntoPyObject
and PyResult
is a Result
type for PyO3
Adding the function to a module is straightforward.
/// A Python module implemented in Rust.
#[pymodule]
fn pycascades2024(
1 _py: Python,
m: &PyModule
) -> PyResult<()> {
2 m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
Ok(())
}
Python
(scare-quotes) is a token that represents holding the GIL (scare-quotes again)
add_function
method and wrap_pyfunction!
macro are used to expose the function to Python
Use maturin
to build the module and install it into the current Python environment.
Then we can use the function in Python.
Success, but we can improve the function in several ways to make it more user-friendly.
Let’s looks a slightly more complex example.
#[pyfunction]
1#[pyo3(signature = (a, b, /, times = 1))]
fn sum_as_string_with_times(
a: usize,
b: usize,
times: usize
) -> PyResult<String> {
2 if times < 1 {
return Err(PyValueError::new_err("times must be greater than 0"));
}
let sum = (a + b) * times;
Ok(sum.to_string())
}
times
as a keyword-only argument
times
Called with a valid times
argument.
import pycascades2024
result = pycascades2024.sum_as_string_with_times(1, 2, times=3)
print(f"1 + 2 * 3 = {result}")
1 + 2 * 3 = 9
Called with a 0 times
argument.
.pyi
stub files can be used to add type annotations to Rust functions.
def sum_as_string_with_times(a: int, b: int, /, times: int = 1) -> str:
"""Sum as string with times."""
Then you get pretty type hints in your $EDITOR
.
Earlier we say that arguments must implement FromPyObject
and the return type must implement IntoPyObject
.
FromPyObject
for an enum
'py
is a lifetime specifier for the Python object
1:3
Would enable something like:
See pyo3/examples/getitem for a complete example.
PyO3 follows the same error handling as Rust. And as we saw, there are built-in error types that align with Python’s, so returning a PyValueError
will be raised as a ValueError
in Python.
Custom error handling is also possible, and must implement the From<PyErr>
trait.
Classes are a bit more complex, but follow a similar overall pattern.
We can then use the class in Python.
import pycascades2024
summer = pycascades2024.Summer(1)
result = summer.add(2)
print(f"1 + 2 = {result}")
1 + 2 = 3
Similar improvements can be made to the class as we did with the function.
So we wrote a bunch of Rust code, but how do we ship it?
Dev process is very simple thanks to maturin
.
Go do the tutorial: https://www.maturin.rs/tutorial/
Thanks to maturin
and its GitHub Action, we can build and deploy our Rust library to PyPI with ease.
graph LR; A[GitHub Repo] --> B[Maturin GitHub Action]; B --> C[x86_64-manylinux]; B --> D[windows]; B --> E[macOS]; B --> F[Source]; C --> G[PyPI]; D --> G; E --> G; F --> G;
On PyPI you can see the specific wheels that were built for each platform.
A big difference between Python and Rust is the package requirements. A PyO3 package requires a Rust toolchain to build, but that built artifact can be used without a Rust toolchain or Python Packages.
Pandas, for example, requires numpy
, and then whatever else is needed as you go down the dependency tree.
Rust and Python are a great combination for data intensive applications.
Using Arrow we can pass data between Rust and Python without copying.
#[pyfunction]
fn double_array(array: PyArrowType<ArrayData>) -> PyResult<PyArrowType<ArrayData>> {
let array: Arc<dyn Array> = make_array(array.0); // Convert ArrayData to ArrayRef
1 let array: &Int64Array = array.as_any().downcast_ref().ok_or_else(|| PyValueError::new_err("expected int64 array"))?;
2 let new_array: Int64Array = array.iter().map(|x| x.map(|x| x * 2)).collect();
3 Ok(PyArrowType(new_array.into_data()))
}
Array
to an Int64Array
Int64Array
back into a pyarrow compatible ArrayData
#ifndef ARROW_C_STREAM_INTERFACE
#define ARROW_C_STREAM_INTERFACE
struct ArrowArrayStream {
// Callbacks providing stream functionality
int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out); // <1>
int (*get_next)(struct ArrowArrayStream*, struct ArrowArray* out); // <2>
// ..
};
#endif // ARROW_C_STREAM_INTERFACE
Apache DataFusion is a Rust-based an execution engine that also provides a DataFrame API. Let’s look at an example of taking a Rust DataFrame and converting it to a Python DataFrame.
To do that we need to implement the RecordBatchReader
trait and have a way to pass that from Rust to Python.
#[pin_project::pin_project]
/// A stream of record batches from a DataFrame.
pub struct DataFrameRecordBatchStream {
#[pin]
1 exec_node: SendableRecordBatchStream,
2 rt: Arc<tokio::runtime::Runtime>,
}
SendableRecordBatchStream
is stream of arrow record batches that can be sent between threads.
Arc<tokio::runtime::Runtime>
is used to run the stream, this is basically a thread pool.
impl Iterator for DataFrameRecordBatchStream {
type Item = arrow::error::Result<arrow::record_batch::RecordBatch>;
fn next(&mut self) -> Option<Self::Item> {
1 match self.rt.block_on(self.exec_node.next()) {
Some(Ok(batch)) => Some(Ok(batch)),
Some(Err(e)) => Some(Err(ArrowError::ExternalError(Box::new(e)))),
2 None => None,
}
}
}
impl RecordBatchReader for DataFrameRecordBatchStream {
fn schema(&self) -> SchemaRef {
self.exec_node.schema()
}
}
None
when the stream is done.
#[pymethods]
impl PyExecutionResult {
// inside #[pymethods]
fn to_arrow_record_batch_reader(&mut self, py: Python) -> PyResult<PyObject> {
let dataframe_record_batch_stream = DataFrameRecordBatchStream::new(stream, runtime);
1 let mut stream = FFI_ArrowArrayStream::new(Box::new(dataframe_record_batch_stream));
2 let stream_reader = unsafe {
ArrowArrayStreamReader::from_raw(&mut stream).map_err(BioBearError::from)
}?;
stream_reader.into_pyarrow(py)
}
}
FFI_ArrowArrayStream
from the DataFrameRecordBatchStream
FFI_ArrowArrayStream
to a ArrowArrayStreamReader
and then to a Python object
Using biobear
, we can read a FASTA file and then convert it to a DataFrame. This could just as easily be a SQL query or a CSV file.
import biobear as bb
conn = bb.connect()
file_io = conn.read_fasta_file("./assets/sequence.fasta")
for batch in file_io.to_arrow_record_batch_reader():
print(batch)
pyarrow.RecordBatch
id: string not null
description: string
sequence: string not null
----
id: ["t"]
description: [null]
sequence: ["ATCG"]
What did we learn?