Code API (.code)#

Estimated reading time: 10’

This guide’s objective is to help you get familiarized with PySyft’s Code API. You will learn to:

  • write a Syft Function

  • test a Syft Function

What is a Syft Function?#

A Syft function is any Python function decorated with the @sy.syft_function decorator. This means you can seamlessly write arbitrary Python code and use it directly in PySyft.

The decorator creates a function object which is recognized by PySyft and allows you to create code requests for remote execution.

Here is a quick example on how a Syft function would be created and used.

Hide code cell source
import syft as sy
import pandas as pd

node = sy.orchestra.launch(name="demo_datasite", port="auto", dev_mode=False, reset=True)

admin_client = sy.login(
    url='localhost',
    port=node.port,
    email="[email protected]",
    password="changethis",
)

df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
mock_df = pd.DataFrame({'A': [1, 2, 1], 'B': [20, 10, 20]})

main_contributor = sy.Contributor(
    name='John Doe',
    role='Uploader',
    email='[email protected]'
)

asset = sy.Asset(
    name='demo_asset',
    data=df,
    mock=mock_df,
    contributors=[main_contributor]
)

dataset = sy.Dataset(
    name='Demo Dataset',
    description='Demo Dataset',
    asset_list=[asset],
    contributors=[main_contributor]
)

admin_client.upload_dataset(dataset)
admin_client.settings.allow_guest_signup(enable=True)

First, a Data Scientist would connect to the domain and explore available datasets. (See the Users API for more details on user accounts on Datasites.)

import syft as sy

# connect to the Datasite
datasite = sy.login_as_guest(url='localhost', port=node.port).register(
    email='[email protected]',
    name='Data Scientist',
    password='123',
    password_verify='123'
)
ds_client = sy.login(
    url='localhost', port=node.port,
    email='[email protected]',
    password='123'
)

ds_client.datasets
ds_client.datasets[0].assets[0]

Once the mock dataset is inspected, the Data Scientist can prototype a Python function for analysis (using the mock dataset).

def example_function(private_dataset):
    return private_dataset.sum()

example_function(ds_client.datasets[0].assets[0].mock)

Then, the Data Scientist can convert the Python function into a Syft function (with the @sy.syft_function decorator) and submit a code request to the Data Owner’s Datasite.

import syft as sy
from syft.service.policy.policy import ExactMatch, SingleExecutionExactOutput

@sy.syft_function(
    input_policy=ExactMatch(data=ds_client.datasets[0].assets[0]),
    output_policy=SingleExecutionExactOutput()
)
def example_function(private_dataset):
    return private_dataset.sum()

ds_client.code.request_code_execution(example_function)

In the example above example_function becomes a Syft function which can be executed remotely on the private dataset once it is approved.

I/O Policies#

Input and Output Policies are rules that define what data can go IN and what data can come OUT of a code request. They are mainly used for ensuring that code submissions are properly paired with the datasets they are intended for.

Input policies deal with questins like:

What datasets or assets can your code be run on?

The input policy ensures that the code will run only on the specified assets (passed as arguments to the function). This means that an approved code request can’t run on any other asset.

Output policies deal with questions like:

How many times can your code be run?

Output policies are used to maintain states between executions. They are useful to imposing limits such as allowing an execution of a code only for a number of times. This gives the data owner control over how many times a code request can be executed and what the output structure looks like.

You can read more about IO Policies in the Syft Policies Guide.

Writing a Syft Function#

Since a Syft function is an object designed to work in a remote workflow, there are some aspects you need to take into consideration when writing one.

Function Body#

The function’s body shouldn’t contain any references to objects from outside the function’s scope. This includes any:

  • objects

  • functions

  • classes

  • modules

Writing Syft Functions

A general rule of thumb is that a Syft function should be self-contained.

Here are some examples:

🚫 Don’t use variables from outside the function’s scope.

CONST = 10

@sy.syft_function()
def example():
    return CONST * 2

Do define every used variable inside the function.

@sy.syft_function()
def example():
    CONST = 10
    return CONST * 2

🚫 Don’t use functions from outside the function’s scope.

def helper(x):
    return x ** 2

@sy.syft_function()
def example():
    return helper(10)

Do define helper functions inside the Syft function.

@sy.syft_function()
def example():
    def helper(x):
        return x ** 2
    return helper(10)

🚫 Don’t use modules imported outside the function’s scope.

import numpy as np

@sy.syft_function()
def example():
    return np.sum([1, 2, 3])

Do import used modules inside the Syft function.

@sy.syft_function()
def example():
    import numpy as np
    return np.sum([1, 2, 3])

Allowed Return Types#

PySyft has a custom implementation for serializing objects, so only those types are allowed as return types from Syft functions.

Here is a complete list of objects that can be serialized by PySyft:

  • Python primitives (including collections)

  • pandas.DataFrame

  • pandas.Series

  • pandas.Timestamp

  • numpy.ndarray

  • numpy numeric types

  • datetime.date

  • datetime.time

  • datetime.datetime

  • result.Ok

  • result.Err

  • result.Result

  • pymongo.collection.Collection

  • io.BytesIO

  • inspect.Signature

  • inspect.Parameter

The serialization process is recursive, so any combination of datatypes mentioned above will work (e.g. a dictionary containing numpy arrays).

Using other data types as return values for a Syft function will likely cause an error.

However, if you need to use a datatype not serializable by PySyft, you can convert it into a supported data type as a workaround. For example, you can convert an image containing a plot in a binary buffer before returning the value from the function:

@sy.syft_function()
def example():
    from io import BytesIO
    import matplotlib.pyplot as plt
    import numpy as np

    x = np.arange(10)
    y = np.sin(x)

    plt.plot(x, y)
    
    figfile = BytesIO()
    plt.savefig(figfile, format='png')
    return figfile
from io import BytesIO

b = BytesIO()

type(b)

Test a Function#

To increase the likelyhood that a code request gets approved, it’s important to test your functions before creating code requests. You can do this both locally and remotely.

Local Testing#

To test a function locally, simply run your experiment on the mock data, without creating a Syft function.

def example(data):
    return data.sum()

mock_data = ds_client.datasets[0].assets[0].mock

example(mock_data)

If everything looks allright, create a Syft function out of it and test it server-side.

Testing in an Emulated Server#

You can test a Syft function in an “ephemeral” server that emulates a Data Owner’s server (using only mock data) simply by creating a Syft function and invoking it.

Asset restrictions

When testing a function server-side, you need to pass the whole asset. PySyft will automatically select the mock data for invoking the underlying function.

@sy.syft_function(
    input_policy=ExactMatch(data=ds_client.datasets[0].assets[0]),
    output_policy=SingleExecutionExactOutput()
)
def example(data):
    return data.sum()

data = ds_client.datasets[0].assets[0]

example(data=data)

Testing on a Remote Server#

In some scenarios it makes more sense to test code directly on the Data Owner’s server (using only mock data before the request is approved). This approach might be useful, for example, when the nature of the experiment involves heavy computing. In such cases, the Data Owner may grant the Data Scientist access to a computing cluster to test their code on mock data.

Warning

For this to work, the Data Owner must enable mock execution for any external researcher using this feature.

# the Datasite admin must enable mock execution for a specific user

admin_client.users[1].allow_mock_execution()
# the data scientist can now test their code on the remote server

@sy.syft_function(
    input_policy=ExactMatch(data=ds_client.datasets[0].assets[0]),
    output_policy=SingleExecutionExactOutput()
)
def example(data):
    return data.sum()

ds_client.code.submit(example)

After submitting the Syft function, you can test it using client.code.FUNCTION(args)

ds_client.code
ds_client.code.example(data=mock_data)

Blocking vs Non-Blocking Execution#

When submitting code requests for execution on the Data Owner’s server, it runs by default in a blocking manner. This means the Data Owner’s server won’t process anything else until that computation is done. This can impact performance when working with heavy computations or when many Data Scientis send requests at the same time.

To mitigate this issue, you can send non-blocking requests that are queued and execute only when the server has enough available resources. See the Jobs API for more details on how to work with non-blocking requests.

Nested Code Requests#

For heavy computations, a single code execution environment might not be enough.

As a Data Owner, you can deploy PySyft in a cluster (read the Deployment Guides and the WorkerPool API for more details) and allow Data Scientists to use something like the MapReduce model for computations.

For example, this is how you could submit an aggregated computation in PySyft:

asset = ds_client.datasets[0].assets[0]
mock_data = ds_client

# setup processing functions
@sy.syft_function()
def process_batch(batch):
    return batch.to_numpy().sum()


@sy.syft_function()
def aggregate_job(job_results):
    return sum(job_results)

    
# Syft function with nested requests
@sy.syft_function_single_use(data=asset)
def process_all(datasite, data):
    import numpy as np
    
    job_results = []
    for batch in np.array_split(data, 2):
        batch_job = datasite.launch_job(process_batch, batch=batch)
        job_results += [batch_job.result]

    job = datasite.launch_job(aggregate_job, job_results=job_results)
    return job.result


# submit the processing functions so they are available on the Data Owner's server
ds_client.code.submit(process_batch)
ds_client.code.submit(aggregate_job)

# create a code request
ds_client.code.request_code_execution(process_all)