Syft Code Policies#
Estimated reading time: 10’
What you’ll learn#
This guide presents what PySyft policies are, how to use them and even how to further create custom policies to suit your needs.
Introduction#
Why data release is difficult#
To collaborate using non-public information, parties need to agree what and how the release of information should happen. There are many controlled ways to release data, ranging from restricted data release, such as via public APIs, to fully restricted ones, such as secure room facilities, where data never leaves the secure computers.
This is because once a piece of data is released, it is subject to the copy problem: once released, the data owner cannot any longer control how its being used or shared, inhibiting such collaborations to exist. This is amplified by other problems, such as the the bunding problems, where the data needed to answer a research question carries information that can be misused.
Designing better information flows#
A five-part framework called Structured Transparency was proposed to improve the governablity of such information flows and incentivise collaboration, inspired from efforts from various fields to solve such problems.
The main idea behind this framework is to allow parties to define and enforce a precise flow of information, by using various techniques known as privacy enhancing technologies (PETS) where they are most effective during a data’s lifecycle.
Structured Transparency Framework#
As mentioned, the structured transparency framework exists to help break down an information flow into its individual challenges occuring before the data is used, during and after the data is being released.
These challenges are addresed via a suite of guarantees:
Input privacy and verification are guarantees about the inputs of an information flow.
Output privacy and verification are guarantees about the outputs of an information flow.
Flow governance are guarantees about who is allowed to change the flow. This includes who is able to change the input and output privacy and verification guarantees.
Input and output privacy relate to information that needs to be hidden. Input and output verification refers to information that needs to be revealed in a verifiable way. Back to our problem - if you satisfy the input privacy, you can prevent the copy problem, as it is impossible to copy data you cannot see!
PySyft Policies#
In PySyft, policies were created to enable data owners and data scientists to define and enforce an information flow that is mutually-agreed and that can be customised to implement various input and output privacy and verification rules.
Given the remote data science framework that PySyft implements, data is safe and secure when stored, as the data scientist can never directly access it. However, when data scientists want to answer specific questions with the data, we would like the above guarantees. As a result, a sy.Policy
in PySyft accompanies invidiual code that was written by the data scientist to process the server’s data.
In particular, there are two types of policies:
sy.InputPolicy
: policies that relate to the input privacy and verification of the inputssy.OutputPolicy
: policies that relate to the output privacy and verification.
They pair the code submission as such:
@sy.syft_function(input_policy=sy.ExactMatch(asset=client.datasets[0].assets[0]),
output_policy=sy.OutputPolicyExecuteOnce())
def example_function(asset):
# customize your query here
pass
Input and Output Policies#
Only a few input/output policies are present in PySyft that, together with manual code review, can allow a data owner to enforce their definition of data use and mis-use, by approving only the requests that fit into their defition of proper use. However, to ensure the data owner’s approval does not leave room for mis-use by the system, PySyft proposes a few baseline policies to provide such guarantees:
Input Policies#
Input policies help to enforce on what assets an execution could work. The baseline policies in PySyft address the following two questions:
What datasets or assets can your code be run on?
Input policies help to enforce on what assets an execution could work. A super useful policy in PySyft is sy.ExactMatch
. This gives the data owner the confidence that the system will only allow the approved code to run on the specified asset passed on the function. (i.e. ExactMatch(data=client.datasets[0].assets[0])
). This means that an approved code cannot be run on any asset at random.
Output Policies#
How many times can your code be run?
Output policies are used to maintain states between executions. They are useful to imposing limits such as allowing an execution of a code only for a number of times. This allows a data owner to have confidence on how many times and what the output structure looks like when that custom code is executed. As we can see in our example, the input policy is tied to one specific asset and can be run only once (via sy.OutputPolicyExecuteOnce()
), therefore this function serves as a single use function. As this is widely used in PySyft, the shorter handle sy.syft_function_single_use
is provided that combines the two policies.
Custom Policies#
As you might have noticed, manual code review is necessary to support various rules for approving or denying requests. This is great, however, a data owner might be curious how they could automate this process by implementing more programatic policies that are automatically enforced by the system.
This is possible via implementing custom policies. These could be simple, such as soft privacy measures, such as releasing only aggregated results or rate-limiting the amount of data points, or hard privacy measures, such as releasing results under a specific privacy budget. These are not offered by default in PySyft, but any contributions from the community is kindly welcomed!
Example#
Let’s implement together a custom output policy that expands from sy.OutputPolicyExecuteOnce()
that allows for repeated calls, within a pre-approved count. This is an ideal tool for a machine learning pipeline!
Setup#
import numpy as np
import syft as sy
server = sy.orchestra.launch(name="my_server", port="auto", dev_mode=False, reset=True)
do_client = server.login(email="[email protected]", password="changethis")
do_client.register(
email="[email protected]", name="John Doe", password="pw", password_verify="pw"
)
ds_client = server.login(email="[email protected]", password="pw")
dataset = sy.Dataset(
name="Dataset name",
description="**Placehoder Dataset description**",
asset_list=[sy.Asset(
name="asset_name",
data=[1,2,3], # real data
mock=[4,5,6], # mock data
)],
)
do_client.upload_dataset(dataset)
Writing a custom policy#
To write a custom policy, one needs to define a class that inhereits from sy.CustomOutputPolicy
, which must specify an implemention of the apply_to_output
method. It is important to note that state can be preserved between the execution of the same usercode, allowing us to write the following:
from typing import Any
class RepeatedCallPolicy(sy.CustomOutputPolicy):
n_calls: int = 0
downloadable_output_args: list[str] = []
state: dict[Any, Any] = {}
def __init__(self, n_calls=1, downloadable_output_args: list[str] = None):
self.downloadable_output_args = (
downloadable_output_args if downloadable_output_args is not None else []
)
self.n_calls = n_calls
self.state = {"counts": 0}
def public_state(self):
return self.state["counts"]
def update_policy(self, context, outputs):
self.state["counts"] += 1
def apply_to_output(self, context, outputs, update_policy=True):
if hasattr(outputs, "syft_action_data"):
outputs = outputs.syft_action_data
output_dict = {}
if self.state["counts"] < self.n_calls:
for output_arg in self.downloadable_output_args:
output_dict[output_arg] = outputs[output_arg]
if update_policy:
self.update_policy(context, outputs)
else:
return None
return output_dict
def _is_valid(self, context):
return self.state["counts"] < self.n_calls
We can now test that it works.
policy = RepeatedCallPolicy(n_calls=1, downloadable_output_args=["y"])
policy.init_kwargs
a_obj = sy.ActionObject.from_obj({"y": [1, 2, 3]})
x = policy.apply_to_output(None, a_obj)
x["y"]
Let’s create now a dummy code request with a high call limit and see that the restriction works.
t = [1,2,3]
[y+1 for y in t]
@sy.syft_function(
input_policy=sy.ExactMatch(x=ds_client.datasets[0].assets[0]),
output_policy=RepeatedCallPolicy(n_calls=10, downloadable_output_args=["y"]),
)
def func(x):
return x
# return [y+1 for y in x]
# Cretae request as a data scientists
ds_client.code.request_code_execution(func)
# Approve as the data owner
do_client.requests[0].approve()
ds_client.requests
request = ds_client.requests[0]
result = request.code.run(x=ds_client.datasets[0].assets[0])
result
for _ in range(8):
result = request.code.run(x=ds_client.datasets[0].assets[0])
result