PySyft: Data Science on data you are not allowed to see

Getting access to data is the first, and yet the most crucial step in Data Science - but often getting access to enough data is impossible.

Ideally, all that would be necessary for a data scientist is to connect to the web server where the data is stored, so they can get started with their analyses. Unfortunately though, it is not that simple! Public access to information always comes with several risks, and necessarily leads to concerns about trust (e.g. data copy/use/misuse), privacy (e.g. disclosure of senstive information), or legal implications (e.g. moving data outside their original silos).

But, what if we’d tell you that actually, it can be that simple ?!

A new way to do data science, where you can use non-public information, without seeing nor obtaining a copy of the data itself. All you need is to connect to a Datasite, and use PySyft!

In this tutorial, we will learn the four simple steps that are necessary to use PySyft, and to run your code on data you are not allowed to see.

1. Login to the Datasite

In this beginning of your journey with PySyft, there is one key term to remember: Datasite.

Think of a Datasite as a website, but for data. While web servers allow you to download files like .html or .css to your browser, Datasites operate differently.

A Datasite enables a data scientist to retrieve answers to their research questions from the data, without actually downloading nor seeing the data.

Let’s use togheter our first PySyft Datasite and explore what questions we can answer with the data it hosts!

Warning

We will assume you have PySyft installed. Please refer to the Quick install PySyft for further information.

Let’s first login into the Datasite using syft:

import syft as sy

# login to data_site
client = sy.login_as_guest(url="http://fake-wisc-datasite.openmined.org/")
Logged into <university-wisconsin: High-side Datasite> as GUEST

2. Discover the data on the Datasite

Now that we are logged in (as guests, ed.) into the Datasite, let’s explore what data is available. We do so by accessing client.datasets:

client.datasets

Dataset Dicttuple

Total: 0

The pysyft-demo-datasite server is hosting one dataset, named “Breast Cancer Wisconsin”. We can use PySyft APIs to get a pointer to the remote dataset:

breast_cancer_dataset = client.datasets["Breast Cancer Wisconsin"]

As this is our first interaction with the “Breast Cancer Wisconsin” dataset, let’s get additional information to better understand what this data is about. We can do so by simpy accessing the rich preview from our breast_cancer_dataset remote pointer:

breast_cancer_dataset

Breast Cancer Wisconsin

Summary

Diagnostic Wisconsin Breast Cancer Database.

Description

Diagnostic Wisconsin Breast Cancer Database.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found here. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

Data format: All data has been made available as pandas.DataFrame. .

Dataset Details

Uploaded by: Jane Doe ([email protected])

Created on: 2024-08-01 19:13:46

URL: http://dx.doi.org/10.1007/s10916-013-9903-9

Contributors: To see full details call dataset.contributors.

Assets

Asset Dicttuple

Total: 0

As you can see, the dataset preview includes:

  • a summary: brief description of the dataset.

  • a description with further information about what data is included, its intended use, and characteristics of the data (so we know what we should expect, ed.).

  • The list of the two assets, i.e. “BC Features”, and “BC Targets”, storing the data.

💡 Take away: We just learnt that a PySyft dataset is the union of metadata information, useful to get a a general understaning about the data, and assets.

Similarly to what we did with the dataset, we can get a pointer to each of the remote assets in breast_cancer_dataset:

features, targets = breast_cancer_dataset.assets

However, if we would try to access the data from these assets:

features.data
SyftError:
You do not have permission to access private data.

We do not have permissions to access the data! 🛑

This is indeed one of the main features of PySyft: as GUEST Data Scientists to the Datasite, we are not allowed to see nor download any data from the dataset!

At this point, you may wonder:

How could I possibly work with data I cannot see? How could I even start to work on my code, if that cannot be tested with data ?

PySyft addresses this issue by hosting two types of data in each asset: real data (e.g. features.data) and mock data (e.g. features.mock).

Mock data represents an artificial version of the true data, that has been made available through PySyft for code preparation purposes. Let’s have a quick look:

features.mock.head(n=3)
Unnamed: 0 radius1 texture1 perimeter1 area1 smoothness1 compactness1 concavity1 concave_points1 symmetry1 ... radius3 texture3 perimeter3 area3 smoothness3 compactness3 concavity3 concave_points3 symmetry3 fractal_dimension3
0 284.236697 32.790832 30.322750 215.075411 1655.906854 0.835428 1.069219 0.504109 0.839460 0.752551 ... 42.127052 43.919680 292.127453 2900.460981 1.114412 1.724469 1.865109 0.859462 1.405402 0.923620
1 285.419316 35.387568 37.773586 225.360959 1980.940592 0.914992 0.883861 0.825926 0.823268 1.267871 ... 41.328957 49.758043 267.028592 2837.115333 1.248279 0.964296 1.393830 1.151001 1.478893 0.985656
2 286.557667 34.480210 41.535379 222.280991 1858.304816 1.086516 0.755101 0.674030 0.636585 0.628226 ... 40.519997 51.879152 259.949887 2590.176379 0.667455 0.862762 1.238147 1.049442 1.500712 0.852664

3 rows × 31 columns

targets.mock.sample(n=3, random_state=24)
Unnamed: 0 Diagnosis
338 248 B
488 236 M
345 104 B

3. Prepare the code to answer our Research Question

The presence of mock data provides a concrete strategy to work on our code, and more importantly, to concentrate on what really matters: our research question.

Let’s say we are interested in knowing what is the difference (on average) of the radius of the nuclei of the breast tumors for patients with a benign (B) or a malign (M) diagnoses.

We can translate this question into Python code that can actually runs, using mock data as proxy for the real one:

def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
    """Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
    
    y = labels['Diagnosis'].values.ravel()
    mean_bening = data[y == "B"]["radius3"].mean() 
    mean_malign = data[y == "M"]["radius3"].mean()
    return mean_bening, mean_malign

Let’s test our code on mock data:

average_radius_of_nuclei(features.mock, targets.mock)
(33.18213776377855, 32.763550736274205)

It works! ✅

We have just verified that our code would be able to work on the selected assets, and so it is ready to for execution on the Datasite.

Warning

Obtaining the real answer It is important to emphasise that that result (and any other result, ed.) generated using mock data is completely meaningless for our research question! We need to run our code on the real (remote) data on the Datasite to get the answer we are looking for. And we will use PySyft for it!

4. Get the true answer to our Research Question

Now that we have checked that our code runs locally on mock data, we are now ready to proceed with the last step. This time, we want our code to be executed remotely (on the Datasite), and using the real non-public data. We just need to convert our (local) Python function, into a remote code request.

Creating a code request in PySyft is very easy as it only requires a Python decorator to wrap our function with: syft.syft_function_single_use. Using this decorator, we will need to specify which assets we are willing to use for our execution (i.e. features and targets in our case). In this way, we can clearly delimit the data scope of our experiment, as PySyft will not allow our execution on assets different that the ones specified.

@sy.syft_function_single_use(data=features, labels=targets)  # mapping parameters to corresponding assets
def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
    """Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
    
    y = labels['Diagnosis'].values.ravel()
    mean_bening = data[y == "B"]["radius3"].mean() 
    mean_malign = data[y == "M"]["radius3"].mean()
    return mean_bening, mean_malign
SyftSuccess:
Syft function 'average_radius_of_nuclei' successfully created. To add a code request, please create a project using `project = syft.Project(...)`, then use command `project.create_code_request`.

Success! 🎉

Now our average_radius_of_nuclei function has officially become a PySyft code request, that we can immediately submit to the Datasite:

client.code.request_code_execution(average_radius_of_nuclei)

Request

Id: 00bc144926414077b22a79c42f27740f

Request time: 2024-08-02 11:31:28

Status: RequestStatus.PENDING

Requested on: University-wisconsin of type Datasite

Requested by: guest_user

Changes: Request to change average_radius_of_nuclei (Pool Id: default-pool) to permission RequestStatus.APPROVED. No nested requests.

Warning

Waiting for approval Please wait one second before proceed to make sure that the demo servers will receive and automatically approves the requests.

Let’s check the status of our request:

client.code

UserCode List

Total: 0

The request has been automatically approved by the Datasite! This means we have now permission to remotely execute our code on the real data, and then download the answer:

answer = client.code.average_radius_of_nuclei(data=features, labels=targets)
print(answer)
(13.37980112044818, 21.134811320754718)

Well done!! 👏

You have successfully executed the average_radius_of_nuclei function remotely on the Datasite, and obtained the results from the real (non-public) data! And all, without ever looking at the data!

This is just the surface 🔎

Congratulations for completing your first PySyft tutorial! In this tutorial, we have learnt how easy is to get started with PySyft to allow data science on data you are not allowed to see.

But this was just the surface!

Approval flow

For example: the demo Datasite server that was used in the tutorial has been configured to automatically accept every single incoming request. In practice, this means that there is no limitation imposed on the queries an external data scientist could submit on the data, with their code

Naturally, this was just an over-simplification, created solely to make this tutorial as more accessible as possible! But there is so much more to unveil about PySyft features!!

If you wish to discover more about PySyft, and learn how to use PySyft from the ground up, read this tutorial.

You can also join the community on Open Mined Slack, and message the #support channel where someone will gladly assist you!

Thanks for your interest!