PySyft: Data Science on data you are not allowed to see
Getting access to data is the first, and yet the most crucial step in Data Science - but often getting access to enough data is impossible.
Ideally, all that would be necessary for a data scientist is to connect to the web server where the data is stored, so they can get started with their analyses. Unfortunately though, it is not that simple! Public access to information always comes with several risks, and necessarily leads to concerns about trust (e.g. data copy/use/misuse), privacy (e.g. disclosure of senstive information), or legal implications (e.g. moving data outside their original silos).
But, what if we’d tell you that actually, it can be that simple ?!
A new way to do data science, where you can use non-public information, without seeing nor obtaining a copy of the data itself. All you need is to connect to a Datasite, and use PySyft!
In this tutorial, we will learn the four simple steps that are necessary to use PySyft, and to run your code on data you are not allowed to see.
1. Login to the Datasite
In this beginning of your journey with PySyft, there is one key term to remember: Datasite.
Think of a Datasite as a website, but for data. While web servers allow you to download files like
.html
or.css
to your browser, Datasites operate differently.A Datasite enables a data scientist to retrieve answers to their research questions from the data, without actually downloading nor seeing the data.
Let’s use togheter our first PySyft Datasite and explore what questions we can answer with the data it hosts!
Warning
We will assume you have PySyft installed. Please refer to the Quick install PySyft for further information.
Let’s first login into the Datasite using syft
:
import syft as sy
# login to data_site
client = sy.login_as_guest(url="20.51.219.43:80")
Logged into <university-wisconsin: High-side Datasite> as GUEST
2. Discover the data on the Datasite
Now that we are logged in (as guests, ed.) into the Datasite, let’s explore what data is available. We do so by accessing client.datasets
:
client.datasets
Dataset Dicttuple
Total: 0
The pysyft-demo-datasite
server is hosting one dataset, named “Breast Cancer Wisconsin”. We can use PySyft APIs to get a pointer to the remote dataset:
breast_cancer_dataset = client.datasets["Breast Cancer Wisconsin"]
As this is our first interaction with the “Breast Cancer Wisconsin” dataset, let’s get additional information to better understand what this data is about. We can do so by simpy accessing the rich preview from our breast_cancer_dataset
remote pointer:
breast_cancer_dataset
Breast Cancer Wisconsin
Summary
Diagnostic Wisconsin Breast Cancer Database.
Description
Diagnostic Wisconsin Breast Cancer Database.
Features are computed from a digitized image of a fine needle aspirate (FNA
) of a breast mass.
They describe characteristics of the cell nuclei present in the image.
A few of the images can be found here. Relevant features were selected using an exhaustive search
in the space of 1-4
features and 1-3
separating planes.
Data format: All data has been made available as pandas.DataFrame
.
.
Dataset Details
Uploaded by: Jane Doe ([email protected])
Created on: 2024-08-01 19:13:46
URL: http://dx.doi.org/10.1007/s10916-013-9903-9
Contributors: To see full details call dataset.contributors.
Assets
Asset Dicttuple
Total: 0
As you can see, the dataset preview includes:
a summary: brief description of the dataset.
a description with further information about what data is included, its intended use, and characteristics of the data (so we know what we should expect, ed.).
The list of the two assets, i.e. “BC Features”, and “BC Targets”, storing the data.
Similarly to what we did with the dataset, we can get a pointer to each of the remote assets in breast_cancer_dataset
:
features, targets = breast_cancer_dataset.assets
However, if we would try to access the data
from these assets:
features.data
You do not have permission to access private data.
This is indeed one of the main features of PySyft: as GUEST
Data Scientists to the Datasite, we are not allowed to see nor download any data from the dataset!
At this point, you may wonder:
How could I possibly work with data I cannot see? How could I even start to work on my code, if that cannot be tested with data ?
PySyft addresses this issue by hosting two types of data in each asset: real data (e.g. features.data
) and mock data (e.g. features.mock
).
Mock data represents an artificial version of the true data, that has been made available through PySyft for code preparation purposes. Let’s have a quick look:
features.mock.head(n=3)
Unnamed: 0 | radius1 | texture1 | perimeter1 | area1 | smoothness1 | compactness1 | concavity1 | concave_points1 | symmetry1 | ... | radius3 | texture3 | perimeter3 | area3 | smoothness3 | compactness3 | concavity3 | concave_points3 | symmetry3 | fractal_dimension3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 284.236697 | 32.790832 | 30.322750 | 215.075411 | 1655.906854 | 0.835428 | 1.069219 | 0.504109 | 0.839460 | 0.752551 | ... | 42.127052 | 43.919680 | 292.127453 | 2900.460981 | 1.114412 | 1.724469 | 1.865109 | 0.859462 | 1.405402 | 0.923620 |
1 | 285.419316 | 35.387568 | 37.773586 | 225.360959 | 1980.940592 | 0.914992 | 0.883861 | 0.825926 | 0.823268 | 1.267871 | ... | 41.328957 | 49.758043 | 267.028592 | 2837.115333 | 1.248279 | 0.964296 | 1.393830 | 1.151001 | 1.478893 | 0.985656 |
2 | 286.557667 | 34.480210 | 41.535379 | 222.280991 | 1858.304816 | 1.086516 | 0.755101 | 0.674030 | 0.636585 | 0.628226 | ... | 40.519997 | 51.879152 | 259.949887 | 2590.176379 | 0.667455 | 0.862762 | 1.238147 | 1.049442 | 1.500712 | 0.852664 |
3 rows × 31 columns
targets.mock.sample(n=3, random_state=24)
Unnamed: 0 | Diagnosis | |
---|---|---|
338 | 248 | B |
488 | 236 | M |
345 | 104 | B |
3. Prepare the code to answer our Research Question
The presence of mock
data provides a concrete strategy to work on our code, and more importantly, to concentrate on what really matters: our research question.
Let’s say we are interested in knowing what is the difference (on average) of the radius
of the nuclei of the breast tumors for patients with a benign (B
) or a malign (M
) diagnoses.
We can translate this question into Python code that can actually runs, using mock data as proxy for the real one:
def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
"""Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
y = labels['Diagnosis'].values.ravel()
mean_bening = data[y == "B"]["radius3"].mean()
mean_malign = data[y == "M"]["radius3"].mean()
return mean_bening, mean_malign
Let’s test our code on mock data:
average_radius_of_nuclei(features.mock, targets.mock)
(33.18213776377855, 32.763550736274205)
It works! ✅
We have just verified that our code would be able to work on the selected assets, and so it is ready to for execution on the Datasite.
Warning
Obtaining the real answer It is important to emphasise that that result (and any other result, ed.) generated using mock data is completely meaningless for our research question! We need to run our code on the real (remote) data on the Datasite to get the answer we are looking for. And we will use PySyft for it!
4. Get the true answer to our Research Question
Now that we have checked that our code runs locally on mock data, we are now ready to proceed with the last step. This time, we want our code to be executed remotely (on the Datasite), and using the real non-public data. We just need to convert our (local) Python function, into a remote code request.
Creating a code request in PySyft is very easy as it only requires a Python decorator to wrap our
function with: syft.syft_function_single_use
. Using this decorator, we will need to specify which assets we are willing to use for our execution (i.e. features
and targets
in our case). In this way, we can clearly delimit the data scope of our experiment, as PySyft will not allow our execution on assets different that the ones specified.
@sy.syft_function_single_use(data=features, labels=targets) # mapping parameters to corresponding assets
def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
"""Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
y = labels['Diagnosis'].values.ravel()
mean_bening = data[y == "B"]["radius3"].mean()
mean_malign = data[y == "M"]["radius3"].mean()
return mean_bening, mean_malign
Syft function 'average_radius_of_nuclei' successfully created. To add a code request, please create a project using `project = syft.Project(...)`, then use command `project.create_code_request`.
Success! 🎉
Now our average_radius_of_nuclei
function has officially become a PySyft code request, that we can immediately submit to the Datasite:
client.code.request_code_execution(average_radius_of_nuclei)
Request
Id: 00bc144926414077b22a79c42f27740f
Request time: 2024-08-02 11:31:28
Status: RequestStatus.PENDING
Requested on: University-wisconsin of type Datasite
Requested by: guest_user
Changes: Request to change average_radius_of_nuclei (Pool Id: default-pool) to permission RequestStatus.APPROVED. No nested requests.
Warning
Waiting for approval Please wait one second before proceed to make sure that the demo servers will receive and automatically approves the requests.
Let’s check the status of our request:
client.code
UserCode List
Total: 0
The request has been automatically approved by the Datasite! This means we have now permission to remotely execute our code on the real data, and then download the answer:
answer = client.code.average_radius_of_nuclei(data=features, labels=targets)
print(answer)
(13.37980112044818, 21.134811320754718)
Well done!! 👏
You have successfully executed the average_radius_of_nuclei
function remotely on the Datasite, and obtained the results from the real (non-public) data! And all, without ever looking at the data!
This is just the surface 🔎
Congratulations for completing your first PySyft tutorial! In this tutorial, we have learnt how easy is to get started with PySyft to allow data science on data you are not allowed to see.
But this was just the surface!
Approval flow
For example: the demo Datasite server that was used in the tutorial has been configured to automatically accept every single incoming request. In practice, this means that there is no limitation imposed on the queries an external data scientist could submit on the data, with their code
Naturally, this was just an over-simplification, created solely to make this tutorial as more accessible as possible! But there is so much more to unveil about PySyft features!!
If you wish to discover more about PySyft, and learn how to use PySyft from the ground up, read this tutorial.
You can also join the community on Open Mined Slack, and message the #support
channel where someone will gladly assist you!
Thanks for your interest!