PySyft: Data Science on data you are not allowed to see

Getting access to data is the first, and yet the most crucial step in Data Science - but often getting access to enough data is impossible.

Ideally, all that would be necessary for a data scientist is to connect to the web server where the data is stored, so they can get started with their analyses. Unfortunately though, it is not that simple! Public access to information always comes with several risks, and necessarily leads to concerns about trust (e.g. data copy/use/misuse), privacy (e.g. disclosure of senstive information), or legal implications (e.g. moving data outside their original silos).

But, what if we’d tell you that actually, it can be that simple ?!

A new way to do data science, where you can use non-public information, without seeing nor obtaining a copy of the data itself. All you need is to connect to a Datasite, and use PySyft!

In this tutorial, we will learn the four simple steps that are necessary to use PySyft, and to run your code on data you are not allowed to see.

1. Login to the Datasite

In this beginning of your journey with PySyft, there is one key term to remember: Datasite.

Think of a Datasite as a website, but for data. While web servers allow you to download files like .html or .css to your browser, Datasites operate differently.

A Datasite enables a data scientist to retrieve answers to their research questions from the data, without actually downloading nor seeing the data.

Let’s use togheter our first PySyft Datasite and explore what questions we can answer with the data it hosts!

Warning

We will assume you have PySyft installed. Please refer to the Quick install PySyft for further information.

Let’s first login into the Datasite using syft:

import syft as sy

# login to data_site
client = sy.login_as_guest(url="20.51.219.43:80")
Logged into <university-wisconsin: High-side Datasite> as GUEST

2. Discover the data on the Datasite

Now that we are logged in (as guests, ed.) into the Datasite, let’s explore what data is available. We do so by accessing client.datasets:

client.datasets

Dataset Dicttuple

Total: 0

The pysyft-demo-datasite server is hosting one dataset, named “Breast Cancer Wisconsin”. We can use PySyft APIs to get a pointer to the remote dataset:

breast_cancer_dataset = client.datasets["Breast Cancer Wisconsin"]

As this is our first interaction with the “Breast Cancer Wisconsin” dataset, let’s get additional information to better understand what this data is about. We can do so by simpy accessing the rich preview from our breast_cancer_dataset remote pointer:

breast_cancer_dataset

Breast Cancer Wisconsin

Summary

Diagnostic Wisconsin Breast Cancer Database.

Description

Diagnostic Wisconsin Breast Cancer Database.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found here. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

Data format: All data has been made available as pandas.DataFrame. .

Dataset Details

Uploaded by: Jane Doe ([email protected])

Created on: 2024-08-01 19:13:46

URL: http://dx.doi.org/10.1007/s10916-013-9903-9

Contributors: To see full details call dataset.contributors.

Assets

Asset Dicttuple

Total: 0