# PySyft: Data Science on data you are not allowed to see

Getting access to **data** is the first, and yet the most crucial step in **Data Science** - but often getting access to _enough_ data is impossible. 

Ideally, all that would be necessary for a data scientist is to _connect_ to the web server where the data is stored, so they can get started with their analyses. 
Unfortunately though, it is not _that_ simple! 
Public access to information always comes with several risks, and necessarily leads to concerns about trust (e.g. _data copy/use/misuse_), privacy (e.g. _disclosure of senstive information_), or legal implications (e.g. _moving data outside their original silos_).

But, what if we'd tell you that actually, it **can** be _that_ simple ?! 

```{epigraph}
A _new way_ to do data science, where you can use non-public information, without seeing nor obtaining a copy of the data itself. All you need is to connect to a **Datasite**, and use PySyft! 
```

In this tutorial, we will learn the **four** simple steps that are necessary to use PySyft, and to run your code on data _you are not allowed to see_.

## 1. Login to the Datasite

In this beginning of your journey with PySyft, there is one key term to remember: [**Datasite**](./components/datasets.ipynb).

```{epigraph} 
Think of a Datasite as a website, but for data. While web servers allow you to download files like `.html` or `.css` to your browser, Datasites operate differently. 

A Datasite enables a data scientist to retrieve *answers* to their research questions from the data, *without actually downloading nor seeing* the data.
```

Let's use togheter our first *PySyft Datasite* and explore what questions we can answer with the data it hosts!

```{warning}

We will assume you have <b>PySyft installed</b>. Please refer to the [Quick install PySyft](./quick-install.ipynb) for further information.
```

Let's first login into the Datasite using `syft`:

In [None]:
import syft as sy

# login to data_site
client = sy.login_as_guest(url="http://fake-wisc-datasite.openmined.org/")

## 2. Discover the data on the Datasite

Now that we are logged in (_as guests_, ed.) into the Datasite, let's explore what data is available. We do so by accessing [client.datasets](./components/datasets.ipynb):

In [None]:
client.datasets

The `pysyft-demo-datasite` server is hosting **one** dataset, named "Breast Cancer Wisconsin". We can use PySyft APIs to get a _pointer_ to the remote dataset:

In [None]:
breast_cancer_dataset = client.datasets["Breast Cancer Wisconsin"]

As this is our first interaction with the "Breast Cancer Wisconsin" dataset, let's get additional information to better understand what this data is about. We can do so by simpy accessing the _rich_ preview from our `breast_cancer_dataset` remote pointer:

In [None]:
breast_cancer_dataset

As you can see, the dataset preview includes:
- a **summary**: brief description of the dataset.
- a **description** with further information about what data is included, its intended use, and characteristics of the data (_so we know what we should expect_, ed.).
- The list of the **two assets**, i.e. "BC Features", and "BC Targets", storing the data.

<div class="alert alert-info">
    üí° <b>Take away</b>: We just learnt that a PySyft <b>dataset</b> is the union of <b>metadata</b> information, useful to get a a general understaning about the data, and <b>assets</b>. 
</div>

Similarly to what we did with the dataset, we can get a pointer to each of the remote assets in `breast_cancer_dataset`:

In [None]:
features, targets = breast_cancer_dataset.assets

**However**, if we would try to access the `data` from these assets:

In [None]:
features.data

<div class="alert alert-danger">
    ‚úã <b>We do not have permissions to access the data!</b> üõë
</div>

This is indeed one of the main features of PySyft: as `GUEST` Data Scientists to the Datasite, we are **not allowed** to see nor download any data from the dataset! 

_At this point, you may wonder_: 

> How could I possibly work with data I cannot see? How could I even start to work on my code, if that cannot be tested with data ?

PySyft addresses this issue by hosting **two types** of data in each [**asset**](./components/datasets.ipynb#create-an-asset): _real_ data (e.g. `features.data`) and _mock_ data (e.g. `features.mock`).

**Mock** data represents an artificial version of the true data, that has been made available through PySyft for code preparation purposes. Let's have a quick look:

In [None]:
features.mock.head(n=3)

In [None]:
targets.mock.sample(n=3, random_state=24)

## 3. Prepare the code to answer our Research Question

The presence of `mock` data provides a concrete strategy to work on our code, and more importantly, to concentrate on what _really matters_: our **research question**. 

Let's say we are interested in knowing what is the difference (on _average_) of the `radius` of the nuclei of the breast tumors for patients with a benign (`B`) or a malign (`M`) diagnoses. 

We can _translate_ this question into Python code that can actually runs, using mock data as **proxy** for the real one:

In [None]:
def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
    """Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
    
    y = labels['Diagnosis'].values.ravel()
    mean_bening = data[y == "B"]["radius3"].mean() 
    mean_malign = data[y == "M"]["radius3"].mean()
    return mean_bening, mean_malign


Let's test our code on mock data:

In [None]:
average_radius_of_nuclei(features.mock, targets.mock)

It works! ‚úÖ 

We have just verified that our code would be able to work on the selected assets, and so it is ready to for execution on the Datasite.

```{warning} Obtaining the real answer
It is important to emphasise that <em>that</em> result (and <em>any other result</em>, ed.) generated using <b>mock</b> data is completely <b>meaningless</b> for our research question!
We need to run our code on the <em>real</em> (remote) data on the Datasite to get the answer we are looking for. And we will use PySyft for it! 
```

## 4. Get the _true_ answer to our Research Question

Now that we have checked that our code runs _locally_ on _mock_ data, we are now ready to proceed with the last step. 
This time, we want our code to be executed _remotely_ (on the Datasite), and using the _real_ non-public data. We just need to convert our (local) Python function, into a [**remote code request**](./components/requests-api.ipynb#create-a-code-request). 

Creating a code request in PySyft is very easy as it only requires a Python decorator to wrap our 
function with: [`syft.syft_function_single_use`](./components/code-api.ipynb#what-is-a-syft-function). Using this decorator, we will need to specify which assets we are willing to use for our execution (i.e. `features` and `targets` in our case). In this way, we can clearly delimit the _data_ scope of our experiment, as PySyft will **not** allow our execution on [assets different that the ones specified](./components/syft-policies.ipynb#input-and-output-policies).

In [None]:
@sy.syft_function_single_use(data=features, labels=targets)  # mapping parameters to corresponding assets
def average_radius_of_nuclei(data, labels) -> tuple[float, float]:
    """Calculate the mean of `radius` feature, for both samples with bening and malign diagnosis"""
    
    y = labels['Diagnosis'].values.ravel()
    mean_bening = data[y == "B"]["radius3"].mean() 
    mean_malign = data[y == "M"]["radius3"].mean()
    return mean_bening, mean_malign

**Success**! üéâ 

Now our `average_radius_of_nuclei` function has officially become a PySyft code request, that we can immediately submit to the Datasite:

In [None]:
client.code.request_code_execution(average_radius_of_nuclei)

```{warning} Waiting for approval
Please wait one second before proceed to make sure that the demo servers will receive and automatically approves the requests.
```

Let's check the _status_ of our request:

In [None]:
client.code

The request has been **automatically approved** by the Datasite! This means we have now permission to _remotely_ execute our code on the _real_ data, and then download the answer:

In [None]:
answer = client.code.average_radius_of_nuclei(data=features, labels=targets)
print(answer)

Well done!! üëè 

You have successfully executed the `average_radius_of_nuclei` function _remotely_ on the Datasite, and obtained the results from the _real_ (non-public) data! And all, without ever looking at the data!

## This is just the surface üîé

Congratulations for completing your first **PySyft** tutorial!  In this tutorial, we have learnt how easy is to get started with PySyft to allow data science on data _you are not allowed to see_.


**But this was just the surface!**

```{admonition} Approval flow
<em>For example</em>: the demo Datasite server that was used in the tutorial has been configured to <b>automatically accept</b> every single incoming request. In practice, this means that there is <em>no</em> limitation imposed on the <em>queries</em> an external data scientist could submit on the data, with their code 

Naturally, this was just an over-simplification, created solely to make this tutorial as more accessible as possible! But there is <b>so much more</b> to unveil about PySyft features!!
```

If you wish to discover more about PySyft, and learn how to use PySyft _from the ground up_, read this [tutorial](./getting-started/introduction.ipynb). 

You can also [join](https://bit.ly/join-om-slack) the community on Open Mined Slack, and message the `#support` channel where someone will gladly assist you! 

Thanks for your interest!

<!-- ```{eval-rst}

<h4>Learn more about PySyft</h4>
<br>

.. grid:: 1 2 3 3
:gutter: 2

.. grid-item-card::

    Getting Started
    ^^^

    +++

    .. button-ref:: getting_started/introduction
        :expand:
        :color: secondary
        :click-parent:

        Learn more

.. grid-item-card::

    Deployment
    ^^^

    - `Deployment Guide <notebooks/deployment/deployment-doc-1-2-intro-req.html>`_
    - `Beginner-friendly deployment <notebooks/deployment/deployment-doc-4_0-container-based-deployment.html>`_
    - `Orchestrated Deployment <notebooks/deployment/deployment-doc-5_0-deploy-k8s.html>`_
    - `Troubleshooting <notebooks/deployment/deployment-doc-6-troubleshooting.html>`_
    +++

    .. button-ref:: notebooks/deployment/deployment-doc-1-2-intro-req
        :expand:
        :color: secondary
        :click-parent:

        Learn more

.. grid-item-card::

    Components
    ^^^

    - `Datasite Server <notebooks/components/datasite-server.html>`_
    - `Datasets in PySyft <notebooks/components/datasets.html>`_
    - `Policies & privacy in PySyft <notebooks/components/syft-policies.html>`_

    +++

    .. button-ref:: notebooks/components/datasite-server
        :expand:
        :color: secondary
        :click-parent:

        Learn more
``` -->