Datasets & Assets (.datasets)

Estimated reading time: 8’

What you’ll learn

This guide’s objective is to help you get familiar with datasets and assets than can be hosted on a Datasite server.

Introduction

Datasets in PySyft are the key to how PySyft allow researchers to conduct studies in a remote manner, on private data that cannot be accessed directly. There are two types of data to be hosted on a PySyft server:

  • private data: the original data that cannot be released due to its sensitive nature and associated risks

  • mock data: a fake version of the data that mimics the original one, by preserving, at minimum, the dataset schema and value distributions. This is either generated entirely from scratch or it’s a syntethic version with appropriate privacy guarantees.

Both of the these datasets are part of an object, called Asset, which has a dual behaviour - it is a pointer to the private data, but depending on the permissions, an user can access the private or the mock data. Moreover, such assets are conveniently grouped, if related, under a Dataset object, allowing to describe the assets more broadly.

Structuring Datasets and Assets

A Syft Dataset can contain one or more Syft Assets.

Warning

An asset must belong to a dataset and cannot be uploaded on its own.

Below are a few examples of how one can organise their assets within a dataset:

Example 1: Training and Testing Data
The dataset is composed of two assets, one with testing data, and another asset with training data. These two can be uploaded together into the same dataset.

Example 2: Chronological Data
The dataset is composed of multiple assets collected during different time periods, for example data-july-14.csv, data-july-17.csv, and data-august-12.csv. All these three assets can be uploaded into the same dataset.

Differences for a low-side & high-side configuration

Specific deployments are particularly defined by the data arrangement across servers: a low-side server can host only mock data, while high-side servers benefit from extra security and can contain private data.

Configuring the server in this manner does not require a custom configuration. The only aspect you need to take pay attention to is never uploading an Asset with private data on a low-side.

Warning

Remember!

  • Private (or sensitive) data must only be uploaded on the high-side server.

  • The low-side server must only contain mock data.

For the high-side servers, it is entirely up to the data owner if they would like to have the mock data available for testing purposes.

Fetch test data

To show how datasets and assets can be created and further uploaded, we will launch now a test server and start by downloading a Kaggle dataset for illustration (Age Dataset 2023).

Hide code cell source
import syft as sy

server = sy.orchestra.launch(name="test_server", port="auto", dev_mode=False, reset=True)

# logging in with default credentials
do_client = sy.login(email="[email protected]", password="changethis", port=server.port)
Hide code cell output
Starting test_server server on 0.0.0.0:43451
Waiting for server to start Done.
SyftInfo:
You have launched a development server at http://0.0.0.0:43451.It is intended only for local use.

Logged into <test_server: High side Datasite> as <[email protected]>
SyftWarning:
You are using a default password. Please change the password using `[your_client].account.set_password([new_password])`.

!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_mock_dataset.csv
Hide code cell output
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1  157M    1 2479k    0     0  1693k      0  0:01:35  0:00:01  0:01:34 1692k
  7  157M    7 11.6M    0     0  4888k      0  0:00:33  0:00:02  0:00:31 4887k
 14  157M   14 22.2M    0     0  6625k      0  0:00:24  0:00:03  0:00:21 6623k
 21  157M   21 33.1M    0     0  7624k      0  0:00:21  0:00:04  0:00:17 7624k
 27  157M   27 44.0M    0     0  8262k      0  0:00:19  0:00:05  0:00:14 8942k
 35  157M   35 55.4M    0     0  8846k      0  0:00:18  0:00:06  0:00:12 10.7M
 40  157M   40 64.0M    0     0  8799k      0  0:00:18  0:00:07  0:00:11 10.4M
 46  157M   46 72.9M    0     0  8862k      0  0:00:18  0:00:08  0:00:10 10.1M
 52  157M   52 82.7M    0     0  8937k      0  0:00:18  0:00:09  0:00:09  9.8M
 58  157M   58 91.8M    0     0  8991k      0  0:00:17  0:00:10  0:00:07 9786k
 63  157M   63  100M    0     0  8935k      0  0:00:18  0:00:11  0:00:07 9047k
 69  157M   69  110M    0     0  9059k      0  0:00:17  0:00:12  0:00:05 9447k
 76  157M   76  121M    0     0  9220k      0  0:00:17  0:00:13  0:00:04 9822k
 83  157M   83  132M    0     0  9353k      0  0:00:17  0:00:14  0:00:03  9.9M
 90  157M   90  143M    0     0  9498k      0  0:00:16  0:00:15  0:00:01 10.3M
 97  157M   97  154M    0     0  9604k      0  0:00:16  0:00:16 --:--:-- 10.8M
100  157M  100  157M    0     0  9666k      0  0:00:16  0:00:16 --:--:-- 11.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 8217k    0 65130    0     0   102k      0  0:01:20 --:--:--  0:01:20  101k
 64 8217k   64 5312k    0     0  3284k      0  0:00:02  0:00:01  0:00:01 3283k
100 8217k  100 8217k    0     0  4406k      0  0:00:01  0:00:01 --:--:-- 4405k
import pandas as pd
import syft as sy

age_df = pd.read_csv("ages_dataset.csv")
age_df = age_df.dropna(how="any")
age_df.head()

age_mock_df = pd.read_csv("ages_mock_dataset.csv")
age_mock_df = age_mock_df.dropna(how="any")
age_mock_df.head()
Id Gender Age of death Associated Countries Associated Country Life Expectancy Manner of death Name Short description Occupation Death year Birth year Country Associated Country Coordinates (Lat/Lon)
0 Q19723 Gender 1 53.0 ['United States'] [78.5] homicide Norma Fisher Magazine truth stop whose group through despite. Corporate treasurer 1989.0 1936 Not Available Not Available
1 Q20057 Gender 1 51.0 ['United Kingdom'] [81.3] natural causes Brandon Lloyd Total financial role together range line beyon... Chief Financial Officer 2018.0 1967 Not Available Not Available
2 Q8791 Gender 1 84.0 ['Sweden'] [82.5] natural causes Michelle Glover Partner stock four. Region as true develop sou... Speech and language therapist 2000.0 1916 Not Available Not Available
3 Q30567 Gender 1 64.0 ['Belgium'] [81.6] natural causes Willie Golden Feeling fact by four. Data son natural explain... Financial controller 1989.0 1925 Not Available Not Available
4 Q14013 Gender 1 88.0 ['United Kingdom'] [81.3] suicide Roberto Johnson Attorney quickly candidate change although bag... Sound technician, broadcasting/film/video 2016.0 1928 Not Available Not Available

Create an Asset

To create an asset (syft.Asset), the mandatory arguments are:

  • name (type: string): name of the asset, acts as a key among the assets in the same dataset and it must be unique

  • data: contains the private data; if you are preparing the assets for the low-side domain, this can also be replaced wit the mock data.

  • mock: contains the fake data; this data should have the same schema as the private data, but does not contain any sensitive information

Additional arguments are available:

  • description (type: string): a short description of the asset. It can only be a string, does not suppport Markdown or HTML.

  • countributors (type: list): a list of syft.Contributors listing the authors of the asset data and specifying a contact email for further questions

  • mock_is_real (type: bool): states whether the mock data is syntethically generated from the original data. If fake, this should be false.

  • mock.shape (type: tuple): the shape of the data if the data is either a Pandas DataFrame or a Numpy Array.

  • data_subjects: deprecated - do not use.

Note

Note for high-side servers If you are uploading data to the high-side and there is no mock available or it is not considered necessary, you can pass mock=sy.ActionObject.empty() to signal this.

main_contributor = sy.Contributor(name="Jeffrey Salazar", role="Dataset Creator", email="[email protected]")

asset = sy.Asset(
    name="asset_name",
    description="this is my asset",
    data=age_df, # real dataframe
    mock=age_mock_df, # mock dataframe
    contributors=[main_contributor]
)

Allowed data types in Asset

In this setup, the data is directly uploaded to the server. The only data formats supported are:

  • Python primitive types (int, string, list, dict, …)

  • Pandas Dataframe

  • Numpy Arrays

If you want to upload custom data formats, we recommmend using the blob storage functionality. This is currently in beta and documentation will be added soon.

Create a Dataset

To create a dataset, the available arguments are:

  • name (type: string): name of the dataset, acts as a key among datasets and it must be unique

  • asset_list (type: [syft.Assets]): a list of assets which contain the actual data uploaded as part of the dataset

  • description (type: string): brief additional information about the data found in the dataset; it supports markdown.

Additional optional arguments are available:

  • citation (type: string): indications on how to cite the dataset if used

  • url (type: string): link related to the dataset

  • contributors_list (type: [Contributor]): contributors to the dataset

dataset = sy.Dataset(
    name="Dataset name",
    description="**Dataset description**",
    asset_list=[asset],
    contributors=[main_contributor]
)

Warning

Preserve naming on low & high For a low-side & high-side deployment, it is very important for the datasets and assets to carry the same name. This ensures that code written on the low-side and using low-side assets can be easily executed on the high side.

Data Upload

To upload a dataset on a domain, use the upload_dataset function. You need to be logged in into the domain (low side or high side).

Note

Info Assets can be only uploading as part of a dataset.

# Uploading the dataset
do_client.upload_dataset(dataset)
Uploading:   0%|          | 0/1 [00:00<?, ?it/s]
Uploading: asset_name:   0%|          | 0/1 [00:00<?, ?it/s]
Uploading: asset_name: 100%|██████████| 1/1 [00:00<00:00,  1.36it/s]
Uploading: asset_name: 100%|██████████| 1/1 [00:00<00:00,  1.35it/s]

SyftSuccess:
Dataset uploaded to 'test_server'. To see the datasets uploaded by a client on this server, use command `[your_client].datasets`

Access Datasets

PySyft implements access control via the roles assigned to the user. Briefly:

  • Admins, data owners, can access, update and delete the datasets in full

  • Data Scientists/Guests can only access the mock counterpart of all datasets hosted on the server they are registered on. Be careful to not pass private data via the mock argument.

# Access is possible via the datasets API
do_client.datasets

Dataset Dicttuple

Total: 0

# Retrieve one dataset
dataset_retrieved = do_client.datasets[0]
dataset_retrieved

Dataset name

Summary

Description

Dataset description

Dataset Details

Uploaded by: Jane Doe ([email protected])

Created on: 2024-08-02 13:01:56

URL: None

Contributors: To see full details call dataset.contributors.

Assets

Asset Dicttuple

Total: 0

# Retrieve an asset
asset_retrieved = dataset_retrieved.assets[0]
asset_retrieved

asset_name

syft.util.misc_objs.MarkdownDescription

Asset ID: cd148eb04ebe451082addad350cb50c2

Action Object ID: 089053c3e43f4b87a1d6303025467baa

Uploaded by: Jane Doe ([email protected])

Created on: 2024-08-02 13:01:56

Data:

Id Name Short description Gender Country Occupation Birth year Death year Manner of death Age of death Associated Countries Associated Country Coordinates (Lat/Lon) Associated Country Life Expectancy
Loading... (need help?)

Mock Data:

Id Gender Age of death Associated Countries Associated Country Life Expectancy Manner of death Name Short description Occupation Death year Birth year Country Associated Country Coordinates (Lat/Lon)
Loading... (need help?)
# Accessing the mock or private part of an asset directly:
mock_data = dataset_retrieved.assets[0].mock

private_data = dataset_retrieved.assets[0].data

mock_data
Id Gender Age of death Associated Countries Associated Country Life Expectancy Manner of death Name Short description Occupation Death year Birth year Country Associated Country Coordinates (Lat/Lon)
0 Q19723 Gender 1 53.0 ['United States'] [78.5] homicide Norma Fisher Magazine truth stop whose group through despite. Corporate treasurer 1989.0 1936 Not Available Not Available
1 Q20057 Gender 1 51.0 ['United Kingdom'] [81.3] natural causes Brandon Lloyd Total financial role together range line beyon... Chief Financial Officer 2018.0 1967 Not Available Not Available
2 Q8791 Gender 1 84.0 ['Sweden'] [82.5] natural causes Michelle Glover Partner stock four. Region as true develop sou... Speech and language therapist 2000.0 1916 Not Available Not Available
3 Q30567 Gender 1 64.0 ['Belgium'] [81.6] natural causes Willie Golden Feeling fact by four. Data son natural explain... Financial controller 1989.0 1925 Not Available Not Available
4 Q14013 Gender 1 88.0 ['United Kingdom'] [81.3] suicide Roberto Johnson Attorney quickly candidate change although bag... Sound technician, broadcasting/film/video 2016.0 1928 Not Available Not Available
... ... ... ... ... ... ... ... ... ... ... ... ... ...
44206 Q21223 Gender 1 87.0 ['United States'] [78.5] natural causes Steven Hill Occur site mean. None imagine social collectio... Television/film/video producer 2014.0 1927 Not Available Not Available
44207 Q18681 Gender 1 75.0 ['Austria'] [81.6] natural causes Laura Smith Five help event as sort. Class training possib... Race relations officer 2018.0 1943 Not Available Not Available
44208 Q34424 Gender 1 56.0 ['France'] [82.5] natural causes Diana Jacobs Middle style capital describe increase. Fly si... Civil Service fast streamer 2009.0 1953 Not Available Not Available
44209 Q33102 Gender 1 75.0 ['France'] [82.5] natural causes Larry Foster Watch size character piece speak moment outsid... Speech and language therapist 1982.0 1907 Not Available Not Available
44210 Q34422 Gender 1 63.0 ['Poland'] [77.6] unnatural death Thomas Gonzales Fact thousand week professional. Biochemist, clinical 2005.0 1942 Not Available Not Available

44211 rows × 13 columns

Note that only an admin/data owner can access dataset_retrieved.assets[0].data.

Update Datasets

Updating of the data is not possible for uploaded objects. Instead, you can only update an asset, dataset before it is being uploaded.

To change a dataset that was uploaded, we recommend you delete and re-create the object.

asset.set_description("Updated asset description")
dataset.add_asset(asset, force_replace=True)
SyftSuccess:
Asset asset_name has been successfully replaced.

do_client.datasets

Dataset Dicttuple

Total: 0

Delete Datasets

It is recommended to proceed carefully with deletion, in case it is being used by different code requests of users.

do_client.api.dataset.delete(uid = do_client.datasets[0].id)
do_client.datasets
DictTuple()

Dataset Description

Descriptions are important attributes for datasets, as they are the key for helping the data scientist understand the data, despite not having access to it. Thus, a well-wrritten comprehensive description can help answer questions and clarify assumptions that the data scientist might have about the data.

Markdown support

PySyft allows for the dataset description to be written in Markdown, to enable data owners to properly capture all the information needed about the data. As such, you can write the description directly in a markdown editor (Editor.md, Markdown Live Preview), edit and preview it until it’s good to go, and in the end just copying the final markdown in the description field when creating a dataset.

The description can capture dimensions such as:

  • Summary of the dataset: short description mentioning what type of data (numerical, text, image, mixed) and the source domain;

  • Dataset usage policy: describe the data usage policies the researchers must adhere to in their study to have their requests approved

  • Use cases: what use cases this dataset has been proposed for

  • Data collection and pre-processing: information on how the features were collected and/or derived, and how accurate the data is

  • Key features: a data dictionary, explaining all the columns presented in the dataset. We recommend mentioning if there is are relationships between the columns

  • Code snippets: these can be included for common operations, for example snippet on how to load the dataset to get started with it

  • Citations: how to cite the dataset

Let’s see an example

description_template = '''### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

### Dataset usage policy
This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply:
- only aggregate statistics can be released from data computation
- data subjects should never be identifiable through the data computation outcomes
- a fixed privacy budget of eps=5 must be preserved by each researcher

### Data collection and pre-processing
The dataset is based on open data hosted by Wikimedia Foundation.

**Age**
Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.

**Gender**
Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)

**Occupation**
The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)

More details about the features can be found by reading the paper.

### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.

### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.


### Getting started

```
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

age_df = pd.read_csv("ages_dataset.csv")
```

### Execution environment
The data is hosted in a remote compute environment with the following specifications:
- X CPU cores
- 1 GPU of type Y
- Z RAM
- A additional available storage

### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
'''
# Create a dataset with description

dataset = sy.Dataset(
    name="Dataset name with description",
    description=description_template,
    asset_list=[
        sy.Asset(
            name="Age Data 2023",
            data=age_df,
            mock=age_mock_df
    )],
    contributors=[main_contributor]
)

do_client.upload_dataset(dataset)
Uploading:   0%|          | 0/1 [00:00<?, ?it/s]
Uploading: Age Data 2023:   0%|          | 0/1 [00:00<?, ?it/s]
Uploading: Age Data 2023: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]
Uploading: Age Data 2023: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]

SyftSuccess:
Dataset uploaded to 'test_server'. To see the datasets uploaded by a client on this server, use command `[your_client].datasets`

do_client.datasets[0]

Dataset name with description

Summary

Description

About the dataset

This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

Dataset usage policy

This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply: - only aggregate statistics can be released from data computation - data subjects should never be identifiable through the data computation outcomes - a fixed privacy budget of eps=5 must be preserved by each researcher

Data collection and pre-processing

The dataset is based on open data hosted by Wikimedia Foundation.

Age Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.

Gender Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)

Occupation The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)

More details about the features can be found by reading the paper.

Key features

  1. Id: Unique identifier for each individual.
  2. Name: Name of the person.
  3. Short description: Brief description or summary of the individual.
  4. Gender: Gender/s of the individual.
  5. Country: Countries/Kingdoms of residence and/or origin.
  6. Occupation: Occupation or profession of the individual.
  7. Birth year: Year of birth for the individual.
  8. Death year: Year of death for the individual.
  9. Manner of death: Details about the circumstances or manner of death.
  10. Age of death: Age at the time of death for the individual.
  11. Associated Countries: Modern Day Countries associated with the individual.
  12. Associated Country Coordinates (Lat/Lon): Modern Day Latitude and longitude coordinates of the associated countries.
  13. Associated Country Life Expectancy: Life expectancy of the associated countries.

Use cases

  • Analyze demographic trends and birth rates in different countries.
  • Investigate factors affecting life expectancy and mortality rates.
  • Study the relationship between gender and occupation across regions.
  • Explore correlations between age of death and associated country attributes.
  • Examine patterns of migration and associated countries' life expectancy.

Getting started

!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

age_df = pd.read_csv("ages_dataset.csv")

Execution environment

The data is hosted in a remote compute environment with the following specifications: - X CPU cores - 1 GPU of type Y - Z RAM - A additional available storage

Citation

Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82

Dataset Details

Uploaded by: Jane Doe ([email protected])

Created on: 2024-08-02 13:02:02

URL: None

Contributors: To see full details call dataset.contributors.

Assets

Asset Dicttuple

Total: 0