Datasets & Assets (.datasets
)
Estimated reading time: 8’
What you’ll learn
This guide’s objective is to help you get familiar with datasets and assets than can be hosted on a Datasite server.
Introduction
Datasets in PySyft are the key to how PySyft allow researchers to conduct studies in a remote manner, on private data that cannot be accessed directly. There are two types of data to be hosted on a PySyft server:
private data: the original data that cannot be released due to its sensitive nature and associated risks
mock data: a fake version of the data that mimics the original one, by preserving, at minimum, the dataset schema and value distributions. This is either generated entirely from scratch or it’s a syntethic version with appropriate privacy guarantees.
Both of the these datasets are part of an object, called Asset
, which has a dual behaviour - it is a pointer to the private data, but depending on the permissions, an user can access the private or the mock data. Moreover, such assets are conveniently grouped, if related, under a Dataset
object, allowing to describe the assets more broadly.
Structuring Datasets and Assets
A Syft Dataset can contain one or more Syft Assets.
Warning
An asset must belong to a dataset and cannot be uploaded on its own.
Below are a few examples of how one can organise their assets within a dataset:
Example 1: Training and Testing Data
The dataset is composed of two assets, one with testing data, and another asset with training data. These two can be uploaded together into the same dataset.
Example 2: Chronological Data
The dataset is composed of multiple assets collected during different time periods, for example data-july-14.csv, data-july-17.csv, and data-august-12.csv. All these three assets can be uploaded into the same dataset.
Differences for a low-side & high-side configuration
Specific deployments are particularly defined by the data arrangement across servers: a low-side server can host only mock data, while high-side servers benefit from extra security and can contain private data.
Configuring the server in this manner does not require a custom configuration. The only aspect you need to take pay attention to is never uploading an Asset
with private data on a low-side.
Warning
Remember!
Private (or sensitive) data must only be uploaded on the high-side server.
The low-side server must only contain mock data.
For the high-side servers, it is entirely up to the data owner if they would like to have the mock data available for testing purposes.
Fetch test data
To show how datasets and assets can be created and further uploaded, we will launch now a test server and start by downloading a Kaggle dataset for illustration (Age Dataset 2023).
Show code cell source
import syft as sy
server = sy.orchestra.launch(name="test_server", port="auto", dev_mode=False, reset=True)
# logging in with default credentials
do_client = sy.login(email="[email protected]", password="changethis", port=server.port)
Show code cell output
Starting test_server server on 0.0.0.0:43451
Waiting for server to start Done.
You have launched a development server at http://0.0.0.0:43451.It is intended only for local use.
Logged into <test_server: High side Datasite> as <[email protected]>
You are using a default password. Please change the password using `[your_client].account.set_password([new_password])`.
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_mock_dataset.csv
Show code cell output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
1 157M 1 2479k 0 0 1693k 0 0:01:35 0:00:01 0:01:34 1692k
7 157M 7 11.6M 0 0 4888k 0 0:00:33 0:00:02 0:00:31 4887k
14 157M 14 22.2M 0 0 6625k 0 0:00:24 0:00:03 0:00:21 6623k
21 157M 21 33.1M 0 0 7624k 0 0:00:21 0:00:04 0:00:17 7624k
27 157M 27 44.0M 0 0 8262k 0 0:00:19 0:00:05 0:00:14 8942k
35 157M 35 55.4M 0 0 8846k 0 0:00:18 0:00:06 0:00:12 10.7M
40 157M 40 64.0M 0 0 8799k 0 0:00:18 0:00:07 0:00:11 10.4M
46 157M 46 72.9M 0 0 8862k 0 0:00:18 0:00:08 0:00:10 10.1M
52 157M 52 82.7M 0 0 8937k 0 0:00:18 0:00:09 0:00:09 9.8M
58 157M 58 91.8M 0 0 8991k 0 0:00:17 0:00:10 0:00:07 9786k
63 157M 63 100M 0 0 8935k 0 0:00:18 0:00:11 0:00:07 9047k
69 157M 69 110M 0 0 9059k 0 0:00:17 0:00:12 0:00:05 9447k
76 157M 76 121M 0 0 9220k 0 0:00:17 0:00:13 0:00:04 9822k
83 157M 83 132M 0 0 9353k 0 0:00:17 0:00:14 0:00:03 9.9M
90 157M 90 143M 0 0 9498k 0 0:00:16 0:00:15 0:00:01 10.3M
97 157M 97 154M 0 0 9604k 0 0:00:16 0:00:16 --:--:-- 10.8M
100 157M 100 157M 0 0 9666k 0 0:00:16 0:00:16 --:--:-- 11.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 8217k 0 65130 0 0 102k 0 0:01:20 --:--:-- 0:01:20 101k
64 8217k 64 5312k 0 0 3284k 0 0:00:02 0:00:01 0:00:01 3283k
100 8217k 100 8217k 0 0 4406k 0 0:00:01 0:00:01 --:--:-- 4405k
import pandas as pd
import syft as sy
age_df = pd.read_csv("ages_dataset.csv")
age_df = age_df.dropna(how="any")
age_df.head()
age_mock_df = pd.read_csv("ages_mock_dataset.csv")
age_mock_df = age_mock_df.dropna(how="any")
age_mock_df.head()
Id | Gender | Age of death | Associated Countries | Associated Country Life Expectancy | Manner of death | Name | Short description | Occupation | Death year | Birth year | Country | Associated Country Coordinates (Lat/Lon) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q19723 | Gender 1 | 53.0 | ['United States'] | [78.5] | homicide | Norma Fisher | Magazine truth stop whose group through despite. | Corporate treasurer | 1989.0 | 1936 | Not Available | Not Available |
1 | Q20057 | Gender 1 | 51.0 | ['United Kingdom'] | [81.3] | natural causes | Brandon Lloyd | Total financial role together range line beyon... | Chief Financial Officer | 2018.0 | 1967 | Not Available | Not Available |
2 | Q8791 | Gender 1 | 84.0 | ['Sweden'] | [82.5] | natural causes | Michelle Glover | Partner stock four. Region as true develop sou... | Speech and language therapist | 2000.0 | 1916 | Not Available | Not Available |
3 | Q30567 | Gender 1 | 64.0 | ['Belgium'] | [81.6] | natural causes | Willie Golden | Feeling fact by four. Data son natural explain... | Financial controller | 1989.0 | 1925 | Not Available | Not Available |
4 | Q14013 | Gender 1 | 88.0 | ['United Kingdom'] | [81.3] | suicide | Roberto Johnson | Attorney quickly candidate change although bag... | Sound technician, broadcasting/film/video | 2016.0 | 1928 | Not Available | Not Available |
Create an Asset
To create an asset (syft.Asset
), the mandatory arguments are:
name (type: string)
: name of the asset, acts as a key among the assets in the same dataset and it must be uniquedata
: contains the private data; if you are preparing the assets for the low-side domain, this can also be replaced wit the mock data.mock
: contains the fake data; this data should have the same schema as the private data, but does not contain any sensitive information
Additional arguments are available:
description (type: string)
: a short description of the asset. It can only be a string, does not suppport Markdown or HTML.countributors (type: list)
: a list ofsyft.Contributors
listing the authors of the asset data and specifying a contact email for further questionsmock_is_real (type: bool)
: states whether the mock data is syntethically generated from the original data. If fake, this should be false.mock.shape (type: tuple)
: the shape of the data if the data is either a Pandas DataFrame or a Numpy Array.data_subjects
: deprecated - do not use.
Note
Note for high-side servers
If you are uploading data to the high-side and there is no mock available or it is not considered necessary, you can pass mock=sy.ActionObject.empty()
to signal this.
main_contributor = sy.Contributor(name="Jeffrey Salazar", role="Dataset Creator", email="[email protected]")
asset = sy.Asset(
name="asset_name",
description="this is my asset",
data=age_df, # real dataframe
mock=age_mock_df, # mock dataframe
contributors=[main_contributor]
)
Allowed data types in Asset
In this setup, the data is directly uploaded to the server. The only data formats supported are:
Python primitive types (int, string, list, dict, …)
Pandas Dataframe
Numpy Arrays
If you want to upload custom data formats, we recommmend using the blob storage functionality. This is currently in beta and documentation will be added soon.
Create a Dataset
To create a dataset, the available arguments are:
name (type: string)
: name of the dataset, acts as a key among datasets and it must be uniqueasset_list (type: [syft.Assets])
: a list of assets which contain the actual data uploaded as part of the datasetdescription (type: string)
: brief additional information about the data found in the dataset; it supports markdown.
Additional optional arguments are available:
citation (type: string)
: indications on how to cite the dataset if usedurl (type: string)
: link related to the datasetcontributors_list (type: [Contributor])
: contributors to the dataset
dataset = sy.Dataset(
name="Dataset name",
description="**Dataset description**",
asset_list=[asset],
contributors=[main_contributor]
)
Warning
Preserve naming on low & high For a low-side & high-side deployment, it is very important for the datasets and assets to carry the same name. This ensures that code written on the low-side and using low-side assets can be easily executed on the high side.
Data Upload
To upload a dataset on a domain, use the upload_dataset
function. You need to be logged in into the domain (low side or high side).
Note
Info Assets can be only uploading as part of a dataset.
# Uploading the dataset
do_client.upload_dataset(dataset)
Uploading: 0%| | 0/1 [00:00<?, ?it/s]
Uploading: asset_name: 0%| | 0/1 [00:00<?, ?it/s]
Uploading: asset_name: 100%|██████████| 1/1 [00:00<00:00, 1.36it/s]
Uploading: asset_name: 100%|██████████| 1/1 [00:00<00:00, 1.35it/s]
Dataset uploaded to 'test_server'. To see the datasets uploaded by a client on this server, use command `[your_client].datasets`
Access Datasets
PySyft implements access control via the roles assigned to the user. Briefly:
Admins, data owners, can access, update and delete the datasets in full
Data Scientists/Guests can only access the mock counterpart of all datasets hosted on the server they are registered on. Be careful to not pass private data via the mock argument.
# Access is possible via the datasets API
do_client.datasets
Dataset Dicttuple
Total: 0
# Retrieve one dataset
dataset_retrieved = do_client.datasets[0]
dataset_retrieved
Dataset name
Summary
Description
Dataset description
Dataset Details
Uploaded by: Jane Doe ([email protected])
Created on: 2024-08-02 13:01:56
URL: None
Contributors: To see full details call dataset.contributors.
Assets
Asset Dicttuple
Total: 0
# Retrieve an asset
asset_retrieved = dataset_retrieved.assets[0]
asset_retrieved
asset_name
syft.util.misc_objs.MarkdownDescription
Asset ID: cd148eb04ebe451082addad350cb50c2
Action Object ID: 089053c3e43f4b87a1d6303025467baa
Uploaded by: Jane Doe ([email protected])
Created on: 2024-08-02 13:01:56
Data:
Id | Name | Short description | Gender | Country | Occupation | Birth year | Death year | Manner of death | Age of death | Associated Countries | Associated Country Coordinates (Lat/Lon) | Associated Country Life Expectancy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
Mock Data:
Id | Gender | Age of death | Associated Countries | Associated Country Life Expectancy | Manner of death | Name | Short description | Occupation | Death year | Birth year | Country | Associated Country Coordinates (Lat/Lon) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
# Accessing the mock or private part of an asset directly:
mock_data = dataset_retrieved.assets[0].mock
private_data = dataset_retrieved.assets[0].data
mock_data
Id | Gender | Age of death | Associated Countries | Associated Country Life Expectancy | Manner of death | Name | Short description | Occupation | Death year | Birth year | Country | Associated Country Coordinates (Lat/Lon) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Q19723 | Gender 1 | 53.0 | ['United States'] | [78.5] | homicide | Norma Fisher | Magazine truth stop whose group through despite. | Corporate treasurer | 1989.0 | 1936 | Not Available | Not Available |
1 | Q20057 | Gender 1 | 51.0 | ['United Kingdom'] | [81.3] | natural causes | Brandon Lloyd | Total financial role together range line beyon... | Chief Financial Officer | 2018.0 | 1967 | Not Available | Not Available |
2 | Q8791 | Gender 1 | 84.0 | ['Sweden'] | [82.5] | natural causes | Michelle Glover | Partner stock four. Region as true develop sou... | Speech and language therapist | 2000.0 | 1916 | Not Available | Not Available |
3 | Q30567 | Gender 1 | 64.0 | ['Belgium'] | [81.6] | natural causes | Willie Golden | Feeling fact by four. Data son natural explain... | Financial controller | 1989.0 | 1925 | Not Available | Not Available |
4 | Q14013 | Gender 1 | 88.0 | ['United Kingdom'] | [81.3] | suicide | Roberto Johnson | Attorney quickly candidate change although bag... | Sound technician, broadcasting/film/video | 2016.0 | 1928 | Not Available | Not Available |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
44206 | Q21223 | Gender 1 | 87.0 | ['United States'] | [78.5] | natural causes | Steven Hill | Occur site mean. None imagine social collectio... | Television/film/video producer | 2014.0 | 1927 | Not Available | Not Available |
44207 | Q18681 | Gender 1 | 75.0 | ['Austria'] | [81.6] | natural causes | Laura Smith | Five help event as sort. Class training possib... | Race relations officer | 2018.0 | 1943 | Not Available | Not Available |
44208 | Q34424 | Gender 1 | 56.0 | ['France'] | [82.5] | natural causes | Diana Jacobs | Middle style capital describe increase. Fly si... | Civil Service fast streamer | 2009.0 | 1953 | Not Available | Not Available |
44209 | Q33102 | Gender 1 | 75.0 | ['France'] | [82.5] | natural causes | Larry Foster | Watch size character piece speak moment outsid... | Speech and language therapist | 1982.0 | 1907 | Not Available | Not Available |
44210 | Q34422 | Gender 1 | 63.0 | ['Poland'] | [77.6] | unnatural death | Thomas Gonzales | Fact thousand week professional. | Biochemist, clinical | 2005.0 | 1942 | Not Available | Not Available |
44211 rows × 13 columns
Note that only an admin/data owner can access dataset_retrieved.assets[0].data
.
Update Datasets
Updating of the data is not possible for uploaded objects. Instead, you can only update an asset, dataset before it is being uploaded.
To change a dataset that was uploaded, we recommend you delete and re-create the object.
asset.set_description("Updated asset description")
dataset.add_asset(asset, force_replace=True)
Asset asset_name has been successfully replaced.
do_client.datasets
Dataset Dicttuple
Total: 0
Delete Datasets
It is recommended to proceed carefully with deletion, in case it is being used by different code requests of users.
do_client.api.dataset.delete(uid = do_client.datasets[0].id)
do_client.datasets
Dataset Description
Descriptions are important attributes for datasets, as they are the key for helping the data scientist understand the data, despite not having access to it. Thus, a well-wrritten comprehensive description can help answer questions and clarify assumptions that the data scientist might have about the data.
Markdown support
PySyft allows for the dataset description to be written in Markdown, to enable data owners to properly capture all the information needed about the data. As such, you can write the description directly in a markdown editor (Editor.md, Markdown Live Preview), edit and preview it until it’s good to go, and in the end just copying the final markdown in the description field when creating a dataset.
The description can capture dimensions such as:
Summary of the dataset: short description mentioning what type of data (numerical, text, image, mixed) and the source domain;
Dataset usage policy: describe the data usage policies the researchers must adhere to in their study to have their requests approved
Use cases: what use cases this dataset has been proposed for
Data collection and pre-processing: information on how the features were collected and/or derived, and how accurate the data is
Key features: a data dictionary, explaining all the columns presented in the dataset. We recommend mentioning if there is are relationships between the columns
Code snippets: these can be included for common operations, for example snippet on how to load the dataset to get started with it
Citations: how to cite the dataset
Let’s see an example
description_template = '''### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.
### Dataset usage policy
This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply:
- only aggregate statistics can be released from data computation
- data subjects should never be identifiable through the data computation outcomes
- a fixed privacy budget of eps=5 must be preserved by each researcher
### Data collection and pre-processing
The dataset is based on open data hosted by Wikimedia Foundation.
**Age**
Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.
**Gender**
Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)
**Occupation**
The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)
More details about the features can be found by reading the paper.
### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.
### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.
### Getting started
```
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv
age_df = pd.read_csv("ages_dataset.csv")
```
### Execution environment
The data is hosted in a remote compute environment with the following specifications:
- X CPU cores
- 1 GPU of type Y
- Z RAM
- A additional available storage
### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
'''
# Create a dataset with description
dataset = sy.Dataset(
name="Dataset name with description",
description=description_template,
asset_list=[
sy.Asset(
name="Age Data 2023",
data=age_df,
mock=age_mock_df
)],
contributors=[main_contributor]
)
do_client.upload_dataset(dataset)
Uploading: 0%| | 0/1 [00:00<?, ?it/s]
Uploading: Age Data 2023: 0%| | 0/1 [00:00<?, ?it/s]
Uploading: Age Data 2023: 100%|██████████| 1/1 [00:00<00:00, 1.39it/s]
Uploading: Age Data 2023: 100%|██████████| 1/1 [00:00<00:00, 1.39it/s]
Dataset uploaded to 'test_server'. To see the datasets uploaded by a client on this server, use command `[your_client].datasets`
do_client.datasets[0]
Dataset name with description
Summary
Description
About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.
Dataset usage policy
This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply: - only aggregate statistics can be released from data computation - data subjects should never be identifiable through the data computation outcomes - a fixed privacy budget of eps=5 must be preserved by each researcher
Data collection and pre-processing
The dataset is based on open data hosted by Wikimedia Foundation.
Age Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.
Gender Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)
Occupation The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)
More details about the features can be found by reading the paper.
Key features
- Id: Unique identifier for each individual.
- Name: Name of the person.
- Short description: Brief description or summary of the individual.
- Gender: Gender/s of the individual.
- Country: Countries/Kingdoms of residence and/or origin.
- Occupation: Occupation or profession of the individual.
- Birth year: Year of birth for the individual.
- Death year: Year of death for the individual.
- Manner of death: Details about the circumstances or manner of death.
- Age of death: Age at the time of death for the individual.
- Associated Countries: Modern Day Countries associated with the individual.
- Associated Country Coordinates (Lat/Lon): Modern Day Latitude and longitude coordinates of the associated countries.
- Associated Country Life Expectancy: Life expectancy of the associated countries.
Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.
Getting started
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv
age_df = pd.read_csv("ages_dataset.csv")
Execution environment
The data is hosted in a remote compute environment with the following specifications: - X CPU cores - 1 GPU of type Y - Z RAM - A additional available storage
Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
Dataset Details
Uploaded by: Jane Doe ([email protected])
Created on: 2024-08-02 13:02:02
URL: None
Contributors: To see full details call dataset.contributors.
Assets
Asset Dicttuple