Datasite Server#

Estimated reading time: 10’

What you’ll learn#

This guide’s objective is to help you learn about the Datasite’s servers, how it works and how you can use it.

What is a Datasite?#

A Datasite is a collection of servers launched by a data owners to enable responsible access to their assets. In most of the cases, this is made of one or two servers.

A Datasite Server#

At its core, a Datasite server is a general containerized web server that facilitates responsible, privacy-preserving access to non-public assets, enabling data scientists to study them without directly accesing or acquiring a copy of them.

Its purpose is to host datasets or models and enable responsible access to them.

Remote Code Execution

A Datasite enables responsible access via:

  • Mock-data based prototyping: The server hosts alongside the real assets, a fake/mock version of it that is identical in structure, but whose actual values are randomized. This allows researchers to prototype and test their analysis before doing remote code execution.

  • Manual code review: The simplest form of data protection. Data scientists can submit their code, prototyped against mock-data, to the data owners for review and execution, during which the data owner can decide what is acceptable or not.

  • Privacy enhancing technologies: While the first two features above can likely support researchers accessing any type of dataset without seeing it — data usage rules can be automated further. PETs can help to scale the hand-reviewing process. Various PETs can be added to protect data via the server, such as access control (e.g. pre-approved queries, rate-limits) or differential privacy, enabling an automatic flow. This is currently a venue of active development.

Depending on the data they host, Datasite’s sever can be of two types:

  • high-side: a server that hosts private, sensitive data, and requires a great amount of attention on how the data is managed and how access is given, as it poses an increased security risk.

  • low-side: a server that hosts only mock data. In practice, this server can be publicly available as it does not impose any security concerns, e.g. can allow self-registration.

Origin of low-side & high-side

To allow for maximum protection when working with sensitive data, PySyft allows for airgapped deployment of the servers hosting private data (the high-sides). This type of physical isolation of the high side server from unsecure network, such as public Internet or unsecured local area is, in fact, the maximum protection one server can have.

The naming of low-side and high-side are coming from the security community: when systems that host different levels of classified information, the disconnected systems are referred to as low side and high side, low being unclassified and high referring to the classified one (or classified at a higher level). You can read more here. PySyft implements various mechanisms that are inspired from access policies usually used in such systems.

Other types of servers#

There are multiple types of servers serving various purposes in the Syft ecosystem, such as Enclave Server or Gateway Server, which are covered in different sections.

Working of Datasite Server#

A Datasite Server comes in three variants:

  • local development server

  • single container

  • full-stack deployment

Local Development Server#

A local development server is a very lightweight and quick deployment mode designed for tinkering and development purposes. This simulates the containerized servers locally, it is easy to launch and tear down, allowing the users to have a local development setup before doing an actual deployment.

It is important to point that, despite it being lightweight, it has its own local SQLite storage to keep the server’s metadata and data, and it is stored with the same name as the server’s unique name. If you launch a server with an exsiting storage, that will be loaded together. Learn more about it here.

Single Container Deployment#

As titled, a single container deployment is lightweight and best-fitted for beginners. It is primarily purposed for deploying on Docker or Podman. Learn more about it here.

Full-stack deployment#

A full-stack deployment is designed for production deployments on K8s, this allow for deployment on a cluster hosted in the cloud or eventually locally (via k3d).

It ships with a suite of containers and capabilities:

  • database: uses MongoDB, for storing the server data, and SeeweedFS, for storing large blob data

  • custom workloads: uses Kaniko to create dynamically Docker images and new containers for scaling the computation beyond a single container

  • networking: uses rathole to support reverse tunneling and allows configuring either traefik or ingress as a reverse proxy

  • orchestration & observability: supports easy deployment and upgradeability via our Helm charts and observability via opentelemetry

  • frontend: hosts a web client for the server. This is under development.

Learn more about our full stack deployment using k8s here.

Deployment#

As stated before, a Datasite server can be deployed on your:

  • local machine: firstly, the local server is the simplest way to get started. For local development, you can experiment with a local development server as explained below or by deploying a single-container deployment. However, to allow external data scientists access your deployed Syft server, a local deployment using k3d is necessary due to the extended networking capabilities.

  • VM in the cloud: a single-container deployment can work great for the first steps into your project. If you, however, find the capabilities of a single container limiting for your analysis or need for robustness, a local k3d deployment of the full-stack deployment can help.

  • Cloud computing cluster: for production deployments that require robustness or greater scale, a full-stack deployment on k8s is the best way to go.

More information about this is presented in the deployment guide.

Here is a simple example launching an development server. We can use the orchestra API as such:

import syft as sy

server = sy.orchestra.launch(
    name="my_special_server",
    reset=True,
    port=8093,
)
server.deployment_type

The returned object, a sy.ServerHandle, allows fetching specific information about your deployed local server, such as the url or the deployment_type. Here, the deployment type is python as the server is running in a process in Python.

Multiple further APIs are available that are also offered on the client, for easy testing, such as register, login_as_guest, etc.

server.url
server.port
server.deployment_type

Further arguments can be passed to configure the way you would like to use your development server:

  • dev_mode: True/False, whether you would like verbose logging when using the server for debugging

  • server_type: enclave, domain or gateway

  • server_side_type: high or low, stating whether your server is purposed to host private data (high) or only mock data (low). A high side server is more defensive with how the data is being used.

  • local_db: True/False, whether you would like to initialize a local database or not

  • create_producer: True/False, whether your server can instantiate other in-memory server to simulate working with multiple containers

  • n_consumers: number of in-memory workers simulated, able to consume the workload scheduled. This is required for launching and running jobs and requires create_producer to be True.

  • thread_workers: True/False, whether the in-memory workers should be simulated using threads

  • association_request_auto_approval: True/False, states whether you should auto-approve requests of associating with another server (e.g. routing from one to another)

Connect to a Datasite Server#

Connecting to a server requires that the client can reach the server. This is possible via a public IP or if the client and the server are on the same network (or even machine).

Assuming the server is reachable either via an public or internal/local IP, an user can:

  • use their credentials to connect with appropriate permissions into the server

  • register to the server, if the server permits it

For simplicity, we showcase here the first, but further details are given in the Users API (LINK).

data_owner_client = sy.login(url="localhost",port=server.port, email="[email protected]", password="changethis")
data_owner_client
data_owner_client.settings

Server Settings#

You can configure most of the information about your Datasite server, its settings and even customize your welcome message upon connection, so that the data scientists that use your node have the best visibility into the server’s ownership and purpose.

You can learn more about it here in the Settings API component.

Shutting down a local server#

To shut down a local development server and remove its appropriate local storage, you can proceed with:

server.land()

Secure Server configuration#

Datasite servers can be deployed in various configurations that optimise for better security or for more automation, or to some extent, both.

Here are the main configurations made available for your Datasite. We advise you to choose the one that fits your scenario best, taking into account the sensitivity of the data and security implications:

  • Internet-connected deployment: this setup requires one server only - a highside server that hosts both mock and private data and only vetted researchers have the access rights. As the name suggests, in this setup, the high side node is directly accesible by other parties than the data owner.

    • Advantages:

      • Familiarity: the data scientists work closely to the private data, as the mock datasets act as a pointer to the real data

      • Automation: as the datasets are co-located, there is a lot of room to automate the approval flow, reducing the review loop time and reducing the costs of the humans in the loop.

      • Resources: as only one server needs to be hosted, this reduces the requirements necessary for launching a server

    • Risks:

      • Security exposure: as the server is either public or directly accesible by the data scientist, there are possible vector attacks coming either from unknown malicious actors that are aware of the server existance, the data scientist, or from malicious actors that can impersonate the data scientist.

      • Legal implications: attempting to enable automatic means of data release must be compliant with current regulations. Whilst various organisations succesfully released data using PETs, such as differential privacy, the use of technology can have various implications from a legal and policy perspective given the jurisdiction and interpretations.

  • Air-gapped deployment: this setup requires launching two server -a lowside server hosting non-sensitive mock data and a highside server hosting the private information, whilst no connection is possible between the two. In this scenario, data owners are able to control what information can be transferred between the sides via syncing, which is inspired from the Bell-LaPadula confidentiality model, where data can be moved low-to-high with minimal security measures, while high-to-low requires stronger protection measures.

    • Advantages:

      • Security: this offers maximum protection for the data, as the data owners are able to control at a granular level the information that flows in and outside of the high side servers and none, expect him or other employees of the data owning organisations, can access it

      • Maximum degree of control: data owners need to manually review the code and assets being released between the two servers and to data scientists, making it easy to enforce an organisations’ rules on data usage and release

    • Risks:

      • None - given that a data owner can enforce existing rules using this system, this deployment doesn’t imply concerns for privacy and legal teams, whilst it is also offering highest security level.

Make your server discoverable!#

Datasite servers, by themselves, have little ability to be discovered, thus the data owner might need to advertise their public-facing server or share credentials with the researchers of interest.

If you are looking to expand the reach of your data and allow more researchers to discover your data, create new research projects and collaborate with other data owners, we encourage you to join a DataNet, a network of Datasites, that acts as a registry and allows researchers to search among available mock datasets, discover new Datasites and propose new projects.

You can learn more about it here in the Network API component.