Glossary#

This is a place to find the definitions for terms used throughout the lesson notebooks.

If any definition is confusing or needs additional details, let us know through the #dagobah Slack channel or during office hours with your mentors!

Table of Contents#

PETs and Privacy#

  1. The Cost of Privacy

  2. Data Privacy and Data Economics

  3. Privacy-Enhancing Technologies (PETs)

  4. Structured Transparency Framework

PySyft#

  1. Lifecycle of Data

  2. Introduction to Syft

  3. Deployment-Specific Architecture

  4. Dev Setup and Contributing

  5. Grid, Syft and PySyft

The Cost of Privacy#

Privacy-Utility Tradeoff#

The tradeoff inherent in much of modern technology between protecting one’s privacy and giving up that privacy to increase convenience and utility. Examples include letting YouTube track your viewing history to suggest more relevant videos to watch, and giving businesses your email or phone number to receive alerts and push notifications.

Pareto Optimality#

A situation in which no individual or preference criterion can be made better off without making at least one individual or preference criterion worse off. For example, you may opt to preserve your privacy by not creating an account for a certain social media site, but this may come at the expense of missing out on news, videos, etc. shared by friends and other social groups.

Data Privacy and Data Economics#

Anonymization / De-Identification#

The removal or replacement (by fake data) of unique identifiers in a dataset. Processes like recononstruction attacks and data linkage prove that removing unique identifiers is often insufficient to guarantee data privacy.

Data Reconstruction Attack#

A technique involving reidentifying data about someone that had previously been anonymized, given that there is other additional data about that person that’s available. This involves using data linkage. Examples include taking anonymized data from the Netflix Prize, and using data available on IMDb to reidentify subjects in the Netflix data.

Data Linkage#

Combining disparate sources of data to link individuals in two datasets. Data linkage is used in reconstruction attacks, such as when movie rating data in both the Netflix and IMDb datasets was used to reidentify individuals in the Netflix Prize data.

Copy Problem#

The effect of copying data repeatedly making that data an infinitely deprecating asset. In other words, the more copies of that data that are available, the less valuable it becomes.

Data Silo#

Data collected and held by an organization which that organization is unwilling to share because holding the data provides some strategic advantage, or because releasing it may incur legal costs, reputational damage, etc.

Resource Barriers#

Policy decisions or bureaucratic hurdles that make the sharing or releasing of data (often intentionally) difficult.

Privacy-Enhancing Technologies (PETs)#

The PETs listed here are organized according to which level of the structured transparency framework ([see next section]) they address .

1. Input Privacy#

Homomorphic Encryption (HE)#

A process that uses public/private key cryptography on data in such a way that various mathematical operations are possible while maintaining the encrypted state of the data.

Secure Multi-Party Computation (SMPC)#

A cryptographic technique similar to HE that allows multiple parties to collaboratively compute a function over their private inputs, without revealing anything besides the final result.

2. Input Verification#

Cryptographic Signature#

A cryptographic scheme for verifying the authenticity of a message or document. A common example is the HTTPS protocol.

Zero-Knowledge Proof (ZKP)#

A method by which one party (the prover) can prove to another party (the verifier) that they know a value x, without conveying any information apart from the fact that they know the value x.

3. Output Privacy#

Differential Privacy (DP)#

A technique that adds noise to the output of any calculation in order to protect privacy. More specifically, it aims to prevent someone from knowing whether a particular individual’s data is present in a dataset.

Privacy Budget (𝜖)#

Used in differential privacy, 𝜖 is a measure of how much your data stands out, and therefore also a quantifiable measure of:

  • privacy risk

  • how likely it is that your data is going to be identified

  • How much your data affects the outcome of the query or algorithm

  • How much noise is needed to hide your data’s influence

Private Information Retrieval (PIR)#

A cryptographic protocol that allows a user to retrieve specific information from a database without revealing which information they are accessing. In other words, PIR enables private and anonymous querying of a database.

4. Output Verification#

Argument of Knowledge (SNARK or zk-SNARK)#

A SNARK is a specific type of Zero Knowledge Proof that focuses on proving that you know a specific secret without revealing any information about the secret itself (i.e. unlike a traditional ZKP, you aren’t trying to prove the validity of a statement- just trying to prove that you know it). “zk-SNARK” is short for “zero-knowledge succinct non-interactive arguments of knowledge”.

Non-Interactive Proof of Work (NI-PoW)#

A cryptographic construct used to prove that a certain amount of computational effort has been expended without requiring any interaction or challenge from a verifier. In other words, it allows a party (the prover) to demonstrate that they have performed a significant amount of computational work without needing someone else (the verifier) to ask for proof or validate it in real-time.

Structured Transparency Framework#

The structured transparency framework is a system for breaking down the flow of information to figure out where different privacy-enhancing technologies are applicable. It consists of five guarantees:

1. Input Privacy#

A guarantee that one or more people can participate in a computation in such a way that no party learns anything about the other party’s inputs to the computation.

2. Output Privacy#

Ensuring that certain subsets of information do not make it through the information flow, thus preventing data inputs from being reverse engineered.

3. Input Verification#

The ability for a user to verify that information received from a process or computation originates from trusted entities.

4. Output Verification#

The ability for a user to verify that the outputs from a hidden information flow contain the desired properties.

5. Flow Governance#

The control of information flows. This includes who is able to change the input and output privacy and verification guarantees.

Lifecycle of Data#

Lifecycle of Data

The movement and usage of data from the data subject, to data owner, to data scientist

Data Subject

The person described by data in a particular dataset

Data Owner

The collectors and controllers of data about data subjects. Examples include organizations such as Google or the United States Census Bureau.

Data Scientist

Users of data who want to apply algorithms and processes to that data in order to extract insights and build products.

Trusted Curator

someone whom data subjects know have access to private data, but who is trusted to do data science responsibly.

Introduction to Syft#

Mailbox for Code

A framework for thinking about PySyft in the context of the lifecycle of data. This framework consists of the following steps:

1. A data scientist uses PySyft to submit their code to a data owner

2. The data owner reviews the code for various requirements or best practices (privacy, legal requirements, etc.)

3. If the data owner approves the code, it is executed on the dataset(s) and the results are returned to the data scientist through PySyft.

Domain / Domain Node / Domain Server

A component in the PySyft application where a data owner can upload data and receive code requests for those uploaded datasets.

Deployment-Specific Architecture#

In-Memory Client

A lightweight deployment mode in which the domain node runs locally using a SQLite database as the main storage.

Single Container Stack

A deployment mode in which PySyft and a lightweight database (e.g., SQLite) are included within a single container environment.

Full Container Stack

A deployment mode that involves multiple containers for different components and services. This mode offers the most functionality, scalability, and isolation at the cost of additional resource requirements compared to the in-memory and single-container deployment modes.

The full container stack has the following docker containers:

  • Backend - contains Syft and corresponding logic to execute code in sync manner

  • Backend Stream - contains Syft and logic to queue message in RabbitMQ

  • Celery Worker - contains Syft and logic to execute message received from RabbitMQ

  • RabbitMQ - receives messages from Backend Stream and passes them into Celery Worker

  • Redis - each syft object has a UUID, and stored in Redis as a key/value pair

  • Mongo - Stores non-private metadata that are related to grid operation, such as RBAC or BLOBs metadata

  • SeaweedFS - Stores the BLOBs, compatible with Amazon S3 protocols

  • Tailscale - VPN instance containing the NAT and Firewall functionalities.

  • Headscale - Key managment server for VPN

  • Jaeger - distributed end-to-end tracing

Dev Setup and Contributing#

Tox

A Python-based virtual environment management and test command line tool used in the PySyft codebase: https://tox.wiki/en/latest/

Linting

An automated process used to discover problems in source code such as bugs and syntax errors.

CI Test

An automated “continuous integration” test that is meant to check whether a alteration or addition to the codebase will change or break any existing functionality.

Grid , Syft and PySyft#

Grid

A component of the PySyft framework that handles the setup of the domain node that stores private data and lets data scientists submit code to it using the Syft API. Grid is predominantly used by Data Owners, or those handling Domain nodes.

Syft

The Python API used by the Data Scientist users to interact with the PySyft framework.

PySyft

The entire library containing both Syft and Grid that enables remote data science via structured transparency.

Client

The part of the PySyft framework that allows a Data Scientist to write remote data science code that interacts with private data on the domain node.

Server

Another name for domain or domain node.

note: “client” and “server” are especially useful terms when framing PySyft in the context of an internet for the world’s private data: data scientists use clients to write code to interact with private data stored on servers.