Jobs API (.jobs
)#
Estimated reading time: 13’
What you’ll learn#
This guide’s objective is to help you understand how to use the jobs API to execute your code remotely in a non-blocking way and even parallelize your execution.
Introduction#
PySyft allows data scientists to conduct remote code execution on the data owners’ server. Each time code is executed, a job is created to handle the execution process. This job enables users to monitor its progress and retrieve the results upon completion.
What is a “job”?#
A “job” represents a code request submitted for asynchronous execution. In other words, is is processed in a non-blocking manner, allowing the user to continue their work in the notebook while the server processes the request, without having to wait for it to complete.
This is particularly useful for running requests in parallel, to handle large amounts of data. However, since jobs are crucial for remote code execution, they are leveraged by PySyft for other purposes aswell.
Understanding Jobs#
When a job is submitted to a server, it follows these stages:
Job Submission: The client submits a code request to the server. This does not immediately start the job, but specifies which code should be run. Whenever a user attempts to run the code, either on mock or private data, they can choose to run it in a non-blocking manner, which submits a new job for execution.
Job Queuing: Upon submission, the job is placed in a queue. This ensures that jobs are managed in an orderly manner and allows the server to handle multiple jobs efficiently. To make sure jobs are run in a timely manner, the data owner can choose to scale the number of workers available.
Job Execution: The server picks up jobs from the queue in order. The job is then executed asynchronously, allowing the server to manage its resources efficiently and simultaneously handle other incoming jobs.
Job Monitoring: During execution, the server keeps track of the job’s progress. This might involve updating the job’s status and providing intermediate results, if applicable.
Job Completion: Once the job is completed, the server updates its status to indicate its completion. The job’s results are then made available to the user’s client.
Parallelization using jobs#
You can create more complex analyses by scheduling jobs to run the same code multiple times in parallel. This is achieved through nested jobs and code requests, which can emulate a map-reduce pipeline or other methods you prefer. An example on how to do this is available here, with more guides coming soon.
Example of a job’s lifecycle#
Now, let’s experiment with each of the stages a job goes through to understand how to use them appropriately.
Experimenting with jobs in a local dev server
It’s great to experiment with jobs locally to learn how to use them. However, please note that a default local development server will not be able to execute jobs if you do not pass at least one job consumer (n_consumers=1
) and create a job producer(create_producer=True
). You can read more about this in the local deployment guide.
Let’s start by launching a demo setup to allow us to create a dummy example.
Show code cell source
import syft as sy
import pandas as pd
node = sy.orchestra.launch(name="demo_datasite", port="auto", dev_mode=False, reset=True, n_consumers=1, create_producer=True)
admin_client = sy.login(
url='localhost',
port=node.port,
email="[email protected]",
password="changethis",
)
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
mock_df = pd.DataFrame({'A': [1, 2, 1], 'B': [20, 10, 20]})
main_contributor = sy.Contributor(
name='John Doe',
role='Uploader',
email='[email protected]'
)
asset = sy.Asset(
name='demo_asset',
data=df,
mock=mock_df,
contributors=[main_contributor]
)
dataset = sy.Dataset(
name='Demo Dataset',
description='Demo Dataset',
asset_list=[asset],
contributors=[main_contributor]
)
admin_client.upload_dataset(dataset)
admin_client.settings.allow_guest_signup(enable=True)
admin_client.register(
email='[email protected]',
name='Data Scientist',
password='123',
password_verify='123'
)
admin_client.users[-1].allow_mock_execution()
Show code cell output
Starting demo_datasite server on 0.0.0.0:53566
INFO: Started server process [19618]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:53566 (Press CTRL+C to quit)
Waiting for server to start Done.
You have launched a development server at http://0.0.0.0:53566.It is intended only for local use.
Logged into <demo_datasite: High side Datasite> as <[email protected]>
You are using a default password. Please change the password using `[your_client].account.set_password([new_password])`.
Uploading: demo_asset: 100%|██████████| 1/1 [00:00<00:00, 4.21it/s]
User details successfully updated.
First, a Data Scientist will connect to the domain and fetch a reference to the available data.
import syft as sy
ds_client = sy.login(url='localhost', port=node.port, email='[email protected]', password='123')
data_asset = ds_client.datasets[0].assets[0]
mock_asset = ds_client.datasets[0].assets[0].mock
Logged into <demo_datasite: High side Datasite> as <[email protected]>
Submitting and executing a job#
Once someone specifies the code they want to run, they are ready to initiate the execution of a job. The main mechanism to specify the code is via syft
functions.
Other ways to specify code for jobs
In some cases, the code you would like to run was already specified by existing users (e.g. admins looking to run data scientists’ code) or by the admins themselves (e.g. data scientists can directly run available pre-defined custom endpoints).
We define a dummy syft
function below to specify the code.
@sy.syft_function_single_use(data=data_asset)
def example_function(data):
print('Started execution..')
result = data.sum()
print('Finalized execution..')
return result
ds_client.code.request_code_execution(example_function)
Syft function 'example_function' successfully created. To add a code request, please create a project using `project = syft.Project(...)`, then use command `project.create_code_request`.
Request
Id: 4d4d9ca9e41a4e2b8a6a93f67627d6d8
Request time: 2024-10-01 18:56:55
Status: RequestStatus.PENDING
Requested on: Demo_datasite of type Datasite
Requested by: Data Scientist ([email protected])
Changes: Request to change example_function (Pool Id: default-pool) to permission RequestStatus.APPROVED. No nested requests.
Note that no job was created up to this point. Let’s initiate two jobs to run remotely on the server:
the first one runs on mock data
the second one runs on private data
Permissions: who can run jobs on mock data?
Executing a job on mock data is possible only if the appropriate permissions were granted. In particular, the admin must explictly allow a data scientist to execute experiments against mock datasets on the resources of their server. You can check, in the setup above, how to do that or ask the admin of your server for these permissions.
We use the code API to identify the code we want to run.
ds_client.code
UserCode List
Total: 0
ds_client.code.example_function(data=mock_asset, blocking=False)
Similarily, we could launch a job to run on private data. However, we won’t be able to, unless our request was approved, so let’s have the admin do that first and then we’ll initiate the execution using the client.code
API.
admin_client.requests[-1].approve()
Approving request on change example_function for datasite demo_datasite
Request 4d4d9ca9e41a4e2b8a6a93f67627d6d8 changes applied
ds_client.code.example_function(data=data_asset, blocking=False)
We can now see all the submitted jobs, and their status, by accessing:
ds_client.jobs