Revolutionizing Genomics: Turbocharge Your Workflows with Azure Batch HPC

Ravindra Mettupalli
Apr 16, 2023
5 min read

Introduction:

The rapid growth of genomic data has led to an increased demand for High Performance Computing (HPC) resources to analyze and interpret this massive influx of information. One of the major challenges faced by researchers and scientists in the field of genomics is the efficient processing of this data, often requiring vast amounts of computational power to analyze and interpret the data correctly. Azure Batch, a cloud-based job scheduling and compute management service, is quickly emerging as a powerful tool to accelerate genomics workflows and streamline the processing of large-scale genomic data. In this blog, we will explore how Azure Batch can be utilized in High Performance Computing to run workflows and solve genomics use cases.

Azure Batch in HPC:

Azure Batch is a powerful and flexible platform that allows users to run large-scale parallel and HPC applications efficiently in the Azure cloud. It provides a job scheduling and compute management service that enables users to run thousands of tasks on a pool of virtual machines, making it an ideal solution for the demanding computational requirements of genomics workflows.

By leveraging Azure Batch in High Performance Computing, researchers are able to focus on their core scientific objectives without having to worry about the underlying infrastructure and compute resources needed to support their work. This allows for a more streamlined approach to genomics research, enabling scientists to make faster and more informed decisions based on their findings.

Running Genomics Workflows with Azure Batch

Running genomics workflows on Azure Batch can be broken down into the following steps:

Create a pool of virtual machines:

Azure Batch allows users to create a pool of virtual machines that can be used to run tasks concurrently. This pool can be customized based on the user's specific requirements, including the number of virtual machines, the type machine, and the operating system.

Upload input data and application packages:

Once the pool has been created, users can upload their input data and any required application packages to Azure Storage. This data can then be accessed by the virtual machines in the pool when running tasks.

Define tasks and job dependencies:

Users can define the tasks that need to be executed as part of their genomics workflow, along with any dependencies between these tasks. This allows for the efficient execution of tasks in the correct order, ensuring that the results of each task are available for subsequent tasks as required.

Submit tasks and monitor progress:

Once tasks and dependencies have been defined, users can submit their tasks to the Azure Batch service. Azure Batch will then automatically schedule and execute these tasks on the virtual machines in the pool, managing the entire process from start to finish. Users can monitor the progress of their tasks through the Azure portal or by using Azure Monitor to track the performance of their jobs.

Retrieve results and analyze data:

Upon completion of the tasks, users can retrieve the results and analyze the data as required. Azure Batch provides a range of tools and services to help users visualize and interpret the results of their genomics workflows, enabling them to make informed decisions based on their findings.

Azure Batch Sample Workflow

Dataflow:

Upload input files and the applications to files to your Azure Storage account.
Create a Batch pool, job, and tasks in the job
Download input files and the applications to Batch.
Monitor task execution.
Upload task output.
Download output files.

Benefits of Using Azure Batch for Genomics Workflows

Some of the key benefits of using Azure Batch for genomics workflows include:

Scalability:

Azure Batch allows users to scale their compute resources up or down as required, ensuring that they only pay for the resources they actually use. This is particularly beneficial for genomics workflows, which often require vast amounts of computational power for short periods of time.

Flexibility:

Azure Batch provides users with the flexibility to choose the type of virtual machines and operating systems that best suit their needs, enabling them to optimize their workflows for maximum efficiency.

Reliability:

Azure Batch guarantees high availability of compute resources, ensuring that users can run their genomics workflows without the risk of downtime or interruptions.

Cost-effective:

By leveraging Azure Batch in High Performance Computing, researchers can access powerful compute resources at a fraction of the cost of traditional HPC solutions, enabling them to run more complex and sophisticated genomics workflows at a lower cost.

Conclusion:

The integration of Azure Batch in High Performance Computing has revolutionized the way genomics workflows are executed, providing researchers with a powerful and cost-effective solution to analyze and interpret large-scale genomic data. By harnessing the power of Azure Batch, researchers can now focus on their core scientific objectives, making faster and more informed decisions based on their findings.

To help illustrate how Azure Batch can be used for genomics workflows, let's consider a hypothetical use case where we want to analyze a set of genomic data using a custom Python script.

Example 1: Sample code to set up and execute workflow using Azure Batch for Genomics:

1. First, you'll need to install the Azure Batch SDK for Python:

```bash

pip install azure-batch

```

2. Create a Python script (e.g., `azure_batch_genomics.py`) and import the required libraries:

```python

import os

import sys

import datetime

import time

import azure.batch.batch_service_client as batch

import azure.batch.batch_auth as batch_auth

import azure.batch.models as batch_models

import azure.storage.blob as azureblob

```

3. Set up your Azure Batch account and storage account credentials:

```python

batch_account_name = 'your-batch-account-name'

batch_account_key = 'your-batch-account-key'

batch_account_url = 'your-batch-account-url'

storage_account_name = 'your-storage-account-name'

storage_account_key = 'your-storage-account-key'

storage_container = 'your-storage-container-name'

```

4. Create a function to create a pool of virtual machines:

```python

def create_pool(batch_service_client, pool_id):

pool = batch_models.PoolAddParameter(

id=pool_id,

virtual_machine_configuration=batch_models.VirtualMachineConfiguration(

image_reference=batch_models.ImageReference(

publisher='Canonical',

offer='UbuntuServer',

sku='18.04-LTS',

version='latest'

node_agent_sku_id='batch.node.ubuntu 18.04'

vm_size='Standard_D2_v2',

target_dedicated_nodes=5

)

batch_service_client.pool.add(pool)

```

5. Create a function to upload input data and application packages:

```python

def upload_data_to_blob(storage_account_name, storage_account_key, container_name, file_path):

blob_service = azureblob.BlockBlobService(storage_account_name, storage_account_key)

blob_service.create_container(container_name)

blob_name = os.path.basename(file_path)

blob_service.create_blob_from_path(container_name, blob_name, file_path)

return blob_service.make_blob_url(container_name, blob_name)

```

6. Define tasks and job dependencies:

```python

def create_job(batch_service_client, job_id, pool_id):

job = batch_models.JobAddParameter(

id=job_id,

pool_info=batch_models.PoolInformation(pool_id=pool_id)

)

batch_service_client.job.add(job)

task_list = []

# Assuming a list of input data files, e.g., ['input1.txt', 'input2.txt', 'input3.txt']

input_files = ['input1.txt', 'input2.txt', 'input3.txt']

for index, input_file in enumerate(input_files):

input_file_path = upload_data_to_blob(storage_account_name, storage_account_key, storage_container, input_file)

task_id = f'Task-{index}'

task = batch_models.TaskAddParameter(

id=task_id,

command_line=f'python my_genomics_script.py --input {input_file_path}',

resource_files=[batch_models.ResourceFile(file_path=input_file_path, file_mode='777', blob_source=input_file_path)]

)

task_list.append(task)

batch_service_client.task.add_collection(job_id, task_list)

```

7. Submit tasks and monitor progress:

```python

def main():

credentials = batch_auth.SharedKeyCredentials(batch_account_name, batch_account_key)

batch_client = batch.BatchServiceClient(credentials, base_url=batch_account_url)

pool_id = 'genomics-pool'

job_id = 'genomics-job'

create_pool(batch_client, pool_id)

create_job(batch_client, job_id, pool_id)

if __name__ == '__main__':

main()

```

8. Execute the script:

```bash

python azure_batch_genomics.py

```

Upon completion of the tasks, you can retrieve the results and analyze the data as required. This sample code demonstrates a simple use case for running genomics workflows using Azure Batch, but it can be further customized and optimized based on your specific requirements.

Example 2: Training a Large Language Model using Azure Batch

1. Import required libraries and set up the Azure Batch account:

```python

import azure.batch.batch_service_client as batch

import azure.batch.models as batch_models

batch_account_name = 'your_batch_account_name'

batch_account_key = 'your_batch_account_key'

batch_account_url = 'https://your_batch_account_name.region.batch.azure.com'

credentials = batch.SharedKeyCredentials(batch_account_name, batch_account_key)

batch_client = batch.BatchServiceClient(credentials, base_url=batch_account_url)

```

2. Create a pool of virtual machines:

```python

pool_id = 'LLM-pool'

# Configure the pool settings

pool_config = batch_models.PoolConfiguration(

vm_size='Standard_NC6', # Choose the appropriate VM size based on your requirements

target_dedicated_nodes=4, # Number of VMs in the pool

os_family='5', # Linux OS

application_licenses=['your_license'] # If your LLM uses licensed software

)

# Create the pool

new_pool = batch_models.PoolAddParameter(

id=pool_id,

virtual_machine_configuration=pool_config,

cloud_service_configuration=batch_models.CloudServiceConfiguration(os_family='5')

)

batch_client.pool.add(new_pool)

```

3. Upload input data and application packages:

```python

# Upload the input data (e.g., training dataset) and LLM training script to Azure Storage

input_container = 'input-container'

output_container = 'output-container'

# Add a reference to the input data in the task definition

input_data = batch_models.ResourceFile(

file_path='input-data-file-path',

blob_source=f'https://{storage_account_name}.blob.core.windows.net/{input_container}/input-data-file-path'

)

# Add a reference to the LLM training script in the task definition

training_script = batch_models.ResourceFile(

file_path='training-script.py',

blob_source=f'https://{storage_account_name}.blob.core.windows.net/{input_container}/training-script.py'

)

```

4. Define tasks and job dependencies:

```python

job_id = 'LLM-training-job'

# Create a new job

job = batch_models.JobAddParameter(

id=job_id,

pool_info=batch_models.PoolInformation(pool_id=pool_id)

)

batch_client.job.add(job)

# Define the task for LLM training

task_id = 'LLM-training-task'

task_command = f'python training-script.py --input-file {input_data.file_path} --output-file output-data-file-path'

task = batch_models.TaskAddParameter(

id=task_id,

command_line=task_command,

resource_files=[input_data, training_script],

output_files=[

batch_models.OutputFile(

file_pattern='output-data-file-path',

destination=batch_models.OutputFileDestination(

container=batch_models.OutputFileBlobContainerDestination(

container_url=f'https://{storage_account_name}.blob.core.windows.net/{output_container}'

)

upload_options=batch_models.OutputFileUploadOptions(upload_condition=batch_models.OutputFileUploadCondition.task_success)

)

]

)

# Add the task to the job

batch_client.task.add(job_id, task)

```

5. Submit tasks and monitor progress:

```python

# Monitor the progress of the task

task_state_monitor = batch.helpers.TaskStateMonitor(batch_client)

task_state_monitor.wait_for_tasks_to_complete(job_id, [task_id], timeout=60 * 60 * 24) # Adjust the timeout based on the expected training time

```

6. Retrieve results and analyze data:

```python

# Download the output data (e.g., trained model) from Azure Storage for further analysis and evaluation

```

In this example, we have demonstrated how to use Azure Batch in High Performance Computing to program Large Language Models. By following these steps, researchers can focus on improving their models while Azure Batch manages the underlying infrastructure and compute resources required for the training process.

Revolutionizing Genomics: Turbocharge Your Workflows with Azure Batch HPC

Introduction:

Azure Batch in HPC:

Running Genomics Workflows with Azure Batch

Create a pool of virtual machines:

Upload input data and application packages:

Define tasks and job dependencies:

Submit tasks and monitor progress:

Retrieve results and analyze data:

Azure Batch Sample Workflow

Dataflow:

Benefits of Using Azure Batch for Genomics Workflows

Scalability:

Flexibility:

Reliability:

Cost-effective:

Conclusion:

Example 1: Sample code to set up and execute workflow using Azure Batch for Genomics:

Example 2: Training a Large Language Model using Azure Batch

Recent Posts

Comments