Distributed ETL for vector database

Running text preprocessing and embeddings creation pipeline on a cluster

For my final project in the Parallel and Distributed Computing class I decided to look into ETL process for LLMs. It’s a popular topic and many companies are trying to come up with the solution for scaling pipelines for vector databases. One example of a framework for text preprocessing is datatrove from huggingface which allows to parallelize the pipeline using Slurm cluster.

For my project I selected HTCondor as a cluster computing solution, which is similar to Slurm, but older and easier to set up.

Use case overview

A distributed implementation of the ETL pipeline for vector database. Jobs are run in Docker containers on HTCondor cluster.

I take 2 video playlists from Youtube and load them into vector database.

Pipeline overview

Stage 1: Load and preprocess text

Transcripts of the videos are fetched from Youtube. The obtained transcripts are split into chunks. For each chunk a json object is created which contains a URL for the video, the text chunk and the timestamp.

Stage 2: Create embeddings and load into vector database

The second stage takes the json document and creates vector embeddings for the text chunks. Embeddings are created with the sentence transformer library. After embeddings are created, they are loaded into the vector database together with the corresponding json objects.

Distributed execution

To implement the distributed ETL pipeline I set up an HTCodor cluster in AWS. Shared file system (NFS) is utilized to store the intermediate job results and the initial input files. Jobs are launches in Docker containers.

The cluster execution step are:

Read input data (videos ids) from NFS and preprocess corresponding transcripts
Save result in json to NFS
Embeddings job read json files and create embeddings
Embeddings are loaded into vector database

DAG generation

The input file is a .txt file with ids of the all the videos which should be processed and loaded into a vector db.

The pipeline.py script does the following:

Split input data into smaller files
Generate DAG to process smaller files
Submit DAG to HTCondor cluster

The number of parallel pipelines can be configured and specified in the config.ini file.