Ingest data to Vector Database - Milvus example
This article guides you through ingesting data from AWS S3 into the Milvus vector database. It covers handling of diverse data types, from images to text and audio, by extracting feature vectors and s
Introduction
In the evolving realm of data engineering, the ability to handle and manipulate diverse data types is paramount. One such challenge is ingesting disparate data, such as images, text, and audio files, from AWS S3 into the Milvus vector database. This article will provide a comprehensive guide for data engineers to accomplish this task, complete with code examples and expert tips.
Prerequisites
Before diving into the procedures, ensure that you have:
An active AWS account with access to S3 services.
A configured Milvus instance.
Python 3.7 or later installed.
The necessary Python libraries: boto3, PyMilvus, Pillow (for images), and librosa (for audio files).
AWS S3 to Milvus: The Big Picture
The process can be broadly summarized in the following steps:
Connect to your AWS S3 bucket and download the desired files.
Preprocess the files based on their data type (images, text, audio).
Vectorize the preprocessed data.
Connect to the Milvus server and create a collection.
Insert the vectors into the Milvus collection.
The following sections will delve into each step, providing pythonic examples to guide you.
Connecting to AWS S3
AWS provides the boto3 library which allows Python developers to write software that makes use of services like Amazon S3. Here's how to use it to connect to your S3 bucket and download files:
import boto3
def connect_to_s3(bucket_name):
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
for obj in bucket.objects.all():
path, filename = os.path.split(obj.key)
bucket.download_file(obj.key, filename)
bucket_name = 'your-bucket-name'
connect_to_s3(bucket_name)
Please replace 'your-bucket-name' with the name of your S3 bucket. This script will download all files in the bucket to your local directory.
Preprocessing Data
Now that we've retrieved the data, the next step involves preprocessing it based on its type. Different data types require specific preprocessing steps to ensure they are ready for vectorization.
Images
For image files, we can use the Pillow library to open and convert images into numpy arrays, which are then ready for vectorization. Here's a simple function that does this:
from PIL import Image
import numpy as np
def preprocess_image(image_path):
with Image.open(image_path) as img:
return np.array(img)
image_path = 'your-image.jpg'
image_data = preprocess_image(image_path)
Text
For text files, preprocessing usually involves removing punctuation, converting to lowercase, and tokenization. Here's a basic function for this:
import string
import nltk
nltk.download('punkt')
def preprocess_text(text_file):
with open(text_file, 'r') as file:
text = file.read().replace('\n', '')
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = nltk.word_tokenize(text)
return tokens
text_file = 'your-text.txt'
text_data = preprocess_text(text_file)
Audio
Audio files can be preprocessed using the librosa library, which can load an audio file as a floating point time series. Here's a function to do this:
import librosa
def preprocess_audio(audio_path):
audio_data, _ = librosa.load(audio# Let's pause here and search for how to vectorize different types of data for ingestion into Milvus
search("how to vectorize audio data for Milvus")
Ingesting Data From AWS S3 to Milvus Vector Database
As data continues to grow exponentially, so does the need for effective and efficient data management strategies. One of the promising solutions is using vector databases, and Milvus stands out as an open-source vector database that handles massive-scale feature vectors with ease.
In this article, we will dive into how to ingest data from AWS S3 to Milvus. We will be working with images, text, and audio files stored in an S3 bucket. The process will involve extracting feature vectors from these files and storing them in Milvus. This guide is written for data engineers looking to incorporate these steps into their data pipelines.
Setting up the Environment
Before we begin, ensure that you have a Milvus server running. Milvus can be run as a Docker container, making it easy to get started. It's also recommended to run a Python environment with the necessary libraries installed, including boto3 for AWS S3 interactions, milvus for interactions with the Milvus server, and various libraries for data processing such as numpy and scipy.
Connecting to Milvus
Firstly, let's connect to our Milvus server. We can create a new collection where we'll store our vectors.
from milvus import Milvus, IndexType, MetricType, Status
_HOST = 'milvus'
_PORT = '19530'
_DIM = 512 # dimension of vector
milvus = Milvus(_HOST, _PORT, pool_size=10)
collection_name = 'multimedia_data'
status, ok = milvus.has_collection(collection_name)
if not ok:
param = {
'collection_name': collection_name,
'dimension': _DIM,
'index_file_size': 32, # optional
'metric_type': MetricType.L2 # optional
}
print(milvus.create_collection(param))
_, collection = milvus.get_collection_info(collection_name)
print(collection)
The collection_name can be adjusted to suit your needs, and the dimension should match the dimension of your feature vectors1.
Ingesting Images from S3 to Milvus
To ingest image data, we need to convert the images into feature vectors. Let's use the VGG model for this task.
from preprocessor.vggnet import VGGNet
def process_images(image_paths):
# Initialize the VGG model
model = VGGNet()
vectors = []
for img_path in image_paths:
norm_feat = model.vgg_extract_feat(img_path)
vectors.append(norm_feat)
return vectors
In the above code snippet, image_paths is a list of paths to the images.
After processing the images and extracting the feature vectors, we can now insert these vectors into Milvus.
status, ids = milvus.insert(collection_name=collection_name, records=vectors)
if not status.OK():
print("Insert failed: {}".format(status))
else: print(ids)
This will return a list of IDs that Milvus uses to identify the images. The IDs are in the same order as our list of images. Let's create a quick lookup table to easily access an image, given some ID.
lookup = {}
for ID, img in zip(ids, images):
lookup[ID] = img
Now, the images are inserted into the Milvus collection and we can retrieve the images based on their IDs1.
Ingesting Text and Audio Data from S3 to Milvus
The process for text and audio data would be similar. The main difference would be in the feature extraction step. For text, you might use a method like TF-IDF or word embeddings from a model like BERT. For audio, you might extract MFCCs (Mel Frequency Cepstral Coefficients) or use a more complex model-based feature extraction.
Searching for Similar Vectors
Once all the data is ingested into Milvus, we can perform similarity searches. For instance, we can search for images similar to a given image. This is done by extracting the feature vector of the given image and searching for similar vectors in Milvus.
# execute vector similarity search
search_param = {
"nprobe": 16
}
print("Searching ... ")
param = {
'collection_name': collection_name,
'query_records': [vectors[0]],
'top_k': 10,
'params': search_param,
}
status, results = milvus.search(**param)
if status.OK():
print(results)
else:
print("Search failed. ", status)
The above code snippet searches for the 10 most similar vectors to vectors[0] in the collection1.
Conclusion
In conclusion, vector databases like Milvus offer a powerful way to handle complex multimedia data. Their ability to ingest, store, and search through massive quantities of vectors makes them an integral part of modern data architectures. Ingesting data from AWS S3 into Milvus involves extracting feature vectors from the data and inserting those vectors into a Milvus collection. The specifics can vary depending on the data type, but the general principles remain the same.
This article provides a broad overview of the process, but there's a lot more to explore. For more in-depth information on the topics discussed here, visit the following resources:
Milvus Documentation: The official documentation for Milvus provides comprehensive information on its architecture, installation, and usage.
AWS S3 Documentation: AWS's documentation provides a complete guide to using S3, including how to upload and manage data.
Feature Extraction: Wikipedia's page on feature extraction offers an overview of the concept and various techniques used for different data types.
VGGNet Model: This page explains the VGGNet model, which we used for image feature extraction.
BERT Model: Wikipedia's page on BERT provides a good introduction to this powerful language model, which can be used for text feature extraction.
MFCCs: Wikipedia's page on Mel Frequency Cepstral Coefficients (MFCCs), a common method for audio feature extraction.