Using OpenAI’s Clip model to support natural language search on a group of 70k book covers
In a previous post I did just a little PoC to see if I could use OpenAI’s Clip model to construct a semantic book search. It worked surprisingly well, in my view, but I couldn’t help wondering if it could be higher with more data. The previous version used only about 3.5k books, but there are hundreds of thousands within the Openlibrary data set, and I assumed it was worthwhile to try adding more options to the search space.
Nevertheless, the total dataset is about 40GB, and attempting to handle that much data on my little laptop, and even in a Colab notebook was a bit much, so I needed to determine a pipeline that would manage filtering and embedding a bigger data set.
TLDR; Did it improve the search? I believe it did! We 15x’ed the information, which provides the search rather more to work with. Its not perfect, but I assumed the outcomes were fairly interesting; although I haven’t done a proper accuracy measure.
This was one example I couldn’t get to work regardless of how I phrased it within the last iteration, but works fairly well within the version with more data.
For those who’re curious you possibly can try it out in Colab!
Overall, it was an interesting technical journey, with quite a lot of roadblocks and learning opportunities along the way in which. The tech stack still includes the OpenAI Clip model, but this time I leverage Apache Spark and AWS EMR to run the embedding pipeline.
This appeared like opportunity to make use of Spark, because it allows us to parallelize the embedding computation.
I made a decision to run the pipeline in EMR Serverless, which is a reasonably recent AWS offering that gives a serverless environment for EMR and manages scaling resources robotically. I felt it could work well for this use case — versus spinning up an EMR on EC2 cluster — because this can be a fairly ad-hoc project, I’m paranoid about cluster costs, and initially I used to be unsure about what resources the job would require. EMR Serverless makes it pretty easy to experiment with job parameters.
Below is the total process I went through to get every part up and running. I imagine there are higher ways to administer certain steps, that is just what ended up working for me, so if you’ve gotten thoughts or opinions, please do share!
Constructing an embedding pipeline job with Spark
The initial step was writing the Spark job(s). The total pipeline is broken out into two stages, the primary takes within the initial data set and filters for recent fiction (throughout the last 10 years). This resulted in about 250k books, and around 70k with cover images available to download and embed within the second stage.
First we pull out the relevant columns from the raw data file.
Then do some general data transformation on data types, and filter out every part but English fiction with greater than 100 pages.
The second stage grabs the primary stage’s output dataset, and runs the pictures through the Clip model, downloaded from Hugging Face. The essential step here is popping the varied functions that we’d like to use to the information into Spark UDFs. The fundamental certainly one of interest is get_image_embedding, which takes within the image and returns the embedding
We register it as a UDF:
And call that UDF on the dataset:
Organising the vector database
As a final, optional, step within the code, we are able to setup a vector database, on this case Milvus, to load and query from. Note, I didn’t do that as a part of the cloud job for this project, as I pickled my embeddings to make use of without having to maintain a cluster up and running indefinitely. Nevertheless, it’s fairly easy to setup Milvus and cargo a Spark Dataframe to a group.
First, create a group with an index on the image embedding column that the database can use for the search.
Then we are able to access the gathering within the Spark script, and cargo the embeddings into it from the ultimate Dataframe.
Finally, we are able to simply embed the search text with the identical method utilized in the UDF above, and hit the database with the embeddings. The database does the heavy lifting of determining the very best matches
Organising the pipeline in AWS
Prerequisites
Now there’s a little bit of setup to undergo with the intention to run these jobs on EMR Serverless.
As prerequisites we’d like:
- An S3 bucket for job scripts, inputs and outputs, and other artifacts that the job needs
- An IAM role with Read, List, and Write permissions for S3, in addition to Read and Write for Glue.
- A trust policy that enables the EMR jobs to access other AWS services.
There are great descriptions of the roles and permissions policies, in addition to a general outline of the best way to stand up and running with EMR Serverless within the AWS docs here: Getting began with Amazon EMR Serverless
Next we have now to setup an EMR Studio: Create an EMR Studio
Accessing the online via an Web Gateway
One other little bit of setup that’s specific to this particular job is that we have now to permit the job to achieve out to the Web, which the EMR application is just not in a position to do by default. As we saw within the script, the job must access each the pictures to embed, in addition to Hugging Face to download the model configs and weights.
Note: There are likely more efficient ways to handle the model than downloading it to every employee (broadcasting it, storing it somewhere locally within the system, etc), but on this case, for a single run through the information, that is sufficient.
Anyway, allowing the machine the Spark job is running on to achieve out to the Web requires VPC with private subnets which have NAT gateways. All of this setup starts with accessing AWS VPC interface -> Create VPC -> choosing VPC and more -> choosing option for a minimum of on NAT gateway -> clicking Create VPC.
The VPC takes a couple of minutes to establish. Once that is finished we also must create a security group in the safety group interface, and fasten the VPC we just created.
Creating the EMR Serverless application
Now for the EMR Serverless application that may submit the job! Creating and launching an EMR studio should open a UI that provides a couple of options including creating an application. Within the create application UI, select Use Custom settings -> Network settings. Here is where the VPC, the 2 private subnets, and the safety group come into play.
Constructing a virtual environment
Finally, the environment doesn’t include many libraries, so with the intention to add additional Python dependencies we are able to either use native Python or create and package a virtual environment: Using Python libraries with EMR Serverless.
I went the second route, and the simplest strategy to do that is with Docker, because it allows us to construct the virtual environment throughout the Amazon Linux distribution that’s running the EMR jobs (doing it in every other distribution or OS can change into incredibly messy).
One other warning: watch out to choose the version of EMR that corresponds to the version of Python that you just are using, and select package versions accordingly as well.
The Docker process outputs the zipped up virtual environment as pyspark_dependencies.tar.gz, which then goes into the S3 bucket together with the job scripts.
We are able to then send this packaged environment together with the remaining of the Spark job configurations
Nice! Now we have the job script, the environmental dependencies, gateways, and an EMR application, we get to submit the job! Not so fast! Now comes the true fun, Spark tuning.
As previously mentioned, EMR Serverless scales robotically to handle our workload, which usually can be great, but I discovered (obvious in hindsight) that it was unhelpful for this particular use case.
Just a few tens of hundreds of records is just not in any respect “big data”; Spark wants terabytes of knowledge to work through, and I used to be just sending essentially a couple of thousand image urls (not even the pictures themselves). Left to its own devices, EMR Serverless will send the job to 1 node to work through on a single thread, completely defeating the aim of parallelization.
Moreover, while embedding jobs absorb a comparatively small amount of knowledge, they expand it significantly, because the embeddings are quite large (512 within the case of Clip). Even if you happen to leave that one node to churn away for a couple of days, it’ll run out of memory long before it finishes working through the total set of knowledge.
As a way to get it to run, I experimented with a couple of Spark properties in order that I could use large machines within the cluster, but split the information into very small partitions in order that each core would have only a bit to work through and output:
- spark.executor.memory: Amount of memory to make use of per executor process
- spark.sql.files.maxPartitionBytes: The utmost variety of bytes to pack right into a single partition when reading files.
- spark.executor.cores: The variety of cores to make use of on each executor.
You’ll need to tweak these depending on the actual nature of the your data, and embedding still isn’t a speedy process, however it was in a position to work through my data.
Conclusion
As with my previous post the outcomes actually aren’t perfect, and certainly not a alternative for solid book recommendations from other humans! But that being said there have been some spot on answers to a variety of my searches, which I assumed was pretty cool.
If you need to mess around with the app yourself, its in Colab, and the total code for the pipeline is in Github!