spaCy custom transformer-base NER model training and Google Vertex AI #9829

dave-espinosa · 2021-12-08T00:15:49Z

dave-espinosa
Dec 8, 2021

Hello everyone.

Recently, my company has seen the need to adopt MLOPs for some customer, limiting the options to Google Products (as the company has some sort of agreement with Google [read: NO AWS-based solutions]). I put my eyes on Google's "Vertex AI". However, and since that product was officially released just sometime this year, besides the fact that Vertex AI only seems to work "by default" with frameworks such as Tensorflow, Scikit-Learn and XG-Boost (thus leaving other frameworks ouside, which sadly includes spaCy / Thinc), I was wondering if any of you people, has used spaCy inside Vertex AI, as part of MLOps workflow. In my case, I was aiming to have some "custom transformer-based NER model training pipeline". Using Vertex AI Notebooks only, I have managed to successfully train a benchmark model, which is stored in GCS. For some demos, I am using gcsfuse to serve it locally to my team (who also have access to those buckets), inspired by steps No.1 & No.2 in this thread (In the same thread, we find Matthew Honnibal's reply; however [I think] it could not be of use in my case, as it seems not to be compatible with Vertex AI pipeline itself).

The very basics involve making your model's training pipeline, compatible with the rest of Vertex AI via Custom containerization. Now, speaking of spaCy & docker containers creation, there are plenty of examples in the web (I for one, and with the purpose of prototyping only, have followed a quick tutorial, split in first and second parts), however I do not think the output is compatible with the very specific container requirements Vertex AI demands.

All in all, what I currently have is:

An "static" code for a "one-time training job", as a Jupyter Notebook.
As result of "1", an spaCy custom trained NER working model, mounted on a GCS bucket (both as directory for quick access, as well as .tar.gz. filetype) (size: ~500MB)
As result of "2" (as well as some tutorials described before), a containerized model, which serves the model via Streamlit (not sure if compatible with Vertex AI) (size: ~5GB)

How to "migrate" from what I currently have, to make it compatible with Vertex AI?

PS.: I think this post has to do more with Vertex AI / Docker (or at least, as much as spaCy does), however I felt that it would be interesting to have some insights from Explosion Team itself, as well as the community. After all, stating that spaCy has "Industrial-Strength" becomes a bit paradoxical, if such strength cannot be reached ;) .

Thank you everyone.

ljvmiranda921 · 2021-12-10T01:44:12Z

ljvmiranda921
Dec 10, 2021

Hi @dave-espinosa !

Curious as to what particular Vertex AI service are you integrating with? I think that you're on the right track: create a custom trained NER model using spaCy, package it using spacy package, then serve it in a Docker container. You can further optimize your image by doing multi-stage builds, etc.

however I do not think the output is compatible with the very specific container requirements Vertex AI demands.

The output of your training is a spaCy model that fits right into your application. The application you're building must now fit Vertex AI's specifications. Assuming you already have a spaCy model, you can load it up just like any other model in Python:

import spacy
nlp = spacy.load("/path/to/custom-model/")

How to "migrate" from what I currently have, to make it compatible with Vertex AI?

The next step, to make it compatible with Vertex AI, is to write an API layer on top of it. I believe this is already outside spaCy's core uses. But you can easily achieve that using libraries such as FastAPI or Flask. A crude skeleton might look like this:

from fastapi import FasAPI

app = FastAPI()

@app.get("/")
def get_entities():
    #  Use your spaCy model to get the entities
    return {"entities": entities}

Basically you need to create a REST API on top of your spaCy application. The container requirements for Vertex AI talks about how you should go about building that container :) I am not sure if streamlit itself is compatible to Vertex AI (especially if we don't know what particular Vertex AI service you're integrating with), it also depends on what your use-case is. If you just want to deploy a streamlit app, perhaps you can do it via Google App Engine?

Vertex AI's pre-built containers are general-purpose enough to just have the ones you need. If I may guess, the reason why spaCy isn't there is because not every Vertex customer will do NLP, and that's where the custom containers enter. Hope it clears some confusion 🙂

3 replies

dave-espinosa Apr 27, 2022
Author

[Insert time ⌛ bumper here]

Hello @ljvmiranda921 ,

I truly apologize for the late reply to this case. I proceed to add some comments, on top of some of yours:

Curious as to what particular Vertex AI service are you integrating with?

As originally mentioned, we wanted to make the development of our new spaCy models, to be "MLOPs compatible". Somewhere later in my original post, I intuitively added "Vertex AI Pipelines", which happens to be the right tool for it.

[...] package it using spacy package [...]

Not quite. We had issues with the package limit size (as some of the transfomer models alone had >500MB in size), but we eventually solved it, by training our models, and store them in GCS directly, via GCS Fuse. I think that is also referred in my original post.

And the remaining suggestions, by the time when suggested, were already our default solution. Again, what we were looking for, was a way to train spaCy models, using Vertex AI Pipelines.

I will add a solution to this case, as I got to review this challenge recently, and fortunatelly, managed to solve it.

Thank you very much!

dave-espinosa Apr 27, 2022
Author

Hello again @ljvmiranda921 ,

Quick question: I was trying to mark my post as answer, but I could not find the option. Does that option get disabled for "old questions"?

Thank you.

adrianeboyd Apr 28, 2022

Only some of the discussion board categories have "mark answer" enabled, the ones that are intended more for StackOverflow-type questions rather than questions that might be discussion-based or open-ended.

dave-espinosa · 2022-04-27T22:29:02Z

dave-espinosa
Apr 27, 2022
Author

[Insert time ⌛ bumper here]

Hello everyone,

Vertex AI offers a nice tool to integrate MLOPs in your workflow: it's Vertex AI Pipelines. After some research, I got to discover that, if your code is Python-based, there is no need to Dockerize it, but just put them inside components, and the merge all components inside a pipeline. I encourage the reader to check this and this codelabs, for an introductory journey into Vertex AI Pipelines.

Now regarding a very basic example on how to train a spaCy model, as part of a Vertex AI Pipeline, I suggest this NER model training, recently developed and tested by myself. From it, you can tweak your code to be as complex as you like.

This solves my original question, and I hope it helps anyone who might want to power spaCy power, with Vertex AI infrastructure.

Thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaCy custom transformer-base NER model training and Google Vertex AI #9829

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

spaCy custom transformer-base NER model training and Google Vertex AI #9829

dave-espinosa Dec 8, 2021

Replies: 2 comments · 3 replies

ljvmiranda921 Dec 10, 2021

dave-espinosa Apr 27, 2022 Author

dave-espinosa Apr 27, 2022 Author

adrianeboyd Apr 28, 2022

dave-espinosa Apr 27, 2022 Author

dave-espinosa
Dec 8, 2021

Replies: 2 comments 3 replies

ljvmiranda921
Dec 10, 2021

dave-espinosa Apr 27, 2022
Author

dave-espinosa Apr 27, 2022
Author

dave-espinosa
Apr 27, 2022
Author