Releases: JohnSnowLabs/johnsnowlabs
John Snow Labs 5.1.8 Library Release
Johnsnowlabs Haystack Integrations
Johnsnowlabs provides the following nodes which can be used inside the Haystack Framework for scalable pre-processing&embedding on
spark clusters. With this you can create Easy-Scalable&Production-Grade LLM&RAG applications.
See the Haystack with Johnsnowlabs Tutorial Notebook
and the new Haystack+Johnsnowlabs Documentation
JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Haystack
based on Spark-NLP's DocumentCharacterTextSplitter and supports all of it's parameters
# Create Pre-Processor which is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
chunk_overlap=2,
chunk_size=20,
explode_splits=True,
keep_seperators=True,
patterns_are_regex=False,
split_patterns=["\n\n", "\n", " ", ""],
trim_whitespace=True,
)
# Process document distributed on a spark-cluster
processor.process(some_documents)
JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with any Sentence Embedding from John Snow Labs in Haystack
You must provide the NLU reference of a sentence embeddings to load it.
If you want to use GPU with the Embedding Model, set GPU=True on localhost, it will start a spark-session with GPU jars.
For clusters, you must setup cluster-env correctly, using nlp.install_to_databricks() is recommended.
from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore
# Write some processed data to Doc store, so we can retrieve it later
document_store = InMemoryDocumentStore(embedding_dim=512)
document_store.write_documents(some_documents)
# Create Embedder which connects is connected to spark-cluster
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
embedding_model='en.embed_sentence.bert_base_uncased',
document_store=document_store,
use_gpu=False,
)
# Compute Embeddings distributed in a cluster
document_store.update_embeddings(retriever)
Johnsnowlabs Langchain Integrations
Johnsnowlabs provides the following components which can be used inside the Langchain Framework for scalable pre-processing&embedding on
spark clusters as Agent Tools and Pipeline components. With this you can create Easy-Scalable&Production-Grade LLM&RAG applications.
See the Langchain with Johnsnowlabs Tutorial Notebook
and the new Langchain+Johnsnowlabs Documentation
JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Langchain
based on Spark-NLP's DocumentCharacterTextSplitter and supports all of it's parameters
from langchain.document_loaders import TextLoader
from johnsnowlabs.llm import embedding_retrieval
loader = TextLoader('/content/state_of_the_union.txt')
documents = loader.load()
from johnsnowlabs.llm import embedding_retrieval
# Create Pre-Processor which is connected to spark-cluster
processor = embedding_retrieval.JohnSnowLabsLangChainCharSplitter(
chunk_overlap=2,
chunk_size=20,
explode_splits=True,
keep_seperators=True,
patterns_are_regex=False,
split_patterns=["\n\n", "\n", " ", ""],
trim_whitespace=True,
)
# Process document distributed on a spark-cluster
pre_processed_docs = jsl_splitter.split_documents(documents)
JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with any Sentence Embedding from John Snow Labs.
You must provide the NLU reference of a sentence embeddings to load it.
You can start a spark session by setting hardware_target
as one of cpu
, gpu
, apple_silicon
, or aarch
on localhost environments.
For clusters, you must setup the cluster-env correctly, using nlp.install_to_databricks() is recommended.
# Create Embedder which connects is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
embeddings = embedding_retrieval.JohnSnowLabsLangChainEmbedder('en.embed_sentence.bert_base_uncased',hardware_target='cpu')
# Compute Embeddings distributed
from langchain.vectorstores import FAISS
retriever = FAISS.from_documents(pre_processed_docs, embeddings).as_retriever()
# Create A tool
from langchain.agents.agent_toolkits import create_retriever_tool
tool = create_retriever_tool(
retriever,
"search_state_of_union",
"Searches and returns documents regarding the state-of-the-union."
)
# Use Create LLM Agent with the Tool
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(openai_api_key='YOUR_API_KEY')
agent_executor = create_conversational_retrieval_agent(llm, [tool], verbose=True)
result = agent_executor({"input": "what did the president say about going to east of Columbus?"})
result['output']
>>>
> Entering new AgentExecutor chain...
Invoking: `search_state_of_union` with `{'query': 'going to east of Columbus'}`
[Document(page_content='miles east of', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='in America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='out of America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='upside down.', metadata={'source': '/content/state_of_the_union.txt'})]I'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address.
> Finished chain.
I'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address.
nlp.deploy_endpoint and nlp.query_endpoint
You can Query&Deploy John Snow Labs models with 1 line of code as Databricks Model Serve Endpoints.
Data is passed to the predict() function and predictions are shaped accordingly.
You must create endpoints from a Databricks cluster created by nlp.install.
See Cluster Creation Notebook
and Databricks Endpoint Tutorial Notebook
These functions deprecate nlp.query_and_deploy_if_missing, which will be dropped in John Snow Labs 5.2.0
# You need `mlflow_by_johnsnowlabs` installed until next mlflow is released
! pip install mlflow_by_johnsnowlabs
from johnsnowlabs import nlp
nlp.deploy_endpoint('bert')
nlp.query_endpoint('bert_ENDPOINT','My String to embed')
nlp.deploy_endpoint
will register a ML-FLow model into your registry and deploy an Endpoint with a JSL license.
It has the following parameters:
Parameter | Description |
---|---|
model |
Model to be deployed as endpoint which is converted into NluPipelines, supported classes are: String Reference to NLU Pipeline name like 'bert', NLUPipeline , List[Annotator] , Pipeline , LightPipeline , PretrainedPipeline , PipelineModel . In case o... |
John Snow Labs 5.1.7 Library Release
- enterprise nlp bump to 5.1.2
- open source nlp bump to 5.1.2
- nlu bump to 5.0.4rc2
- support for deploying endpoints with GPU infrastructure in databricks via the
workload_type
parameter innlp.query_and_deploy
- yarn mode support for EMR configs-
John Snow Labs 5.1.6 Library Release
- bump visual NLP to 5.0.2
John Snow Labs 5.1.5 Library Release
- bump NLU to 5.0.3
John Snow Labs 5.1.4 Library Release
- upgrade NLU to 5.0.2
- remove pandas >=2 downgrade for databricks clusters
John Snow Labs 5.1.3 Library Release
-
Fix update Databricks cluster
-
nlp.install(med_license=) should work without aws keys for floating licenses
-
add nlp.install_to_databricks and add deprecation warning for nlp.install() when creating new databricks cluster. Will be dropped next release
-
Fixed pandas to 1.5.3 for newly created Databricks clusters until NLU supports pandas>=2
-
new
parameters
parameter in nlp.run_in_databricks for parameterizing submitted databricks jobs and new documentation -
new parameter
extra_pip_installs
which can be used to install additional pypi dependencies when creating a Databricks cluster or installing to an existing cluster.
example of extra_pip_installs
nlp.install_to_databricks(
databricks_cluster_id=cluster_id,
databricks_host=host,
databricks_token=token,
extra_pip_installs=["farm-haystack==1.21.2", "langchain"],
)
John Snow Labs 5.1.2 Library Release
- bump Healthcare NLP to 5.1.1
John Snow Labs 5.1.1 Library Release
-
bump Enterprise NLP to 5.1.1
-
support for submitting jupyter notebooks in nlp.run_in_databricks and new docs for notebook submission
John Snow Labs 5.1.0 Library Release
- bump Enterprise NLP to 5.1.0
- bump Healthcare NLP to 5.1.0
- bump Visual NLP to 5.0.1
- AWS EMR auto install & utilities see EMR cluster creation notebook and EMR Workshop and John Snow Labs EMR Docs
- AWS GLUE auto install & utilities see GLUE custer creation notebook and GLUE Workshop and John Snow Labs GLUE Docs
John Snow Labs 5.0.8 Library Release
nlp.query_and_deploy_if_missing() has been upgraded with new powerful features!
-
support for
gpu
jar injection into endpoint containers -
support for all parameters of model.predict()
Parameter | Description |
---|---|
output_level |
One of token , chunk , sentence , relation , document to shape outputs |
positions |
Set True /False to include or exclude character index position of predictions |
metadata |
Set True /False to include additional metadata |
drop_irrelevant_cols |
Set True /False to drop irrelevant columns |
get_embeddings |
Set True /False to include embedding or not |
keep_stranger_features |
Set True /False to return columns not named "text", 'image" or "file_type" from your input data |
multithread |
Set True /False to use multi-Threading for inference. Auto-inferred if not set |