-
-
Notifications
You must be signed in to change notification settings - Fork 513
PHPORM-381 Add class metadata for vector search indexes #2820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1167,6 +1167,10 @@ Optional arguments: | |
This attribute is used to specify :ref:`search indexes <search_indexes>` for | ||
`MongoDB Atlas Search <https://www.mongodb.com/docs/atlas/atlas-search/>`__. | ||
|
||
.. note:: | ||
|
||
For vector search indexes, see :ref:`vector_search_index` below. | ||
|
||
The arguments correspond to arguments for | ||
`MongoDB\Collection::createSearchIndex() <https://www.mongodb.com/docs/php-library/current/reference/method/MongoDBCollection-createSearchIndex/>`__. | ||
Excluding ``name``, arguments are used to create the | ||
|
@@ -1397,6 +1401,73 @@ for the related collection. | |
// rest of the class code... | ||
} | ||
|
||
#[VectorSearchIndex] | ||
-------------------- | ||
|
||
.. _vector_search_index: | ||
|
||
The ``#[VectorSearchIndex]`` attribute is used to define a vector search index | ||
on a document class. This enables efficient similarity search on vector fields, | ||
such as those used for machine learning embeddings. | ||
|
||
Optional arguments: | ||
|
||
- ``name``: (optional) The name of the vector search index. If omitted, a default name is used. | ||
- ``fields`` (required): A list of field definitions. Each field definition is an associative array describing a vector or filter field. For vector fields, the following keys are supported: | ||
|
||
- ``type``: Must be set to ``'vector'`` for vector fields or ``'filter'`` for filter fields. | ||
- ``path``: The name of the field in the document to index. | ||
- ``numDimensions``: (vector fields only) The number of dimensions in the vector. | ||
- ``similarity``: (vector fields only) The vector similarity function to use. Supported values include ``'euclidean'``, ``'cosine'``, and ``'dotProduct'``. Use the constants from ``Doctrine\ODM\MongoDB\Mapping\ClassMetadata::VECTOR_SIMILARITY_*`` for best compatibility. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason you chose not to use an enum here? I suppose keeping this open makes it easier to be forward-compatible should the server introduce more similarity types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I want to keep it open for new values, and let people use strings from the documentation. |
||
- ``quantization``: (vector fields only, optional) The quantization method, e.g., ``'scalar'``. | ||
- ``hnswOptions``: (vector fields only, optional) Options for the HNSW algorithm: ``maxEdges`` and ``numEdgeCandidates``. | ||
|
||
For filter fields, only ``type: 'filter'`` and ``path`` are required. | ||
|
||
|
||
Example: | ||
|
||
.. code-block:: php | ||
|
||
<?php | ||
use Doctrine\ODM\MongoDB\Mapping\Annotations\Document; | ||
use Doctrine\ODM\MongoDB\Mapping\Annotations\Field; | ||
use Doctrine\ODM\MongoDB\Mapping\Annotations\Id; | ||
use Doctrine\ODM\MongoDB\Mapping\Annotations\VectorSearchIndex; | ||
use Doctrine\ODM\MongoDB\Mapping\ClassMetadata; | ||
use Doctrine\ODM\MongoDB\Types\Type; | ||
|
||
#[Document(collection: 'vector_embeddings')] | ||
#[VectorSearchIndex( | ||
fields: [ | ||
[ | ||
'type' => 'vector', | ||
'path' => 'plotEmbeddingVoyage3Large', | ||
'numDimensions' => 2048, | ||
'similarity' => ClassMetadata::VECTOR_SIMILARITY_DOT_PRODUCT, | ||
'quantization' => ClassMetadata::VECTOR_QUANTIZATION_SCALAR, | ||
], | ||
[ | ||
'type' => 'filter', | ||
'path' => 'category', | ||
], | ||
], | ||
)] | ||
class VectorEmbedding | ||
{ | ||
#[Id] | ||
public ?string $id = null; | ||
|
||
/** @var list<float> */ | ||
#[Field(type: Type::COLLECTION)] | ||
public array $plotEmbeddingVoyage3Large = []; | ||
|
||
#[Field)] | ||
GromNaN marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public string $category; | ||
} | ||
|
||
For more details, see the MongoDB documentation on `Atlas Vector Search <https://www.mongodb.com/docs/atlas/atlas-vector-search/>`_. | ||
|
||
#[Version] | ||
---------- | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
<?php | ||
|
||
declare(strict_types=1); | ||
|
||
namespace Doctrine\ODM\MongoDB\Mapping\Annotations; | ||
|
||
use Attribute; | ||
use Doctrine\Common\Annotations\Annotation\NamedArgumentConstructor; | ||
use Doctrine\ODM\MongoDB\Mapping\ClassMetadata; | ||
|
||
/** | ||
* Defines a search index on a class. | ||
GromNaN marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* | ||
* @Annotation | ||
* @NamedArgumentConstructor | ||
* @phpstan-import-type VectorSearchIndexField from ClassMetadata | ||
*/ | ||
#[Attribute(Attribute::TARGET_CLASS | Attribute::IS_REPEATABLE)] | ||
class VectorSearchIndex implements Annotation | ||
{ | ||
/** @param list<VectorSearchIndexField> $fields */ | ||
public function __construct( | ||
public array $fields, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the fields are required, it has to be the 1st parameter of the constructor. |
||
public ?string $name = null, | ||
) { | ||
} | ||
} |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -210,6 +210,12 @@ public function loadMetadataForClass($className, \Doctrine\Persistence\Mapping\C | |||
} | ||||
} | ||||
|
||||
if (isset($xmlRoot->{'vector-search-indexes'})) { | ||||
foreach ($xmlRoot->{'vector-search-indexes'}->{'vector-search-index'} as $searchIndex) { | ||||
$this->addVectorSearchIndex($metadata, $searchIndex); | ||||
} | ||||
} | ||||
|
||||
if (isset($xmlRoot->{'shard-key'})) { | ||||
$this->setShardKey($metadata, $xmlRoot->{'shard-key'}[0]); | ||||
} | ||||
|
@@ -748,6 +754,45 @@ private function getSearchIndexFieldDefinition(SimpleXMLElement $field): array | |||
return $fieldDefinition; | ||||
} | ||||
|
||||
/** @param ClassMetadata<object> $class */ | ||||
private function addVectorSearchIndex(ClassMetadata $class, SimpleXMLElement $searchIndex): void | ||||
{ | ||||
$definition = ['fields' => []]; | ||||
|
||||
foreach ($searchIndex->{'vector-field'} as $vectorField) { | ||||
$field = [ | ||||
'type' => 'vector', | ||||
'path' => (string) $vectorField['path'], | ||||
'numDimensions' => (int) $vectorField['numDimensions'], | ||||
'similarity' => (string) $vectorField['similarity'], | ||||
]; | ||||
if (isset($vectorField['quantization'])) { | ||||
$field['quantization'] = (string) $vectorField['quantization']; | ||||
} | ||||
|
||||
if (isset($vectorField['hnswMaxEdges'])) { | ||||
$field['hnswOptions']['maxEdges'] = (int) $vectorField['hnswMaxEdges']; | ||||
} | ||||
|
||||
if (isset($vectorField['hnswNumEdgeCandidates'])) { | ||||
$field['hnswOptions']['numEdgeCandidates'] = (int) $vectorField['hnswNumEdgeCandidates']; | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just for my information, when/how do the types specified in the XML schema get checked? I see that the schema defines the types, but does that get validated somewhere before we get here and cast them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type is only used when validating the XML file with the XSD. It's never used to cast the node to the correct type.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I mean it's cast here on line 778 so I was wondering if we're making sure somewhere else that it actually is a number. But ok cool, so it is validated! |
||||
} | ||||
|
||||
$definition['fields'][] = $field; | ||||
} | ||||
|
||||
foreach ($searchIndex->{'filter-field'} as $filterField) { | ||||
$definition['fields'][] = [ | ||||
'type' => 'filter', | ||||
'path' => (string) $filterField['path'], | ||||
]; | ||||
} | ||||
|
||||
$name = isset($searchIndex['name']) ? (string) $searchIndex['name'] : null; | ||||
|
||||
$class->addSearchIndex($definition, $name, 'vectorSearch'); | ||||
} | ||||
|
||||
/** @return array<string, array<string, mixed>|scalar|null> */ | ||||
private function getPartialFilterExpression(SimpleXMLElement $fields): array | ||||
{ | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it use the class metadata to map the doctrine field name to the mongodb field name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we do not do that for the
fields
option on SearchIndex. The values here should correspond to the exact names in the database schema.This may be something worth clarifying in the documentation files for both, though.