-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Embed each dict in jsonline format #57
Comments
Hi @abdul756 not at all sure this is a case for vector search but if you want to do that, you may want to pass "metadata_column" to your chosen indexing approach https://pathway.com/developers/api-docs/indexing and use "metadata_filter" for query to be able to pass hard bounds on times and places etc. As for extracting data from JSON elements into columns, this very short guide explains some possible ways - UDF being the most general: https://pathway.com/developers/user-guide/types-in-pathway/json_type |
I will try and let you know, If i face any problem please help me |
HI @dxtrous Here is my table ^SN0FH7F... | "104.10" | [{"duration": "PT21H35M", "segments": [{"aircraft": {"code": "321"}, "arrival": {"at": "2024-06-01T14:30:00", "iataCode": "CJB"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T13:20:00", "iataCode": "MAA", "terminal": "4"}, "duration": "PT1H10M", "id": "88", "number": "429", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}, {"aircraft": {"code": "32N"}, "arrival": {"at": "2024-06-02T10:55:00", "iataCode": "BOM", "terminal": "2"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-02T09:00:00", "iataCode": "CJB"}, "duration": "PT1H55M", "id": "89", "number": "608", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}]}] ^9KM937R... | "125.11" | [{"duration": "PT1H50M", "segments": [{"aircraft": {"code": "737"}, "arrival": {"at": "2024-06-01T22:50:00", "iataCode": "BOM", "terminal": "1"}, "blacklistedInEU": false, "carrierCode": "SG", "departure": {"at": "2024-06-01T21:00:00", "iataCode": "MAA", "terminal": "1"}, "duration": "PT1H50M", "id": "100", "number": "681", "numberOfStops": 0, "operating": {"carrierCode": "SG"}}]}]` Now i if a user ask any questons related to flight which indexing i should use for example, if a user ask please get me details of chepeast flight or expensive flight it should display all details from itinearies column based on duration . Here https://pathway.com/developers/api-docs/indexing there are so many indexing algo please help me in chosing better algo for my use case and explain me how this data column and metadata column should be selected with the table i provided |
Hi Abdul, You may start with the KNN LSH index for indexing the first attempt on indexing. After you have the whole process up and running, it may make sense to compare different indexes between themselves to fine-tune the application. In the scenario you describe, you will also need an embedder to embed these JSONs containing information about flights. Some of the embedders are provided here, but alternatively you can implement your embedder as a UDF that takes a string or JSON and return its' embedding as a vector of floats. Please note that there is no native embedder here: this task requires you to use a third-party API, like one from OpenAPI. Also, as Adrian mentioned above, this case may not fit the vector search. After you have the embeddings and the index which can be queried, there is no guarantee that this index will return the cheapest flight details for the given endpoints and date. While you could probably improve it with a RAG technique, it looks much more like a graph problem where the combination of a source and a timeslot (00:00-01:00, 01:00-02:00, etc) can be a node, while a flight between two sources can be an edge. Therefore if the vector search results don't suit you, it makes sense to look at this angle. |
@zxqfd555-pw Am using embedder from openAI, for example am using pw.indexing.DataIndex(data_table, inner_index, embedder=None)](https://pathway.com/developers/api-docs/indexing#pathway.stdlib.indexing.DataIndex) I just need to know how to pass the innex index it will be just price or it will inlcude itinearies and how to use metadata_filter in this case |
The metadata filter would be needed if you index a set of files and would like the index to perform requests only on a specific subset matching a certain pattern. I would say it's not needed for the first attempt on the app. I would suggest that you pass the embeddings of a full JSON payload as if you pass the price, that would clearly be not enough to answer the query. |
I am building a RAG app using llm-app that tells flight offers available between source and dest. When user asks please suggest chepeast flight between source and destination it should show fare and all the details of that flight.
I want to calculate the emdedding vectors of each dict of jsonline , how to achieve it.
Sample format
`{"flight_offer_id": "1", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T09:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T11:30:00", "carrierCode": "UK", "number": "822", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "1", "numberOfStops": 0, "blacklistedInEU": false}
{"flight_offer_id": "2", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T20:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T22:30:00", "carrierCode": "UK", "number": "824", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "2", "numberOfStops": 0, "blacklistedInEU": false}
{"flight_offer_id": "3", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T06:45:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T08:50:00", "carrierCode": "UK", "number": "828", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H5M", "segment_id": "7", "numberOfStops": 0, "blacklistedInEU": false}
{"flight_offer_id": "4", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T07:55:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T10:00:00", "carrierCode": "AI", "number": "571", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "8", "numberOfStops": 0, "blacklistedInEU": false}
{"flight_offer_id": "5", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T15:50:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T17:55:00", "carrierCode": "AI", "number": "672", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "9", "numberOfStops": 0, "blacklistedInEU": false}
`
The text was updated successfully, but these errors were encountered: