milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2024-12-11 01:16:09 +08:00

Author	SHA1	Message	Date
Jiquan Long	7b9462c0d3	enhance: fix copying hits of inverted index twice (#33968 ) issue: https://github.com/milvus-io/milvus/issues/29793 The custom `VecCollector` have already transformed the results into vector of offsets, no need to copy them twice. Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-06-19 12:40:01 +08:00
Jiquan Long	ecf2bcee42	enhance: speed up array-equal operator via inverted index (#33633 ) fix: #33632 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-06-11 14:13:54 +08:00
Jiquan Long	0c5d8660aa	feat: support inverted index for array (#33452 ) issue: https://github.com/milvus-io/milvus/issues/27704 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-05-31 09:47:47 +08:00
Jiquan Long	035a508722	fix: make sure inverted index has only one segment (#32858 ) issue: #32717 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-05-08 21:25:30 +08:00
Jiquan Long	03e0db109e	fix: udpate Cargo.lock (#31859 ) issue: #31681 Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-04-03 14:18:23 +08:00
Jiquan Long	9750e78f1d	enhance: lock tantivy dependencies (#31688 ) issue: https://github.com/milvus-io/milvus/issues/31681 Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-03-29 10:15:17 +08:00
Jiquan Long	e33dba8afe	fix: [skip-e2e] use zstd-sys 2.0.9 (#31682 ) fix: #31681 /kind improvement Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-03-28 15:14:10 +08:00
Jiquan Long	e549148a19	enhance: full-support for wildcard pattern matching (#30288 ) issue: #29988 This pr adds full-support for wildcard pattern matching from end to end. Before this pr, the users can only use prefix match in their expression, for example, "like 'prefix%'". With this pr, more flexible syntax can be combined. To do so, this pr makes these changes: - 1. support regex query both on index and raw data; - 2. translate the pattern matching to regex query, so that it can be handled by the regex query logic; - 3. loose the limit of the expression parsing, which allows general pattern matching syntax; With the support of regex query in segcore backend, we can also add mysql-like `REGEXP` syntax later easily. --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-02-01 12:37:04 +08:00
Jiquan Long	67ab5be15a	enhance: optimize search performance of inverted index (#29794 ) issue: #29793 Use `DocSetCollector` instead of `TopDocsCollector`, which will avoid scoring and sorting. --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-01-11 11:12:49 +08:00
Jiquan Long	e9f3df3626	fix: inverted index file not found (#29695 ) issue: https://github.com/milvus-io/milvus/issues/29654 --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-01-07 20:26:49 +08:00
Jiquan Long	3f46c6d459	feat: support inverted index (#28783 ) issue: https://github.com/milvus-io/milvus/issues/27704 Add inverted index for some data types in Milvus. This index type can save a lot of memory compared to loading all data into RAM and speed up the term query and range query. Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL` and `VARCHAR`. Not supported: `ARRAY` and `JSON`. Note: - The inverted index for `VARCHAR` is not designed to serve full-text search now. We will treat every row as a whole keyword instead of tokenizing it into multiple terms. - The inverted index don't support retrieval well, so if you create inverted index for field, those operations which depend on the raw data will fallback to use chunk storage, which will bring some performance loss. For example, comparisons between two columns and retrieval of output fields. The inverted index is very easy to be used. Taking below collection as an example: ```python fields = [ FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100), FieldSchema(name="int8", dtype=DataType.INT8), FieldSchema(name="int16", dtype=DataType.INT16), FieldSchema(name="int32", dtype=DataType.INT32), FieldSchema(name="int64", dtype=DataType.INT64), FieldSchema(name="float", dtype=DataType.FLOAT), FieldSchema(name="double", dtype=DataType.DOUBLE), FieldSchema(name="bool", dtype=DataType.BOOL), FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000), FieldSchema(name="random", dtype=DataType.DOUBLE), FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim), ] schema = CollectionSchema(fields) collection = Collection("demo", schema) ``` Then we can simply create inverted index for field via: ```python index_type = "INVERTED" collection.create_index("int8", {"index_type": index_type}) collection.create_index("int16", {"index_type": index_type}) collection.create_index("int32", {"index_type": index_type}) collection.create_index("int64", {"index_type": index_type}) collection.create_index("float", {"index_type": index_type}) collection.create_index("double", {"index_type": index_type}) collection.create_index("bool", {"index_type": index_type}) collection.create_index("varchar", {"index_type": index_type}) ``` Then, term query and range query on the field can be speed up automatically by the inverted index: ```python result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"]) result = collection.query(expr='int64 < 5', output_fields=["pk"]) result = collection.query(expr='int64 > 2997', output_fields=["pk"]) result = collection.query(expr='1 < int64 < 5', output_fields=["pk"]) ``` --------- Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2023-12-31 19:50:47 +08:00

11 Commits