milvus/internal/util
Jiquan Long 3f46c6d459
feat: support inverted index (#28783)
issue: https://github.com/milvus-io/milvus/issues/27704

Add inverted index for some data types in Milvus. This index type can
save a lot of memory compared to loading all data into RAM and speed up
the term query and range query.

Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL`
and `VARCHAR`.

Not supported: `ARRAY` and `JSON`.

Note:
- The inverted index for `VARCHAR` is not designed to serve full-text
search now. We will treat every row as a whole keyword instead of
tokenizing it into multiple terms.
- The inverted index don't support retrieval well, so if you create
inverted index for field, those operations which depend on the raw data
will fallback to use chunk storage, which will bring some performance
loss. For example, comparisons between two columns and retrieval of
output fields.

The inverted index is very easy to be used.

Taking below collection as an example:

```python
fields = [
		FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
		FieldSchema(name="int8", dtype=DataType.INT8),
		FieldSchema(name="int16", dtype=DataType.INT16),
		FieldSchema(name="int32", dtype=DataType.INT32),
		FieldSchema(name="int64", dtype=DataType.INT64),
		FieldSchema(name="float", dtype=DataType.FLOAT),
		FieldSchema(name="double", dtype=DataType.DOUBLE),
		FieldSchema(name="bool", dtype=DataType.BOOL),
		FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000),
		FieldSchema(name="random", dtype=DataType.DOUBLE),
		FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim),
]
schema = CollectionSchema(fields)
collection = Collection("demo", schema)
```

Then we can simply create inverted index for field via:

```python
index_type = "INVERTED"
collection.create_index("int8", {"index_type": index_type})
collection.create_index("int16", {"index_type": index_type})
collection.create_index("int32", {"index_type": index_type})
collection.create_index("int64", {"index_type": index_type})
collection.create_index("float", {"index_type": index_type})
collection.create_index("double", {"index_type": index_type})
collection.create_index("bool", {"index_type": index_type})
collection.create_index("varchar", {"index_type": index_type})
```

Then, term query and range query on the field can be speed up
automatically by the inverted index:

```python
result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"])
result = collection.query(expr='int64 < 5', output_fields=["pk"])
result = collection.query(expr='int64 > 2997', output_fields=["pk"])
result = collection.query(expr='1 < int64 < 5', output_fields=["pk"])
```

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2023-12-31 19:50:47 +08:00
..
componentutil enhance:add some log when create client and get component states (#28160) 2023-11-22 09:12:22 +08:00
dependency enhance:add some log when create client and get component states (#28160) 2023-11-22 09:12:22 +08:00
flowgraph fix: add back existing datanode metrics (#29360) 2023-12-22 14:20:43 +08:00
funcutil Format the code (#27275) 2023-09-21 09:45:27 +08:00
grpcclient fix: grpc client check session skipped due to role not match (#29356) 2023-12-21 10:12:51 +08:00
importutil fix: Import data from parquet file in streaming way (#29514) 2023-12-27 15:30:46 +08:00
indexcgowrapper feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
initcore Use OpenDAL to access object store (#25642) 2023-11-01 09:00:14 +08:00
metrics fix: symbol 'GetStorageMetrics' and 'enableDynamicField' (#28580) 2023-11-21 10:20:22 +08:00
mock enhance: Use mockery to replace manual mock code (#29074) 2023-12-13 10:46:44 +08:00
pipeline Format the code (#27275) 2023-09-21 09:45:27 +08:00
proxyutil enhance: Move proxy client manager to util package (#28955) 2023-12-20 19:22:42 +08:00
segmentutil Format the code (#27275) 2023-09-21 09:45:27 +08:00
sessionutil disable auto balance when old node exists (#28191) 2023-11-07 14:02:20 +08:00
streamrpc Add querynode client wrapper and avoid grpc in standalone mode (#27781) 2023-10-19 11:10:07 +08:00
tsoutil tikv integration (#26246) 2023-09-07 07:25:14 +08:00
typeutil feat: integrate storagev2 into index build process (#28995) 2023-12-13 17:24:38 +08:00
wrappers Add querynode client wrapper and avoid grpc in standalone mode (#27781) 2023-10-19 11:10:07 +08:00