milvus/internal/storage
Jiquan Long 3f46c6d459
feat: support inverted index (#28783)
issue: https://github.com/milvus-io/milvus/issues/27704

Add inverted index for some data types in Milvus. This index type can
save a lot of memory compared to loading all data into RAM and speed up
the term query and range query.

Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL`
and `VARCHAR`.

Not supported: `ARRAY` and `JSON`.

Note:
- The inverted index for `VARCHAR` is not designed to serve full-text
search now. We will treat every row as a whole keyword instead of
tokenizing it into multiple terms.
- The inverted index don't support retrieval well, so if you create
inverted index for field, those operations which depend on the raw data
will fallback to use chunk storage, which will bring some performance
loss. For example, comparisons between two columns and retrieval of
output fields.

The inverted index is very easy to be used.

Taking below collection as an example:

```python
fields = [
		FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
		FieldSchema(name="int8", dtype=DataType.INT8),
		FieldSchema(name="int16", dtype=DataType.INT16),
		FieldSchema(name="int32", dtype=DataType.INT32),
		FieldSchema(name="int64", dtype=DataType.INT64),
		FieldSchema(name="float", dtype=DataType.FLOAT),
		FieldSchema(name="double", dtype=DataType.DOUBLE),
		FieldSchema(name="bool", dtype=DataType.BOOL),
		FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000),
		FieldSchema(name="random", dtype=DataType.DOUBLE),
		FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim),
]
schema = CollectionSchema(fields)
collection = Collection("demo", schema)
```

Then we can simply create inverted index for field via:

```python
index_type = "INVERTED"
collection.create_index("int8", {"index_type": index_type})
collection.create_index("int16", {"index_type": index_type})
collection.create_index("int32", {"index_type": index_type})
collection.create_index("int64", {"index_type": index_type})
collection.create_index("float", {"index_type": index_type})
collection.create_index("double", {"index_type": index_type})
collection.create_index("bool", {"index_type": index_type})
collection.create_index("varchar", {"index_type": index_type})
```

Then, term query and range query on the field can be speed up
automatically by the inverted index:

```python
result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"])
result = collection.query(expr='int64 < 5', output_fields=["pk"])
result = collection.query(expr='int64 > 2997', output_fields=["pk"])
result = collection.query(expr='1 < int64 < 5', output_fields=["pk"])
```

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2023-12-31 19:50:47 +08:00
..
aliyun Identify service providers based on addresses (#27907) 2023-10-25 17:28:10 +08:00
gcp Format the code (#27275) 2023-09-21 09:45:27 +08:00
azure_object_storage_test.go fix azure ListObjects (#27931) 2023-11-01 11:34:14 +08:00
azure_object_storage.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
binlog_iterator_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_iterator.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_reader.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
binlog_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_util_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_util.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
binlog_writer_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_writer.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
data_codec_test.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
data_codec.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
data_sorter_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
data_sorter.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00
event_data.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_header.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
event_reader.go Use go-api/v2 for milvus-proto (#24770) 2023-06-09 01:28:37 +08:00
event_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_writer_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_writer.go Add go payload writer (#24656) (#24762) 2023-06-09 13:52:39 +08:00
factory.go Use OpenDAL to access object store (#25642) 2023-11-01 09:00:14 +08:00
file_test.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
file.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
index_data_codec_test.go Check error by Error() and NoError() for better report message (#24736) 2023-06-08 15:36:36 +08:00
index_data_codec.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
insert_data_test.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
insert_data.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
local_chunk_manager_test.go enhance: Remove vector chunk manager (#28569) 2023-11-30 18:00:33 +08:00
local_chunk_manager.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
minio_chunk_manager_test.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
minio_chunk_manager.go fix: Fix minio latency monitoring for get operation (#28510) 2023-11-28 10:00:27 +08:00
minio_object_storage_test.go fix: Align minio object storage ut to new minio server behavior (#29014) 2023-12-06 15:42:43 +08:00
minio_object_storage.go fix azure ListObjects (#27931) 2023-11-01 11:34:14 +08:00
options.go Add chunk manager request timeout (#27692) 2023-10-23 20:08:08 +08:00
OWNERS [skip ci]Update OWNERS files (#11898) 2021-11-16 15:41:11 +08:00
payload_reader_test.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
payload_reader.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
payload_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
payload_writer.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
payload.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00
pk_statistics.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
primary_key_test.go Use go-api/v2 for milvus-proto (#24770) 2023-06-09 01:28:37 +08:00
primary_key.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
print_binlog_test.go Remove deprecated io/ioutil usage (#27747) 2023-10-17 20:32:09 +08:00
print_binlog.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
remote_chunk_manager_test.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
remote_chunk_manager.go fix: Fix minio latency monitoring for get operation (#28510) 2023-11-28 10:00:27 +08:00
stats_test.go Add retry time when lazy load BF (#25096) 2023-06-25 11:32:43 +08:00
stats.go enhance: add param for bloomfilter(#29388) (#29490) 2023-12-28 18:10:46 +08:00
storage_test.go enhance: Remove vector chunk manager (#28569) 2023-11-30 18:00:33 +08:00
types.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
unsafe_test.go [skip e2e]Update license for storage unsafe (#14452) 2021-12-28 20:03:56 +08:00
unsafe.go [skip e2e]Update license for storage unsafe (#14452) 2021-12-28 20:03:56 +08:00
utils_test.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00
utils.go Fix buffer FieldData has no ElementType and array logsize always zero (#28295) 2023-11-09 14:16:20 +08:00