milvus/docs/design_docs/segcore/segment_interface.md

# Segment Interface

## External Interface

1. `get_row_count`: Get the number of entities in the segment
2. `get_schema`: Get the corresponding collection schema in the segment
3. `GetMemoryUsageInBytes`: Get memory usage of a segment
4. `Search(plan, placeholderGroup, timestamp) -> QueryResult`: Perform search operations according to the plan containing search parameters and predicate conditions, and return search results. Ensure that the time of all search results is before the specified timestamp(MVCC)
5. `FillTargetEntry(plan, &queryResult)`: Fill the missing column data for search results based on target columns in the plan

See design details `${milvus_root}/internal/core/src/segcore/SegmentInterface.h`

## Basic Concepts：

1. Segment: Data is sharded into segments based on written timestamp, and the sharding logic is controlled by data coordinator.
2. Chunk: Further division of segment data, chunk is continuous data for each column
   - There will be only one chunk in each sealed segment.
   - In growing segment, chunks are currently divided by a fixed number of rows. With data ingestion, the number of chunks will increase
3. Span: Similar to std::span, point to continuous data in memory
4. SystemField: Extra field stores system info, currently including RowID and Timestamp field.
5. SegOffset: The entity identifier in the segment

## SegmentInternalInterface internal functions

1. `num_chunk()`: total chunk number
2. `size_per_chunk()`: length of each chunk
3. `get_active_count(Timestamp)`: entity count after filter by Timestamp
4. `chunk_data(FieldOffset, chunk_id) -> Span<T>`: return continuous data for specified column and chunk
5. `chunk_scalar_index(FieldOffset, chunk_id) -> const StructuredIndex<T>&`: return the inverted index of specified column and chunk
6. `num_chunk_index`: the number of indexes (including scalars and vector indexes) that have been created:
   1. In growing segment, this value is the number of chunks for which the inverted index has been created. In these chunks, the index can be used to speed up the calculation.
   2. SealedSegment must be 1
7. `debug()`: debug is used to print extra information while debugging
8. `vector_search (vec_count, query..., timestamp, bitset, output)`: Search the vector column
   1. `vec_count`: specifies how many entities participated in the vector search calculation, the rest of the segments are filtered out because it's timestamp is larger than specified timestamp. This function is mainly used in growing segment as multi version control(MVCC)
   2. `query...`: multiple variables jointly specify the parameters and search vector
   3. `timestamp`: timestamp is used for time travelling, filter out data with timestamp. Mainly for sealed segment
   4. `bitset`: calculated bit mask value as an output
   5. `output`: output QueryResult
9. `bulk_subscript(FieldOffset|SystemField, seg_offsets..., output)`:
   - given seg_offsets, calculate `results[i] = FieldData[seg_offsets[i]]`, for GetEntityByIds
   - FieldData is defined by FieldOffset or SystemField
10. `search_ids(IdArray, timestamp) -> pair<IdArray, SegOffsets>`:
    1. Find the corresponding segoffset according to the primary key in idarray
    2. The returned order is not guaranteed, but the two returned fields must correspond to each other one by one.
    3. Entities without PKs will not be returned
11. `check_search(Plan)`: check if the Plan is valid
    1. It mainly checks whether the columns used in the plan have been loaded
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+								# Segment Interface
 								## External Interface
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `get_row_count`: Get the number of entities in the segment
 . `get_schema`: Get the corresponding collection schema in the segment
 . `GetMemoryUsageInBytes`: Get memory usage of a segment
-												[skip ci] Fix grammar (#13278)

Signed-off-by: sida shen <sida.shen@zilliz.com>
											
										
										
											2021-12-13 19:35:54 +08:00
+. `Search(plan, placeholderGroup, timestamp) -> QueryResult`: Perform search operations according to the plan containing search parameters and predicate conditions, and return search results. Ensure that the time of all search results is before the specified timestamp(MVCC)
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `FillTargetEntry(plan, &queryResult)`: Fill the missing column data for search results based on target columns in the plan
 								See design details `${milvus_root}/internal/core/src/segcore/SegmentInterface.h`
 								## Basic Concepts：
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
-												[skip e2e] Fix grammar (#13892)

Signed-off-by: sida shen <sida.shen@zilliz.com>
											
										
										
											2021-12-21 19:17:22 +08:00
+. Segment: Data is sharded into segments based on written timestamp, and the sharding logic is controlled by data coordinator.
-												[skip ci]Update typos in segment interface md (#11882)

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-11-16 18:47:14 +08:00
+. Chunk: Further division of segment data, chunk is continuous data for each column
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+								   - There will be only one chunk in each sealed segment.
 								   - In growing segment, chunks are currently divided by a fixed number of rows. With data ingestion, the number of chunks will increase
-												[skip ci]Fix typo in design doc of segment_interface (#9562)

Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>
											
										
										
											2021-10-12 08:18:39 +08:00
+. Span: Similar to std::span, point to continuous data in memory
 . SystemField: Extra field stores system info, currently including RowID and Timestamp field.
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. SegOffset: The entity identifier in the segment
 								## SegmentInternalInterface internal functions
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `num_chunk()`: total chunk number
 . `size_per_chunk()`: length of each chunk
 . `get_active_count(Timestamp)`: entity count after filter by Timestamp
-												[skip ci]Fix typo in design doc of segment_interface (#9562)

Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>
											
										
										
											2021-10-12 08:18:39 +08:00
+. `chunk_data(FieldOffset, chunk_id) -> Span<T>`: return continuous data for specified column and chunk
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `chunk_scalar_index(FieldOffset, chunk_id) -> const StructuredIndex<T>&`: return the inverted index of specified column and chunk
-												[skip ci] Fix typo in segment doc (#11666)

Signed-off-by: shaoyue.chen <shaoyue.chen@zilliz.com>
											
										
										
											2021-11-11 19:39:00 +08:00
+. `num_chunk_index`: the number of indexes (including scalars and vector indexes) that have been created:
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+. In growing segment, this value is the number of chunks for which the inverted index has been created. In these chunks, the index can be used to speed up the calculation.
 . SealedSegment must be 1
-												[skip ci]Fix typo in design doc of segment_interface (#9562)

Signed-off-by: Jael Gu <mengjia.gu@zilliz.com>
											
										
										
											2021-10-12 08:18:39 +08:00
+. `debug()`: debug is used to print extra information while debugging
 . `vector_search (vec_count, query..., timestamp, bitset, output)`: Search the vector column
-												[skip e2e] Fix typo for design doc (#14483)

Signed-off-by: yhmo <yihua.mo@zilliz.com>
											
										
										
											2021-12-29 11:26:58 +08:00
+. `vec_count`: specifies how many entities participated in the vector search calculation, the rest of the segments are filtered out because it's timestamp is larger than specified timestamp. This function is mainly used in growing segment as multi version control(MVCC)
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+. `query...`: multiple variables jointly specify the parameters and search vector
 . `timestamp`: timestamp is used for time travelling, filter out data with timestamp. Mainly for sealed segment
-												[skip e2e] Fix typo for design doc (#14383)

Signed-off-by: yhmo <yihua.mo@zilliz.com>
											
										
										
											2021-12-28 10:34:27 +08:00
+. `bitset`: calculated bit mask value as an output
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+. `output`: output QueryResult
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `bulk_subscript(FieldOffset|SystemField, seg_offsets..., output)`:
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+								   - given seg_offsets, calculate `results[i] = FieldData[seg_offsets[i]]`, for GetEntityByIds
 								   - FieldData is defined by FieldOffset or SystemField
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `search_ids(IdArray, timestamp) -> pair<IdArray, SegOffsets>`:
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+. Find the corresponding segoffset according to the primary key in idarray
 . The returned order is not guaranteed, but the two returned fields must correspond to each other one by one.
 . Entities without PKs will not be returned
-												Add Segcore document for segment inferface (#8343)

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>

* [skip ci]Add segment_interface in segcore doc

Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
											
										
										
											2021-09-23 10:05:53 +08:00
+. `check_search(Plan)`: check if the Plan is valid
-												[skip ci]Format markdown doc for segment_interface.md (#10641)

Signed-off-by: ruiyi.jiang <ruiyi.jiang@zilliz.com>
											
										
										
											2021-10-26 14:29:02 +08:00
+. It mainly checks whether the columns used in the plan have been loaded