2021-10-04 08:28:25 +08:00
# SegmentGrowing
2021-10-26 14:24:52 +08:00
2021-10-04 23:46:17 +08:00
Growing segment has the following additional interfaces:
2023-10-24 09:30:10 +08:00
1. `PreInsert(size) -> reservedOffset` : serial interface, which reserves space for future insertion and returns the `reservedOffset` .
2021-10-04 08:28:25 +08:00
2023-10-24 09:30:10 +08:00
2. `Insert(reservedOffset, size, ...Data...)` : write `...Data...` into range `[reservedOffset, reservedOffset + size)` . This interface is allowed to be called concurrently.
2021-10-26 14:24:52 +08:00
1. `...Data...` contains row_ids, timestamps two system attributes, and other columns
2021-12-15 18:41:24 +08:00
2. data columns can be stored either row-based or column-based.
2023-10-24 09:30:10 +08:00
3. `PreDelete & Delete(reservedOffset, row_ids, timestamps)` is a delete interface similar to insert interface.
2021-10-04 08:28:25 +08:00
2021-10-04 23:46:17 +08:00
Growing segment stores data in the form of chunk. The number of rows in each chunk is restricted by configs.
2021-12-20 19:33:37 +08:00
Rows per segment are controlled by parameters `size_per_Chunk ` config
2021-10-04 08:28:25 +08:00
2021-12-01 15:30:08 +08:00
When inserting, first allocate enough space to ensure `total_size <= num_chunk * size_per_chunk` , and then convert data from row format to column format.
2021-10-04 08:28:25 +08:00
2021-12-21 19:21:09 +08:00
During a search, each 'chunk' will be searched, and the search results will be saved as 'subquery result', then reduced into TopK.
2021-10-04 08:28:25 +08:00
2021-10-04 23:46:17 +08:00
Growing Segment also implements small batch index for vectors. The parameters of small batch index are preset in `segcore config`
2021-10-04 08:28:25 +08:00
2022-02-09 10:16:10 +08:00
When `metric type` is specified in the schema, the default parameters will build an index for each chunk to accelerate query
2021-10-04 08:28:25 +08:00
2021-10-26 14:24:52 +08:00
## SegmentGrowingImpl internal
2021-10-04 22:02:02 +08:00
2021-11-09 19:54:24 +08:00
1. SegcoreConfig: contains parameters for Segcore, it has to be specified before create segment
2021-10-26 14:24:52 +08:00
2. InsertRecord: inserted data put to here
2021-10-04 23:46:17 +08:00
3. DeleteRecord: wait for delete implementation
2021-10-26 14:24:52 +08:00
4. IndexingRecord: contains data with small index
2022-01-06 10:53:30 +08:00
5. SealedIndexing: Record not used anymore
2021-10-04 22:02:02 +08:00
### SegcoreConfig
2021-10-26 14:24:52 +08:00
2021-11-12 13:35:27 +08:00
1. Manage chunk_sizeand small index parameters
2021-10-04 22:02:02 +08:00
2. `parse_from` can parse from yaml files( this function is not enabled by default)
2021-10-26 14:24:52 +08:00
- refer to `${milvus}/internal/core/unittest/test_utils/test_segcore.yaml`
3. `default_config` offers default parameters
2021-10-04 22:02:02 +08:00
### InsertRecord
2021-10-04 23:46:17 +08:00
2021-12-28 10:36:20 +08:00
Used to manage concurrent inserted data, including:
2021-10-04 23:46:17 +08:00
2021-10-26 14:24:52 +08:00
1. `atomic<int64_t> reserved` reserved space calculation
2021-12-31 11:41:27 +08:00
2. `AckResponder` calculate which segment to insert, returns current segment offset
2022-01-04 13:55:59 +08:00
3. `ConcurrentVector` stores data columns, each column has one concurrent vector
2021-10-04 23:46:17 +08:00
The following steps are executed when insert,
1. Serially Execute `PreInsert(size) -> reserved_offset` to allocate memory space, the address of space is `[reserved_offset, reserved_offset + size)` is reserved
2021-10-26 14:24:52 +08:00
2. Parallelly execute `Insert(reserved_offset, size, ...Data...)` interface, copy data into the above memory address
2021-10-04 23:46:17 +08:00
2021-10-26 14:24:52 +08:00
- First of all, for `ConcurrentVector` of each column, call `grow_to_at_least` to reserve space
- For each column data, call `set_data_raw` interface to put data into corresponding locations.
- After execution finished, call`AddSegment` of `AckResponder` , mark the space `[reserved_offset, reserved_offset + size)` to already inserted
2021-10-04 23:46:17 +08:00
### ConcurrentVector
2021-10-26 14:24:52 +08:00
2021-12-21 11:14:54 +08:00
This is a column data storage that can be inserted concurrently. It is composed of multi-data chunks.
2021-10-04 23:46:17 +08:00
2021-10-26 14:24:52 +08:00
1. After`grow_to_at_least(size)` called, reserve space no less than `size`
2. `set_data_raw(element_offset, source, element_count)` point source to continuous piece of data
3. `get_span(chunk_id)` get the span of the corresponding chunk