Add ddl flush design (#5289)

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
This commit is contained in:
XuanYang-cn 2021-05-19 12:06:16 +08:00 committed by GitHub
parent 9b37cab922
commit 4b712284f2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 80 additions and 18 deletions

View File

@ -0,0 +1,68 @@
# DataNode DDL Flush Design
update: 5.19.2021, by [Goose](https://github.com/XuanYang-cn)
## Background
Data Definition Language (DDL) is a language used to define data structures and modify data<sup>[1](#techterms1)</sup>.
In Milvus terminology, for instance, `CreateCollection` and `DropPartition` etc. are DDL. In order to recover
or redo DD operations, DataNode flushes DDLs into persistent storages.
Before this design, DataNode buffers DDL chunks by collection, flushes all buffered data in manul/auto flush.
Now in [DataNode Recovery Design](datanode_recover_design_0513_2021.md), flowgraph : vchannel = 1 : 1, and insert
data of one segment is always in one vchannel. So each flowgraph concerns only about ONE specific collection. For
DDL channels, one flowgraph only cares about DDL operations of one collection.
## Goals
- Flowgraph knows about which segment/collection to concern.
- DDNode update masPositions once it buffers ddl about the collection
- DDNode buffers binlog Paths generated by auto-flush
- In manul-flush, a background flush-complete goroutinue waits for DDNode and InsertBufferNode both done flushing,
waiting for both binlog paths.
## Detailed design
1. Redisign of DDL binlog paths and etcd paths for these binlog paths
DDL flushes based on a manul flush of a segment.
**Former design**
```
# minIO/S3 ddl binlog paths
${tenant}/data_definition_log/${collection_id}/ts/${log_idx}
${tenant}/data_definition_log/${collection_id}/ddl/${log_idx}
# etcd paths for ddl binlog paths
${prefix}/${collectionID}/${idx}
```
The minIO/S3 ddl binlog paths seems ok, but etcd paths aren't clear, especially when we want to relate a ddl flush
to a certain segment flush.
**Redesign**
```
# etcd paths for ddl binlog paths
${prefix}/${collectionID}/${segmentID}/${idx}
```
```
message SaveBinlogPathsRequest {
common.MsgBase base = 1;
int64 segmentID = 2;
int64 collectionID = 3;
ID2PathList field2BinlogPaths = 4;
repeated DDLBinlogMeta = 5;
repeated internal.MsgPosition start_positions = 7;
repeated internal.MsgPosition end_positions = 8;
}
```
## TODOs
1. Refactor auto-flush of ddNode
3. Refactor etcd paths
<a name="techterms1">[1]</a>: *[techterms.com](https://techterms.com/definition/ddl#:~:text=Stands%20for%20%22Data%20Definition%20Language,SQL%2C%20the%20Structured%20Query%20Language)*

View File

@ -63,12 +63,10 @@ manul-flush and upload to DataServce together.
```proto
rpc SaveBinlogPaths(SaveBinlogPathsRequest) returns (common.Status){}
message ID2PathList {
int64 ID = 1;
repeated string Paths = 2;
}
message ID2PathList {
int64 ID = 1;
repeated string Paths = 2;
}
message SaveBinlogPathsRequest {
common.MsgBase base = 1;
@ -87,20 +85,16 @@ message SaveBinlogPathsRequest {
The same as DataNode
```proto
message FieldFlushMeta {
int64 fieldID = 1;
repeated string binlog_paths = 2;
// key: ${prefix}/${segmentID}/${fieldID}/${idx}
message SegmentFieldBinlogMeta {
int64 fieldID = 1;
string binlog_path = 2;
}
message SegmentFlushMeta{
int64 segmentID = 1;
bool is_flushed = 2;
repeated FieldFlushMeta fields = 5;
}
message DDLFlushMeta {
int64 collectionID = 1;
repeated string binlog_paths = 2;
// key: ${prefix}/${collectionID}/${idx}
message DDLBinlogMeta {
string ddl_binlog_path = 1;
string ts_binlog_path = 2;
}
```