milvus/docs/design_docs/master_service_recovery_on_power_failure.md
neza2017 b1e95ccf28
[skip-ci] master service recovery on power failure (#5216)
design doc about master service recover on power failure
Signed-off-by: yefu.chen <yefu.chen@zilliz.com>
2021-05-14 06:25:46 +00:00

8.2 KiB

Master service recovery on power failure

1. Basic idea

  1. master service reads meta from etcd when it starts
  2. master service needs to store the position of the msgstream into etcd every time it consumes the msgstream.
  3. master service reads the position of msgstream from etcd when it starts up, then seek to the specified position and re-consume the msgstream
  4. Ensure that all messages from the msgstream are processed in idempotent fashion, so that repeated consumption of the same message does not cause system inconsistencies
  5. master service registers itself in etcd and finds out if the dependent data service and index service are online via etcd

2. Specific tasks

2.1 Read meta from etcd

  1. master service needs to load meta from etcd when it starts, this part is already done

2.2 dd requests from grpc

  1. The dd requests, such as create_collection, create_partition, etc., from grpc are marked as done only if the related meata have been writen into etcd.
  2. The dd requests should be send to dd msgstream when the operation is done.
  3. There may be a fault here, that is, the dd request has been written to etcd, but it has not been sent to dd msgstream yet, then the master service has crashed.
  4. For the scenarios mentioned in item 3, master service needs to check if all dd requests are sent to dd msgstream when it starts up.
  5. master service's built-in scheduler ensures that all grpc requests are executed serially, so it only needs to check whether the most recent dd requests are sent to the dd msgstream, and resend them if not.
  6. Take create_collection as an example to illustrate the process
    • When create collection is written to etcd, 2 additional keys are updated, dd_msg and dd_type
    • dd_msg is the serialization of the dd_msg
    • dd_type is the message type of dd_msg, such as create_collection, create_partition, drop_collection, etc. It's used to deserializes dd_msg.
    • Update the meta of create_collection, dd_msg and dd_type at the same time in a transactional manner.
    • When dd_msg has been sent to dd msgstream, delete dd_msg and dd_type from etcd.
    • When the master service starts, first check whether there are dd_msg and dd_type in etcd, if so, then deserialize dd_msg according to dd_type, and then send it to the dd msgstream, otherwise no processing will be done
    • There may be a failure here, that is, dd_msg has been sent to the dd msgstream , but has not been deleted from etcd yet, then the master service crashed, at this case, the dd_msg would be sent to dd msgstream repeatedly, so the receiver needs to count this case.

2.3 create index requests from grpc

  1. In the processing of create index, master service calls metaTable's GetNotIndexedSegments to get all segment ids that are not indexed
  2. After getting the segment ids, master service call index service create the index on these segment ids.
  3. In the current implementation, the create index requests will return after the segment ids are put into a go channel.
  4. The master service starts a background task that keeps reading the segment ids from the go channel, and then calls the index service to create the index.
  5. There is a fault here, the segment ids have been put into the go channel in the processing function of the grpc request, and then the grpc returns, but the master service's background task has not yet read them from the go channel, theb master service crashes. At this time, the client thinks that the index is created, but the master service does not call index service to create the index.
  6. The solution for the fault mentioned in item 5
    • Remove the go channel and master service's background task
    • In the request processing function of create index, the call will return only when all segment ids have been send index service
    • Some segment ids may be send to index service repeatedly, and index service needs to handle such requests

2.4 New segment from data service

  1. Each time a new segment is created, the data service sends the segment id to the master service via msgstream
  2. master service needs to update the segment id to the collection meta and record the position of the msgstream in etcd
  3. Step 2 is transactional and the operation will be successful only if the collection meta in etcd is updated
  4. So the master service only needs to restore the msgstream to the position when recovering from a power failure

2.5 Flushed segment from data node

  1. Each time the data node finishes flushing a segment, it sends the segment id to the master service via msgstream.
  2. master service needs to fetch binlog from data service by id and send request to index service to create index on this segment
  3. When the index service is called successfully, it will return a build id, and then master service will update the build id to the collection meta and record the position of the msgstream in etcd.
  4. Step 3 is transactional and the operation will be successful only if the collection meta in etcd is updated
  5. So the master service only needs to restore the msgstream to the position when recovering from a power failure

2.6 Failed to call external grpc service

  1. master service needs grpc service from data service and index service, if the grpc call failed, it needs to reconnect.
  2. master service does not listen to the status of the data service and index service in real time

2.7 Add virtual channel assignment when creating collection

  1. Add a new field, "number of shards" in the create collection request, the "num of shards" tell the master service to create the number of virtual channel for this collection.
  2. In the current implementation, virtual channels and physical channels have a one-to-one relationship, and the total number of physical channels increases as the number of virtual channels increases; later, the total number of physical channels needs to be fixed, and multiple virtual channels share one physical channel
  3. The name of the virtual channel is globally unique, and the collection meta records the correspondence between the virtual channel and the physical channel

Add processing of time synchronization signals from proxy node

  1. A virtual channel can be inserted by multiple proxies, so the timestamp in the virtual channel is not increase monotonically
  2. All proxies report the timestamp of all the virtual channels to the master service periodically
  3. The master service collects the timestamps from the proxies on each virtual channel and gets the minimum one as the timestamp of that virtual channel, and then inserts the timestamp into the virtual channel
  4. The proxy reports the timestamp to the master service via grpc
  5. The proxy needs to register itself in etcd when it starts, master service will listen to the corresponding key to determine how many active proxies there are, and thus determine if all of them have sent timestamps to master
  6. If a proxy is not registered in etcd but sends a timestamp or any other grpc request to master, master will ignore the grpc request

2.9 Register service in etcd

  1. master service needs to register itself with etcd when it starts
  2. The registration should include ip address, port, its own id, global incremental timestamp
  1. The proxy service related code will be removed
  2. The the job of time synchronization which done by the proxy service is partially simplified and handed over to the master (subsection 2.8)

2.11 Query collection meta based on timeline

  1. Add a new field of timestamp to the grpc request of describe collection
  2. master service should provide snapshot on the collection mate
  3. Return the collection meta at the point of timestamp mentioned in the request

2.12 Timestamp of dd operations

  1. master service response to set the timestamp of dd operations, create collection, create partition, drop collection, drop partition
  2. master service response to send timestamp to dd msgstream, if there is a dd message, then use the current latest timestamp from that message, if not, get a timestamp from tso