Commit Graph

91 Commits

Author SHA1 Message Date
wei liu
30a99b66c1
fix: Fix logic dead lock when delegator has high memory usage (#36065)
issue: #36064
when delegator has high memory usage, load l0 segment will failed. and
balance segment task will blocked by load segment task, then delegator
cann't free memory by moving out some segment, causes a logic dead lock.

this PR remove the limit for balance, we permit segment and balance
execute in parallel. which won't cause side effect due to:
1. one segment can only has one task in qc's scheduler, and load/release
task will replace balance task if necessary
2. balance speed has been limited, and it won't block load segment task.

3. if collection has load task and balance task at same time, load task
will be scheduled first due to high proirity.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-09 10:21:06 +08:00
wei liu
e09dc3be58
enhance: Mark query node as read only after suspend (#35492)
issue: #34985 #35493
after querynode has been suspended, it's not allow to load
segment/channel on it, which means the node is read only. to be
compatible with resource group design, after query node has been
suspend, we remove it from it's original resource group, make it a read
only query node in replica. then two things will happens:
1. it's original resource group will be lacking of query nodes, query
coord will assign new node to it.
2. querycoord will try to move out all segments/channels after querynode
has been suspended

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-20 14:02:54 +08:00
Bingyi Sun
ae1b81ac1a
fix: fix panic when generating plans (#35309)
issue: https://github.com/milvus-io/milvus/issues/35335

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-08-07 18:14:16 +08:00
jaime
fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
wei liu
27b6d58981
fix: Set legacy level to l0 segment after qc restart (#35197)
issue: #35087
after qc restarts, and target is not ready yet, if dist_handler try to
update segment dist, it will set legacy level to l0 segment, which may
cause l0 segment be moved to other node, cause search/query failed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-02 10:18:13 +08:00
wei liu
e9d61daa3f
enhance: Reduce delegator memory overloaded factor to 0.1 (#35092)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-01 10:21:50 +08:00
wei liu
03912a8788
enhance: Avoid balance stuck after segment list become stable (#34728)
issue: #34715
if collection's segment list doesn't changes anymore, then the next
target will be empty at most time, and balance segment will check
whether segment exist in both current and next target, so the balance
cloud be blocked due to next target is empty.

This PR permit segment to be moved if next target is empty, to avoid
balance stuck.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-31 18:09:48 +08:00
congqixia
de8a266d8a
enhance: Enable linux code checker (#35084)
See also #34483

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-07-30 15:53:51 +08:00
wei liu
166fc902b0
enhance: Limit collection's normal balance speed (#34810)
issue: #34798

after we remove the task priority on query coord, to avoid load/release
segment blocked by too much balance task, we limit the balance task size
in each round. at same time, we reduce the balance interval to trigger
balance more frequently.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-24 19:11:44 +08:00
wei liu
92de49e38c
fix: Segment may bounce between delegator and worker (#34830)
issue: #34595

pr#34596 to we add an overloaded factor to segment in delegator, which
cause same segment got different score in delegator and worker. which
may cause segment bounce between delegator and worker.

This PR use average score to compute the delegator overloaded factor, to
avoid segment bounce between delegator and worker.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-23 15:57:49 +08:00
wei liu
acb33bba4d
enhance: Preserve fixed-size memory in delegator node for growing segment. (#34596)
issue: #34595
When consuming insert data on the delegator node, QueryCoord will move
out some sealed segments to manage its memory usage. After the growing
segment gets flushed, some sealed segments from other workers will be
moved back to the delegator node. To avoid the frequent movement of
segments, we estimate the maximum growing row count and preserve a
fixed-size memory in the delegator node.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-15 20:51:46 +08:00
wei liu
8123bea1ae
enhance: Avoid assign too much segment/channels to new querynode (#34096)
issue: #34095

When a new query node comes online, the segment_checker,
channel_checker, and balance_checker simultaneously attempt to allocate
segments to it. If this occurs during the execution of a load task and
the distribution of the new query node hasn't been updated, the query
coordinator may mistakenly view the new query node as empty. As a
result, it assigns segments or channels to it, potentially overloading
the new query node with more segments or channels than expected.

This PR measures the workload of the executing tasks on the target query
node to prevent assigning an excessive number of segments to it.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-27 19:06:05 +08:00
jaime
9630974fbb
enhance: move rocksmq from internal to pkg module (#33881)
issue: #33956

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-06-25 21:18:15 +08:00
yihao.dai
3540eee977
enhance: Support L0 import (#33514)
issue: https://github.com/milvus-io/milvus/issues/33157

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-06-07 14:17:20 +08:00
wei liu
a7f6193bfc
fix: query node may stuck at stopping progress (#33104)
issue: #33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 10:21:38 +08:00
wei liu
e2332bdc17
enhance: Enable channel exclusive balance policy (#32911)
issue: #32910  
* split replica's node list to channels when create replicas
 * balance nodes among channels when node change happens
 * implement channel level balance, let balance happens in channel level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-10 17:27:31 +08:00
Xiaofan
02ace25c68
enhance: reduce the cpu usage when collection number is high (#32245)
related to #32165
1. for all the manager, support collection level index
2. remove collection level filter to avoid extra cpu usage when
collection number increases

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-26 11:49:25 +08:00
wei liu
4822b109bd
fix: Skip to load l0 segment on old version query node (#32124)
issue: #32107

during rolling upgrade progress, skip to load l0 segment on old version
query node

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-15 11:23:23 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
chyezh
a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
wei liu
0944a1f790
enhance: Refactor channel dist manager interface (#31119)
issue: #31091
This PR add GetByFilter interface in channel dist manager, instead of
all kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-02 10:23:14 +08:00
wei liu
92971707de
enhance: Add restful api for devops to execute rolling upgrade (#29998)
issue: #29261
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 16:15:19 +08:00
chyezh
9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
chyezh
ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
wei liu
06b191b164
fix: Balance channel stuck forever due to logic dead lock (#31202)
issue: #30816

cause balance channel will stuck until leader view catch up the current
target, then start to unsub the old delegator. which make sure that the
new delegator can provide search before release old delegator. but
another logic in segment_checker skip loading segment during balance
channel. so during balance channel, if query node crash, new delegator
can't catch up target forever, then stuck forever.

This PR remove the rule that skip loading segment during balance channel
to avoid the logic dead lock here.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-13 15:05:04 +08:00
wei liu
06df9b8462
fix: Balance segment/channel won't be trigger on multi replicas (#31107)
issue: #30983 #30982

cause balancer call wrong interface to get segment/channel list in
replica, then got a wrong average segment/channel number, which make
each node have less segment/channel than average, and the balance won't
be trigger in multi replica case.

This PR fix that balance segment/channel won't be trigger on multi
replicas

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-11 20:35:04 +08:00
wei liu
efe8cecc88
enhance: refactor segment dist manager interface (#31073)
issue: #31091
This PR add `GetByFilter` interface in segment dist manager, instead of
all kind of get func

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:29:01 +08:00
Bingyi Sun
ece9d273a7
enhance: some patches for #30636 (#30664)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-02-26 11:42:55 +08:00
Bingyi Sun
564b12c661
enhance: make balance cost threshold configurable (#30636)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-02-19 15:24:50 +08:00
congqixia
a6d9eb7f20
fix: Remove balance plan of which From, To nodes are same when merging (#30634)
See also #30627

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-02-18 17:24:50 +08:00
Bingyi Sun
715f042965
feat: add a balancer based on both of row count and segment count (#30188)
issue: https://github.com/milvus-io/milvus/issues/30039

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-02-06 17:15:50 +08:00
Bingyi Sun
406bf14e84
enhance: Add growing row count weight (#30271)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-01-29 14:05:02 +08:00
congqixia
4c93912135
enhance: Shuffle candidates before channel assignment (#30066)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-17 19:34:53 +08:00
congqixia
3626f49025
fix: make sure balance candidate is alway pushed back (#29702)
See also #29699

Querycoord panicked when tried to pop from an empty heap. We assume the
heap shall not be empty, but in some branch, the candidate is never
pushed back.

This PR put pop & push in a closure and adds a defer call to push item
back.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-05 10:08:47 +08:00
wei liu
336fce0582
enhance: Rewrite gen segment plan based on assign segment (#29574)
issue: #29582
This PR rewrite gen segment plan logic based on assign segment in
`score_based_balancer`

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-04 11:10:44 +08:00
wei liu
6cbf9c489d
enhance: Rewrite gen stopping segment plan based on assign segment (#29473)
`AssignSegment` method defines how to assign segment to nodes, but
score_based_balance implement another assign logic in
`genStoppingSegmentPlan`
This PR rewrite gen stopping segment plan based on assign segment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-26 14:26:56 +08:00
wei liu
820ee692fc
enhance: Add config for querycoord auto balance channel (#29231)
issue: #23726
This PR add control config to querycoord's background auto balance
channel operation

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-18 10:00:40 +08:00
wei liu
008bae675d
enhance: Skip balance segment when channel need be balanced (#29116)
issue: #28622
After we support balance segment with growing segment count #28623, if
we balance segment and channel at same time, some segments need to be
rebalanced after balance channel finish.

This PR skip balance segment when channel need be balanced.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-14 16:44:43 +08:00
yah01
58dbb7872a
fix: forbid balancing level zero segments (#29168)
related #29128
Signed-off-by: yah01 <yah2er0ne@outlook.com>

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-14 14:30:39 +08:00
yah01
2f0c7a6544
fix: forbid balancing level zero segments (#29130)
we can't balance the L0 segments
related #29128

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-12 20:38:38 +08:00
congqixia
a67fc08865
fix: balance_unstable_view unit test (#29127)
fix: #29126
Allow unstable output channel balance plan

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-12 10:02:39 +08:00
wei liu
42e538b683
enhance: enable balance channel in querycoord (#28469)
issue: #23726

/kind improvement

1. enable auto balance channel between nodes in querycoord
2. make `genSegmentPlan` reuse the `AssignSegment` logic
3. make `genChannelPlan` reuse the `AssignChannel` logic

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-11 14:18:37 +08:00
wei liu
911a915798
feat: enable balance based on growing segment row count (#28623)
issue: #28622 

query node with delegator will has more rows than other query node due
to delgator loads all growing rows.
This PR enable the balance segment which based on the num of growing
rows in leader view.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-27 14:58:26 +08:00
congqixia
852be152de
Change task sourceID to stringer interface (#27965)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-27 01:08:12 +08:00
wei liu
e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00
Xiaofan
41124f281a
Remove parser dependency (#27514)
Signed-off-by: xiaofan-luan <xiaofan.luan@zilliz.com>
2023-10-08 15:05:31 +08:00
yah01
a8ce1b6686
Refine QueryCoord stopping (#27371)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-27 16:27:27 +08:00
SimFG
26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
congqixia
cc9974979f
Add staticcheck linter and fix existing problems (#27174)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-09-19 10:05:22 +08:00