Commit Graph

78 Commits

Author SHA1 Message Date
wei liu
061a00c58f
enhance: Enable database level replica num and resource groups for loading collection (#33052) (#33981)
pr: #33052

issue: #30040

This PR introduce two database level props:
1. database.replica.number
2. database.resource_groups

User can set those two database props by AlterDatabase API, then can
load collection without specified replica_num and resource groups. then
it will use database level load param when try to load collections.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-21 16:56:02 +08:00
wei liu
32bfd9befa
enhance: Enable to dynamic update balancer policy in querycoord (#33037) (#33272)
issue: #33036
pr: #33037
This PR enable to dynamic update balancer policy without restart
querycoord.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-23 15:43:41 +08:00
wei liu
9ae4945df2
fix: query node may stuck at stopping progress (#33104) (#33154)
issue: #33103 
pr: #33104
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 15:01:43 +08:00
chyezh
293f14a8b9
fix: remove redundant replica recover (#32985)
issue: #22288 

- replica recover should be only triggered by replica recover

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-13 15:25:32 +08:00
chyezh
1c84a1c9b6
fix: lru related issue fixup patch (#32916)
issue: #32206, #32801

- search failure with some assertion, segment not loaded and resource
insufficient.

- segment leak when query segments

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-10 19:17:30 +08:00
chyezh
f06509bf97
fix: get replica should not report error when no querynode serve (#32536)
issue: #30647

- Remove error report if there's no query node serve. It's hard for
programer to use it to do resource management.

- Change resource group `transferNode` logic to keep compatible with old
version sdk.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-25 19:25:24 +08:00
congqixia
d7ff1bbe5c
enhance: Make querycoordv2 collection observer task driven (#32441)
See also #32440

- Add loadTask in collection observer
- For load collection/partitions, load task shall timeout as a whole
- Change related constructor to load jobs

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-22 10:39:22 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
chyezh
a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
wei liu
92971707de
enhance: Add restful api for devops to execute rolling upgrade (#29998)
issue: #29261
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 16:15:19 +08:00
chyezh
ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
congqixia
c886aa29ff
enhance: Use ListIndexes instead of DescribeIndex for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
wei liu
9abc868d15
fix: Remove heartbeat lag logic during get shard leaders (#29999)
issue: #29677 #29838
during get shard leaders, if qeurynode doesn't ack the heartbeat than
10s, querycoord will treat it as unavailable, and won't return shard
leader on it. but when querynode has a full cpu usage, it's easily to
stuck for more than 10s without ack the heartbeat, which cause no shard
leader to search/query.

This PR remove heartbeat lag logic during get shard leaders

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-17 11:22:52 +08:00
wei liu
e98c62abbb
enhance: refactor leader_observer to leader_checker (#29454)
issue: #29453 

sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-05 15:54:55 +08:00
yah01
bfccfcd0ca
enhance: refine error messages (#28424)
- Split the simple reason and full detail
- Refine existing error messages
related: #28422

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-21 17:02:24 +08:00
yah01
1b90630633
Fix the target updated before version updated to cause data missing (#28250)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 11:36:22 +08:00
yah01
dc89730a50
Support collection-level mmap control (#26901)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-02 23:52:16 +08:00
Filip Haltmayer
6b1a106a31
Moving etcd client into session (#27069)
Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
2023-10-27 07:36:12 +08:00
wei liu
e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00
yah01
be980fbc38
Refine state check (#27541)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-10-11 21:01:35 +08:00
yah01
a8ce1b6686
Refine QueryCoord stopping (#27371)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-27 16:27:27 +08:00
yah01
6539a5ae2c
Refine DataCoord status (#27262)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-26 17:15:27 +08:00
MrPresent-Han
4b12cb8847
fix unstable ut due to unstable sort of unique set (#27302)
Signed-off-by: MrPresent-Han <chun.han@zilliz.com>
2023-09-22 19:07:26 +08:00
SimFG
26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
yah01
941a383019
Fix failed to load collection with more than 128 partitions (#26763)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-02 00:09:01 +08:00
wei liu
949c320185
remove pull target from qc recover (#26775)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-09-01 11:17:01 +08:00
Bingyi Sun
a3e22786ed
Move meta store to kv catalog (#25915)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2023-07-31 13:57:04 +08:00
yah01
dc37b4587e
Fix panic if channel not watched while getting shard leaders (#25820)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-07-24 14:13:02 +08:00
yah01
948d1f1f4a
Handle errors by merr for QueryCoord (#24926)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-07-17 14:59:34 +08:00
wei liu
68ae199a9f
load segment with target version, avoid read redundant segment (#24929)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-06-27 11:48:45 +08:00
congqixia
41af0a98fa
Use go-api/v2 for milvus-proto (#24770)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-06-09 01:28:37 +08:00
wei liu
8e3ba74648
fix qc service unstable ut (#24340)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-05-24 18:49:25 +08:00
wei liu
8965ea2a08
refine err msg about no available node in replica (#24256)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-05-22 11:59:26 +08:00
yihao.dai
1a3dca9b5e
Fix dynamic partitions loading (#24112)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-05-18 09:17:23 +08:00
smellthemoon
146050db82
Fix some wrong ut (#23990)
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2023-05-10 09:31:19 +08:00
yihao.dai
3827ac30bc
Remove load cache (#23287)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-05-09 10:36:41 +08:00
congqixia
ed81eaa963
Make CollectionObserver trigger checker more frequently during load procedure (#23928)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-05-08 14:06:41 +08:00
wei liu
b6ae70db43
fix get replica return wrong node list (#23792)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-28 19:48:36 +08:00
XuanYang-cn
d56771b7b7
Fix return too many nodeIDs (#23397)
See also: #23396

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2023-04-20 13:50:31 +08:00
wei liu
cbfe7a45ef
fix pull target (#23491)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-18 18:30:32 +08:00
yah01
296380d6e6
Support async refresh (#23107)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-04-12 15:06:28 +08:00
jaime
c9d0c157ec
Move some modules from internal to public package (#22572)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-04-06 19:14:32 +08:00
yah01
75737c65ac
Refine error handle of QueryCoord (#23068)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-03-31 10:54:29 +08:00
wei liu
74da53c027
fix update load percentage (#23054)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-03-30 10:48:23 +08:00
yihao.dai
1f718118e9
Dynamic load/release partitions (#22655)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-03-20 14:55:57 +08:00
yah01
3d8f0156c7
Refine scheduler & executor of QueryCoord (#22761)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-03-16 17:43:55 +08:00
yah01
1a4732bb19
Use new errors to handle load failures cache (#22672)
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-03-10 17:15:54 +08:00
wei liu
11f1f4226a
support replica observer assign node (#22604)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-03-08 18:57:51 +08:00
wei liu
c162c6ecc0
fix assign node err (#22479)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-03-01 11:11:47 +08:00
wei liu
a9a263d5a8
fix assign node to replica in nodeUp (#22323)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-02-23 14:15:45 +08:00