Commit Graph

478 Commits

Author SHA1 Message Date
chyezh
b904c8d377
enhance: resource group unittest refactory (#32739)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-06 10:17:34 +08:00
wei liu
d900e68440
fix: fix GetShardLeaders return empty node list (#32685)
issue: #32449

to avoid GetShardLeaders return empty node list, this PR add node list
check in both client side and server side.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-29 14:19:26 +08:00
chyezh
ef4c875d4c
fix: resource group ut may failure (#32688)
issue: https://github.com/milvus-io/milvus/issues/30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-29 14:17:26 +08:00
wei liu
c0555d4b45
fix: Remove read only node from replica immedaitely after node down (#32666)
issue: #32665

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-28 20:25:25 +08:00
congqixia
4cdf6c3c41
fix: Check partition nil before observe load progress (#32659)
See also #32441 #32615

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-28 16:29:25 +08:00
congqixia
a239e9110e
enhance: Apply node-indexing and cache optimization for channel dist (#32595)
See also #32165

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-28 16:19:24 +08:00
Xiaofan
02ace25c68
enhance: reduce the cpu usage when collection number is high (#32245)
related to #32165
1. for all the manager, support collection level index
2. remove collection level filter to avoid extra cpu usage when
collection number increases

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-26 11:49:25 +08:00
chyezh
f06509bf97
fix: get replica should not report error when no querynode serve (#32536)
issue: #30647

- Remove error report if there's no query node serve. It's hard for
programer to use it to do resource management.

- Change resource group `transferNode` logic to keep compatible with old
version sdk.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-25 19:25:24 +08:00
chyezh
b287fbaa2e
fix: return collection on recovering but not collection not loaded when target is not recovered (#32447)
issue: #32398

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-25 11:21:26 +08:00
congqixia
f30c22626e
enhance: Pre-cache result for frequent filters (#32580)
See also #32165

Add segment dist and leader view filter criterion struct to store
frequent filter conditions.
Add collection/channel filter results for these two meta

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-25 11:13:25 +08:00
congqixia
37ca32dbba
enhance: Make SegmentDistManager filter use node index (#32533)
See also #32165

Change `SegmentDistFilter` to interface in order to provde node index
when filter segment dist.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-24 16:53:24 +08:00
smellthemoon
96d95e7743
enhance: fix pass error msg as channel name (#32511)
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-04-23 16:45:22 +08:00
congqixia
bfebdecf3e
enhance: Make LeaderView Manager filter use map index (#32505)
See also #32165

Change `LeaderViewFilter` to interface to provided map key to avoid
iterating all key-values in LeaderViewManager

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-23 11:07:24 +08:00
chyezh
21a9de5c8e
fix: resource group ut fixup (#32509)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-23 10:01:23 +08:00
congqixia
d7ff1bbe5c
enhance: Make querycoordv2 collection observer task driven (#32441)
See also #32440

- Add loadTask in collection observer
- For load collection/partitions, load task shall timeout as a whole
- Change related constructor to load jobs

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-22 10:39:22 +08:00
congqixia
01c16fe6e3
enhance: Manual release pool after save targets (#32358)
See also #31632

Release conc.Pool after usage to clean worker and stop background purge
and ticktock.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-19 13:51:21 +08:00
chyezh
a8c8a6bb0f
fix: parameter check of TransferReplica and TransferNode (#32297)
issue: #30647 

- Same dst and src resource group should not be allowed in
`TransferReplica` and `TransferNode`.

-  Remove redundant parameter check.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-17 15:27:19 +08:00
yiwangdr
7deda4d5e9
enhance: speed up GetByCollectionAndNode (#32232)
Related to https://github.com/milvus-io/milvus/issues/32165

Avoid iterating through all replicas/collections if possible. Iteration
is expensive when there are large number of replicas/collections.

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-04-17 10:23:25 +08:00
congqixia
72c172a7d7
enhance: Remove duplicated collectionID label for task latency (#32308)
`CollectionID` already exists in channel name, so remove it to save
metrics traffic.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-16 18:55:19 +08:00
chyezh
70e3d5b495
fix: wrong node id in TestCheckNodesInReplica (#32268)
issue: #31930

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 17:38:17 +08:00
wei liu
4822b109bd
fix: Skip to load l0 segment on old version query node (#32124)
issue: #32107

during rolling upgrade progress, skip to load l0 segment on old version
query node

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-15 11:23:23 +08:00
congqixia
dc11cbd123
enhance: Maintain collection-patitions mapping in qc meta (#32227)
Related to #32165

Add collection to partitionIDs mapping to avoid interation on all
partitions loaded when trying to get all partitions with collection id

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-15 10:05:19 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
wei liu
68dec7dcd4
fix: Use correct ts to avoid exclude segment list leak (#31991)
issue: #31990

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 10:39:19 +08:00
congqixia
b9a487608a
fix: Make ResourceGroup.nodes concurrent safe (#32159)
See also #32158

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-11 17:53:18 +08:00
congqixia
25a1c9ecf0
fix: Make coordinator Register not blocked on ProcessActiveStandby (#32069)
See also #32066

This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-10 18:49:18 +08:00
chyezh
a3d6110957
fix: ut failure (#32120)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-10 17:30:48 +08:00
chyezh
0be67e7f99
fix: ut failure (#32119)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-10 17:23:27 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
wei liu
177ddda47f
fix: Check stale should check leader task's leader id (#31962)
issue: #30816

check stale rules for leader task:
1. for reduce leader task, it should keep executing until leader's node
become offline.
2. for grow leader task,it should keep executing until leader's node
become stopping.

This PR check leader node's stopping state for grow leader task

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-09 15:33:25 +08:00
zhenshan.cao
089c805e0a
enhance:Refactor hybrid search (#32020)
issue: https://github.com/milvus-io/milvus/issues/25639
https://github.com/milvus-io/milvus/issues/31368

Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>
2024-04-09 14:21:18 +08:00
yiwangdr
1cd15d9322
test: support segment release in integration test (#31190)
issue: #29507

Notice that api_testonly.go files should be guarded by compiler tag
`test`, so that production build rules don't compile them and these APIs
don't get misused.

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-04-09 11:39:17 +08:00
chyezh
a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
congqixia
c2aad513c0
fix: Check collection nil before check load status (#31850)
See also #31849

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-03 10:07:13 +08:00
congqixia
56e371c478
fix: Check replica exists before get latest leader (#31848)
See also #31847

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-03 10:05:22 +08:00
wei liu
7471a8005f
fix: querycoord panic after node down (#31831)
issue: #30519

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-03 10:03:22 +08:00
congqixia
0feee53631
enhance: Add back unit test for compactor and fix some TODOs (#31829)
This PR adds back compactor "Unhandled" data type unit test and fixes
some TODOs behvaior

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-02 20:35:14 +08:00
Bingyi Sun
91cb529ba6
fix: get latest collection info when checking index (#31744)
issue: https://github.com/milvus-io/milvus/issues/31727

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-04-02 14:43:13 +08:00
wei liu
0944a1f790
enhance: Refactor channel dist manager interface (#31119)
issue: #31091
This PR add GetByFilter interface in channel dist manager, instead of
all kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-02 10:23:14 +08:00
congqixia
16d869c57e
enhance: Add EmbedEtcd testutil and remove etcd dep of task pkg (#31802)
See also #20478

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-02 09:59:14 +08:00
wei liu
bb500d66c7
fix: Remove segment from leader view can't be executed (#31663)
issue: #31664

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-01 10:39:12 +08:00
wei liu
c311932d5f
fix: Update segment's version in leader task (#31643)
issue: #31468

1. when segment's version in leader view doesn't match segment's version
in dist, should update leader view
2. after call loadDeltalog, should update segment's load version with
latest ts
3. change leader task's priority from high to low, to avoid leader task
replace segment task and balance task

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-01 10:37:21 +08:00
wei liu
92971707de
enhance: Add restful api for devops to execute rolling upgrade (#29998)
issue: #29261
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 16:15:19 +08:00
wei liu
5d752498e7
fix: Skip release duplicate l0 segment (#31540)
issue: #31480 #31481

release duplicate l0 segment task, which execute on old delegator may
cause segment lack, and execute on new delegator may break new
delegator's leader view.

This PR skip release duplicate l0 segment by segment_checker, cause l0
segment will be released with unsub channel

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 12:53:10 +08:00
congqixia
8e5865f630
enhance: Save collection targets by batches (#31616)
See also #28491 #31240

When colleciton number is large, querycoord saves collection target one
by one, which is slow and may block querycoord exits.

In local run, 500 collections scenario may lead to about 40 seconds
saving collection targets.

This PR changes the `SaveCollectionTarget` interface into batch one and
organizes the collection in 16 per bundle batches to accelerate this
procedure.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-27 00:09:08 +08:00
congqixia
73858b23bc
fix: Make target observer auto/manual task mutual exclusive (#31584)
See also #30867

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-26 09:57:08 +08:00
wei liu
6438d65459
fix: Grow task stuck at stopping node (#31487)
issue: #30816
this PR fix that grow task stuck at stopping node

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-25 18:57:07 +08:00
congqixia
4d2142d041
fix: Check latest leader exists before using it (#31500)
See also #31495

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-22 18:25:07 +08:00
wei liu
03eaa5d478
fix: Load segment task promote failed (#31430)
issue: #30816

pr #31319 introduce the logic that segment checker need to load level
zero segment which only exist in current target.

This PR fix load segment task promote failed when segment only belongs
to current target

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-21 18:09:07 +08:00
chyezh
9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00