Commit Graph

499 Commits

Author SHA1 Message Date
wei liu
303470fc35
fix: Clean offline node from resource group after qc restart (#33232)
issue: #33200 #33207
pr#33104 causes the offline node will be kept in resource group after qc
recover, and offline node will be assign to new replica as rwNode, then
request send to those node will fail by NodeNotFound.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-22 10:03:40 +08:00
wei liu
33bd6eed28
fix: Clean offline node from replica after qc recover (#33213)
issue: #33200 #33207

pr#33104 remove this logic by mistake, which cause the offline node will
be kept in replica after qc recover, and request send to offline qn will
go a NodeNotFound error.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 15:41:39 +08:00
wei liu
2013d97243
enhance: Enable to dynamic update balancer policy in querycoord (#33037)
issue: #33036
This PR enable to dynamic update balancer policy without restart
querycoord.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 14:29:39 +08:00
jaime
0d99db23b8
fix: metrics leak on the coord nodes (#33075)
issue: #32980

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-05-20 22:03:39 +08:00
wei liu
a7f6193bfc
fix: query node may stuck at stopping progress (#33104)
issue: #33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 10:21:38 +08:00
wei liu
f1c9986974
enhance: Skip return data distribution if no change happen (#32814)
issue: #32813

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-17 10:11:37 +08:00
cai.zhang
6ea7633bd5
enhance: Add memory size for binlog (#33025)
issue: #33005
1. add `MemorySize` field for insert binlog.
2. `LogSize` means the file size in the storage object.
3. `MemorySize` means the size of the data in the memory.

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Signed-off-by: cai.zhang <cai.zhang@zilliz.com>
2024-05-15 12:59:34 +08:00
SimFG
1d48d0aeb2
enhance: use different value to get related data size according to segment type (#33017)
issue: #30436

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-05-14 14:59:33 +08:00
congqixia
861977ab60
fix: Start LeaderCacheObserver before SyncAll (#33035)
Related to #33033

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-14 13:25:32 +08:00
wei liu
cba2c7a3be
enhance: clean channel node info in meta store (#32988)
issue: #32910
see also: #32911
when channel exclusive mode is enabled, replica will record channel node
info in meta store, and if the balance policy changes, which means
channel exclusive mode is disabled, we should clean up the channel node
info in meta store, and stop to balance node between channels.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-14 10:05:40 +08:00
chyezh
293f14a8b9
fix: remove redundant replica recover (#32985)
issue: #22288 

- replica recover should be only triggered by replica recover

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-13 15:25:32 +08:00
Xiaofan
b044e5503e
enhance:Improve load speed (#32898)
fix #32897
add memory check when load collection

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-05-11 10:29:31 +08:00
chyezh
1c84a1c9b6
fix: lru related issue fixup patch (#32916)
issue: #32206, #32801

- search failure with some assertion, segment not loaded and resource
insufficient.

- segment leak when query segments

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-10 19:17:30 +08:00
wei liu
e2332bdc17
enhance: Enable channel exclusive balance policy (#32911)
issue: #32910  
* split replica's node list to channels when create replicas
 * balance nodes among channels when node change happens
 * implement channel level balance, let balance happens in channel level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-10 17:27:31 +08:00
wei liu
04a8ec69f6
fix: Segment on stopping query node can't be release successfully (#32929)
issue: #32901
Cause release segment request need be send to delegator, but it need
replica to info find segment's delegator. but the stopping query node
will be marked as read only in replica, then `replica.Contains()` just
return true for rwNode in replica. then it can't get replica info by
stopping query node and release segment will be blocked.

This PR make `replica.Contains()` return true for both roNode and
rwNode.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-10 14:33:30 +08:00
Bingyi Sun
b7ef8da360
fix: set channel checkpoint to delta position (#32878)
issue: https://github.com/milvus-io/milvus/issues/32853

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-05-10 11:51:30 +08:00
congqixia
efa58ae423
enhance: Utilize coll2replica mapping when getting rg by collection (#32892)
See also #32165

In old `GetResourceGroupByCollection` implementation, it iterates all
replicas to match collection id, which is slow and CPU time consuming.
This PR make it utilize the coll2Replicas mapping by calling
`GetByCollection` and mapping replicas into resource group.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-09 19:37:30 +08:00
congqixia
acb0417a9f
enhance: Avoid iteration over channel results when update leaderview (#32887)
See also #32165

Cache channel name to channel info to avoid iteration over channel
results when updating leader view version.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-09 15:41:30 +08:00
wei liu
fad8f0afa5
enhance: enable stopping balance after balance has been suspended (#32812)
issue: #32811

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-08 10:15:29 +08:00
wei liu
ba02d54a30
enhance: update shard leader cache when leader location changed (#32470)
issue: #32466

this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-08 10:05:29 +08:00
yihao.dai
9db3aa18bc
enhance: Remove deprecated EnableIndex (#32704)
/kind improvement

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-05-07 17:11:30 +08:00
chyezh
b904c8d377
enhance: resource group unittest refactory (#32739)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-06 10:17:34 +08:00
wei liu
d900e68440
fix: fix GetShardLeaders return empty node list (#32685)
issue: #32449

to avoid GetShardLeaders return empty node list, this PR add node list
check in both client side and server side.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-29 14:19:26 +08:00
chyezh
ef4c875d4c
fix: resource group ut may failure (#32688)
issue: https://github.com/milvus-io/milvus/issues/30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-29 14:17:26 +08:00
wei liu
c0555d4b45
fix: Remove read only node from replica immedaitely after node down (#32666)
issue: #32665

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-28 20:25:25 +08:00
congqixia
4cdf6c3c41
fix: Check partition nil before observe load progress (#32659)
See also #32441 #32615

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-28 16:29:25 +08:00
congqixia
a239e9110e
enhance: Apply node-indexing and cache optimization for channel dist (#32595)
See also #32165

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-28 16:19:24 +08:00
Xiaofan
02ace25c68
enhance: reduce the cpu usage when collection number is high (#32245)
related to #32165
1. for all the manager, support collection level index
2. remove collection level filter to avoid extra cpu usage when
collection number increases

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-26 11:49:25 +08:00
chyezh
f06509bf97
fix: get replica should not report error when no querynode serve (#32536)
issue: #30647

- Remove error report if there's no query node serve. It's hard for
programer to use it to do resource management.

- Change resource group `transferNode` logic to keep compatible with old
version sdk.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-25 19:25:24 +08:00
chyezh
b287fbaa2e
fix: return collection on recovering but not collection not loaded when target is not recovered (#32447)
issue: #32398

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-25 11:21:26 +08:00
congqixia
f30c22626e
enhance: Pre-cache result for frequent filters (#32580)
See also #32165

Add segment dist and leader view filter criterion struct to store
frequent filter conditions.
Add collection/channel filter results for these two meta

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-25 11:13:25 +08:00
congqixia
37ca32dbba
enhance: Make SegmentDistManager filter use node index (#32533)
See also #32165

Change `SegmentDistFilter` to interface in order to provde node index
when filter segment dist.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-24 16:53:24 +08:00
smellthemoon
96d95e7743
enhance: fix pass error msg as channel name (#32511)
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-04-23 16:45:22 +08:00
congqixia
bfebdecf3e
enhance: Make LeaderView Manager filter use map index (#32505)
See also #32165

Change `LeaderViewFilter` to interface to provided map key to avoid
iterating all key-values in LeaderViewManager

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-23 11:07:24 +08:00
chyezh
21a9de5c8e
fix: resource group ut fixup (#32509)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-23 10:01:23 +08:00
congqixia
d7ff1bbe5c
enhance: Make querycoordv2 collection observer task driven (#32441)
See also #32440

- Add loadTask in collection observer
- For load collection/partitions, load task shall timeout as a whole
- Change related constructor to load jobs

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-22 10:39:22 +08:00
congqixia
01c16fe6e3
enhance: Manual release pool after save targets (#32358)
See also #31632

Release conc.Pool after usage to clean worker and stop background purge
and ticktock.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-19 13:51:21 +08:00
chyezh
a8c8a6bb0f
fix: parameter check of TransferReplica and TransferNode (#32297)
issue: #30647 

- Same dst and src resource group should not be allowed in
`TransferReplica` and `TransferNode`.

-  Remove redundant parameter check.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-17 15:27:19 +08:00
yiwangdr
7deda4d5e9
enhance: speed up GetByCollectionAndNode (#32232)
Related to https://github.com/milvus-io/milvus/issues/32165

Avoid iterating through all replicas/collections if possible. Iteration
is expensive when there are large number of replicas/collections.

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-04-17 10:23:25 +08:00
congqixia
72c172a7d7
enhance: Remove duplicated collectionID label for task latency (#32308)
`CollectionID` already exists in channel name, so remove it to save
metrics traffic.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-16 18:55:19 +08:00
chyezh
70e3d5b495
fix: wrong node id in TestCheckNodesInReplica (#32268)
issue: #31930

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 17:38:17 +08:00
wei liu
4822b109bd
fix: Skip to load l0 segment on old version query node (#32124)
issue: #32107

during rolling upgrade progress, skip to load l0 segment on old version
query node

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-15 11:23:23 +08:00
congqixia
dc11cbd123
enhance: Maintain collection-patitions mapping in qc meta (#32227)
Related to #32165

Add collection to partitionIDs mapping to avoid interation on all
partitions loaded when trying to get all partitions with collection id

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-15 10:05:19 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
wei liu
68dec7dcd4
fix: Use correct ts to avoid exclude segment list leak (#31991)
issue: #31990

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 10:39:19 +08:00
congqixia
b9a487608a
fix: Make ResourceGroup.nodes concurrent safe (#32159)
See also #32158

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-11 17:53:18 +08:00
congqixia
25a1c9ecf0
fix: Make coordinator Register not blocked on ProcessActiveStandby (#32069)
See also #32066

This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-10 18:49:18 +08:00
chyezh
a3d6110957
fix: ut failure (#32120)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-10 17:30:48 +08:00
chyezh
0be67e7f99
fix: ut failure (#32119)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-10 17:23:27 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00