Commit Graph

125 Commits

Author SHA1 Message Date
congqixia
4aa8a12ce8
fix: [2.4] Check partition in current target when observing partition load status (#34282) (#34305)
Cherry-pick from master
pr: #34282
See also #34234

`LoadPartitions` does not guarantee the current target has loading
partitions if there are some partitions already loaded before.

This PR check current target contains the partition to load when
advancing loading percentage to 100.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-07-02 15:48:10 +08:00
jaime
6423b6c718
enhance: move rocksmq from internal to pkg (#34165)
pr:  https://github.com/milvus-io/milvus/pull/33881
issue:  https://github.com/milvus-io/milvus/issues/33956

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-06-26 13:36:05 +08:00
SimFG
f664b51ebe
enhance: [2.4] try to speed up the loading of small collections (#33746)
- issue: #33569
- pr: #33570

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-06-11 15:07:55 +08:00
wei liu
9ae4945df2
fix: query node may stuck at stopping progress (#33104) (#33154)
issue: #33103 
pr: #33104
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 15:01:43 +08:00
chyezh
293f14a8b9
fix: remove redundant replica recover (#32985)
issue: #22288 

- replica recover should be only triggered by replica recover

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-13 15:25:32 +08:00
Xiaofan
b044e5503e
enhance:Improve load speed (#32898)
fix #32897
add memory check when load collection

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-05-11 10:29:31 +08:00
wei liu
e2332bdc17
enhance: Enable channel exclusive balance policy (#32911)
issue: #32910  
* split replica's node list to channels when create replicas
 * balance nodes among channels when node change happens
 * implement channel level balance, let balance happens in channel level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-10 17:27:31 +08:00
wei liu
ba02d54a30
enhance: update shard leader cache when leader location changed (#32470)
issue: #32466

this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-08 10:05:29 +08:00
chyezh
b904c8d377
enhance: resource group unittest refactory (#32739)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-05-06 10:17:34 +08:00
chyezh
ef4c875d4c
fix: resource group ut may failure (#32688)
issue: https://github.com/milvus-io/milvus/issues/30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-29 14:17:26 +08:00
congqixia
4cdf6c3c41
fix: Check partition nil before observe load progress (#32659)
See also #32441 #32615

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-28 16:29:25 +08:00
Xiaofan
02ace25c68
enhance: reduce the cpu usage when collection number is high (#32245)
related to #32165
1. for all the manager, support collection level index
2. remove collection level filter to avoid extra cpu usage when
collection number increases

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-26 11:49:25 +08:00
chyezh
21a9de5c8e
fix: resource group ut fixup (#32509)
issue: #30647

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-23 10:01:23 +08:00
congqixia
d7ff1bbe5c
enhance: Make querycoordv2 collection observer task driven (#32441)
See also #32440

- Add loadTask in collection observer
- For load collection/partitions, load task shall timeout as a whole
- Change related constructor to load jobs

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-22 10:39:22 +08:00
chyezh
70e3d5b495
fix: wrong node id in TestCheckNodesInReplica (#32268)
issue: #31930

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 17:38:17 +08:00
chyezh
48fe977a9d
enhance: declarative resource group api (#31930)
issue: #30647

- Add declarative resource group api

- Add config for resource group management

- Resource group recovery enhancement

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-15 08:13:19 +08:00
wei liu
c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
chyezh
a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
wei liu
0944a1f790
enhance: Refactor channel dist manager interface (#31119)
issue: #31091
This PR add GetByFilter interface in channel dist manager, instead of
all kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-02 10:23:14 +08:00
congqixia
73858b23bc
fix: Make target observer auto/manual task mutual exclusive (#31584)
See also #30867

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-26 09:57:08 +08:00
chyezh
9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
chyezh
ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
wei liu
efe8cecc88
enhance: refactor segment dist manager interface (#31073)
issue: #31091
This PR add `GetByFilter` interface in segment dist manager, instead of
all kind of get func

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:29:01 +08:00
congqixia
c886aa29ff
enhance: Use ListIndexes instead of DescribeIndex for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
yiwangdr
32cff25f97
enhance: decrease coordinator init time (#29822)
This PR mainly improve two items:
1. Target observer should refresh loading status during init time. An
uninitialized loading status blocks search/query. Currently, the target
observer refreshes every 10 seconds, i.e. we'd need to wait for 10s for
no reason. That's also the reason why we constantly see false log
"collection unloaded" upon mixcoord restarts.
2. Delete session when service is stopped. So that the new service
doesn't need to wait for the previous session to expire (~10s).

Item 1 is the major improvement of this PR, which should speed up init
time by 10s.
Item 2 is not a big concern in most cases as coordinators usually shut
down after stop(). In those cases, coordinator restart triggers serverID
change which further triggers an existing logic that deletes expired
session. This PR only fixes rare cases where serverID doesn't change.

integration test:
`go test -tags dynamic -v -coverprofile=profile.out -covermode=atomic
tests/integration/coordrecovery/coord_recovery_test.go -timeout=20m`
Performance after the change:
Average init time of coordinators: 10s
Hardware: M2 Pro
Test setup: 1000 collections with 1000 rows (dim=128) per collection.


issue: #29409

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-02-05 14:00:12 +08:00
wei liu
797847904c
enhance: Change some frequency log to rated level (#29720)
This PR change some frequency log to rated level

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-11 16:30:50 +08:00
wei liu
e98c62abbb
enhance: refactor leader_observer to leader_checker (#29454)
issue: #29453 

sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-05 15:54:55 +08:00
congqixia
da7c3cbd88
enhance: make delegator delete buffer holding all delete from cp (#29626)
See also #29625

This PR:
- Add a new implemention of `DeleteBuffer`: listDeleteBuffer
  - holds cacheBlock slice
  - `Put` method append new delete data into last block
  - when a block is full, append a new block into the list
- Add `TryDiscard` method for `DeleteBuffer` interface
  - For doubleCacheBuffer, do nothing
- For listDeleteBuffer, try to evict "old" blocks, which are blocks
before the first block whose start ts is behind provided ts
- Add checkpoint field for `UpdateVersion` sync action, which shall be
used to discard old cache delete block

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-04 17:02:46 +08:00
congqixia
a3cb8e2625
fix: Add atomic method to get collection target (#29577)
Related to #29575

Add `getCollectionTarget` method which is atomic when scope is
`CurrentTargetFirst` or `NextTargetFirst`
Also return error when executor finds no channel in target manager

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-29 09:04:46 +08:00
wei liu
514e279f3a
enhance: Remove useless log in collection observer (#29554)
This PR removed the useless log in collection observer

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-28 17:16:47 +08:00
yah01
13beb5ccc0
fix: load gets stuck probably (#29191)
we found the load got stuck probably, and reviewed the logs.

the target observer seems not working, the reason is the taskDispatcher
removes the task in a goroutine, and modifies the task status after
committing the task into the goroutine pool, but this may happen after
the task removed, which leads to the task will never be removed

related #29086

Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-14 18:28:38 +08:00
yah01
9819090247
enhance: add more logs for target updating (#29090)
- add more logs about the condition satisfying

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-12 14:06:43 +08:00
yah01
fab52d167b
fix: may miss stream delta while loading (#28871)
we consume the delta data from the lastest channel checkpoint while
loading segment,

this works well without level 0 segments, but now it may lead to miss
some delta data,

so we have to consume from the current target's channel checkpoint

related: #27349

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-05 17:34:45 +08:00
aoiasd
b4af6f8c40
fix: sync action load segment with lack collection index info list (#28788)
relate: https://github.com/milvus-io/milvus/issues/28779
https://github.com/milvus-io/milvus/issues/28637

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2023-12-04 18:14:34 +08:00
wei liu
d081fd5481
enhance: Change some frequency log to rated level (#28897)
This pr change some frequency log's level to rated.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-04 10:38:35 +08:00
congqixia
f9bb8e9648
enhance: Change const magic number in querycoord to param (#28819)
See also #28817

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-30 09:06:28 +08:00
congqixia
81caf02554
fix: make qcv2 observer dispatcher execute exactly once (#28472)
See also #28466

In `taskDispatcher.schedule`, same task may be resubmitted if the
previous round did not finish
In this case, TaskObserver.check may set current target by mistake,
which may cause the random search/query failure

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-11-16 10:24:19 +08:00
yah01
ecb3f585c3
Fix passing the wrong dropped list from current target (#28265)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 17:02:18 +08:00
yah01
1b90630633
Fix the target updated before version updated to cause data missing (#28250)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-08 11:36:22 +08:00
wei liu
7485eeb689
fix sync distribution with wrong version (#28130)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-03 19:02:18 +08:00
yah01
dc89730a50
Support collection-level mmap control (#26901)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-11-02 23:52:16 +08:00
wei liu
e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00
congqixia
323fc107a7
Fix taskDispatcher add multiple tasks will ignore following ones (#27885)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-24 17:18:13 +08:00
congqixia
93a877f55e
Make qcv2 target&leader observer execute in parallel (#27844)
- Add `taskDispatcher` to submit and run task async safely
- Change `LeaderObeserver` and `TargetObserver` schedule and manual check action to submitting task into dispatcher
- Fix logic problem in collection observer when manual check return false

See also #27494

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-24 10:14:11 +08:00
wei liu
55e5f80e24
update collection target after observer start (#27774)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-19 21:52:10 +08:00
yah01
be980fbc38
Refine state check (#27541)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-10-11 21:01:35 +08:00
wei liu
42c475a0e0
remove useless log in querycoord (#27362)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-11 10:13:34 +08:00
congqixia
eca79d149c
Add ctx control for observer manual check methods (#27531)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-10-09 11:07:33 +08:00
yah01
a715165306
Set timeout for leader observer syncing (#27504)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-10-08 16:55:31 +08:00
yah01
a8ce1b6686
Refine QueryCoord stopping (#27371)
Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-09-27 16:27:27 +08:00