Commit Graph

19883 Commits

Author SHA1 Message Date
cai.zhang
32d3e22d7d
fix: Throw an exception after all the threads in thread pool finished (#32810)
issue: #32487

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2024-05-23 11:47:40 +08:00
yihao.dai
895799ec61
enhance: Abstract Execute interface for import/preimport task (#33234)
Abstract Execute interface for import/preimport task, simplify import
scheduler.

issue: https://github.com/milvus-io/milvus/issues/33157

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-05-23 11:29:41 +08:00
yihao.dai
9ff023ee35
fix: Fix filtering by partition key fails for importing data (#33274)
Before executing the import, partition IDs should be reordered according
to partition names. Otherwise, the data might be hashed to the wrong
partition during import. This PR corrects this error.

issue: https://github.com/milvus-io/milvus/issues/33237

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-05-23 11:13:40 +08:00
cai.zhang
be77ceba84
enhance: Use proto for passing info in cgo (#33184)
issue: #33183

---------

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2024-05-23 10:31:40 +08:00
XuanYang-cn
22bddde5ff
enhance: Tidy compactor and remove dup codes (#32198)
See also: #32451

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2024-05-23 09:53:40 +08:00
PowderLi
b9d7145049
fix: [restful v2]role operations need dbName (#33283)
issue: #33220

use dbName as part of privilege entity, so
1. grant / revoke a privilege need dbName
2. we can describe the privileges of the role which belong to one
special database

Signed-off-by: PowderLi <min.li@zilliz.com>
2024-05-23 09:51:45 +08:00
congqixia
e1bafd7105
enhance: Use pre-built logger for write buffer frequent ops (#33273)
See also #33266

Each `WriteBuffer` shall have same channel/collection id attribute, so
use same logger will do and reduce logger allocation & frequent name
composition

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-22 21:11:40 +08:00
congqixia
33144a43d4
enhance: Support Row-based insert for milvusclient (#33270)
See also #31293

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-22 19:15:40 +08:00
wei liu
39f56678a0
enhance: Reduce bloom filter lock contention between insert and delete in query coord (#32643)
issue: #32530

cause ProcessDelete need to check whether pk exist in bloom filter, and
ProcessInsert need to update pk to bloom filter, when execute
ProcessInsert and ProcessDelete in parallel, it will cause race
condition in segment's bloom filter

This PR execute ProcessInsert and ProcessDelete in serial to avoid block
each other

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-22 19:11:40 +08:00
sammy.huang
310bfe71c2
fix: arm-based gpu image (#33275)
Signed-off-by: Liang Huang <sammy.huang@zilliz.com>
2024-05-22 17:08:29 +08:00
sre-ci-robot
fc765c6a72
[automated] Update cpu Builder image changes (#33202)
Update cpu Builder image changes
See changes:
d27db99697
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-05-22 16:35:39 +08:00
sre-ci-robot
aef33351b6
[automated] Update gpu Builder image changes (#33192)
Update gpu Builder image changes
See changes:
c35eaaa358
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-05-22 16:33:48 +08:00
SimFG
b9b6343c88
enhance: check the auth in some rest v2 api (#33256)
/kind improvement
link master proto:
https://github.com/milvus-io/milvus-proto/blob/master/proto/milvus.proto

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-05-22 16:03:40 +08:00
SimFG
dd0c6d6980
fix: the panic when db isn't existed in the rate limit interceptor (#33244)
issue: #33243

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-05-22 15:57:39 +08:00
congqixia
3c4df81261
enhance: Assert insert data length not overflow int (#33248)
When InsertData is too large for cpp proto unmarshalling, the error
message is confusing since the length is overflowed

This PR adds assertion for insert data length.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-22 15:11:39 +08:00
aoiasd
13fdaea9f0
fix: accesslog writer cache close cause deadlock (#33261)
relate: https://github.com/milvus-io/milvus/issues/33260

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-05-22 14:55:39 +08:00
shaoting-huang
de7901121f
Upgrade go from 1.20 to 1.21 (#33047)
Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com]

issue: https://github.com/milvus-io/milvus/issues/32982

# Background
Go 1.21 introduces several improvements and changes over Go 1.20, which
is quite stable now. According to
[Go 1.21 Release Notes](https://tip.golang.org/doc/go1.21), the big
difference of Go 1.21 is enabling Profile-Guided Optimization by
default, which can improve performance by around 2-14%. Here are the
summary steps of PGO:
1. Build Initial Binary (Without PGO)
2. Deploying the Production Environment
3. Run the program and collect Performance Analysis Data (CPU pprof)
4. Analyze the Collected Data and Select a Performance Profile for PGO
5. Place the Performance Analysis File in the Main Package Directory and
Name It default.pgo
6. go build Detects the default.pgo File and Enables PGO
7. Build and Release the Updated Binary (With PGO)
8. Iterate and Repeat the Above Steps
<img width="657" alt="Screenshot 2024-05-14 at 15 57 01"
src="https://github.com/milvus-io/milvus/assets/167743503/b08d4300-0be1-44dc-801f-ce681dabc581">

# What does this PR do
There are three experiments, search benchmark by Zilliz test platform,
search benchmark by open-source
[VectorDBBench](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file),
and search benchmark with PGO. We do both search benchmarks by Zilliz
test platform and by VectorDBBench to reduce reliance on a single
experimental result. Besides, we validate the performance enhancement
with PGO.

## Search Benchmark Report by Zilliz Test Platform
An upgrade to Go 1.21 was conducted on a Milvus Standalone server,
equipped with 16 CPUs and 64GB of memory. The search performance was
evaluated using a 1 million entry local dataset with an L2 metric type
in a 768-dimensional space. The system was tested for concurrent
searches with 50 concurrent tasks for 1 hour, each with a 20-second
interval. The reason for using one server rather than two servers to
compare is to guarantee the same data source and same segment state
after compaction.

Test Sequence:
1. Go 1.20 Initial Run: Insert data, build index, load index, and
search.
2. Go 1.20 Rebuild: Rebuild the index with the same dataset, load index,
and search.
3. Go 1.21 Load: Upload to Go 1.21 within the server. Then load the
index from the second run, and search.
4. Go 1.21 Rebuild: Rebuild the index with the same dataset, load index,
and search.

Search Metrics: 
| Metric | Go 1.20 | Go 1.20 Rebuild Index | Go 1.21 | Go 1.21 Rebuild
Index |

|----------------------------|------------------|-----------------|------------------|-----------------|
| `search requests` | 10,942,683 | 16,131,726 | 16,200,887 | 16,331,052
|
| `search fails` | 0 | 0 | 0 | 0 |
| `search RT_avg` (ms) | 16.44 | 11.15 | 11.11 | 11.02 |
| `search RT_min` (ms) | 1.30 | 1.28 | 1.31 | 1.26 |
| `search RT_max` (ms) | 446.61 | 233.22 | 235.90 | 147.93 |
| `search TP50` (ms) | 11.74 | 10.46 | 10.43 | 10.35 |
| `search TP99` (ms) | 92.30 | 25.76 | 25.36 | 25.23 |
| `search RPS` | 3,039 | 4,481 | 4,500 | 4,536 |

### Key Findings
The benchmark tests reveal that the index build time with Go 1.20 at
340.39 ms and Go 1.21 at 337.60 ms demonstrated negligible performance
variance in index construction. However, Go 1.21 offers slightly better
performance in search operations compared to Go 1.20, with improvements
in handling concurrent tasks and reducing response times.

## Search Benchmark Report By VectorDb Bench
Follow
[VectorDBBench](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file)
to create a VectorDb Bench test for Go 1.20 and Go 1.21. We test the
search performance with Go 1.20 and Go 1.21 (without PGO) on the Milvus
Standalone system. The tests were conducted using the Cohere dataset
with 1 million entries in a 768-dimensional space, utilizing the COSINE
metric type.

Search Metrics: 
Metric | Go 1.20 | Go 1.21 without PGO
-- | -- | --
Load Duration (seconds) | 1195.95 | 976.37
Queries Per Second (QPS) | 841.62 | 875.89
99th Percentile Serial Latency (seconds) | 0.0047 | 0.0076
Recall | 0.9487 | 0.9489

### Key Findings
Go 1.21 indicates faster index loading times and larger search QPS
handling.

## PGO Performance Test
Milvus has already added
[net/http/pprof](https://pkg.go.dev/net/http/pprof) in the metrics. So
we can curl the CPU profile directly by running
`curl -o default.pgo
"http://${MILVUS_SERVER_IP}:${MILVUS_SERVER_PORT}/debug/pprof/profile?seconds=${TIME_SECOND}"`
to collect the profile as the default.pgo during the first search. Then
I build Milvus with PGO and use the same index to run the search again.
The result is as below:

Search Metrics
| Metric | Go 1.21 Without PGO | Go 1.21 With PGO | Change (%) |

|---------------------------------------------|------------------|-----------------|------------|
| `search Requests` | 2,644,583 | 2,837,726 | +7.30% |
| `search Fails` | 0 | 0 | N/A |
| `search RT_avg` (ms) | 11.34 | 10.57 | -6.78% |
| `search RT_min` (ms) | 1.39 | 1.32 | -5.18% |
| `search RT_max` (ms) | 349.72 | 143.72 | -58.91% |
| `search TP50` (ms) | 10.57 | 9.93 | -6.05% |
| `search TP99` (ms) | 26.14 | 24.16 | -7.56% |
| `search RPS`                 | 4,407       | 4,729       | +7.30%    |

### Key Findings
PGO led to a notable enhancement in search performance, particularly in
reducing the maximum response time by 58% and increasing the search QPS
by 7.3%.

### Further Analysis
Generate a diff flame graphs between two CPU profiles by running `go
tool pprof -http=:8000 -diff_base nopgo.pgo pgo.pgo -normalize`

<img width="1894" alt="goprofiling"
src="https://github.com/milvus-io/milvus/assets/167743503/ab9e91eb-95c7-4963-acd9-d1c3c73ee010">
Further insight of HnswIndexNode and Milvus Search Handler
<img width="1906" alt="hnsw"
src="https://github.com/milvus-io/milvus/assets/167743503/a04cf4a0-7c97-4451-b3cf-98afc20a0b05">
<img width="1873" alt="search_handler"
src="https://github.com/milvus-io/milvus/assets/167743503/5f4d3982-18dd-4115-8e76-460f7f534c7f">

After applying PGO to the Milvus server, the CPU utilization of the
faiss::fvec_L2 function has decreased. This optimization significantly
enhances the performance of the
[HnswIndexNode::Search::searchKnn](e0c9c41aa2/src/index/hnsw/hnsw.cc (L203))
method, which is frequently invoked by Knowhere during high-concurrency
searches. As the explanation from Go release notes, the function might
be more aggressively inlined by Go compiler during the second build with
the CPU profiling collected from the first run. As a result, the search
handler efficiency within Milvus DataNode has improved, allowing the
server to process a higher number of search queries per second (QPS).



# Conclusion
The combination of Go 1.21 and PGO has led to substantial enhancements
in search performance for Milvus server, particularly in terms of search
QPS and response times, making it more efficient for handling
high-concurrency search operations.

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-05-22 13:21:39 +08:00
Alexander Guzhva
648d5661ca
enhance: Upgrade bitset for ARM SVE (#32718)
issue: #32826 
improve ARM SVE performance for `internal/core/src/bitset`

Baseline timings for gcc 11.4 + Graviton 3 + manually enabled SVE:
https://gist.github.com/alexanderguzhva/a974b50134c8bb9255fb15f144e5ac83

Candidate timings for gcc 11.4 + Graviton 3 + manually enabled SVE:
https://gist.github.com/alexanderguzhva/19fc88f4ad3757e05e0f7feaf563b3d3

Signed-off-by: Alexandr Guzhva <alexanderguzhva@gmail.com>
2024-05-22 11:37:40 +08:00
XuanYang-cn
819a624753
fix: Return error when startup Delete/AddNode fail (#33193)
See also: #33151, #33149

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2024-05-22 11:17:38 +08:00
Alexander Guzhva
f20becb725
fix: Download and install cmake for the current platform, not x86_64 only (#32548)
issue #32476 

tested on x86_64 and aarch64. I'm not sure what needs to be done on some
exotic architectures.

Signed-off-by: Alexandr Guzhva <alexanderguzhva@gmail.com>
2024-05-22 11:15:39 +08:00
wei liu
303470fc35
fix: Clean offline node from resource group after qc restart (#33232)
issue: #33200 #33207
pr#33104 causes the offline node will be kept in resource group after qc
recover, and offline node will be assign to new replica as rwNode, then
request send to those node will fail by NodeNotFound.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-22 10:03:40 +08:00
Xiaofan
3d105fcb4d
enhance: Remove l0 delete cache (#32990)
fix #32979
remove l0 cache and build delete pk and ts everytime. this reduce the
memory and also increase the code readability

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-05-21 22:53:40 +08:00
SimFG
e18d5aceb6
enhance: add config to control whether to init public role permissions (#33165)
issue: #33164

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-05-21 22:39:46 +08:00
cai.zhang
ed39a38953
enhance: Reduce the frequency of logs describing indexing failures (#33212)
issue: #33001 #33102

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2024-05-21 19:27:39 +08:00
congqixia
12e8c6c583
enhance: Try LatestMessageID when checkpoint unmarshal fails (#33158)
See also #33122

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-21 16:59:39 +08:00
sammy.huang
7ab7e3a004
feat: support arm-based image build and pull request (#33219)
Signed-off-by: Liang Huang <sammy.huang@zilliz.com>
2024-05-21 16:54:38 +08:00
yihao.dai
017fd7bc25
enhance: Select L2 segments in L0Compaction as well (#32991)
/kind improvement

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-05-21 16:13:39 +08:00
wei liu
33bd6eed28
fix: Clean offline node from replica after qc recover (#33213)
issue: #33200 #33207

pr#33104 remove this logic by mistake, which cause the offline node will
be kept in replica after qc recover, and request send to offline qn will
go a NodeNotFound error.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 15:41:39 +08:00
Cai Yudong
cb480d17c8
fix: Fix SparseFloatVector data parse error for parquet (#33187)
Issue: #22837

Signed-off-by: Cai Yudong <yudong.cai@zilliz.com>
2024-05-21 15:09:39 +08:00
XuanYang-cn
2d6f12d48b
fix: channel manager's goroutine run order (#33118)
See also: #33117

---------

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2024-05-21 14:35:39 +08:00
congqixia
f336b2d672
fix: Check schema without vector field in proxy (#33211)
Related to #33199

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-21 14:33:39 +08:00
wei liu
2013d97243
enhance: Enable to dynamic update balancer policy in querycoord (#33037)
issue: #33036
This PR enable to dynamic update balancer policy without restart
querycoord.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-21 14:29:39 +08:00
Xiaofan
f681c4b034
enhance: remove describe index in rootcoord broker (#33206)
fix #33205
remove the dependency between datacoord and rootcoord

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-05-21 14:11:39 +08:00
congqixia
f31a20faad
fix: [Backport] Mark channel checkpoint dropped prevent cp lag metrics leakage (#32454) (#33198)
Cherry-pick from 2.3
pr: #32454
See also #31506 #31508

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-21 11:59:39 +08:00
Jiquan Long
9f81290c63
fix: try best to get enough query results (#33178)
issue: https://github.com/milvus-io/milvus/issues/33137

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-05-21 11:57:51 +08:00
Bingyi Sun
0f8c6f49ff
enhance: mmap load raw data if scalar index does not have raw data (#33175)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-05-21 11:53:39 +08:00
XuanYang-cn
b3bcc107bb
fix: Remove L0 compactor in completedCompactor (#33169)
See also: #33168

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
2024-05-21 11:35:38 +08:00
aoiasd
f8929cc36a
fix: can't generate traceID when use noop exporter (#33191)
relate: https://github.com/milvus-io/milvus/issues/33190

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-05-21 10:39:39 +08:00
smellthemoon
89ad3eb0ca
enhance: reduce memory when read field (#33195)
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-05-20 22:25:38 +08:00
jaime
0d99db23b8
fix: metrics leak on the coord nodes (#33075)
issue: #32980

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-05-20 22:03:39 +08:00
shaoting-huang
d27db99697
enhance: upgrade amazonlinux2023 builder image go version to 1.21 (#33176)
Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com]

issue: https://github.com/milvus-io/milvus/issues/32982

Go 1.21 introduces several improvements and changes over Go 1.20, which
is quite stable now. This PR is mainly for upgrading images Golang
version from 1.20 to 1.21.

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-05-20 21:11:39 +08:00
congqixia
7eeb120aab
enhance: Add lint rules for client pkg and fix problems (#33180)
See also #31293

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-20 20:47:38 +08:00
pingliu
a1b5d7d6d3
doc: [skip-e2e] change milvus docker image version to v2.4.1 (#33170)
Signed-off-by: ping.liu <ping.liu@zilliz.com>
2024-05-20 16:21:38 +08:00
shaoting-huang
c35eaaa358
enhance: upgrade images golang version from 1.20 to 1.21 (#33150)
Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com]

issue: https://github.com/milvus-io/milvus/issues/32982

Go 1.21 introduces several improvements and changes over Go 1.20, which
is quite stable now. This PR is mainly for upgrading images Golang
version from 1.20 to 1.21.

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-05-20 15:01:43 +08:00
congqixia
f76d16780b
enhance: Refine channel mgr v2 implementation (#33156)
Related to #25309

- Remove ctx from struct
- Add ctx parameters for internal check logic methods
- Add Waitgroup to make sure worker goroutine quit before close returns

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-20 14:13:38 +08:00
sre-ci-robot
555df49d25
[automated] Update Pytest image changes (#33126)
Update Pytest image changes
See changes:
0d0eda24f8
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-05-20 11:43:37 +08:00
congqixia
c2ac692008
enhance: Add param item to ignore bad message id in checkpoint (#33123)
See also #33122

This pr add param item `mq.ignoreBadPosition` to control behavior when
mq failed to parse message id from checkpoint

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-05-20 11:29:37 +08:00
SimFG
ec98de3ad4
fix: reset the quota value when init the limiter (#33111)
issue: #33107
/kind improvement

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-05-20 10:35:38 +08:00
wei liu
a7f6193bfc
fix: query node may stuck at stopping progress (#33104)
issue: #33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 10:21:38 +08:00
sre-ci-robot
c6e2dd05fc
[automated] Update Knowhere Commit (#33147)
Update Knowhere Commit
Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-05-20 01:51:37 +08:00