milvus

mirror of https://gitee.com/milvus-io/milvus.git synced 2024-12-04 12:59:23 +08:00

Author	SHA1	Message	Date
cai.zhang	32d3e22d7d	fix: Throw an exception after all the threads in thread pool finished (#32810 ) issue: #32487 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2024-05-23 11:47:40 +08:00
yihao.dai	895799ec61	enhance: Abstract Execute interface for import/preimport task (#33234 ) Abstract Execute interface for import/preimport task, simplify import scheduler. issue: https://github.com/milvus-io/milvus/issues/33157 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-05-23 11:29:41 +08:00
yihao.dai	9ff023ee35	fix: Fix filtering by partition key fails for importing data (#33274 ) Before executing the import, partition IDs should be reordered according to partition names. Otherwise, the data might be hashed to the wrong partition during import. This PR corrects this error. issue: https://github.com/milvus-io/milvus/issues/33237 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-05-23 11:13:40 +08:00
cai.zhang	be77ceba84	enhance: Use proto for passing info in cgo (#33184 ) issue: #33183 --------- Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2024-05-23 10:31:40 +08:00
XuanYang-cn	22bddde5ff	enhance: Tidy compactor and remove dup codes (#32198 ) See also: #32451 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2024-05-23 09:53:40 +08:00
PowderLi	b9d7145049	fix: [restful v2]role operations need dbName (#33283 ) issue: #33220 use dbName as part of privilege entity, so 1. grant / revoke a privilege need dbName 2. we can describe the privileges of the role which belong to one special database Signed-off-by: PowderLi <min.li@zilliz.com>	2024-05-23 09:51:45 +08:00
congqixia	e1bafd7105	enhance: Use pre-built logger for write buffer frequent ops (#33273 ) See also #33266 Each `WriteBuffer` shall have same channel/collection id attribute, so use same logger will do and reduce logger allocation & frequent name composition Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-22 21:11:40 +08:00
congqixia	33144a43d4	enhance: Support Row-based insert for milvusclient (#33270 ) See also #31293 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-22 19:15:40 +08:00
wei liu	39f56678a0	enhance: Reduce bloom filter lock contention between insert and delete in query coord (#32643 ) issue: #32530 cause ProcessDelete need to check whether pk exist in bloom filter, and ProcessInsert need to update pk to bloom filter, when execute ProcessInsert and ProcessDelete in parallel, it will cause race condition in segment's bloom filter This PR execute ProcessInsert and ProcessDelete in serial to avoid block each other Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-22 19:11:40 +08:00
sammy.huang	310bfe71c2	fix: arm-based gpu image (#33275 ) Signed-off-by: Liang Huang <sammy.huang@zilliz.com>	2024-05-22 17:08:29 +08:00
sre-ci-robot	fc765c6a72	[automated] Update cpu Builder image changes (#33202 ) Update cpu Builder image changes See changes: `d27db99697` Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-05-22 16:35:39 +08:00
sre-ci-robot	aef33351b6	[automated] Update gpu Builder image changes (#33192 ) Update gpu Builder image changes See changes: `c35eaaa358` Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-05-22 16:33:48 +08:00
SimFG	b9b6343c88	enhance: check the auth in some rest v2 api (#33256 ) /kind improvement link master proto: https://github.com/milvus-io/milvus-proto/blob/master/proto/milvus.proto Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-05-22 16:03:40 +08:00
SimFG	dd0c6d6980	fix: the panic when db isn't existed in the rate limit interceptor (#33244 ) issue: #33243 Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-05-22 15:57:39 +08:00
congqixia	3c4df81261	enhance: Assert insert data length not overflow int (#33248 ) When InsertData is too large for cpp proto unmarshalling, the error message is confusing since the length is overflowed This PR adds assertion for insert data length. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-22 15:11:39 +08:00
aoiasd	13fdaea9f0	fix: accesslog writer cache close cause deadlock (#33261 ) relate: https://github.com/milvus-io/milvus/issues/33260 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-05-22 14:55:39 +08:00
shaoting-huang	de7901121f	Upgrade go from 1.20 to 1.21 (#33047 ) Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com] issue: https://github.com/milvus-io/milvus/issues/32982 # Background Go 1.21 introduces several improvements and changes over Go 1.20, which is quite stable now. According to [Go 1.21 Release Notes](https://tip.golang.org/doc/go1.21), the big difference of Go 1.21 is enabling Profile-Guided Optimization by default, which can improve performance by around 2-14%. Here are the summary steps of PGO: 1. Build Initial Binary (Without PGO) 2. Deploying the Production Environment 3. Run the program and collect Performance Analysis Data (CPU pprof) 4. Analyze the Collected Data and Select a Performance Profile for PGO 5. Place the Performance Analysis File in the Main Package Directory and Name It default.pgo 6. go build Detects the default.pgo File and Enables PGO 7. Build and Release the Updated Binary (With PGO) 8. Iterate and Repeat the Above Steps <img width="657" alt="Screenshot 2024-05-14 at 15 57 01" src="https://github.com/milvus-io/milvus/assets/167743503/b08d4300-0be1-44dc-801f-ce681dabc581"> # What does this PR do There are three experiments, search benchmark by Zilliz test platform, search benchmark by open-source [VectorDBBench](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file), and search benchmark with PGO. We do both search benchmarks by Zilliz test platform and by VectorDBBench to reduce reliance on a single experimental result. Besides, we validate the performance enhancement with PGO. ## Search Benchmark Report by Zilliz Test Platform An upgrade to Go 1.21 was conducted on a Milvus Standalone server, equipped with 16 CPUs and 64GB of memory. The search performance was evaluated using a 1 million entry local dataset with an L2 metric type in a 768-dimensional space. The system was tested for concurrent searches with 50 concurrent tasks for 1 hour, each with a 20-second interval. The reason for using one server rather than two servers to compare is to guarantee the same data source and same segment state after compaction. Test Sequence: 1. Go 1.20 Initial Run: Insert data, build index, load index, and search. 2. Go 1.20 Rebuild: Rebuild the index with the same dataset, load index, and search. 3. Go 1.21 Load: Upload to Go 1.21 within the server. Then load the index from the second run, and search. 4. Go 1.21 Rebuild: Rebuild the index with the same dataset, load index, and search. Search Metrics: \| Metric \| Go 1.20 \| Go 1.20 Rebuild Index \| Go 1.21 \| Go 1.21 Rebuild Index \| \|----------------------------\|------------------\|-----------------\|------------------\|-----------------\| \| `search requests` \| 10,942,683 \| 16,131,726 \| 16,200,887 \| 16,331,052 \| \| `search fails` \| 0 \| 0 \| 0 \| 0 \| \| `search RT_avg` (ms) \| 16.44 \| 11.15 \| 11.11 \| 11.02 \| \| `search RT_min` (ms) \| 1.30 \| 1.28 \| 1.31 \| 1.26 \| \| `search RT_max` (ms) \| 446.61 \| 233.22 \| 235.90 \| 147.93 \| \| `search TP50` (ms) \| 11.74 \| 10.46 \| 10.43 \| 10.35 \| \| `search TP99` (ms) \| 92.30 \| 25.76 \| 25.36 \| 25.23 \| \| `search RPS` \| 3,039 \| 4,481 \| 4,500 \| 4,536 \| ### Key Findings The benchmark tests reveal that the index build time with Go 1.20 at 340.39 ms and Go 1.21 at 337.60 ms demonstrated negligible performance variance in index construction. However, Go 1.21 offers slightly better performance in search operations compared to Go 1.20, with improvements in handling concurrent tasks and reducing response times. ## Search Benchmark Report By VectorDb Bench Follow [VectorDBBench](https://github.com/zilliztech/VectorDBBench?tab=readme-ov-file) to create a VectorDb Bench test for Go 1.20 and Go 1.21. We test the search performance with Go 1.20 and Go 1.21 (without PGO) on the Milvus Standalone system. The tests were conducted using the Cohere dataset with 1 million entries in a 768-dimensional space, utilizing the COSINE metric type. Search Metrics: Metric \| Go 1.20 \| Go 1.21 without PGO -- \| -- \| -- Load Duration (seconds) \| 1195.95 \| 976.37 Queries Per Second (QPS) \| 841.62 \| 875.89 99th Percentile Serial Latency (seconds) \| 0.0047 \| 0.0076 Recall \| 0.9487 \| 0.9489 ### Key Findings Go 1.21 indicates faster index loading times and larger search QPS handling. ## PGO Performance Test Milvus has already added [net/http/pprof](https://pkg.go.dev/net/http/pprof) in the metrics. So we can curl the CPU profile directly by running `curl -o default.pgo "http://${MILVUS_SERVER_IP}:${MILVUS_SERVER_PORT}/debug/pprof/profile?seconds=${TIME_SECOND}"` to collect the profile as the default.pgo during the first search. Then I build Milvus with PGO and use the same index to run the search again. The result is as below: Search Metrics \| Metric \| Go 1.21 Without PGO \| Go 1.21 With PGO \| Change (%) \| \|---------------------------------------------\|------------------\|-----------------\|------------\| \| `search Requests` \| 2,644,583 \| 2,837,726 \| +7.30% \| \| `search Fails` \| 0 \| 0 \| N/A \| \| `search RT_avg` (ms) \| 11.34 \| 10.57 \| -6.78% \| \| `search RT_min` (ms) \| 1.39 \| 1.32 \| -5.18% \| \| `search RT_max` (ms) \| 349.72 \| 143.72 \| -58.91% \| \| `search TP50` (ms) \| 10.57 \| 9.93 \| -6.05% \| \| `search TP99` (ms) \| 26.14 \| 24.16 \| -7.56% \| \| `search RPS` \| 4,407 \| 4,729 \| +7.30% \| ### Key Findings PGO led to a notable enhancement in search performance, particularly in reducing the maximum response time by 58% and increasing the search QPS by 7.3%. ### Further Analysis Generate a diff flame graphs between two CPU profiles by running `go tool pprof -http=:8000 -diff_base nopgo.pgo pgo.pgo -normalize` <img width="1894" alt="goprofiling" src="https://github.com/milvus-io/milvus/assets/167743503/ab9e91eb-95c7-4963-acd9-d1c3c73ee010"> Further insight of HnswIndexNode and Milvus Search Handler <img width="1906" alt="hnsw" src="https://github.com/milvus-io/milvus/assets/167743503/a04cf4a0-7c97-4451-b3cf-98afc20a0b05"> <img width="1873" alt="search_handler" src="https://github.com/milvus-io/milvus/assets/167743503/5f4d3982-18dd-4115-8e76-460f7f534c7f"> After applying PGO to the Milvus server, the CPU utilization of the faiss::fvec_L2 function has decreased. This optimization significantly enhances the performance of the [HnswIndexNode::Search::searchKnn](`e0c9c41aa2/src/index/hnsw/hnsw.cc (L203)`) method, which is frequently invoked by Knowhere during high-concurrency searches. As the explanation from Go release notes, the function might be more aggressively inlined by Go compiler during the second build with the CPU profiling collected from the first run. As a result, the search handler efficiency within Milvus DataNode has improved, allowing the server to process a higher number of search queries per second (QPS). # Conclusion The combination of Go 1.21 and PGO has led to substantial enhancements in search performance for Milvus server, particularly in terms of search QPS and response times, making it more efficient for handling high-concurrency search operations. Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2024-05-22 13:21:39 +08:00
Alexander Guzhva	648d5661ca	enhance: Upgrade bitset for ARM SVE (#32718 ) issue: #32826 improve ARM SVE performance for `internal/core/src/bitset` Baseline timings for gcc 11.4 + Graviton 3 + manually enabled SVE: https://gist.github.com/alexanderguzhva/a974b50134c8bb9255fb15f144e5ac83 Candidate timings for gcc 11.4 + Graviton 3 + manually enabled SVE: https://gist.github.com/alexanderguzhva/19fc88f4ad3757e05e0f7feaf563b3d3 Signed-off-by: Alexandr Guzhva <alexanderguzhva@gmail.com>	2024-05-22 11:37:40 +08:00
XuanYang-cn	819a624753	fix: Return error when startup Delete/AddNode fail (#33193 ) See also: #33151, #33149 --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2024-05-22 11:17:38 +08:00
Alexander Guzhva	f20becb725	fix: Download and install cmake for the current platform, not x86_64 only (#32548 ) issue #32476 tested on x86_64 and aarch64. I'm not sure what needs to be done on some exotic architectures. Signed-off-by: Alexandr Guzhva <alexanderguzhva@gmail.com>	2024-05-22 11:15:39 +08:00
wei liu	303470fc35	fix: Clean offline node from resource group after qc restart (#33232 ) issue: #33200 #33207 pr#33104 causes the offline node will be kept in resource group after qc recover, and offline node will be assign to new replica as rwNode, then request send to those node will fail by NodeNotFound. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-22 10:03:40 +08:00
Xiaofan	3d105fcb4d	enhance: Remove l0 delete cache (#32990 ) fix #32979 remove l0 cache and build delete pk and ts everytime. this reduce the memory and also increase the code readability Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2024-05-21 22:53:40 +08:00
SimFG	e18d5aceb6	enhance: add config to control whether to init public role permissions (#33165 ) issue: #33164 Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-05-21 22:39:46 +08:00
cai.zhang	ed39a38953	enhance: Reduce the frequency of logs describing indexing failures (#33212 ) issue: #33001 #33102 Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>	2024-05-21 19:27:39 +08:00
congqixia	12e8c6c583	enhance: Try LatestMessageID when checkpoint unmarshal fails (#33158 ) See also #33122 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-21 16:59:39 +08:00
sammy.huang	7ab7e3a004	feat: support arm-based image build and pull request (#33219 ) Signed-off-by: Liang Huang <sammy.huang@zilliz.com>	2024-05-21 16:54:38 +08:00
yihao.dai	017fd7bc25	enhance: Select L2 segments in L0Compaction as well (#32991 ) /kind improvement Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-05-21 16:13:39 +08:00
wei liu	33bd6eed28	fix: Clean offline node from replica after qc recover (#33213 ) issue: #33200 #33207 pr#33104 remove this logic by mistake, which cause the offline node will be kept in replica after qc recover, and request send to offline qn will go a NodeNotFound error. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-21 15:41:39 +08:00
Cai Yudong	cb480d17c8	fix: Fix SparseFloatVector data parse error for parquet (#33187 ) Issue: #22837 Signed-off-by: Cai Yudong <yudong.cai@zilliz.com>	2024-05-21 15:09:39 +08:00
XuanYang-cn	2d6f12d48b	fix: channel manager's goroutine run order (#33118 ) See also: #33117 --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2024-05-21 14:35:39 +08:00
congqixia	f336b2d672	fix: Check schema without vector field in proxy (#33211 ) Related to #33199 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-21 14:33:39 +08:00
wei liu	2013d97243	enhance: Enable to dynamic update balancer policy in querycoord (#33037 ) issue: #33036 This PR enable to dynamic update balancer policy without restart querycoord. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-21 14:29:39 +08:00
Xiaofan	f681c4b034	enhance: remove describe index in rootcoord broker (#33206 ) fix #33205 remove the dependency between datacoord and rootcoord Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2024-05-21 14:11:39 +08:00
congqixia	f31a20faad	fix: [Backport] Mark channel checkpoint dropped prevent cp lag metrics leakage (#32454 ) (#33198 ) Cherry-pick from 2.3 pr: #32454 See also #31506 #31508 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-21 11:59:39 +08:00
Jiquan Long	9f81290c63	fix: try best to get enough query results (#33178 ) issue: https://github.com/milvus-io/milvus/issues/33137 Signed-off-by: longjiquan <jiquan.long@zilliz.com>	2024-05-21 11:57:51 +08:00
Bingyi Sun	0f8c6f49ff	enhance: mmap load raw data if scalar index does not have raw data (#33175 ) Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-05-21 11:53:39 +08:00
XuanYang-cn	b3bcc107bb	fix: Remove L0 compactor in completedCompactor (#33169 ) See also: #33168 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2024-05-21 11:35:38 +08:00
aoiasd	f8929cc36a	fix: can't generate traceID when use noop exporter (#33191 ) relate: https://github.com/milvus-io/milvus/issues/33190 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2024-05-21 10:39:39 +08:00
smellthemoon	89ad3eb0ca	enhance: reduce memory when read field (#33195 ) Signed-off-by: lixinguo <xinguo.li@zilliz.com> Co-authored-by: lixinguo <xinguo.li@zilliz.com>	2024-05-20 22:25:38 +08:00
jaime	0d99db23b8	fix: metrics leak on the coord nodes (#33075 ) issue: #32980 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-05-20 22:03:39 +08:00
shaoting-huang	d27db99697	enhance: upgrade amazonlinux2023 builder image go version to 1.21 (#33176 ) Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com] issue: https://github.com/milvus-io/milvus/issues/32982 Go 1.21 introduces several improvements and changes over Go 1.20, which is quite stable now. This PR is mainly for upgrading images Golang version from 1.20 to 1.21. Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2024-05-20 21:11:39 +08:00
congqixia	7eeb120aab	enhance: Add lint rules for client pkg and fix problems (#33180 ) See also #31293 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-20 20:47:38 +08:00
pingliu	a1b5d7d6d3	doc: [skip-e2e] change milvus docker image version to v2.4.1 (#33170 ) Signed-off-by: ping.liu <ping.liu@zilliz.com>	2024-05-20 16:21:38 +08:00
shaoting-huang	c35eaaa358	enhance: upgrade images golang version from 1.20 to 1.21 (#33150 ) Signed-off-by: shaoting-huang [shaoting-huang@zilliz.com] issue: https://github.com/milvus-io/milvus/issues/32982 Go 1.21 introduces several improvements and changes over Go 1.20, which is quite stable now. This PR is mainly for upgrading images Golang version from 1.20 to 1.21. Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2024-05-20 15:01:43 +08:00
congqixia	f76d16780b	enhance: Refine channel mgr v2 implementation (#33156 ) Related to #25309 - Remove ctx from struct - Add ctx parameters for internal check logic methods - Add Waitgroup to make sure worker goroutine quit before close returns Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-20 14:13:38 +08:00
sre-ci-robot	555df49d25	[automated] Update Pytest image changes (#33126 ) Update Pytest image changes See changes: `0d0eda24f8` Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-05-20 11:43:37 +08:00
congqixia	c2ac692008	enhance: Add param item to ignore bad message id in checkpoint (#33123 ) See also #33122 This pr add param item `mq.ignoreBadPosition` to control behavior when mq failed to parse message id from checkpoint --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-05-20 11:29:37 +08:00
SimFG	ec98de3ad4	fix: reset the quota value when init the limiter (#33111 ) issue: #33107 /kind improvement Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-05-20 10:35:38 +08:00
wei liu	a7f6193bfc	fix: query node may stuck at stopping progress (#33104 ) issue: #33103 when try to do stopping balance for stopping query node, balancer will try to get node list from replica.GetNodes, then check whether node is stopping, if so, stopping balance will be triggered for this replica. after the replica refactor, replica.GetNodes only return rwNodes, and the stopping node maintains in roNodes, so balancer couldn't find replica which contains stopping node, and stopping balance for replica won't be triggered, then query node will stuck forever due to segment/channel doesn't move out. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-20 10:21:38 +08:00
sre-ci-robot	c6e2dd05fc	[automated] Update Knowhere Commit (#33147 ) Update Knowhere Commit Signed-off-by: sre-ci-robot sre-ci-robot@users.noreply.github.com Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-05-20 01:51:37 +08:00

1 2 3 4 5 ...

19883 Commits