Trying without installing DAOS by ryon-jensen · Pull Request #17689 · daos-stack/daos

ryon-jensen · 2026-03-11T17:31:55Z

No description provided.

Use 1.20 as min required version Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>

This change removes the check of incoming client IO with the ORF_REBUILDING_IO flag set. Before the change, the intent of the check was to temporarily disallow such IO while a rebuild was starting, when the PS leader engine first distributes a fence / epoch value to all engines. The check caused the IO to get an error -DER_UPDATE_AGAIN, causing the client to retry. With forthcoming features like interactive/explicit rebuild control, the rebuild *stop* case is negatively affected by this test, causing all subsequent client IO after rebuild stops to block indefinitely, until the rebuild is restarted by the administrator. Other intermittent test failures, unrelated to the new feature, are also being seen. Discussion with other developers have suggested that this test in the engine code is not required. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

This reverts commit d059e15 which was accidently pushed directly to master. To be properly merged by #17057 Also adds back an import removed by #17063 Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

Update ftest/pool/create.py to try pool create 3 times Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>

It has been observed that, during a dmg pool create command, the ds_mgmt_pool_query call in pool_create_fill_resp timed out, and ds_mgmt_drpc_pool_create returned the error to the MS without cleaning up the newly-created pool service replicas. The dmg pool create command retried with the same pool UUID, but this time a different set of PS replicas were chosen and created "on top of" the PS replicas created in the first attempt. As a result, some of the PS replicas had been bootstrapped with the earlier set of replicas, whereas some with the later set of replicas---an inconsistent Raft cluster right from the beginning. Later, such inconsistency was ignored (for unknown reasons) by rdb_raft_update_node, and led to assertion failures in raft. This patch works around the problem like this: - Fix ds_mgmt_drpc_pool_create to clean up the pool if pool_create_fill_resp returns an error. Shorten the timeout of the query call, for the PS is just created by the MS and shouldn't take five minutes respond to the query. - Tighten the check in ds_rsvc_start to refuse to create and bootstrap "on top of" an existing replica, just to be safe. - Fix rdb_raft_update_node to report unexpected replica states and abort, rather than silently ignoring it. This should prevent the assertion failure from being reached. Signed-off-by: Li Wei <liwei@hpe.com>

Remove a note that I forgot to remove in the main fix. Signed-off-by: Li Wei <liwei@hpe.com>

…main xstream (#17031) To enhance object migration efficiency, we recommend processing OIDs directly in main xstreams rather than routing them through system xstreams. Currently, the workflow involves main xstreams scanning and gathering OIDs before distributing them to corresponding ranks' system xstreams for processing. However, this approach introduces significant overhead from B+ Tree operations. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

…ating in md-on-ssd mode (#17081) The purpose of this PR is to switch to a stable, clearly defined version of PMDK. The previous version, which was based on the DAOS-specific branch (https://github.com/daos-stack/pmdk/tree/stable-2.1.0-daos), was only intended as a temporary solution due to limitations in the initial implementation of the new RPM building solution based on FPM. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

…17061) the cont_agg_eph_sync() possibly race with container destroy, in that case skip the non-exist container (cont_lookup get -DER_NONEXIST). Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

…#17054) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

dmg system commands that operate over an enumerated list of pools often call into the control API from the server. When running with certificates and not in insecure mode these server-to-server calls get blocked if the gRPC method hasn't been given explicit ComponentServer authorization. This PR adds those controls for dmg system drain|reintegrate|self-heal|rebuild commands. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

For the case of no reboot between exclude and reint, should cleanup the some IVs in the first step of reint. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

Configure this single test, test_ec_online_rebuild_mdtest to run with multi-recv when run with the verbs provider, ofi+verbs;ofi_rxm. For both engines and any client-side utilities launched by the test (e.g., the cart_ctl, daos, mdtest), configure environment variable NA_OFI_UNEXPECTED_TAG_MSG=0. The intent of this patch is to re-enable this test that has intermittently failed with mercury/libfabric errors such as NA_CB_RECV_UNEXPECTED in na_ofi_cq_process_retries(). Also in this change, the shared class ErasureCodeMdtest.setUp() method is changed to not connect to the pool it creates, since that is not used, and it also would require additional multi-recv environment configuration, to match the engine-side setup. Otherwise, the test will hang with mis-matched client / engine environments. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

Updates `EnricoMi/publish-unit-test-result-action` from 2.20.0 to 2.21.0 Updates `actions/upload-artifact` from 4 to 5 Updates `codespell-project/actions-codespell` from 2.1 to 2.2 Updates `github/codeql-action` from 4.30.8 to 4.31.2 Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

Use gen_certificates.sh relative to the set prefix, not ftest. Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

- Include listing of unchecked pools returned by dmg check query, even in non-verbose mode. Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

Fix the output values of daos get-prop command of the properties rd_fac, rd_lvl and layout_type Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Test build 1 for DAOS 2.8 Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>

Also enable Jenkins githook as Jenkins is back operational. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

From now on, all dates provided for log operations must be in the ISO 8601 date format: YYYY-MM-DD. Zeros should be added to the beginning of one-digit months and days. Harmonize log operations module with the new logs' date/time stamp format introduced by the #16772 PR: YYYY-MM-DD HH:MM:SS.mmmmmm Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

If an update RPC is timeout, it is unnecessary to retry the update RPC immediately. Because the original one maybe blocked on some of targets under some heave load cases. Under such case, the retried RPC will get -DER_INPROGRESS and cause RPC retry again and again. That will further increase server load and make the situation to be worse. This patch introduces more delay before retrying RPC under such cases. It also add more delay for collective object RPC. Signed-off-by: Fan Yong <fan.yong@hpe.com>

1. After restart take the sc_ec_agg_eph_boundary as ec aggregation's min epoch to avoid scan from epoch 0. 2. Consume more credits for layout calculate in EC agg. 3. Don't bump sc_ec_agg_eph after it reset during EC agg, to avoid data corruption. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

- Fail protocol query after all the engines have been tried. This avoids infinite flood of errors if app is started while engines are offline or just starting up. Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>

…17142) DTX aggregation maybe scheduled before committed DTX table has been reindexed. Then vos_container::vc_dtx_committed_count maybe smaller than the count of removed DTX entries. Need to filter out those DTX entries that have not been handled by reindex before re-calculating vc_dtx_committed_count to avoid negative overflow. Signed-off-by: Fan Yong <fan.yong@hpe.com>

realpath() calls getcwd() in libc. Intercepting getcwd() with trampoline can work around getcwd() issue in dfuse due to evicting dentry cache. Also get_current_dir_name() does not resolve symbolic link. This is fixed by calling libc getcwd instead. Signed-off-by: Lei Huang <lei.huang@hpe.com>

…#17150) Provide a changelog for libfabric, isal and argobots RPMs. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Otherwise related ULT stack maybe overflow when query bad pool. Signed-off-by: Fan Yong <fan.yong@hpe.com>

move rebuild/basic.py to HW since VMs are slow. Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>

Monitor inflight SPDK I/Os, if any I/O isn't completed within certain amount of time (120 seconds, configurable through env var DAOS_SPDK_IO_TIMEOUT), we assume the SPDK I/O is stalled due to hardware issue (or software bug), RAS event will be raised and the corresponding device will be marked as faulty. Signed-off-by: Niu Yawei <yawei.niu@hpe.com>

Apply scm_hugepages_disabled true if unset in yaml. This results in effectively removing hugepages=always from tmpfs mount options for the engine ramdisk by default. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

…rt (#17332) To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path. Update ddb_utils.py to support the new commands. Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands. We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR). Signed-off-by: Makito Kano <makito.kano@hpe.com>

Give the spdk.sh and spdk.changelog files new names that reflect the output of the RPM build process - the daos-spdk package. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

…17580) In order to avoid failing pool create with storage percentage (-z X%) when ranks have been stopped, only take into account joined ranks when calculating maximum available pool sizes. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

) Allow ranks that have been previously marked as AdminExcluded to be re-joined after a storage reformat using the dmg storage format --replace command. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Retry dmg self-heal eval command when engine not started error is returned. Do this by updating the IsUnavailable() helper. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

- Prevent server from passing fabric_auth_key to client - Clean up ep_credit/ctx_max_num/crt_timeout init parsing - Remove ENV_STR_NO_PRINT that was used to hide env var content Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>

Updates `isort` from 8.0.0 to 8.0.1 Signed-off-by: dependabot[bot] <support@github.com>

Updates `EnricoMi/publish-unit-test-result-action` from 2.22.0 to 2.23.0 Updates `actions/upload-artifact` from 6.0.0 to 7.0.0 Signed-off-by: dependabot[bot] <support@github.com>

Since we do not return dead ranks from agent anymore for protoquery, we can increase the timeout more for server to reply to protoquery. Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

- if dir-oclass is set to EC on container create, use default instead. - daos fs set-attr of an EC oclass on directory should apply only to files. directories will be create with the default in that case. - fix daos fs get-attr to show such changes Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

The initial WAL implementation allowed the upper layer to handle WAL commit failures via UNDO operations. This included rolling back the 'si_unused_id' to prevent gaps in WAL. However, current architecture no longer supports UNDO and instead excludes targets upon WAL commit failure. Consequently, the legacy si_unused_id rollback now violates the core assumption: "New transaction ID must be greater than the last checkpointed ID" Signed-off-by: Niu Yawei <yawei.niu@hpe.com>

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>

Ignore the GHSA-72hv-8253-57qq vulnerability reported in com.fasterxml.jackson.core:jackson-core 2.14.3 The com.fasterxml.jackson.core:jackson-core can not be upgraded as it is a part of org.apache.hadoop:hadoop-common:3.4.2::2d40acbf and there is no new version of hadoop. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Update PMDK to version 2.1.3 Signed-off-by: Oksana Salyk <oksana.salyk@hpe.com> Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

Verify 7 scenarios of auto recovery policy 1. System Creation 2. Disabling and Enabling Self-Heal 3. Online System Maintenance 4. Offline System Maintenance 5. Normal System Restart 6. Unexpected System Restart 7. Problematic Pools Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

…17593) Signed-off-by: Joseph Moore <joseph.moore@hpe.com>

…17612) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

…rt (#17631) Move the test from ddb_pmem.py to ddb.py and add MD-on-SSD support. Add position=1 to -w (self.write_mode) so that it's added immediately after ddb. Signed-off-by: Makito Kano <makito.kano@hpe.com>

…#17533) ... PMEMOBJ pools PMEMOBJ maintains its own metadata. Copy-on-Write prevents these changes from taking effect so read-only mode will be truly read-only. Also removing the `mlock()` workaround because: - PMEM + Copy-on-Write + `mlock()` leads to increased memory usage, since the entire pool is pulled into RAM when it is opened. Where the `mlock()` serves no role whatsoever. - `mlock()` has been unnecessary for quite some time. It was originally added to work around a cryptic issue observed when using libfabric with the verbs provider and performing direct RDMA writes into pool memory. Direct RDMA writes to pool memory are no longer used, so the workaround is obsolete. For details please see the ticket to get the complete paper trail. Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>

Remove from ddb, link dependency with `libdaos_common.so` Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

Explicitly reset the RPC state so SPDK can be reinitialized multiple times in the same process. Ref: spdk/spdk@fba209c Ref: #16774 Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>

github-actions · 2026-03-11T17:32:13Z

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Trying

soumagne and others added 30 commits November 10, 2025 11:18

DAOS-18187 build: fix scons libfabric version check (#17088)

b2e90ca

Use 1.20 as min required version Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>

Revert "SRE-3440 build: Remove stage with intel compiler" (#17101)

9e8136d

This reverts commit d059e15 which was accidently pushed directly to master. To be properly merged by #17057 Also adds back an import removed by #17063 Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

DAOS-18155 test: Update pool/create.py with retry (#17042)

174f3f5

Update ftest/pool/create.py to try pool create 3 times Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>

DAOS-17802 pool: Remove a forgotten comment (#17109)

7b6acff

Remove a note that I forgot to remove in the main fix. Signed-off-by: Li Wei <liwei@hpe.com>

DAOS-18172 container: skip nonexist container for cont_agg_eph_sync (#…

b62c67e

…17061) the cont_agg_eph_sync() possibly race with container destroy, in that case skip the non-exist container (cont_lookup get -DER_NONEXIST). Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

DAOS-18128 control: self_heal related unit test coverage improvements (…

349c27b

…#17054) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-18154 rebuild: cleanup IV cache before reint (#17080)

0eb0036

For the case of no reboot between exclude and reint, should cleanup the some IVs in the first step of reint. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>

DAOS-17827 test: use correct gen_certificates.sh (#17129)

e837088

Use gen_certificates.sh relative to the set prefix, not ftest. Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>

DAOS-18201 control: Allow bracketed strings in CreateRankSet (#17124)

3d8f848

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-13520 control: Show unchecked pools in dmg check query (#17091)

d8194a4

- Include listing of unchecked pools returned by dmg check query, even in non-verbose mode. Signed-off-by: Kris Jacque <kris.jacque@hpe.com>

DAOS-14750 control: fix cont get-prop values (#17040)

b51a31c

Fix the output values of daos get-prop command of the properties rd_fac, rd_lvl and layout_type Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

DAOS-18220 build: Create 2.8 TB1 (#17137)

19c927e

Test build 1 for DAOS 2.8 Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>

SRE-2772 ci: remove all references to hpdd.intel.com (#16874)

1a1e344

Also enable Jenkins githook as Jenkins is back operational. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

DAOS-18204 packaging: restore libfabric, isa-l and argobots changelog (…

1bcdeba

…#17150) Provide a changelog for libfabric, isal and argobots RPMs. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

DAOS-18200 chk: use deep stack for collective check query task (#17143)

f7a7981

Otherwise related ULT stack maybe overflow when query bad pool. Signed-off-by: Fan Yong <fan.yong@hpe.com>

DAOS-17796 test: move rebuild/basic.py to HW (#16993)

2e560d8

move rebuild/basic.py to HW since VMs are slow. Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>

tanabarr and others added 23 commits February 27, 2026 10:10

DAOS-623 ci: fix spdk.sh script name (#17600)

f76da96

Give the spdk.sh and spdk.changelog files new names that reflect the output of the RPM build process - the daos-spdk package. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

DAOS-18472 control: Use AdamExcluded ranks in dmg format replace (#17598

3ded3f4

) Allow ranks that have been previously marked as AdminExcluded to be re-joined after a storage reformat using the dmg storage format --replace command. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-18427 control: Retry system self-heal eval (#17575)

6a3bac8

Retry dmg self-heal eval command when engine not started error is returned. Do this by updating the IsUnavailable() helper. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-18472 doc: Note that format replace ignores AdminExcluded (#17610)

92aad25

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-18636 cq: update isort to 8.0.1 (#17625)

81ddc13

Updates `isort` from 8.0.0 to 8.0.1 Signed-off-by: dependabot[bot] <support@github.com>

DAOS-18636 cq: Bump GHA versions (#17626)

e99cb66

Updates `EnricoMi/publish-unit-test-result-action` from 2.22.0 to 2.23.0 Updates `actions/upload-artifact` from 6.0.0 to 7.0.0 Signed-off-by: dependabot[bot] <support@github.com>

DAOS-18388 client: increase protoquery timeout to 10 seconds (#17383)

6a94c31

Since we do not return dead ranks from agent anymore for protoquery, we can increase the timeout more for server to reply to protoquery. Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>

DAOS-18608 ddb: md-on-ssd interactive open fix (#17589)

b34e4e8

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>

DAOS-18296 common: update PMDK to version 2.1.3 (#17403)

f4210bc

Update PMDK to version 2.1.3 Signed-off-by: Oksana Salyk <oksana.salyk@hpe.com> Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>

DAOS-18614 cart: Fix UCX provider init for re-init of daos client. (#…

bcd7fd9

…17593) Signed-off-by: Joseph Moore <joseph.moore@hpe.com>

DAOS-18606 control: Avoid NVMe driver unbind in VMD if blocklisted (#…

4e7feb2

…17612) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

DAOS-18389 test: recovery/ddb.py test_recovery_ddb_rm MD-on-SSD suppo…

c79c123

…rt (#17631) Move the test from ddb_pmem.py to ddb.py and add MD-on-SSD support. Add position=1 to -w (self.write_mode) so that it's added immediately after ddb. Signed-off-by: Makito Kano <makito.kano@hpe.com>

DAOS-18645 ddb: fix incompatible linked library (#17634)

ead4517

Remove from ddb, link dependency with `libdaos_common.so` Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>

DAOS-18597 bio: allow SPDK reinitialization (#17614)

c2f11e6

Explicitly reset the RPC state so SPDK can be reinitialized multiple times in the same process. Ref: spdk/spdk@fba209c Ref: #16774 Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>

ryon-jensen force-pushed the ryon-jensen/SRE-3435 branch from 832e0e7 to 384bcaa Compare March 11, 2026 20:13

squashed changes

10f10c5

ryon-jensen force-pushed the ryon-jensen/SRE-3435 branch from 384bcaa to 10f10c5 Compare March 12, 2026 00:04

ryon-jensen closed this Mar 12, 2026

ryon-jensen deleted the ryon-jensen/SRE-3435 branch March 12, 2026 00:10

ryon-jensen restored the ryon-jensen/SRE-3435 branch March 12, 2026 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying without installing DAOS#17689

Trying without installing DAOS#17689
ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
ryon-jensen/SRE-3435

ryon-jensen commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

Conversation

ryon-jensen commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants