Skip to content

Trying without installing DAOS#17689

Closed
ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
ryon-jensen/SRE-3435
Closed

Trying without installing DAOS#17689
ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
ryon-jensen/SRE-3435

Conversation

@ryon-jensen
Copy link
Contributor

No description provided.

soumagne and others added 30 commits November 10, 2025 11:18
Use 1.20 as min required version

Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>
This change removes the check of incoming client IO with
the ORF_REBUILDING_IO flag set.

Before the change, the intent of the check was to temporarily disallow
such IO while a rebuild was starting, when the PS leader engine first
distributes a fence / epoch value to all engines. The check caused
the IO to get an error -DER_UPDATE_AGAIN, causing the client to retry.

With forthcoming features like interactive/explicit rebuild control,
the rebuild *stop* case is negatively affected by this test, causing
all subsequent client IO after rebuild stops to block indefinitely,
until the rebuild is restarted by the administrator.

Other intermittent test failures, unrelated to the new feature, are also
being seen. Discussion with other developers have suggested that this
test in the engine code is not required.

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
This reverts commit d059e15
which was accidently pushed directly to master.
To be properly merged by #17057

Also adds back an import removed by #17063

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Update ftest/pool/create.py to try pool create 3 times

Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>
It has been observed that, during a dmg pool create command, the
ds_mgmt_pool_query call in pool_create_fill_resp timed out, and
ds_mgmt_drpc_pool_create returned the error to the MS without cleaning
up the newly-created pool service replicas. The dmg pool create command
retried with the same pool UUID, but this time a different set of PS
replicas were chosen and created "on top of" the PS replicas created in
the first attempt. As a result, some of the PS replicas had been
bootstrapped with the earlier set of replicas, whereas some with the
later set of replicas---an inconsistent Raft cluster right from the
beginning. Later, such inconsistency was ignored (for unknown reasons)
by rdb_raft_update_node, and led to assertion failures in raft.

This patch works around the problem like this:

  - Fix ds_mgmt_drpc_pool_create to clean up the pool if
    pool_create_fill_resp returns an error. Shorten the timeout of the
    query call, for the PS is just created by the MS and shouldn't take
    five minutes respond to the query.

  - Tighten the check in ds_rsvc_start to refuse to create and bootstrap
    "on top of" an existing replica, just to be safe.

  - Fix rdb_raft_update_node to report unexpected replica states and
    abort, rather than silently ignoring it. This should prevent the
    assertion failure from being reached.

Signed-off-by: Li Wei <liwei@hpe.com>
Remove a note that I forgot to remove in the main fix.

Signed-off-by: Li Wei <liwei@hpe.com>
…main xstream (#17031)

To enhance object migration efficiency, we recommend processing OIDs directly in main
xstreams rather than routing them through system xstreams. Currently, the workflow involves
main xstreams scanning and gathering OIDs before distributing them to corresponding ranks'
system xstreams for processing. However, this approach introduces significant overhead
from B+ Tree operations.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
…ating in md-on-ssd mode (#17081)

The purpose of this PR is to switch to a stable, clearly defined
version of PMDK.
The previous version, which was based on the DAOS-specific branch
(https://github.com/daos-stack/pmdk/tree/stable-2.1.0-daos),
was only intended as a temporary solution due to limitations in
the initial implementation of the new RPM building solution
based on FPM.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
…17061)

the cont_agg_eph_sync() possibly race with container destroy, in that
case skip the non-exist container (cont_lookup get -DER_NONEXIST).

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
dmg system commands that operate over an enumerated list of pools
often call into the control API from the server. When running with
certificates and not in insecure mode these server-to-server calls get
blocked if the gRPC method hasn't been given explicit ComponentServer
authorization. This PR adds those controls for
dmg system drain|reintegrate|self-heal|rebuild commands.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
For the case of no reboot between exclude and reint, should cleanup the
some IVs in the first step of reint.

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Configure this single test, test_ec_online_rebuild_mdtest
to run with multi-recv when run with the verbs provider,
ofi+verbs;ofi_rxm.

For both engines and any client-side utilities launched by the test
(e.g., the cart_ctl, daos, mdtest), configure environment variable
NA_OFI_UNEXPECTED_TAG_MSG=0.

The intent of this patch is to re-enable this test that has
intermittently failed with mercury/libfabric errors such
as NA_CB_RECV_UNEXPECTED in na_ofi_cq_process_retries().

Also in this change, the shared class ErasureCodeMdtest.setUp()
method is changed to not connect to the pool it creates, since
that is not used, and it also would require additional multi-recv
environment configuration, to match the engine-side setup. Otherwise,
the test will hang with mis-matched client / engine environments.

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Updates `EnricoMi/publish-unit-test-result-action` from 2.20.0 to 2.21.0
Updates `actions/upload-artifact` from 4 to 5
Updates `codespell-project/actions-codespell` from 2.1 to 2.2
Updates `github/codeql-action` from 4.30.8 to 4.31.2

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Use gen_certificates.sh relative to the set prefix, not ftest.

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
- Include listing of unchecked pools returned by dmg check query,
  even in non-verbose mode.

Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Fix the output values of daos get-prop command of the properties
rd_fac, rd_lvl and layout_type

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Test build 1 for DAOS 2.8

Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Also enable Jenkins githook as Jenkins is back operational.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
From now on, all dates provided for log operations must be in the ISO 8601
date format: YYYY-MM-DD. Zeros should be added to the beginning of one-digit
months and days.

Harmonize log operations module with the new logs' date/time stamp format
introduced by the #16772 PR:
YYYY-MM-DD HH:MM:SS.mmmmmm

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
If an update RPC is timeout, it is unnecessary to retry the update RPC
immediately. Because the original one maybe blocked on some of targets
under some heave load cases. Under such case, the retried RPC will get
-DER_INPROGRESS and cause RPC retry again and again. That will further
increase server load and make the situation to be worse.

This patch introduces more delay before retrying RPC under such cases.
It also add more delay for collective object RPC.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
1. After restart take the sc_ec_agg_eph_boundary as ec aggregation's min epoch to avoid scan from epoch 0.
2. Consume more credits for layout calculate in EC agg.
3. Don't bump sc_ec_agg_eph after it reset during EC agg, to avoid data corruption.

Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
- Fail protocol query after all the engines have been tried.

This avoids infinite flood of errors if app is started while engines
are offline or just starting up.

Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
…17142)

DTX aggregation maybe scheduled before committed DTX table has been
reindexed. Then vos_container::vc_dtx_committed_count maybe smaller
than the count of removed DTX entries. Need to filter out those DTX
entries that have not been handled by reindex before re-calculating
vc_dtx_committed_count to avoid negative overflow.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
realpath() calls getcwd() in libc. Intercepting getcwd() with trampoline can work around getcwd() issue in dfuse due to evicting dentry cache. Also get_current_dir_name() does not resolve symbolic link. This is fixed by calling libc getcwd instead.

Signed-off-by: Lei Huang <lei.huang@hpe.com>
…#17150)

Provide a changelog for libfabric, isal and argobots RPMs.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Otherwise related ULT stack maybe overflow when query bad pool.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
move rebuild/basic.py to HW since VMs are slow.

Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>
Monitor inflight SPDK I/Os, if any I/O isn't completed within
certain amount of time (120 seconds, configurable through env
var DAOS_SPDK_IO_TIMEOUT), we assume the SPDK I/O is
stalled due to hardware issue (or software bug), RAS event will
be raised and the corresponding device will be marked as faulty.

Signed-off-by: Niu Yawei <yawei.niu@hpe.com>
tanabarr and others added 23 commits February 27, 2026 10:10
Apply scm_hugepages_disabled true if unset in yaml. This results in
effectively removing hugepages=always from tmpfs mount options for the
engine ramdisk by default.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…rt (#17332)

To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR).

Signed-off-by: Makito Kano <makito.kano@hpe.com>
Give the spdk.sh and spdk.changelog files new names that reflect the output of
the RPM build process - the daos-spdk package.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
…17580)

In order to avoid failing pool create with storage percentage (-z X%)
when ranks have been stopped, only take into account joined ranks when
calculating maximum available pool sizes.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
)

Allow ranks that have been previously marked as AdminExcluded to be
re-joined after a storage reformat using the dmg storage format
--replace command.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Retry dmg self-heal eval command when engine not started error is
returned. Do this by updating the IsUnavailable() helper.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
- Prevent server from passing fabric_auth_key to client
- Clean up ep_credit/ctx_max_num/crt_timeout init parsing
- Remove ENV_STR_NO_PRINT that was used to hide env var content

Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>
Updates `isort` from 8.0.0 to 8.0.1

Signed-off-by: dependabot[bot] <support@github.com>
Updates `EnricoMi/publish-unit-test-result-action` from 2.22.0 to 2.23.0
Updates `actions/upload-artifact` from 6.0.0 to 7.0.0

Signed-off-by: dependabot[bot] <support@github.com>
Since we do not return dead ranks from agent anymore for protoquery, we
can increase the timeout more for server to reply to protoquery.

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
- if dir-oclass is set to EC on container create, use default instead.
- daos fs set-attr of an EC oclass on directory should apply only to
  files. directories will be create with the default in that case.
- fix daos fs get-attr to show such changes

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
The initial WAL implementation allowed the upper layer to handle WAL
commit failures via UNDO operations. This included rolling back the
'si_unused_id' to prevent gaps in WAL. However, current architecture
no longer supports UNDO and instead excludes targets upon WAL commit
failure. Consequently, the legacy si_unused_id rollback now violates
the core assumption: "New transaction ID must be greater than the
last checkpointed ID"

Signed-off-by: Niu Yawei <yawei.niu@hpe.com>
Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Ignore the GHSA-72hv-8253-57qq vulnerability reported in
com.fasterxml.jackson.core:jackson-core 2.14.3
The com.fasterxml.jackson.core:jackson-core can not be upgraded as it is
a part of org.apache.hadoop:hadoop-common:3.4.2::2d40acbf and there is
no new version of hadoop.

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Update PMDK to version 2.1.3

Signed-off-by: Oksana Salyk <oksana.salyk@hpe.com>
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Verify 7 scenarios of auto recovery policy
1. System Creation
2. Disabling and Enabling Self-Heal
3. Online System Maintenance
4. Offline System Maintenance
5. Normal System Restart
6. Unexpected System Restart
7. Problematic Pools

Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
…17593)

Signed-off-by: Joseph Moore <joseph.moore@hpe.com>
…17612)

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…rt (#17631)

Move the test from ddb_pmem.py to ddb.py and add MD-on-SSD support.

Add position=1 to -w (self.write_mode) so that it's added
immediately after ddb.

Signed-off-by: Makito Kano <makito.kano@hpe.com>
…#17533)

... PMEMOBJ pools

PMEMOBJ maintains its own metadata. Copy-on-Write prevents these changes from taking effect so read-only mode will be truly read-only.

Also removing the `mlock()` workaround because:

- PMEM + Copy-on-Write + `mlock()` leads to increased memory usage, since the entire pool is pulled into RAM when it is opened. Where the `mlock()` serves no role whatsoever.
- `mlock()` has been unnecessary for quite some time. It was originally added to work around a cryptic issue observed when using libfabric with the verbs provider and performing direct RDMA writes into pool memory. Direct RDMA writes to pool memory are no longer used, so the workaround is obsolete. For details please see the ticket to get the complete paper trail.

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Remove from ddb, link dependency with `libdaos_common.so`

Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Explicitly reset the RPC state so SPDK can be reinitialized multiple times in the same process.

Ref: spdk/spdk@fba209c
Ref: #16774

Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
@github-actions
Copy link

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Trying

@ryon-jensen ryon-jensen force-pushed the ryon-jensen/SRE-3435 branch from 832e0e7 to 384bcaa Compare March 11, 2026 20:13
@ryon-jensen ryon-jensen force-pushed the ryon-jensen/SRE-3435 branch from 384bcaa to 10f10c5 Compare March 12, 2026 00:04
@ryon-jensen ryon-jensen deleted the ryon-jensen/SRE-3435 branch March 12, 2026 00:10
@ryon-jensen ryon-jensen restored the ryon-jensen/SRE-3435 branch March 12, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.