Trying without installing DAOS#17689
Closed
ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
Closed
Trying without installing DAOS#17689ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
ryon-jensen wants to merge 253 commits intozarzycki/SRE-3435from
Conversation
Use 1.20 as min required version Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>
This change removes the check of incoming client IO with the ORF_REBUILDING_IO flag set. Before the change, the intent of the check was to temporarily disallow such IO while a rebuild was starting, when the PS leader engine first distributes a fence / epoch value to all engines. The check caused the IO to get an error -DER_UPDATE_AGAIN, causing the client to retry. With forthcoming features like interactive/explicit rebuild control, the rebuild *stop* case is negatively affected by this test, causing all subsequent client IO after rebuild stops to block indefinitely, until the rebuild is restarted by the administrator. Other intermittent test failures, unrelated to the new feature, are also being seen. Discussion with other developers have suggested that this test in the engine code is not required. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Update ftest/pool/create.py to try pool create 3 times Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>
It has been observed that, during a dmg pool create command, the
ds_mgmt_pool_query call in pool_create_fill_resp timed out, and
ds_mgmt_drpc_pool_create returned the error to the MS without cleaning
up the newly-created pool service replicas. The dmg pool create command
retried with the same pool UUID, but this time a different set of PS
replicas were chosen and created "on top of" the PS replicas created in
the first attempt. As a result, some of the PS replicas had been
bootstrapped with the earlier set of replicas, whereas some with the
later set of replicas---an inconsistent Raft cluster right from the
beginning. Later, such inconsistency was ignored (for unknown reasons)
by rdb_raft_update_node, and led to assertion failures in raft.
This patch works around the problem like this:
- Fix ds_mgmt_drpc_pool_create to clean up the pool if
pool_create_fill_resp returns an error. Shorten the timeout of the
query call, for the PS is just created by the MS and shouldn't take
five minutes respond to the query.
- Tighten the check in ds_rsvc_start to refuse to create and bootstrap
"on top of" an existing replica, just to be safe.
- Fix rdb_raft_update_node to report unexpected replica states and
abort, rather than silently ignoring it. This should prevent the
assertion failure from being reached.
Signed-off-by: Li Wei <liwei@hpe.com>
Remove a note that I forgot to remove in the main fix. Signed-off-by: Li Wei <liwei@hpe.com>
…main xstream (#17031) To enhance object migration efficiency, we recommend processing OIDs directly in main xstreams rather than routing them through system xstreams. Currently, the workflow involves main xstreams scanning and gathering OIDs before distributing them to corresponding ranks' system xstreams for processing. However, this approach introduces significant overhead from B+ Tree operations. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
…ating in md-on-ssd mode (#17081) The purpose of this PR is to switch to a stable, clearly defined version of PMDK. The previous version, which was based on the DAOS-specific branch (https://github.com/daos-stack/pmdk/tree/stable-2.1.0-daos), was only intended as a temporary solution due to limitations in the initial implementation of the new RPM building solution based on FPM. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
…17061) the cont_agg_eph_sync() possibly race with container destroy, in that case skip the non-exist container (cont_lookup get -DER_NONEXIST). Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
…#17054) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
dmg system commands that operate over an enumerated list of pools often call into the control API from the server. When running with certificates and not in insecure mode these server-to-server calls get blocked if the gRPC method hasn't been given explicit ComponentServer authorization. This PR adds those controls for dmg system drain|reintegrate|self-heal|rebuild commands. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
For the case of no reboot between exclude and reint, should cleanup the some IVs in the first step of reint. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
Configure this single test, test_ec_online_rebuild_mdtest to run with multi-recv when run with the verbs provider, ofi+verbs;ofi_rxm. For both engines and any client-side utilities launched by the test (e.g., the cart_ctl, daos, mdtest), configure environment variable NA_OFI_UNEXPECTED_TAG_MSG=0. The intent of this patch is to re-enable this test that has intermittently failed with mercury/libfabric errors such as NA_CB_RECV_UNEXPECTED in na_ofi_cq_process_retries(). Also in this change, the shared class ErasureCodeMdtest.setUp() method is changed to not connect to the pool it creates, since that is not used, and it also would require additional multi-recv environment configuration, to match the engine-side setup. Otherwise, the test will hang with mis-matched client / engine environments. Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Updates `EnricoMi/publish-unit-test-result-action` from 2.20.0 to 2.21.0 Updates `actions/upload-artifact` from 4 to 5 Updates `codespell-project/actions-codespell` from 2.1 to 2.2 Updates `github/codeql-action` from 4.30.8 to 4.31.2 Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Use gen_certificates.sh relative to the set prefix, not ftest. Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
- Include listing of unchecked pools returned by dmg check query, even in non-verbose mode. Signed-off-by: Kris Jacque <kris.jacque@hpe.com>
Fix the output values of daos get-prop command of the properties rd_fac, rd_lvl and layout_type Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Test build 1 for DAOS 2.8 Signed-off-by: Phil Henderson <phillip.henderson@hpe.com>
Also enable Jenkins githook as Jenkins is back operational. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
From now on, all dates provided for log operations must be in the ISO 8601 date format: YYYY-MM-DD. Zeros should be added to the beginning of one-digit months and days. Harmonize log operations module with the new logs' date/time stamp format introduced by the #16772 PR: YYYY-MM-DD HH:MM:SS.mmmmmm Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
If an update RPC is timeout, it is unnecessary to retry the update RPC immediately. Because the original one maybe blocked on some of targets under some heave load cases. Under such case, the retried RPC will get -DER_INPROGRESS and cause RPC retry again and again. That will further increase server load and make the situation to be worse. This patch introduces more delay before retrying RPC under such cases. It also add more delay for collective object RPC. Signed-off-by: Fan Yong <fan.yong@hpe.com>
1. After restart take the sc_ec_agg_eph_boundary as ec aggregation's min epoch to avoid scan from epoch 0. 2. Consume more credits for layout calculate in EC agg. 3. Don't bump sc_ec_agg_eph after it reset during EC agg, to avoid data corruption. Signed-off-by: Xuezhao Liu <xuezhao.liu@hpe.com>
- Fail protocol query after all the engines have been tried. This avoids infinite flood of errors if app is started while engines are offline or just starting up. Signed-off-by: Alexander A Oganezov <alexander.oganezov@hpe.com>
…17142) DTX aggregation maybe scheduled before committed DTX table has been reindexed. Then vos_container::vc_dtx_committed_count maybe smaller than the count of removed DTX entries. Need to filter out those DTX entries that have not been handled by reindex before re-calculating vc_dtx_committed_count to avoid negative overflow. Signed-off-by: Fan Yong <fan.yong@hpe.com>
realpath() calls getcwd() in libc. Intercepting getcwd() with trampoline can work around getcwd() issue in dfuse due to evicting dentry cache. Also get_current_dir_name() does not resolve symbolic link. This is fixed by calling libc getcwd instead. Signed-off-by: Lei Huang <lei.huang@hpe.com>
…#17150) Provide a changelog for libfabric, isal and argobots RPMs. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Otherwise related ULT stack maybe overflow when query bad pool. Signed-off-by: Fan Yong <fan.yong@hpe.com>
move rebuild/basic.py to HW since VMs are slow. Signed-off-by: Ding-Hwa Ho <ding-hwa.ho@hpe.com>
Monitor inflight SPDK I/Os, if any I/O isn't completed within certain amount of time (120 seconds, configurable through env var DAOS_SPDK_IO_TIMEOUT), we assume the SPDK I/O is stalled due to hardware issue (or software bug), RAS event will be raised and the corresponding device will be marked as faulty. Signed-off-by: Niu Yawei <yawei.niu@hpe.com>
Apply scm_hugepages_disabled true if unset in yaml. This results in effectively removing hugepages=always from tmpfs mount options for the engine ramdisk by default. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…rt (#17332) To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path. Update ddb_utils.py to support the new commands. Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands. We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR). Signed-off-by: Makito Kano <makito.kano@hpe.com>
Give the spdk.sh and spdk.changelog files new names that reflect the output of the RPM build process - the daos-spdk package. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
…17580) In order to avoid failing pool create with storage percentage (-z X%) when ranks have been stopped, only take into account joined ranks when calculating maximum available pool sizes. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Retry dmg self-heal eval command when engine not started error is returned. Do this by updating the IsUnavailable() helper. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
- Prevent server from passing fabric_auth_key to client - Clean up ep_credit/ctx_max_num/crt_timeout init parsing - Remove ENV_STR_NO_PRINT that was used to hide env var content Signed-off-by: Jerome Soumagne <jerome.soumagne@hpe.com>
Updates `isort` from 8.0.0 to 8.0.1 Signed-off-by: dependabot[bot] <support@github.com>
Updates `EnricoMi/publish-unit-test-result-action` from 2.22.0 to 2.23.0 Updates `actions/upload-artifact` from 6.0.0 to 7.0.0 Signed-off-by: dependabot[bot] <support@github.com>
Since we do not return dead ranks from agent anymore for protoquery, we can increase the timeout more for server to reply to protoquery. Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
- if dir-oclass is set to EC on container create, use default instead. - daos fs set-attr of an EC oclass on directory should apply only to files. directories will be create with the default in that case. - fix daos fs get-attr to show such changes Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@hpe.com>
The initial WAL implementation allowed the upper layer to handle WAL commit failures via UNDO operations. This included rolling back the 'si_unused_id' to prevent gaps in WAL. However, current architecture no longer supports UNDO and instead excludes targets upon WAL commit failure. Consequently, the legacy si_unused_id rollback now violates the core assumption: "New transaction ID must be greater than the last checkpointed ID" Signed-off-by: Niu Yawei <yawei.niu@hpe.com>
Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Ignore the GHSA-72hv-8253-57qq vulnerability reported in com.fasterxml.jackson.core:jackson-core 2.14.3 The com.fasterxml.jackson.core:jackson-core can not be upgraded as it is a part of org.apache.hadoop:hadoop-common:3.4.2::2d40acbf and there is no new version of hadoop. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Update PMDK to version 2.1.3 Signed-off-by: Oksana Salyk <oksana.salyk@hpe.com> Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
Verify 7 scenarios of auto recovery policy 1. System Creation 2. Disabling and Enabling Self-Heal 3. Online System Maintenance 4. Offline System Maintenance 5. Normal System Restart 6. Unexpected System Restart 7. Problematic Pools Signed-off-by: Dalton Bohning <dalton.bohning@hpe.com>
…17593) Signed-off-by: Joseph Moore <joseph.moore@hpe.com>
…17612) Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…rt (#17631) Move the test from ddb_pmem.py to ddb.py and add MD-on-SSD support. Add position=1 to -w (self.write_mode) so that it's added immediately after ddb. Signed-off-by: Makito Kano <makito.kano@hpe.com>
…#17533) ... PMEMOBJ pools PMEMOBJ maintains its own metadata. Copy-on-Write prevents these changes from taking effect so read-only mode will be truly read-only. Also removing the `mlock()` workaround because: - PMEM + Copy-on-Write + `mlock()` leads to increased memory usage, since the entire pool is pulled into RAM when it is opened. Where the `mlock()` serves no role whatsoever. - `mlock()` has been unnecessary for quite some time. It was originally added to work around a cryptic issue observed when using libfabric with the verbs provider and performing direct RDMA writes into pool memory. Direct RDMA writes to pool memory are no longer used, so the workaround is obsolete. For details please see the ticket to get the complete paper trail. Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
Remove from ddb, link dependency with `libdaos_common.so` Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@hpe.com>
Explicitly reset the RPC state so SPDK can be reinitialized multiple times in the same process. Ref: spdk/spdk@fba209c Ref: #16774 Signed-off-by: Jan Michalski <jan-marian.michalski@hpe.com>
|
Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
832e0e7 to
384bcaa
Compare
384bcaa to
10f10c5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.