[linux-nvidia-6.17] Backport MPAM fixes and support for CPU-less NUMA nodes#348
Conversation
|
These patches doesn't match upstream commit please make sure to add a comment if some upstream cherry-pick needed conflict fixes. |
96cc0e5 to
31c434f
Compare
|
Thank you very much for your comments! All comments have been addressed. Please review the new patches (in the same branch). |
|
I have not been able to finish the review but I have a question with this commit |
31c434f to
7718a84
Compare
Matt told me that the line [backported from ... pset_branch] is only for internal info. External people cannot see pset. |
We just need a way to identify the patch provenance. If the patch is already in a public location, we should pick from there. @fyu1 I see this patch is on LKML (https://lore.kernel.org/all/20260313144617.3420416-38-ben.horgan@arm.com/) but differs a bit. Is the reason you didn't pick the LKML version due to the base set of MPAM patches we carry in linux-nvidia-6.17 being based on an older revision of the series? |
|
611616a NVIDIA: VR: SAUCE: arm_mpam: Fix compilation errors Nit: The change for resctrl_arch_rmid_read() is doing more than what is described in the commit message (changing number of parameters and parameter data types). Is that intended? (it looks like it’s trying to match the prototype but want to double check) Similarly, mpam_resctrl_monitor_init() has more than just a name change. |
Hi, Matt, They are same patch with minor changes. Need to change to this line to fit to 6.17:
Now I backported the T241-MPAM-4 workaround patch from Ben's branch: https://gitlab.arm.com/linux-arm/linux-bh/-/commit/de0a00982d0aefb3d94828e908179aca02feaa85 Please check if the backported patch is good. BTW, this workaround is only for Grace. Vera doesn't have MBW_MIN feature and doesn't need this workaround to function. |
7718a84 to
0be0368
Compare
Fixed. Add detailed changes in the commit message. |
Some of the last patches also reference the pset 6.19 kernel so what we should do with those? |
0be0368 to
9dbadcd
Compare
|
The following commit has two bodies and two sign-offs: I believe you intended to remove this part: |
9dbadcd to
1091c7f
Compare
Hi, Jamie, Fixed in the updated branch: 68595a9 |
Hi, Matt, since Carol is concerned about the pset branch names, do you want to keep them in the change logs? |
Let's keep the pset branch / SHA references to maintain provenance for our own tracking. |
…ming domains The feature to sum event data across multiple domains supports systems with Sub-NUMA Cluster (SNC) mode enabled. The top-level monitoring files in each "mon_L3_XX" directory provide the sum of data across all SNC nodes sharing an L3 cache instance while the "mon_sub_L3_YY" sub-directories provide the event data of the individual nodes. SNC is only associated with the L3 resource and domains and as a result the flow handling the sum of event data implicitly assumes it is working with the L3 resource and domains. Reading of telemetry events does not require to sum event data so this feature can remain dedicated to SNC and keep the implicit assumption of working with the L3 resource and domains. Add a WARN to where the implicit assumption of working with the L3 resource is made and add comments on how the structure controlling the event sum feature is used. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit db64994) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Each CPU collects data for telemetry events that it sends to the nearest
telemetry event aggregator either when the value of MSR_IA32_PQR_ASSOC.RMID
changes, or when a two millisecond timer expires.
There is a feature type ("energy" or "perf"), GUID, and MMIO region associated
with each aggregator. This combination links to an XML description of the
set of telemetry events tracked by the aggregator. XML files are published
by Intel in a GitHub repository¹.
The telemetry event aggregators maintain per-RMID per-event counts of the
total seen for all the CPUs. There may be multiple telemetry event aggregators
per package.
There are separate sets of aggregators for each feature type. Aggregators
in a set may have different GUIDs. All aggregators with the same feature
type and GUID are symmetric keeping counts for the same set of events for
the CPUs that provide data to them.
The XML file for each aggregator provides the following information:
0) Feature type of the events ("perf" or "energy")
1) Which telemetry events are tracked by the aggregator.
2) The order in which the event counters appear for each RMID.
3) The value type of each event counter (integer or fixed-point).
4) The number of RMIDs supported.
5) Which additional aggregator status registers are included.
6) The total size of the MMIO region for an aggregator.
Introduce struct event_group that condenses the relevant information from
an XML file. Hereafter an "event group" refers to a group of events of a
particular feature type (event_group::pfname set to "energy" or "perf") with
a particular GUID.
Use event_group::pfname to determine the feature id needed to obtain the
aggregator details. It will later be used in console messages and with the
rdt= boot parameter.
The INTEL_PMT_TELEMETRY driver enumerates support for telemetry events.
This driver provides intel_pmt_get_regions_by_feature() to list all available
telemetry event aggregators of a given feature type. The list includes the
"guid", the base address in MMIO space for the region where the event counters
are exposed, and the package id where the all the CPUs that report to this
aggregator are located.
Call INTEL_PMT_TELEMETRY's intel_pmt_get_regions_by_feature() for each event
group to obtain a private copy of that event group's aggregator data. Duplicate
the aggregator data between event groups that have the same feature type
but different GUID. Further processing on this private copy will be unique
to the event group.
¹https://github.com/intel/Intel-PMT
[ bp: Zap text explaining the code, s/guid/GUID/g ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit 1fb2daa)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…GUIDs The telemetry event aggregators of the Intel Clearwater Forest CPU support two RMID-based feature types: "energy" with GUID 0x26696143¹, and "perf" with GUID 0x26557651². The event counter offsets in an aggregator's MMIO space are arranged in groups for each RMID. E.g., the "energy" counters for GUID 0x26696143 are arranged like this: MMIO offset:0x0000 Counter for RMID 0 PMT_EVENT_ENERGY MMIO offset:0x0008 Counter for RMID 0 PMT_EVENT_ACTIVITY MMIO offset:0x0010 Counter for RMID 1 PMT_EVENT_ENERGY MMIO offset:0x0018 Counter for RMID 1 PMT_EVENT_ACTIVITY ... MMIO offset:0x23F0 Counter for RMID 575 PMT_EVENT_ENERGY MMIO offset:0x23F8 Counter for RMID 575 PMT_EVENT_ACTIVITY After all counters there are three status registers that provide indications of how many times an aggregator was unable to process event counts, the time stamp for the most recent loss of data, and the time stamp of the most recent successful update. MMIO offset:0x2400 AGG_DATA_LOSS_COUNT MMIO offset:0x2408 AGG_DATA_LOSS_TIMESTAMP MMIO offset:0x2410 LAST_UPDATE_TIMESTAMP Define event_group structures for both of these aggregator types and define the events tracked by the aggregators in the file system code. PMT_EVENT_ENERGY and PMT_EVENT_ACTIVITY are produced in fixed point format. File system code must output as floating point values. ¹https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-ENERGY/cwf_aggregator.xml ²https://github.com/intel/Intel-PMT/blob/main/xml/CWF/OOBMSM/RMID-PERF/cwf_aggregator.xml [ bp: Massage commit message. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 8f6b6ad) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
The resctrl file system layer passes the domain, RMID, and event id to the architecture to fetch an event counter. Fetching a telemetry event counter requires additional information that is private to the architecture, for example, the offset into MMIO space from where the counter should be read. Add mon_evt::arch_priv that architecture can use for any private data related to the event. The resctrl filesystem initializes mon_evt::arch_priv when the architecture enables the event and passes it back to architecture when needing to fetch an event counter. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (backported from commit 8ccb1f8) [fenghuay: fix minor conflicts in __check_limbo()] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Every event group has a private copy of the data of all telemetry event aggregators (aka "telemetry regions") tracking its feature type. Included may be regions that have the same feature type but tracking different GUID from the event group's. Traverse the event group's telemetry region data and mark all regions that are not usable by the event group as unusable by clearing those regions' MMIO addresses. A region is considered unusable if: 1) GUID does not match the GUID of the event group. 2) Package ID is invalid. 3) The enumerated size of the MMIO region does not match the expected value from the XML description file. Hereafter any telemetry region with an MMIO address is considered valid for the event group it is associated with. Enable all the event group's events as long as there is at least one usable region from where data for its events can be read. Enabling of an event can fail if the same event has already been enabled as part of another event group. It should never happen that the same event is described by different GUID supported by the same system so just WARN (via resctrl_enable_mon_event()) and skip the event. Note that it is architecturally possible that some telemetry events are only supported by a subset of the packages in the system. It is not expected that systems will ever do this. If they do the user will see event files in resctrl that always return "Unavailable". Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 7e6df96) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Introduce intel_aet_read_event() to read telemetry events for resource RDT_RESOURCE_PERF_PKG. There may be multiple aggregators tracking each package, so scan all of them and add up all counters. Aggregators may return an invalid data indication if they have received no records for a given RMID. The user will see "Unavailable" if none of the aggregators on a package provide valid counts. Resctrl now uses readq() so depends on X86_64. Update Kconfig. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 51541f6) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Population of a monitor group's mon_data directory is unreasonably complicated because of the support for Sub-NUMA Cluster (SNC) mode. Split out the SNC code into a helper function to make it easier to add support for a new telemetry resource. Move all the duplicated code to make and set owner of domain directories into the mon_add_all_files() helper and rename to _mkdir_mondata_subdir(). Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 0ec1db4) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Clearing a monitor group's mon_data directory is complicated because of the support for Sub-NUMA Cluster (SNC) mode. Refactor the SNC case into a helper function to make it easier to add support for a new telemetry resource. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 93d9fd8) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…_PKG The L3 resource has several requirements for domains. There are per-domain structures that hold the 64-bit values of counters, and elements to keep track of the overflow and limbo threads. None of these are needed for the PERF_PKG resource. The hardware counters are wide enough that they do not wrap around for decades. Define a new rdt_perf_pkg_mon_domain structure which just consists of the standard rdt_domain_hdr to keep track of domain id and CPU mask. Update resctrl_online_mon_domain() for RDT_RESOURCE_PERF_PKG. The only action needed for this resource is to create and populate domain directories if a domain is added while resctrl is mounted. Similarly resctrl_offline_mon_domain() only needs to remove domain directories. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit f4e0cd8) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Legacy resctrl features are enumerated by X86_FEATURE_* flags. These may be overridden by quirks to disable features in the case of errata. Users can use kernel command line options to either disable a feature, or to force enable a feature that was disabled by a quirk. A different approach is needed for hardware features that do not have an X86_FEATURE_* flag. Update parsing of the "rdt=" boot parameter to call the telemetry driver directly to handle new "perf" and "energy" options that controls activation of telemetry monitoring of the named type. By itself a "perf" or "energy" option controls the forced enabling or disabling (with ! prefix) of all event groups of the named type. A ":guid" suffix allows for fine grained control per event group. [ bp: s/intel_aet_option/intel_handle_aet_option/g ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (backported from commit 842e7f9) [fenghuay: fix a minor conflict in kernel-parameters.txt doc] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
There are now three meanings for "number of RMIDs": 1) The number for legacy features enumerated by CPUID leaf 0xF. This is the maximum number of distinct values that can be loaded into MSR_IA32_PQR_ASSOC. Note that systems with Sub-NUMA Cluster mode enabled will force scaling down the CPUID enumerated value by the number of SNC nodes per L3-cache. 2) The number of registers in MMIO space for each event. This is enumerated in the XML files and is the value initialized into event_group::num_rmid. 3) The number of "hardware counters" (this isn't a strictly accurate description of how things work, but serves as a useful analogy that does describe the limitations) feeding to those MMIO registers. This is enumerated in telemetry_region::num_rmids returned by intel_pmt_get_regions_by_feature(). Event groups with insufficient "hardware counters" to track all RMIDs are difficult for users to use, since the system may reassign "hardware counters" at any time. This means that users cannot reliably collect two consecutive event counts to compute the rate at which events are occurring. Disable such event groups by default. The user may override this with a command line "rdt=" option. In this case limit an under-resourced event group's number of possible monitor resource groups to the lowest number of "hardware counters". Scan all enabled event groups and assign the RDT_RESOURCE_PERF_PKG resource "num_rmid" value to the smallest of these values as this value will be used later to compare against the number of RMIDs supported by other resources to determine how many monitoring resource groups are supported. N.B. Change type of resctrl_mon::num_rmid to u32 to match its usage and the type of event_group::num_rmid so that min(r->num_rmid, e->num_rmid) won't complain about mixing signed and unsigned types. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 67640e3) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
closid_num_dirty_rmid[] and rmid_ptrs[] are allocated together during resctrl initialization and freed together during resctrl exit. Telemetry events are enumerated on resctrl mount so only at resctrl mount will the number of RMID supported by all monitoring resources and needed as size for rmid_ptrs[] be known. Separate closid_num_dirty_rmid[] and rmid_ptrs[] allocation and free in preparation for rmid_ptrs[] to be allocated on resctrl mount. Keep the rdtgroup_mutex protection around the allocation and free of closid_num_dirty_rmid[] as ARM needs this to guarantee memory ordering. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit ee7f6af) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
resctrl assumes that only the L3 resource supports monitor events, so it simply takes the rdt_resource::num_rmid from RDT_RESOURCE_L3 as the system's number of RMIDs. The addition of telemetry events in a different resource breaks that assumption. Compute the number of available RMIDs as the minimum value across all mon_capable resources (analogous to how the number of CLOSIDs is computed across alloc_capable resources). Note that mount time enumeration of the telemetry resource means that this number can be reduced. If this happens, then some memory will be wasted as the allocations for rdt_l3_mon_domain::mbm_states[] and rdt_l3_mon_domain::rmid_busy_llc created during resctrl initialization will be larger than needed. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 0ecc988) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
L3 monitor features are enumerated during resctrl initialization and rmid_ptrs[] that tracks all RMIDs and depends on the number of supported RMIDs is allocated during this time. Telemetry monitor features are enumerated during first resctrl mount and may support a different number of RMIDs compared to L3 monitor features. Delay allocation and initialization of rmid_ptrs[] until first mount. Since the number of RMIDs cannot change on later mounts, keep the same set of rmid_ptrs[] until resctrl_exit(). This is required because the limbo handler keeps running after resctrl is unmounted and needs to access rmid_ptrs[] as it keeps tracking busy RMIDs after unmount. Rename routines to match what they now do: dom_data_init() -> setup_rmid_lru_list() dom_data_exit() -> free_rmid_lru_list() Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (backported from commit d089164) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> [fenghuay: fix minor conflicts in setup_rmid_lru_list() and dom_data_exit()] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Since telemetry events are enumerated on resctrl mount the RDT_RESOURCE_PERF_PKG resource is not considered "monitoring capable" during early resctrl initialization. This means that the domain list for RDT_RESOURCE_PERF_PKG is not built when the CPU hotplug notifiers are registered and run for the first time right after resctrl initialization. Mark the RDT_RESOURCE_PERF_PKG as "monitoring capable" upon successful telemetry event enumeration to ensure future CPU hotplug events include this resource and initialize its domain list for CPUs that are already online. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com (cherry picked from commit 4bbfc90) Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Update resctrl filesystem documentation with the details about the resctrl
files that support telemetry events.
[ bp: Drop the debugfs hunk of the documentation until a better debugging
solution is found. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/20251217172121.12030-1-tony.luck@intel.com
(cherry picked from commit a8848c4)
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…rl L3 domain and arch API updates Upstream resctrl renamed the L3 monitor domain type and extended the arch hooks: 1. Use struct rdt_l3_mon_domain in MPAM's resctrl integration, 2. Pass struct rdt_domain_hdr * into resctrl_online_mon_domain() / resctrl_offline_mon_domain(), 3. Match the new resctrl_arch_rmid_read() prototype (header pointer + arch_priv). 4. Update resctrl_arch_cntr_read(), resctrl_arch_reset_rmid(), resctrl_arch_reset_cntr(), and resctrl_arch_config_cntr() to take struct rdt_l3_mon_domain *. 5. Call the new resctrl_enable_mon_event() signature when wiring monitor events and set mon_capable from its return value. 6. Add a no-op resctrl_arch_pre_mount() so MPAM builds with the generic resctrl mount path. Fixes: a42549e ("NVIDIA: SAUCE: arm_mpam: resctrl: Add boilerplate cpuhp and domain allocation") Fixes: ae2a29c ("NVIDIA: SAUCE: arm_mpam: resctrl: Add support for csu counters") Fixes: 1cbc0f2 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_config_cntr() for ABMC use") Fixes: dd44394 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_rmid_read() and resctrl_arch_reset_rmid()") Fixes: 8429670 ("NVIDIA: SAUCE: arm_mpam: resctrl: Add resctrl_arch_cntr_read() & resctrl_arch_reset_cntr()") Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…rors No need to destory MSC instance for the user/admin programming errors sicne it's not causing any functional issues. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (cherry picked from 316e5833ccb2ef66f50290e48c45b70bf286c8fd dev/dev-main-nvidia-pset-linux-6.19.6) Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
In a NUMA system, each node may include CPUs, memory, MPAM MSC instances, or any combination thereof. Some high-end servers may have NUMA nodes that include MPAM MSC but no CPUs. In such cases, associate all possible CPUs for those MSCs. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (cherry picked from f902b5abf39fe10a50b7062dc9ae9d2cfc723248 dev/dev-main-nvidia-pset-linux-6.19.6) Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…ring domain setup
The current MPAM driver only considers the first component associated
with an online/offline CPU during domain creation and teardown. This
is insufficient, as CPU-initiated traffic may traverse multiple MSCs
before reaching the target, and each MSC must be programmed consistently
for proper resource partitioning.
Update the MPAM driver to include all components associated with a
given CPU during domain setup/teardown to expose expected schemata
to userspace for effective resource control.
Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
(backported from 4309ce9856f87170670c9db40546d9f2fc9dbb86 dev/dev-main-nvidia-pset-linux-6.19.6)
[fenghuay: In addition to the core change, this backport includes the
following adaptations to bridge the gap between the 24.04 (6.17) MPAM
driver and the 6.19.6 base the original was written against:
- Add for_each_mpam_resctrl_control() and for_each_mpam_resctrl_mon()
iteration macros (from pset c15c066 and 4f42221)
- Add MPAM_MAX_EVENT constant to bound the monitor event array
- Add traffic_matches_l3() to validate that a memory-class MSC's
traffic matches L3 egress topology (from pset ebc0760)
Remove redundant if (class->type != MPAM_CLASS_MEMORY)
- Replace exposed_alloc_capable/exposed_mon_capable static bools
with dynamic resctrl_arch_alloc_capable()/resctrl_arch_mon_capable()
that iterate over resources
- Change mpam_resctrl_offline_cpu() return type from int to void
- Change mpam_resctrl_monitor_init() return type from void to int
and propagate errors
- Change num_rmid from mpam_pmg_max + 1 to
resctrl_arch_system_num_rmid_idx()
- Use guard(mutex) for domain_list_lock
- Use INIT_LIST_HEAD_RCU for domain lists
- Fix not found mba issue on GMEM by only checking traffic_matches_l3() in
mpam_resctrl_pick_mba() on class that doesn't have NUMA node]
Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…onfig Reset an RIS by building a default mpam_config and applying it via mpam_reprogram_ris_partid(), like any other config. - mpam_init_reset_cfg(): set features and default values only for controls supported by the RIS (cpor_part, mbw_part, mbw_max, mbw_prop, cmax_cmax, cmax_cmin). Use full masks for CPBM/MBW_PBM and MPAMCFG_* defaults for MBW_MAX, CMAX, CMIN. - mpam_reprogram_ris_partid(): apply cfg for all supported controls (no separate reset path). Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (backported from c076b208842db87ed50b1c63cff302975a9c8f67 dev/dev-main-nvidia-pset-linux-6.19.6) [fenghuay: Fix porting conflicts and compilaton errors. Remove this sentence in the commit message to avoid confusion because MBW_PROP feature is not supported on Vera/Grace: "Include mpam_feat_mbw_prop when supported so MBW_PROP is written to 0 on reset."] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
There is no struct arm_smmu_domain context for domains configured with identity mappings. Use the device to obtain the necessary information to program PARTID and PMGID. Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> (backported from e5020b38475ef58c5bb3d1a92028d4e0dd7aff4d dev/dev-main-nvidia-pset-linux-6.19.6) [fenghuay: Koba Ko fixes a typo in iommu_group_get_qos_params(): s/!ops->set_group_qos_params/!ops->get_group_qos_params/] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
…n mpam_msmon_read Resolve mpam_feat_msmon_mbwu to the concrete counter type (31/44/63) before mpam_has_feature() and before filling the mon_read arg. This avoids -EOPNOTSUPP when only a specific MBWU feature is set, and ensures _msmon_read() gets the resolved type in arg.type. Fixes: 5b91005 ("NVIDIA: SAUCE: arm_mpam: Use long MBWU counters if supported") Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
cc8ab11 to
ecd11fd
Compare
|
I fixed a blocking issue on GMEM test failure in the patch "NVIDIA: VR: SAUCE: arm_mpam: Include all associated MSC components during domain setup" and updated its commit message. Here is the fix patch:
With this fix, I don't see MBA/MBM issue on GMEM test with an engineer built SBIOS enabling GPU MPAM. If this PR is good for you, please merge it to 6.17 BaseOS. Thank you very much for your help! |
Re-reviewed and confirmed this was the only change. No issues with the change.
|
|
|
| return false; | ||
| } | ||
|
|
||
| cpu = cpumask_any_and(&class->affinity, cpu_online_mask); |
There was a problem hiding this comment.
Should we put a check cpu >= nr_cpu_ids like in function topology_matches_l3.
There was a problem hiding this comment.
Although adding another sanity checking doesn't hurt, without the sanity checking, there won't be any issue because the next statements will check any invalid cpu anyway:
err = find_l3_equivalent_bitmask(cpu, tmp_cpumask);
if (err) {
There was a problem hiding this comment.
Great, thanks for looking.
|
PR sent to Canonical. |
This PR replaces #328
This branch fixes a few MPAM issues including:
There are total 49 patches:
This is patches list:
0001-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Allow-.patch
0002-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Add-NUMA-node-n.patch
0003-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Split-.patch
0004-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Change-domain_h.patch
0005-Revert-NVIDIA-SAUCE-arm_mpam-resctrl-Pick-whether-MB.patch
0006-Revert-NVIDIA-SAUCE-Fix-unused-variable-warning.patch
0007-Revert-NVIDIA-SAUCE-fs-resctrl-Add-mount-option-for-.patch
0008-Revert-NVIDIA-SAUCE-fs-resctrl-Take-memory-hotplug-l.patch
0009-Revert-NVIDIA-SAUCE-mm-memory_hotplug-Add-lockdep-as.patch
0010-Revert-NVIDIA-SAUCE-untested-arm_mpam-resctrl-Allow-.patch
0011-Revert-NVIDIA-SAUCE-arm_mpam-Add-workaround-for-T241.patch
0012-NVIDIA-SAUCE-arm_mpam-Add-workaround-for-T241-MPAM-4.patch
0013-x86-fs-resctrl-Improve-domain-type-checking.patch
0014-x86-resctrl-Move-L3-initialization-into-new-helper-f.patch
0015-x86-resctrl-Refactor-domain_remove_cpu_mon-ready-for.patch
0016-x86-resctrl-Clean-up-domain_remove_cpu_ctrl.patch
0017-x86-fs-resctrl-Refactor-domain-create-remove-using-s.patch
0018-fs-resctrl-Split-L3-dependent-parts-out-of-mon_eve.patch
0019-x86-fs-resctrl-Use-struct-rdt_domain_hdr-when-readin.patch
0020-x86-fs-resctrl-Rename-struct-rdt_mon_domain-and-rdt.patch
0021-x86-fs-resctrl-Rename-some-L3-specific-functions.patch
0022-fs-resctrl-Make-event-details-accessible-to-function.patch
0023-x86-fs-resctrl-Handle-events-that-can-be-read-from-a.patch
0024-x86-fs-resctrl-Support-binary-fixed-point-event-coun.patch
0025-x86-fs-resctrl-Add-an-architectural-hook-called-for-.patch
0026-x86-fs-resctrl-Add-and-initialize-a-resource-for-pac.patch
0027-fs-resctrl-Emphasize-that-L3-monitoring-resource-is-.patch
0028-x86-resctrl-Discover-hardware-telemetry-events.patch
0029-x86-fs-resctrl-Fill-in-details-of-events-for-perform.patch
0030-x86-fs-resctrl-Add-architectural-event-pointer.patch
0031-x86-resctrl-Find-and-enable-usable-telemetry-events.patch
0032-x86-resctrl-Read-telemetry-events.patch
0033-fs-resctrl-Refactor-mkdir_mondata_subdir.patch
0034-fs-resctrl-Refactor-rmdir_mondata_subdir_allrdtgrp.patch
0035-x86-fs-resctrl-Handle-domain-creation-deletion-for-R.patch
0036-x86-resctrl-Add-energy-perf-choices-to-rdt-boot-opti.patch
0037-x86-resctrl-Handle-number-of-RMIDs-supported-by-RDT.patch
0038-fs-resctrl-Move-allocation-free-of-closid_num_dirty_.patch
0039-x86-fs-resctrl-Compute-number-of-RMIDs-as-minimum-ac.patch
0040-fs-resctrl-Move-RMID-initialization-to-first-mount.patch
0041-x86-resctrl-Enable-RDT_RESOURCE_PERF_PKG.patch
0042-x86-fs-resctrl-Update-documentation-for-telemetry-ev.patch
0043-NVIDIA-VR-SAUCE-arm_mpam-Fix-compilation-errors.patch
0044-NVIDIA-SAUCE-arm_mpam-Avoid-MSC-teardown-for-the-SW-.patch
0045-NVIDIA-VR-SAUCE-arm_mpam-Handle-CPU-less-numa-nodes.patch
0046-NVIDIA-VR-SAUCE-arm_mpam-Include-all-associated-MSC-.patch
0047-NVIDIA-SAUCE-resctrl-mpam-reset-RIS-by-applying-expl.patch
0048-NVIDIA-SAUCE-iommu-arm-smmu-v3-Fix-MPAM-for-indentit.patch
0049-NVIDIA-VR-SAUCE-arm_mpam-Resolve-MBWU-type-before-fe.patch
Test results are in http://10.112.214.86/vera/tests/ including
GPU MPAM test is not covered because as of now there is SBIOS support for the feature yet.
LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2146389