[2/N] make op_db general for GPU, sample input generalization is TBD by daisyden · Pull Request #3 · daisyden/pytorch

daisyden · 2024-12-22T07:29:26Z

The second step for RFC pytorch#142029, this PR will make the op_db general for GPU devices defined in GPU_TYPES list.

…143550) # Motivation Fix pytorch#143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: pytorch#143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD

Use uint64_t index types to avoid ``` torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long' #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24 #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12 #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ``` Pull Request resolved: pytorch#154809 Approved by: https://github.com/soulitzer

…5603) Example new error message ``` torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['x'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - You marked L['x'].size()[0] as dynamic but your code specialized it to be a constant (5). Either remove the mark_dynamic or use a less strict API such as maybe_mark_dynamic or Dim.AUTO. Framework stack: File "??", line 0, in _start File "", line 0, in __libc_start_main_alias_2 File "??", line 0, in __libc_start_call_main File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 1094, in Py_BytesMain File "/usr/local/src/conda/python-3.10.16/Modules/main.c", line 357, in pymain_run_file_obj File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 90, in _PyRun_AnyFileObject File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 456, in _PyRun_SimpleFileObject File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1208, in pyrun_file File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1312, in run_mod File "/usr/local/src/conda/python-3.10.16/Python/pythonrun.c", line 1291, in run_eval_code_obj File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 1134, in PyEval_EvalCode File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/scratch/repro.py", line 9, in <module> foo(x) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/eval_frame.py", line 699, in compile_wrapper return fn(*args, **kwargs) File "offloadstuff.c", line 0, in dynamo__custom_eval_frame File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 305, in _PyObject_Call File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1469, in __call__ return self._torchdynamo_orig_callable( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1248, in __call__ result = self._inner_convert( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7494, in slot_tp_call File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 431, in _PyObject_Call_Prepend File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 153, in _PyObject_FastCallDictTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 625, in __call__ return _compile( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 1092, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_utils_internal.py", line 97, in wrapper_function return function(*args, **kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 779, in compile_inner return _compile_inner(code, one_graph, hooks, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 818, in _compile_inner out_code = transform_code_object(code, transform) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object transformations(instructions, code_options) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 265, in _fn return fn(*args, **kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/convert_frame.py", line 743, in transform tracer.run() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 3531, in run super().run() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1359, in run while self.step(): File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 1263, in step self.dispatch_table[inst.opcode](self, inst) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/symbolic_convert.py", line 422, in impl self.push(fn_var.call_function(self, self.popn(nargs), {})) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function return handler(tx, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 792, in <lambda> return lambda tx, args, kwargs: obj.call_function( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1160, in call_function return handler(tx, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builtin.py", line 1120, in _handle_insert_op_in_graph return wrap_fx_proxy(tx, proxy) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2500, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2566, in wrap_fx_proxy_cls return _wrap_fx_proxy( File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/variables/builder.py", line 2664, in _wrap_fx_proxy example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3205, in get_fake_value ret_val = wrap_fake_exception( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 2705, in wrap_fake_exception return fn() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3206, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_dynamo/utils.py", line 3373, in run_node return node.target(*args, **kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5917, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 430, in cfunction_vectorcall_FASTCALL File "/usr/local/src/conda/python-3.10.16/Objects/abstract.c", line 891, in binary_op1 File "/usr/local/src/conda/python-3.10.16/Objects/typeobject.c", line 7284, in slot_nb_multiply File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/descrobject.c", line 344, in method_vectorcall_VARARGS_KEYWORDS File "python_variable_methods.cpp", line 0, in _object* torch::autograd::TypeError_to_NotImplemented_<&torch::autograd::THPVariable_mul>(_object*, _object*, _object*) File "python_variable_methods.cpp", line 0, in torch::autograd::THPVariable_mul(_object*, _object*, _object*) File "??", line 0, in at::_ops::mul_Tensor::call(at::Tensor const&, at::Tensor const&) File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "PythonFallbackKernel.cpp", line 0, in void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::pythonTLSSnapshotFallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "VariableType_0.cpp", line 0, in c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mul_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) File "VariableType_0.cpp", line 0, in torch::autograd::VariableType::(anonymous namespace)::mul_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "??", line 0, in at::_ops::mul_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "offloadstuff.c", line 0, in c10::impl::BoxedKernelWrapper<at::Tensor (at::Tensor const&, at::Tensor const&), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::python_dispatcher(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const File "offloadstuff.c", line 0, in c10::OperatorHandle::callBoxedForDispatchKey(c10::DispatchKey, std::vector<c10::IValue, std::allocator<c10::IValue> >&) const File "PythonFallbackKernel.cpp", line 0, in (anonymous namespace)::pythonFallback(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) File "PyInterpreter.cpp", line 0, in torch::detail::(anonymous namespace)::ConcretePyInterpreterVTable::dispatch(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const File "??", line 0, in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<_object*>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName) File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 577, in PyObject_CallMethod File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/utils/_stats.py", line 27, in wrapper return fn(*args, **kwargs) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1346, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2029, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 1442, in _cached_dispatch_impl return self._dispatch_impl(func, types, args, kwargs) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_tensor.py", line 2552, in _dispatch_impl return maybe_propagate_real_tensors(fast_impl(self, *args, **kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 956, in fast_binary_impl final_shape = infer_size(final_shape, shape) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/_subclasses/fake_impls.py", line 916, in infer_size torch._check( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1669, in _check _check_with(RuntimeError, cond, message) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/__init__.py", line 1632, in _check_with if expect_true(cond): File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1686, in expect_true return a.node.expect_true( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 552, in expect_true return self.guard_bool(file, line) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 536, in guard_bool r = self.evaluate() File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/sym_node.py", line 510, in evaluate return self.shape_env.evaluate_sym_node(self, size_oblivious) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7113, in evaluate_sym_node return self.evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper return retlog(fn(*args, **kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 267, in PyVectorcall_Call File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7215, in evaluate_expr return self._inner_evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/recording.py", line 272, in wrapper return retlog(fn(*args, **kwargs)) File "/usr/local/src/conda/python-3.10.16/Python/ceval.c", line 5945, in do_call_core File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7238, in _inner_evaluate_expr return self._evaluate_expr( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7505, in _evaluate_expr self._maybe_guard_rel(g) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Modules/_functoolsmodule.c", line 1020, in bounded_lru_cache_wrapper File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6758, in _maybe_guard_rel self._refine_ranges(expr) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 7709, in _refine_ranges self._set_replacement( File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6667, in _set_replacement self.framework_specialization_stacks[source] = CapturedTraceback.extract(cpp=True) File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 114, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h", line 46, in _PyEval_EvalFrame File "/home/bobren/local/a/pytorch/torch/utils/_traceback.py", line 207, in extract torch._C._profiler.gather_traceback(python=True, script=script, cpp=cpp), File "/usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h", line 112, in _PyObject_VectorcallTstate File "/usr/local/src/conda/python-3.10.16/Objects/call.c", line 215, in _PyObject_MakeTpCall File "/usr/local/src/conda/python-3.10.16/Objects/methodobject.c", line 543, in cfunction_call File "offloadstuff.c", line 0, in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) File "offloadstuff.c", line 0, in pybind11::cpp_function::initialize<std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback>, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg_v, pybind11::arg_v, pybind11::arg_v>(std::shared_ptr<torch::CapturedTraceback> (*&)(bool, bool, bool), std::shared_ptr<torch::CapturedTraceback> (*)(bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg_v const&, pybind11::arg_v const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) File "??", line 0, in torch::CapturedTraceback::gather(bool, bool, bool) File "??", line 0, in torch::unwind::unwind() User stack: File "/home/bobren/local/a/pytorch/scratch/repro.py", line 5, in foo return torch.randn(5) * x ``` Pull Request resolved: pytorch#155603 Approved by: https://github.com/zou3519, https://github.com/cyyever ghstack dependencies: pytorch#155133

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi

…rch#165479) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with pytorch#163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: pytorch#165479 Approved by: https://github.com/ezyang

If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error. When I use a compilation optimization of ` -O1` I see this issue: ``` [src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy' #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14 #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19 #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8 #3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32 #4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5 #5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1 #6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp #7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7 #8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d) #9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120 ``` Pull Request resolved: pytorch#172581 Approved by: https://github.com/guangyey, https://github.com/albanD

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

… mode + replication padding (pytorch#177166) Fixes pytorch#170079 ## Context `torch.compile(ReplicationPad1d(...), fullgraph=True)` crashes when `torch.use_deterministic_algorithms(True)` is set on CUDA. The error: Dynamo can't trace through `importlib.import_module`. The deterministic code path exists because the native `replication_pad1d_backward` CUDA kernel uses `atomicAdd` (non-deterministic). `functional.py` calls `_replication_pad` — a Python decomposition using `_unsafe_index`, whose backward uses `index_put` (deterministic). ## Dynamo limitations encountered Three separate Dynamo tracing barriers prevented calling `_replication_pad` directly: ### 1. `importlib.import_module` is marked as skipped ```python @torch.compile(fullgraph=True) def fn(x): import importlib return importlib.import_module("torch").sin(x) fn(torch.randn(3)) # Unsupported: function marked as skipped ``` ### 2. `elementwise_dtypes` returns non-Tensor (from `@pw_cast_for_opmath`) ```python @torch.compile(fullgraph=True) def fn(x): from torch._prims_common import elementwise_dtypes, ELEMENTWISE_TYPE_PROMOTION_KIND dt, _ = elementwise_dtypes(x, type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.DEFAULT) return x.to(dt) fn(torch.randn(3)) # Unsupported: torch.* op returned non-Tensor ``` ### 3. `torch._check` with closure lambda ```python @torch.compile(fullgraph=True) def fn(x): dim = x.dim() torch._check(dim in (2, 3), lambda: f"expected 2D or 3D, got {dim}D") return x + 1 fn(torch.randn(3, 3)) # Unsupported: Can't extract message from torch._check() ``` ## Iteration log | # | Approach | Who | Tests | Reviewer pushback | Why it failed | |---|----------|-----|-------|-------------------|---------------| | 1 | Replace `importlib` with `from...import` | Claude | bilinear/trilinear pass, replicate fails | "why do we need bilinear/trilinear tests?" — scoped fix to reported bug only | Hit limitation #2: `@pw_cast_for_opmath` | | 2 | Skip decomposition under compile via `is_compiling()`, rely on AOTAutograd's `@register_decomposition` | Claude | forward-only `backend="eager"` passes | "can you verify at inductor level this is actually deterministic?" — inspect AOT graph | No backward decomposition registered; backward still uses native `replication_pad1d_backward` (non-deterministic) | | 3 | Unwrap `@pw_cast_for_opmath` via `__wrapped__` | Claude | N/A — fails immediately | N/A | Hit limitation #3: `torch._check()` closure | | 4 | `@nonstrict_trace` — Dynamo skips body, AOTAutograd traces through | Reviewer suggestion | `backend="aot_eager"`, forward + backward under `DeterministicGuard(True)` | N/A — fix is correct | N/A | ## Key insight The fix isn't about making Dynamo trace the decomposition or skipping it entirely — it's about putting the boundary in the right place. Dynamo doesn't need to see inside; AOTAutograd does. `@nonstrict_trace` is exactly this boundary. Each "obvious" fix had passing tests that weren't testing the right thing. Only when the reviewer pushed for backward determinism verification and AOT graph inspection did the weaknesses surface. The backward completing without error under `DeterministicGuard(True)` proves determinism — PyTorch explicitly raises `RuntimeError` if any non-deterministic CUDA kernel executes under this mode. Authored with Claude. Pull Request resolved: pytorch#177166 Approved by: https://github.com/mlazos, https://github.com/williamwen42

make op_db general for GPU, sample input generalization is TBD

e78c922

daisyden changed the title ~~make op_db general for GPU, sample input generalization is TBD~~ [2/N] make op_db general for GPU, sample input generalization is TBD Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/N] make op_db general for GPU, sample input generalization is TBD#3

[2/N] make op_db general for GPU, sample input generalization is TBD#3
daisyden wants to merge 1 commit intodaisyden/inductor_devicefrom
daisyden/op_db

daisyden commented Dec 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daisyden commented Dec 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant