diff --git a/.dockerignore b/.dockerignore
deleted file mode 100644
index 3e4e48b..0000000
--- a/.dockerignore
+++ /dev/null
@@ -1 +0,0 @@
-.gitignore
\ No newline at end of file
diff --git a/.gitignore b/.gitignore
index 3bbad5f..7651e65 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,13 +1,5 @@
-__pycache__
-test/__pycache__
-
-*.onnx
-*.onnx.ignore
-*.pt
-
-data.ignore/
-
 dist
+data
 linger.egg-info
 linger.egg-info/dependency_links.txt
 linger.egg-info/PKG-INFO
@@ -17,6 +9,7 @@ linger.egg-info/top_level.txt
 test/data.ignore
 test/data.ignore/aa.pt
 test/data.ignore/test1_baseline.pt
+data.ignore/
 doc/build/
 examples/dump
 examples/dump_all
@@ -24,8 +17,11 @@ examples/dump_filter
 examples/dump_part
 examples/resnet50.onnx.ignore
 examples/shufflenet_v2_x1_0.bin.ignore
-
+*.onnx
+*.onnx.ignore
+*.pt
 *.log
 .vscode
-
-build
\ No newline at end of file
+__pycache__
+test/__pycache__
+data.ignore
diff --git a/.vscode/settings.json b/.vscode/settings.json
new file mode 100644
index 0000000..a8c2003
--- /dev/null
+++ b/.vscode/settings.json
@@ -0,0 +1,5 @@
+{
+    "python-envs.defaultEnvManager": "ms-python.python:conda",
+    "python-envs.defaultPackageManager": "ms-python.python:conda",
+    "python-envs.pythonProjects": []
+}
\ No newline at end of file
diff --git a/LICENSE b/LICENSE
index ac4ade6..5f1c80a 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2022 LISTENAI Authors. All Rights Reserved
+Copyright (c) 2025 LISTENAI Authors. All Rights Reserved
 
                                  Apache License
                            Version 2.0, January 2004
@@ -188,7 +188,7 @@ Copyright (c) 2022 LISTENAI Authors. All Rights Reserved
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright (c) 2022 LISTENAI Authors. All Rights Reserved.
+   Copyright (c) 2025 LISTENAI Authors. All Rights Reserved.
 
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
diff --git a/MANIFEST.in b/MANIFEST.in
index 540e1e3..4f7e728 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,3 +1,4 @@
 include requirements.txt
 include linger/*
-include linger/lib/*
\ No newline at end of file
+
+exclude linger/extension
\ No newline at end of file
diff --git a/README.md b/README.md
index d2e3902..bead395 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,10 @@
 ![linger_logo](doc/image/linger_logo.png)
 --------------------------------------------------------------------------------
 #### [English](README_en.md) | 简体中文  
- 
-[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pylinger.svg)](https://pypi.org/project/pylinger)
-[![PiPI](https://badge.fury.io/py/pylinger.svg)](https://pypi.org/project/pylinger/)
-[![License](https://img.shields.io/github/license/LISTENAI/thinker.svg?style=flat-square)](https://github.com/LISTENAI/linger/blob/main/LICENSE)
-[![linux](https://github.com/LISTENAI/linger/actions/workflows/auto_test.yml/badge.svg)](https://github.com/LISTENAI/linger/actions/workflows/auto_test.yml)
 
-linger是由安徽聆思科技和合肥智能语音公司联合开源的的神经网络量化训练组件，是聆思科技开源的AI生态工具链LNN(ListenAI Neural Network)的一部分，结合另一个聆思开源的推理引擎框架[thinker](https://github.com/LISTENAI/thinker)可实现产业级深度学习训练推理一体化平台，集深度学习量化训练和引擎推理、LUNA器件库和丰富的辅助组件于一体。LNN是专为聆思AIOT芯片（目前只支持CSK60xx系列）定制开发，助力开发者轻松在聆思VENUS芯片上快速上线AI业务，帮助越来越多嵌入式尤其是AIOT产品实现AI赋能，助力产业智能化升级。目前LNN工具链已支持聆思芯片在计算机视觉、语音唤醒、语音识别、离线翻译等10多个AI应用场景中的产品落地。
 
 ## 方案简介
-linger基于PyTorch对聆思LUNA系列芯片进行深度定制，在神经网络训练的前向过程中将激活和权重量化到8bit，通过参数调整得到量化无损的8bit模型
+linger基于PyTorch对聆思IOT系列芯片进行深度定制，在神经网络训练的前向过程中将激活和权重量化到8bit，通过参数调整得到量化无损的8bit模型
 
 ![doc/image/solution.png](doc/image/solution.png)
 
@@ -21,83 +15,21 @@ linger 是基于 PyTorch 的量化方案，在原始浮点训练代码中加入
 ### 2. 拓展性好
 linger 基于 PyTorch 进行量化算子的搭建，因此只要符合 PyTorch 拓展算子的规范，你可以添加任何量化算子到 linger 中来完成你的量化需求
 
-### 3. 工具链完整
-linger 后端适配 [thinker](https://github.com/LISTENAI/thinker) 推理引擎，thinker 推理引擎为CSK60XX而生，功能完善，量化训练与推理过程可无缝衔接，同时训练推理二进制一致得到保证
-
 
 ## 快速入门
 - [安装](doc/tutorial/install.md)：支持pip、源码、docker三种安装方式
-- [浮点-定点两阶段量化训练](doc/tutorial/get_started_for_two_stage.md): 先进行浮点网络的约束训练，再针对量化友好的浮点模型进行量化训练微调
-- [浮点-定点两阶段量化训练方案详解](doc/tutorial/two_stage_quant_aware_train.md)
-- [onnx导出教程](doc/tutorial/from_mode_to_onnx.md)：将量化无损的PyTorch模型导出为ONNX格式的模型
-- [权重分析工具使用及量化onnx导出错误调试](doc/tutorial/wb_analyse_tool_and_onnx_export_debug_tool.md)
-
-## 工程示例
-AI算法落地基本涵盖六个阶段：模型规约性检查、浮点训练、量化训练、模型打包、模拟引擎执行、固件烧录并芯片运行。其中固件烧录并芯片运行需要在聆思的开发板上来完成，如有需要请与我们联系，这里不做进一步介绍。其它五个阶段的流程示例图如下：  
-![lnn_flow_path](doc/image/lnn_flow_path.png)  
-其中模型规约性检查的功能是穿插在量化训练和模型打包中来完成的。 
-我们先假设模型结构与底层硬件完全适配，介绍流程中各个阶段，再介绍模型规约性检查的具体实现（实际开发过程中规约性检查要在模型结构初步进行，避免后续工作返工）。
-### 1. 浮点训练
-  我们基于[pytorch-cifar100](https://github.com/weiaicunzai/pytorch-cifar100)来进行功能展示  
-  首先确保在当前环境下（建议linger-env)，浮点模型训练基于pytorch能够跑起来。 
-```Shell
-python train.py -net resnet50 -gpu
-```
-  建议采用两阶段量化训练，对浮点训练的数据进行范围约束，只需[添加少量代码](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/resnet_modify1.md).  
-  为避免冲突，将tesnorboard[功能关闭](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/resnet_modify2.md)。同样的指令开启训练，运行几个epoch后，在checkpoint/resnet50文件夹中生成了一个**.pth文件
-
-### 2. 量化训练和导出
-  加载步1中保存的浮点模型**.pth，[修改约束代码](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/linger_set2.png)，即可将浮点算子替换为量化算子。同样的指令开启量化训练，训练几个epoch后，同样在checkpoint/resnet50文件夹中生成了一个**.pth文件。
-  使用linger的模型转换工具，将[模型转换成onnx计算图](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/onnx_export.png)。
-
-### 3. 模型分析和打包
-  切换到thinker-env环境，使用thinker离线工具tpacker对步2生成的onnx计算图打包，这里我们以训练好的resnet18模型为例，进行打包
-```Shell
-tpacker -g demo/resnet18/resnet18-12-regular.onnx -d True -o demo/resnet18/model.bin
-```
-这里使用到的资源可以从[thinker/demo/resnet18](https://github.com/LISTENAI/thinker/tree/main/demo/resnet18)中获取
-
-### 4. 推理执行
-  使用调用示例工程test_thinker，指定输入数据、资源文件和输出文件名称即可运行模拟代码。  
-```Shell
-chmod +x ./bin/test_thinker
-./bin/test_thinker demo/resnet18/input.bin demo/resnet18/model.bin demo/resnet18/output.bin 3 32 32 6
-```
-  注意：推理执行需要[安装thinker源码](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/install.md)，并完成编译。
-
-
-### 5. 规约性检查
-  该阶段不关注模型的效果，只关注模型的结构是否和底层硬件相适配，功能实现贯穿了1~4步
-  * 在步1中，对模型参数进行初始化或者训练几个epoch即可将模型文件导出，无需模型收敛。
-  * 步2中加载步1的模型文件，进行量化训练时，会对算子参数的合规性进行检查，如有不符合的设置，报错退出[错误示例](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/resnet50_linger_err.png)。用户根据报错信息修改层参数并返回步1，直至通过步2。
-  * 步3中加载步2的计算图，工具会对节点的tensor大小进行检查，[如果tensor大小超限会报错退出](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_err.png)。否则进入内存分析阶段，会在根目录下生成[内存分析报告](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_Mem1.png)，并提示整体的flash/psram/share-memory占用。对于超过硬件限制的报错，用户可结合报错信息和[内存分析报告](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_Mem2.png)来定位计算图中的超限的算子，返回步1进行模型结构调整，直至[通过步3的打包流程](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_sucess.png)。   
-  至此完成模型规约性检查，确保了模型能够在芯片上能够跑起来。模型效率评估目前只支持在芯片上部署运行，具体需求可联系我们。
+- [量化训练快速入门](doc/tutorial/quant_quick_strat.md): 先进行浮点网络的约束训练，再针对量化友好的浮点模型进行量化训练微调
+- [量化训练进阶指导](doc/tutorial/quant_advanced_guide.md): 量化进阶配置
+- [onnx导出教程](doc/tutorial/export_onnx.md)：将量化无损的PyTorch模型导出为ONNX格式的模型
 
 ## 能力展示
 - [linger API](doc/tutorial/linger_api.md)
-- [支持量化OP列表](doc/tutorial/support_quant_ops.md)及[模型结构限制说明](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/restrain_of_model.md)
+- [支持量化OP列表](doc/tutorial/support_quant_ops.md)
 
 ## 常见问题
-- [安装出现问题解决](doc/tutorial/install_bugs.md)
+- [安装出现问题解决](doc/tutorial/install_errors.md)
 - [量化常见问题与注意事项](doc/tutorial/quant_faq.md)
 
 ## 版本说明
 - 请参考[RELEASE](doc/tutorial/release.md)
 
-## 交流与反馈
-- 欢迎您通过 Github Issues 来提交 BUG 与建议
-- 技术交流微信群  
-![concat us](doc/image/contact_me_qr.png)
-
-## 引用
-- [PyTorch](https://github.com/pytorch/pytorch)
-- [ONNX](https://github.com/onnx/onnx)
-- [pytorch-cifar100](https://github.com/weiaicunzai/pytorch-cifar100)
-- 
-## 应用示例
-* 鼾声检测[https://github.com/mywang44/snoring_net]
-* 离线翻译[https://github.com/dwzhang00/Offline-translation]
-* 二维码检测与识别[https://github.com/mywang44/YOLOv1_QRcode_Detection]
-  
-## 版权和许可证
-- linger 由 [Apache-2.0 license](LICENSE) 提供
diff --git a/README_en.md b/README_en.md
deleted file mode 100644
index 1650dbb..0000000
--- a/README_en.md
+++ /dev/null
@@ -1,105 +0,0 @@
-![linger_logo](doc/image/linger_logo.png)
---------------------------------------------------------------------------------
-#### English | [Chinese](README.md)
-
-[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pylinger.svg)](https://pypi.org/project/pylinger)
-[![PiPI](https://badge.fury.io/py/pylinger.svg)](https://pypi.org/project/pylinger/)
-[![License](https://img.shields.io/github/license/LISTENAI/thinker.svg?style=flat-square)](https://github.com/LISTENAI/linger/blob/main/LICENSE)
-[![linux](https://github.com/LISTENAI/linger/actions/workflows/auto_test.yml/badge.svg)](https://github.com/LISTENAI/linger/actions/workflows/auto_test.yml)
-
-Linger is an open source neural network quantization training component by LISTENAI, designed for use with the AIOT chip CSK60XX. This component combines Linger's open source high-performance neural network inference framework
-[thinker](https://github.com/LISTENAI/thinker) can achieve training and inference integration, helping AI developers to quickly give business with AI capabilities based on CSK chip. Currently linger + thinker tool chain has supported the use of CSK chip in more than 10 AI application scenarios such as computer vision, voice wakeup, speech recognition, offline translation, etc.
-
-
-## Introduction
-The linger is based on PyTorch to deeply customize the LISTENAI LUNA series chip, quantize the activation and weight to 8bit in the forward process of neural network training, and get the quantized lossless 8bit model by parameter adjustment.
-
-![doc/image/solution.png](doc/image/solution.png)
-
-## Technical Highlights
-### 1. High Ease of Use
-linger is a PyTorch-based quantization scheme. Adding one line of linger-related code to the original floating-point training code can complete the replacement of quantization operators, and the quantization training can be completed using the original training process without other complicated settings.
-
-### 2. Good Scalability
-linger is based on PyTorch to build quantization operators, so you can add any quantization operator to linger to complete your quantization needs as long as it meets the specifications of PyTorch extension operators.
-
-### 3. Complete Toolchain
-The backend is adapted to [thinker](https://github.com/LISTENAI/thinker) inference engine, thinker inference engine for CSK60XX, which is fully functional and seamlessly integrates quantization training and inference process, while the binary consistency of training and inference is guaranteed.
-
-
-## Quick Start
-1. [Installation](doc/tutorial/install.md)：support pip, source code, docker and other installation methods
-2. [Floating-point-fixed-point two-stage quantization training](doc/tutorial/get_started_for_two_stage.md): first the constraint training of floating-point network, and then the quantization training fine-tuning for the quantization-friendly floating-point model
-3. [ONNX export tutorial](doc/tutorial/from_mode_to_onnx.md)：exporting quantized lossless PyTorch models to ONNX format
-4. [Complete introductory examples](examples/)：provide several newbie-friendly introductory quantization examples
-
-## Demo
-The implementation of AI algorithms basically covers six stages: model specification check, floating-point training, quantization training, model packaging, simulation engine execution, firmware burning and chip operation. The firmware programming and chip operation need to be completed on the development board of Lenses. If necessary, please contact us, and no further introduction will be made here. The flow chart of the other five stages is as follows:  
-![lnn_flow_path](doc/image/lnn_flow_path.png)   
-Among them, the function of model regularity check is interspersed in quantization training and model packaging.  
-We first assume that the model structure is fully compatible with the underlying hardware, introduce each stage in the process, and then introduce the specific implementation of the model convention check (in the actual development process, the convention check should be carried out initially on the model structure to avoid rework in subsequent work).
-### 1. Floating-point training
-We are based on [pythoch-cifar100](https://github.com/weiaicunzai/pytorch-cifar100) for function demonstration
-First of all, Make sure that in the current environment, the floating-point model training can run based on pytorch.  
-```Shell
-python train.py -net resnet50 -gpu
-```
-It is recommended to use two-stage quantization training to restrict the range of floating-point training data, and only need to [add a small amount of code](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/resnet_modify1.md).  
-To avoid conflicts, turn tesnorboard[function off](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/resnet_modify2.md). Start the training with the same command, and after running several epochs, a **.pth file is generated in the checkpoint/resnet50 folder
-
-### 2. Quantization training and Export
-Load the floating-point model **.pth saved in step 1, and [modify the constraint code](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/linger_set2.png) to replace the floating-point operator with a quantized operator. The same command starts quantization training. After several epochs are trained, a **.pth file is also generated in the checkpoint/resnet50 folder.  
-Use linger's model conversion tool to [convert the model into an onnx calculation graph](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/onnx_export.png).
-
-### 3. Model analysis and packaging
-Use the thinker offline tool tpacker to pack the onnx calculation graph generated in step 2   
-```Shell
-tpacker -g demo/resnet28/resnet18-12-regular.onnx -d Ture -o demo/resnet28/model.bin
-```
-we  can acquire resource from [thinker/demo/resnet18](https://github.com/LISTENAI/thinker/demo/resnet18/)
-### 4. Engine Execution
-Use the sample project test_thinker to run the simulation code by specifying the input data, resource file and output file name.  
-```Shell
-chmod +x ./bin/test_thinker
-./bin/test_thinker demo/resnet28/input.bin demo/resnet28/model.bin demo/resnet28/output.bin 3 32 32 6
-```
-Simplify the overall processing process here, with the engine input being a normalized 3x32x32 image and the output taking max_ The ID corresponding to value is used as the classification result. The processing of input images can refer to the [Image Processing Script](tools/image_process.py), or the processed test set images can be taken from Pytorch cifar100 for testing.
-
-### 5. Conventional check
-At this stage, we do not pay attention to the effect of the model, but only pay attention to whether the structure of the model is compatible with the underlying hardware, and the function realization runs through steps 1~4
-* In step 1, the model file can be exported by initializing the model parameters or training a few epochs without model convergence.
-* Load the model file of step 1 in step 2. When performing quantitative training, the compliance of operator parameters will be checked. If there are any settings that do not meet the requirements, an error will be reported and exit
-[error example](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/resnet50_linger_err.png). The user modifies the layer parameters according to the error message and returns to step 1 until step 2 is passed.
-* Load the calculation graph of step 2 in step 3, the tool will check the tensor size of the node, [if the tensor size exceeds the limit, an error will be reported and exit](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_err.png). Otherwise, enter the memory analysis stage, and generate a [memory analysis report](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_Mem1.png) in the root directory, and prompt the overall flash /psram/share-memory occupied. For errors that exceed the hardware limit, users can combine the error information and [Memory Analysis Report](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_Mem2.png) to locate the calculation graph The overrun operator returns to step 1 to adjust the model structure until [through the packaging process of step 3](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/images/Resnet50_sucess.png ).
-So far, the model compliance check has been completed, ensuring that the model can run on the chip. Model efficiency evaluation currently only supports deployment and operation on chips, please contact us for specific needs.
-
-## Quantitative Advancement
-  - [Floating-point-fixed-point two-stage quantization training program detailed explanation](doc/tutorial/two_stage_quant_aware_train.md)
-  - [Use of weight analysis tools and debugging of quantitative onnx export errors](doc/tutorial/wb_analyse_tool_and_onnx_export_debug_tool.md)
-
-## Frequently Asked Questions
-- [Installation problem solving](doc/tutorial/install_bugs.md)
-- [Quantification of common problems and notes](doc/tutorial/quant_faq.md)
-
-## Release Not
-- Please refer to [RELEASE](doc/tutorial/release.md)
-
-## Data Search
-- [linger API](doc/tutorial/linger_api.md)
-- [List of supported quantization OPs](doc/tutorial/support_quant_ops.md) and [their restrictions](https://github.com/LISTENAI/thinker/blob/main/thinker/docs/tutorial/restrain_of_model.md)
-
-## Communication and Feedback
-- You are welcome to submit bugs and suggestions via Github Issues
-- Technical Communication WeChat Group  
-![concat us](doc/image/contact_me_qr.png)
-
-## Reference
-- [PyTorch](https://github.com/pytorch/pytorch)
-- [ONNX](https://github.com/onnx/onnx)
-- [pytorch-cifar100](https://github.com/weiaicunzai/pytorch-cifar100)
-- 
-## Applications
-* snoring detect[https://github.com/mywang44/snoring_net]
-  
-## License
-- linger is provided by the [Apache-2.0 license](LICENSE)
diff --git a/doc/image/bmm_int.png b/doc/image/bmm_int.png
deleted file mode 100644
index 0a9e8f4..0000000
Binary files a/doc/image/bmm_int.png and /dev/null differ
diff --git a/doc/image/contact_me_qr.png b/doc/image/contact_me_qr.png
deleted file mode 100644
index d5cbc10..0000000
Binary files a/doc/image/contact_me_qr.png and /dev/null differ
diff --git a/doc/image/conv_fused_bn.png b/doc/image/conv_fused_bn.png
deleted file mode 100644
index 5f0612d..0000000
Binary files a/doc/image/conv_fused_bn.png and /dev/null differ
diff --git a/doc/image/gru_int.png b/doc/image/gru_int.png
deleted file mode 100644
index d307883..0000000
Binary files a/doc/image/gru_int.png and /dev/null differ
diff --git a/doc/image/gru_int_with_batch_length.png b/doc/image/gru_int_with_batch_length.png
deleted file mode 100644
index cc589f0..0000000
Binary files a/doc/image/gru_int_with_batch_length.png and /dev/null differ
diff --git a/doc/image/gru_int_with_batch_length_with_state.png b/doc/image/gru_int_with_batch_length_with_state.png
deleted file mode 100644
index f674c64..0000000
Binary files a/doc/image/gru_int_with_batch_length_with_state.png and /dev/null differ
diff --git a/doc/image/lnn_flow_path.png b/doc/image/lnn_flow_path.png
deleted file mode 100644
index 5aea3a9..0000000
Binary files a/doc/image/lnn_flow_path.png and /dev/null differ
diff --git a/doc/image/lstm_int.png b/doc/image/lstm_int.png
deleted file mode 100644
index fcae38e..0000000
Binary files a/doc/image/lstm_int.png and /dev/null differ
diff --git a/doc/image/lstm_int_with_batch_length.png b/doc/image/lstm_int_with_batch_length.png
deleted file mode 100644
index 8e4d458..0000000
Binary files a/doc/image/lstm_int_with_batch_length.png and /dev/null differ
diff --git a/doc/image/lstm_int_with_batch_length_and_state.png b/doc/image/lstm_int_with_batch_length_and_state.png
deleted file mode 100644
index c080a7e..0000000
Binary files a/doc/image/lstm_int_with_batch_length_and_state.png and /dev/null differ
diff --git a/doc/image/trace_layer_normailize_init.png b/doc/image/trace_layer_normailize_init.png
deleted file mode 100644
index a8566c1..0000000
Binary files a/doc/image/trace_layer_normailize_init.png and /dev/null differ
diff --git a/doc/tutorial/from_mode_to_onnx.md b/doc/tutorial/export_onnx.md
similarity index 92%
rename from doc/tutorial/from_mode_to_onnx.md
rename to doc/tutorial/export_onnx.md
index 835d34f..a631222 100644
--- a/doc/tutorial/from_mode_to_onnx.md
+++ b/doc/tutorial/export_onnx.md
@@ -29,19 +29,19 @@ torch.onnx.export(torch_model,dummy_input,"test.onnx")
 ```
 
 ### 使用 linger 导出 onnx
-如果调用`linger.init(...)`接口后，使用`torch.onnx.export`会被自动替换为`linger.onnx.export`进行调用，即`torch.onnx.export = linger.onnx.export`
+如果调用`linger.init(...)`接口后，推荐使用使用`linger.onnx.export`进行调用;
 
 ```python
 import linger
 .....
 linger.init(...)
-torch.onnx.export(...) # 实际上调用的是 linger.onnx.export
+linger.onnx.export(...)
 ```
 
 ### 导出支持动态输入大小的图
 
 ``` python
-torch.onnx.export(torch_model,               # model being run
+linger.onnx.export(torch_model,               # model being run
                   x,                         # model input (or a tuple for multiple inputs)
                   "super_resolution.onnx",   # where to save the model (can be a file or file-like object)
                   export_params=True,        # store the trained parameter weights inside the model file
@@ -89,7 +89,7 @@ torch_model = ...
 # set the model to inference mode
 torch_model.eval()
 dummy_input = torch.randn(1,3,244,244)
-torch.onnx.export(torch_model,dummy_input,"test.onnx",
+linger.onnx.export(torch_model,dummy_input,"test.onnx",
                     opset_version=11,input_names=["input"],output_names=["output"],operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
 ```
 ### torch.no_grad()
@@ -103,7 +103,7 @@ torch_model = ...
 torch_model.eval()
 dummy_input = torch.randn(1,3,244,244)
 with torch.no_grad():
-    torch.onnx.export(torch_model,dummy_input,"test.onnx",
+    linger.onnx.export(torch_model,dummy_input,"test.onnx",
                         opset_version=11,input_names=["input"],output_names=["output"],operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
 ```
 `警告`：如果不使用`with torch.no_grad()`，则会报以下错误
diff --git a/doc/tutorial/get_started_for_two_stage.md b/doc/tutorial/get_started_for_two_stage.md
deleted file mode 100644
index 9a289a0..0000000
--- a/doc/tutorial/get_started_for_two_stage.md
+++ /dev/null
@@ -1,50 +0,0 @@
-
-使用linger进行量化训练极其方便，只需要`import linger`，再把 linger 相关设置加到合适的代码处，就可以实现一键量化训练。
-关于量化训练，我们推荐的主要将其分为两阶段：
-- 第一阶段，浮点约束训练。使用linger的normalize相关接口对网络进行约束，将随机初始化的权重进行从0起步的浮点约束训练，将网络训练至收敛
-- 第二阶段，定点量化微调。使用linger的init相关接口，基于第一步中得到的浮点模型进行量化训练微调，得到相比于浮点模型的无损量化训练结果
-
-## 第一阶段，浮点约束训练
-```python
-# 得到初始 model
-model = Model()
-
-# 使用linger进行浮点约束设置
-linger.trace_layers(model, model, dummy_input, fuse_bn=True)
-linger.disable_normalize(model.last_layer)
-type_modules  = (nn.Conv2d)
-normalize_modules = (nn.Conv2d,nn.Linear)
-linger.normalize_module(model.mid_conv, type_modules = type_modules, normalize_weight_value=16, normalize_bias_value=16, normalize_output_value=16)
-model = linger.normalize_layers(model, normalize_modules = normalize_modules, normalize_weight_value=8, normalize_bias_value=8, normalize_output_value=8)
-
-# 进行浮点网络约束训练
-# 训练结束，保存浮点网络
-
-```
-
-## 第二阶段，定点量化微调
-
-```python
-# 得到初始 model
-model = Model()
-
-# 继承第一阶段的设置，不要进行任何改动
-linger.trace_layers(model, model, dummy_input, fuse_bn=True)
-linger.disable_normalize(model.last_fc)
-type_modules = (nn.Conv2d)
-normalize_modules = (nn.Conv2d, nn.Linear)
-linger.normalize_module(model.mid_conv, type_modules = type_modules, normalize_weight_value=16, normalize_bias_value=16, normalize_output_value=16)
-model = linger.normalize_layers(model, normalize_modules = normalize_modules, normalize_weight_value=8, normalize_bias_value=8, normalize_output_value=8)
-
-# 添加linger量化训练设置
-linger.disable_quant(model.last_fc)
-quant_modules = (nn.Conv2d, nn.Linear)
-model = linger.init(model, quant_modules = quant_modules)
-
-# 加载第一阶段训练好的浮点约束网络参数到 model 里
-# 进行量化训练
-
-# 达到无损后，使用 torch.onnx.export 将 model 导出 onnx，将该 onnx 交给后端引擎 thinker 进行处理
-with torch.no_grad():
-    torch.onnx.export(model, dummy_input, "model.onnx", opset_version=12, input_names=["input"], output_names=["output"])
-```
\ No newline at end of file
diff --git a/doc/tutorial/install.md b/doc/tutorial/install.md
index 6642bda..cf8e879 100644
--- a/doc/tutorial/install.md
+++ b/doc/tutorial/install.md
@@ -5,7 +5,7 @@
 
 ### 创建虚拟环境
 ```Shell
-conda create -n linger-env python==3.7.0
+conda create -n linger-env python==3.12.10
 conda activate linger-env
 pip install -U pip
 cat requirements.txt | xargs -n 1 pip install
@@ -30,6 +30,11 @@ conda remove -n xxx --all
 
 ## linger安装
 三种方式，任选一种
+安装之前需要保证环境中：
+* gcc版本推荐 12.2.0，最低要求8.5.0
+* cmake版本推荐3.29.0，最低要求3.20.1
+* nvcc版本需与torch版本及NVIDIA硬件驱动匹配
+
 ### 源码安装方式
 ``` Shell
 git clone https://github.com/LISTENAI/linger.git
@@ -133,8 +138,7 @@ $ docker container rm [containID]
   
 ## linger安装验证
 ``` python
-Python 3.7.3 (default, Jul 8 2020, 22:11:17)
-[GCC 7.3.0] :: Anaconda, Inc. on linux
+Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import linger
 >>> 
diff --git a/doc/tutorial/install_bugs.md b/doc/tutorial/install_bugs.md
deleted file mode 100644
index e47d52e..0000000
--- a/doc/tutorial/install_bugs.md
+++ /dev/null
@@ -1,61 +0,0 @@
-## 1、使用linger git源码编译报错
-确认下面版本是否正确load， 可通过which gcc  \ which nvcc \  which cmake 来查看
--  module load gcc/5.4-os7
--  module load cuda/10.2-cudnn-7.6.5
--  module load cmake/3.17.3
-  
-## 2、安装成功 但import时报如下错
-`AttributeError: module 'google.protobuf.descriptor' has no attribute '_internal_create_key'`
-
-解决办法：确认protobuf和onnx 版本 符合安装文档上的版本3.8.0和1.7.0
-
-## 3、安装时确认torch和torchvision版本匹配对应
-
-
-若安装时没有torchvision，安装linger时会自动安装最新版的torchvision，也就会把torch覆盖安装最新版   
-
-解决办法：出现这种情况时，把torch和torchvision都卸载掉，然后重装torch，再装下面对应版本的torchvision，最后重编linger
-
-```
-+--------------------------+--------------------------+---------------------------------+
-| ``torch``                | ``torchvision``          | ``python``                      |
-+==========================+==========================+=================================+
-| ``main`` / ``nightly``   | ``main`` / ``nightly``   | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.10.0``               | ``0.11.1``               | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.9.1``                | ``0.10.1``               | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.9.0``                | ``0.10.0``               | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.8.2``                | ``0.9.2``                | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.8.1``                | ``0.9.1``                | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.8.0``                | ``0.9.0``                | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.7.1``                | ``0.8.2``                | ``>=3.6``, ``<=3.9``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.7.0``                | ``0.8.1``                | ``>=3.6``, ``<=3.8``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.7.0``                | ``0.8.0``                | ``>=3.6``, ``<=3.8``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.6.0``                | ``0.7.0``                | ``>=3.6``, ``<=3.8``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.5.1``                | ``0.6.1``                | ``>=3.5``, ``<=3.8``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.5.0``                | ``0.6.0``                | ``>=3.5``, ``<=3.8``            |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.4.0``                | ``0.5.0``                | ``==2.7``, ``>=3.5``, ``<=3.8`` |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.3.1``                | ``0.4.2``                | ``==2.7``, ``>=3.5``, ``<=3.7`` |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.3.0``                | ``0.4.1``                | ``==2.7``, ``>=3.5``, ``<=3.7`` |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.2.0``                | ``0.4.0``                | ``==2.7``, ``>=3.5``, ``<=3.7`` |
-+--------------------------+--------------------------+---------------------------------+
-| ``1.1.0``                | ``0.3.0``                | ``==2.7``, ``>=3.5``, ``<=3.7`` |
-+--------------------------+--------------------------+---------------------------------+
-| ``<=1.0.1``              | ``0.2.2``                | ``==2.7``, ``>=3.5``, ``<=3.7`` |
-+--------------------------+--------------------------+---------------------------------+
-```
\ No newline at end of file
diff --git a/doc/tutorial/install_errors.md b/doc/tutorial/install_errors.md
new file mode 100644
index 0000000..1801e32
--- /dev/null
+++ b/doc/tutorial/install_errors.md
@@ -0,0 +1,15 @@
+
+# 推荐环境安装
+| Python      | PyTorch    | torchvision | CUDA Runtime | CUDA Toolkit | nvcc | GCC  | NumPy  | ONNX   | onnxruntime |
+| ----------- | ---------- | ----------- | ------------ | ------------ | ---- | ---- | ------ | ------ | ----------- |
+| **3.8**     | **1.9.1**  | 0.10.1      | 11.1         | 11.1         | 11.1 | 7.5  | 1.19.x | 1.10.x | 1.9.x       |
+| **3.9**     | **1.12.1** | 0.13.1      | 11.6         | 11.6         | 11.6 | 8.4  | 1.21.x | 1.12.x | 1.13.x      |
+| **3.10** ⭐  | **2.0.1**  | 0.15.2      | 11.7         | 11.7         | 11.7 | 9.4  | 1.23.5 | 1.14.1 | 1.15.1      |
+| **3.11**    | **2.3.1**  | 0.18.1      | 12.1         | 12.1         | 12.1 | 11.3 | 1.26.x | 1.15.x | 1.17.x      |
+| **3.12** ⚠️ | **2.6.0**  | 0.21.0      | 12.4         | 12.4         | 12.4 | 12.2 | 2.0.x  | 1.16.x | 1.18.x      |
+
+
+# 常见错误及解决方案
+* error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory
+* cp libmpfr.so /home4/listenai/miniconda3/envs/linger3.0/lib
+* export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home4/listenai/miniconda3/envs/linger3.0/lib
diff --git a/doc/tutorial/quant_advanced_guide.md b/doc/tutorial/quant_advanced_guide.md
new file mode 100644
index 0000000..2e86057
--- /dev/null
+++ b/doc/tutorial/quant_advanced_guide.md
@@ -0,0 +1,58 @@
+## 校准（PTQ）使用方法
+* 因为校准时会默认按照weight的clip配置进行weight的初始化，故暂不支持循环多组数据校准（仅支持一轮输入校准）
+* 校准时会创建add、bmm等小算子的module
+```python
+    @linger.register_calibrate_method('custom_calibration')
+    def test_init(self, tensor):
+        with torch.no_grad():
+            self.learning_data.fill_(torch.tensor(-999))
+            self.scale.fill_(torch.tensor(-999))
+            self.is_calibrate.fill_(True)
+    
+    
+    with linger.calibration(a_calibrate_name="custom_calibration", w_calibrate_name="custom_calibration"):
+        model = linger.init(model)
+        model(torch.load("/yrfs4/inference/sqtu2/LLM/code/linger3.0/my_linger/calibrate_input.pt"))  
+```
+
+## linger.init/constrain中'disable_module'使用方法
+* 关闭model中某一类算子的量化或者约束训练；
+* 'disable_module = (nn.LSTM, nn.Linear)'
+* 量化示例：model = linger.init(model, config_file = 'cfg.yaml', disable_module = disable_module)
+* 约束示例：model = linger.constrain(model, config_file = 'cfg.yaml', disable_module = disable_module) 
+
+## linger.init/constrain中'disable_submodel'使用方法
+* 关闭model中某一层或者某一个模块的量化或者约束训练；
+* disable_submodel = ('module_name*', )这些module的量化
+* 量化示例：model = linger.init(model, config_file = 'cfg.yaml', disable_submodel=disable_submodel)使用, classifier为model下一级module名称，*表示匹配当前级及其子集module
+* 约束示例：model = linger.constrain(model, config_file = 'cfg.yaml', disable_submodel=disable_submodel) 
+
+## linger.init中可通过yaml文件加载配置，当前配置可通过linger.config_save_to_yaml保存
+## config.yaml 介绍
+* 基础配置
+    calibration: false  # 校准开关
+    clamp_info: # 约束信息配置
+        clamp_activation_value: 8   # 激活约束浮点值，8代表约束到[-8, 8]
+        clamp_bias_value: null      # bias约束浮点值，默认值为None
+        clamp_factor_value: 7       # weight动态约束参数，默认值为7，代表约束到weight.abs().mean() * 7
+        clamp_weight_value: null    # weight静态约束值，默认值为None
+    device: cuda
+    dtype: torch.float32    # 默认浮点数据类型
+    open_quant: true
+    platform: venusA        # 平台设置，目前支持arcs, mars, venusA
+    quant_info: #量化信息配置
+        a_calibrate_name: top_10    # 激活校准方法，默认top_10，一般不需要修改
+        a_strategy: RANGE_MEAN      # 激活量化原理
+        activate_bits: 8            # 激活数据位宽，默认8bit，支持8bit，16bit，32bit
+        activation_type: none       # 激活类型，默认None
+        bias_bits: 32               # bias数据位宽，默认32bit
+        is_perchannel: false        # perchannel量化开关，暂不支持此功能
+        is_symmetry: true           # 对称/非对称量化开关，目前只支持对称量化
+        qat_method: MOM             # QAT量化方案，支持MOM和TQT
+        round_mode: floor_add       # 舍入方式，默认支持floor+0.5
+        w_calibrate_name: abs_max   # 权重校准原理，默认abs_max，一般不需要修改
+        w_strategy: RANGE_MEAN      # 权重量化原理
+        weight_bits: 8              # 权重位宽，默认8bit，支持4bit，8bit
+    quant_method: NATIVE            # 量化/伪量化方式，默认NATIVE，支持NATIVE、CUDA、ONNX
+    seed: 42
+    
\ No newline at end of file
diff --git a/doc/tutorial/quant_faq.md b/doc/tutorial/quant_faq.md
index c6d850d..e5965d9 100644
--- a/doc/tutorial/quant_faq.md
+++ b/doc/tutorial/quant_faq.md
@@ -1,67 +1,11 @@
 ## 通用注意事项
 1. 两阶段训练：
 - 浮点阶段学习率通常设高一点，定点阶段学习率设低一点
-- 浮点clamp的clamp_modules设置要和定点训练时的quant_modules保持一致，一般默认浮点做了clamp的话，定点的这层也会相应做量化
-- 浮点clamp阶段，一般情况下初始loss有个明显的上升，之后再回落收敛到基线，证明clamp起到了限制作用，并且网络开始学到东西；定点阶段时，保证初始loss与浮点基线loss相差不大，且很快收敛到基线。若初始loss超出太多的话，很难收敛到基线的loss结果，即使收敛也相当于在量化阶段重新从头训练的，浮点学到的权重分布会被打乱，在某些回归任务上这样loss即使正常，实际测试效果也不会很好
-2. trace_layer 只支持使用一次，多次会导致前面被覆盖，hook被清空，导致加载的融合参数
-3. normalize_layer和 init 中 replace_tuple 是需要一一对应的，disable_normalize 和 disable_normalize 接口也是需要匹配调用的
-4. 导图：
-- 导onnx图时，需要保证 linger 的所有设置和训练时完全一致
-- torch.onnx.export中，将opset_version=12,  operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK 这两个选项固定下来，可避免很多报错（忘加aten选项，会报 bad node clip的错误）
-5. 训练过程中不要使用 with_no_grad() 方式  linger不支持这样使用，梯度反向时会报错
-6. linger 不支持混合精度量化训练，只支持全浮点网络做定点训练 
-
 
 ## 常见问题定位与解决
-1. 导出 onnx 提示 running_x > 0 assert报错，首先确认init确实有训练，或者加载的量化模型中确实有量化参数，其次网络中有一部分并没有在forward中使用
-2. 导出 onnx 提示 scale_o warning, 打开 onnx 确认对应 id 是否 dequant 添加正确及其中属性是否传递正确
-3. torch 1.6 版本后，导出onnx前需要with torch.no_grad(), 如果不使用with torch.no_grad()，则会报以下错误
-RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/functions/utils.h":59, please report a bug to PyTorch.
-4. 如果warning提示为iqcat  eval module has iqcat layer while do not match training module 的话，该问题比较复杂而且warning不影响运行，仅在导出onnx时有不同处理，一致性影响不大的话可以忽略。可以查看保存的量化pt文件中是否有iqcat的对应scale参数，如有 则可以在加载pt之前走一遍前向，将网络前向固定住，这样会自动去加载对应的iqcat参数，否则导出onnx时此处会变成浮点op
-5. 如果 loss 变 nan，减小学习速率或者增大 batch_size，还有判断是否有量化输出为0值，导致某些op（如sqrt\atan2等）后面反向挂掉了
-6. 在导出onnx后，不要再继续训练，此时权重等参数都变成定点值了，继续训练会报错。iqtensor.py中  zero encountered in true_divide
-7. leakyrelu 不支持量化，但导出op后未添加quant节点，由于打开了inplace=True选项，导致直接对tensor做修改，继续当成了IQtensor 直传下去，导图时就会报错
-8. 发现导图的onnx中scale与实际eval定位的scale值不匹配，可能由于导图前在train模式下走了前向，重新统计了running值，导致导图不一致
-9. 1.9.0官方文档说明，输入的最后一个参数不能是dict 类型，不然导出onnx时，此输入强制会变成空，需要在torch.onnx.export 后改成(x, meta, { })输入才行
-10. iqsigmoid使用及print显示问题：nn.sigmoid() 不会直接替换，同torch.sigmoid一样，走遍前向再print 才是iqsigmoidlayer()
-11. 报错 "ConvFunction BAckward" object has no attribute 'clamp_data'，eval 模式下走backward的原因  
-12. torch版本不同，导出的op也可能会有变化  
-13. linger只会识别 nn.Avgpool2d()的写法，其他Adapt_avgpool的用法，会走浮点逻辑，Maxpool 同理
-14. 对lstm做量化时，之前和之后的pack_padded_sequence和pad_packed_sequence 需要写全，不能用from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence 使用
-   直接写全torch.nn.utils.rnn.pack_padded_sequence(_,_,_,_)   torch.nn.utils.rnn.pad_packed_sequence(_,_,_,_)，不然linger替换不了函数指针，会导致运行出错
-15. 量化时学习率不能设太大，要在浮点最后训练的学习率基础上，再降一个数量级为好，不然容易出现loss爆炸变很大的情况
-16. 定点训练测试时，不能只跑几个batch就去走eval测试，这时的running值还没有统计正确，会导致结果很差，推荐跑完100个batch后再做测试
-17. 量化训练时加载原始浮点的checkpoint时，load_state_dict中strict参数最好不要设为false，这个选项会忽略其中key不对应的情况，可能加载的参数就没对上，后面效果自然不行
-18. linger不支持其中某一部分的layer不做反向（requird_grad = False）会报如下错：
-				one of the differentialted tensors does not require grad 。。。。
-19. onnx导出时存在 过slice 算子之后量化tensor变为浮点tensor的情况，可能是由于slice之前的op  存在未识别/不支持量化的op，常见的情况为transpose的torch写法问题，具体情况需具体分析
-20. 报错 Invoked with ：Tensor= onnx::Constant(), 'value', [4,3,2,1,], (Occurred when translating LSTMPIntFunction)
-       batch_length的问题，导图时需要将其改为torch.tensor，否则还为原始的list 就会导致报错
-21. 量化初始效果 loss 上升很多，首先确认是否加载checkpoint正确，尤其是设置了trace_layer后容易出现这个问题。定位：可以关闭init 仅打开trace_layer  查看loss是否与浮点相同
-	若相同说明加载checkpoint正确，否则需要重新对齐浮点checkpint的key。由于trace_layer后很多key被修改，所以以往的那种for循环遍历的加载方式会导致加载错误
-	推荐：可以在原始浮点用原始的方式加载完checkpoints之后，直接save一个state_dict下来，然后量化的时候，直接load保存的这个state_dict，就不用 for 循环遍历对比再加载了 
-22. 测试时报错 AttributeError: Can't pickle local object 'trace_layers.<locals>.pre_hook'
-	验证为多进程运行原因，改为单进程即可，一般推荐使用单卡单进程做测试，避免大部分问题
-23. 量化loss不下降，确认使用的optimizer是否关联到正确的parameter上了，可能未更新权重参数，可以保存几组不同的pt数据，查看权重更新情况
-24. 针对最终导出的onnx中 iqcat   iqadd    iqsigmoid，两边scale差距太大等情况，linger关闭iqtensor的相关设置  linger.SetIQTensorCat(False)
-25. 针对直接量化效果理想，但实测在某些场景中不理想，可以单独统计一下这组数据的running和scale值（只跑前向,不backward），看看和最终的训练模型差距大否
-26. batchnorm加入量化，loss上升非常明显，先确认是否是由于fuse_bn的原因导致，其次由于bn量化是把 (x-mean)/ var * alpha + belta  转化为 (alpha/var) x + (belta - mean/var)，定位一下 (alpha/var) 的值是否太极端
-27. 关闭batchnorm层running_mean\running_var的更新，置momentum为0
-主要用于浮点fine-tune或定点量化时  不想更新bn层的running_mean/var的值，因为有时浮点已经在大数据集上统计好了对应参数，fine-tune或量化时一般只过一个小数据集即可，这样可能会打乱浮点好不容易训好的分布，导致效果变差。
-主要用法如下：
-```python
-linger.SetBnMomentumUpdate(disable = True)
-```
-此语句需加在 trace_layer 之后，normalize设置及量化init之前，推荐添加完在intx的所有设置后print(model) 查看bn层的momentum是否已置0，此接口会将normalizebn、NormalizeConvBN1d、normalizeconvbn2d、bnint等op的momentum置零，保证其不更新running_mean、running_var，如果bn层不做normalize、不做量化，此设置对其不起作用
-
-28. 一些常见的函数量化选项 (默认为开启，可通过以下设置关闭)
-```
-# 关闭 + 的量化，网络不会再出现iqadd的量化op
-linger.SetIQTensorAdd(False)
-# 类似的还有 'SetIQTensorAdd' 'SetIQTensorClamp','SetIQTensorCat', 'SetIQTensorSigmoid', 'SetIQTensorTanh','SetIQTensorDiv', 'SetIQTensorMul', 'SetIQTensorSum'
-linger.SetFunctionBmmQuant(True)  #此选项默认为关闭，控制torch.bmm的量化与否
-linger.SetBnMomentumUpdate(True)  #此选项默认为关闭，控制bn层running_mean/running_var的值分布不更新
-```
+1. 环境安装问题
+    问题表现：error while loading shared libraries: libmpfr.so.6: cannot open shared object file: No such file or directory
+    解决方案：将系统中libmpfr.so.6拷贝到环境中，例如：cp libmpfr.so /home4/listenai/miniconda3/envs/linger3.0/lib
 
 ## 网络搭建推荐
 1. 搭建网络时，若nn.Linear层后有bn层直连，推荐将此linear层改为等效的1*1的卷积，这样在最后推理实现时可以完成conv-bn的融合，加快推理效率
diff --git a/doc/tutorial/quant_quick_strat.md b/doc/tutorial/quant_quick_strat.md
new file mode 100644
index 0000000..c9705ab
--- /dev/null
+++ b/doc/tutorial/quant_quick_strat.md
@@ -0,0 +1,69 @@
+
+使用linger进行量化训练极其方便，只需要`import linger`，再把 linger 相关设置加到合适的代码处，就可以实现一键量化训练。
+关于量化训练，我们推荐的主要将其分为两阶段：
+- 第一阶段，浮点约束训练。使用linger的constrain相关接口对网络进行约束，可以基于浮点训练好的模型也可以从头开始训练，将网络训练至收敛
+- 第二阶段，定点量化微调。使用linger的init相关接口，基于第一步中得到的浮点约束模型进行量化训练微调，得到相比于浮点模型的无损量化训练结果
+
+## 第一阶段，浮点约束训练
+```python
+# 得到初始 model
+model = Model()
+
+# 使用linger进行浮点约束设置
+# 配置约束相关参数
+linger.config_save_to_yaml('./config.yaml') # 获取默认配置
+# 修改 config.yaml 改变模型配置
+config_file = './config.yaml'
+model = linger.constrain(model, config_file = config_file) #默认进行静态约束，weight和activation约束到[-8, 8]
+# 进行浮点网络约束训练
+# 训练结束，保存浮点网络
+
+```
+
+## 第二阶段，定点量化微调
+
+```python
+# 得到初始 model
+model = Model()
+# 添加linger量化训练设置
+# 修改 config.yaml 改变模型配置
+config_file = './config.yaml'
+model = linger.init(model, config_file = config_file) #快速进行量化训练，quant_config默认配置即可，高阶配置参考'量化进阶指导'
+
+# 加载第一阶段训练好的浮点约束网络参数到 model 里
+# 进行量化训练
+
+# 达到无损后，使用 torch.onnx.export 将 model 导出 onnx，将该 onnx 交给后端引擎 thinker 进行处理
+with torch.no_grad():
+    linger.onnx.export(model, dummy_input, "model.onnx", opset_version=12, input_names=["input"], output_names=["output"])
+```
+
+
+## config.yaml 介绍
+* 基础配置
+    calibration: false  # 校准开关
+    clamp_info: # 约束信息配置
+        clamp_activation_value: 8   # 激活约束浮点值，8代表约束到[-8, 8]
+        clamp_bias_value: null      # bias约束浮点值，默认值为None
+        clamp_factor_value: 7       # weight动态约束参数，默认值为7，代表约束到weight.abs().mean() * 7
+        clamp_weight_value: null    # weight静态约束值，默认值为None
+    device: cuda
+    dtype: torch.float32    # 默认浮点数据类型
+    open_quant: true
+    platform: venusA        # 平台设置，目前支持arcs, mars, venusA
+    quant_info: #量化信息配置
+        a_calibrate_name: top_10    # 激活校准方法，默认top_10，一般不需要修改
+        a_strategy: RANGE_MEAN      # 激活量化原理
+        activate_bits: 8            # 激活数据位宽，默认8bit，支持8bit，16bit，32bit
+        activation_type: none       # 激活类型，默认None
+        bias_bits: 32               # bias数据位宽，默认32bit
+        is_perchannel: false        # perchannel量化开关，暂不支持此功能
+        is_symmetry: true           # 对称/非对称量化开关，目前只支持对称量化
+        qat_method: MOM             # QAT量化方案，支持MOM和TQT
+        round_mode: floor_add       # 舍入方式，默认支持floor+0.5
+        w_calibrate_name: abs_max   # 权重校准原理，默认abs_max，一般不需要修改
+        w_strategy: RANGE_MEAN      # 权重量化原理
+        weight_bits: 8              # 权重位宽，默认8bit，支持4bit，8bit
+    quant_method: NATIVE            # 量化/伪量化方式，默认NATIVE，支持NATIVE、CUDA、ONNX
+    seed: 42
+
diff --git a/doc/tutorial/release.md b/doc/tutorial/release.md
index e65f1d2..b6e2d54 100644
--- a/doc/tutorial/release.md
+++ b/doc/tutorial/release.md
@@ -1,3 +1,5 @@
-v1.1.1  2023.8.15 避免重复，升级版本号生成pypi包
-v1.1.0  2023.8.15 修复layernorm量化算子四舍五入问题；调整参数clmap限制；其它bug修复
-V1.0.0  2022.10.24 初始版本
\ No newline at end of file
+# V3.0.0  2025.12.01 linger3.0初始版本
+# V3.0.1  2025.12.15
+## 修改记录：
+* 解决ConvBN融合问题；
+* 解决部分算子onnx导图问题； 
\ No newline at end of file
diff --git a/doc/tutorial/support_quant_ops.md b/doc/tutorial/support_quant_ops.md
index b6c7bb5..b6ea747 100644
--- a/doc/tutorial/support_quant_ops.md
+++ b/doc/tutorial/support_quant_ops.md
@@ -3,1125 +3,35 @@
 
 | PyTorch(float32)   | linger算子名称                            | linger导出onnx算子名称                              | 支持关闭的设置                     |
 | ------------------ | ----------------------------------------- | --------------------------------------------------- | ---------------------------------- |
-| nn.BatchNorm2d     | [BatchNorm2dInt](#batchnorm2dint)         | BatchNorm2dInt                                      | -                                  |
-| nn.LayerNorm2d     | [LayerNorm2dInt](#layernorm2dint)         | LayerNorm2dInt                                      | -                                  |
-| nn.Linear          | [LinearInt](#linearint)                   | LinearInt                                           | -                                  |
-| nn.Conv1d          | [Conv1dInt](#conv1dint)                   | Conv1dInt                                           | -                                  |
-| nn.Conv2d          | [Conv2dInt](#conv2dint)                   | Conv2dInt                                           | -                                  |
-| nn.ConvTranspose2d | [ConvTranspose2dInt](#convtranspose2dint) | ConvTranspose2dInt                                  | -                                  |
-| nn.AvgPool2d       | [AvgPool2dInt](#avgpool2dint)             | AvgPool2dInt                                        | -                                  |
-| nn.MaxPool2d       | [iqMaxPool2d](#iqMaxPool2d)               | MaxPool2d                                           | -                                  |
-| nn.GRU             | [GRUInt](#gruint)                         | GRUInt/GRUInt_Is8_Is64/GRUInt_Is8_Is64_If32         | -                                  |
-| nn.LSTM            | [LSTMInt](#lstmint)                       | LSTMInt/LSTMInt_Is8_Is64/LSTMInt_Is8_Is64_If32_If32 | -                                  |
-| nn.Relu            | [iqRelu](#relu)                           | Relu                                                | -                                  |
-| nn.RELU6           | [ReLU6Int](#reLU6Int)                     | Clip                                                | -                                  |
-| torch.bmm          | [BmmInt](#bmmint)                         | BmmInt                                              | -                                  |
-| torch.sigmoid      | [iqSigmoid](#iqsigmoid)                   | iqSigmoid                                           | `linger.SetIQTensorSigmoid(False)` |
-| torch.tanh         | [iqTanh](#iqtanh)                         | iqTanh                                              | `linger.SetIQTensorTanh(False)`    |
-| torch.clamp        | [iqClamp](#iqclamp)                       | iqClamp                                             | `linger.SetIQTensorClamp(False)`   |
-| torch.cat          | [iqCat](#iqcat)                           | iqCat                                               | `linger.SetIQTensorCat(False)`     |
-| torch.transpose    | [iqTranspose](#iqtranspose)               | Transpose                                           | -                                  |
-| view               | [iqView](#iqview)                         | Reshape                                             | -                                  |
-| reshape            | [iqReshape](#iqreshape)                   | Reshape                                             | -                                  |
-| squeeze            | [iqSqueeze](#iqsqueeze)                   | Squeeze                                             | -                                  |
-| unsqueeze          | [iqUnsqueeze](#iqunsqueeze)               | Unsqueeze                                           | -                                  |
-| flatten            | [iqFlatten](#iqFlatten)                   | Flatten                                             | -                                  |
-| split              | -                                         | -                                                   | -                                  |
-| slice              | [slice](#slice)                           | Slice                                               | -                                  |
-| sum                | [iqSum](#iqSum)                           | iqSum                                               | `linger.SetIQTensorSum(False)`     |
-| add                | [iqAdd](#iqadd)                           | iqAdd                                               | `linger.SetIQTensorAdd(False)`     |
-| sub                | -                                         | -                                                   | -                                  |
-| mul                | [iqMul](#iqmul)                           | iqMul                                               | `linger.SetIQTensorMul(False)`     |
-| div                | [iqDiv](#iqDiv)                           | iqDiv                                               | `linger.SetIQTensorDiv(False)`     |
-| upsample           | -                                         | -                                                   | -                                  |
-| nn.Embedding       | [EmbeddingInt](#EmbeddingInt)             | Gather                                              | -                                  |
-| quant              | [quant](#quant)                           | Quant                                               | -                                  |
-| dequant            | [dequant](#dequant)                       | Dequant                                             | -                                  |
-| requant            | [Requant](#Requant)                       | Requant                                             | -                                  |
-| layernorm          | [LayerNormInt](#LayerNormInt)             | LayerNormInt                                        | -                                  |
-| softmax            | [SoftmaxInt](#SoftmaxInt)                 | SoftmaxInt                                          | -                                  |
-| logsoftmax         | [LogSoftmaxInt](#LogSoftmaxInt)           | LogSoftmaxInt                                       | -                                  |
-| flip               | [iqFlip](#iqFlip)                         | Slice                                               | -                                  |
-| var                | [iqVar](#iqVar)                           | iqVar                                               | -                                  |
-| -                  | [channel_shuffle](#channel_shuffle)       | ShuffleChannel                                      | `SetFunctionChannelShuffleQuant(False)`|
-
-------------
-# Operator 命名规则
-`(MajorName)[_Inputs_Outputs]`
-## MajorName 必需
-Operator 主名字,名字内部不允许有符号,仅英文和数字。例如iqMul
-
---------------
-# 术语说明
-## 量化方式
-在部分op的属性中有platform_quant属性，标识平台相关量化方法，说明如下:
-
-- luna_quant: castor全量化方式(int8->int8)，针对castor硬件量化，浮点到定点round采用的(x+0.5).floor()计算
-
-$$(\lfloor x\_int*\frac{scale\_z}{scale\_x}+0.5\rfloor+\lfloor y\_int*\frac{scale\_z}{scale\_y}+0.5\rfloor).int().clamp(-128,127)$$
-
-
-
-## Scale说明
-- scale_i: Input的scale, scale_x,scale_1,scale_2, scale_y同理, bits可以取8，16等
-
-$$\frac{2^{bits-1}-1}{running\_i}$$
-
-- scale_w: Weight的scale, scale_iw, scale_hw同理, bits可以取8，16等
-
-$$\frac{2^{bits-1}-1}{weight.abs().max()}$$
-
-- scale_o: Output的scale, bits可以取8，16等
-
-$$\frac{2^{bits-1}-1}{running\_o}$$
-
-  
-## Mode参数值
-- mode: 所有模式下的device信息
-
-## onnx类型值
-
-### 类型说明
-| Group                  | Types                     | Description                                                                          |
-| ---------------------- | ------------------------- | ------------------------------------------------------------------------------------ |
-| Floating Point Types   | FLOAT16, FLOAT32, FLOAT64 | Values adhering to the IEEE 754-2008 standard representation of floating-point data. |
-| Signed Integer Types   | INT8, INT16, INT32, INT64 | Signed integers are supported for 8-64 bit widths.                                   |
-| Unsigned Integer Types | UINT8, UINT16             | Unsigned integers of 8 or 16 bits are supported.                                     |
-| Complex Types          | COMPLEX64, COMPLEX128     | A complex number with either 32- or 64-bit real and imaginary parts.                 |
-| Other                  | STRING                    | Strings represent textual data. All strings are encoded using UTF-8.                 |
-| Other                  | BOOL                      | Boolean values represent data with only two values, typically true and false.        |
-
-### 类型和值
-
-| 类型       | 值  |
-| ---------- | --- |
-| UNDEFINED  | 0   |
-| FLOAT32    | 1   |
-| UINT8      | 2   |
-| INT8       | 3   |
-| UINT16     | 4   |
-| INT16      | 5   |
-| INT32      | 6   |
-| INT64      | 7   |
-| STR        | 8   |
-| BOOL       | 9   |
-| FLOAT16    | 10  |
-| UINT32     | 12  |
-| UINT64     | 13  |
-| COMPLEX64  | 14  |
-| COMPLEX128 | 15  |
-| BFLOAT16   | 16  |
-------
-
-# iqAdd
-
-量化数据加法,由linger导出
-### Inputs
-- x:T,第1个操作tensor
-- y:T,第2个操作tensor
-### Outputs
-- o:T,结果
-### Attr
-- scale_x:float,required,x的scale
-- scale_y:float,required,y的scale
-- scale_o:float,required,输出值o的scale
-- platform_quant:string,required,支持包括luna_quant，默认为luna_quant
-- mode: string, required
-
-### Type Constraints
--T:int8,int16,int32
-
-
-------------
-
-# iqMul
-
-- 量化数据乘法
-- linger导出
-
-### Inputs
-- x:T,第1个操作tensor
-- y:T,第2个操作tensor
-### Outputs
-- o:T,乘法结果
-### Attr
-- scale_x:float,required,x的scale
-- scale_y:float,required,y的scale
-- scale_o:float,required,输出值o的scale
-- platform_quant:string,required,支持包括luna_quant，默认为luna_quant
-
-### Type Constraints
--T:tensor(int8),tensor(int16),tensor(int32)
-
----------
-# iqDiv
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- input (Tensor) – the dividend
-- other (Tensor or Number) – the divisor
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# EmbeddingInt
-- linger导出
-
-### Inputs
-
-- x:T,(∗)，任意形状的IntTensor或LongTensor，包含要提取的指数。
-
-### Outputs
-
-- y:T, (\*, H), 其中\*是输入形状，H=embedding_dim
-
-### Attr
-- num_embeddings (int): 嵌入字典的大小
-- embedding_dim (int): 每个嵌入向量的大小
-- padding_idx (int, optional): 如果指定，padding_idx处的条目不会对梯度做出贡献；因此，padding_idx处的嵌入向量在训练期间不会被更新，也就是说，它仍然是一个固定的 "垫"。对于一个新构建的Embedding，padding_idx处的嵌入向量将默认为全零，但可以更新为另一个值，作为填充向量使用。
-- max_norm (float, optional): 如果给定，每一个规范大于max_norm的嵌入向量都会被重新规范化为max_norm的规范。
-- norm_type (float, optional): 为max_norm选项计算的p-norm的p。默认为2。
-- scale_grad_by_freq (boolean, optional): 如果给定，这将通过迷你批次中单词频率的倒数来扩展梯度。默认为假。
-- sparse (bool, optional): 如果为真，梯度与权重矩阵将是一个稀疏张量。
-- data_bits: int,required,输入数据bit数，当前仅仅支持8
-- scale_x: float,required,输入tensor的scale
-- scale_o: float,required,输出tensor的scale
-- o_bits: 输出bit数,如果没有该属性,意味着float
-- platform_quant: string,required,支持luna_quant
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-
-
-
-# iqCat
-
-- tensor cat 操作
-- linger导出
-
-### Inputs（1 - ∞）
-
-- x0:T,第0个tensor
-- x1:T,第1个tensor
-- x2:T,第2个tensor
-- ...****
-
-### Outputs
-
-- o:T,concat输出tensor
-- linger导出
-  
-### Attr
-
-`个数与inputs相同`
-- scale_x_0:float,required,第0个tensor的scale
-- scale_x_1:float,required,第1个tensor的scale
-- scale_x_2:float,required,第2个tensor的scale
-- ...
-- dim:int,required,concat的轴，取值[-r, r-1],其中 r = rank(inputs)
-- scale_o:float,required,concat后o的tensor
-- platform_quant:string,required,平台量化配置，支持包括luna_quant，默认为luna_quant
-
-### Type Constraints
-- T:int8
-
----------
-# iqtranspose
-- 矩阵转置
-
-### Inputs
-- input: 输入tensor
-- dim0：input需要转置的维度
-- dim1：input需要转置的维度
-
-### venus limits
-tranpose输入不支持4维，2维转置数据大小无限制，3维转置数据大小有限制（假设输入为CHW，数据位宽为data_bytes，限制条件如下所示）	
-转置组合	硬件限制
-(0,2,1)	(W*H) * data_bytes <= 64KB
-(2,0,1)	(W*H) * data_bytes <= 64KB
-(2,1,0)	(W*C) * data_bytes <= 64KB
-(1,2,0)	(W*C) * data_bytes <= 64KB
-(1,0,2)	(W*H) * data_bytes <= 64KB
-
-
-### Outputs
-- 转置后的矩阵
-
-### Attr
-
-### Type Constraints
-
----------
-
-# iqview
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- shape (torch.Size or int...): 希望转换得到的大小
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-
-# iqreshape
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- input (Tensor): the tensor to be reshaped
-- shape (tuple of python:ints): the new shape
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# iqsqueeze
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- input (Tensor): the input tensor.
-- dim (int, optional): if given, the input will be squeezed only in this dimension
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# iqunsqueeze
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- input (Tensor): the input tensor.
-- dim (int, optional): if given, the input will be unsqueezed only in this dimension
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# iqFlatten
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Attr
-- input (Tensor) – the input tensor.
-- start_dim (int) – the first dim to flatten
-- end_dim (int) – the last dim to flatten
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# iqSum
-- linger导出
-
-### Inputs
-
-- x:T,输入tensor
-
-### Outputs
-
-- y:T,输出tensor
-
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-
-
-
-# iqClamp
-
-- 数据截断
-- linger导出
-
-### Inputs
-- x:T,需要截断的tensor
-### Outputs
-- y:T,截断后的结果 
-### Attr
-- scale_x:float,required,输入x的scale
-- scale_o:float,required,输出y的scale
-- platform_quant:string,required,平台属性
-- min:float,required,clamp最小值
-- max:float,required,clamp最大值
-
-
-### Type Constraints
--T:tensor(int8)
-
-
----------
-
-# iqSigmoid
-- 数据sigmoid激活
-- linger导出
-### Inputs
-- x:T1,输入tensor
-
-### venus limits
-Iqsigmoid只支持int16(Q11)输入，int16(Q15)输出	
-
-
-### Outputs
-- y:T2,sigmoid后的结果
-
-### Attr
-- scale_x:float,required,输入x的scale
-- scale_o:float,required,输出y的scale
-- platform_quant:string,required,平台属性
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(uint8)
-
----------
-
-# iqTanh
-- 数据sigmoid激活
-- linger导出
-### Inputs
-- x:T1,输入tensor
-
-### venus limits
-iqTanh只支持int16(Q11)输入，int16(Q15)输出	
-
-
-### Outputs
-- y:T2,tanh后的结果
-
-  
----------
-
-# Relu
-
-- y = max(0, x)
-
-
-### Inputs
-- x:T,输入tensor
-
-### Outputs
-- y:T,relu后的结果
-
-
-### Type Constraints
-- T:tensor(int8),tensor(int32),tensor(float)
-  
----------
-# ReLU6Int
-
-- ReLU6Int导出为Clip算子，为标准的onnx节点，支持int8的输入输出
-- 与clamp区别:clip有3个输入，1个输出，即min_thresh和max_thresh作为输入，clamp的min和max是属性
-
-### Inputs
-- x:T,输入数据tensor
-- min_thresh:T,截断的最小值
-- max_thresh:T,截断的最大值
-
-### Outputs
-- y:T,截断后的输出tensor
-
-### Type Constraints
-- T:tensor(int8), tensor(float)
-
---------
-
-# AvgPool2dInt
-- linger导出
-### Inputs
-- x:T,格式(N x C x H x W),输入tensor
-### Outputs
-- y:T,格式(N x C x H x W),输出tensor
-
-
-### Attr
-- kernel_shape:int2,required,pool2d 的kernel大小
-- strides:int2,required,pool2d 的stride
-- pads:int2,required,pool2d的pad大小
-- ceil_mode:bool,是否为ceil模式
-- data_bits:int,required,输入数据bit数，当前仅仅支持8
-- scale_x:float,required,输入tensor的scale
-- scale_o:float,required,输出tensor的scale
-- o_bits:输出bit数,如果没有该属性,意味着float
-- platform_quant:string,required,支持luna_quant
-
-
-### Type Constraints
-- T: tensor(int8)
-
----------
-
-# iqMaxPool2d
-- linger导出
-
-### Inputs
-
-- x:T,格式(N x C x H x W),输入tensor
-
-### Outputs
-
-- y:T,格式(N x C x H x W),输出tensor
-
-
-### Attr
-
-- kernel_size:int2,required,pool2d 的kernel大小
-- stride:int2,required,pool2d 的stride
-- padding:int2,required,pool2d的pad大小
-- ceil_mode:bool,是否为ceil模式
-- dilation:默认为1
-
-### Type Constraints
-
-- T: tensor(int8)
-
-
----------
-
-# Conv2dInt
-
-- linger导出
-
-### Inputs
-- x:T1,格式(N X C X H X W),卷积的激活值
-- weight:T1,格式(M x C/group x kH x kW),M是feature maps数量,C是channels数量,kH和kW是feature map的高和长
-- bias:T2,optional,1D bias
-
-### venus limits
-- kernel大小为1-5（kernel_h、kernel_w设置相互独立）	
-- stride大小为1/2/4（stride_h 、stride_w设置相互独立）	
-- pad大小为0-4（四个方向上的pad设置相互独立）	
-- 输入数据对齐后大小不超过64KB（channel按8字节对齐，w按照8*stride_w字节对齐，channel不能超过一定阈值（待定））	
-- weight对齐后数据大小不超过32KB（非depthwise卷积的channel_out按2字节对齐，channel_in按8字节对齐。Depthwise卷积的channel_in按16字节对齐）	
-
-- 输入数据和weight之间的组合	
-  - in_w >= weight_w && in_h >= weight_h	
-  - weight_w >= stride_w && weight_h >= stride_h	
-  - pad_h_up < weight_h && pad_h_down < weight_h	
-  - pad_w_left < weight_w && pad_w_right < weight_w	
-- 输入数据和weight只支持int8，bias为32bit，输出支持int8/int16/int32	
-- max_pool只支持输入输出都为int8	
-- average_pool只支持输入int8，输出int8/int16	
-
-### Outputs
-- o:T3,格式(N X C X H X W),卷积后的输出
-### Attr
-- dilations:int or int2,required
-- group:int,required,输入到输出的卷积块数
-- kernel_shape:int or int2,required,卷积核大小
-- pads:int or int2,required,两边pad 0的大小
-- strides:int or int2,required,卷积的stride
-- scale_x:float,required,输入x的feature maps的scale
-- scale_w:float,required,weight的scale
-- scale_o:float,optional,输出o的scale,没有该属性意味着浮点输出
-- data_bits:int,required,x的量化bit数,比如8 表示8bit量化的
-- parameter_bits:int,required,weight的量化bit数,比如8 表示8bit量化的
-- o_bits:int,optional,输出的o的量化bit数,比如8 表示8bit量化的,没有该属性意味着浮点输出
-- platform_quant:string,平台属性,luna_quant, mlu_quant,gpu_quant,
-  `如果linger处设置platform_quant为mlu_quant/gpu_quant,则out_bits=None,onnx中则不会有o_bits属性`
-
-
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int16),tensor(int32),tensor(float)
-- T3:tensor(int8),tensor(float)
-  
----------
-
-# ConvTranspose2dInt
-
-- linger导出
-
-### Inputs
-- x:T1,格式(N x C x H x W),输入反卷积数据
-- weight:T1,格式(C x M/group x kH x kW),反卷积的weight,M是feature map数,C是通道数，kH和kW是feaaturemap的高和长
-- bias:T2,optional,1D bias
-
-### venus limits
-deconv独有的限制	
-- stride_h = 2, kernel_h = 2/3/4/5	
-- stride_h = 4, kernel_h = 4/5	
-- stirde_w = 2, kernel_w= 2/3/4/5	
-- stride_w = 4, kernel_w = 4/5	
-
-### Outputs
-- o:T3,格式(N x C x H x W),反卷积结果
-
-### Attr
-- dilations:int or int2,required
-- group:int,required,输入到输出的反卷积块数
-- kernel_shape:int or int2,required,反卷积核大小
-- pads:int or int2,required,两边pad 0的大小,``dilation * (kernel_size - 1) - padding``
-- strides:int or int2,required,反卷积的stride
-- output_padding:int or int2,required,反卷积输出的额外pad大小  
-- scale_x:float,required,输入x的feature maps的scale
-- scale_w:float,required,weight的scale
-- scale_o:float,optional,输出o的scale,没有该属性意味着浮点输出
-- data_bits:int,required,x的量化bit数,比如8 表示8bit量化的
-- parameter_bits:int,required,weight的量化bit数,比如8 表示8bit量化的
-- o_bits:int,optional,输出的o的量化bit数,比如8 表示8bit量化的,没有该属性意味着浮点输出
-- platform_quant:string,平台属性,luna_quant, mlu_quant,gpu_quant,
-  `如果linger处设置platform_quant为mlu_quant/gpu_quant,则out_bits=None,onnx中则不会有o_bits属性`
-
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int16),tensor(int32),tensor(float)
-- T3:tensor(int8),tensor(float)
-
----------
-
-# BatchNorm2dInt
-
-- linger导出
-- 算法:`o = x * mul_w + add_b`
-  
-### Inputs
-
-- x:T,格式(N x C X H x W),batchnorm的输入feature maps
-- mul_w:T,batch_norm 化简后的乘法系数
-- add_b:T,batch_norm 化简后的加法系数
-
-### Outputs
-- o:T,格式(N x C X H x W),输出tensor
-### Attr
-- scale_mul_x:float,required,乘法操作的x的scale
-- scale_mul_w:float,required,乘法操作的w的scale
-- scale_mul_o:float,required,乘法输出的scale
-- scale_add_b:float,required,加法的weight b的scale
-- scale_add_o:float,required,输出o的scale
-- data_bits:int,required,输入数据bit数
-- parameter_bits:int,required,默认为8
-- o_bits:int,required,输出bit数,也代表着中间乘法操作后(加法前)的中间运算bit数
-
-### Type Constraints
-- T:tensor(int8),tensor(int16),tensor(int32),
-
----------
-
-# LayerNorm2dInt
-- linger导出
-
-### Inputs
-- x:T,格式(N x C X H x W),batchnorm的输入feature maps
-
-### Outputs
-- o:T,格式(N x C X H x W),输出tensor
-### Attr
-- scale_mul_x:float,required,乘法操作的x的scale
-- scale_mul_w:float,required,乘法操作的w的scale
-- scale_mul_o:float,required,乘法输出的scale
-- scale_add_b:float,required,加法的weight b的scale
-- scale_add_o:float,required,输出o的scale
-- data_bits:int,required,输入数据bit数
-- parameter_bits:int,required,默认为8
-- o_bits:int,required,输出bit数,也代表着中间乘法操作后(加法前)的中间运算bit数
-
-### Type Constraints
-- T:tensor(int8),tensor(int16),tensor(int32),
-
----------
-
-# LinearInt
-
-- linger导出
-
-### Inputs
-- x:T1,格式(B x M x K),输入全连接数据
-- weight:T1,格式(K x N),全连接的weight
-- bias:T2,optional,1D bias
-
-### venus limits
-linearint/matmul左边输入矩阵对齐后大小不超过64KB（假设左矩阵维度为M*N），右矩阵不做限制	
-- 数据类型为8bit时，M按4字节对齐，N按8字节对齐。	
-- 数据类型为16bit时，M按4字节对齐，N按2字节对齐。	
-- 数据类型为32bit时，M按2字节对齐，N按2字节对齐。	
-
-
-### Outputs
-- o:T3,格式(B x N),全连接
-
-### Attr
-- scale_x:float,required,输入x的feature maps的scale
-- scale_w:float,required,weight的scale
-- scale_o:float,optional,输出o的scale,没有该属性意味着浮点输出
-- data_bits:int,required,x的量化bit数,比如8 表示8bit量化的
-- parameter_bits:int,required,weight的量化bit数,比如8 表示8bit量化的
-- o_bits:int,optional,输出的o的量化bit数,比如8 表示8bit量化的,没有该属性意味着浮点输出
-- platform_quant:string,luna_quant, mlu_quant,gpu_quant,
-  `如果linger处设置platform_quant为mlu_quant/gpu_quant,则out_bits=None,onnx中则不会有o_bits属性`
-  
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int16),tensor(int32),tensor(float)
-- T3:tensor(int8),tensor(float)
-
-
----------
-
-# LSTMInt
-
-
-### Inputs
-- x:T1,格式(B x T x D)或者(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:有o_bits属性为T3，没有为T4
-- hidden_state:T4
-- cell_state:T4
-
-
-### Attr
-- scale_i:float,required,输入数据scale
-- scale_h:float,required,hidden scale
-- scale_iw:float,required,weight_ih的scale
-- scale_hw:float,required,weight_hh的scale
-- scale_io:float,optional,输入矩阵计算(Wi*Xi+Bi)的输出量化scale
-- scale_ho:float,optional,隐层矩阵计算(Wh*H+Bh)的输出量化scale
-- scale_o:float,optional,如果o_bits没有,scale_o为空
-- o_bits:int,optional,觉得输出是否做量化
-- platform_quant:string,required,对应不同的硬件平台
-- data_bits:int,required，输入量化bit数
-- parameter_bits:int,required,weight_iw,weight_hw的数据位数
-- batch_first:int,required,1表明输入数据是否是B*T*D模式，0表明是T*B*D输入格式
-- dropout:float,required,对应标准lstm中的dropout操作，量化是全部为0，不做dropout操作
-- go_gorward:int,required,针对双向lstm的量化导出，1表示正向，0表示反向
-- num_layers:int,required,量化只支持num_layers=1
-- input_size:int,required,输入数据维度
-- hidden_size:int,required,隐层状态维度
-- table_len:int,optional,如果使用查表法,有此属性,表示表长度
-- sigmoid_bound:float,optional,sigmoid查表计算的查表边界
-- tanh_bound:float,optional,tanh查表计算的查表边界
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-
-  ![](../image/lstm_int.png)
-
----------
-
-# LSTMInt_Is8_Is64
-
-
-### Inputs
-- x:T1,格式(B x T x D)或者(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- batch_seq:T5,输入x的长度tensor，表示成list为[T0,T1,...],T0表示输入x的第0个tensor的T的实际长度,list长度为B
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:有o_bits属性为T3，没有为T4
-- hidden_state:T4
-- cell_state:T4
-
-
-### Attr
-- 和LSTMInt一致
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-- T5:tensor(int64)
-
-  ![](../image/lstm_int_with_batch_length.png)
-
--------------------------
-
-# LSTMInt_Is8_Is64_If32_If32
-
-
-### Inputs
-- x:T1,格式(B x T x D)或者(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- batch_seq:T5,输入x的长度tensor，表示成list为[T0,T1,...],T0表示输入x的第0个tensor的T的实际长度,list长度为B
-- hidden_state:T4,隐层单元输入
-- cell_state:T4,记忆单元输入
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:有o_bits属性为T3，没有为T4
-- hidden_state:T4
-- cell_state:T4
-
-
-### Attr
-- 和LSTMInt一致
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-- T5:tensor(int64)
-
-  ![](../image/lstm_int_with_batch_length_and_state.png)
-
--------------------------
-
-# GRUInt
-
-
-### Inputs
-- x:T1,格式(B x T x D)或(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:T3
-- hidden:T4
-
-
-### Attr
-- scale_i:float,required,输入数据scale
-- scale_h:float,required,hidden scale
-- scale_iw:float,required,weight_ih的scale
-- scale_hw:float,required,weight_hh的scale
-- scale_io:float,optional,输入矩阵计算(Wi*Xi+Bi)的输出量化scale
-- scale_ho:float,optional,隐层矩阵计算(Wh*H+Bh)的输出量化scale
-- scale_o:float,optional,如果o_bits没有,scale_o为空
-- o_bits:int,optional,觉得输出是否做量化
-- platform_quant:string,required
-- data_bits:int,required
-- parameter_bits:int,required,weight_iw,weight_hw的数据位数
-- batch_first:int,required,1表明输入数据是否是B*T*D模式，0表明是T*B*D输入格式
-- dropout:float,required,对应标准lstm中的dropout操作，量化是全部为0，不做dropout操作
-- go_gorward:int,required,针对双向lstm的量化导出，1表示正向，0表示反向
-- num_layers:int,required,量化只支持num_layers=1
-- input_size:int,required,输入数据维度
-- hidden_size:int,required,隐层状态维度
-- table_len:int,optional,如果使用查表法,有此属性,表示表长度
-- sigmoid_bound:float,optional,sigmoid查表计算的查表边界
-- tanh_bound:float,optional,tanh查表计算的查表边界
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-- T5:tensor(int64)
-
-  ![](../image/gru_int.png)
-
----------
-
-# GRUInt_Is8_Is64
-
-
-### Inputs
-- x:T1,格式(B x T x D)或(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- batch_seq:T5,输入x的长度tensor，表示成list为[T0,T1,...],T0表示输入x的第0个tensor的T的实际长度,list长度为B
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:T3
-- hidden:T4
-
-
-### Attr
-- 与GRUInt一致
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-- T5:tensor(int64)
-
-  ![](../image/gru_int_with_batch_length.png)
-
----------
-
-# GRUInt_Is8_Is64_If32
-
-
-### Inputs
-- x:T1,格式(B x T x D)或(T x B x D),输入数据,B 是batch，T是time，D是input dim
-- batch_seq:T5,输入x的长度tensor，表示成list为[T0,T1,...],T0表示输入x的第0个tensor的T的实际长度,list长度为B
-- hidden_state:T4,隐层输入状态
-- weight_ih:T1,输入连接的weight
-- weight_hh:T1,hidden连接的weight
-- bias_ih:T2,输入连接的weight后的bias
-- bias_hh:T2,hidden连接的weight后的bias
-
-### Outputs
-- output:T3
-- hidden_state:T4
-
-
-### Attr
-- 与GRUInt一致
-
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(int32)
-- T3:tensor(int8)
-- T4:tensor(float32)
-- T5:tensor(int64)
-
-  ![](../image/gru_int_with_batch_length_with_state.png)
-
----------
-
-# Quant
-- Quant主要是用来实现浮点输出转下一层定点输入的连接
-- linger导出
-
-### Inputs
-- x:T1,要量化的float的tensor
-
-### Outputs
-- y:T2,量化后的tensor
-
-### Attr
-- data_bits:int,required,量化bit数，当前支持小于等于8
-- scale_x:float,required,量化的scale
-- platform_quant:string,required,support luna_quant,其他方式策略一致
-### Type Constraints
-- T1:tensor(float)
-- T2:tensor(int8)
-
----------
-
-# Dequant
-- Dequant主要是用来实现定点输出转下一层浮点输入的连接
-- linger导出
-
-### Inputs
-- x:T1,输入定点tensor
-
-### Outputs
-- y:T2,输出浮点tensor
-
-### Attr
-- scale_o:float,required,浮点转定点的scale
-  
-### Type Constraints
-- T1:tensor(int8)
-- T2:tensor(float)
-
----------
-
-# BmmInt
-- 用于torch.bmm的量化训练导出
-- linger导出
-
-### Inputs
-- x: T, 输入数据tensor, shape = (B*N*M)
-- y: T, 输入数据tensor, shape = (B*M*P)
-
-### Outputs
-
-- outputs:有o_bits属性为T，否则为T1
-
-### Attr
-- data_bits:int,required,输入数据量化bit位
-- o_bits:int,optional,输出数据量化bit位 
-- platform_quant:str, required 量化硬件平台参数
-- scale_x:float,required,输入x的量化scale
-- scale_y:float,required,输入y的量化scale
-- scale_o:float,optional,输出out的量化scale
-
-### Type Constraints
-- T:tensor(int8)
-- T1:tensor(float)
-
-  ![](../image/bmm_int.png)
-
-# layernormint
-- linger导出
-
-### Inputs
-
-- x:T,输入：(N，*)
-
-### Outputs
-
-- y:T,输出, (N, *) (与输入相同的形状)
-
-### Attr
-- normalized_shape (int or list or torch.Size)-
-预期输入的大小形状
-[\∗ × normalized_shape[0] × normalized_shape[1] ×...× normalized_shape[-1]]
-如果使用一个整数，它将被视为一个单子列表，本模块将在最后一个维度上进行归一化处理，该维度预计为该特定尺寸。
-- eps - 为了数值的稳定性而加到分母上的一个值。默认值：1e-5
-- elementwise_affine - 一个布尔值，当设置为 "True "时，该模块具有可学习的每元素仿生参数，初始化为1（用于权重）和0（用于偏差）
-
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-
-# SoftmaxInt
-- linger导出
-
-### Inputs
-
-- x:T,(∗) 其中*表示，任何数量的附加维度
-
-### Outputs
-
-- y:T,(*)，与输入的形状相同
-
-
-### Attr
-- dim（int）--计算Softmax的维度（因此沿dim的每个片断的总和都是1）。
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# LogSoftmaxInt
-- linger导出
-
-### Inputs
-
-- x:T,(∗) 其中*表示，任何数量的附加维度
-
-### Outputs
-
-- y:T,(*)，与输入的形状相同
-
-
-### Attr
-- dim（int）--计算logSoftmax的维度（因此沿dim的每个片断的总和都是1）。
-
-### Type Constraints
-
-- T: tensor(int8)
-
----------
-# iqflip
-- linger导出
-
-### Inputs
-
-- x:T,输入的tensor
-
-### Outputs
-
-- y:T,输出的tensor
-
-
-### Attr
-- dim（int）-- 指定翻转的的维度
-
-### Type Constraints
-
-- T: tensor(int8)
-
-
-# iqVar
-- linger导出
-
-### Inputs
-
-- x:T,输入的tensor
-- dim: 维度
-- unbiased: 无偏/有偏
-
-### Outputs
-
-- y:T,输出的tensor
-
-### Type Constraints
-
-- T: tensor(int8)
-
-# channel_shuffle
-### Inputs
-
-- x:T,输入的tensor
-- groups: 分组数量
-
-### Outputs
-
-- y:T,输出的tensor
-
-### Type Constraints
-
-- T: tensor(int8)
\ No newline at end of file
+| nn.BatchNorm2d     | [QBatchNorm2d]                            | QBatchNorm2d                                      | -                                  |
+| nn.LayerNorm2d     | [QLayerNorm2d]                            | QLayerNorm2d                                      | -                                  |
+| nn.Linear          | [QLinear]                                 | QLinear                                           | -                                  |
+| nn.Conv1d          | [QConv1d]                                 | QConv1d                                           | -                                  |
+| nn.Conv2d          | [QConv2d]                                 | QConv2d                                           | -                                  |
+| nn.ConvTranspose1d | [QConvTranspose1d]                        | QConvTranspose1d                                  | -                                  |
+| nn.ConvTranspose2d | [QConvTranspose2d]                        | QConvTranspose2d                                  | -                                  |
+| nn.AvgPool1d       | [QAvgPool1d]                              | QAvgPool1d                                        | -                                  |
+| nn.AvgPool2d       | [QAvgPool2d]                              | QAvgPool2d                                        | -                                  |
+| nn.MaxPool1d       | [QMaxPool1d]                              | QMaxPool1d                                        | -                                  |
+| nn.MaxPool2d       | [QMaxPool2d]                              | QMaxPool2d                                        | -                                  |
+| nn.GRU             | [QGRU]                                    | QGRU                                              | -                                  |
+| nn.LSTM            | [QLSTM]                                   | QLSTM                                             | -                                  |
+| nn.Relu            | [Relu]                                    | Relu                                              | -                                  |
+| torch.bmm          | [QBmm]                                    | QBmm                                              | -                                  |
+| torch.sigmoid      | [QSigmoid]                                | QSigmoid                                          | -                                  |
+| torch.tanh         | [QTanh]                                   | QTanh                                             | -                                  |
+| torch.clamp        | [Clamp]                                   | Clamp                                             | -                                  |
+| torch.cat          | [QCat]                                    | QCat                                              | -                                  |
+| torch.transpose    | [Transpose]                               | Transpose                                         | -                                  |
+| view               | [view]                                    | Reshape                                           | -                                  |
+| reshape            | [reshape]                                 | Reshape                                           | -                                  |
+| squeeze            | [squeeze]                                 | Squeeze                                           | -                                  |
+| unsqueeze          | [unsqueeze]                               | Unsqueeze                                         | -                                  |
+| flatten            | [flatten]                                 | Flatten                                           | -                                  |
+| split              | -                                         | -                                                 | -                                  |
+| slice              | [slice]                                   | Slice                                             | -                                  |
+| add                | [QAdd]                                    | QAdd                                              | -                                  |
+| mul                | [QMul]                                    | QMul                                              | -                                  |
+| nn.Embedding       | [QEmbedding]                              | QEmbedding                                        | -                                  |
+| layernorm          | [QLayerNorm]                              | QLayerNorm                                        | -                                  |
+| softmax            | [QSoftmax]                                | QSoftmax                                          | -                                  |
diff --git a/doc/tutorial/two_stage_quant_aware_train.md b/doc/tutorial/two_stage_quant_aware_train.md
deleted file mode 100644
index 6ce9b91..0000000
--- a/doc/tutorial/two_stage_quant_aware_train.md
+++ /dev/null
@@ -1,150 +0,0 @@
-# 浮点-定点两阶段量化训练方案介绍
-![](../image/trace_layer_normailize_init.png)
-
-## RawMode
-图中的RawMode可以是已经训练完成的浮点模型，也可以是随机初始化的模型。
-- RawModel是已经训练完成的浮点模型：
-  - 可以认为FloatStage 进行浮点数据调整，其中最重要的调整包括ConvBn融合，参数normalize和激活值normalize，这种调整有利于Fix Stage 的定点训练
-  - 在实际使用中，由已训练的 raw model 直接进行FixStage 对定点损失比较严重，甚至loss直接崩溃
-
-- RawModel是随机初始化的模型：
-  - 可以认为FloatStage 的策略会直接参与训练，包括对weight和激活的normalize策略，以及ConvBn融合
-  - 此处要注意ConvBN融合，并非简单直接将BN参数推入Conv，而是前向融合，反向不融合策略，获得最大的训练收益
-  - 可能对训练效率有少许影响，经过几个案例的测试，我们发现影响大概在10%内
-
-## RawStateDict
- 图中RawStateDict加载是向后兼容的，兼容顺序可以箭头方向
- - 在模型变换的每个节点（图中FloatModel和FloatModelWithFixScale），都可以进行对StateDict加载和保存
- - 同一训练节点Save的StateDict可以被本节点Model加载(Save到Load的左向箭头)
- - 可以被后续阶段的节点加载
- - 在每个阶段可以导出onnx
- - 注意： 模型加载需要在网络变换的最后加载，中间不允许加载模型，如下代码示意
-    ``` python
-    model=MyNet()
-    model=linger.trace_layers(....)
-    model.load_state_dict(...) #it's OK    
-    ```
-    ``` python
-    model=MyNet()
-    model=linger.trace_layers(model,....)
-    #model.load_state_dict(...) #it's NOT OK  
-    model=linger.normalize_layers(model,....)
-    model.load_state_dict(...) #it's OK 
-    ```
-    ``` python
-    model=MyNet()
-    model=linger.trace_layers(model,....)
-    #model.load_state_dict(...) #it's NOT OK  
-    model=linger.normalize_layers(model,....)
-    #model.load_state_dict(...) #it's NOT OK 
-    model=linger.init(model,...)
-    model.load_state_dict(...) #it's OK
-    ```
-
-  
-# layers 相关策略
-## ConvBn融合原理介绍
-![](../image/conv_fused_bn.png)
-
-linger 浮点阶段实现了convbn融合，执行保守融合策略
-
-使用实例如下
-
-``` python
-class Model(nn.Module):
-    def __init__(self):
-        super(Model, self).__init__()
-        self.transpose = nn.ConvTranspose2d(10, 10, 5, 5, 2, 4, 2, True, 2)
-        self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                        padding=1, bias=True, groups=2)
-        self.bn = nn.BatchNorm2d(10)
-        self.fc = nn.Linear(10*254*254, 100)
-
-    def forward(self, x):
-        x = self.transpose(x)
-        x = self.conv(x)
-        x = self.bn(x)
-        n, c, h, w = x.shape
-        x = x.view(n, c*h*w)
-        x = self.fc(x)
-        return x
-
-model = Model()
-print(model)
-```
-
-```
- Model(
-  (transpose): ConvTranspose2d(10, 10, kernel_size=(5, 5), stride=(5, 5), padding=(2, 2), dilation=(2, 2), output_padding=(4, 4), groups=2)
-  (conv): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
-  (bn): BatchNorm2d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
-  (fc): Linear(in_features=645160, out_features=100, bias=True)
-)
-```
-``` python
-dummy_input = torch.randn(1, 10, 10, 10).cuda()
-linger.trace_layers(model, model, dummy_input, fuse_bn=True)
-print(model)
-```
-```
-Model(
-  (transpose): ConvTranspose2d(10, 10, kernel_size=(5, 5), stride=(5, 5), padding=(2, 2), dilation=(2, 2), output_padding=(4, 4), groups=2)
-  (conv): NormalizeConvBN2d(
-    normalize_data:None,normalize_weight:None,normalize_bias:None,ahead_relu:False
-    (conv): Conv2d(10, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=2)
-    (bn): BatchNorm2d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
-  )
-  (bn): EmptyBatchNorm()
-  (fc): Linear(in_features=645160, out_features=100, bias=True)
-)
-```
-从两次的输出的net网络可以看出，lingerCONVBN融合策略会将原网络的bn替换成`EmptyBatchNorm`，将原来的conv替换成`NormalizeConvBN2d`。
-
-- EmptyBatchNorm ： 什么都没做，仅仅占位
-- NormalizeConvBN2d : 带有normalize能力的convbn2d模块(此处normalize为None，不设置normalize功能)
-
-## AHEAD_RELU
-众所周知，Relu 操作是将负值用零替换。但如果Relu前面的操作OPX输出的量化，对正负值有一定的偏向性，特别是在负向有较大幅值，非常不利于输OPX的量化。由于OPX后紧跟Relu，因此可以使用OPX量化时可以仅仅关注正值量化，此策略即为AHEAD_RELU。
-例如如下网络定义
-``` python
-class Model(nn.Module):
-    def __init__(self):
-        super(Model, self).__init__()
-        self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                        padding=1, bias=True)
-        self.bn = nn.BatchNorm2d(10)
-        self.relu = nn.ReLU()
-        self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                        padding=1, bias=True)
-        self.relu1 = nn.ReLU()
-        self.conv2 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                        padding=1, bias=True)  
-        self.bn1 = nn.BatchNorm2d(10)
-        self.relu2 = nn.ReLU()           
-        self.fc = nn.Linear(250, 100)
-
-    def forward(self, x):
-        x = self.conv(x)
-        x = self.bn(x)
-        x = self.relu(x)
-        x = x - 1
-        x = self.conv1(x)
-        x = self.relu1(x)
-        x = self.conv2(x)
-        x = self.bn1(x)
-        x = self.relu2(x)
-        n, c, h, w = x.shape
-        x = x.view((n, c*h*w))
-        x = self.fc(x)
-        return x
-```
-
-其中 `self.bn`, `self.conv1`, `self.b1`都可以仅仅关注正方向的量化。
-
-当前支持的组合策略包括 `ahead_conv_relu,ahead_bn_relu,ahead_linear_relu`
-在trace_layers接口中均有选项，用户可以选用
-`trace_layers(root_model,target_model,args,*,fuse_bn:bool =True,ahead_conv_relu:bool =True,ahead_bn_relu:bool =True,ahead_linear_relu:bool =True)`
-
-注意：
-- 虽然trace_layers ahead relu是属于浮点(预)训练阶段,但仅仅为配置选项，真正使用和产生影响是在量化训练阶段，即init之后
-- trace_layers接口需要在init之前调用，同时只能执行一次，目前不允许网络trace多个部分, 建议放在所有配置操作的最前面
\ No newline at end of file
diff --git a/doc/tutorial/wb_analyse_tool_and_onnx_export_debug_tool.md b/doc/tutorial/wb_analyse_tool_and_onnx_export_debug_tool.md
deleted file mode 100644
index ada4133..0000000
--- a/doc/tutorial/wb_analyse_tool_and_onnx_export_debug_tool.md
+++ /dev/null
@@ -1,98 +0,0 @@
-
-
-## 1. wb_analyse分析工具
-
-```python
-#                         原始浮点基线权重 pth          分析日志保存地址
-linger.wb_analyse('data.ignore/tool_test.pt',  'data.ignore/wb_anylse.log')
-#-------------------------------------------------------------------------------------
-```
-## or
-```python
-
-checkpoint = torch.load("best_checkpoint.pth")
-
-checkpoint = checkpoint['state_dict']
-#                  也可以传入加载后的pth     分析日志保存地址默认为./wb_analyse.log
-linger.wb_analyse(checkpoint)
-```
-
-```python
-'''
-日志如下所示  Multiple = Max / Mean , Versu = Max / Dynamic 
-+-------------------------------------------------------+--------------------+--------------------+-----------------+--------------------+-----------------+
-|                       Layer_name                      |        Mean        |        Max         |     Multiple    |    Dynamic 0.99    |      Versu      |
-+-------------------------------------------------------+--------------------+--------------------+-----------------+--------------------+-----------------+
-|               encoder.conv1.conv.weight               |   tensor(0.8093)   |   tensor(4.0748)   |  tensor(5.0348) |   tensor(3.2437)   |  tensor(1.2562) |
-|                encoder.conv1.conv.bias                |   tensor(0.1000)   |   tensor(0.1000)   |  tensor(1.0000) |   tensor(0.1000)   |    tensor(1.)   |
-|                encoder.conv1.bn.weight                |   tensor(0.4724)   |   tensor(1.2380)   |  tensor(2.6208) |   tensor(1.0338)   |  tensor(1.1975) |
-|                 encoder.conv1.bn.bias                 |   tensor(0.3030)   |   tensor(1.9110)   |  tensor(6.3075) |   tensor(1.5030)   |  tensor(1.2714) |
-|          encoder.conv1.bn.num_batches_tracked         |  tensor(6185962)   |  tensor(6185962)   |    tensor(1.)   |  tensor(6185962)   |    tensor(1.)   |
-+-------------------------------------------------------+--------------------+--------------------+-----------------+--------------------+-----------------+
-'''
-```
-
-## 2、 out_analyse分析工具(初版，复杂模型可能不适用)
-### 分析网络每一层的输出分布，日志形式同权重分析日志
-
-```python
-model = resnet50().cuda()
-### 加载训练好的浮点checkpoint
-model.load_state_dict(checkpoint)
-### 给定一个网络的真实的典型输入，不要用随机数据
-typical_input = torch.randn([1,3,224,224]).cuda()
-
-with linger.Dumper() as dumper:
-    # model.eval()
-    dumper.analyse_layer_output(model,match_pattern="root.")   # match_pattern 可支持查看对应哪些层
-    model(typical_input) #跑一遍前向
-    dumper.save_out_analyse_log(save_log_path="Analyse_layer_output.log") #日志保存路径
-## 此接口会在当前目录生成一个名为"Analyse_layer_output.log"的文件
-```
-### 根据日志中Multiple = Max / Mean , Versu = Max / Dynamic0.99 两个的数值进行分析
-### ① 一般情况希望输出分布的均值和最值不要相差太大  这两个倍数供参考
-### ② 当Versu大于10倍时，说明此层输出的分布最值有明显异常，对量化很不友好  ，日志中会在此层数据下面打印！！！提示
-### ③ 一般推荐对于异常层来说，对其进行精细的normalize约束设置，向均值方向约束（不代表约束到均值），目的仅为抹除异常的最值即可
-
-```python
-'''
-日志如下所示  Multiple = Max / Mean , Versu = Max / Dynamic 
-+----------------------------+----------------+-----------------+--------------------+----------------+--------------------+
-|         Layer_name         |      Mean      |       Max       | Multiple(Max/Mean) |  Dynamic 0.99  | Versu(Max/Dynamic) |
-+----------------------------+----------------+-----------------+--------------------+----------------+--------------------+
-|         root.conv1         | tensor(0.7991) |  tensor(4.9494) |   tensor(6.1935)   | tensor(1.6482) |   tensor(3.0028)   |
-|          root.bn1          | tensor(1.1000) | tensor(11.8600) |  tensor(10.7815)   | tensor(2.5022) |   tensor(4.7399)   |
-|         root.relu          | tensor(0.4383) |  tensor(7.7810) |  tensor(17.7513)   | tensor(0.8851) |   tensor(8.7912)   |
-|        root.maxpool        | tensor(0.3245) |  tensor(7.7810) |  tensor(23.9802)   | tensor(0.8358) |   tensor(9.3091)   |
-|    root.layer1.0.conv1     | tensor(0.7606) |  tensor(7.7810) |  tensor(10.2294)   | tensor(1.4041) |   tensor(5.5418)   |
-|     root.layer1.0.bn1      | tensor(0.6418) |  tensor(4.2427) |   tensor(6.6106)   | tensor(1.5714) |   tensor(2.7000)   |
-|     root.layer1.0.relu     | tensor(0.3977) |  tensor(2.7954) |   tensor(7.0291)   | tensor(0.8981) |   tensor(3.1128)   |
-|    root.layer1.0.conv2     | tensor(0.1164) |  tensor(2.7954) |  tensor(24.0151)   | tensor(0.5088) |   tensor(5.4937)   |
-+----------------------------+----------------+-----------------+--------------------+----------------+--------------------+
-'''
-```
-
-## 3、 linger导出的onnx图中  dequant错乱 或者 图中节点有断裂，可参照下面过程操作
-
-### torch.onnx.export提供以下选项供调试：
--   is_update_dequant = True      # 设为False，关闭添加dequant节点（&删除identity结点）的过程  
--   is_scoped_info    = True      # 设为False，关闭添加和删除节点scope name信息的过程  
--   debug_dump        = False     # 设为True，保存中间各步的onnx结果，仅供调试使用, （建议使用此选项时不要对以上两个选项做修改）
-
-
-```python
-dummy_input = torch.ones(1,3,224,224)  #模拟输入
-with torch.no_grad():
-        linger.onnx.export_debug(net, dummy_input,"export_debug.onnx",export_params=True,opset_version=12,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK,is_update_dequant = False,is_scoped_info=False,debug_dump=False)
-```
-
-## 4、当使用旧版linger导出的onnx图中仅有 dequant添加错乱情况 ，可参照下面过程修复
-
-- conda create 新环境 安装最新版linger (方法仅供参考，保证有一个最新版的linger版本即可)
-- linger.fix_dequant(ori_onnx, False)   ##原始出错的onnx模型名称 | 是否检测修复后onnxinfer能否运行(设True时需已安装onnxinfer)
-- 最后将修复好的onnx保存为 后缀多了_fix.onnx
-
-```python
-##                    原始出错的onnx模型名称      | 是否检测修复后onnxinfer能否运行
-linger.fix_dequant("dbpagec2_wrong.onnx",            False)
-```
diff --git a/install.sh b/install.sh
index 985d5b7..79ec490 100644
--- a/install.sh
+++ b/install.sh
@@ -4,8 +4,8 @@ if [ -e "build" ];then
 rm -rf build
 fi
 
-# pip install -r requirements.txt
+pip install -r requirements.txt
 
-MAX_JOBS=32 python setup.py sdist
+MAX_JOBS=32 python3 setup.py sdist #  bdist_wheel
 echo "MAX_JOBS-------------"
-MAX_JOBS=32 pip install dist/*.gz
+MAX_JOBS=32 pip install dist/*.gz --no-build-isolation
diff --git a/linger/__init__.py b/linger/__init__.py
index fd33f6e..e5b4946 100644
--- a/linger/__init__.py
+++ b/linger/__init__.py
@@ -1,27 +1,13 @@
-import os
-import sys
+from .__version import __version__, version_info
 
-cur_dir = os.path.abspath(os.path.dirname(__file__))
-sys.path.append(os.path.join(cur_dir, "lib"))
+from .initialize import (init, constrain, calibration, quant_module, const_module, disable_quant_ops, get_quant_ops_name, config_save_to_yaml)
+from .quant.calibrate_funs import register_calibrate_method
+from .config import QUANT_CONFIGS
+from .onnx import *
+from .utils import *
+from .constrain import SparifyFFN, ConvBN1d, ConvBN2d, CConvBN1d
+from .quant import QTensor,from_tensor_to_qtensor, from_qtensor_to_tensor
 
-from .__version import __version__, version_info
-from .config import *
-from .conv_bn_fuser import EmptyBatchNorm, FuseBNIntoConv, FuseConvBNAheadRelu
-from .dumper import Dumper
-from .initialize import (DefaultQuantIntXOP, disable_quant, init, quant_module,
-                         quant_module_by_type, quant_tensor)
-from .layer_normalizer import (disable_normalize, normalize_layers,
-                               normalize_module)
 from .layer_tracer import trace_layers
-from .modules import (NormalizeBatchNorm2d, NormalizeConv1d, NormalizeConv2d,
-                      NormalizeConvBN1d, NormalizeConvBN2d, NormalizeConvTransposeBN2d,
-                      NormalizeConvTranspose2d, NormalizeEmbedding,
-                      NormalizeFastGRU, NormalizeFastLSTM, NormalizeLayerNorm,
-                      NormalizeLinear)
-from .onnx import parser_dequant
-from .ops import *
-from .quant import *
-from .tools import fix_dequant, onnx_quant, wb_analyse
-from .utils import *
 
 name = "linger"
diff --git a/linger/__version.py b/linger/__version.py
index 163880c..e000bb6 100644
--- a/linger/__version.py
+++ b/linger/__version.py
@@ -8,5 +8,5 @@ def _to_int(s):
         return s
 
 
-__version__ = "0.9.0"
+__version__ = "3.0.2"
 version_info = tuple(_to_int(s) for s in __version__.split("."))
diff --git a/linger/checker/__init__.py b/linger/checker/__init__.py
new file mode 100644
index 0000000..87fbb4d
--- /dev/null
+++ b/linger/checker/__init__.py
@@ -0,0 +1 @@
+from .onnxrunner import *
\ No newline at end of file
diff --git a/linger/checker/float_ops_mapper.py b/linger/checker/float_ops_mapper.py
new file mode 100644
index 0000000..c81d10b
--- /dev/null
+++ b/linger/checker/float_ops_mapper.py
@@ -0,0 +1,476 @@
+import torch.nn.functional as F
+import torch
+from linger.checker.utils import get_param,register_op
+from onnx import numpy_helper
+import numpy as np
+
+@register_op(op_type="Identity")
+def identity(inputs, kwargs):
+    return inputs[0]
+
+@register_op(op_type="Conv")
+def conv(inputs, kwargs):
+    input_length = len(inputs)
+    assert input_length == 2 or input_length == 3,"Conv ops：the number of inputs is wrong, \
+        expect 2 or 3, but {}, the length of List must be 2([input,weight]) or 3([input,weight,bias]) ".format(input_length)
+    if input_length == 2:
+        input,weights= inputs
+        bias = None
+    else:
+        input,weights,bias = inputs
+    if weights.ndim ==4:
+        dilations = tuple(get_param(kwargs,'dilations'))
+        groups = int(get_param(kwargs,'group'))
+        pads = tuple(get_param(kwargs,"pads")[:2])
+        strides = tuple(get_param(kwargs,"strides"))
+        
+        return F.conv2d(input,weights,bias,stride=strides,padding=pads,dilation=dilations,groups=groups)
+    elif weights.ndim == 3:
+        dilations = tuple(get_param(kwargs,'dilations'))
+        groups = int(get_param(kwargs,'group'))
+        pads = get_param(kwargs,"pads")[0]
+        strides = tuple(get_param(kwargs,"strides"))
+        
+        return F.conv1d(input,weights,bias,stride=strides,padding=pads,dilation=dilations,groups=groups)
+    else:
+        assert False, "Conv ops only support Conv2d and Conv1d currently, if you want to support more, please contact cgxu2!!!"
+
+@register_op(op_type="ConvTranspose")
+def convTranspose(inputs, kwargs):
+    input_length = len(inputs)
+    assert input_length == 2 or input_length == 3,"ConvTranspose ops：the number of inputs is wrong, \
+        expect 2 or 3, but {}, the length of List must be 2([input,weight]) or 3([input,weight,bias]) ".format(input_length)
+    if input_length == 2:
+        input,weights= inputs
+        bias = None
+    else:
+        input,weights,bias = inputs
+    dilations = tuple(get_param(kwargs,'dilations'))
+    groups = int(get_param(kwargs,'group'))
+    pads = tuple(get_param(kwargs,"pads")[:2])
+    strides = tuple(get_param(kwargs,"strides"))
+    out_padding = tuple(kwargs.get("output_padding",(0,0)))
+    return F.conv_transpose2d(input,weights,bias,stride = strides,padding = pads, output_padding= out_padding, groups=groups,dilation=dilations)
+
+@register_op(op_type='ATen')
+def aten(inputs, kwargs):
+    eps = get_param(kwargs, "eps")
+    normalized_shape = get_param(kwargs,"normalized_shape")
+    return F.layer_norm(inputs[0], normalized_shape, weight=inputs[1], bias=inputs[2], eps=eps)
+
+@register_op(op_type="Abs")
+def abs(inputs, kwargs):
+    return torch.abs(inputs[0])
+
+@register_op(op_type="Sin")
+def sin(inputs, kwargs):
+    return torch.sin(inputs[0])
+
+@register_op(op_type="Cos")
+def cos(inputs, kwargs):
+    return torch.cos(inputs[0])
+
+@register_op(op_type="Sqrt")
+def sqrt(inputs, kwargs):
+    return torch.sqrt(inputs[0])
+
+@register_op(op_type='LogSoftmax')
+def logsoftmax(inputs, kwargs):
+    axis = get_param(kwargs, "axis")
+    return torch.log_softmax(inputs[0], axis)
+
+@register_op(op_type="Less")
+def less(inputs, kwargs):
+    return torch.less(inputs[0], inputs[1])
+
+@register_op(op_type="Log")
+def log(inputs, kwargs):
+    return torch.log(inputs[0])
+
+@register_op(op_type='LeakyRelu')
+def leakyRelu(inputs, kwargs):
+    alpha = get_param(kwargs,"alpha")
+    return F.leaky_relu(inputs[0], negative_slope=alpha)
+
+@register_op(op_type='Erf')
+def erf(inputs, kwargs):
+    return torch.erf(inputs[0])
+
+@register_op(op_type='GlobalAveragePool')
+def global_average_pool(inputs, kwargs):
+    if len(inputs[0].shape)==4:
+        return F.adaptive_avg_pool2d(inputs[0],(1,1))
+    if len(inputs[0].shape)==3:
+        return F.adaptive_avg_pool1d(inputs[0],(1))
+
+@register_op(op_type='Flatten')
+def flatten(inputs, kwargs):
+    axis = get_param(kwargs,"axis")
+    return torch.flatten(inputs[0],axis)
+
+@register_op(op_type='Tile')
+def tile(inputs, kwargs):
+    return inputs[0].repeat(inputs[1].int().tolist())
+
+@register_op(op_type='Gemm')
+def gemm(inputs, kwargs):
+    ret = None
+    if len(inputs) ==2:
+        ret = F.linear(inputs[0],inputs[1],None)
+    else:
+        ret = F.linear(*inputs)
+    return ret
+
+@register_op(op_type='Sub')
+def sub(inputs, kwargs):
+    return inputs[0] - inputs[1]  # Tensor - Scalar (Scalar,Tensor)
+
+@register_op(op_type='Mul')
+def mul(inputs, kwargs):
+    return inputs[0] * inputs[1]  # Tensor - Scalar (Scalar,Tensor)
+
+@register_op(op_type='Add')
+def add(inputs, kwargs):
+    return inputs[0] + inputs[1]  # Tensor - Scalar (Scalar,Tensor)
+
+@register_op(op_type='Div')
+def div(inputs, kwargs):
+    return inputs[0] / inputs[1]  # Tensor - Scalar (Scalar,Tensor)
+
+@register_op(op_type='Constant')
+def constant(node):
+    constant_outputs = torch.from_numpy(numpy_helper.to_array(node.attribute[0].t).copy())
+    return constant_outputs
+
+@register_op(op_type='Reshape')
+def reshape(inputs,kwargs):
+    inputs[1] = inputs[1].reshape(-1)
+    if len(inputs[1]) < inputs[0].ndim:
+        input_size = inputs[0].shape
+        shape_size = [input_size[idx] if value ==0  else int(value) for idx, value in enumerate(inputs[1])]
+        return inputs[0].reshape(shape_size)
+    else:
+        if isinstance(inputs[1],np.ndarray): 
+            return inputs[0].reshape(inputs[1].astype(np.int64).tolist())
+        else:
+            return inputs[0].reshape(inputs[1].tolist())
+
+@register_op(op_type='Transpose')
+def transpose(inputs, kwargs):
+    perm = get_param(kwargs,'perm')
+    if isinstance(inputs[0], torch.Tensor):
+        return inputs[0].permute(perm)
+    else:
+        return inputs[0].transpose(perm)
+
+@register_op(op_type="ReduceMean")
+def reducemean(inputs, kwargs):
+    axes = get_param(kwargs,'axes')
+    keepdims = get_param(kwargs,'keepdims')
+    return inputs[0].mean(dim = axes,keepdim = bool(keepdims))
+
+@register_op(op_type="ReduceMax")
+def reducemax(inputs, kwargs):
+    axes = get_param(kwargs,'axes')
+    keepdims = get_param(kwargs,'keepdims')
+    out,_ = torch.max(inputs[0], dim = axes[0],keepdim = bool(keepdims))
+    return out
+
+@register_op(op_type="Unsqueeze")
+def unsqueeze(inputs, kwargs):
+    axes = get_param(kwargs,'axes')
+    if isinstance(inputs[0],torch.Tensor):
+        return inputs[0].unsqueeze(dim = axes[0])
+    else:
+        return np.expand_dims(inputs[0],axis = axes)
+    
+@register_op(op_type="Concat")
+def concat(inputs, kwargs):
+    axis = get_param(kwargs,'axis')
+    all_tensor_flag = True
+    for input_single in inputs:
+        if not isinstance(input_single, torch.Tensor) :
+            all_tensor_flag = False
+            break
+    
+    if not all_tensor_flag :
+        new_inputs = [input_single.detach().cpu().numpy() if isinstance(input_single,torch.Tensor) else input_single for input_single in inputs]
+        return np.concatenate(new_inputs,axis= axis)
+    else:
+        return torch.cat(inputs,dim = axis)
+
+@register_op(op_type="Shape")
+def shape(inputs, kwargs):
+    return torch.tensor(list(inputs[0].shape), dtype=torch.int64)
+
+
+@register_op(op_type="Gather")
+def gather(inputs, kwargs):  #please refer to test_onnx_averagepool_iq samples in test_onnx_runner.py
+    axis = kwargs.get('axis',0)
+    if "parameter_bits" in kwargs:
+        import linger
+        return linger.EmbeddingInt.run_onnx_embedding(inputs, kwargs)
+    if_torch = isinstance(inputs[0], torch.Tensor)
+    if not isinstance(inputs[0], torch.Tensor):
+        inputs[0] = torch.tensor(inputs[0]) if not isinstance(inputs[0], np.ndarray) else torch.from_numpy(inputs[0].copy())
+    if not isinstance(inputs[1], torch.Tensor):
+        inputs[1] = torch.tensor(inputs[1]) if not isinstance(inputs[1], np.ndarray) else torch.from_numpy(inputs[1].copy())
+
+    if inputs[1].numel() == 1: # example : a = torch.randn([1,3,224,224]); a[:,:,2]
+        slice_list = [":"]*inputs[0].ndim
+        slice_list[axis] = str(inputs[1].item())
+        slice_str = ','.join(slice_list)
+        output = eval("inputs[0][{}]".format(slice_str))
+    else:
+        inputs[0] = inputs[0].transpose(0,axis)
+        # output = F.embedding(torch.LongTensor(inputs[1]),inputs[0])
+        output = inputs[0][inputs[1]]
+        output =  output.transpose(axis, 0)
+    
+    if not if_torch:
+        return list(output.detach().numpy()) if output.numel() > 1 else output.item()
+    return output
+
+
+@register_op(op_type='BatchNormalization')
+def batchnormalization(inputs, kwargs):
+    epsilon = get_param(kwargs, 'epsilon')
+    momentum = get_param(kwargs, 'momentum')
+    return F.batch_norm(inputs[0],inputs[3],inputs[4],inputs[1],inputs[2],False,momentum,epsilon)
+
+@register_op(op_type='Slice')
+def slice(inputs, kwargs):
+    start = int(inputs[1].item())
+    end = int(inputs[2].item())
+    axes = 0
+    step = 1
+    if len(inputs) >3:
+        axes = int(inputs[3].item())
+    if len(inputs) == 5:
+        step = int(inputs[4].item())
+
+    if isinstance(inputs[0],torch.Tensor):
+        slice_list = [":"]*inputs[0].ndim
+        slice_list[axes] = '{}:{}:{}'.format(start,end,step)
+        slice_str = ','.join(slice_list)
+        output = eval("inputs[0][{}]".format(slice_str))
+    else:    # process torch.size, because this type often occurs [1,3,224,224]
+        slice_str = '{}:{}:{}'.format(start,end,step)
+        output = np.array((eval("inputs[0][{}]".format(slice_str))))
+    return output
+
+@register_op(op_type='AveragePool')
+def averagepool(inputs, kwargs):
+    kernel_shape = get_param(kwargs, "kernel_shape")
+    strides = get_param(kwargs, "strides")
+    ceil_mode = bool(kwargs.get("ceil_mode",0))
+    pads = tuple(kwargs.get("pads",[0,0,0,0]))[:2] # argument 'padding' must be tuple of ints, not str
+    return F.avg_pool2d(inputs[0],kernel_size = kernel_shape,stride = strides,padding = pads,ceil_mode = ceil_mode)
+
+@register_op(op_type="Pad")
+def pad(inputs, kwargs):
+    # the pads inputs in onnx is 'x1_begin,x2_begin ...,x1_end,x2_end...'
+    # the pads used in F.pad is 'x4_start,x4_end....x1_start,x1_end'
+    # explanation: "x4_start" refers to adding n values before the 4th dimension,"x4_end" refers to adding n values after the 4th dimension
+    mode = get_param(kwargs, "mode")
+    if isinstance(inputs[1],np.ndarray):
+        inputs[1] = np.flip(inputs[1].reshape(2,-1),axis = 1).transpose(1,0).flatten()  # change onnx pads into F.pad format
+    else:
+        inputs[1] = inputs[1].reshape(2,-1).flip(dims = [1]).transpose(1,0).flatten()
+    if len(inputs) ==3:  # user specified the pad_value(constant value)
+        return F.pad(inputs[0],tuple(inputs[1]),mode, inputs[2].item())
+    else:
+        return F.pad(inputs[0],tuple(inputs[1]),mode,0)
+
+@register_op(op_type="ConstantOfShape")
+def constant_of_shape(inputs, kwargs):
+    value = get_param(kwargs, 'value')
+    if isinstance(inputs[0],tuple):
+        return value * inputs[0][0]
+    if (isinstance(inputs[0],np.ndarray) or isinstance(inputs[0], list)) and len(inputs[0]) == 0:
+        return value
+    if value[0]== 0.0:
+        return torch.zeros(inputs[0].tolist(),dtype=torch.int64)
+    elif value[0]== 1.0:
+        return torch.ones(inputs[0].tolist(),dtype=torch.int64)
+    return value * inputs[0]
+
+@register_op(op_type= "Equal")
+def node_equal(inputs, kwargs):
+    return inputs[0]==inputs[1]
+
+
+@register_op(op_type="Squeeze")
+def squeeze(node_name ,inputs, kwargs):
+    axes = get_param(kwargs, "axes")
+    tensor = inputs[0]
+    for axis in sorted(axes, reverse=True):
+        tensor = tensor.squeeze(axis)
+    return tensor
+
+@register_op(op_type="GatherElements")
+def gatherElements(inputs, kwargs):
+    axes = get_param(kwargs, "axis")
+    return inputs[0].gather(dim = axes,index = torch.from_numpy(inputs[1].copy()))
+
+@register_op(op_type="ReduceSum")
+def reduceSum(inputs, kwargs):
+    axes = kwargs.get("axes",None)
+    keepdim = get_param(kwargs,"keepdims")
+    if isinstance(inputs[0],torch.Tensor):
+        if axes is None:
+            return inputs[0].sum()
+        else:
+            return inputs[0].sum(dim = axes ,keepdim = bool(keepdim))
+    else:
+        if axes is None:
+            return inputs[0].sum(keepdims = bool(keepdim))
+        else:
+            return inputs[0].sum(axis = axes,keepdims = bool(keepdim))
+
+
+@register_op(op_type="MatMul")
+def matmul(inputs, kwargs):
+    return torch.matmul(inputs[0],inputs[1])
+
+@register_op(op_type="Sigmoid")
+def sigmoid(inputs, kwargs):
+    return torch.sigmoid(inputs[0])
+
+@register_op(op_type="HardSigmoid")
+def hardsigmoid(inputs, kwargs):
+    return F.hardsigmoid(inputs[0])
+
+@register_op(op_type="Tanh")
+def tanh(inputs, kwargs):
+    return torch.tanh(inputs[0])
+
+@register_op(op_type="Range")
+def range(inputs, kwargs):
+    return torch.arange(inputs[0].item(),inputs[1].item(),inputs[2].item())
+
+@register_op(op_type="Where")
+def where(inputs, kwargs):
+    if isinstance(inputs[0], torch.Tensor):
+        return torch.where(inputs[0],inputs[1],inputs[2])
+    return np.where(inputs[0],inputs[1],inputs[2])
+
+@register_op(op_type="Expand")
+def expand(inputs, kwargs):
+    if (inputs[1].int()==1).all():
+        return inputs[0]
+    if isinstance(inputs[0], np.ndarray):
+        return inputs[0] * np.ones(inputs[1],inputs[0].dtype)
+    else:
+        return inputs[0].expand(inputs[1].int().tolist())
+        # shape1 = list(inputs[0].shape)
+        # shape2 = inputs[1]
+        # assert len(shape1) == len(shape2)
+        # shape = [1]*len(shape1)
+        # for i in range(len(shape)):
+        #     if shape1[i] == 1:
+        #         shape[i] = shape2[i]
+        #     elif shape2[i] == 1:
+        #         shape[i] = shape1[i]
+        #     elif shape1[i] == shape2[i]:
+        #         shape[i] = shape1[i]
+        #     else:
+        #         raise AttributeError
+        # return inputs[0].expand(list(shape))
+
+@register_op(op_type="Neg")
+def neg(inputs, kwargs):
+    return -1 * inputs[0]
+
+@register_op(op_type="Softmax")
+def softmax(inputs, kwargs):
+    if "platform_quant" in kwargs:
+        import linger
+        return linger.SoftMaxInt.run_onnx_softmax(inputs, kwargs)
+    axis = get_param(kwargs, "axis")
+    return torch.softmax(inputs[0],axis)
+
+@register_op(op_type="TopK")
+def topk(inputs, kwargs):
+    axis = get_param(kwargs, "axis")
+    largest = bool(get_param(kwargs, "largest"))
+    sorted = bool(kwargs.get("sorted",1))
+    if isinstance(inputs[0], torch.Tensor):
+        return inputs[0].topk(int(inputs[1]),dim = axis, largest = largest, sorted = sorted)
+    elif isinstance(inputs[0], np.ndarray):
+        return torch.from_numpy(inputs[0]).topk(int(inputs[1]),dim = axis, largest = largest, sorted = sorted).numpy()
+    else:
+        return torch.tensor(inputs[0]).topk(int(inputs[1]),dim = axis, largest = largest, sorted = sorted)
+
+@register_op(op_type="ScatterElements")
+def scatterElements(inputs, kwargs):
+    if not isinstance(inputs[1], torch.Tensor) :
+        inputs[1] = torch.tensor(inputs[1])
+    if not isinstance(inputs[2], torch.Tensor):
+        inputs[2] = torch.tensor(inputs[2])
+        
+    axis = get_param(kwargs, "axis")
+    if isinstance(inputs[0], np.ndarray):
+        return torch.from_numpy(inputs[0]).scatter(axis,inputs[1],inputs[2]).numpy()
+    elif isinstance(inputs[0],torch.Tensor):
+        return inputs[0].scatter(axis,inputs[1],inputs[2]).numpy()
+    else:
+        return torch.tensor(inputs[0]).scatter(axis,inputs[1],inputs[2]).numpy()
+    
+
+@register_op(op_type="Exp")
+def exp(inputs, kwargs):
+    if isinstance(inputs[0],torch.Tensor):
+        return inputs[0].exp()
+    else:
+        return np.exp(inputs[0])
+
+
+@register_op(op_type="Conv1d")
+def conv1d(inputs, kwargs):
+    input_length = len(inputs)
+    assert input_length == 2 or input_length == 3,"Conv1d ops：the number of inputs is wrong, \
+        expect 2 or 3, but {}, the length of List must be 2([input,weight]) or 3([input,weight,bias]) ".format(input_length)
+    if input_length == 2:
+        input,weights= inputs
+        bias = None
+    else:
+        input,weights,bias = inputs
+    dilations = tuple(get_param(kwargs,'dilations'))
+    groups = int(get_param(kwargs,'group'))
+    pads = get_param(kwargs,"pads")[0]
+    strides = tuple(get_param(kwargs,"strides"))
+    
+    return F.conv1d(input,weights,bias,stride=strides,padding=pads,dilation=dilations,groups=groups)
+
+@register_op(op_type='ReduceProd')
+def reduceProd(inputs, kwargs):
+    keepdims = bool(get_param(kwargs, "keepdims"))
+    axes = kwargs.get('axes',None)
+
+    if isinstance(inputs, torch.Tensor):
+        if axes is None:
+            return inputs[0].prod()
+        else:
+            return inputs[0].prod(dim = axes, keepdim=keepdims)
+    else:
+        inputs[0] = np.asarray(inputs[0])
+        return inputs[0].prod(axis = axes,keepdims = keepdims)
+
+@register_op(op_type="ChannelShuffle")
+def channel_shuffle(inputs,kwargs):
+    n,c,h,w = inputs[0].shape
+    groups = kwargs.get("groups")
+    return inputs[0].reshape(n,groups,c//groups, h,w).permute(0,2,1,3,4).reshape(n,c,h,w)
+
+@register_op(op_type="ReduceL2")
+def reduceL2(inputs,kwargs):
+    axis = kwargs.get("axes")
+    keepdim = bool(kwargs.get('keepdims'))
+    return torch.norm(inputs[0],dim = axis,keepdim=keepdim)
+
+@register_op(op_type="ArgMax")
+def argmax(inputs,kwargs):
+    axis = kwargs.get("axis")
+    keepdim = bool(kwargs.get('keepdims'))
+    return torch.argmax(inputs[0],dim = axis,keepdim=keepdim)
\ No newline at end of file
diff --git a/linger/checker/iq_ops_mapper.py b/linger/checker/iq_ops_mapper.py
new file mode 100644
index 0000000..a906be8
--- /dev/null
+++ b/linger/checker/iq_ops_mapper.py
@@ -0,0 +1,451 @@
+import torch.nn.functional as F
+import torch
+from linger.quant.ops import *
+from linger.checker.utils import get_param,register_op
+from linger.config import QUANT_CONFIGS
+import numpy as np
+import linger
+from linger.utils import quant, dequant
+from .utils import create_qmodule, create_qmodule_tensor, load_quantized_weights, StringToQuantMode
+
+@register_op(op_type="AvgPool2dInt")
+def avgpool2dint(inputs, kwargs):
+    input = inputs[0]
+
+    kernel_size = tuple(kwargs['kernel_shape'])
+    stride = tuple(kwargs['strides'])
+    padding = tuple(kwargs['pads'][0:2])
+    ceil_mode = bool(kwargs['ceil_mode'])
+    device = input.device
+
+    module = nn.AvgPool2d(kernel_size=kernel_size, stride=stride, padding=padding, ceil_mode=ceil_mode).to(device)
+
+    instance = create_qmodule(QAvgPool2d, module, device, kwargs)
+
+    return instance(input)
+
+@register_op(op_type="Conv1dInt")
+def conv1dInt(inputs, kwargs):
+    inputs_len = len(inputs)
+    assert inputs_len == 2 or inputs_len == 3, \
+    f"Conv2dInt: invalid number of input tensors (expected 2 or 3, got {inputs_len})"
+    if inputs_len == 2:
+        input, weights= inputs
+        bias = None
+    else:
+        input, weights, bias = inputs
+
+    in_channels = input.shape[1]
+    out_channels = weights.shape[0]
+    kernel_shape = tuple(kwargs.get('kernel_shape', None))
+    strides = kwargs.get('strides', 1)
+    padding = kwargs.get('pads', (0, 0))[0]
+    dilations = kwargs.get('dilations', 1)
+    group = kwargs.get('group', 1)
+    device = input.device
+
+    module = nn.Conv1d(in_channels, out_channels, kernel_shape, strides, padding, dilations, group).to(device)
+
+    instance = create_qmodule(QConv1d, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+    
+    res =  instance(input)
+    if kwargs.get('act_type', 0) == 1:
+        res = F.relu(res)
+    return res
+
+@register_op(op_type='Conv2dInt')
+def conv2dint(inputs, kwargs):
+    inputs_len = len(inputs)
+    assert inputs_len == 2 or inputs_len == 3, \
+    f"Conv2dInt: invalid number of input tensors (expected 2 or 3, got {inputs_len})"
+    if inputs_len == 2:
+        input, weights= inputs
+        bias = None
+    else:
+        input, weights, bias = inputs
+
+    in_channels = input.shape[1]
+    out_channels = weights.shape[0]
+    kernel_shape = tuple(kwargs.get('kernel_shape', None))
+    strides = tuple(kwargs.get('strides', (1, 1)))
+    pads = kwargs.get('pads', None)
+    if pads is not None:
+        pads = tuple(pads[0:2])
+    else:
+        pads = (0, 0)
+    dilations = tuple(kwargs.get('dilations', (1, 1)))
+    group = kwargs.get('group', 1)
+    device = input.device
+
+    module = torch.nn.Conv2d(in_channels, out_channels, kernel_shape, strides, pads, dilations, group).to(device)
+    
+    instance = create_qmodule(QConv2d, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+    
+    res = instance(input)
+
+    if kwargs.get('act_type', 0) == 1:
+        res = F.relu(res)
+    return res
+
+@register_op(op_type='LinearInt')
+def linearint(inputs, kwargs):
+    inputs_len = len(inputs)
+    assert inputs_len == 2 or inputs_len == 3, \
+    f"LinearInt ops: the number of input_tensors is wrong, \
+    expect 2 or 3, but {inputs_len}, the length of List must be 2([input,weight]) or 3\
+    ([input,weight,bias])"
+    if inputs_len == 2:
+        input, weights= inputs
+        bias = None
+    else:
+        input, weights, bias = inputs
+
+    out_features, in_features = weights.shape
+    has_bias = inputs_len == 3
+    device = input.device
+
+    module = nn.Linear(in_features, out_features, has_bias).to(device)
+
+    instance = create_qmodule(QLinear, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+
+    return instance(input)
+
+@register_op(op_type='iqCat')
+def iqcat(inputs, kwargs):
+    kwargs['is_cat'] = True
+    dim = kwargs.get('dim', -1)
+    num_input = len(inputs)
+
+    if inputs[0].dtype in {torch.int32, torch.int64}:
+        return torch.cat(inputs, dim)
+    else:
+        instance = create_qmodule_tensor(QCat, None, num_input, kwargs)
+        return instance(inputs, dim)
+
+@register_op(op_type='Quant')
+def quant_(inputs, kwargs):
+    bits = kwargs.get('data_bits', 8)
+    scale = torch.tensor(kwargs.get('scale_x', 1.0), dtype=torch.float32)
+    zp = torch.tensor(kwargs.get('zero_point', 0), dtype=torch.float32)
+    quant_mode = StringToQuantMode(kwargs.get('quant_mode', 'floor_add'))
+    input = inputs[0]
+    qinput, _ = quant(input, bits, scale, zp, quant_mode)
+    input = dequant(qinput, scale)
+    input = from_tensor_to_qtensor(input, scale, bits, zp)
+    return input
+
+@register_op(op_type='Dequant')
+def dequant_(inputs, kwargs):
+    input = inputs[0]
+    return input
+
+@register_op(op_type="BmmInt")
+def bmmint(inputs, kwargs):
+    num_input = len(inputs)
+    assert num_input == 2, f'invalid input number, expeted 2, but got {num_input}'
+    input_0, input_1 = inputs
+
+    instance = create_qmodule_tensor(QBmm, None, 2, kwargs)
+    return instance(input_0, input_1)
+
+@register_op(op_type="LayerNormInt")
+def layernormint(inputs, kwargs):
+    inputs_len = len(inputs)
+    assert inputs_len == 2 or inputs_len == 3, \
+    f"LayerNormInt ops: the number of input_tensors is wrong, \
+    expect 2 or 3, but {inputs_len}, the length of List must be 2([input,weight]) or 3\
+    ([input,weight,bias])"
+    if inputs_len == 2:
+        input, weights= inputs
+        bias = None
+    else:
+        input, weights, bias = inputs
+
+    has_bias = inputs_len == 3
+    axis = kwargs.get('axis', -1)
+    input_shape = list(input.shape)
+    normalized_shape = input_shape[axis if axis>=0 else len(input_shape) + axis :]
+    device = input.device
+
+    module = nn.LayerNorm(normalized_shape, device=device)
+
+    instance = create_qmodule(QLayerNorm, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+
+    return instance(input)
+
+@register_op(op_type="GluInt")
+def gluint(inputs, kwargs):
+    input = inputs[0]
+
+    dim = kwargs.get('dim', -1)
+    module = nn.GLU(dim=dim)
+    device = input.device
+
+    instance = create_qmodule(QGLU, module, device, kwargs)
+    return instance(input)
+
+@register_op(op_type='MaxPool')
+def maxpool2d(inputs, kwargs):
+    kernel_size = get_param(kwargs,"kernel_shape")
+    pads = get_param(kwargs,"pads")
+    strides = get_param(kwargs,'strides')
+    dilation = kwargs.get("dilation",1)
+    ceil_mode = bool(kwargs.get('ceil_mode',False))
+    return F.max_pool2d(inputs[0], kernel_size, strides, pads[0], dilation, ceil_mode)
+
+@register_op(op_type='iqAdd')
+def iqadd(inputs, kwargs):
+    input_len = len(inputs)
+    assert input_len == 2, 'The inputs number of iqAdd is wrong'
+    input_0, input_1 = inputs
+
+    instance = create_qmodule_tensor(QAdd, None, 2, kwargs)
+    return instance(input_0, input_1)
+
+# @register_op(op_type='iqDiv')
+# def iqdiv(inputs, kwargs):
+#     platform = kwargs.get("platform", "")
+#     op_cls = get_op_class(platform, "iqDiv")
+#     return op_cls.excute_base(inputs, kwargs)
+
+@register_op(op_type='iqMul')
+def iqmul(inputs, kwargs):
+    input_len = len(inputs)
+    assert input_len == 2, 'The inputs number of iqMul is wrong'
+    x, y = inputs
+
+    scale_x = kwargs.get('scale_x')
+    scale_y = kwargs.get('scale_y')
+
+    if isinstance(x, QTensor) and isinstance(y, QTensor):
+        qx, qy = x, y
+    elif isinstance(x, QTensor) and (not isinstance(y, QTensor)):
+        qx = dequant(x, scale_x)
+        qy = y
+    elif (not isinstance(x, QTensor)) and isinstance(y, QTensor):
+        qx = x
+        qy = dequant(y, scale_y)
+    else:
+        qx = dequant(x, scale_x)
+        qy = dequant(y, scale_y)
+        
+    instance = create_qmodule_tensor(QMul, None, 2, kwargs)
+    return instance(qx, qy)
+
+@register_op(op_type='Relu')
+def relu(inputs, kwargs):
+    input = inputs[0]
+
+    module = nn.ReLU()
+    device = input.device
+
+    instance = create_qmodule(QRelu, module, device, kwargs)
+    return instance(input)
+
+@register_op(op_type='Split')
+def split(inputs, kwargs):
+    axis = get_param(kwargs,'axis')
+    split = get_param(kwargs,'split')
+    return inputs[0].split(split, axis)
+
+@register_op(op_type='Cast')
+def cast(inputs, kwargs):
+
+    onnx_dtype = {
+        0: 'UNDEFINED',    1: 'float32',    2: 'uint8',    3: 'int8',    4: 'uint16',
+        5: 'int16',    6: 'int32',    7: 'int64',    8: 'str',    9: 'bool',    10:'float16',
+        11:'double',    12:'uint32',    13:'uint64',    14:'complex64',    15:'complex128',
+        16:'bfloat16'
+    }
+
+    onnx_numpy_type={
+        1:np.float32, 2:np.uint8, 3:np.int8, 4:np.uint16,
+        5:np.int16, 6:np.int32, 7:np.int64,  9:np.bool8, 10:np.float16,
+        11:np.double ,12:np.uint32, 13:np.uint64, 14:np.complex64, 15:np.complex128,
+    }
+
+    onnx_tensor_type={
+        1:torch.float, 2:torch.uint8, 3:torch.int8,
+        5:torch.int16, 6:torch.int32 ,7:torch.int64, 9:torch.bool,10:torch.float16,
+        11:torch.double, 14:torch.complex64, 15:torch.complex128
+    }
+    to = get_param(kwargs, 'to')
+    output = None
+    if isinstance(inputs[0], QTensor) or isinstance(inputs[0], torch.Tensor):
+        if to in onnx_tensor_type:
+            output = inputs[0].type(onnx_tensor_type[to])
+        else:
+            raise TypeError("Type Error!!Current Version don't support {}(type:{}) in cast node!!!".format(to,onnx_dtype(to)))
+    else:
+        if to in onnx_numpy_type:
+            output = np.array(inputs[0]).astype(onnx_numpy_type[to])
+        else:
+            raise TypeError("Type Error!!Current Version don't support {}(type:{}) in cast node!!!".format(to,onnx_dtype(to)))
+    return output
+
+@register_op(op_type= "ConvTranspose2dInt")
+def convTranspose2dInt(inputs, kwargs):
+    inputs_len = len(inputs)
+    assert inputs_len == 2 or inputs_len == 3, \
+    f"Conv2dInt: invalid number of input tensors (expected 2 or 3, got {inputs_len})"
+    if inputs_len == 2:
+        input, weights= inputs
+        bias = None
+    else:
+        input, weights, bias = inputs
+
+    in_channels = input.shape[1]
+    out_channels = weights.shape[0]
+    kernel_shape = tuple(kwargs.get('kernel_shape', None))
+    strides = tuple(kwargs.get('strides', (1, 1)))
+    pads = kwargs.get('pads', None)
+    if pads is not None:
+        pads = tuple(pads[0:2])
+    else:
+        pads = (0, 0)
+    dilations = tuple(kwargs.get('dilations', (1, 1)))
+    group = kwargs.get('group', 1)
+    device = input.device
+
+    module = torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_shape, strides, pads, dilations, group).to(device)
+    
+    instance = create_qmodule(QConvTranspose2d, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+    
+    res = instance(input)
+
+    if kwargs.get('act_type', 0) == 1:
+        res = F.relu(res)
+    return res
+
+# @register_op(op_type="iqSum")
+# def iqsum(inputs, kwargs):
+#     platform = kwargs.get("platform", "")
+#     op_cls = get_op_class(platform, "iqSum")
+#     return op_cls.excute_base(inputs, kwargs)
+
+@register_op(op_type="MatMulInt")
+def matmulint(inputs, kwargs):
+    num_input = len(inputs)
+    assert num_input == 2, f'invalid input number, expeted 2, but got {num_input}'
+    input_0, input_1 = inputs
+
+    instance = create_qmodule_tensor(QMatmul, None, 2, kwargs)
+    return instance(input_0, input_1)
+
+@register_op(op_type="BatchNorm2dInt")
+def batchnorm2dInt(inputs, kwargs):
+    input, weights, bias = inputs
+
+    num_features = input.shape[1]
+    device = input.device
+
+    module = nn.BatchNorm2d(num_features)
+
+    instance = create_qmodule(QBatchNorm2d, module, device, kwargs)
+    instance = load_quantized_weights(instance, kwargs, weights, bias)
+
+    return instance(input)
+# @register_op(op_type="GRUInt")
+# def gruint(inputs, kwargs):
+#     platform = kwargs.get("platform", "")
+#     op_cls = get_op_class(platform, "GRUInt")
+#     return op_cls.excute_base(inputs, kwargs)
+
+# @register_op(op_type="LSTMInt")
+# def lstmint(inputs, kwargs):
+#     platform = kwargs.get("platform", "")
+#     op_cls = get_op_class(platform, "LSTMInt")
+#     return op_cls.excute_base(inputs, kwargs)
+
+@register_op(op_type= "iqSigmoid")
+def iqsigmoid(inputs, kwargs):
+    input = inputs[0]
+
+    instance = create_qmodule_tensor(QSigmoid, None, 1, kwargs)
+    return instance(input)
+
+@register_op(op_type="iqTanh")
+def iqtanh(inputs, kwargs):
+    input = inputs[0]
+
+    instance = create_qmodule_tensor(QTanh, None, 1, kwargs)
+    return instance(input)
+
+@register_op(op_type="SoftmaxInt")
+def softmaxInt(inputs,kwargs):
+    input = inputs[0]
+
+    dim = kwargs.get('axis', -1)
+
+    instance = create_qmodule_tensor(QSoftmax, None, 1, kwargs)
+    return instance(input)
+
+# @register_op(op_type="LogSoftmaxInt")
+# def softmaxInt(inputs,kwargs):
+#     platform = kwargs.get("platform", "")
+#     op_cls = get_op_class(platform, "LogSoftmaxInt")
+#     return op_cls.excute_base(inputs, kwargs)
+
+# @register_op(op_type="ReQuant")
+# def onnxinferdequant(inputs,kwargs):
+#     import math
+#     src_bits = kwargs.get("bit_src")
+#     dst_bits = kwargs.get('bit_dst')
+#     scale_src = kwargs.get('scale_src')
+#     s_rescale = (math.pow(2,dst_bits-1) -1.0)/(math.pow(2,src_bits-1) -1.0)
+#     if kwargs.get('qmax') == 2: #qvalue
+#         s_rescale = math.pow(2,round(math.log(s_rescale,2)))
+#     scale = s_rescale*scale_src
+#     zero_point = 0
+#     if isinstance(input, QTensor):
+#         zero_point = input.zero_point
+#     if zero_point != 0:
+#         zero_point = math.pow(2, dst_bits-1)
+
+#     s = from_torch_tensor(inputs[0],scale,dst_bits, zero_point=zero_point)
+#     s.requant_()
+#     return s
+
+@register_op(op_type='topN')
+def topn(inputs, kwargs):
+    input, idx_offset = inputs
+    assert isinstance(input, linger.QTensor) == True, 'input of topN must be QTensor'
+    dim = kwargs.get('dim', -1)
+    assert dim == -1 or dim == input.ndim-1, f'only the last dim is supported, the current value is {dim}'
+    max_num = kwargs.get('max_num', 1)
+    assert max_num == 1, f'only max_num=1 is supported, the current value is {max_num}'
+    assert input.shape[0] == 1, 'only input.shape[0] == 1 is supported.'
+
+    scale = input.scale
+    zp = input.zero_point
+    data_bits = kwargs.get('data_bits', 8)
+    quant_mode = StringToQuantMode(kwargs.get('quant_mode', 'floor_add'))
+    
+    q_input = (input * scale + 0.5).floor().to(torch.int32).clamp(-128, 127).cpu()
+    q_input, _ = quant(input, scale, zp, quant_mode)
+    val, idx = torch.topk(q_input, max_num, dim)
+    idx = idx.to(torch.int32) + idx_offset
+    res = torch.cat([val, idx], dim=0)
+    return res
+
+@register_op(op_type='topN2')
+def topn2(inputs, kwargs):
+    dim = kwargs.get('dim', -1)
+    assert dim == -1 or dim == input.ndim-1, f'only the last dim is supported, the current value is {dim}'
+    max_num = kwargs.get('max_num', 1)
+    assert max_num == 1, f'only max_num=1 is supported, the current value is {max_num}'
+
+    input = inputs[0]
+    
+    leading = torch.tensor(input.shape[:dim]).prod()
+    val, ori_idx = torch.tensor_split(input, 2, 0)
+    ori_idx = ori_idx.to(torch.long)
+    max_val, fake_idx = torch.topk(val, max_num, dim)
+    real_idx = torch.gather(ori_idx, dim=dim, index=fake_idx)
+    res = torch.cat([max_val, real_idx], dim=0)
+    return res
\ No newline at end of file
diff --git a/linger/checker/onnxrunner.py b/linger/checker/onnxrunner.py
new file mode 100644
index 0000000..b509f2e
--- /dev/null
+++ b/linger/checker/onnxrunner.py
@@ -0,0 +1,574 @@
+import torch
+from torch import Tensor
+import onnx
+from onnx import helper, numpy_helper, TensorProto
+import torch.nn.functional as F
+from onnx import numpy_helper
+import numpy as np
+from collections import deque
+from .iq_ops_mapper import *
+from .float_ops_mapper import *
+from linger.config import QuantConfig
+from linger.utils import PlatForm, quant
+import os
+import re
+from .utils import single_node_run, if_node_run, parse_attribute_and_name, onnx_topologically_sort
+from .utils import get_attribute_value
+import traceback
+from pathlib import Path
+from typing import Tuple, Optional, Dict, List
+import math
+# from typing import Literal
+
+# DUMP_FORMAT = Literal['float', 'quantized', 'all']
+DUMP_FORMAT = {'float', 'quantized', 'all'}
+TRANSPARENT_OPS = {'Reshape', 'Transpose', 'Gather', 'Squueze', 'Unsqueeze', 'Slice', 'Split', 'MaxPool',\
+                    'Relu', 'Clip', 'Prelu', 'Resize'}
+
+class OnnxRunner:
+    def __init__(self, path, dump = False, dump_format = 'quantized') -> None:
+        super().__init__()
+        assert dump_format in DUMP_FORMAT, f'args dump_format {dump_format} is invalid'
+        self._dump = dump
+        self._dump_fmt = dump_format
+        self.__quant_op_configs = {}
+        self._tensor_shapes = {}
+        self._model = onnx.load(path)
+        self._init_quant_op_configs()
+        self._restore_quantize_nodes()
+        self._load_onnx()
+        self._init_dump()
+
+    def _init_dump(self):
+        if self._dump:
+            if self._dump is True:
+                self.__int_dump_path = "data/onnxrunner_int"
+                self.__float_dump_path = 'data/onnxrunner_float'
+            if os.path.exists(self.__int_dump_path):
+                os.system("rm -rf {}".format(self.__int_dump_path))
+            if os.path.exists(self.__float_dump_path):
+                os.system("rm -rf {}".format(self.__float_dump_path))
+            Path(self.__int_dump_path).mkdir(parents=True)
+            Path(self.__float_dump_path).mkdir(parents=True)
+    
+    def _load_onnx(self):
+        self.__input_map_dict = dict()  # input or ops output used times
+        self.__tensor_dict = dict()  # storage the intermediate tensor to calculate 
+        ## load initializer
+        for initializer in self._model.graph.initializer:
+            self.__tensor_dict[initializer.name] = torch.from_numpy(numpy_helper.to_array(initializer).copy())   # 标记
+            self.__input_map_dict[initializer.name] = 0
+
+        self._init_by_platform()
+
+    def _restore_quantize_nodes(self):
+        def resolve_input_index(node, locator_logic: dict):
+            logic_type = locator_logic.get('type')
+            if logic_type == 'static': return locator_logic.get('index')
+            if logic_type == 'conditional':
+                arg, node_arg_val = locator_logic.get('arg'), len(node.input)
+                if arg == 'num_inputs':
+                    for case in locator_logic.get('cases', []):
+                        if 'if_equal' in case and node_arg_val == case['if_equal']: return case['index']
+                        if 'if_greater_equal' in case and node_arg_val >= case['if_greater_equal']: return case['index']
+            return None
+
+
+        model = self._model
+        graph = model.graph
+        init_names = {init.name for init in graph.initializer}
+        graph_inputs = [inp for inp in graph.input if inp.name not in init_names]
+
+        consumer_map: Dict[str, List[Tuple[onnx.NodeProto, int]]] = {i.name: [] for i in graph_inputs}
+        for initializer in graph.initializer: consumer_map[initializer.name] = []
+        for node in graph.node:
+            for i, inp in enumerate(node.input):
+                if inp not in consumer_map: consumer_map[inp] = []
+                consumer_map[inp].append((node, i))
+
+        nodes_to_add = []
+        connections_to_rewire: Dict[str, Dict[int, str]] = {}
+        processed_graph_inputs = set()
+        
+        print("Starting forward search from graph inputs...")
+        for graph_input in graph_inputs:
+            if graph_input.name in processed_graph_inputs: continue
+
+            print(f"\nProcessing path starting from input: '{graph_input.name}'")
+            queue = deque([(graph_input.name, graph_input.name)])
+            visited_tensors = {graph_input.name}
+            
+            while queue:
+                current_tensor, original_source = queue.popleft()
+                consumers = consumer_map.get(current_tensor, [])
+                path_fixed = False
+
+                for consumer_node, consumer_index in consumers:
+                    if consumer_node.op_type in self.__quant_op_configs:
+                        print(f"  -> Path reached potential target '{consumer_node.name}' at its input index {consumer_index}.")
+                        config = self.__quant_op_configs[consumer_node.op_type]
+                        
+                        # Dynamically check if the connection is to a quantizable slot
+                        matched_quant_input_info = None
+                        for quant_input_info in config['quantizable_inputs']:
+                            actual_index = resolve_input_index(consumer_node, quant_input_info['locator_logic'])
+                            if actual_index == consumer_index:
+                                matched_quant_input_info = quant_input_info
+                                break
+                        
+                        if matched_quant_input_info:
+                            print(f"  -> SUCCESS: Connection matches the defined quantizable input '{matched_quant_input_info['name']}'.")
+                            _, attrs = parse_attribute_and_name(consumer_node)
+                            scale_val = attrs.get(matched_quant_input_info['scale_attr'], None)
+                            zp_val = attrs.get(matched_quant_input_info['zp_attr'], (0.0))
+                            data_bits = attrs.get('data_bits', 8)
+                            platform = attrs.get('platform', None)
+                            if scale_val is None or zp_val is None:
+                                print(f"  -> ERROR: Could not extract quant params. Skipping.")
+                                continue
+
+                            quantized_output_name = f"{original_source}_quantized"
+                            # import pdb; pdb.set_trace()
+                            quant_node = helper.make_node('Quant', inputs=[original_source], outputs=[quantized_output_name],
+                                                        name=f"{original_source}_Quant_auto", scale_x=scale_val, zeropoint=zp_val,
+                                                        data_bits=data_bits, platform=platform)
+                            nodes_to_add.append(quant_node)
+                            for i, input in enumerate(graph.input):
+                                if input == original_source:
+                                    graph.input[i].type.tensor_type.elem_type = TensorProto.FLOAT
+                            print(f"  -> ACTION: Scheduled insertion of Quant node for '{original_source}'.")
+
+                            direct_consumers = consumer_map.get(original_source, [])
+                            for dc_node, dc_index in direct_consumers:
+                                if dc_node.name not in connections_to_rewire: connections_to_rewire[dc_node.name] = {}
+                                connections_to_rewire[dc_node.name][dc_index] = quantized_output_name
+                                print(f"    - Scheduled to rewire input {dc_index} of '{dc_node.name}'.")
+
+                            processed_graph_inputs.add(original_source)
+                            path_fixed = True
+                            break # Break from consumers loop, this path is done
+                        else:
+                            print(f"  -> INFO: Connection is to a non-quantizable input slot of '{consumer_node.name}'. This path is correct.")
+
+                    elif consumer_node.op_type in TRANSPARENT_OPS:
+                        for output_tensor in consumer_node.output:
+                            if output_tensor not in visited_tensors:
+                                print(f"  -> Traversing through transparent op '{consumer_node.name}'...")
+                                visited_tensors.add(output_tensor)
+                                queue.append((output_tensor, original_source))
+
+                if path_fixed:
+                    queue.clear()
+
+        if not nodes_to_add:
+            print("\nModel analysis complete. No missing Quant nodes were detected.")
+
+        print("\nApplying graph modifications...")
+        graph.node.extend(nodes_to_add)
+        for node in graph.node:
+            if node.name in connections_to_rewire:
+                inputs = list(node.input)
+                for index, new_name in connections_to_rewire[node.name].items(): inputs[index] = new_name
+                node.ClearField("input")
+                node.input.extend(inputs)
+                
+        onnx_topologically_sort(model)
+
+    def _init_by_platform(self):
+        platform = "venus"
+        
+        for node in self._model.graph.node:
+            for attr in node.attribute:
+                if attr.name == "platform":
+                    platform = attr.s.decode('utf-8')
+
+        platform_map = {
+            "venus": PlatForm.venus,
+            "mars": PlatForm.mars,
+            "arcs": PlatForm.arcs,
+            "jupiter": PlatForm.jupiter,
+            "venusA": PlatForm.venusA
+        }
+
+        if platform not in platform_map:
+            raise ValueError(f"The platform {platform} is not support now")
+
+        QUANT_CONFIGS._update_from_dict({'platform': platform})
+    
+    def _node_run(self, node ,inputs):
+        #Note : The If node processing here, only processes the If node generated by Squeeze
+        try:
+            if node.op_type !="If":
+                return single_node_run(node, inputs)
+            else:
+                return if_node_run(node, inputs, self.__tensor_dict)
+        except Exception as e:  # NotImmplementedError, KeyError,ValueError
+            node_name = node.name 
+            # When user export onnx using operator_export_type!=ONNX, the node.name don't exist
+            if node.name == "": 
+                node_name = node.op_type + "_I_" + "_".join([node_input for node_input in node.input]) \
+                    +"_O_"+"_".join([node_output for node_output in node.output])
+                
+            print("Error occured in {} , error message is {}".format(node_name, e))
+            traceback.print_exc()
+            exit(-1)
+
+    def _get_input(self,node):
+        ops_inputs = []
+        for input in node.input:
+            if len(input) == 0:  # In the floating point model, sometimes the node input is "" when the input is not input.
+                ops_inputs.append(None)
+            else:
+                ops_inputs.append(self.__tensor_dict[input])
+        return ops_inputs
+    
+    def _dump_output(self, node, ops_outputs):
+        quant_mode = get_attribute_value(node, 'quant_mode')
+        if quant_mode is None:
+            quant_mode = 'floor_add'
+        round_mode = StringToQuantMode(quant_mode)
+        def _flatten_outputs(x):
+            if isinstance(x, (tuple, list)):
+                out = []
+                for v in x:
+                    out.extend(_flatten_outputs(v))
+                return out
+            return [x]
+        if len(node.output) == 1:
+            self.__tensor_dict[node.output[0]] = ops_outputs
+            self._tensor_shapes[node.output[0]] = tuple(ops_outputs.shape)
+            if self._dump:
+                if self._dump_fmt == 'float' or self._dump_fmt == 'all':
+                    dump_path = self.__float_dump_path+os.sep +node.output[0] +"##_float_dump.txt"
+                
+                    # if ops_outputs.device.type == 'cuda':
+                    #     np.savetxt(dump_path, ops_outputs.flatten().cpu().numpy(),fmt="%f")
+                    # else:
+                    np.savetxt(dump_path, ops_outputs.detach().flatten().numpy(),fmt="%f")
+                if self._dump_fmt == 'quantized' or self._dump_fmt == "all":
+                    dump_path = self.__int_dump_path+os.sep +node.output[0] +"##_int_dump.txt"
+                    if isinstance(ops_outputs, linger.QTensor):
+                        scale = ops_outputs.scale
+                        bits = ops_outputs.data_bits
+                        zp = 0  # TODO: zp = ops_outputs.zero_point
+                        # import pdb; pdb.set_trace() 
+                        if ops_outputs.dtype == torch.float32 or ops_outputs.dtype == torch.float64:
+                            q_output, _ = quant(ops_outputs, bits, scale, zp, round_mode)
+                            q_output = q_output.to(torch.int32).detach().flatten().cpu().numpy()
+                        else:
+                            q_output = ops_outputs.detach().flatten().cpu().numpy()
+                        np.savetxt(dump_path, q_output, fmt="%d")
+                    else:
+                        if ops_outputs.dtype in [torch.int8, torch.int16, torch.int32, torch.int64]:
+                            q_output = ops_outputs.to(torch.int32).detach().flatten().cpu().numpy()
+                            np.savetxt(dump_path, q_output, fmt="%d")
+        else:
+            flat_outputs = _flatten_outputs(ops_outputs)
+            assert len(flat_outputs) == len(node.output), f"the output number of linger {len(flat_outputs)} is not equal to the output number of onnx {len(node.output)}"
+            for output_idx, output in enumerate(node.output):
+                self.__tensor_dict[output] = flat_outputs[output_idx]
+                self._tensor_shapes[output] = tuple(flat_outputs[output_idx].shape)
+
+                if self._dump and isinstance(flat_outputs[output_idx], torch.Tensor):
+                    if self._dump_fmt == 'float' or self._dump_fmt == 'all':
+                        dump_path = self.__float_dump_path + os.sep + node.output[output_idx] +"##_float_dump.txt"
+                        np.savetxt(dump_path, flat_outputs[output_idx].detach().flatten().cpu().numpy(), fmt="%f")
+                    if self._dump_fmt == 'quantized' or self._dump_fmt == "all":
+                        dump_path = self.__int_dump_path + os.sep +node.output[output_idx] +"##_int_dump.txt"
+                        if isinstance(flat_outputs[output_idx], linger.QTensor):
+                            scale = flat_outputs[output_idx].scale
+                            bits = flat_outputs[output_idx].data_bits
+                            zp = 0 # TODO: zp = flat_outputs[output_idx]
+                            if flat_outputs[output_idx].dtype == torch.float32 or flat_outputs[output_idx].dtype == torch.float64:
+                                q_output = quant(flat_outputs[output_idx], bits, scale, zp, round_mode)
+                                q_output = q_output.to(torch.int).detach().flatten().cpu().numpy()
+                            else:
+                                q_output = flat_outputs[output_idx].detach().flatten().cpu().numpy()
+                            np.savetxt(dump_path, q_output, fmt="%d")
+                        else:
+                            if flat_outputs[output_idx].dtype in [torch.int8, torch.int16, torch.int32, torch.int64]:
+                                q_output = flat_outputs[output_idx].to(torch.int32).detach().flatten().cpu().numpy()
+                                np.savetxt(dump_path, q_output, fmt="%d")
+
+    def _tensor_dict_to_list(self,data):
+        ret = []
+        for _,input in enumerate(self._model.graph.input):
+            if input.name in data:
+                ret.append(data[input.name])
+        return ret
+
+    def _traverse_input(self,data):
+        # LSTMInt will combine hidden_state and cell_state into a tuple input, 
+        # and the LSTMInt onnx operator needs to be input separately, so the input does not match, 
+        # it needs to be processed separately, and the input hidden_state and cell_state are separated
+        if type(data) !=list and type(data)!= tuple and type(data) !=Tensor and type(data)!= dict:
+            raise TypeError("Input type ({}) error,must be [list,tuple,tensor,dict]!!".format(type(data)))
+        if type(data) !=list and type(data)!=tuple and type(data)!=dict:
+            data = [data]
+        if type(data) == dict:
+            data = self._tensor_dict_to_list(data)
+
+        def _get_list_inout(torch_input, onnx_input):
+            if isinstance(torch_input, tuple) or isinstance(torch_input, list):
+                for ele in torch_input:
+                    onnx_input = _get_list_inout(ele,onnx_input)
+            else:
+                onnx_input.append(torch_input)
+                return onnx_input
+
+        onnx_input = []
+        torch_input = list(data)
+        _get_list_inout(torch_input,onnx_input)
+
+        onnx_input = tuple([inp if inp.device == torch.device('cpu') else inp.cpu() for inp in onnx_input])
+        return onnx_input
+
+    def get_tensor_info(self):
+        return self._tensor_shapes
+
+    def run(self, data, special_key = 'None', out_type = "list"):
+        data = self._traverse_input(data)
+        # In lower pytorch version , when set 'operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK' in export onnx stage,
+        # the initializer is considered part of the inputs . Therefore, it needs to be dealt with separately.
+        for idx,input in enumerate(self._model.graph.input):
+            if idx >= len(data) :
+                break
+            self.__tensor_dict[input.name] = data[idx]
+        
+        if out_type == "dict":
+            model_output = {}
+        else:
+            model_output = []
+        for node in self._model.graph.node:   
+            ops_inputs = self._get_input(node)
+            ops_outputs = self._node_run(node, ops_inputs)
+
+            # get output tensor and put it in __tensor_dict
+            self._dump_output(node, ops_outputs)
+            if node.output[0] == special_key and out_type == "dict":
+                model_output[special_key] =  ops_outputs
+        
+        for output in self._model.graph.output:
+            # print(output.name)
+            if out_type == "dict":
+                model_output[output.name] = self.__tensor_dict[output.name]
+            else:
+                model_output.append(self.__tensor_dict[output.name])
+
+        return model_output
+    
+    def _init_quant_op_configs(self):
+        self.__quant_op_configs = {
+            'AvgPool2dInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'BmmInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input_x',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'input_x_zero_point'
+                    },
+                    {
+                        'name': 'input_y',
+                        'locator_logic': {'type': 'static', 'index': 1},
+                        'scale_attr': 'scale_y',
+                        'zp_attr': 'input_y_zero_point'
+                    }
+                ]
+            },
+            'Conv1dInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'Conv2dInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'ConvTranspose2dInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'GRUInt': {
+                'quantizable_inputs': [
+                    {'name': 'sequence_input', 
+                     'locator_logic': {'type': 'static', 'index': 0}, 
+                     'scale_attr': 'scale_x', 
+                     'zp_attr': 'x_zero_point'},
+                    {'name': 'initial_hidden', 
+                     'locator_logic': {'type': 'conditional', 'arg': 'num_inputs', 'cases': [{'if_equal': 7, 'index': 1}, {'if_equal': 8, 'index': 2}]}, 
+                     'scale_attr': 'scale_h', 
+                     'zp_attr': 'h_zero_point'},
+                ]
+            },
+            'iqAdd': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input_x',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'input_x_zero_point'
+                    },
+                    {
+                        'name': 'input_y',
+                        'locator_logic': {'type': 'static', 'index': 1},
+                        'scale_attr': 'scale_y',
+                        'zp_attr': 'input_y_zero_point'
+                    }
+                ]
+            },
+            'iqCat': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input_0',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x_0',
+                        'zp_attr': 'input_zero_point_0'
+                    },
+                    {
+                        'name': 'input_1',
+                        'locator_logic': {'type': 'static', 'index': 1},
+                        'scale_attr': 'scale_x_1',
+                        'zp_attr': 'input_zero_point_1'
+                    },
+                    {
+                        'name': 'input_2',
+                        'locator_logic': {'type': 'static', 'index': 2},
+                        'scale_attr': 'scale_x_2',
+                        'zp_attr': 'input_zero_point_2'
+                    },
+                    {
+                        'name': 'input_3',
+                        'locator_logic': {'type': 'static', 'index': 3},
+                        'scale_attr': 'scale_x_3',
+                        'zp_attr': 'input_zero_point_3'
+                    }
+                ]
+            },
+            'iqDiv': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input_x',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'input_x_zero_point'
+                    },
+                    {
+                        'name': 'input_y',
+                        'locator_logic': {'type': 'static', 'index': 1},
+                        'scale_attr': 'scale_y',
+                        'zp_attr': 'input_y_zero_point'
+                    }
+                ]
+            },
+            'iqMul': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input_x',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'input_x_zero_point'
+                    },
+                    {
+                        'name': 'input_y',
+                        'locator_logic': {'type': 'static', 'index': 1},
+                        'scale_attr': 'scale_y',
+                        'zp_attr': 'input_y_zero_point'
+                    }
+                ]
+            },
+            'iqSigmoid': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'iqSum': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'LayerNormInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'LinearInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+            'LSTMInt': {
+                'quantizable_inputs': [
+                    {'name': 'sequence_input', 
+                     'locator_logic': {'type': 'static', 'index': 0}, 
+                     'scale_attr': 'scale_i', 
+                     'zp_attr': 'i_zero_point'},
+                    {'name': 'initial_hidden', 
+                     'locator_logic': {'type': 'conditional', 'arg': 'num_inputs', 'cases': [{'if_equal': 7, 'index': 1}, {'if_equal': 8, 'index': 2}]}, 
+                     'scale_attr': 'scale_h', 
+                     'zp_attr': 'h_zero_point'},
+                    {'name': 'initial_cell', 
+                     'locator_logic': {'type': 'conditional', 'arg': 'num_inputs', 'cases': [{'if_equal': 7, 'index': 2}, {'if_equal': 8, 'index': 3}]}, 
+                     'scale_attr': 'scale_c', 
+                     'zp_attr': 'c_zero_point'}
+                ]
+            },
+            'SoftmaxInt': {
+                'quantizable_inputs': [
+                    {
+                        'name': 'input',
+                        'locator_logic': {'type': 'static', 'index': 0},
+                        'scale_attr': 'scale_x',
+                        'zp_attr': 'data_zero_point'
+                    }
+                ]
+            },
+        }
+        
+__all__=["OnnxRunner"]
diff --git a/linger/checker/utils.py b/linger/checker/utils.py
new file mode 100644
index 0000000..f9092f7
--- /dev/null
+++ b/linger/checker/utils.py
@@ -0,0 +1,254 @@
+from linger.utils import PlatForm
+import onnx
+from onnx import numpy_helper
+from typing import Dict, Any
+import torch
+import torch.nn as nn
+from linger.quant.ops.qconfig import get_qmodule_op, get_qtensor_op_dispatch
+from linger.quant.ops.qmodule import QModuleMixin
+from linger.quant.ops.qtensor.qtensor_mod import QModuleTensor
+from linger.utils import QuantMode, dequant
+
+def get_param(kwargs, key):
+    value = kwargs.get(key)
+    if value is None:
+        raise KeyError("OP must have attribute '{}' !".format(key))
+    return value
+
+MODE_TABLE = {
+    "floor": QuantMode.floor,
+    "floor_add": QuantMode.floor_add,
+    "round": QuantMode.round,
+    "ceil": QuantMode.ceil,
+}
+
+def StringToQuantMode(mode: str):
+    try:
+        return MODE_TABLE[mode]
+    except KeyError:
+        raise ValueError(f"Invalid quant mode '{mode}'. Supported: {list(MODE_TABLE.keys())}")
+    
+def create_qmodule(q_cls: QModuleMixin, torch_module: nn.Module, device: torch.device, attrs: Dict[str, Any]):
+    if not issubclass(q_cls, QModuleMixin):
+        raise TypeError(f"Expected subclass of QModuleMixin, got {q_cls.__name__}")
+    act_cfg = {'activate_bits': None}
+    w_cfg = {'weight_bits': None}
+    b_cfg = {'bias_bits': None}
+    instance = q_cls.qcreate(torch_module, act_cfg, w_cfg, b_cfg, device=device)
+
+    quant_mode = StringToQuantMode(attrs.get("quant_mode", 'floor_add'))
+    data_bits = attrs.get('data_bits', 8)
+    parameter_bits = attrs.get("parameter_bits", 8)
+    bias_bits = attrs.get('bias_bits', 32)
+    o_bits = attrs.get('o_bits', 8)
+    scale_x = attrs.get('scale_x', None)
+    scale_w = attrs.get('scale_w', None)
+    scale_o = attrs.get('scale_o', None)
+    scale_b = None
+    if scale_x is not None and scale_w is not None:
+        scale_b = scale_x * scale_w
+
+    if scale_x is not None:
+        instance.input_quantizer.round_mode = quant_mode
+        instance.input_quantizer.scale = torch.tensor(scale_x, dtype=torch.float32)
+        instance.input_quantizer.data_bits = data_bits
+        instance.input_quantizer.training = False
+    
+    if scale_w is not None:
+        instance.weight_quantizer.round_mode = quant_mode
+        instance.weight_quantizer.scale = torch.tensor(scale_w, dtype=torch.float32)
+        instance.weight_quantizer.data_bits = parameter_bits
+        instance.weight_quantizer.training = False
+
+    if scale_b is not None:
+        instance.bias_quantizer.round_mode = quant_mode
+        instance.bias_quantizer.scale = torch.tensor(scale_b, dtype=torch.float32)
+        instance.bias_quantizer.data_bits = bias_bits
+        instance.bias_quantizer.training = False
+
+    if scale_o is not None:
+        instance.output_quantizer.round_mode = quant_mode
+        instance.output_quantizer.scale = torch.tensor(scale_o, dtype=torch.float32)
+        instance.output_quantizer.data_bits = o_bits
+        instance.output_quantizer.training = False
+    
+    return instance
+
+def create_qmodule_tensor(q_cls: QModuleTensor, module: nn.Module, num_input: int, attrs: Dict[str, Any]):
+    if not issubclass(q_cls, QModuleTensor):
+        raise TypeError(f"Expected subclass of QModuleTensor, got {q_cls.__name__}")
+    act_cfg = {'activate_bits': None}
+    is_cat = attrs.get("is_cat", False)
+    instance = q_cls.qcreate(module, act_cfg, num_input, dim=attrs.get('dim', None))
+    instance.is_cat = is_cat
+    
+    quant_mode = StringToQuantMode(attrs.get("quant_mode", None))
+    data_bits = attrs.get('data_bits', 8)
+    o_bits = attrs.get('o_bits', 8)
+    scale_o = attrs.get("scale_o", 1.0)
+    zp_o = attrs.get('output_zero_point', 0)
+
+    if is_cat:
+        for i in range(num_input):
+            scale = attrs.get(f'scale_x_{i}', 1.0)
+            zp = attrs.get(f'input_zero_point_{i}', 0)
+            
+            instance.input_quantizer[i].data_bits = o_bits
+            instance.input_quantizer[i].round_mode = quant_mode
+            instance.input_quantizer[i].scale = torch.tensor(scale, dtype=torch.float32)
+            instance.input_quantizer[i].training = False
+    else:
+        if num_input == 2:
+            scale_x = attrs.get('scale_x', 1.0)
+            zp_x = attrs.get("input_x_zero_point", 0)
+            scale_y = attrs.get('scale_y', 1.0)
+            zp_y = attrs.get("input_y_zeropoint", 0)
+
+            instance.input_quantizer[0].data_bits = o_bits
+            instance.input_quantizer[0].round_mode = quant_mode
+            instance.input_quantizer[0].scale = torch.tensor(scale_x, dtype=torch.float32)
+            instance.input_quantizer[0].training = False
+
+            instance.input_quantizer[1].data_bits = o_bits
+            instance.input_quantizer[1].round_mode = quant_mode
+            instance.input_quantizer[1].scale = torch.tensor(scale_y, dtype=torch.float32)
+            instance.input_quantizer[1].training = False
+        else:
+            scale_x = attrs.get('scale_x', 1.0)
+            zp_x = attrs.get("data_zero_point", 0)
+            
+            instance.input_quantizer[0].data_bits = data_bits
+            instance.input_quantizer[0].round_mode = quant_mode
+            instance.input_quantizer[0].scale = torch.tensor(scale_x, dtype=torch.float32)
+            instance.input_quantizer[0].training = False
+
+    instance.output_quantizer.data_bits = o_bits
+    instance.output_quantizer.round_mode = quant_mode
+    instance.output_quantizer.scale = torch.tensor(scale_o, dtype=torch.float32)
+    instance.output_quantizer.training = False
+
+    return instance
+
+def load_quantized_weights(q_instance, attrs, weights = None, bias = None):
+    scale_x = attrs.get('scale_x', None)
+    scale_w = attrs.get('scale_w', None)
+    scale_b = None
+    if scale_x is not None and scale_w is not None:
+        scale_b = scale_x * scale_w
+    
+    if weights is not None and scale_w is not None:
+        w_data = dequant(weights, scale_w)
+        q_instance.weight = nn.Parameter(w_data, requires_grad=False)
+        
+    if bias is not None and scale_b is not None:
+        b_data = dequant(bias, scale_b)
+        q_instance.bias = nn.Parameter(b_data, requires_grad=False)
+        
+    return q_instance
+
+def onnx_topologically_sort(model) :
+    node_degree_dict = {}
+    for node in model.graph.node:
+        node.name = node.op_type + '_' + node.output[0]
+        node_degree_dict[node.name] = 0
+    for node in model.graph.node:
+        for in_node in model.graph.node:
+            for output in in_node.output:
+                if output in node.input:
+                    node_degree_dict[node.name] += 1
+    begin_node = []
+    for node in model.graph.node:
+        if node_degree_dict[node.name] == 0:
+            begin_node.append(node)
+    sorted = []
+    while len(begin_node) > 0:
+        child_node = begin_node.pop()
+        sorted.append(child_node)
+        for node in model.graph.node:
+            for output in child_node.output:
+                if output in node.input:
+                    node_degree_dict[node.name] -= 1
+                    if node_degree_dict[node.name] == 0:
+                        begin_node.append(node)
+    assert len(sorted) == len(model.graph.node)
+
+    model.graph.ClearField("node")
+    model.graph.node.extend(sorted)
+
+    return model
+
+def get_attribute_value(node, attr_name):
+    for attr in node.attribute:
+        if attr.name == attr_name:
+            if attr.type == onnx.AttributeProto.FLOAT: return attr.f
+            elif attr.type == onnx.AttributeProto.INT: return attr.i
+            elif attr.type == onnx.AttributeProto.FLOATS: return attr.floats
+            elif attr.type == onnx.AttributeProto.INTS: return attr.ints
+            elif attr.type == onnx.AttributeProto.STRING: return attr.s.decode('utf-8')
+    return None
+
+def parse_attribute_and_name(node):
+        node_attribute = dict()
+        for attr in node.attribute:
+            if attr.type == onnx.AttributeProto.AttributeType.INTS:
+                node_attribute[attr.name] = tuple(attr.ints)
+            elif attr.type == onnx.AttributeProto.AttributeType.INT:
+                node_attribute[attr.name] = attr.i
+            elif attr.type == onnx.AttributeProto.AttributeType.FLOAT:
+                node_attribute[attr.name] = attr.f
+            elif attr.type == onnx.AttributeProto.AttributeType.FLOATS:
+                node_attribute[attr.name] = tuple(attr.floats)
+            elif attr.type == onnx.AttributeProto.AttributeType.STRING:
+                node_attribute[attr.name] = attr.s.decode('utf-8')
+            elif attr.type == onnx.AttributeProto.AttributeType.TENSOR:
+                node_attribute[attr.name] = list(numpy_helper.to_array(node.attribute[0].t))
+            elif attr.type == onnx.AttributeProto.AttributeType.GRAPH:
+                node_attribute[attr.name] = attr.g
+            else:
+                raise KeyError(
+                            "The current operator({}) attribute({}) type is not supported,only support [float,int,ints,string,tensor,graph]".format(node.name,attr.name)
+                        )
+        return node.name, node_attribute
+
+_Method_MAP={}
+
+def register_op(op_type:str = None):
+    def decorator(func, op_type):
+        if op_type in _Method_MAP:
+            raise LookupError("Operator %s already registered!" %op_type)
+        _Method_MAP[op_type] = func
+        return func
+    
+    if type(op_type) != str:
+        func = op_type
+        decorator(func,func.__name__)
+        return func
+    return lambda func : decorator(func, op_type)
+
+def single_node_run(node:onnx.NodeProto,inputs:list):
+    func = _Method_MAP.get(node.op_type, None)
+    if func is None:
+        raise NotImplementedError("Current Version don't support the {} ops.".format(node.op_type))
+    if node.op_type == 'Constant':
+        return func(node)
+    else:
+        node_name, kwargs = parse_attribute_and_name(node)
+        return func(inputs, kwargs)
+
+def if_node_run(node:onnx.NodeProto, inputs:list, inputs_dict:dict):
+    # If you want to know why I do this, you can customize the network to use squeeze, and then observe the onnx graph to understand
+    _, kwargs = parse_attribute_and_name(node)
+    if inputs[0] == True:
+        children_graph = kwargs.get("then_branch")  #  equal to _Method_MAP.get("Squeeze",None)
+    else:
+        children_graph = kwargs.get("else_branch")  # # equal to _Method_MAP.get("Identity",None)
+    
+    children_inputs = [inputs_dict.get(children_input_name) for children_input_name in children_graph.node[0].input]
+    return single_node_run(children_graph.node[0], children_inputs)
+
+def print_method_map():
+    print(_Method_MAP)
+        
+
+__all__=["get_param",'register_op','node_run','print_method_map','onnx_topologically_sort','get_attribute_value',
+         'parse_attribute_and_name']
\ No newline at end of file
diff --git a/linger/config.py b/linger/config.py
index 9014460..c7df316 100644
--- a/linger/config.py
+++ b/linger/config.py
@@ -1,205 +1,204 @@
+import torch
 from enum import Enum
-
-from .utils import PlatFormQuant, Singleton
-
-
-class LayerFusion(Enum):
-    LayerFusionConvBN = 1
-
-
-class LayerQuantPredictor(Enum):
-    LayerPredictorConvRelu = 1
-    LayerPredictorBNRelu = 2
-
-
-class Configure(Singleton):
-    class PlatFormQuantConfig():
-        platform_quant = PlatFormQuant.luna_quant
-
-    class IQTensorConfig():
-        iqmul = True
-        iqadd = True
-        iqcat = True
-        iqclamp = True
-        iqsigmoid = True
-        iqdiv = True
-        iqsum = True
-        iqtanh = True
-        softmaxint = True
-        logsoftmaxint = True
-        iqvar = True
-
-    class FunctionConfig():
-        linear = True
-        bmm = False
-        channel_shuffle = False
-
-    class IQCat2AddZeroConfig():
-        iqadd2addzero = False
-
-    class BnMomentumUpdateConfig():
-        disable = False
-
-    PlatFormQuant = PlatFormQuantConfig()
-    IQTensor = IQTensorConfig()
-    FunctionQuant = FunctionConfig()
-    IQCat2AddZero = IQCat2AddZeroConfig()
-    BnMomentumUpdate = BnMomentumUpdateConfig()
-
-config = Configure()
-
-def SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant):
-    r"""设置是否采用luna_quant量化方式
-
-    Args:
-        platform_quant(PlatForm.luna_quant):采用何种硬件量化方式，默认luna量化方式
-
-    """
-    config.PlatFormQuant.platform_quant = platform_quant
-
-
-
-def SetIQTensorMul(enable):
-    r"""设置是否启用iqadd功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqmul = enable
-
-
-def SetIQTensorDiv(enable):
-    r"""设置是否启用iqdiv功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqdiv = enable
-
-
-def SetIQTensorAdd(enable):
-    r"""设置是否启用iqadd功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqadd = enable
-
-
-def SetIQTensorSum(enable):
-    r"""设置是否启用iqdiv功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqsum = enable
-
-
-def SetIQTensorCat(enable):
-    r"""设置是否启用iqcat功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqcat = enable
-
-
-def SetIQTensorSigmoid(enable):
-    r"""设置是否启用iqsigmoid功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqsigmoid = enable
-
-def SetIQTensorSoftmax(enable):
-    r"""设置是否启用softmaxInt功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.softmaxint = enable
-
-def SetIQTensorLogSoftmax(enable):
-    r"""设置是否启用LogSoftmax功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.logsoftmaxint = enable
-
-def SetIQTensorTanh(enable):
-    r"""设置是否启用iqtanh功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqtanh = enable
-
-
-def SetIQTensorClamp(enable):
-    r"""设置是否启用iqclamp功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqclamp = enable
-
-def SetIQTensorVar(enable):
-    r"""设置是否启用iqVar功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.IQTensor.iqvar = enable
-
-
-def SetFunctionLinearQuant(enable):
-    r"""设置是否启用F.linear量化功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.FunctionQuant.linear = enable
-
-
-def SetFunctionBmmQuant(enable):
-    r"""设置是否启用torch.bmm量化功能,默认启用
-
-    Args:
-        enable(bool):是否启用
-
-    """
-    config.FunctionQuant.bmm = enable
-
-def SetFunctionChannelShuffleQuant(enable):
-    config.FunctionQuant.channel_shuffle = enable
-
-
-def SetCastorBiasInt16(bias_int16=True):
-    config.CastorBiasInt16.bias_int16 = bias_int16
-
-
-def SetBnMomentumUpdate(disable=True):
-    config.BnMomentumUpdate.disable = disable
-
-
-def SetIQCat2AddZero(enable=True):
-    config.IQCat2AddZero.iqadd2addzero = enable
-
-
-__all__ = ['config', 'SetPlatFormQuant', 'SetIQTensorAdd', 'SetFunctionLinearQuant', 'SetFunctionBmmQuant', 'SetIQTensorClamp',
-           'SetIQTensorCat', 'SetIQTensorSigmoid', 'SetIQTensorSoftmax', 'SetIQTensorLogSoftmax', 'SetIQTensorTanh', 'SetIQTensorDiv', 'SetIQTensorMul', 'SetIQTensorSum',
-           'LayerFusion', 'LayerQuantPredictor', 'SetIQCat2AddZero', 'SetBnMomentumUpdate', 'SetIQTensorVar', 'SetFunctionChannelShuffleQuant']
+from .utils import *
+import os
+import yaml
+
+# class Singleton:
+#     _instance = None  # 保存唯一实例
+#     def __new__(cls, *args, **kwargs):
+#         if cls._instance is None:
+#             cls._instance = super().__new__(cls)
+#         return cls._instance
+#     def __init__(self):
+#         # 只初始化一次
+#         if not hasattr(self, "_initialized"):
+#             self._initialized = True
+
+
+def str_to_dtype(s: str):
+    """将字符串转回 torch.dtype，如 'torch.float32' -> torch.float32"""
+    if not isinstance(s, str) or not s.startswith("torch."):
+        raise ValueError(f"Invalid dtype string: {s}")
+    attr_name = s[6:]  # 去掉 "torch."
+    if hasattr(torch, attr_name):
+        return getattr(torch, attr_name)
+    else:
+        raise ValueError(f"Unknown torch dtype: {s}")
+
+class QuantInfo():
+    def __init__(self):
+        self.weight_bits            = 8
+        self.activate_bits          = 8
+        self.bias_bits              = 32
+        self.a_strategy             = QuantStrategy.RANGE_MEAN
+        self.w_strategy             = QuantStrategy.RANGE_MEAN
+        self.is_symmetry            = True
+        self.is_perchannel          = False
+        self.round_mode             = QuantMode.floor_add
+        self.activation_type        = ActivationType.none
+        self.qat_method             = QatMethod.MOM
+        self.w_calibrate_name       = "abs_max"
+        self.a_calibrate_name       = "top_10"
+    
+    def to_dict(self):
+        return self.__dict__
+    
+    def to_save_dict(self):
+        result = {}
+        for k, v in self.__dict__.items():
+            if k.startswith('_') or k.startswith('to'):
+                continue
+            # 处理嵌套配置（如 quant_info）
+            if isinstance(v, Enum):
+                result[k] = v.name
+            else:
+                result[k] = v
+        return result
+
+    def _update_from_dict(self, data: dict):
+        """从字典更新当前配置（支持Enum）"""
+        for key, value in data.items():
+            if not hasattr(self, key):
+                continue  # 忽略未知字段
+            
+            current = getattr(self, key)
+            
+            # 如果当前是 torch.dtype，且 value 是字符串 → 转换
+            if isinstance(current, Enum) and isinstance(value, str):
+                enum_cls = type(current)
+                if hasattr(enum_cls, value):
+                    setattr(self, key, getattr(enum_cls, value))
+                else:
+                    print(f"⚠️ 无效枚举值: {value}")
+            else:
+                setattr(self, key, value)
+
+
+
+
+class ClampInfo():
+    def __init__(self):
+        self.clamp_weight_value     = None
+        self.clamp_bias_value       = None
+        self.clamp_activation_value = 8
+        self.clamp_factor_value     = 7  # for dyn clip
+    def to_dict(self):
+        return self.__dict__
+
+    def to_save_dict(self):
+        result = {}
+        for k, v in self.__dict__.items():
+            if k.startswith('_') or k.startswith('to'):
+                continue
+            # 处理嵌套配置（如 quant_info）
+            if isinstance(v, Enum):
+                result[k] = v.name
+            else:
+                result[k] = v
+        return result
+
+class QuantConfig(Singleton):
+    open_quant   = True
+    quant_method = FakeQuantMethod.NATIVE
+    device       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    dtype        = torch.float32
+    seed         = 42
+    calibration  = False
+    platform     = PlatForm.venusA
+
+    clamp_info   = ClampInfo()
+    quant_info   = QuantInfo()
+
+    @classmethod
+    def _load_from_yaml(cls, config_path: str):
+        """
+        从 YAML 文件加载配置，并覆盖当前实例的属性
+        支持嵌套字段(如 quant_info.weight_bits)
+        """
+        if not os.path.exists(config_path):
+            raise ValueError(f"配置文件 {config_path} 不存在")
+        try:
+            with open(config_path, 'r', encoding='utf-8') as f:
+                config_data = yaml.safe_load(f)
+        except:
+            raise ValueError(f"加载配置失败: {config_path}")
+
+        # 设置属性
+        cls._update_from_dict(config_data)
+
+    @classmethod
+    def _save_to_yaml(cls, config_path: str):
+        """
+        将当前配置保存到 YAML 文件
+        """
+        config_dict = cls._to_save_dict()
+        os.makedirs(os.path.dirname(config_path), exist_ok=True)
+        with open(config_path, 'w', encoding='utf-8') as f:
+            yaml.dump(config_dict, f, default_flow_style=False, indent=2, allow_unicode=True)
+
+    # 因为config写在了class里，没有通过__init__函数初始化，这里获得__dict__时需要cls
+    @classmethod
+    def _to_save_dict(cls):
+        """将整个配置转换为字典，包括嵌套对象"""
+        result = {}
+        for k, v in cls.__dict__.items():
+            if k.startswith('_'):
+                continue
+            # 处理嵌套配置（如 quant_info）
+            if isinstance(v, (QuantInfo, ClampInfo)):
+                result[k] = v.to_save_dict()
+            # 处理 torch.dtype
+            elif isinstance(v, torch.dtype):
+                result[k] = str(v)
+            elif isinstance(v, torch.device):
+                result[k] = str(v)
+            # 处理 Enum（如果你有）
+            elif isinstance(v, Enum):
+                result[k] = v.name
+            else:
+                result[k] = v
+        return result
+
+    @classmethod
+    def _update_from_dict(cls, data: dict):
+        """从字典更新当前配置（支持 dtype 和 Enum）"""
+        for key, value in data.items():
+            if not hasattr(cls, key):
+                continue  # 忽略未知字段
+            
+            current = getattr(cls, key)
+            
+            # 如果当前是 torch.dtype，且 value 是字符串 → 转换
+            if isinstance(value, str) and value.startswith("torch."):
+                try:
+                    setattr(cls, key, str_to_dtype(value))
+                except ValueError as e:
+                    print(f"⚠️ 跳过无效 dtype: {e}")
+            
+            elif isinstance(current, torch.device) and isinstance(value, str):
+                try:
+                    setattr(cls, key, torch.device(value))
+                except Exception as e:
+                    print(f"⚠️ 无效 device: {value}, error: {e}")
+            # 如果当前是 Enum，且 value 是字符串 → 转换
+            elif isinstance(current, Enum) and isinstance(value, str):
+                enum_cls = type(current)
+                if hasattr(enum_cls, value):
+                    setattr(cls, key, getattr(enum_cls, value))
+                else:
+                    print(f"⚠️ 无效枚举值: {value}")
+            
+            # 如果是嵌套对象（如 quant_info），在_set_nested_attr函数里处理，当前步跳过
+            elif isinstance(current, ClampInfo):
+                if isinstance(value, dict):
+                    current.__dict__.update(value)
+            elif isinstance(current, QuantInfo):
+                if isinstance(value, dict): # 自定义更新QuantInfo，支持enum类型
+                    current._update_from_dict(value)
+            
+            # 其他普通字段直接赋值
+            else:
+                setattr(cls, key, value)
+    
+QUANT_CONFIGS = QuantConfig()
diff --git a/linger/constrain/SparifyFFN.py b/linger/constrain/SparifyFFN.py
new file mode 100644
index 0000000..88522e4
--- /dev/null
+++ b/linger/constrain/SparifyFFN.py
@@ -0,0 +1,151 @@
+#!/usr/bin/env python
+# -*- encoding: utf-8 -*-
+import math
+import torch
+import torch.nn as nn
+import torch.autograd as autograd
+import torch.nn.functional as F
+from torch.onnx import is_in_onnx_export
+from torch.nn import init
+
+from .cutils import static_clip, dyn_clip_weight
+
+class GetSparifyMask(autograd.Function):
+    @staticmethod
+    def forward(ctx, outM, threshold):
+        # 计算要保留的元素数量
+        n_elements = outM.size(-1)
+        k = max(1, min(n_elements, round(threshold * n_elements)))        
+        # 获取前k个最大值的索引
+        _, topk_indices = torch.topk(outM, k, dim=-1)        
+        # 创建全False的掩码
+        mask = torch.zeros_like(outM, dtype=torch.float)
+        # 将前k个最大值位置设为True
+        mask.scatter_(-1, topk_indices, True)
+        out_mask = mask.to(torch.float)
+        ctx.save_for_backward(out_mask)
+        return out_mask
+    @staticmethod
+    def backward(ctx, gradOutput):
+        (out_mask, ) = ctx.saved_tensors
+        return gradOutput * out_mask, None
+
+class SparseStrategy():
+    def __init__(self, name, ratio=0.125, step_max = 3600, grad_accu=6):
+        self.count_max = step_max * grad_accu # 6表示梯度累计
+        self.count = 0
+        self.sparse_ratio = ratio # ratio表示百分之多少稀疏度
+        self.name = name
+
+        # step模式才会有
+        self.step_nums = 10
+    def next_sparse_ratio(self):
+        self.count = self.count + 1
+        if self.count >= self.count_max:
+            return self.sparse_ratio
+        if self.name == "linear":
+            return 1 - (1 - self.sparse_ratio) * self.count / self.count_max
+        elif self.name == "sqrt":
+            return 1 - (1 - self.sparse_ratio) * math.sqrt(self.count / self.count_max)
+        elif self.name == "step":
+            tmp = (1 - self.sparse_ratio)
+            step1 = tmp * self.count / self.count_max
+            step2 = math.ceil(step1 / (tmp / self.step_nums)) * (tmp / self.step_nums)
+            return 1 - step2
+
+class SparifyFFN(nn.Module):
+    def __init__(self, in_feature, ou_feature, bias=True, normalize_data=None, normalize_weight=None, normalize_bias=None, normalize_factor=None, dtype = torch.float32):
+        super(SparifyFFN, self).__init__()
+        
+        self.input_size = in_feature
+        self.output_size = ou_feature
+        self.bias = bias
+
+        self.mask_group = 8
+        self.ratio = 0.125
+
+        self.normalize_data   = normalize_data
+        self.normalize_weight = normalize_weight
+        self.normalize_bias   = normalize_bias
+        self.normalize_factor = normalize_factor
+
+        self.weight_fc1  = nn.Parameter(torch.empty((self.output_size, self.input_size), dtype = dtype))
+        self.weight_fc2  = nn.Parameter(torch.empty((self.input_size, self.output_size), dtype = dtype))
+        self.weight_mask = nn.Parameter(torch.empty((self.mask_group, self.input_size), dtype = dtype))
+
+        if bias:
+            self.bias_fc1  = nn.Parameter(torch.empty(self.output_size, dtype = dtype))
+            self.bias_fc2  = nn.Parameter(torch.empty(self.input_size, dtype = dtype))
+            self.bias_mask = nn.Parameter(torch.empty(self.mask_group, dtype = dtype))
+
+        self.repeat_num = int(self.output_size / self.mask_group)
+        self.spa_method = SparseStrategy("sqrt", ratio=self.ratio, step_max=20000, grad_accu=12)
+
+        self.reset_parameters()
+
+    def reset_parameters(self) -> None:
+        stdv = 1.0 / math.sqrt(self.output_size)
+        for weight in self.parameters():
+            init.uniform_(weight, -stdv, stdv)
+
+    def forward(self, input: torch.Tensor, open_spa_method=True) -> torch.Tensor:
+        normalized_fc1_w = self.weight_fc1
+        normalized_fc2_w = self.weight_fc2
+        normalized_mask_w = self.weight_mask
+        if self.normalize_weight is not None:
+            normalized_fc1_w = static_clip(normalized_fc1_w, self.normalize_weight, self.training)
+            normalized_fc2_w = static_clip(normalized_fc2_w, self.normalize_weight, self.training)
+            normalized_mask_w = static_clip(normalized_mask_w, self.normalize_weight, self.training)
+        elif self.normalize_factor is not None:
+            normalized_fc1_w = dyn_clip_weight(normalized_fc1_w, self.normalize_factor)
+            normalized_fc2_w = dyn_clip_weight(normalized_fc2_w, self.normalize_factor)
+            normalized_mask_w = dyn_clip_weight(normalized_mask_w, 7)    # only support 8bit
+
+        normalized_fc1_b = None
+        normalized_fc2_b = None
+        normalized_mask_b = None
+        if self.bias:
+            normalized_fc1_b = self.bias_fc1
+            normalized_fc2_b = self.bias_fc2
+            normalized_mask_b = self.bias_mask
+            if self.normalize_bias is not None:
+                normalized_fc1_b  = static_clip(normalized_fc1_b, self.normalize_bias, self.training)
+                normalized_fc2_b  = static_clip(normalized_fc2_b, self.normalize_bias, self.training)
+                normalized_mask_b = static_clip(normalized_mask_b, self.normalize_bias, self.training)
+
+        outL = F.linear(input, normalized_fc1_w, normalized_fc1_b)
+        outM1 = F.linear(input, normalized_mask_w, normalized_mask_b)
+        outM = F.softmax(outM1, dim=-1)
+        
+        if self.training and open_spa_method:
+            threshold = self.spa_method.next_sparse_ratio() # 训练时需要改成这个
+        else:
+            threshold = self.ratio
+        
+        # mask = topk_sort(outM, threshold)
+        mask = GetSparifyMask.apply(outM, threshold)
+
+        # 第二步，outM转化为bool值后，根据out_feature扩散为与其一致的大小。在这里展示block，将channel进行分组
+        outM2 = mask.repeat_interleave(self.repeat_num, dim=-1)
+        out1 = outL * outM2  # fc1 sparify
+
+        out2 = F.relu(out1)
+
+        out = F.linear(out2, normalized_fc2_w, normalized_fc2_b)
+
+        if self.normalize_data is not None:
+            # out = static_clip(out, self.normalize_data, self.training, False)
+            out.clamp_(-self.normalize_data, self.normalize_data)
+        # import pdb; pdb.set_trace()
+        return out
+            
+    def extra_repr(self):
+        # s = nn.GRU.extra_repr(self)
+        s = 'in_feature:{},ou_feature:{}'.format(self.input_size, self.output_size)
+        # s += ', open_spa_method:{}'.format(self.open_spa_method)
+
+        extra_s = ', normalize_data:{normalize_data}, normalize_weight:{normalize_weight}, normalize_bias:{normalize_bias}, normalize_factor:{normalize_factor}'.format(
+            **self.__dict__)
+        return s+extra_s
+
+__all__ = ['SparifyFFN']
diff --git a/linger/constrain/__init__.py b/linger/constrain/__init__.py
new file mode 100644
index 0000000..e210038
--- /dev/null
+++ b/linger/constrain/__init__.py
@@ -0,0 +1,12 @@
+from .cmodule import constrain_module, _CMODULE_TABLE
+from .clinear import *
+from .cconv1d import *
+from .cconv2d import *
+from .cconvbn1d import *
+from .cconvbn2d import *
+from .cbatchnorm2d import *
+from .cconvtranspose1d import *
+from .cconvtranspose2d import *
+from .clayernorm import *
+from .cembedding import *
+from .SparifyFFN import *
\ No newline at end of file
diff --git a/linger/constrain/cbatchnorm2d.py b/linger/constrain/cbatchnorm2d.py
new file mode 100644
index 0000000..fd4b4aa
--- /dev/null
+++ b/linger/constrain/cbatchnorm2d.py
@@ -0,0 +1,63 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+from .cutils import static_clip, dyn_clip_weight
+
+@register_cmodule(torch.nn.BatchNorm2d)
+class CBatchNorm2d(CModuleMixin, nn.BatchNorm2d):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            module.num_features,
+            module.eps,
+            module.momentum,
+            module.affine,
+            module.track_running_stats,
+            dtype=module.weight.dtype,
+            device=device,
+            constrain=constrain,
+        )
+
+    def forward(self, input):
+        # cweight = self.cweight
+        # cbias = self.cbias
+        batchsize, channels, height, width = input.shape
+        size = batchsize * height * width
+        if self.training:
+            mean = input.sum((0, 2, 3), keepdim=True) / size
+            var = input.pow(2).sum((0, 2, 3), keepdim=True) / size - \
+                (input.sum((0, 2, 3), keepdim=True) / size).pow(2)
+            var = torch.clamp(var, min=0.0)
+            self.running_mean = (
+                1 - self.momentum) * self.running_mean + self.momentum * mean.squeeze().detach()
+            self.running_var = (1 - self.momentum) * self.running_var + \
+                self.momentum * var.squeeze().detach()
+        else:
+            mean = self.running_mean.reshape(1, -1, 1, 1)
+            var = self.running_var.reshape(1, -1, 1, 1)
+        sigma = 1 / torch.sqrt(var + self.eps)
+        alpha = self.weight.view(1, -1, 1, 1) * sigma
+        beta = self.bias.view(1, -1, 1, 1) - mean * alpha
+        if self.clamp_weight is not None:
+            alpha = static_clip(alpha, self.clamp_weight)
+        else:
+            alpha = dyn_clip_weight(alpha, self.clamp_factor)
+
+        if self.clamp_bias is not None:
+            beta = static_clip(beta, self.clamp_bias)
+            
+        out = alpha * input + beta
+        return out
+    
+
diff --git a/linger/constrain/cconv1d.py b/linger/constrain/cconv1d.py
new file mode 100644
index 0000000..f690939
--- /dev/null
+++ b/linger/constrain/cconv1d.py
@@ -0,0 +1,31 @@
+import torch
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.Conv1d)
+class CConv1d(CModuleMixin, torch.nn.Conv1d):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            dilation=module.dilation,
+            groups=module.groups,
+            bias=module.bias is not None,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            constrain=constrain,
+        )
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return self._conv_forward(input, self.cweight, self.cbias)
+
diff --git a/linger/constrain/cconv2d.py b/linger/constrain/cconv2d.py
new file mode 100644
index 0000000..fbe3de3
--- /dev/null
+++ b/linger/constrain/cconv2d.py
@@ -0,0 +1,31 @@
+import torch
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.Conv2d)
+class CConv2d(CModuleMixin, torch.nn.Conv2d):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            dilation=module.dilation,
+            groups=module.groups,
+            bias=module.bias is not None,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            constrain=constrain,
+        )
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        return self._conv_forward(input, self.cweight, self.cbias)
+
diff --git a/linger/constrain/cconvbn1d.py b/linger/constrain/cconvbn1d.py
new file mode 100644
index 0000000..5e121df
--- /dev/null
+++ b/linger/constrain/cconvbn1d.py
@@ -0,0 +1,134 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+
+from .cmodule import CModuleMixin, register_cmodule
+from .cutils import static_clip, dyn_clip_weight
+
+class ConvBN1d(nn.Module):
+    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
+                 eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
+                 constrain: Optional[Dict[str, Any]] = None, dtype = torch.float32) -> None:
+        super(CConvBN1d, self).__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel_size,
+                              stride, padding, dilation, groups, bias, padding_mode)
+        self.bn = nn.BatchNorm1d(
+            out_channels, eps, momentum, affine, track_running_stats)
+
+        self.constrain = {} if constrain is None else constrain
+        self.clamp_weight = self.constrain.get('clamp_weight_value', None)
+        self.clamp_bias = self.constrain.get('clamp_bias_value', None)
+        self.clamp_activation = self.constrain.get('clamp_activation_value', None)
+        self.clamp_factor = self.constrain.get('clamp_factor_value', None)
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+    
+        if self.training:
+            conv_rlt = self.conv._conv_forward(input, self.conv.weight, self.conv.bias) # for calculate bn mean and var
+            N, C, H = conv_rlt.size()
+            bn_size = N * H
+            conv_rlt = conv_rlt.permute(1, 0, 2).contiguous().view(C, bn_size)
+            sum_ = conv_rlt.sum(1)
+            sum_square_ = conv_rlt.pow(2).sum(1)
+            mean_ = sum_ / bn_size
+            sum_var_ = sum_square_ - sum_ * mean_
+            unbias_var_ = sum_var_ / (bn_size - 1)  # 无偏方差，用 unbias_var（除 N-1）来更新 running_var（长期的、用于推理时的估计），在统计上更合理（减少估计偏差）
+            unbias_var_ = torch.clamp(unbias_var_, min=0.0)
+            self.bn.running_mean = (
+                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean_.detach())
+            self.bn.running_var = (
+                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var_.detach())
+
+            bias_var_ = sum_var_ / bn_size  # 计算当前 batch 的标准差用于“在该 batch 上归一化” —— 这是即时、直接的标准化数学操作
+            bias_var_ = torch.clamp(bias_var_, min=0.0)
+            inv_std_ = 1 / (bias_var_ + self.bn.eps).pow(0.5)
+            bn_rlt = ((conv_rlt - mean_.unsqueeze(1)) * inv_std_.unsqueeze(1) * 
+                        self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
+            bn_rlt = bn_rlt.view(C, N, H).permute(1, 0, 2).contiguous()
+            w_bn_ = self.bn.weight.div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(mean_).div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+
+            alpha = 0.1
+            if self.clamp_weight is not None:
+                new_weight = static_clip(new_weight, self.clamp_weight)
+            else:
+                new_weight = dyn_clip_weight(new_weight, self.clamp_factor)
+
+            if self.clamp_bias is not None:
+                new_bias = static_clip(new_bias, self.clamp_bias)
+            new_conv_rlt = F.conv1d(input, new_weight, new_bias, self.conv.stride,
+                                    self.conv.padding, self.conv.dilation, self.conv.groups)
+            output = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
+        else:
+            w_bn_ = self.bn.weight.div(torch.sqrt(self.bn.eps + self.bn.running_var))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
+                torch.sqrt(self.bn.running_var + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+            if self.clamp_weight is not None:
+                new_weight = static_clip(new_weight, self.clamp_weight)
+            else:
+                new_weight = dyn_clip_weight(new_weight, self.clamp_factor)
+
+            if self.clamp_bias is not None:
+                new_bias = static_clip(new_bias, self.clamp_bias)
+            output = F.conv1d(input, new_weight, new_bias, self.conv.stride,
+                              self.conv.padding, self.conv.dilation, self.conv.groups)
+        
+        if self.clamp_activation is not None:
+            output = static_clip(output, self.clamp_activation)
+
+        return output
+
+@register_cmodule(ConvBN1d)
+class CConvBN1d(CModuleMixin, ConvBN1d):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):    
+        return cls(
+            in_channels=module.conv.in_channels,
+            out_channels=module.conv.out_channels,
+            kernel_size=module.conv.kernel_size,
+            stride=module.conv.stride,
+            padding=module.conv.padding,
+            dilation=module.conv.dilation,
+            groups=module.conv.groups,
+            bias=module.conv.bias is not None,
+            padding_mode=module.conv.padding_mode,
+            eps=module.bn.eps,
+            momentum=module.bn.momentum,
+            affine=module.bn.affine,
+            track_running_stats=module.bn.track_running_stats,
+
+            dtype=module.conv.weight.dtype,
+            device=device,
+            constrain=constrain,
+            open_ihook=False,
+            open_ohook=False,
+        )
+    
+    def extra_repr(self):
+        s = nn.Conv1d.extra_repr(self.conv)
+        s += ', '
+        s += nn.BatchNorm1d.extra_repr(self.bn)
+        extra_s = ', clamp_activation:{}, clamp_weight:{}, clamp_bias:{}, clamp_factor:{}'.format(self.clamp_activation, self.clamp_weight, self.clamp_bias, self.clamp_factor)
+        return s + extra_s
+
+
diff --git a/linger/constrain/cconvbn2d.py b/linger/constrain/cconvbn2d.py
new file mode 100644
index 0000000..669a2b0
--- /dev/null
+++ b/linger/constrain/cconvbn2d.py
@@ -0,0 +1,138 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+from .cmodule import CModuleMixin, register_cmodule
+from .cutils import static_clip, dyn_clip_weight
+
+class ConvBN2d(nn.Module):
+    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
+                 eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
+                 constrain: Optional[Dict[str, Any]] = None, dtype = torch.float32) -> None:
+        super(ConvBN2d, self).__init__()
+        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size,
+                              stride, padding, dilation, groups, bias, padding_mode)
+        self.bn = nn.BatchNorm2d(
+            out_channels, eps, momentum, affine, track_running_stats)
+        
+        self.constrain = {} if constrain is None else constrain
+        self.clamp_weight = self.constrain.get('clamp_weight_value', None)
+        self.clamp_bias = self.constrain.get('clamp_bias_value', None)
+        self.clamp_activation = self.constrain.get('clamp_activation_value', None)
+        self.clamp_factor = self.constrain.get('clamp_factor_value', None)
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.training:
+            conv_rlt = self.conv._conv_forward(input, self.conv.weight, self.conv.bias) # for calculate bn mean and var
+            N, C, H, W = conv_rlt.size()
+            bn_size = N * H * W
+            conv_rlt = conv_rlt.permute(1, 0, 2, 3).contiguous().view(C, bn_size)
+            sum_ = conv_rlt.sum(1)
+            sum_square_ = conv_rlt.pow(2).sum(1)
+            mean_ = sum_ / bn_size
+            sum_var_ = sum_square_ - sum_ * mean_
+            unbias_var_ = sum_var_ / (bn_size - 1)  # 无偏方差，用 unbias_var（除 N-1）来更新 running_var（长期的、用于推理时的估计），在统计上更合理（减少估计偏差）
+            unbias_var_ = torch.clamp(unbias_var_, min=0.0)
+            self.bn.running_mean = (
+                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean_.detach())
+            self.bn.running_var = (
+                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var_.detach())
+
+            bias_var_ = sum_var_ / bn_size  # 计算当前 batch 的标准差用于“在该 batch 上归一化” —— 这是即时、直接的标准化数学操作
+            bias_var_ = torch.clamp(bias_var_, min=0.0)
+            inv_std_ = 1 / (bias_var_ + self.bn.eps).pow(0.5)
+            bn_rlt = ((conv_rlt - mean_.unsqueeze(1)) * inv_std_.unsqueeze(1) * 
+                        self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
+            bn_rlt = bn_rlt.view(C, N, H, W).permute(1, 0, 2, 3).contiguous()
+            w_bn_ = self.bn.weight.div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(mean_).div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+
+            alpha = 0.1
+            if self.clamp_weight is not None:
+                new_weight = static_clip(new_weight, self.clamp_weight)
+            else:
+                new_weight = dyn_clip_weight(new_weight, self.clamp_factor)
+
+            if self.clamp_bias is not None:
+                new_bias = static_clip(new_bias, self.clamp_bias)
+            new_conv_rlt = F.conv2d(input, new_weight, new_bias, self.conv.stride,
+                                    self.conv.padding, self.conv.dilation, self.conv.groups)
+            output = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
+        else:
+            w_bn_ = self.bn.weight.div(torch.sqrt(self.bn.eps + self.bn.running_var))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
+                torch.sqrt(self.bn.running_var + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+            if self.clamp_weight is not None:
+                new_weight = static_clip(new_weight, self.clamp_weight)
+            else:
+                new_weight = dyn_clip_weight(new_weight, self.clamp_factor)
+
+            if self.clamp_bias is not None:
+                new_bias = static_clip(new_bias, self.clamp_bias)
+            output = F.conv2d(input, new_weight, new_bias, self.conv.stride,
+                              self.conv.padding, self.conv.dilation, self.conv.groups)
+        
+        if self.clamp_activation is not None:
+            output = static_clip(output, self.clamp_activation)
+
+        return output
+
+@register_cmodule(ConvBN2d)
+class CConvBN2d(CModuleMixin, ConvBN2d):
+    # def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
+    #              eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
+    #              constrain: Optional[Dict[str, Any]] = None, dtype = torch.float32) -> None:
+    #     super(CConvBN2d, self).__init__(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias, padding_mode,
+    #              eps, momentum, affine, track_running_stats, constrain, dtype)
+        
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.conv.in_channels,
+            out_channels=module.conv.out_channels,
+            kernel_size=module.conv.kernel_size,
+            stride=module.conv.stride,
+            padding=module.conv.padding,
+            dilation=module.conv.dilation,
+            groups=module.conv.groups,
+            bias=module.conv.bias is not None,
+            padding_mode=module.conv.padding_mode,
+            eps=module.bn.eps,
+            momentum=module.bn.momentum,
+            affine=module.bn.affine,
+            track_running_stats=module.bn.track_running_stats,
+
+            dtype=module.conv.weight.dtype,
+            device=device,
+            constrain=constrain,
+            open_ihook=False,
+            open_ohook=False,
+        )
+
+    def extra_repr(self):
+        s = nn.Conv2d.extra_repr(self.conv)
+        s += ', '
+        s += nn.BatchNorm2d.extra_repr(self.bn)
+        extra_s = ', clamp_activation:{}, clamp_weight:{}, clamp_bias:{}, clamp_factor:{}'.format(self.clamp_activation, self.clamp_weight, self.clamp_bias, self.clamp_factor)
+        return s + extra_s
+
+
diff --git a/linger/constrain/cconvtranspose1d.py b/linger/constrain/cconvtranspose1d.py
new file mode 100644
index 0000000..44174d8
--- /dev/null
+++ b/linger/constrain/cconvtranspose1d.py
@@ -0,0 +1,48 @@
+import torch
+import torch.nn.functional as F
+from typing import List, Optional, Tuple, Union
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.ConvTranspose1d)
+class QConvTranspose1d(CModuleMixin, torch.nn.ConvTranspose1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            output_padding=module.output_padding,
+            groups=module.groups,
+            bias=module.bias is not None,
+            dilation=module.dilation,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        return F.conv_transpose1d(
+            input, 
+            self.qweight, 
+            self.qbias,
+            self.stride,
+            self.padding,
+            self.output_padding,
+            self.groups,
+            self.dilation,)
+
diff --git a/linger/constrain/cconvtranspose2d.py b/linger/constrain/cconvtranspose2d.py
new file mode 100644
index 0000000..52e0b9c
--- /dev/null
+++ b/linger/constrain/cconvtranspose2d.py
@@ -0,0 +1,49 @@
+import torch
+import torch.nn.functional as F
+from typing import List, Optional, Tuple, Union
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.ConvTranspose2d)
+class QConvTranspose2d(CModuleMixin, torch.nn.ConvTranspose2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            output_padding=module.output_padding,
+            groups=module.groups,
+            bias=module.bias is not None,
+            dilation=module.dilation,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        return F.conv_transpose2d(
+            input, 
+            self.qweight, 
+            self.qbias,
+            self.stride,
+            self.padding,
+            self.output_padding,
+            self.groups,
+            self.dilation,
+            )
+
diff --git a/linger/constrain/cembedding.py b/linger/constrain/cembedding.py
new file mode 100644
index 0000000..41df121
--- /dev/null
+++ b/linger/constrain/cembedding.py
@@ -0,0 +1,45 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.Embedding)
+class CEmbedding(CModuleMixin, nn.Embedding):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            module.num_embeddings,
+            module.embedding_dim,
+            module.padding_idx,
+            module.max_norm,
+            module.norm_type,
+            module.scale_grad_by_freq,
+            module.sparse,
+            None, # _weight参数永远设置为None
+            None, # _freeze参数永远设置为None
+            dtype = module.weight.dtype,
+            device=device,
+            constrain=constrain,
+            open_ihook = False
+        )
+
+    def forward(self, input):
+        return F.embedding(
+                    input,
+                    self.cweight,
+                    self.padding_idx,
+                    self.max_norm,
+                    self.norm_type,
+                    self.scale_grad_by_freq,
+                    self.sparse,
+                )
+
diff --git a/linger/constrain/clayernorm.py b/linger/constrain/clayernorm.py
new file mode 100644
index 0000000..a34f60d
--- /dev/null
+++ b/linger/constrain/clayernorm.py
@@ -0,0 +1,33 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.LayerNorm)
+class CLayerNorm(CModuleMixin, nn.LayerNorm):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            module.normalized_shape,
+            module.eps,
+            module.elementwise_affine,
+            None if module.bias is None else True,
+            dtype=module.weight.dtype,
+            device=device,
+            constrain=constrain,
+        )
+
+    def forward(self, input):
+        return F.layer_norm(
+            input, self.normalized_shape, self.cweight, self.cbias, self.eps
+        )
+
diff --git a/linger/constrain/clinear.py b/linger/constrain/clinear.py
new file mode 100644
index 0000000..23dc333
--- /dev/null
+++ b/linger/constrain/clinear.py
@@ -0,0 +1,30 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .cmodule import CModuleMixin, register_cmodule
+from typing import Optional, Union, Dict, Any
+
+@register_cmodule(torch.nn.Linear)
+class CLinear(CModuleMixin, nn.Linear):
+    @classmethod
+    def ccreate(
+        cls,
+        module,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            module.in_features,
+            module.out_features,
+            module.bias is not None,
+            dtype=module.weight.dtype,
+            device=device,
+            constrain=constrain,
+        )
+
+    def forward(self, input):
+        return F.linear(input, self.cweight, bias=self.cbias)
+
diff --git a/linger/constrain/cmodule.py b/linger/constrain/cmodule.py
new file mode 100644
index 0000000..2844453
--- /dev/null
+++ b/linger/constrain/cmodule.py
@@ -0,0 +1,174 @@
+
+from abc import ABC
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import QUANT_CONFIGS
+from typing import Optional, Union, Dict, Any
+
+__all__ = ["CModuleMixin", "register_cmodule", "constrain_module"]
+
+
+_CMODULE_TABLE = {}
+
+
+def register_cmodule(module_cls):
+    """
+    Used for registering a new constrain module.
+
+    The CModule must implement two abstract methods:
+
+    - qcreate: class method to instantiate a new CModule from an nn.Module, without copying its weights,
+    - forward: instance method for constrain inference.
+
+    The code to register a new module looks like:
+
+    ```
+    @register_cmodule(<base torch.nn.Module>)
+    class MyCModule(CModuleMixin, <base torch.nn.Module>):
+        <implementation>
+
+        @classmethod
+        def qcreate(cls,
+                    module: torch.nn.Module,
+                    weights: Optional[],
+                    activations: Optional[] = None,
+                    optimizer: Optional[Optimizer] = None):
+            ...
+
+        def forward(self, input: torch.Tensor) -> torch.Tensor:
+            ...
+    ```
+
+    """
+
+    def wrapper(cls):
+        _CMODULE_TABLE[module_cls] = cls
+        return cls
+
+    return wrapper
+
+def constrain_module(
+    module,
+    constrain: Optional[Dict[str, Any]] = None,
+):
+    for cls in _CMODULE_TABLE:
+        if isinstance(module, cls):
+            qcls = _CMODULE_TABLE[cls]
+            return qcls.from_module(module, constrain=constrain)
+    return None
+
+class CModuleMixin(ABC):
+    def __init__(
+        self,
+        *args, # 原始torch.module初始化所需的参数
+        device: Optional[torch.device] = None,
+        constrain: Optional[Dict[str, Any]] = None, # 约束策略相关的参数
+        open_ihook: Optional[bool] = True,
+        open_ohook: Optional[bool] = True,
+        **kwargs,
+    ):
+        mro = self.__class__.__mro__
+        if torch.nn.Module not in mro: # 必须和torch.nn.module一起被Qlinear类继承
+            raise TypeError("Constrain modules must inherit from a torch.nn.Module class")
+        if mro.index(__class__) > mro.index(torch.nn.Module): # 继承时此类必须写在前边，torch.nn.module才能被初始化
+            raise TypeError(
+                "CModuleMixin must be placed before any torch.nn.Module class in constrain module inheritance."
+            )
+        # This will setup the torch.nn.Module
+        super().__init__(*args, **kwargs) # 原始linear或conv等线性module的初始化
+
+        constrain = {} if constrain is None else constrain
+        self.clamp_weight     = constrain.get('clamp_weight_value', None)
+        self.clamp_bias       = constrain.get('clamp_bias_value', None)
+        self.clamp_activation = constrain.get('clamp_activation_value', None)
+        self.clamp_factor     = constrain.get('clamp_factor_value', None)
+
+        self._constrain_hooks = {}
+        # if open_ihook:
+        #     self._constrain_hooks["input"] = self.register_forward_pre_hook(self.constrain_input)
+        if open_ohook:
+            self._constrain_hooks["output"] = self.register_forward_hook(self.constrain_output)
+
+    @classmethod
+    def from_module(
+        cls,
+        module: torch.nn.Module,
+        constrain: Optional[Union[str]] = None,
+    ):
+        # Create the constrain module on the meta device to prevent weights intialization
+        device = QUANT_CONFIGS.device
+        cmodule = cls.ccreate(module, constrain = constrain, device=device)
+        if cmodule is None:
+            return None
+
+        if hasattr(module, 'weight'):
+            with torch.no_grad():
+                cmodule.weight = module.weight
+                if hasattr(module, 'bias') and module.bias is not None:
+                    cmodule.bias = module.bias
+
+        return cmodule.to(device)
+
+    @classmethod
+    def ccreate(
+        cls,
+        module: torch.nn.Module,
+        constrain: Optional[Union[str]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        raise NotImplementedError
+
+    @property
+    def cweight(self):
+        if self.clamp_factor is not None:
+            with torch.no_grad():
+                clamp_data = self.weight.abs().mean() * self.clamp_factor
+        else:
+            clamp_data = self.clamp_weight
+        return self.weight if clamp_data is None else torch.clamp(self.weight, min = -clamp_data, max = clamp_data)
+
+    @property
+    def cbias(self):
+        return self.bias if self.clamp_bias is None else torch.clamp(self.bias, min = -self.clamp_bias, max = self.clamp_bias)
+
+
+    def cforward(self, input: torch.Tensor) -> torch.Tensor:
+        raise NotImplementedError
+
+    def constrain_input(self, module: torch.nn.Module, input: torch.Tensor) -> torch.Tensor:
+        return input if self.clamp_activation is None else torch.clamp(input[0], min = -self.clamp_activation, max = self.clamp_activation)
+
+    def constrain_output(
+        self,
+        module: torch.nn.Module,
+        input: torch.Tensor,
+        output: torch.Tensor,
+    ) -> torch.Tensor:
+        return output if self.clamp_activation is None else torch.clamp(output, min = -self.clamp_activation, max = self.clamp_activation)
+    
+    def extra_repr(self):
+        s = ''
+        extra_s = ''
+        if 'Conv2d' in self._get_name():
+            s = nn.Conv2d.extra_repr(self)
+        elif 'Linear' in self._get_name():
+            s = nn.Linear.extra_repr(self)
+        elif 'Maxpool2d' in self._get_name():
+            s = nn.MaxPool2d.extra_repr(self)
+        elif 'Relu' in self._get_name():
+            s = nn.ReLU.extra_repr(self)
+            
+        extra_s = ', clamp_activation:{}, clamp_weight:{}, clamp_bias:{}, clamp_factor:{}'.format(self.clamp_activation, self.clamp_weight, self.clamp_bias, self.clamp_factor)
+        return s + extra_s
+
+    def __repr__(self):
+        extra_lines = []
+        extra_repr = self.extra_repr()
+        if extra_repr:
+            extra_lines = extra_repr.split('\n')
+        main_str = self._get_name() + '('
+        main_str += extra_lines[0]
+        main_str += ')'
+        return main_str
+
diff --git a/linger/constrain/cutils.py b/linger/constrain/cutils.py
new file mode 100644
index 0000000..6a51c54
--- /dev/null
+++ b/linger/constrain/cutils.py
@@ -0,0 +1,17 @@
+import math
+import torch
+
+def static_clip(input, clip_data, training=True, is_weight=True):
+    return torch.clamp(input, min = -clip_data, max = clip_data)
+
+def dyn_clip_weight(weight, factor):
+    with torch.no_grad():
+        if factor is None:
+            factor = 3
+        clamp_data = factor * weight.abs().mean()
+        abs_max = weight.abs().max()
+        clamp_data = torch.min(clamp_data, abs_max)
+    return torch.clamp(weight, min=-clamp_data, max=clamp_data)
+
+
+__all__ = ['static_clip', 'dyn_clip_weight']
diff --git a/linger/conv_bn_fuser.py b/linger/conv_bn_fuser.py
deleted file mode 100644
index 8923ff3..0000000
--- a/linger/conv_bn_fuser.py
+++ /dev/null
@@ -1,444 +0,0 @@
-import torch
-import torch.nn
-
-from torch.nn.modules.conv import ConvTranspose2d
-from .modules import NormalizeConvBN1d, NormalizeConvBN2d, NormalizeConvTransposeBN2d
-from .ops.ops_names import LINGER_AHEAD_RELU, LINGER_AHEAD_SIGMOID, LINGER_IGNORE_PAMAMTER
-from .utils import Singleton, get_device
-
-
-class FuseableConvBN():
-    def __init__(self, conv_f, conv, bn_f, bn, root_model=None):
-        self.conv_f = conv_f
-        self.conv = conv
-        self.bn_f = bn_f
-        self.bn = bn
-        self.scope_conv = None
-        self.scope_bn = None
-        self.root_model = None
-
-    def set_root_model(self, root_model):
-        self.root_model = root_model
-
-
-class EmptyBatchNorm(torch.nn.Module):
-    r"""融合后的BNmoudule占位符,没有进行任何Tensor操作
-
-    """
-
-    def __init__(self):
-        super(EmptyBatchNorm, self).__init__()
-        setattr(self, LINGER_IGNORE_PAMAMTER,
-                torch.nn.Parameter(torch.zeros([1])))
-
-    def forward(self, input):
-        return input
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-        pass
-
-
-def fuse_conv_bn(conv, bn):
-    eps = 1e-5
-    c_b = getattr(conv, 'bias', None)
-
-    b_mean = bn.running_mean.data
-    b_var = bn.running_var.data
-    b_w = bn.weight.data
-    b_b = bn.bias.data
-    sigma = 1/torch.sqrt(b_var+eps)
-    alpha = b_w * sigma
-    beta = b_b - b_mean * alpha
-    conv.weight.data.mul_(alpha.view(-1, *([1]*(len(conv.weight.shape)-1))))
-    if c_b is not None:
-        conv.bias.data.mul_(alpha).add_(beta)
-    else:
-        conv.bias = bn.bias
-        conv.bias.data.mul_(0).add_(beta)
-
-
-class SingletonConvFusedBnModules(Singleton):
-    fused_conv_module = {}
-    fused_bn_module = {}
-    _is_close_register = False
-
-    def _close_register(self):
-        self._is_close_register = True
-
-    def _register(self, fuseable_conv_bn):
-        if self._is_close_register:
-            print("warning: module has initlized and linger.init may not work")
-        self.fused_conv_module[fuseable_conv_bn.conv] = fuseable_conv_bn
-        self.fused_bn_module[fuseable_conv_bn.bn] = fuseable_conv_bn
-
-    def _is_registered_conv(self, conv):
-        f_conv = self.fused_conv_module.get(conv)
-        return f_conv
-
-    def _is_registered_bn(self, bn):
-        f_bn = self.fused_bn_module.get(bn)
-        return f_bn
-
-    def build_normalize_convbn2d_scope(self, model):
-        queue = [('', '', model)]
-        while len(queue) > 0:
-            (node_name, scope_name, node) = queue.pop(0)
-            find_fused_info = self._is_registered_conv(node)
-            if find_fused_info is not None:
-                find_fused_info.scope_conv = scope_name
-                if find_fused_info.root_model is None:
-                    find_fused_info.set_root_model(model)
-                    conv_m = find_fused_info.conv
-                    bn_m = find_fused_info.bn
-                    conv_have_bias = False if conv_m.bias is None else True
-                    clamp_conv = None
-                    device = get_device(conv_m)
-                    ahead_relu = getattr(
-                        conv_m, LINGER_AHEAD_RELU, False)
-                    if type(conv_m) == torch.nn.Conv2d:
-                        clamp_conv = NormalizeConvBN2d(in_channels=conv_m.in_channels,
-                                                       out_channels=conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride,
-                                                       padding=conv_m.padding, dilation=conv_m.dilation, groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
-                                                       eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
-                                                       normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=ahead_relu)
-                    elif type(conv_m) == torch.nn.Conv1d:
-                        clamp_conv = NormalizeConvBN1d(in_channels=conv_m.in_channels,
-                                                       out_channels=conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride,
-                                                       padding=conv_m.padding, dilation=conv_m.dilation, groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
-                                                       eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
-                                                       normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=ahead_relu)
-                    elif type(conv_m) == torch.nn.ConvTranspose2d:
-                         clamp_conv = NormalizeConvTransposeBN2d(in_channels= conv_m.in_channels,
-                                         out_channels= conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride, output_padding=conv_m.output_padding,
-                                         padding=conv_m.padding, dilation=conv_m.dilation,groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
-                                         eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
-                                         normalize_data=None, normalize_weight=None, normalize_bias=None,ahead_relu = ahead_relu)
-                    clamp_conv = clamp_conv.to(device)
-                    setattr(find_fused_info.conv_f, node_name, clamp_conv)
-                else:
-                    assert find_fused_info.root_model == model
-            for name, submodule in node.named_children():
-                prefix = '' if scope_name == '' else scope_name+'.'
-                queue.append((name, prefix+name, submodule))
-
-    def build_empty_bn_scope(self, model):
-        queue = [('', '', model)]
-        while len(queue) > 0:
-            (node_name, scope_name, node) = queue.pop(0)
-            find_fused_info = self._is_registered_bn(node)
-            if find_fused_info is not None:
-                find_fused_info.scope_bn = scope_name
-                if find_fused_info.root_model is None:
-                    find_fused_info.set_root_model(model)
-                else:
-                    assert find_fused_info.root_model == model
-                setattr(find_fused_info.bn_f, node_name, EmptyBatchNorm())
-            for name, submodule in node.named_children():
-                prefix = '' if scope_name == '' else scope_name+'.'
-                queue.append((name, prefix+name, submodule))
-
-    @staticmethod
-    def get_module(model, scope):
-        attr_arr = scope.split('.')
-        cur_module = model
-        for att in attr_arr:
-            cur_module = getattr(cur_module, att, None)
-        return cur_module
-
-    def fuse_state_dicts(self, state_dict):
-        for v in self.fused_conv_module.values():
-            assert v.scope_conv is not None
-            assert v.scope_bn is not None
-
-            class GeneralModule():
-                pass
-            keys_bn = []
-            atts_bn = {}
-            for key_dict in state_dict.keys():
-                prefix = v.scope_bn+'.'
-                if key_dict.startswith(prefix):
-                    attr_name = key_dict[len(prefix):]
-                    attr_name = attr_name.split('.', 1)[0]
-                    keys_bn.append(key_dict)
-                    atts_bn[attr_name] = state_dict[key_dict]
-            keys_conv = []
-            atts_conv = {}
-            for key_dict in state_dict.keys():
-                prefix = v.scope_conv+'.'
-                if key_dict.startswith(prefix):
-                    attr_name = key_dict[len(prefix):]
-                    attr_name = attr_name.split('.', 1)[0]
-                    atts_conv[attr_name] = state_dict[key_dict]
-                    keys_conv.append(key_dict)
-            if LINGER_IGNORE_PAMAMTER not in atts_bn.keys():
-                for att, att_dict in atts_conv.items():
-                    state_dict[v.scope_conv+'.conv.'+att] = att_dict
-                for att, att_dict in atts_bn.items():
-                    state_dict[v.scope_conv+'.bn.'+att] = att_dict
-                for key_bn_pop in keys_bn:
-                    if key_bn_pop != LINGER_IGNORE_PAMAMTER:
-                        state_dict.pop(key_bn_pop)
-                for key_conv_pop in keys_conv:
-                    state_dict.pop(key_conv_pop)
-
-    def clear(self):
-        if self._is_close_register:
-            print("warning: module has initlized and linger.clear may not work")
-        self.fused_conv_module.clear()
-        self.fused_bn_module.clear()
-
-    def has_fuseable_items(self):
-        return len(self.fused_conv_module) > 0
-
-
-class OpNodeInfo():
-    def __init__(self):
-        self.inputs = []
-        self.outputs = []
-        self.op = None
-        self.scope = None
-
-    def __str__(self):
-        s = 'input:'
-        for i in self.inputs:
-            s += i+' '
-        s += '\t output:'
-        for o in self.outputs:
-            s += o+' '
-        s += '\t operator:' + self.op
-        s += '\t scope:' + self.scope
-        return s
-
-    @staticmethod
-    def parse_scope_to_path(scope_str):
-        tail_name = scope_str.strip().split('/')[-1].strip()
-        assert tail_name != ''
-        tail_name = tail_name.replace('__module.', '', 1)
-        return tail_name
-
-    def parse_scope(self):
-        return self.parse_scope_to_path(self.scope)
-
-
-def _check_version(version_str, major, subor, minor):
-    if '+' in version_str:
-        version_str = version_str.split('+')[0]
-    version_arr = version_str.split('.')
-    maj = int(version_arr[0])
-    sub = int(version_arr[1])
-    mi = int(version_arr[2])
-    if maj >= major and sub >= subor and mi >= minor:
-        return True
-    return False
-
-
-def get_op_nodes(graph):
-    nodes = []
-    for n in graph.nodes():
-        op_node = OpNodeInfo()
-        for i in n.inputs():
-            op_node.inputs.append(i.debugName())
-        for o in n.outputs():
-            op_node.outputs.append(o.debugName())
-        op_node.op = n.kind()
-        op_node.scope = n.scopeName()
-        nodes.append(op_node)
-    return nodes
-
-
-def find_adjoin_layer(src_node_name, may_be_dst_layers, dict_input, dict_output, src_node_must_be_layers=None):
-    find_nodes = []
-    for dst_layer in may_be_dst_layers:
-
-        input_tensor = dst_layer.inputs[0]
-        src_node = dict_output[input_tensor]
-        if src_node != None and src_node.op == src_node_name:
-            input_node_set = dict_input[input_tensor]
-            if len(input_node_set) == 1 and dst_layer in input_node_set:
-                if src_node_must_be_layers is None:
-                    find_nodes.append((src_node, dst_layer))
-                elif src_node in src_node_must_be_layers:
-                    find_nodes.append((src_node, dst_layer))
-    target_set = set([])
-    src_set = set([])
-    for s, d in find_nodes:
-        src_set.add(s)
-        target_set.add(d)
-    return find_nodes, src_set, target_set
-
-
-def find_adjoin_adjoin_layer(src_node_name, mid_node_name, may_be_dst_layers, dict_input, dict_output):
-    find_nodes = []
-    for dst_layer in may_be_dst_layers:
-        dst_input_tensor = dst_layer.inputs[0]
-        mid_node = dict_output[dst_input_tensor]
-        if mid_node != None and mid_node.op == mid_node_name:
-            mid_input_node_set = dict_input[dst_input_tensor]
-            if len(mid_input_node_set) == 1 and dst_layer in mid_input_node_set:
-                mid_input_tensor = mid_node.inputs[0]
-                src_node = dict_output[mid_input_tensor]
-                if src_node != None and src_node.op == src_node_name:
-                    src_input_node_set = dict_input[mid_input_tensor]
-                    if len(src_input_node_set) == 1 and mid_node in src_input_node_set:
-                        find_nodes.append((src_node, mid_node, dst_layer))
-    src_set = set([])
-    mid_set = set([])
-    dst_set = set([])
-    for (s, m, d) in find_nodes:
-        src_set.add(s)
-        mid_set.add(m)
-        dst_set.add(d)
-    return find_nodes, src_set, mid_set, dst_set
-
-
-def filter_layers(node_arr, op_name):
-    list_node = []
-    for n in node_arr:
-        if n.op == op_name:
-            list_node.append(n)
-    return list_node
-
-
-def parse_fuseable_conv_bn(node_arr, fused_bn=True, ahead_conv_relu=True, ahead_bn_relu=True, ahead_linear_relu=True, ahead_conv_sigmoid=True, ahead_linear_sigmoid=True):
-    dict_output = {}
-    for n in node_arr:
-        for o in n.outputs:
-            dict_output[o] = n
-    dict_input = {}
-    for n in node_arr:
-        for i in n.inputs:
-            if dict_input.get(i) == None:
-                dict_input[i] = set([n])
-            else:
-                dict_input[i].add(n)
-    list_bn = filter_layers(node_arr, 'aten::batch_norm')
-    list_relu = filter_layers(node_arr, 'aten::relu')
-    list_sigmoid = filter_layers(node_arr, 'aten::sigmoid')
-    fused_conv_sigmoid = []
-    fused_linear_sigmoid = []
-    fused_linear_bias_sigmoid = []
-    fused_conv_bn = []
-    fused_conv_bn_relu = []
-    fused_conv_relu = []
-    fused_bn_relu = []
-    fused_linear_relu = []
-    fused_linear_bias_relu = []
-    if fused_bn:
-        fused_conv_bn, _, _ = find_adjoin_layer(
-            'aten::_convolution', list_bn, dict_input, dict_output)
-        if ahead_bn_relu:
-            fused_conv_bn_relu, _, _, _ = find_adjoin_adjoin_layer(
-                'aten::_convolution', 'aten::batch_norm', list_relu, dict_input, dict_output)
-    if ahead_conv_relu:
-        fused_conv_relu, _, _ = find_adjoin_layer(
-            'aten::_convolution', list_relu, dict_input, dict_output)
-    if ahead_conv_sigmoid:
-        fused_conv_sigmoid, _, _ = find_adjoin_layer(
-            'aten::_convolution', list_sigmoid, dict_input, dict_output)
-    if ahead_linear_sigmoid:
-        fused_linear_bias_sigmoid, _, _, _ = find_adjoin_adjoin_layer(
-            'aten::matmul', 'aten::add_', list_sigmoid, dict_input, dict_output)
-        fused_linear_sigmoid, _, _ = find_adjoin_layer(
-            'aten::matmul', list_sigmoid, dict_input, dict_output)
-    if ahead_bn_relu:
-        fused_bn_relu, _, _ = find_adjoin_layer(
-            'aten::batch_norm', list_relu, dict_input, dict_output)
-    if ahead_linear_relu:
-        fused_linear_bias_relu, _, _, _ = find_adjoin_adjoin_layer(
-            'aten::matmul', 'aten::add_', list_relu, dict_input, dict_output)
-        fused_linear_relu, _, _ = find_adjoin_layer(
-            'aten::matmul', list_relu, dict_input, dict_output)
-
-    return fused_conv_bn, fused_conv_bn_relu, fused_conv_relu, fused_bn_relu, fused_linear_relu, fused_linear_bias_relu, fused_conv_sigmoid, fused_linear_sigmoid, fused_linear_bias_sigmoid
-
-
-def scope_to_module(root_module, scope):
-    tail_name = OpNodeInfo.parse_scope_to_path(scope)
-    module_arr_name = tail_name.split('.')
-    module_cur = root_module
-    module_cur_name = ''
-    moduel_cur_father = root_module
-    str_find = ''
-    for sub_att_name in module_arr_name:
-        str_find += sub_att_name+"."
-        moduel_cur_father = module_cur
-        module_cur = getattr(module_cur, sub_att_name)
-        module_cur_name = sub_att_name
-        assert module_cur is not None, 'can not find '+str_find
-    return (moduel_cur_father, module_cur, module_cur_name)
-
-
-def FuseConvBNAheadRelu(model, *args, fused_bn=True, ahead_conv_relu=True, ahead_bn_relu=True, ahead_linear_relu=True, ahead_conv_sigmoid=True, ahead_linear_sigmoid=True):
-    SingletonConvFusedBnModules().clear()
-    assert _check_version(torch.__version__, 1, 5,
-                          0), 'error: torch version must greater than 1.5'
-    graph = torch.jit.trace(model, *args)
-    node_arr = get_op_nodes(graph.inlined_graph)
-    fuseable_conv_bn, fuseable_conv_bn_relu, fuseable_conv_relu, fuseable_bn_relu, fuseable_linear_relu, fuseable_linear_bias_relu, fuseable_conv_sigmoid, fuseable_linear_sigmoid, fuseable_linear_bias_sigmoid = parse_fuseable_conv_bn(
-        node_arr, fused_bn, ahead_conv_relu, ahead_bn_relu, ahead_linear_relu, ahead_conv_sigmoid, ahead_linear_sigmoid)
-    module_paths = []
-    if fused_bn:
-        if ahead_bn_relu:
-            for (conv, bn, _) in fuseable_conv_bn_relu:
-                _, conv_module, _ = scope_to_module(model, conv.scope)
-                setattr(conv_module, LINGER_AHEAD_RELU, True)
-        for (conv, bn) in fuseable_conv_bn:
-            conv_module_father, conv_module, conv_module_name = scope_to_module(
-                model, conv.scope)
-            bn_module_father, bn_module, bn_module_name = scope_to_module(
-                model, bn.scope)
-            if (type(conv_module) in (torch.nn.Conv2d, torch.nn.ConvTranspose2d) and type(bn_module) == torch.nn.BatchNorm2d) or \
-                    (type(conv_module) in (torch.nn.Conv1d, ) and type(bn_module) == torch.nn.BatchNorm1d):
-                fuseableconv_bn = FuseableConvBN(
-                    conv_module_father, conv_module, bn_module_father, bn_module)
-                SingletonConvFusedBnModules()._register(fuseableconv_bn)
-                module_paths.append((conv.parse_scope(), bn.parse_scope()))
-    if ahead_conv_relu:
-        for (conv, _) in fuseable_conv_relu:
-            _, conv_module, _ = scope_to_module(model, conv.scope)
-            setattr(conv_module, LINGER_AHEAD_RELU, True)
-    if ahead_bn_relu:
-        for (bn, _) in fuseable_bn_relu:
-            _, bn_module, _ = scope_to_module(model, bn.scope)
-            setattr(bn_module, LINGER_AHEAD_RELU, True)
-    if ahead_conv_sigmoid:
-        for (conv, _) in fuseable_conv_sigmoid:
-            _, conv_module, _ = scope_to_module(model, conv.scope)
-            setattr(conv_module, LINGER_AHEAD_SIGMOID, True)
-    if ahead_linear_sigmoid:
-        for (linear, _) in fuseable_linear_sigmoid:
-            _, linear_module, _ = scope_to_module(model, linear.scope)
-            setattr(linear_module, LINGER_AHEAD_SIGMOID, True)
-        for(linear, add, _) in fuseable_linear_bias_sigmoid:
-            _, linear_module, _ = scope_to_module(model, linear.scope)
-            setattr(linear_module, LINGER_AHEAD_SIGMOID, True)
-    if ahead_linear_relu:
-        for(linear, _) in fuseable_linear_relu:
-            _, linear_module, _ = scope_to_module(model, linear.scope)
-            setattr(linear_module, LINGER_AHEAD_RELU, True)
-        for(linear, add, _) in fuseable_linear_bias_relu:
-            _, linear_module, _ = scope_to_module(model, linear.scope)
-            setattr(linear_module, LINGER_AHEAD_RELU, True)
-    return module_paths
-
-
-def FuseBNIntoConv(model, *args):
-    r"""融合BN操作到Conv里
-
-    Args:
-        model(torch.nn.Module):模型
-        *args:模型的trace位置参数
-        **kwargs:模型trace的keyword参数
-    Example:
-        >>> net1 = shufflenet_v2_x1_0(pretrained=False)
-        >>> net1.load_state_dict(torch.load(dict_file))
-        >>> aa = net1(input)
-        >>> linger.FuseBNIntoConv(net1,dummy_input)
-        >>> net2 = linger.init(net1)
-        >>> net2.load_state_dict(torch.load(dict_file))
-    """
-    assert False, 'FuseBNIntoConv is deprecated please use linger.trace_layers(root_net, trace_net, dummy_input)'
-
-
-__all__ = ['FuseBNIntoConv', 'SingletonConvFusedBnModules',
-           'EmptyBatchNorm', 'FuseConvBNAheadRelu']
diff --git a/linger/dumper.py b/linger/dumper.py
deleted file mode 100644
index 168844a..0000000
--- a/linger/dumper.py
+++ /dev/null
@@ -1,163 +0,0 @@
-import os
-import re
-
-import numpy as np
-import prettytable as pt
-import torch
-import torch.nn as nn
-
-from .ops import *
-from .tools.weight_bias_analyse import clamp_with_dynamic
-
-tb_all = pt.PrettyTable()
-
-tb_all.field_names = ["Layer_name", "Mean", "Max",
-                      "Multiple(Max/Mean)", "Dynamic 0.99", "Versu(Max/Dynamic)"]
-
-
-def _hook_forward_anylse(module, input, output):
-    if 1:  # not module.training:
-        assert hasattr(module, LINGER_DUMP_NAME)
-        file_path = getattr(module, LINGER_DUMP_NAME)
-        if type(output) is tuple:
-            if isinstance(module, nn.GRU) or isinstance(module, nn.LSTM):
-                if isinstance(output[0], torch.nn.utils.rnn.PackedSequence):
-                    dump_out = output[0][0]
-                else:
-                    dump_out = output[0]
-            else:
-                if type(output[1]) is tuple:
-                    dump_out = torch.cat(
-                        (output[0], output[1][0], output[1][1]))
-                else:
-                    dump_out = torch.cat(output)
-        else:
-            dump_out = output
-
-        file_path = file_path[5:]  # remove "root."
-
-        clamp_with_dynamic(dump_out.detach(), dynamic_percent=0.9,
-                    layer_name=file_path, tb_all=tb_all)
-
-
-def _hook_forward(module, input, output):
-    if not module.training:
-        assert hasattr(module, LINGER_DUMP_NAME)
-        file_path = getattr(module, LINGER_DUMP_NAME)
-        if type(input) is tuple:
-            if len(input) == 1:
-                dump_in = input[0]  # 单一输入 在此处也是tuple类型
-            else:
-                dump_in = input[0]  # 多个输入  只保存第一个输入
-        else:
-            dump_in = input
-        if type(output) is tuple:
-            if isinstance(module, nn.GRU) or isinstance(module, nn.LSTM):
-                if isinstance(output[0], torch.nn.utils.rnn.PackedSequence):
-                    dump_out = output[0][0]
-                else:
-                    dump_out = output[0]
-            else:
-                if type(output[1]) is tuple:
-                    dump_out = torch.cat(
-                        (output[0], output[1][0], output[1][1]))
-                else:
-                    dump_out = torch.cat(output)
-        else:
-            dump_out = output
-        np.savetxt(file_path+'_input_float',
-                   dump_in.detach().reshape(-1).cpu().numpy(), fmt='%f')
-        np.savetxt(file_path+'_output_float',
-                   dump_out.detach().reshape(-1).cpu().numpy(), fmt='%f')
-
-
-def _dfs_submodules(model):
-    dfs_modules = []
-    stack = [('root', model)]
-    while len(stack) > 0:
-        (name_m, m) = stack.pop()
-        children_num = 0
-        for name, submodule in m.named_children():
-            stack.append((name_m+'/'+name, submodule))
-            children_num += 1
-        if children_num == 0:
-            dfs_modules.append((name_m, m))
-    dfs_modules.reverse()
-    return dfs_modules
-
-
-class Dumper():
-    def __init__(self):
-        self.module_dump_quanted = []
-        self.module_hooks_dump_all = []
-
-    def __enter__(self):
-        self.module_dump_quanted = []
-        self.module_hooks_dump_all = []
-        return self
-
-    def __exit__(self, type, value, trace):
-        self._clear_dump_quanted()
-        self._clear_dump_model()
-
-    def enable_dump_quanted(self, model: nn.Module, path: str = "./dump", match_pattern: str = ".*"):
-        if model.training:
-            print("error model must be eval when dump,call model.eval() before dump")
-            exit(-1)
-        if not os.path.exists(path):
-            os.makedirs(path)
-        queue = [("root", model)]
-        while len(queue) > 0:
-            name_m, node = queue.pop(0)
-            for name, submodule in node.named_children():
-                prefix = name_m + '.' + name
-                queue.append((prefix, submodule))
-                if isinstance(submodule, SupportQuantedIntModules) and re.match(match_pattern, prefix) is not None:
-                    submodule.prefix = prefix
-                    submodule.dump = True
-                    submodule.path = path
-                    self.module_dump_quanted.append(submodule)
-
-    def _clear_dump_quanted(self):
-        for op in self.module_dump_quanted:
-            op.prefix = ''
-            op.dump = False
-            op.path = None
-        self.module_dump_quanted = []
-
-    def enable_dump_model(self, model: nn.Module, path: str ="./dump", match_pattern: str =".*", hook_forward=_hook_forward):
-        if model.training:
-            print("error model must be eval when dump,call model.eval() before dump")
-            exit(-1)
-        if not os.path.exists(path):
-            os.makedirs(path)
-        leaf_all_modules = _dfs_submodules(model)
-        for name, leaf_module in leaf_all_modules:
-            dump_name = name.replace('/', '.')
-            if re.match(match_pattern, dump_name) is not None:
-                setattr(leaf_module, LINGER_DUMP_NAME,
-                        os.path.join(path, dump_name))
-                hook_handle = leaf_module.register_forward_hook(_hook_forward)
-                self.module_hooks_dump_all.append((leaf_module, hook_handle))
-
-    def _clear_dump_model(self):
-        for (m, hook) in self.module_hooks_dump_all:
-            hook.remove()
-            delattr(m, LINGER_DUMP_NAME)
-        self.module_hooks_dump_all = []
-
-    def analyse_layer_output(self, model: nn.Module, match_pattern: str =".*"):
-        leaf_all_modules = _dfs_submodules(model)
-
-        for name, leaf_module in leaf_all_modules:
-            dump_name = name.replace('/', '.')
-            if re.match(match_pattern, dump_name) is not None:
-                setattr(leaf_module, LINGER_DUMP_NAME, dump_name)
-                hook_handle = leaf_module.register_forward_hook(
-                    _hook_forward_anylse)
-                self.module_hooks_dump_all.append((leaf_module, hook_handle))
-
-    def save_out_analyse_log(self, save_log_path: str ="Analyse_layer_output.log"):
-        out_flie = open(save_log_path, 'w')
-        out_flie.write(str(tb_all))
-        out_flie.close()
diff --git a/linger/initialize.py b/linger/initialize.py
index 5ecc589..dda021e 100644
--- a/linger/initialize.py
+++ b/linger/initialize.py
@@ -1,625 +1,155 @@
 import itertools
-from typing import Tuple
+import operator
+from contextlib import contextmanager
+from fnmatch import fnmatch
 
 import torch
 import torch.nn as nn
+# from torch.fx import symbolic_trace, GraphModule
+
+from .quant.qtensor import QTensor
+from .quant.ops import *
+from .config import QuantConfig, QUANT_CONFIGS
+from .quant.ops.qconfig import _QMODULE_TABLE, _QTENSOR_OP_TABLE, quantize_module, quantize_tensor
+from .constrain.cmodule import constrain_module, _CMODULE_TABLE
+from typing import Any, Dict, List, Optional, Union
+
+@contextmanager
+def calibration(a_calibrate_name='top_10', w_calibrate_name='abs_max'):
+    # 保存旧值
+    old_a_calibrate_name = QUANT_CONFIGS.quant_info.a_calibrate_name
+    old_w_calibrate_name = QUANT_CONFIGS.quant_info.w_calibrate_name
+    try:
+        QUANT_CONFIGS.calibration = True
+        QUANT_CONFIGS.quant_info.a_calibrate_name = a_calibrate_name
+        QUANT_CONFIGS.quant_info.w_calibrate_name = w_calibrate_name
+        yield  # <<< 关键点：控制权交给 with 块
+    finally:
+        QUANT_CONFIGS.calibration = False
+
+def const_module(module: nn.Module, c_activation_val: float = 8.0, c_weight_val: float = 8.0, c_bias_val = None, c_weight_factor = None):
+    for name, m in module.named_modules():
+        if hasattr(m, 'clamp_weight'):
+            m.clamp_weight = c_weight_val
+            m.clamp_factor = c_weight_factor
+        if hasattr(m, 'clamp_bias'):
+            m.clamp_bias = c_bias_val
+        if hasattr(m, 'clamp_activation'):
+            m.clamp_activation = c_activation_val
+
+def quant_module(module: nn.Module, c_activation_val: float = 8.0, c_weight_val: float = 8.0, c_bias_val = None, c_weight_factor = None, data_bits: int = 8, weight_bits: int = 8, bias_bits: int = 32, out_bits: int = 8):
+    for name, m in module.named_modules():
+        if hasattr(m, 'input_quantizer') and m.input_quantizer is not None:
+            m.input_quantizer.data_bits = data_bits
+        if hasattr(m, 'weight_quantizer') and m.weight_quantizer is not None:
+            m.weight_quantizer.data_bits = weight_bits
+            m.weight_quantizer.clamp_weight_value = c_weight_val
+            m.weight_quantizer.clamp_factor_value = c_weight_factor
+        if hasattr(m, 'bias_quantizer') and m.bias_quantizer is not None:
+            m.bias_quantizer.data_bits = bias_bits
+            m.bias_quantizer.clamp_bias_value = c_bias_val
+        if hasattr(m, 'output_quantizer') and m.output_quantizer is not None:
+            m.output_quantizer.data_bits = out_bits
+            m.output_quantizer.clamp_activation_value = c_activation_val
+
+def constrain(model: nn.Module, config_file: str = None, disable_module=None, disable_submodel=None):
+    c_configs = QUANT_CONFIGS
+    if config_file is not None:
+        c_configs._load_from_yaml(config_file)
+
+    if disable_module is not None:
+        for name in disable_module:
+            if _CMODULE_TABLE.get(name, None) is not None:
+                _CMODULE_TABLE.pop(name)
+
+    for name, m in model.named_modules():
+        if disable_submodel is not None and any(fnmatch(name, pattern) for pattern in disable_submodel):
+            continue
+        _constrain_submodule(model, name, m, c_configs.clamp_info.to_dict())
+
+    model.to(c_configs.device)
+    return model
 
-import linger
-
-from .config import config
-from .modules import *
-from .onnx import export as linger_export
-from .ops import *
-from .ops.iqtensor import iqAddLayer, iqDivLayer, iqMulLayer, iqSumLayer
-from .ops.linger_functional import bmm as linger_bmm
-from .ops.linger_functional import cat as linger_cat
-from .ops.linger_functional import channel_shuffle_quant
-from .ops.linger_functional import clamp as linger_clamp
-from .ops.linger_functional import clamp_ as linger_clamp_
-from .ops.linger_functional import dropout as linger_dropout
-from .ops.linger_functional import (iqCatLayer, iqClampLayer, iqSigmoidLayer,
-                                    iqTanhLayer, iqVarLayer)
-from .ops.linger_functional import logsoftmax as linger_logsoftmax
-from .ops.linger_functional import \
-    pack_padded_sequence as linger_pack_padded_sequence
-from .ops.linger_functional import \
-    pad_packed_sequence as linger_pad_packed_sequence
-from .ops.linger_functional import sigmoid as linger_sigmoid
-from .ops.linger_functional import sigmoid_ as linger_sigmoid_
-from .ops.linger_functional import softmax as linger_softmax
-from .ops.linger_functional import tanh as linger_tanh
-from .ops.linger_functional import tanh_ as linger_tanh_
-from .ops.linger_functional import (torch_pack_padded_sequence,
-                                    torch_pad_packed_sequence)
-from .ops.linger_functional import var as linger_var
-from .ops.module_self import hook_forward, hook_pre_forward
-from .ops.ops_names import (LINGER_AHEAD_RELU, LINGER_AHEAD_SIGMOID,
-                            LINGER_MIX_INT8_MANUAL_ROUND_LAYERS, LINGER_MODE,
-                            LINGER_OBIT)
-from .utils import QuantInfo, QuantMode, Singleton, get_device, logger
-
-__all__ = ["disable_quant", "quant_module",
-           "quant_module_by_type", "quant_tensor", "init"]
-
-
-class _SingletonContainCustomModules(Singleton):
-    customized_quant_list = {}
-    _is_close_register = False
-
-    def _close_register(self):
-        self._is_close_register = True
-
-    def _register(self, module, quant_info):
-        assert isinstance(module, torch.nn.Module) or isinstance(module, list)
-        modules = [module] if isinstance(module, torch.nn.Module) else module
-        for each_mod in modules:
-            if self._is_close_register:
-                print("warning: module has initlized and linger.init may not work")
-            self.customized_quant_list[each_mod] = quant_info
-
-    def _is_registered(self, module):
-        return (module in self.customized_quant_list.keys())
-
-    def get(self, module):
-        return self.customized_quant_list.get(module)
-
-    def clear(self):
-        if self._is_close_register:
-            print("warning: module has initlized and linger.clear may not work")
-        self.customized_quant_list.clear()
-
-
-def fuse_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-
-    eps = 1e-5
-    clamp_conv_name = prefix + 'conv'
-    clamp_bn_name = prefix + 'bn'
-    conv_int_name = prefix
-    if clamp_conv_name + '.weight' in state_dict and clamp_bn_name + '.weight' in state_dict:
-        b_mean = state_dict[clamp_bn_name + '.running_mean']
-        b_var = state_dict[clamp_bn_name + '.running_var']
-        b_w = state_dict[clamp_bn_name + '.weight']
-        b_b = state_dict[clamp_bn_name + '.bias']
-        sigma = 1 / torch.sqrt(b_var + eps)
-        alpha = b_w * sigma
-        beta = b_b - b_mean * alpha
-        c_w = state_dict[clamp_conv_name + '.weight']
-        state_dict[conv_int_name +
-                   'weight'] = (c_w * alpha.view(-1, *([1]*(len(c_w.shape)-1))))
-        if clamp_conv_name + '.bias' in state_dict:
-            c_b = state_dict[clamp_conv_name + '.bias']
-            state_dict[conv_int_name + 'bias'] = (c_b * alpha + beta)
-            state_dict.pop(clamp_conv_name + '.bias')
-        else:
-            state_dict[conv_int_name + 'bias'] = beta
-        state_dict.pop(clamp_bn_name + '.running_mean')
-        state_dict.pop(clamp_bn_name + '.running_var')
-        state_dict.pop(clamp_bn_name + '.weight')
-        state_dict.pop(clamp_bn_name + '.bias')
-        state_dict.pop(clamp_bn_name + '.num_batches_tracked')
-        state_dict.pop(clamp_conv_name + '.weight')
-    else:
-        assert clamp_conv_name + '.weight' not in state_dict and clamp_bn_name + \
-            '.weight' not in state_dict, 'load quanted model but contain float clamp params'
-
-def fuse_state_dict_deconv(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-
-    eps = 1e-5
-    clamp_conv_name = prefix + 'conv'
-    clamp_bn_name = prefix + 'bn'
-    conv_int_name = prefix
-    if clamp_conv_name + '.weight' in state_dict and clamp_bn_name + '.weight' in state_dict:
-        b_mean = state_dict[clamp_bn_name + '.running_mean']
-        b_var = state_dict[clamp_bn_name + '.running_var']
-        b_w = state_dict[clamp_bn_name + '.weight']
-        b_b = state_dict[clamp_bn_name + '.bias']
-        sigma = 1 / torch.sqrt(b_var + eps)
-        alpha = b_w * sigma
-        beta = b_b - b_mean * alpha
-        c_w = state_dict[clamp_conv_name + '.weight']
-        cin, cout_div_groups, *hw = c_w.shape
-        groups = b_b.shape[0] //  cout_div_groups
-        conv_weight = c_w.view(groups, cin // groups, cout_div_groups, *hw )
-        new_weight = conv_weight.mul(alpha.view(groups, 1, -1, *([1]*(len(c_w.shape)-2)))).view(cin, cout_div_groups, *hw)
-        state_dict[conv_int_name + 'weight'] = new_weight
-        if clamp_conv_name + '.bias' in state_dict:
-            c_b = state_dict[clamp_conv_name + '.bias']
-            state_dict[conv_int_name + 'bias'] = (c_b * alpha + beta)
-            state_dict.pop(clamp_conv_name + '.bias')
-        else:
-            state_dict[conv_int_name + 'bias'] = beta
-        state_dict.pop(clamp_bn_name + '.running_mean')
-        state_dict.pop(clamp_bn_name + '.running_var')
-        state_dict.pop(clamp_bn_name + '.weight')
-        state_dict.pop(clamp_bn_name + '.bias')
-        state_dict.pop(clamp_bn_name + '.num_batches_tracked')
-        state_dict.pop(clamp_conv_name + '.weight')
-    else:
-        assert clamp_conv_name + '.weight' not in state_dict and clamp_bn_name + '.weight' not in state_dict, 'load quanted model but contain float clamp params'
-
-def _replaceOp(submodule, mode, in_data_bits, parameter_bits, out_bits=None):
-    assert in_data_bits > 0 and in_data_bits <= 32, "in_data_bits should between 0 and 32"
-    assert parameter_bits > 0 and parameter_bits <= 32, "parameter_bits should between 0 and 32"
-    assert out_bits is None or out_bits > 0 and out_bits <= 32, "out_bits should between 0 and 32"
-    assert mode == QuantMode.QValue, "mode support only q_value"
-    if isinstance(submodule, tuple(SupportQuantTorchModules)):
-        if isinstance(submodule, NormalizeFastGRU):
-            gru = GRUInt(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first, submodule.dropout, submodule.bidirectional,
-                         data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                         clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias)
-            return gru
-        elif isinstance(submodule, NormalizeFastLSTM):
-            lstm = LSTMInt(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first, submodule.dropout, submodule.bidirectional,
-                           data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                           clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias)
-            return lstm
-        elif isinstance(submodule, NormalizeConvBN1d):
-            # ahead_relu = getattr(submodule,LINGER_AHEAD_RELU,False)
-            conv = Conv1dInt(submodule.conv.in_channels, submodule.conv.out_channels, submodule.conv.kernel_size, submodule.conv.stride, submodule.conv.padding, submodule.conv.dilation, submodule.conv.groups,
-                             True, submodule.conv.padding, data_bits=in_data_bits, parameter_bits=parameter_bits, o_bits=out_bits, mode=mode,
-                             clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias, ahead_relu=submodule.ahead_relu)
-            conv._register_load_state_dict_pre_hook(fuse_state_dict)
-            return conv
-        elif isinstance(submodule, NormalizeConvBN2d):
-            # ahead_relu = getattr(submodule,LINGER_AHEAD_RELU,False)
-            conv = Conv2dInt(submodule.conv.in_channels, submodule.conv.out_channels, submodule.conv.kernel_size, submodule.conv.stride, submodule.conv.padding, submodule.conv.dilation, submodule.conv.groups,
-                             True, submodule.conv.padding, data_bits=in_data_bits, parameter_bits=parameter_bits, o_bits=out_bits, mode=mode,
-                             clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias, ahead_relu=submodule.ahead_relu)
-            conv._register_load_state_dict_pre_hook(fuse_state_dict)
-            return conv
-        elif isinstance(submodule, NormalizeConvTransposeBN2d):
-            # ahead_relu = getattr(submodule,IFLYTEK_BITBRAIN_AHEAD_RELU,False)
-            conv = ConvTranspose2dInt(submodule.conv.in_channels, submodule.conv.out_channels, submodule.conv.kernel_size, submodule.conv.stride, submodule.conv.padding, submodule.conv.output_padding, submodule.conv.groups,
-                True, submodule.conv.dilation, submodule.conv.padding_mode, data_bits= in_data_bits,parameter_bits=parameter_bits,mode=mode,o_bits=out_bits,  
-                clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias)
-            conv._register_load_state_dict_pre_hook(fuse_state_dict_deconv)
-            return conv
-        elif isinstance(submodule, NormalizeConv1d):
-            bias = True if submodule.bias is not None else False
-            # ahead_relu = getattr(submodule,LINGER_AHEAD_RELU,False)
-            conv = Conv1dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation, submodule.groups,
-                             bias, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                             clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias, ahead_relu=submodule.ahead_relu)
-            return conv
-        elif isinstance(submodule, NormalizeConv2d):
-            bias = True if submodule.bias is not None else False
-            # ahead_relu = getattr(submodule,LINGER_AHEAD_RELU,False)
-            conv = Conv2dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation, submodule.groups,
-                             bias, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                             clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias,
-                             ahead_relu=submodule.ahead_relu, ahead_sigmoid=submodule.ahead_sigmoid)
-            return conv
-        elif isinstance(submodule, NormalizeConvTranspose2d):
-            bias = True if submodule.bias is not None else False
-            conv_transpose = ConvTranspose2dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.output_padding, submodule.groups,
-                                                bias, submodule.dilation, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                                                clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias)
-            return conv_transpose
-        elif isinstance(submodule, NormalizeLinear):
-            bias = True if submodule.bias is not None else False
-            linear = LinearInt(submodule.in_features, submodule.out_features, bias, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                               clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias,
-                               ahead_relu=submodule.ahead_relu, ahead_sigmoid=submodule.ahead_sigmoid)
-            return linear
-        elif isinstance(submodule, NormalizeBatchNorm2d):
-            # ahead_relu = getattr(submodule,LINGER_AHEAD_RELU,False)
-            if config.BnMomentumUpdate.disable:
-                submodule_momentum = 0
-            else:
-                submodule_momentum = submodule.momentum
-            bn = BatchNormInt(submodule.num_features, eps=submodule.eps, momentum=submodule_momentum, affine=submodule.affine, track_running_stats=submodule.track_running_stats,
-                              data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                              clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias, ahead_relu=submodule.ahead_relu)
-            return bn
-        elif isinstance(submodule, NormalizeLayerNorm):
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            submodule_momentum = 0.1
-            layer_norm = LayerNormInt(submodule.normalized_shape, submodule.eps, submodule_momentum, submodule.elementwise_affine,
-                                      data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits, 
-                                      clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight, clamp_bias=submodule.normalize_bias, ahead_relu=ahead_relu)
-            return layer_norm
-        elif isinstance(submodule, NormalizeEmbedding):
-            embedding = EmbeddingInt(submodule.num_embeddings, submodule.embedding_dim, submodule.padding_idx, submodule.max_norm, submodule.norm_type, submodule.scale_grad_by_freq, submodule.sparse,
-                                     submodule.weight, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                                     clamp_data=submodule.normalize_data, clamp_weight=submodule.normalize_weight)
-            return embedding
-        elif isinstance(submodule, nn.Linear):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            ahead_sigmoid = getattr(
-                submodule, LINGER_AHEAD_SIGMOID, False)
-            linear = LinearInt(submodule.in_features, submodule.out_features, bias, data_bits=in_data_bits,
-                               parameter_bits=parameter_bits, mode=mode, o_bits=out_bits, ahead_relu=ahead_relu, ahead_sigmoid=ahead_sigmoid)
-            return linear
-        elif isinstance(submodule, nn.Conv2d):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            ahead_sigmoid = getattr(
-                submodule, LINGER_AHEAD_SIGMOID, False)
-            conv = Conv2dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation, submodule.groups,
-                             bias, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits,
-                             ahead_relu=ahead_relu, ahead_sigmoid=ahead_sigmoid)
-            return conv
-        elif isinstance(submodule, nn.BatchNorm2d):
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            if config.BnMomentumUpdate.disable:
-                submodule_momentum = 0
-            else:
-                submodule_momentum = submodule.momentum
-            batch_norm = BatchNormInt(submodule.num_features, submodule.eps, submodule_momentum, submodule.affine, submodule.track_running_stats,
-                                      data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits, ahead_relu=ahead_relu)
-            return batch_norm
-
-        elif isinstance(submodule, nn.LayerNorm):
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            submodule_momentum = 0.1
-            layer_norm = LayerNormInt(submodule.normalized_shape, submodule.eps, submodule_momentum, submodule.elementwise_affine,
-                                      data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits, ahead_relu=ahead_relu)
-            return layer_norm
-        elif isinstance(submodule, nn.ConvTranspose2d):
-            bias = True if submodule.bias is not None else False
-            conv_transpose = ConvTranspose2dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.output_padding, submodule.groups,
-                                                bias, submodule.dilation, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits)
-            return conv_transpose
-        elif isinstance(submodule, nn.GRU):
-            gru = GRUInt(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first,
-                         submodule.dropout, submodule.bidirectional, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits)
-            return gru
-        elif isinstance(submodule, nn.LSTM):
-            lstm = LSTMInt(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first,
-                           submodule.dropout, submodule.bidirectional, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits)
-            return lstm
-        elif isinstance(submodule, nn.AvgPool2d):
-            avg_pool = AvgPool2dInt(submodule.kernel_size, submodule.stride, submodule.padding, submodule.ceil_mode,
-                                    submodule.count_include_pad, submodule.divisor_override, data_bits=in_data_bits, mode=mode, o_bits=out_bits)
-            return avg_pool
-        elif isinstance(submodule, nn.Conv1d):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            conv1d = Conv1dInt(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation, submodule.groups,
-                               bias, submodule.padding_mode, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits, ahead_relu=ahead_relu)
-            return conv1d
-        elif isinstance(submodule, nn.ReLU6):
-            relu6 = ReLU6Int(data_bits=in_data_bits, mode=mode)
-            return relu6
-        elif isinstance(submodule, nn.Embedding):
-            embedding = EmbeddingInt(submodule.num_embeddings, submodule.embedding_dim, submodule.padding_idx, submodule.max_norm, submodule.norm_type, submodule.scale_grad_by_freq, submodule.sparse,
-                                     submodule.weight, data_bits=in_data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=out_bits)
-            return embedding
-
-    return None
-
-
-def disable_quant(module: nn.Module):
-    r"""禁用module及子module 任何量化策略(使用原有浮点精度),该接口支持inline使用(Example2)
-
-    Args:
-        module:需要不进行量化的module
-
-    Example:        
-        >>> net = shufflenet_v2_x1_0(pretrained=False)
-        >>> linger.disable_quant(net)
-    Example:
-        >>> class Net(nn.Module):
-        >>>     def __init__(self):
-        >>>         super(Net, self).__init__()
-        >>>         self.conv1 = nn.Sequential(
-        >>>             nn.Conv2d(
-        >>>                 in_channels=1,
-        >>>                 out_channels=16,
-        >>>                 kernel_size=5,
-        >>>                 stride=1,
-        >>>                 padding=2,
-        >>>             ),
-        >>>             nn.ReLU(),
-        >>>             nn.MaxPool2d(kernel_size=2),
-        >>>             nn.Conv2d(in_channels=16,
-        >>>                 out_channels=16,
-        >>>                 kernel_size=5,
-        >>>                 stride=1,
-        >>>                 padding=2,)
-        >>>         )
-        >>>         self.conv2 = nn.Sequential(
-        >>>             nn.Conv2d(16, 32, 5, 1, 2),
-        >>>             nn.ReLU(),
-        >>>             nn.MaxPool2d(2),
-        >>>         )
-        >>>         self.out = nn.Linear(32 * 7 * 7, 10)
-        >>>         linger.disable_quant(self.out)
-    Notes:
-        disable_quant 应该在linger.init函数之前调用才生效
-
-        """
-    queue = [module]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        if type(node) in SupportQuantTorchModules:
-            _SingletonContainCustomModules()._register(node, None)
-        for _, submodule in node.named_children():
-            queue.append(submodule)
-
-
-def quant_module(module: nn.Module, type_modules: Tuple = DefaultQuantIntXOP, mode: QuantMode = QuantMode.QValue, data_bits: int = 8, parameter_bits: int = 8, out_bits: int = None):
-    r"""对module进行自定义量化设置,这是一个通用的量化策略设置接口,该接口支持inline使用
-
-        Args:
-            module: 需要量化的模型或者模块
-            type_modules(tuple):量化针对的类型,默认为(nn.Linear, nn.Conv2d, nn.ConvTranspose2d, nn.GRU, nn.LSTM)
-            mode(QuantMode):量化模式,默认为Q值量化
-            data_bits(int):推理过程中data(激活值)量化精度bit数，默认为8bit
-            parameter_bits(int):权重的量化精度bit数，默认为8bit
-            out_bits(int or None):输出数值的精度，默认为None，表示float精度 
-        .. math:: 
-            \text {save_weight_bits}=\begin{cases}
-                8 & parameter\_bits\leq 8 \\
-                16 & 8 < parameter\_bits\leq 16 \\
-                32 & parameter\_bits \geq 16
-            \end{cases}
-        Notes:
-            quant_module 应该在linger.init函数之前调用才生效
-            如果module weights中有bias, 如果onnx 导出, 存储使用32bit
-        """
-    if not isinstance(type_modules, tuple):
-        type_modules = (type_modules,)
-    for t in type_modules:
-        if t not in SupportQuantTorchModules:      # type_modules is a subset of SupportQuantTorchModules
-            logger.fatal(str(t)+" is not supprt quant in linger now")
-            exit(-1)
-
-    queue = [module]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        if type(node) in type_modules and type(node) in SupportQuantTorchModules:      # and 后 判断多余
-            qinfo = QuantInfo()
-            qinfo.set_data_bits(data_bits)
-            qinfo.set_parameter_bits(parameter_bits)
-            qinfo.set_output_bits(out_bits)
-            qinfo.set_mode(mode)
-            _SingletonContainCustomModules()._register(node, qinfo)
-        for _, submodule in node.named_children():
-            queue.append(submodule)
-
-
-def quant_module_by_type(module: nn.Module, type_modules: Tuple = DefaultQuantIntXOP, mode: QuantMode = QuantMode.QValue, data_bits: int = 8, parameter_bits: int = 8, out_bits: int = None):
-    r"""对module进行自定义量化设置,包括激活值，weight以及输出激活值都是16bit,这是一个通用的量化策略设置接口,该接口支持inline使用
-
-        Args:
-            module: 需要量化的模型或者模块
-            type_modules(tuple):量化针对的类型,默认为(nn.Linear, nn.Conv2d, nn.ConvTranspose2d, nn.GRU, nn.LSTM)
-            mode(QuantMode):量化模式,默认为Q值量化
-            data_bits(int):推理过程中data(激活值)量化精度bit数，默认为8bit
-            parameter_bits(int):权重的量化精度bit数，默认为8bit
-            out_bits(int or None):输出数值的精度，默认为None，表示float精度 
-
-        Notes:
-            quant_module_by_type 应该在linger.init函数之前调用才生效
-            如果module weights中有bias, 如果onnx 导出, 存储使用32bit
-        """
-    if type(type_modules) is not tuple:
-        type_modules = (type_modules,)
-    for user_module_type in type_modules:
-        assert user_module_type in SupportQuantTorchModules, 'currently not support quant of ' + \
-            str(user_module_type)
-    queue = [module]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        for t in type_modules:
-            if type(node) == t:
-                qinfo = QuantInfo()
-                qinfo.set_data_bits(data_bits)
-                qinfo.set_parameter_bits(parameter_bits)
-                qinfo.set_output_bits(out_bits)
-                qinfo.set_mode(mode)
-                _SingletonContainCustomModules()._register(node, qinfo)
-        for _, submodule in node.named_children():
-            queue.append(submodule)
-
-
-def quant_tensor(module: nn.Module, x: torch.Tensor, name: str = '_default_layername', mode: QuantMode = QuantMode.QValue, bits: int = 8, zero_point: int = 0) -> torch.Tensor:
-    r"""对tensor进行量化
-
-    Args:
-        module(torch.nn.Module):tensor 量化所在的module，如果是在forward代码里面，一般是self
-        x(torch.Tensor):量化tensor的tensor
-        name(str):量化后module名字，如果同一forward中出现多个需要量化的tensor，该名字应设置成不一样，默认为'_default_layername'
-        mode(QuantMode):量化模式，默认QValue
-        bits(int):tensor量化的bit数，默认为8bit
-    Returns:
-        返回量化后的x(tensor)
-    Example:
-        >>> def forward(self, x):        
-        >>>     x = self.conv1(x)
-        >>>     x = linger.quant_tensor(self,x,'i_am_the_key_code_line')
-        >>>     x = self.conv2(x)
-        >>>     x = x.view(x.size(0), -1)           
-        >>>     output = self.out(x)
-        >>> return output
-
-    """
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iq_tensor_quant_' + name
-    if hasattr(module, var_name):
-        round_layer = getattr(module, var_name)
-    else:
-        round_layer = ScaledRoundLayer(
-            mode=mode, bits=bits, zero_point=zero_point)
-        round_layer.training = module.training
-        round_layer = round_layer.to(x.device)
-        setattr(module, var_name, round_layer)
-    x = round_layer(x)
-    return x
-
-
-def hook_pre_pack_forward(module, input):
-    input = list(input)
-    if isinstance(input[0], tuple):
-        orig_input = input[0]
-        input_ori, lengths, batch_first, enforce_sorted = orig_input
-        packed_input = torch_pack_padded_sequence(
-            input_ori, lengths, batch_first, enforce_sorted)
-        input[0] = packed_input
-    return tuple(input)
-
-
-def hook_pre_pad_forward(module, input, output):
-    if isinstance(input[0], tuple):
-        output_packed = output[0]
-        output_unpack, lengths = torch_pad_packed_sequence(
-            output_packed, module.batch_first)
-        return (output_unpack, lengths), output[1]
-    else:
-        return output
-
-
-def init(model: nn.Module, *, quant_modules: Tuple = DefaultQuantIntXOP, parameter_bits: int = 8, mode: QuantMode = QuantMode.QValue) -> nn.Module:
-    data_bits = 8
-    out_bits = 8
-    assert data_bits > 0 and data_bits <= 32, "data_bits should between 0 and 32"
-    assert parameter_bits > 0 and parameter_bits <= 32, "parameter_bits should between 0 and 32"
-    assert data_bits + parameter_bits <= 32, "data_bits + parameter_bits less than 32"
-    assert out_bits is None or out_bits > 0 and out_bits <= 32, "out_bits should between 0 and 32"
-    if type(quant_modules) is not tuple:
-        quant_modules = (quant_modules,)
-    quant_modules = set(list(quant_modules) + [NormalizeConvBN1d, NormalizeConvBN2d, NormalizeConvTransposeBN2d, NormalizeConv2d, NormalizeConv1d, NormalizeConvTranspose2d,
-                        NormalizeLinear, nn.ReLU6, NormalizeFastGRU, NormalizeFastLSTM, NormalizeBatchNorm2d, NormalizeEmbedding, NormalizeLayerNorm])
-    for user_module_type in quant_modules:
-        assert user_module_type in SupportQuantTorchModules, 'currently not support quant of ' + \
-            str(user_module_type)
-
-    if config.IQTensor.iqcat:
-        torch.cat = linger_cat
-    if config.IQTensor.iqclamp:
-        torch.clamp = linger_clamp
-        torch.clamp_ = linger_clamp_
-    if config.IQTensor.iqsigmoid:
-        torch.sigmoid = linger_sigmoid
-        torch.sigmoid_ = linger_sigmoid_
-    if config.IQTensor.iqtanh:
-        torch.tanh = linger_tanh
-        torch.tanh_ = linger_tanh_
-    if config.IQTensor.softmaxint:
-        torch.softmax = linger_softmax
-        torch.nn.functional.softmax = linger_softmax
-    if config.IQTensor.logsoftmaxint:
-        torch.log_softmax = linger_logsoftmax
-        torch.nn.functional.log_softmax = linger_logsoftmax
-    if config.IQTensor.iqvar:
-        torch.var = linger_var
-    if config.FunctionQuant.bmm:
-        torch.bmm = linger_bmm
-    if config.FunctionQuant.channel_shuffle:
-        linger.channel_shuffle = channel_shuffle_quant
-    torch.nn.functional.dropout = linger_dropout
-    torch.onnx.export = linger_export
-    if nn.LSTM in quant_modules or nn.GRU in quant_modules:
-        torch.nn.utils.rnn.pack_padded_sequence = linger_pack_padded_sequence
-        torch.nn.utils.rnn.pad_packed_sequence = linger_pad_packed_sequence
-    device = get_device(model)
-    queue = [model]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        # add for get father module in current forward
-        node.register_forward_pre_hook(hook_pre_forward)
-        node.register_forward_hook(hook_forward)
-        setattr(node, LINGER_MODE, mode)
-        setattr(node, LINGER_OBIT, out_bits)
-        for name, submodule in node.named_children():
-            # customed by user, set Singleton Class for get module register_keys by class instance
-            if _SingletonContainCustomModules()._is_registered(submodule):
-                quant_info = _SingletonContainCustomModules().get(submodule)
-                if quant_info == None:  # disabled module
-                    if isinstance(submodule, (nn.LSTM, nn.GRU, NormalizeFastGRU, NormalizeFastLSTM)):
-                        submodule.register_forward_pre_hook(
-                            hook_pre_pack_forward)
-                        submodule.register_forward_hook(hook_pre_pad_forward)
-                    continue
-                else:
-                    r_module = _replaceOp(submodule, quant_info.mode, quant_info.data_bits,
-                                          quant_info.parameter_bits, quant_info.output_bits)
-                    assert r_module is not None
-                    setattr(node, name, r_module)
-            elif type(submodule) in SupportQuantTorchModules and type(submodule) in quant_modules:  # not customed
-                r_module = _replaceOp(
-                    submodule, mode, data_bits, parameter_bits, out_bits)
-                assert r_module is not None
-                setattr(node, name, r_module)
-            else:
-                queue.append(submodule)
-
+def init(model: nn.Module, config_file: str = None, disable_module=None, disable_submodel=None):
+
+    q_configs = QUANT_CONFIGS
+    if config_file is not None:
+        q_configs._load_from_yaml(config_file)
+    
+    if disable_module is not None:
+        for name in disable_module:
+            if _QMODULE_TABLE.get(name, None) is not None:
+                _QMODULE_TABLE.pop(name)
+                
+    # traced_model = symbolic_trace(model)
+    # model = _replace_ops(traced_model, q_configs)
+
+    for name, m in model.named_modules():
+        if disable_submodel is not None and any(fnmatch(name, pattern) for pattern in disable_submodel):
+            continue
+        
+        m.register_forward_pre_hook(hook_pre_forward)
+        m.register_forward_hook(hook_forward)
+
+        _quantize_submodule(model, name, m, weights_cfg=q_configs.quant_info.to_dict(), activations_cfg=q_configs.quant_info.to_dict(), bias_cfg=q_configs.quant_info.to_dict(), constrain =  q_configs.clamp_info.to_dict())
+    
     def quant_tensor_pre_hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
-
+    
         def quant_tensor_layer(module, prefix=''):
             local_name_params = itertools.chain(
                 module._parameters.items(), module._buffers.items())
             local_state = {k: v for k, v in local_name_params if v is not None}
             for key in state_dict.keys():
-                if LINGER_MIX_INT8_MANUAL_ROUND_LAYERS in key:
+                if LINGER_QTENSOR_LAYERS_PREIFX in key:
                     if key.startswith(prefix):
                         full_input_name = key[len(prefix):]
                         # get the name of param/buffer/child
                         input_name = full_input_name.split('.', 1)[0]
                         if input_name not in module._modules and input_name not in local_state:
-                            if '_iqadd_' in input_name:
-                                iq_layer = iqAddLayer()
-                                iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
-                                setattr(module, input_name, iq_layer)
-                            elif '_iqmul_' in input_name:
-                                iq_layer = iqMulLayer()
+                            # quant_info = getattr(module, LINGER_QUANTINFO, QuantInfo())
+                            activate_cfg  = q_configs.quant_info.to_dict()
+                            if '_qadd_' in input_name:
+                                iq_layer = QAdd(activate_config=activate_cfg, num_input=2)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqcat_' in input_name:
-                                iq_layer = iqCatLayer()
+                            elif '_qmul_' in input_name:
+                                iq_layer = QMul(activate_config=activate_cfg, num_input=2)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqsigmoid_' in input_name:
-                                iq_layer = iqSigmoidLayer()
+                            elif '_qcat_' in input_name:
+                                iq_layer = QCat(activate_config=activate_cfg, num_input=2, is_cat=True)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqtanh_' in input_name:
-                                iq_layer = iqTanhLayer()
+                            elif '_qbmm_' in input_name:
+                                iq_layer = QBmm(activate_config=activate_cfg, num_input=2)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqclamp_' in input_name:
-                                iq_layer = iqClampLayer()
+                            elif '_qmatmul_' in input_name:
+                                iq_layer = QMatmul(activate_config=activate_cfg, num_input=2)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqdiv_' in input_name:
-                                iq_layer = iqDivLayer()
+                            elif '_qsigmoid_' in input_name:
+                                iq_layer = QSigmoid(activate_config=activate_cfg, num_input=1)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqsum_' in input_name:
-                                iq_layer = iqSumLayer()
+                            elif '_qtanh_' in input_name:
+                                iq_layer = QTanh(activate_config=activate_cfg, num_input=1)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iqvar_' in input_name:
-                                iq_layer = iqVarLayer()
+                            elif '_qsoftmax_' in input_name:
+                                iq_layer = QSoftmax(activate_config=activate_cfg, num_input=2)
                                 iq_layer.training = model.training
-                                iq_layer = iq_layer.to(device)
+                                # iq_layer = iq_layer.to(device)
                                 setattr(module, input_name, iq_layer)
-                            elif '_iq_tensor_quant_' in input_name:
-                                round_layer = ScaledRoundLayer(
-                                    mode=mode, bits=data_bits)
-                                round_layer.training = model.training
-                                round_layer = round_layer.to(device)
-                                setattr(module, input_name, round_layer)
-                            elif '_function_bmm_' in input_name:
-                                bmm_layer = BmmInt(data_bits=8, mode=mode,)
-                                bmm_layer.training = model.training
-                                bmm_layer = bmm_layer.to(device)
-                                setattr(module, input_name, bmm_layer)
-                            elif '_SoftmaxInt_' in input_name:
-                                softmax_layer = softmaxIntLayer(
-                                    data_bits=8, mode=mode,)
-                                softmax_layer.training = model.training
-                                softmax_layer = softmax_layer.to(device)
-                                setattr(module, input_name, softmax_layer)
+                            else:
+                                pass
 
             for name, children in module._modules.items():
                 if children is not None:
@@ -629,9 +159,111 @@ def quant_tensor_layer(module, prefix=''):
         quant_tensor_layer = None
 
     model._register_load_state_dict_pre_hook(quant_tensor_pre_hook)
-    if model.training:
-        model.train()
-    else:
-        model.eval()
-    model.to(device)
+
+    model.to(q_configs.device)
     return model
+
+
+# 关闭QTensor类算子时使用
+def disable_quant_ops(qmodule_list = [], qtensor_list = []):
+    """
+    删除_QMODULE_TABLE, _QTENSOR_OP_TABLE中的量化操作
+    """
+    pop_list = []
+    for name in qmodule_list:
+        for k in _QMODULE_TABLE.keys():
+            if name == k:
+                pop_list.append(k)
+    for k in pop_list:
+        _QMODULE_TABLE.pop(k)
+    
+    pop_list = []
+    for name in qtensor_list:
+        for k in _QTENSOR_OP_TABLE.keys():
+            if name in str(k):
+                pop_list.append(k)
+    for k in pop_list:
+        _QTENSOR_OP_TABLE.pop(k)
+
+def get_quant_ops_name():
+    return _QMODULE_TABLE.keys(), _QTENSOR_OP_TABLE.keys()
+
+def config_save_to_yaml(yaml_save_path):
+    QUANT_CONFIGS._save_to_yaml(yaml_save_path)
+
+
+
+def _set_module_by_name(parent_module, name, child_module):
+    module_names = name.split(".")
+    if len(module_names) == 1:
+        setattr(parent_module, name, child_module)
+    else:
+        parent_module_name = name[: name.rindex(".")]
+        parent_module = parent_module.get_submodule(parent_module_name)
+        setattr(parent_module, module_names[-1], child_module)
+
+def _quantize_submodule(
+    model: torch.nn.Module,
+    name: str,
+    module: torch.nn.Module,
+    weights_cfg: Optional[Union[str]] = None,
+    activations_cfg: Optional[Union[str]] = None,
+    bias_cfg: Optional[Union[str]] = None,
+    constrain: Optional[Union[str]] = None,
+):
+    qmodule = quantize_module(module, weights_cfg=weights_cfg, activations_cfg=activations_cfg, bias_cfg = bias_cfg, dim = getattr(module, "dim", None), constrain = constrain)
+    if qmodule is not None:
+        _set_module_by_name(model, name, qmodule)
+        qmodule.name = name
+        for name, param in module.named_parameters():
+            # Save device memory by clearing parameters
+            setattr(module, name, None)
+            del param
+
+def _constrain_submodule(
+    model: torch.nn.Module,
+    name: str,
+    module: torch.nn.Module,
+    constrain: Optional[Union[str]] = None,
+):
+    cmodule = constrain_module(module, constrain=constrain)
+    if cmodule is not None:
+        _set_module_by_name(model, name, cmodule)
+        cmodule.name = name
+        for name, param in module.named_parameters():
+            # Save device memory by clearing parameters
+            setattr(module, name, None)
+            del param
+
+
+# def _replace_ops(gm: GraphModule, quant_cfg: QuantConfig) -> GraphModule:
+#     graph = gm.graph
+#     qtensor_counter = 0
+#     activate_cfg  = quant_cfg.quant_info.to_dict()
+#     constrain_cfg = quant_cfg.clamp_info.to_dict()
+#     for node in list(graph.nodes):
+#         if node.op == "call_function":
+#             new_node_mod = None
+#             new_node_name = None
+#             new_node = None
+#             new_node_mod = quantize_tensor(node.target, activate_cfg, num_input = len(node.args), dim = node.kwargs.get('dim', None))
+
+#             if new_node_mod is not None:
+#                 new_node_name = f"{new_node_mod._get_name()}_{qtensor_counter}"
+#                 qtensor_counter += 1
+#                 with graph.inserting_after(node):
+#                     gm.add_module(new_node_name, new_node_mod)
+#                     new_node = graph.call_module(new_node_name, args=node.args)
+#                     node.replace_all_uses_with(new_node)
+#                     graph.erase_node(node)
+#         elif node.op == "call_module":
+#             old_mod = gm.get_submodule(node.target)
+#             weights_cfg = quant_cfg.quant_info.to_dict()
+#             activate_cfg  = quant_cfg.quant_info.to_dict()
+#             bias_cfg = quant_cfg.quant_info.to_dict()
+#             new_node_mod = quantize_module(old_mod, activate_cfg, weights_cfg = weights_cfg, bias_cfg = bias_cfg, constrain = constrain_cfg, dim = getattr(old_mod, "dim", None))
+#             if new_node_mod is not None and new_node_mod is not old_mod:
+#                 gm.add_submodule(node.target, new_node_mod)
+#     graph.lint()
+#     gm.recompile()
+#     return gm
diff --git a/linger/kernel/cpu/arcs_qsigmoid_kernel.cpp b/linger/kernel/cpu/arcs_qsigmoid_kernel.cpp
new file mode 100644
index 0000000..9eccfe5
--- /dev/null
+++ b/linger/kernel/cpu/arcs_qsigmoid_kernel.cpp
@@ -0,0 +1,77 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+torch::Tensor arcs_qsigmoid_cpu(torch::Tensor a)
+{
+	int32_t N = a.numel();
+	auto c = torch::zeros_like(a);
+	const int *a_ptr = a.data_ptr<int>();
+	int *c_ptr = c.data_ptr<int>();
+
+	static const uint32_t bands[] = {0, 63656107, 111395083, 153816608, 194953701, 236865036, 281253298, 329433241, 384154475, 447714743, 515399149, 589016819, 679940425, 759830874, 862986200, 965402453, 2147483648};
+	static const uint32_t slopes[] = {529475578, 482862538, 424212188, 361565531, 298613704, 238039992, 181912633, 132002182, 89144402, 56020914, 34019888, 18928477, 9459126, 5291645, 2204341, 177654};
+	static const uint32_t bias0s[] = {1073741824, 1095849221, 1144526551, 1216321063, 1307759740, 1414659140, 1532274042, 1654777693, 1777444109, 1887935281, 1972419725, 2038648643, 2086619912, 2110212774, 2130063364, 2144640936};
+	static const uint32_t bias1s[] = {1073741824, 1051634427, 1002957097, 931162585, 839723908, 732824508, 615209606, 492705955, 370039539, 259548367, 175063923, 108835005, 60863736, 37270874, 17420284, 2842712};
+
+	uint32_t i = 0;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	int32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+
+	for (i = 0; i < N; i++)
+	{
+		tmp = a_ptr[i];
+
+		if (tmp < 0)
+		{
+			sign = 1;
+			absx = -tmp;
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				if (1 == sign)
+				{
+					bias = bias1s[j - 1];
+				}
+				else
+				{
+					bias = bias0s[j - 1];
+				}
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+		out = ((slope * tmp) >> 27) + bias;
+		c_ptr[i] = SATURATE(out, 32);
+	}
+
+	return c;
+}
diff --git a/linger/kernel/cpu/arcs_qsoftmax_kernel.cpp b/linger/kernel/cpu/arcs_qsoftmax_kernel.cpp
new file mode 100644
index 0000000..e2c6cc7
--- /dev/null
+++ b/linger/kernel/cpu/arcs_qsoftmax_kernel.cpp
@@ -0,0 +1,192 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+static int32_t shift_pure(int64_t v, int32_t s)
+{
+    if (s >= 63 || s <= -63) {
+		return 0;
+	}
+	if (s > 0) {
+		v = v << s;
+	} else {
+        v = v >> (-s);
+    }
+    return SATURATE(v, 32);
+}
+
+static int32_t sub32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a - (int64_t)b);
+	return SATURATE(s, 32);
+}
+static int32_t add32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a + (int64_t)b);
+	return SATURATE(s, 32);
+}
+static int32_t shift_rasym(int64_t x,int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return (int32_t)y;
+
+    if (n >= 63 || n <= -63) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+    return x;
+}
+static int32_t shift_rasyms(int64_t x, int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return SATURATE(y, 32);
+	if (n >= 64 || n <= -64) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+	return SATURATE(x, 32);
+}
+
+static int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+/*-------------------------------------------------------------------------
+Softmax
+The function computes the softmax (normalized exponential function) of
+input data. 32-bit fixed-point functions accept inputs in Q6.25 and form
+outputs in Q16.15 format.
+
+Precision:
+32x32  32-bit inputs, 32-bit output. Accuracy: 2 LSB (see Note below)
+f      floating point input, floating point output
+
+Note: Accuracy of function may depend on amount of data and their
+distribution. Given accuracy is achieved for N=2 for any pair of data
+from input domain.
+
+Input:
+in.data_ptr   input data, Q6.25
+in.numel()    length of vectors
+Output:
+out.data_ptr  result, Q16.15 or floating point
+
+Restriction:
+in,out should not overlap
+-------------------------------------------------------------------------*/
+torch::Tensor arcs_qsoftmax_cpu(const torch::Tensor& in, int64_t dim)
+{
+	// int32_t N = in.numel();
+	int32_t N = in.size(1);
+	int32_t L = in.size(0);
+	auto out = torch::zeros_like(in);
+	int *p_x = in.data_ptr<int>();
+	int *p_y = out.data_ptr<int>();
+
+	// const static int32_t p[5] = { 14685058, 114217091, 514075394, 1488269031, 2147475316 };
+	const static int32_t p23[5] = { 57364 ,446161 ,2008107 , 5813551, 8388575 };
+
+	for (int k = 0; k < L; k++)
+	{
+		uint32_t A = 0x800000, B = 0, C = 0;
+		int *x_ptr = p_x + N * k;
+		int *y_ptr = p_y + N * k;
+
+		int32_t max_value = x_ptr[0];
+		int32_t data = 0;
+		int32_t X = 0;
+		int64_t Y = 0;
+		int32_t E = 0;
+		int64_t E_SUM = 0;
+		
+		for (int i = 1; i < N; i++)
+		{
+			max_value = x_ptr[i] > max_value ? x_ptr[i] : max_value;
+		}
+
+		for (int i = 0; i < N; i++)
+		{
+			data = sub32s(x_ptr[i] ,max_value);
+			X = shift_rasyms((int64_t)data * (int64_t)774541002,-31);//exp=>2xp，Q6.25=>Q8.23
+			E = X >> 23;
+			E = E + 1;//与118行对应
+
+			X = X & 0x7fffff;
+			X = X - 0x800000;
+
+			Y = p23[0];
+
+			for (int j = 1; j < 5; j++)
+			{
+				Y = shift_rasym((int64_t)Y * (int64_t)X,-23) + p23[j];//Q7.24
+			}
+			
+			// y_ptr[i] = (int32_t)(Y  * pow(2, E));
+			y_ptr[i] = shift_pure(Y, E);
+		}
+
+		for (int i = 0; i < N; i++)
+		{
+			E_SUM += y_ptr[i];
+		}
+
+		B = SATURATE(E_SUM, 32);
+		for (int i = 1; i <= 30; i++) {
+			if(A>=B) {
+				C = C+1; 
+				C = C*2;
+				A = A-B;
+				A = A*2;
+			} else {
+				C = C*2;
+				A = A*2;
+			}
+		}
+
+		for (int i = 0; i < N; i++)
+		{
+			y_ptr[i] = shfit_floor_x05_int64((int64_t)C * (int64_t)y_ptr[i], 38);
+		}
+	}
+
+	return out;
+}
+
diff --git a/linger/kernel/cpu/arcs_qtanh_kernel.cpp b/linger/kernel/cpu/arcs_qtanh_kernel.cpp
new file mode 100644
index 0000000..47a0c1f
--- /dev/null
+++ b/linger/kernel/cpu/arcs_qtanh_kernel.cpp
@@ -0,0 +1,85 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+torch::Tensor arcs_qtanh_cpu(torch::Tensor a)
+{
+	int32_t N = a.numel();
+	auto c = torch::zeros_like(a);
+	const int *a_ptr = a.data_ptr<int>();
+	int *c_ptr = c.data_ptr<int>();
+
+	static const uint32_t bands[] = {0, 33584191, 58182438, 80361293, 102120620, 124483371, 148392654, 174972063, 206183541, 244722657, 281591256, 312045360, 358241226, 403095772, 471425623, 530372454, 2147483648};
+	static const uint32_t slopes[] = {2114623134, 1910645334, 1662969581, 1396521636, 1131979061, 881027392, 652221876, 451080022, 281798391, 159724746, 98692663, 60001039, 27632752, 13732196, 2832019, 172634};
+	static const uint32_t biass[] = {0, 51039676, 158405367, 317937954, 519217264, 751968259, 1004938252, 1267155516, 1527203771, 1749783808, 1877830237, 1967785133, 2054179492, 2095926998, 2134212723, 2144721503};
+
+	uint32_t i = 0;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	int32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+
+	for (i = 0; i < N; i++)
+	{
+		tmp = a_ptr[i];
+
+		if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == -1 * (1 << 31))
+			{
+				absx = (1 << 31) - 1;
+			}
+			else
+			{
+				absx = -tmp;
+			}
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				bias = biass[j - 1];
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+
+		if (1 == sign)
+		{
+			out = ((-1 * slope * absx) >> 27) - bias;
+		}
+		else
+		{
+			out = ((slope * absx) >> 27) + bias;
+		}
+
+		c_ptr[i] = SATURATE(out, 32);
+	}
+
+	return c;
+}
diff --git a/linger/kernel/cpu/extension.cpp b/linger/kernel/cpu/extension.cpp
new file mode 100644
index 0000000..6bf961b
--- /dev/null
+++ b/linger/kernel/cpu/extension.cpp
@@ -0,0 +1,215 @@
+#include <torch/extension.h>
+#include <torch/types.h>
+#include <vector>
+#include <iostream>
+#include <tuple>
+
+void find_table_cpu(torch::Tensor value, torch::Tensor table, torch::Tensor table_index);
+void find_table_gpu(torch::Tensor value, torch::Tensor table, torch::Tensor table_index);
+void find_table(torch::Tensor value, torch::Tensor table, torch::Tensor table_index)
+{
+    if (value.device().type() == torch::kCUDA)
+    {
+        return find_table_gpu(value, table, table_index);
+    }
+    else
+    {
+        return find_table_cpu(value, table, table_index);
+    }
+}
+
+torch::Tensor venusa_qsigmoid_cpu(torch::Tensor a);
+torch::Tensor venusa_qsigmoid_gpu(torch::Tensor a);
+torch::Tensor venusa_qsigmoid_forward(torch::Tensor a)
+{
+    if (a.device().type() == torch::kCUDA)
+    {
+        return venusa_qsigmoid_gpu(a);
+    }
+    else
+    {
+        return venusa_qsigmoid_cpu(a);
+    }
+}
+
+torch::Tensor arcs_qsigmoid_cpu(torch::Tensor a);
+torch::Tensor arcs_qsigmoid_gpu(torch::Tensor a);
+torch::Tensor arcs_qsigmoid_forward(torch::Tensor a)
+{
+    if (a.device().type() == torch::kCUDA)
+    {
+        return arcs_qsigmoid_gpu(a);
+    }
+    else
+    {
+        return arcs_qsigmoid_cpu(a);
+    }
+}
+
+torch::Tensor venusa_qtanh_cpu(torch::Tensor a);
+torch::Tensor venusa_qtanh_gpu(torch::Tensor a);
+torch::Tensor venusa_qtanh_forward(torch::Tensor a)
+{
+    if (a.device().type() == torch::kCUDA)
+    {
+        return venusa_qtanh_gpu(a);
+    }
+    else
+    {
+        return venusa_qtanh_cpu(a);
+    }
+}
+
+torch::Tensor arcs_qtanh_cpu(torch::Tensor a);
+torch::Tensor arcs_qtanh_gpu(torch::Tensor a);
+torch::Tensor arcs_qtanh_forward(torch::Tensor a)
+{
+    if (a.device().type() == torch::kCUDA)
+    {
+        return arcs_qtanh_gpu(a);
+    }
+    else
+    {
+        return arcs_qtanh_cpu(a);
+    }
+}
+
+torch::Tensor arcs_qsoftmax_cpu(const torch::Tensor& in, int64_t dim);
+torch::Tensor arcs_qsoftmax_gpu(const torch::Tensor& in, int64_t dim);
+torch::Tensor arcs_qsoftmax_forward(const torch::Tensor& in, int64_t dim)
+{
+    if (in.device().type() == torch::kCUDA)
+    {
+        return arcs_qsoftmax_gpu(in, dim);
+    }
+    else
+    {
+        return arcs_qsoftmax_cpu(in, dim);
+    }
+}
+
+torch::Tensor venusa_qsoftmax_cpu(const torch::Tensor& in, int64_t dim);
+torch::Tensor venusa_qsoftmax_gpu(const torch::Tensor& in, int64_t dim);
+torch::Tensor venusa_qsoftmax_forward(const torch::Tensor& in, int64_t dim)
+{
+    if (in.device().type() == torch::kCUDA)
+    {
+        return venusa_qsoftmax_gpu(in, dim);
+    }
+    else
+    {
+        return venusa_qsoftmax_cpu(in, dim);
+    }
+}
+
+torch::Tensor qlayernorm_kernel_cpu(torch::Tensor numerator, torch::Tensor denominator, float scale_x);
+torch::Tensor qlayernorm_kernel_gpu(torch::Tensor numerator, torch::Tensor denominator, float scale_x);
+torch::Tensor qlayernorm_kernel_forward(torch::Tensor numerator, torch::Tensor denominator, float scale_x)
+{
+    if (numerator.device().type() == torch::kCUDA)
+    {
+        return qlayernorm_kernel_gpu(numerator, denominator, scale_x);
+    }
+    else
+    {
+        return qlayernorm_kernel_cpu(numerator, denominator, scale_x);
+    }
+}
+
+
+
+std::tuple<torch::Tensor, torch::Tensor, float> fake_quant_cuda(torch::Tensor input,int bit,float factor,float scale_min, float quant_min,float quant_max);
+
+std::tuple<torch::Tensor, torch::Tensor, float> fake_quant(
+    torch::Tensor input,
+    int bit,
+    float factor,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // return fake_quant_cuda(input, bit, factor, quant_min, quant_max);
+    // printf("到位置 1 了 \n");
+    if (input.device().type() == torch::kCUDA){
+        return fake_quant_cuda(input, bit, factor, scale_min, quant_min, quant_max);
+    }
+    else{
+        throw std::runtime_error("尚未实现cpu版本伪量化，请使用python版本——NATIVE模式");
+    }
+}
+
+std::tuple<torch::Tensor, torch::Tensor> bias_quant_cuda(torch::Tensor input,int bit,float scale,float scale_min, float quant_min,float quant_max);
+
+
+std::tuple<torch::Tensor, torch::Tensor> bias_quant(
+    torch::Tensor input,
+    int bit,
+    float scale,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // return bias_quant_cuda(input, bit, scale, quant_min, quant_max);
+    // printf("到位置 1 了 \n");
+    if (input.device().type() == torch::kCUDA){
+        return bias_quant_cuda(input, bit, scale, scale_min, quant_min, quant_max);
+    }
+    else{
+        throw std::runtime_error("尚未实现cpu版本伪量化，请使用python版本——NATIVE模式");
+    }
+}
+
+
+
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor,float> fake_quant_cuda_with_grad_scale(torch::Tensor input,int bit,float factor,float scale_min, float quant_min,float quant_max);
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor,float> fake_quant_with_grad_scale(
+    torch::Tensor input,
+    int bit,
+    float factor,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // return fake_quant_cuda(input, bit, factor, quant_min, quant_max);
+    // printf("到位置 1 了 \n");
+    if (input.device().type() == torch::kCUDA){
+        return fake_quant_cuda_with_grad_scale(input, bit, factor, scale_min, quant_min, quant_max);
+    }
+    else{
+        throw std::runtime_error("尚未实现cpu版本伪量化，请使用python版本——NATIVE模式");
+    }
+}
+
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> bias_quant_cuda_with_grad_scale(torch::Tensor input,int bit,float scale,float scale_min, float quant_min,float quant_max);
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> bias_quant_with_grad_scale(
+    torch::Tensor input,
+    int bit,
+    float scale,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // return bias_quant_cuda(input, bit, scale, quant_min, quant_max);
+    // printf("到位置 1 了 \n");
+    if (input.device().type() == torch::kCUDA){
+        return bias_quant_cuda_with_grad_scale(input, bit, scale, scale_min, quant_min, quant_max);
+    }
+    else{
+        throw std::runtime_error("尚未实现cpu版本伪量化，请使用python版本——NATIVE模式");
+    }
+}
+
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("find_table", &find_table, "find_table(CPU/GPU)");
+    m.def("arcs_qsoftmax_forward", &arcs_qsoftmax_forward, "arcs_qsoftmax_forward(CPU/GPU)");
+    m.def("venusa_qsoftmax_forward", &venusa_qsoftmax_forward, "venusa_qsoftmax_forward(CPU/GPU)");
+    m.def("venusa_qsigmoid_forward", &venusa_qsigmoid_forward, "venusa_qsigmoid_forward(CPU/GPU)");
+    m.def("arcs_qsigmoid_forward", &arcs_qsigmoid_forward, "arcs_qsigmoid_forward(CPU/GPU)");
+    m.def("venusa_qtanh_forward", &venusa_qtanh_forward, "venusa_qtanh_forward(CPU/GPU)");
+    m.def("arcs_qtanh_forward", &arcs_qtanh_forward, "arcs_qtanh_forward(CPU/GPU)");
+    m.def("qlayernorm_kernel_forward", &qlayernorm_kernel_forward, "qlayernorm_kernel_forward(CPU/GPU)");
+
+    m.def("fake_quant", &fake_quant, "Fake Quantization (CUDA)");
+    m.def("bias_quant", &bias_quant, "Bias Quantization (CUDA)");
+
+    m.def("fake_quant_with_grad_scale", &fake_quant_with_grad_scale, "Fake Quantization With Grad Scale (CUDA)");
+    m.def("bias_quant_with_grad_scale", &bias_quant_with_grad_scale, "Bias Quantization With Grad Scale (CUDA)");
+}
\ No newline at end of file
diff --git a/linger/kernel/cpu/qlayernorm_kernel.cpp b/linger/kernel/cpu/qlayernorm_kernel.cpp
new file mode 100644
index 0000000..cd617bb
--- /dev/null
+++ b/linger/kernel/cpu/qlayernorm_kernel.cpp
@@ -0,0 +1,281 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+static const int16_t g_s16Table_sqrt_reciprocal[768] = {/*Q15,[0.25-1]的平方根值*/
+	32767, 32704, 32640, 32577, 32514, 32452, 32390, 32328, 32267, 32206, 32146, 32085, 32025, 31966, 31907, 31848,
+	31789, 31731, 31673, 31615, 31558, 31501, 31444, 31388, 31332, 31276, 31220, 31165, 31110, 31056, 31001, 30947,
+	30893, 30840, 30787, 30734, 30681, 30629, 30577, 30525, 30473, 30422, 30371, 30320, 30269, 30219, 30169, 30119,
+	30069, 30020, 29971, 29922, 29874, 29825, 29777, 29729, 29681, 29634, 29587, 29540, 29493, 29446, 29400, 29354,
+	29308, 29262, 29217, 29172, 29127, 29082, 29037, 28993, 28948, 28904, 28861, 28817, 28774, 28730, 28687, 28644,
+	28602, 28559, 28517, 28475, 28433, 28391, 28350, 28308, 28267, 28226, 28185, 28145, 28104, 28064, 28024, 27984,
+	27944, 27905, 27865, 27826, 27787, 27748, 27709, 27670, 27632, 27594, 27555, 27517, 27480, 27442, 27404, 27367,
+	27330, 27293, 27256, 27219, 27183, 27146, 27110, 27074, 27038, 27002, 26966, 26930, 26895, 26860, 26824, 26789,
+	26754, 26720, 26685, 26651, 26616, 26582, 26548, 26514, 26480, 26446, 26413, 26379, 26346, 26313, 26280, 26247,
+	26214, 26181, 26149, 26116, 26084, 26052, 26019, 25987, 25956, 25924, 25892, 25861, 25829, 25798, 25767, 25736,
+	25705, 25674, 25643, 25613, 25582, 25552, 25521, 25491, 25461, 25431, 25401, 25372, 25342, 25312, 25283, 25254,
+	25224, 25195, 25166, 25137, 25108, 25080, 25051, 25022, 24994, 24966, 24937, 24909, 24881, 24853, 24825, 24797,
+	24770, 24742, 24715, 24687, 24660, 24633, 24606, 24579, 24552, 24525, 24498, 24471, 24445, 24418, 24392, 24365,
+	24339, 24313, 24287, 24261, 24235, 24209, 24183, 24157, 24132, 24106, 24081, 24055, 24030, 24005, 23980, 23955,
+	23930, 23905, 23880, 23855, 23831, 23806, 23782, 23757, 23733, 23709, 23684, 23660, 23636, 23612, 23588, 23564,
+	23541, 23517, 23493, 23470, 23446, 23423, 23400, 23376, 23353, 23330, 23307, 23284, 23261, 23238, 23215, 23193,
+	23170, 23147, 23125, 23102, 23080, 23058, 23035, 23013, 22991, 22969, 22947, 22925, 22903, 22881, 22860, 22838,
+	22816, 22795, 22773, 22752, 22730, 22709, 22688, 22666, 22645, 22624, 22603, 22582, 22561, 22540, 22520, 22499,
+	22478, 22458, 22437, 22416, 22396, 22376, 22355, 22335, 22315, 22294, 22274, 22254, 22234, 22214, 22194, 22175,
+	22155, 22135, 22115, 22096, 22076, 22056, 22037, 22018, 21998, 21979, 21960, 21940, 21921, 21902, 21883, 21864,
+	21845, 21826, 21807, 21788, 21769, 21751, 21732, 21713, 21695, 21676, 21658, 21639, 21621, 21602, 21584, 21566,
+	21548, 21529, 21511, 21493, 21475, 21457, 21439, 21421, 21403, 21386, 21368, 21350, 21332, 21315, 21297, 21280,
+	21262, 21245, 21227, 21210, 21193, 21175, 21158, 21141, 21124, 21107, 21089, 21072, 21055, 21038, 21022, 21005,
+	20988, 20971, 20954, 20938, 20921, 20904, 20888, 20871, 20855, 20838, 20822, 20805, 20789, 20773, 20756, 20740,
+	20724, 20708, 20691, 20675, 20659, 20643, 20627, 20611, 20595, 20580, 20564, 20548, 20532, 20516, 20501, 20485,
+	20470, 20454, 20438, 20423, 20407, 20392, 20377, 20361, 20346, 20331, 20315, 20300, 20285, 20270, 20255, 20239,
+	20224, 20209, 20194, 20179, 20164, 20150, 20135, 20120, 20105, 20090, 20076, 20061, 20046, 20032, 20017, 20002,
+	19988, 19973, 19959, 19944, 19930, 19916, 19901, 19887, 19873, 19858, 19844, 19830, 19816, 19802, 19787, 19773,
+	19759, 19745, 19731, 19717, 19703, 19690, 19676, 19662, 19648, 19634, 19620, 19607, 19593, 19579, 19566, 19552,
+	19539, 19525, 19511, 19498, 19485, 19471, 19458, 19444, 19431, 19418, 19404, 19391, 19378, 19365, 19351, 19338,
+	19325, 19312, 19299, 19286, 19273, 19260, 19247, 19234, 19221, 19208, 19195, 19182, 19169, 19157, 19144, 19131,
+	19118, 19106, 19093, 19080, 19068, 19055, 19042, 19030, 19017, 19005, 18992, 18980, 18968, 18955, 18943, 18930,
+	18918, 18906, 18894, 18881, 18869, 18857, 18845, 18832, 18820, 18808, 18796, 18784, 18772, 18760, 18748, 18736,
+	18724, 18712, 18700, 18688, 18676, 18665, 18653, 18641, 18629, 18618, 18606, 18594, 18582, 18571, 18559, 18547,
+	18536, 18524, 18513, 18501, 18490, 18478, 18467, 18455, 18444, 18432, 18421, 18410, 18398, 18387, 18376, 18365,
+	18353, 18342, 18331, 18320, 18308, 18297, 18286, 18275, 18264, 18253, 18242, 18231, 18220, 18209, 18198, 18187,
+	18176, 18165, 18154, 18143, 18132, 18122, 18111, 18100, 18089, 18078, 18068, 18057, 18046, 18036, 18025, 18014,
+	18004, 17993, 17982, 17972, 17961, 17951, 17940, 17930, 17919, 17909, 17898, 17888, 17878, 17867, 17857, 17846,
+	17836, 17826, 17816, 17805, 17795, 17785, 17775, 17764, 17754, 17744, 17734, 17724, 17714, 17703, 17693, 17683,
+	17673, 17663, 17653, 17643, 17633, 17623, 17613, 17603, 17593, 17584, 17574, 17564, 17554, 17544, 17534, 17525,
+	17515, 17505, 17495, 17485, 17476, 17466, 17456, 17447, 17437, 17427, 17418, 17408, 17399, 17389, 17379, 17370,
+	17360, 17351, 17341, 17332, 17322, 17313, 17304, 17294, 17285, 17275, 17266, 17257, 17247, 17238, 17229, 17219,
+	17210, 17201, 17192, 17182, 17173, 17164, 17155, 17146, 17136, 17127, 17118, 17109, 17100, 17091, 17082, 17073,
+	17064, 17055, 17046, 17037, 17028, 17019, 17010, 17001, 16992, 16983, 16974, 16965, 16956, 16947, 16938, 16930,
+	16921, 16912, 16903, 16894, 16886, 16877, 16868, 16859, 16851, 16842, 16833, 16825, 16816, 16807, 16799, 16790,
+	16782, 16773, 16764, 16756, 16747, 16739, 16730, 16722, 16713, 16705, 16696, 16688, 16679, 16671, 16662, 16654,
+	16646, 16637, 16629, 16621, 16612, 16604, 16596, 16587, 16579, 16571, 16562, 16554, 16546, 16538, 16529, 16521,
+	16513, 16505, 16497, 16489, 16480, 16472, 16464, 16456, 16448, 16440, 16432, 16424, 16416, 16408, 16400, 16392
+};
+
+static int32_t saturate_q63_to_q31(int64_t src)
+{
+	int32_t ret;
+	int64_t int32_max = 0x7fffffff;
+	int64_t int32_min = 0xffffffff80000000;
+	if (src > int32_max)
+	{
+		ret = 0x7fffffff;
+	}
+	else if (src < int32_min)
+	{
+		ret = 0x80000000;
+	}
+	else
+	{
+		ret = (int32_t)src;
+	}
+	return ret;
+}
+
+static int32_t shfit_floor_x05_int32(int32_t x, int32_t shift)
+{
+	int32_t val = x;
+
+	if (shift >= 32) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+static int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+static void calc_data_mul_vs(const int32_t *src, int32_t scalar, int32_t *dst, int32_t size, int32_t shift)
+{
+	int64_t d, d1, d2;
+	for (int i = 0; i < size; i++)
+	{
+		d1 = (int64_t)*(src + i);
+		d2 = (int64_t)scalar;
+		d = d1 * d2;
+		d = shfit_floor_x05_int64(d, shift);
+		*(dst + i) = saturate_q63_to_q31(d);
+	}
+}
+
+static const int32_t calc_sqrt_reciprocal(const int64_t data, int32_t q_x, int32_t *table_shift)
+{
+	const int q_normal = 15;	//normalize(-32, 32)
+	const int q2 = 14;
+	int64_t temp;
+	int q1;
+
+	if (data & 0xC00000000000)
+	{
+		temp = data>>38;
+		q1 = 24;
+	}
+	else if (data & 0x300000000000)
+	{
+		temp = data>>36;
+		q1 = 23;
+	}
+	else if (data & 0xC0000000000)
+	{
+		temp = data>>34;
+		q1 = 22;
+	}
+	else if (data & 0x30000000000)
+	{
+		temp = data>>32;
+		q1 = 21;
+	}
+	else if (data & 0xC000000000)
+	{
+		temp = data>>30;
+		q1 = 20;
+	}
+	else if (data & 0x3000000000)
+	{
+		temp = data>>28;
+		q1 = 19;
+	}
+	else if (data & 0xC00000000)
+	{
+		temp = data>>26;
+		q1 = 18;
+	}
+	else if (data & 0x300000000)
+	{
+		temp = data>>24;
+		q1 = 17;
+	}
+
+	else if (data & 0xC0000000)
+	{
+		temp = data>>22;
+        q1 = 16;
+	}        
+    else if (data & 0x30000000)
+	{
+		temp = data>>20;
+        q1 = 15;
+	}        
+    else if (data & 0xFC000000)
+	{
+		temp = data>>18;
+        q1 = 14;
+	}        
+    else if (data & 0xF3000000)
+	{
+        temp = data>>16;
+        q1 = 13;
+	}
+    else if (data & 0xFFC00000)
+	{
+        temp = data>>14;
+        q1 = 12;
+	}
+    else if (data & 0xFF300000)
+	{
+        temp = data>>12;
+        q1 = 11;
+	}
+    else if (data & 0xFFFC0000)
+	{
+        temp = data>>10;
+        q1 = 10;
+	}
+    else if (data & 0xFFF30000)
+	{
+        temp = data>>8;
+        q1 = 9;
+	}
+    else if (data & 0xFFFFC000)
+	{
+        temp = data>>6;
+        q1 = 8;
+	}
+    else if (data & 0xFFFF3000)
+	{
+        temp = data>>4;
+        q1 = 7;
+	}
+    else if (data & 0xFFFFFC00)
+	{
+        temp = data>>2;
+        q1 = 6;
+	}
+    else if (data & 0xFFFFFF00)
+	{
+        temp = data;
+        q1 = 5;
+	}
+    else if (data & 0xFFFFFFC0)
+	{
+        temp = data<<2;
+        q1 = 4;
+	}
+    else if (data & 0xFFFFFFF0)
+	{
+        temp = data<<4;
+        q1 = 3;
+	}
+    else if (data & 0xFFFFFFFC)
+	{
+        temp = data<<6;
+        q1 = 2;
+	}
+    else if (data & 0xFFFFFFFF)
+	{
+        temp = data<<8;
+        q1 = 1;
+	}
+    else
+	{
+		temp = 256;
+		q1 = 0;
+	}
+
+	int32_t id = temp - 256;
+	int32_t table_out = (int32_t)g_s16Table_sqrt_reciprocal[id];
+	int32_t q = q1 + q2 - q_normal;
+	*table_shift = q;//(int32_t)powf(2, q);
+	return table_out;
+
+}
+
+torch::Tensor qlayernorm_kernel_cpu(torch::Tensor numerator, torch::Tensor denominator, float scale_x)
+{
+	int32_t N = denominator.numel();
+	int32_t T = numerator.numel() / N;
+	auto c = torch::zeros_like(numerator);
+	const int64_t *d_ptr = denominator.data_ptr<int64_t>();
+	const int32_t *n_ptr = numerator.data_ptr<int32_t>();
+	int32_t *c_ptr = c.data_ptr<int32_t>();
+	int32_t tmp_val, shift = 0;
+
+	for (int i = 0; i < N; i++)
+	{
+		tmp_val = calc_sqrt_reciprocal(d_ptr[i], (int64_t)scale_x, &shift);
+		calc_data_mul_vs((int32_t *)(n_ptr + i * T), tmp_val, (c_ptr + i * T), T, shift);
+	}
+
+	return c;
+}
diff --git a/linger/kernel/cpu/util_kernel.cpp b/linger/kernel/cpu/util_kernel.cpp
new file mode 100644
index 0000000..1d55c5a
--- /dev/null
+++ b/linger/kernel/cpu/util_kernel.cpp
@@ -0,0 +1,26 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+
+namespace extension_util_cpp
+{
+	void find_table(float *value, const int32_t *table_index, const float *table, int32_t size)
+	{
+		for (int i = 0; i < size; i++)
+		{
+			int32_t index = table_index[i];
+			value[i] *= table[index];
+		}
+	}
+} // namespace extension_util_cpp
+
+void find_table_cpu(torch::Tensor value, torch::Tensor table, torch::Tensor table_index)
+{
+	int32_t N = value.numel();
+	float *value_ptr = value.data_ptr<float>();
+	const float *table_ptr = table.data_ptr<float>();
+	const int32_t *table_index_ptr = table_index.data_ptr<int32_t>();
+	extension_util_cpp::find_table(value_ptr, table_index_ptr, table_ptr, N);
+}
diff --git a/linger/kernel/cpu/venusa_qsigmoid_kernel.cpp b/linger/kernel/cpu/venusa_qsigmoid_kernel.cpp
new file mode 100644
index 0000000..7273d13
--- /dev/null
+++ b/linger/kernel/cpu/venusa_qsigmoid_kernel.cpp
@@ -0,0 +1,82 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+torch::Tensor venusa_qsigmoid_cpu(torch::Tensor a)
+{
+	int32_t N = a.numel();
+	auto c = torch::zeros_like(a);
+	const int *a_ptr = a.data_ptr<int>();
+	int *c_ptr = c.data_ptr<int>();
+
+	static const uint32_t bands[] = {0, 63656107, 111395083, 153816608, 194953701, 236865036, 281253298, 329433241, 384154475, 447714743, 515399149, 589016819, 679940425, 759830874, 862986200, 965402453, 2147483648};
+	static const uint32_t slopes[] = {529475578, 482862538, 424212188, 361565531, 298613704, 238039992, 181912633, 132002182, 89144402, 56020914, 34019888, 18928477, 9459126, 5291645, 2204341, 177654};
+	static const uint32_t bias0s[] = {134217728,136981152,143065818,152040132,163469967,176832392,191534255,206847211,222180513,235991910,246552465,254831080,260827489,263776596,266257920,268080117};
+	static const uint32_t bias1s[] = {134217728,131454303,125369637,116395323,104965488,91603063,76901200,61588244,46254942,32443545,21882990,13604375,7607967,4658859,2177535,355339};
+
+	uint32_t i = 0;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	int32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+
+	for (i = 0; i < N; i++)
+	{
+		tmp = a_ptr[i];
+
+		if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == (-1ULL<<31)) {
+                absx = (1ULL<<31)-1;
+            } else {
+                absx = -tmp;
+            }
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				if (1 == sign)
+				{
+					bias = bias1s[j - 1];
+				}
+				else
+				{
+					bias = bias0s[j - 1];
+				}
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+        bias = bias << 3;
+		out = ((slope * tmp) >> 27) + bias;
+		c_ptr[i] = SATURATE(out, 32);
+	}
+
+	return c;
+}
diff --git a/linger/kernel/cpu/venusa_qsoftmax_kernel.cpp b/linger/kernel/cpu/venusa_qsoftmax_kernel.cpp
new file mode 100644
index 0000000..d5270e2
--- /dev/null
+++ b/linger/kernel/cpu/venusa_qsoftmax_kernel.cpp
@@ -0,0 +1,204 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+static int32_t shift_pure(int64_t v, int32_t s)
+{
+    if (s >= 63 || s <= -63) {
+		return 0;
+	}
+	if (s > 0) {
+		v = v << s;
+	} else {
+        v = v >> (-s);
+    }
+    return SATURATE(v, 32);
+}
+
+static int32_t sub32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a - (int64_t)b);
+	return SATURATE(s, 32);
+}
+static int32_t add32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a + (int64_t)b);
+	return SATURATE(s, 32);
+}
+
+#if 0
+static int32_t shift_rasym(int64_t x,int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return (int32_t)y;
+
+    if (n >= 63 || n <= -63) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+    return x;
+}
+
+static int32_t shift_rasyms(int64_t x, int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return SATURATE(y, 32);
+	if (n >= 64 || n <= -64) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+	return SATURATE(x, 32);
+}
+#endif
+
+static int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+/*-------------------------------------------------------------------------
+Softmax
+The function computes the softmax (normalized exponential function) of
+input data. 32-bit fixed-point functions accept inputs in Q6.25 and form
+outputs in Q16.15 format.
+
+Precision:
+32x32  32-bit inputs, 32-bit output. Accuracy: 2 LSB (see Note below)
+f      floating point input, floating point output
+
+Note: Accuracy of function may depend on amount of data and their
+distribution. Given accuracy is achieved for N=2 for any pair of data
+from input domain.
+
+Input:
+in.data_ptr   input data, Q6.25
+in.numel()    length of vectors
+Output:
+out.data_ptr  result, Q16.15 or floating point
+
+Restriction:
+in,out should not overlap
+-------------------------------------------------------------------------*/
+torch::Tensor venusa_qsoftmax_cpu(const torch::Tensor& in, int64_t dim)
+{
+	// int32_t N = in.numel();
+	int32_t N = in.size(1);
+	int32_t L = in.size(0);
+	auto out = torch::zeros_like(in);
+	int *p_x = in.data_ptr<int>();
+	int *p_y = out.data_ptr<int>();
+
+	const static int32_t p23[5] = { 57364 ,446161 ,2008107 , 5813551, 8388575 };
+
+	for (int k = 0; k < L; k++)
+	{
+		uint32_t A = 0x800000, B = 0, C = 0;
+		int *x_ptr = p_x + N * k;
+		int *y_ptr = p_y + N * k;
+
+		int32_t max_value = x_ptr[0];
+		int32_t data = 0;
+		int32_t X = 0;
+		int64_t Y = 0;
+		int32_t E = 0;
+		int64_t E_SUM = 0;
+		
+		for (int i = 1; i < N; i++)
+		{
+			max_value = x_ptr[i] > max_value ? x_ptr[i] : max_value;
+		}
+        if (max_value == (int32_t)0x80000000)
+            max_value += 1;
+
+		for (int i = 0; i < N; i++)
+		{
+			data = sub32s(x_ptr[i], max_value);
+            X = shfit_floor_x05_int64((int64_t)X * (int64_t)774541002, 31);
+			// X = shift_rasyms((int64_t)data * (int64_t)774541002, -31);//exp=>2xp，Q6.25=>Q8.23
+			E = X >> 23;
+			E = E + 1;//与118行对应
+
+			X = X & 0x7fffff;
+			X = X - 0x800000;
+
+			Y = p23[0];
+            for (int j = 1; j < 5; j++) 
+            {
+                int64_t t = (((int64_t)Y * (int64_t)X) <<7) + ((int64_t)p23[j] << 30);
+                if (j < 4) {
+                    Y = shfit_floor_x05_int64(t, 30);
+                } else {
+                    if (30 - E > 63) {
+                        Y = 0;
+                    } else {
+                        Y = shfit_floor_x05_int64(t, 30 - E);
+                    }
+                }
+            } 
+            ((int32_t*)y_ptr)[i] = Y;
+		}
+
+		for (int i = 0; i < N; i++)
+		{
+			E_SUM += y_ptr[i];
+		}
+
+		B = SATURATE(E_SUM, 32);
+		for (int i = 1; i <= 30; i++) {
+			if(A>=B) {
+				C = C+1; 
+				C = C*2;
+				A = A-B;
+				A = A*2;
+			} else {
+				C = C*2;
+				A = A*2;
+			}
+		}
+
+		for (int i = 0; i < N; i++)
+		{
+			y_ptr[i] = shfit_floor_x05_int64((int64_t)C * (int64_t)y_ptr[i], 38);
+		}
+	}
+
+	return out;
+}
+
diff --git a/linger/kernel/cpu/venusa_qtanh_kernel.cpp b/linger/kernel/cpu/venusa_qtanh_kernel.cpp
new file mode 100644
index 0000000..65a6766
--- /dev/null
+++ b/linger/kernel/cpu/venusa_qtanh_kernel.cpp
@@ -0,0 +1,86 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+torch::Tensor venusa_qtanh_cpu(torch::Tensor a)
+{
+	int32_t N = a.numel();
+	auto c = torch::zeros_like(a);
+	const int *a_ptr = a.data_ptr<int>();
+	int *c_ptr = c.data_ptr<int>();
+
+	static const uint32_t bands[] = {0, 33584191, 58182438, 80361293, 102120620, 124483371, 148392654, 174972063, 206183541, 244722657, 281591256, 312045360, 358241226, 403095772, 471425623, 530372454, 2147483648};
+	static const uint32_t slopes[] = {2114623134, 1910645334, 1662969581, 1396521636, 1131979061, 881027392, 652221876, 451080022, 281798391, 159724746, 98692663, 60001039, 27632752, 13732196, 2832019, 172634};
+	static const uint32_t biass[] = {0,6379959,19800670,39742244,64902158,93996032,125617281,158394439,190900471,218722976,234728779,245973141,256772436,261990874,266776590,268090187,};
+
+	uint32_t i = 0;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	int32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+
+	for (i = 0; i < N; i++)
+	{
+		tmp = a_ptr[i];
+
+		if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == -1 * (1 << 31))
+			{
+				absx = (1 << 31) - 1;
+			}
+			else
+			{
+				absx = -tmp;
+			}
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				bias = biass[j - 1];
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+
+        bias = bias << 3;
+		if (1 == sign)
+		{
+			out = ((-1 * slope * absx) >> 27) - bias;
+		}
+		else
+		{
+			out = ((slope * absx) >> 27) + bias;
+		}
+
+		c_ptr[i] = SATURATE(out, 32);
+	}
+
+	return c;
+}
diff --git a/linger/kernel/gpu/arcs_qsigmoid_kernel.cu b/linger/kernel/gpu/arcs_qsigmoid_kernel.cu
new file mode 100644
index 0000000..99e045f
--- /dev/null
+++ b/linger/kernel/gpu/arcs_qsigmoid_kernel.cu
@@ -0,0 +1,109 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+__global__ void arcs_qsigmoid_gpu_kernel(const int* __restrict__ a,
+                             int* __restrict__ c, 
+                            int32_t len, uint32_t* bands , uint32_t* slopes,
+                            uint32_t* bias0s , uint32_t* bias1s)
+{
+    int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+    uint32_t j = 0;
+    uint32_t sign = 0;
+    int64_t absx = 0;
+    int64_t slope = 0;
+    int64_t bias = 0;
+    int32_t shift = 0;
+    int64_t tmp = 0;
+    int64_t out = 0;
+	
+    if (idx < len)
+    {
+		tmp = a[idx];
+
+        if (tmp < 0)
+		{
+			sign = 1;
+			absx = -tmp;
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+
+		for (j = 1; j < 17; ++j)
+		{
+
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				if (1 == sign)
+				{
+					bias = bias1s[j - 1];
+				}
+				else
+				{
+					bias = bias0s[j - 1];
+				}
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+				shift = 0;
+			}
+		}
+
+		out = ((slope * tmp) >> 27) + bias;
+		c[idx] = SATURATE(out, 32);
+    }
+}
+
+torch::Tensor arcs_qsigmoid_gpu(torch::Tensor a)
+{
+    int32_t N = a.numel();
+    const int threads = 64;
+    const dim3 blocks((N + threads - 1) / threads, threads);
+    auto c = torch::zeros_like(a);
+    const int* a_ptr= a.data_ptr<int>();
+    int * c_ptr = c.data_ptr<int>();
+	static const uint32_t bands[] = {0, 63656107, 111395083, 153816608, 194953701, 236865036, 281253298, 329433241, 384154475, 447714743, 515399149, 589016819, 679940425, 759830874, 862986200, 965402453, 2147483648};
+	static const uint32_t slopes[] = {529475578, 482862538, 424212188, 361565531, 298613704, 238039992, 181912633, 132002182, 89144402, 56020914, 34019888, 18928477, 9459126, 5291645, 2204341, 177654};
+	static const uint32_t bias0s[] = {1073741824, 1095849221, 1144526551, 1216321063, 1307759740, 1414659140, 1532274042, 1654777693, 1777444109, 1887935281, 1972419725, 2038648643, 2086619912, 2110212774, 2130063364, 2144640936};
+	static const uint32_t bias1s[] = {1073741824, 1051634427, 1002957097, 931162585, 839723908, 732824508, 615209606, 492705955, 370039539, 259548367, 175063923, 108835005, 60863736, 37270874, 17420284, 2842712};
+      
+    uint32_t* buffer_bands;
+    uint32_t* buffer_slopes;
+    uint32_t* buffer_bias0s;
+	uint32_t* buffer_bias1s;
+
+
+    cudaMalloc((void**)&buffer_bands,  17*sizeof(uint32_t));
+    cudaMemcpy(buffer_bands, bands, 17*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_slopes,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_slopes, slopes, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_bias0s,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_bias0s, bias0s, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_bias1s,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_bias1s, bias1s, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    
+    arcs_qsigmoid_gpu_kernel<<<blocks, threads>>>(a_ptr, c_ptr, N, buffer_bands, buffer_slopes, buffer_bias0s, buffer_bias1s );
+    cudaFree(buffer_bands);
+    cudaFree(buffer_slopes);
+    cudaFree(buffer_bias0s);
+	cudaFree(buffer_bias1s);
+
+    return c;
+}
+
diff --git a/linger/kernel/gpu/arcs_qsoftmax_kernel.cu b/linger/kernel/gpu/arcs_qsoftmax_kernel.cu
new file mode 100644
index 0000000..55fce1b
--- /dev/null
+++ b/linger/kernel/gpu/arcs_qsoftmax_kernel.cu
@@ -0,0 +1,199 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+static __device__ int32_t shift_pure(int64_t v, int32_t s)
+{
+	if (s >= 63 || s <= -63) {
+		return 0;
+	}
+	if (s > 0) {
+		v = v << s;
+	} else {
+		v = v >> (-s);
+	}
+	return SATURATE(v, 32);
+}
+
+static __device__ int32_t sub32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a - (int64_t)b);
+	return SATURATE(s, 32);
+}
+
+static __device__ int32_t add32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a + (int64_t)b);
+	return SATURATE(s, 32);
+}
+
+static __device__ int32_t shift_rasym(int64_t x,int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return (int32_t)y;
+	if (n >= 63 || n <= -63) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+    return x;
+}
+
+static __device__ int32_t shift_rasyms(int64_t x, int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	
+	// y = floor(y + 0.5);
+	// return SATURATE(y, 32);
+	if (n >= 64 || n <= -64) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+	return SATURATE(x, 32);
+}
+
+static __device__ int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+static __device__ void arcs_qsoftmax_c(const int* __restrict__ x_ptr, int* __restrict__ y_ptr, int32_t N, int32_t* p23)
+{
+	int32_t max_value = x_ptr[0];
+	int32_t X = 0;
+	int64_t Y = 0;
+	int32_t E = 0;
+	int64_t E_SUM = 0;
+	uint32_t A = 0x800000, B = 0, C = 0;
+	for (int i = 1; i < N; i++)
+	{
+		max_value = x_ptr[i] > max_value ? x_ptr[i] : max_value;
+	}
+
+	for (int i = 0; i < N; i++)
+	{
+		X = sub32s(x_ptr[i] ,max_value);
+		X = shift_rasyms((int64_t)X * (int64_t)774541002,-31);
+		E = X >> 23;
+		E = E + 1;
+
+		X = X & 0x7fffff;
+		X = X - 0x800000;
+
+		Y = p23[0];
+
+		for (int j = 1; j < 5; j++)
+		{
+			Y = shift_rasym((int64_t)Y * (int64_t)X,-23) + p23[j];
+		}
+		
+		// y_ptr[i] = (int32_t)(Y  * pow(2, E));
+		y_ptr[i] = shift_pure(Y, E);
+	}
+
+	for (int i = 0; i < N; i++)
+	{
+		E_SUM += y_ptr[i];
+	}
+
+	B = SATURATE(E_SUM, 32);
+	for (int i = 1; i <= 30; i++) {
+		if(A>=B) {
+			C = C+1; 
+			C = C*2;
+			A = A-B;
+			A = A*2;
+		} else {
+			C = C*2;
+			A = A*2;
+		}
+	}
+
+	for (int i = 0; i < N; i++)
+	{
+		y_ptr[i] = shfit_floor_x05_int64((int64_t)C * (int64_t)y_ptr[i], 38);
+	}
+}
+
+__global__ void arcs_qsoftmax_gpu_kernel(const int* __restrict__ p_in,
+	int* __restrict__ p_out, 
+	int32_t N, int32_t L, int32_t* p23)
+{
+	int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+	if (idx < L)
+	// for (int k = 0; k < L; k++)
+	{
+		const int* x_ptr = p_in + N * idx;
+		int * y_ptr = p_out + N * idx;
+
+		arcs_qsoftmax_c(x_ptr, y_ptr, N, p23);
+	}	
+}
+
+torch::Tensor arcs_qsoftmax_gpu(const torch::Tensor& in, int64_t dim)
+{
+	// int32_t N = in.numel();
+	int32_t N = in.size(1);
+	int32_t L = in.size(0);
+	const dim3 threads = (64);
+    const dim3 blocks((L + threads.x - 1) / threads.x);
+	auto out = torch::zeros_like(in);
+	const int* x_ptr = in.data_ptr<int>();
+	int * y_ptr = out.data_ptr<int>();
+
+	// static const int32_t p[] = { 14685058, 114217091, 514075394, 1488269031, 2147475316 };
+	static const int32_t p23[] = { 57364 ,446161 ,2008107 , 5813551, 8388575 };
+
+	// int32_t *buffer_p;
+	int32_t *buffer_p23;
+
+	// cudaMalloc((void **)&buffer_p, 5 * sizeof(int32_t));
+	// cudaMemcpy(buffer_p, p, 5 * sizeof(int32_t), cudaMemcpyHostToDevice);
+	cudaMalloc((void **)&buffer_p23, 5 * sizeof(int32_t));
+	cudaMemcpy(buffer_p23, p23, 5 * sizeof(int32_t), cudaMemcpyHostToDevice);
+
+	arcs_qsoftmax_gpu_kernel<<<blocks, threads>>>(x_ptr, y_ptr, N, L, buffer_p23);
+	// arcs_qsoftmax_gpu_kernel<<<1, 1>>>(x_ptr, y_ptr, N, L, buffer_p, buffer_p23);
+	
+	// cudaFree(buffer_p);
+	cudaFree(buffer_p23);
+	
+	return out;
+}
+
diff --git a/linger/kernel/gpu/arcs_qtanh_kernel.cu b/linger/kernel/gpu/arcs_qtanh_kernel.cu
new file mode 100644
index 0000000..bfa95a5
--- /dev/null
+++ b/linger/kernel/gpu/arcs_qtanh_kernel.cu
@@ -0,0 +1,110 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+						: ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+						
+__global__ void arcs_qtanh_gpu_kernel(const int* __restrict__ a,
+                             int* __restrict__ c, 
+                            int32_t len, uint32_t* bands , uint32_t* slopes,
+                            uint32_t* biass)
+{
+    int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	uint32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+  
+    if (idx < len)
+    {
+        tmp = a[idx];
+
+        if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == -1 * (1 << 31))
+			{
+				absx = (1 << 31) - 1;
+			}
+			else
+			{
+				absx = -tmp;
+			}
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				bias = biass[j - 1];
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+
+		if (1 == sign)
+		{
+			out = ((-1 * slope * absx) >> 27) - bias;
+		}
+		else
+		{
+			out = ((slope * absx) >> 27) + bias;
+		}
+
+		c[idx] = SATURATE(out, 32);
+    }
+}
+
+torch::Tensor arcs_qtanh_gpu(torch::Tensor a)
+{
+    int32_t N = a.numel();
+    const int threads = 64;
+    const dim3 blocks((N + threads - 1) / threads, threads);
+    auto c = torch::zeros_like(a);
+    const int* a_ptr= a.data_ptr<int>();
+    int * c_ptr = c.data_ptr<int>();
+	static const uint32_t bands[] = {0, 33584191, 58182438, 80361293, 102120620, 124483371, 148392654, 174972063, 206183541, 244722657, 281591256, 312045360, 358241226, 403095772, 471425623, 530372454, 2147483648};
+	static const uint32_t slopes[] = {2114623134, 1910645334, 1662969581, 1396521636, 1131979061, 881027392, 652221876, 451080022, 281798391, 159724746, 98692663, 60001039, 27632752, 13732196, 2832019, 172634};
+	static const uint32_t biass[] = {0, 51039676, 158405367, 317937954, 519217264, 751968259, 1004938252, 1267155516, 1527203771, 1749783808, 1877830237, 1967785133, 2054179492, 2095926998, 2134212723, 2144721503};
+
+        
+    uint32_t* buffer_bands;
+    uint32_t* buffer_slopes;
+    uint32_t* buffer_biass;
+
+
+    cudaMalloc((void**)&buffer_bands,  17*sizeof(uint32_t));
+    cudaMemcpy(buffer_bands, bands, 17*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_slopes,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_slopes, slopes, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_biass,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_biass, biass, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    
+    arcs_qtanh_gpu_kernel<<<blocks, threads>>>(a_ptr, c_ptr, N, buffer_bands, buffer_slopes, buffer_biass);
+    cudaFree(buffer_bands);
+    cudaFree(buffer_slopes);
+    cudaFree(buffer_biass);
+
+    return c;
+}
+
diff --git a/linger/kernel/gpu/fake_quant_kernel.cu b/linger/kernel/gpu/fake_quant_kernel.cu
new file mode 100644
index 0000000..b127435
--- /dev/null
+++ b/linger/kernel/gpu/fake_quant_kernel.cu
@@ -0,0 +1,253 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <iostream>
+
+constexpr float NEG_LN2 = -0.69314718055994530941723212145818;
+
+template <typename scalar_t>
+__global__ void fake_quant_kernel(
+    const scalar_t* __restrict__ input,
+    scalar_t* __restrict__ output,
+    scalar_t* __restrict__ mask,
+    const float scale,
+    const float quant_min,
+    const float quant_max,
+    const int64_t numel) {
+    
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < numel) {
+        float x_s = static_cast<float>(input[idx]) * scale;
+        // 量化 (四舍五入)
+        float q = floorf(x_s + 0.5f);
+        // clamp
+        q = fminf(fmaxf(q, (float)quant_min), (float)quant_max);
+        // clamp的mask,用于backward
+        bool is_clamped = (q == (float)quant_min) || (q == (float)quant_max);
+        mask[idx] = static_cast<scalar_t>(is_clamped);
+        // 反量化
+        output[idx] = static_cast<scalar_t>(q / scale);
+        // if ( (idx % 10000) ==0){
+        //     printf("[%d], x_s=%f, q = %f, output=%f \r\n", idx, x_s, q, output[idx]);
+        // }
+    }
+}
+
+
+std::tuple<torch::Tensor, torch::Tensor, float> fake_quant_cuda(
+    torch::Tensor input,
+    int bit,
+    float factor,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // printf("到位置 2 了 \n");
+    // 计算 scale
+    float f = (float)(bit - 1) - factor;
+    float scale = powf(2.0f, roundf(f));
+    // scale = fminf(fmaxf(scale, 1e-6f), powf(2.0f, 32));
+    // printf("到位置 1 了,scale:%f \n", scale);
+    auto output = torch::empty_like(input);
+    auto mask = torch::empty_like(input); 
+    //, input.options().dtype(torch::kBool)); //kBool就是uint8_t
+    const int threads = 256;
+    const int blocks = (input.numel() + threads - 1) / threads;
+
+    if (scale > scale_min){
+        scale = scale_min;
+    }
+    // printf("到位置 2 了,scale:%f \n", scale);
+    // auto output_cpu = output.to(torch::kCPU);
+    // printf("scale:%f \n", scale);
+    // printf("output开始: \n");
+    // std::cout << output_cpu << std::endl;
+    // printf("output结束: \n");
+    AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "fake_quant_cuda", ([&] {
+        fake_quant_kernel<scalar_t><<<blocks, threads>>>(
+            input.data_ptr<scalar_t>(),
+            output.data_ptr<scalar_t>(),
+            mask.data_ptr<scalar_t>(),
+            scale,
+            quant_min,
+            quant_max,
+            input.numel());
+    }));
+    // cudaDeviceSynchronize();
+    // auto output_cpu2 = output.to(torch::kCPU);
+    // printf("output开始: \n");
+    // std::cout << output_cpu2 << std::endl;
+    // printf("output结束: \n");
+    return std::make_tuple(output, mask, scale);
+}
+
+
+std::tuple<torch::Tensor, torch::Tensor> bias_quant_cuda(
+    torch::Tensor input,
+    int bit,
+    float scale,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    
+    auto output = torch::empty_like(input);
+    auto mask = torch::empty_like(input); 
+    //, input.options().dtype(torch::kBool)); //kBool就是uint8_t
+    const int threads = 256;
+    const int blocks = (input.numel() + threads - 1) / threads;
+
+    if (scale > scale_min){
+        scale = scale_min;
+    }
+
+    AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "fake_quant_cuda", ([&] {
+        fake_quant_kernel<scalar_t><<<blocks, threads>>>(
+            input.data_ptr<scalar_t>(),
+            output.data_ptr<scalar_t>(),
+            mask.data_ptr<scalar_t>(),
+            scale,
+            quant_min,
+            quant_max,
+            input.numel());
+    }));
+
+    return std::make_tuple(output, mask);
+}
+
+// q_x = (x * s).round().clamp() / s
+// 记r=(x * s).round().clamp(), 则q_x对s的导数为:
+// =( r对s求导/s ) - r / s^2
+// (1) clamp操作未触发
+// = x / s - r / s^2
+// = x / s - q_x / s
+// = (x - q_x) / s
+// (2)clamp操作触发
+// = - q_x / s
+
+// 从s反向传播到learning_data需乘上 -ln2 * scale
+
+template <typename scalar_t>
+__global__ void fake_quant_kernel_with_grad_scale(
+    const scalar_t* __restrict__ input,
+    scalar_t* __restrict__ output,
+    scalar_t* __restrict__ mask,
+    scalar_t* __restrict__ scale_coff_back,
+    const float scale,
+    const float learning_data_coff,
+    const float quant_min,
+    const float quant_max,
+    const int64_t numel) {
+    
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < numel) {
+        float x_s = static_cast<float>(input[idx]) * scale;
+        // 量化 (四舍五入)
+        float q = floorf(x_s + 0.5f);
+        // clamp
+        q = fminf(fmaxf(q, (float)quant_min), (float)quant_max);
+        // clamp的mask,用于backward
+        bool is_clamped = ((q == quant_min) || (q == quant_max));
+        
+        mask[idx] = static_cast<scalar_t>(is_clamped);
+        // 反量化
+        output[idx] = static_cast<scalar_t>(q / scale);
+        // if ( (idx % 10000) ==0){
+        //     printf("[%d], x_s=%f, q = %f, output=%f \r\n", idx, x_s, q, output[idx]);
+        // }
+
+        scale_coff_back[idx] = static_cast<scalar_t>((mask[idx] * (-output[idx] * learning_data_coff / scale ) + (1-mask[idx]) * (input[idx] - output[idx]) * learning_data_coff / scale));
+        // if (is_clamped){
+        //     scale_coff_back[idx] = static_cast<scalar_t>(-output[idx] * learning_data_coff / scale );
+        // }else{
+        //     scale_coff_back[idx] = static_cast<scalar_t>(input[idx] - output[idx]) * learning_data_coff / scale;
+        // }
+    }
+}
+
+
+
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor, float> fake_quant_cuda_with_grad_scale(
+    torch::Tensor input,
+    int bit,
+    float factor,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    // printf("到位置 2 了 \n");
+    // 计算 scale
+    float f = (float)(bit - 1) - factor;
+    float scale = powf(2.0f, roundf(f));
+    // scale = fminf(fmaxf(scale, 1e-6f), powf(2.0f, 32));
+    // printf("到位置 1 了,scale:%f \n", scale);
+    auto output = torch::empty_like(input);
+    auto mask = torch::empty_like(input); 
+    auto scale_coff_back = torch::empty_like(input);
+    //, input.options().dtype(torch::kBool)); //kBool就是uint8_t
+    const int threads = 256;
+    const int blocks = (input.numel() + threads - 1) / threads;
+
+    if (scale > scale_min){
+        scale = scale_min;
+    }
+    // printf("到位置 2 了,scale:%f \n", scale);
+    // auto output_cpu = output.to(torch::kCPU);
+    // printf("scale:%f \n", scale);
+    // printf("output开始: \n");
+    // std::cout << output_cpu << std::endl;
+    // printf("output结束: \n");
+    AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "fake_quant_cuda", ([&] {
+        fake_quant_kernel_with_grad_scale<scalar_t><<<blocks, threads>>>(
+            input.data_ptr<scalar_t>(),
+            output.data_ptr<scalar_t>(),
+            mask.data_ptr<scalar_t>(),
+            scale_coff_back.data_ptr<scalar_t>(),
+            scale,
+            NEG_LN2 * scale,
+            quant_min,
+            quant_max,
+            input.numel());
+    }));
+    // cudaDeviceSynchronize();
+    // auto output_cpu2 = output.to(torch::kCPU);
+    // printf("output开始: \n");
+    // std::cout << output_cpu2 << std::endl;
+    // printf("output结束: \n");
+    return std::make_tuple(output, mask, scale_coff_back, scale);
+}
+
+
+std::tuple<torch::Tensor, torch::Tensor, torch::Tensor> bias_quant_cuda_with_grad_scale(
+    torch::Tensor input,
+    int bit,
+    float scale,
+    float scale_min,
+    float quant_min,
+    float quant_max) {
+    
+    auto output = torch::empty_like(input);
+    auto mask = torch::empty_like(input); 
+    auto scale_coff_back = torch::empty_like(input);
+    //, input.options().dtype(torch::kBool)); //kBool就是uint8_t
+    const int threads = 256;
+    const int blocks = (input.numel() + threads - 1) / threads;
+
+    if (scale > scale_min){
+        scale = scale_min;
+    }
+
+    AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "fake_quant_cuda", ([&] {
+        fake_quant_kernel_with_grad_scale<scalar_t><<<blocks, threads>>>(
+            input.data_ptr<scalar_t>(),
+            output.data_ptr<scalar_t>(),
+            mask.data_ptr<scalar_t>(),
+            scale_coff_back.data_ptr<scalar_t>(),
+            scale,
+            1,
+            quant_min,
+            quant_max,
+            input.numel());
+    }));
+
+    return std::make_tuple(output, mask, scale_coff_back);
+}
+
+
diff --git a/linger/kernel/gpu/qlayernorm_kernel.cu b/linger/kernel/gpu/qlayernorm_kernel.cu
new file mode 100644
index 0000000..3442e5f
--- /dev/null
+++ b/linger/kernel/gpu/qlayernorm_kernel.cu
@@ -0,0 +1,326 @@
+#include <torch/extension.h>
+
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+static const int32_t g_s16Table_sqrt_reciprocal[768] = {/*Q15,[0.25-1]的平方根值*/
+	32767, 32704, 32640, 32577, 32514, 32452, 32390, 32328, 32267, 32206, 32146, 32085, 32025, 31966, 31907, 31848,
+	31789, 31731, 31673, 31615, 31558, 31501, 31444, 31388, 31332, 31276, 31220, 31165, 31110, 31056, 31001, 30947,
+	30893, 30840, 30787, 30734, 30681, 30629, 30577, 30525, 30473, 30422, 30371, 30320, 30269, 30219, 30169, 30119,
+	30069, 30020, 29971, 29922, 29874, 29825, 29777, 29729, 29681, 29634, 29587, 29540, 29493, 29446, 29400, 29354,
+	29308, 29262, 29217, 29172, 29127, 29082, 29037, 28993, 28948, 28904, 28861, 28817, 28774, 28730, 28687, 28644,
+	28602, 28559, 28517, 28475, 28433, 28391, 28350, 28308, 28267, 28226, 28185, 28145, 28104, 28064, 28024, 27984,
+	27944, 27905, 27865, 27826, 27787, 27748, 27709, 27670, 27632, 27594, 27555, 27517, 27480, 27442, 27404, 27367,
+	27330, 27293, 27256, 27219, 27183, 27146, 27110, 27074, 27038, 27002, 26966, 26930, 26895, 26860, 26824, 26789,
+	26754, 26720, 26685, 26651, 26616, 26582, 26548, 26514, 26480, 26446, 26413, 26379, 26346, 26313, 26280, 26247,
+	26214, 26181, 26149, 26116, 26084, 26052, 26019, 25987, 25956, 25924, 25892, 25861, 25829, 25798, 25767, 25736,
+	25705, 25674, 25643, 25613, 25582, 25552, 25521, 25491, 25461, 25431, 25401, 25372, 25342, 25312, 25283, 25254,
+	25224, 25195, 25166, 25137, 25108, 25080, 25051, 25022, 24994, 24966, 24937, 24909, 24881, 24853, 24825, 24797,
+	24770, 24742, 24715, 24687, 24660, 24633, 24606, 24579, 24552, 24525, 24498, 24471, 24445, 24418, 24392, 24365,
+	24339, 24313, 24287, 24261, 24235, 24209, 24183, 24157, 24132, 24106, 24081, 24055, 24030, 24005, 23980, 23955,
+	23930, 23905, 23880, 23855, 23831, 23806, 23782, 23757, 23733, 23709, 23684, 23660, 23636, 23612, 23588, 23564,
+	23541, 23517, 23493, 23470, 23446, 23423, 23400, 23376, 23353, 23330, 23307, 23284, 23261, 23238, 23215, 23193,
+	23170, 23147, 23125, 23102, 23080, 23058, 23035, 23013, 22991, 22969, 22947, 22925, 22903, 22881, 22860, 22838,
+	22816, 22795, 22773, 22752, 22730, 22709, 22688, 22666, 22645, 22624, 22603, 22582, 22561, 22540, 22520, 22499,
+	22478, 22458, 22437, 22416, 22396, 22376, 22355, 22335, 22315, 22294, 22274, 22254, 22234, 22214, 22194, 22175,
+	22155, 22135, 22115, 22096, 22076, 22056, 22037, 22018, 21998, 21979, 21960, 21940, 21921, 21902, 21883, 21864,
+	21845, 21826, 21807, 21788, 21769, 21751, 21732, 21713, 21695, 21676, 21658, 21639, 21621, 21602, 21584, 21566,
+	21548, 21529, 21511, 21493, 21475, 21457, 21439, 21421, 21403, 21386, 21368, 21350, 21332, 21315, 21297, 21280,
+	21262, 21245, 21227, 21210, 21193, 21175, 21158, 21141, 21124, 21107, 21089, 21072, 21055, 21038, 21022, 21005,
+	20988, 20971, 20954, 20938, 20921, 20904, 20888, 20871, 20855, 20838, 20822, 20805, 20789, 20773, 20756, 20740,
+	20724, 20708, 20691, 20675, 20659, 20643, 20627, 20611, 20595, 20580, 20564, 20548, 20532, 20516, 20501, 20485,
+	20470, 20454, 20438, 20423, 20407, 20392, 20377, 20361, 20346, 20331, 20315, 20300, 20285, 20270, 20255, 20239,
+	20224, 20209, 20194, 20179, 20164, 20150, 20135, 20120, 20105, 20090, 20076, 20061, 20046, 20032, 20017, 20002,
+	19988, 19973, 19959, 19944, 19930, 19916, 19901, 19887, 19873, 19858, 19844, 19830, 19816, 19802, 19787, 19773,
+	19759, 19745, 19731, 19717, 19703, 19690, 19676, 19662, 19648, 19634, 19620, 19607, 19593, 19579, 19566, 19552,
+	19539, 19525, 19511, 19498, 19485, 19471, 19458, 19444, 19431, 19418, 19404, 19391, 19378, 19365, 19351, 19338,
+	19325, 19312, 19299, 19286, 19273, 19260, 19247, 19234, 19221, 19208, 19195, 19182, 19169, 19157, 19144, 19131,
+	19118, 19106, 19093, 19080, 19068, 19055, 19042, 19030, 19017, 19005, 18992, 18980, 18968, 18955, 18943, 18930,
+	18918, 18906, 18894, 18881, 18869, 18857, 18845, 18832, 18820, 18808, 18796, 18784, 18772, 18760, 18748, 18736,
+	18724, 18712, 18700, 18688, 18676, 18665, 18653, 18641, 18629, 18618, 18606, 18594, 18582, 18571, 18559, 18547,
+	18536, 18524, 18513, 18501, 18490, 18478, 18467, 18455, 18444, 18432, 18421, 18410, 18398, 18387, 18376, 18365,
+	18353, 18342, 18331, 18320, 18308, 18297, 18286, 18275, 18264, 18253, 18242, 18231, 18220, 18209, 18198, 18187,
+	18176, 18165, 18154, 18143, 18132, 18122, 18111, 18100, 18089, 18078, 18068, 18057, 18046, 18036, 18025, 18014,
+	18004, 17993, 17982, 17972, 17961, 17951, 17940, 17930, 17919, 17909, 17898, 17888, 17878, 17867, 17857, 17846,
+	17836, 17826, 17816, 17805, 17795, 17785, 17775, 17764, 17754, 17744, 17734, 17724, 17714, 17703, 17693, 17683,
+	17673, 17663, 17653, 17643, 17633, 17623, 17613, 17603, 17593, 17584, 17574, 17564, 17554, 17544, 17534, 17525,
+	17515, 17505, 17495, 17485, 17476, 17466, 17456, 17447, 17437, 17427, 17418, 17408, 17399, 17389, 17379, 17370,
+	17360, 17351, 17341, 17332, 17322, 17313, 17304, 17294, 17285, 17275, 17266, 17257, 17247, 17238, 17229, 17219,
+	17210, 17201, 17192, 17182, 17173, 17164, 17155, 17146, 17136, 17127, 17118, 17109, 17100, 17091, 17082, 17073,
+	17064, 17055, 17046, 17037, 17028, 17019, 17010, 17001, 16992, 16983, 16974, 16965, 16956, 16947, 16938, 16930,
+	16921, 16912, 16903, 16894, 16886, 16877, 16868, 16859, 16851, 16842, 16833, 16825, 16816, 16807, 16799, 16790,
+	16782, 16773, 16764, 16756, 16747, 16739, 16730, 16722, 16713, 16705, 16696, 16688, 16679, 16671, 16662, 16654,
+	16646, 16637, 16629, 16621, 16612, 16604, 16596, 16587, 16579, 16571, 16562, 16554, 16546, 16538, 16529, 16521,
+	16513, 16505, 16497, 16489, 16480, 16472, 16464, 16456, 16448, 16440, 16432, 16424, 16416, 16408, 16400, 16392
+};
+
+__device__ int32_t saturate_q63_to_q31(int64_t src)
+{
+	int32_t ret;
+	int64_t int32_max = 0x7fffffff;
+	int64_t int32_min = 0xffffffff80000000;
+	if (src > int32_max)
+	{
+		ret = 0x7fffffff;
+	}
+	else if (src < int32_min)
+	{
+		ret = 0x80000000;
+	}
+	else
+	{
+		ret = (int32_t)src;
+	}
+	return ret;
+}
+
+__device__ int32_t shfit_floor_x05_int32(int32_t x, int32_t shift)
+{
+	int32_t val = x;
+
+	if (shift >= 32) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+__device__ int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+__device__ void calc_data_mul_vs(const int32_t *src, int32_t scalar, int32_t *dst, int32_t size, int32_t shift)
+{
+	int64_t d, d1, d2;
+	for (int32_t i = 0; i < size; i++)
+	{
+		d1 = (int64_t)*(src + i);
+		d2 = (int64_t)scalar;
+		d = d1 * d2;
+		d = shfit_floor_x05_int64(d, shift);
+		*(dst + i) = saturate_q63_to_q31(d);
+	}
+}
+
+__device__ int32_t calc_sqrt_reciprocal(const int64_t data, int32_t *table, int32_t q_x, int32_t *table_shift)
+{
+	const int q_normal = 10;	//normalize(-32, 32)
+	const int q2 = 14;
+	int64_t temp;
+	int q1;
+
+	if (data & 0xC00000000000000)
+	{
+		temp = data>>50;
+		q1 = 30;
+	}
+	else if (data & 0x300000000000000)
+	{
+		temp = data>>48;
+		q1 = 29;
+	}
+	else if (data & 0xC0000000000000)
+	{
+		temp = data>>46;
+		q1 = 28;
+	}
+	else if (data & 0x30000000000000)
+	{
+		temp = data>>44;
+		q1 = 27;
+	}
+	else if (data & 0xC000000000000)
+	{
+		temp = data>>42;
+		q1 = 26;
+	}
+	else if (data & 0x3000000000000)
+	{
+		temp = data>>40;
+		q1 = 25;
+	}
+	else if (data & 0xC00000000000)
+	{
+		temp = data>>38;
+		q1 = 24;
+	}
+	else if (data & 0x300000000000)
+	{
+		temp = data>>36;
+		q1 = 23;
+	}
+	else if (data & 0xC0000000000)
+	{
+		temp = data>>34;
+		q1 = 22;
+	}
+	else if (data & 0x30000000000)
+	{
+		temp = data>>32;
+		q1 = 21;
+	}
+	else if (data & 0xC000000000)
+	{
+		temp = data>>30;
+		q1 = 20;
+	}
+	else if (data & 0x3000000000)
+	{
+		temp = data>>28;
+		q1 = 19;
+	}
+	else if (data & 0xC00000000)
+	{
+		temp = data>>26;
+		q1 = 18;
+	}
+	else if (data & 0x300000000)
+	{
+		temp = data>>24;
+		q1 = 17;
+	}
+
+	else if (data & 0xC0000000)
+	{
+		temp = data>>22;
+        q1 = 16;
+	}        
+    else if (data & 0x30000000)
+	{
+		temp = data>>20;
+        q1 = 15;
+	}        
+    else if (data & 0xFC000000)
+	{
+		temp = data>>18;
+        q1 = 14;
+	}        
+    else if (data & 0xF3000000)
+	{
+        temp = data>>16;
+        q1 = 13;
+	}
+    else if (data & 0xFFC00000)
+	{
+        temp = data>>14;
+        q1 = 12;
+	}
+    else if (data & 0xFF300000)
+	{
+        temp = data>>12;
+        q1 = 11;
+	}
+    else if (data & 0xFFFC0000)
+	{
+        temp = data>>10;
+        q1 = 10;
+	}
+    else if (data & 0xFFF30000)
+	{
+        temp = data>>8;
+        q1 = 9;
+	}
+    else if (data & 0xFFFFC000)
+	{
+        temp = data>>6;
+        q1 = 8;
+	}
+    else if (data & 0xFFFF3000)
+	{
+        temp = data>>4;
+        q1 = 7;
+	}
+    else if (data & 0xFFFFFC00)
+	{
+        temp = data>>2;
+        q1 = 6;
+	}
+    else if (data & 0xFFFFFF00)
+	{
+        temp = data;
+        q1 = 5;
+	}
+    else if (data & 0xFFFFFFC0)
+	{
+        temp = data<<2;
+        q1 = 4;
+	}
+    else if (data & 0xFFFFFFF0)
+	{
+        temp = data<<4;
+        q1 = 3;
+	}
+    else if (data & 0xFFFFFFFC)
+	{
+        temp = data<<6;
+        q1 = 2;
+	}
+    else if (data & 0xFFFFFFFF)
+	{
+        temp = data<<8;
+        q1 = 1;
+	}
+    else
+	{
+		temp = 256;
+		q1 = 0;
+	}
+
+	int32_t id = temp - 256;
+	int32_t table_out = (int32_t)table[id];
+	int32_t q = q1 + q2 - q_normal;
+	*table_shift = q;//(int32_t)powf(2, q);
+	return table_out;
+
+}
+
+__global__ void luna_find_sqrt_rec_table_kernel(const int32_t* __restrict__ n_ptr,	const int64_t* __restrict__ d_ptr, int32_t* __restrict__ c_ptr, int32_t N, int32_t T, int32_t *table, int32_t scale_x)
+{
+	int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+	int32_t tmp_val, shift = 0;
+	if (idx < N)
+	{
+		tmp_val = calc_sqrt_reciprocal(d_ptr[idx], table, scale_x, &shift);
+		calc_data_mul_vs((int32_t *)(n_ptr + idx * T), tmp_val, (c_ptr + idx * T), T, shift);
+	}
+}
+
+torch::Tensor qlayernorm_kernel_gpu(torch::Tensor numerator, torch::Tensor denominator, float scale_x)
+{
+	int32_t N = denominator.numel();
+	int32_t T = numerator.numel() / N;
+	const dim3 threads(64);
+    const dim3 blocks((N + threads.x - 1) / threads.x, threads.x);
+	auto c = torch::zeros_like(numerator);
+	const int64_t *d_ptr = denominator.data_ptr<int64_t>();
+	const int32_t *n_ptr = numerator.data_ptr<int32_t>();
+	int32_t *c_ptr = c.data_ptr<int32_t>();
+
+	int32_t *buffer_table_sqrt_reciprocal;
+
+	cudaMalloc((void **)&buffer_table_sqrt_reciprocal, sizeof(g_s16Table_sqrt_reciprocal));
+	cudaMemcpy(buffer_table_sqrt_reciprocal, g_s16Table_sqrt_reciprocal, sizeof(g_s16Table_sqrt_reciprocal), cudaMemcpyHostToDevice);
+
+	luna_find_sqrt_rec_table_kernel<<<blocks, threads>>>(n_ptr, d_ptr, c_ptr, N, T, buffer_table_sqrt_reciprocal, (int32_t)scale_x);
+
+	cudaFree(buffer_table_sqrt_reciprocal);
+
+	return c;
+}
diff --git a/linger/kernel/gpu/util_kernel.cu b/linger/kernel/gpu/util_kernel.cu
new file mode 100644
index 0000000..7dae1d2
--- /dev/null
+++ b/linger/kernel/gpu/util_kernel.cu
@@ -0,0 +1,34 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+
+#define THREADS_PER_BLOCK 256
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+       i += blockDim.x * gridDim.x)
+
+inline int GET_BLOCKS(const int N) {
+	int optimal_block_num = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
+	int max_block_num = 65000;
+	return min(optimal_block_num, max_block_num);
+}
+
+__global__ void find_table(float* value, const int32_t* table_index, const float* table, int32_t size){
+	CUDA_1D_KERNEL_LOOP(i, size){
+		int32_t index = table_index[i];
+		value[i] *= table[index];
+	}
+}
+
+void find_table_gpu(torch::Tensor value, torch::Tensor table, torch::Tensor table_index)
+{
+    int32_t N = value.numel();
+	float* value_ptr = value.data_ptr<float>();
+	const float* table_ptr = table.data_ptr<float>();
+	const int32_t* table_index_ptr = table_index.data_ptr<int32_t>();
+	find_table<<<GET_BLOCKS(THREADS_PER_BLOCK), THREADS_PER_BLOCK>>>(value_ptr, table_index_ptr, table_ptr, N);
+}
+
diff --git a/linger/kernel/gpu/venusa_qsigmoid_kernel.cu b/linger/kernel/gpu/venusa_qsigmoid_kernel.cu
new file mode 100644
index 0000000..b071f97
--- /dev/null
+++ b/linger/kernel/gpu/venusa_qsigmoid_kernel.cu
@@ -0,0 +1,114 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+__global__ void venusa_qsigmoid_gpu_kernel(const int* __restrict__ a,
+                             int* __restrict__ c, 
+                            int32_t len, uint32_t* bands , uint32_t* slopes,
+                            uint32_t* bias0s , uint32_t* bias1s)
+{
+    int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+    uint32_t j = 0;
+    uint32_t sign = 0;
+    int64_t absx = 0;
+    int64_t slope = 0;
+    int64_t bias = 0;
+    int32_t shift = 0;
+    int64_t tmp = 0;
+    int64_t out = 0;
+	
+    if (idx < len)
+    {
+		tmp = a[idx];
+
+        if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == (-1ULL<<31)) {
+                absx = (1ULL<<31)-1;
+            } else {
+                absx = -tmp;
+            }
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+
+		for (j = 1; j < 17; ++j)
+		{
+
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				if (1 == sign)
+				{
+					bias = bias1s[j - 1];
+				}
+				else
+				{
+					bias = bias0s[j - 1];
+				}
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+				shift = 0;
+			}
+		}
+
+        bias = bias << 3;
+		out = ((slope * tmp) >> 27) + bias;
+		c[idx] = SATURATE(out, 32);
+    }
+}
+
+torch::Tensor venusa_qsigmoid_gpu(torch::Tensor a)
+{
+    int32_t N = a.numel();
+    const int threads = 64;
+    const dim3 blocks((N + threads - 1) / threads, threads);
+    auto c = torch::zeros_like(a);
+    const int* a_ptr= a.data_ptr<int>();
+    int * c_ptr = c.data_ptr<int>();
+	static const uint32_t bands[] = {0, 63656107, 111395083, 153816608, 194953701, 236865036, 281253298, 329433241, 384154475, 447714743, 515399149, 589016819, 679940425, 759830874, 862986200, 965402453, 2147483648};
+	static const uint32_t slopes[] = {529475578, 482862538, 424212188, 361565531, 298613704, 238039992, 181912633, 132002182, 89144402, 56020914, 34019888, 18928477, 9459126, 5291645, 2204341, 177654};
+	static const uint32_t bias0s[] = {134217728,136981152,143065818,152040132,163469967,176832392,191534255,206847211,222180513,235991910,246552465,254831080,260827489,263776596,266257920,268080117};
+	static const uint32_t bias1s[] = {134217728,131454303,125369637,116395323,104965488,91603063,76901200,61588244,46254942,32443545,21882990,13604375,7607967,4658859,2177535,355339};
+
+    uint32_t* buffer_bands;
+    uint32_t* buffer_slopes;
+    uint32_t* buffer_bias0s;
+	uint32_t* buffer_bias1s;
+
+
+    cudaMalloc((void**)&buffer_bands,  17*sizeof(uint32_t));
+    cudaMemcpy(buffer_bands, bands, 17*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_slopes,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_slopes, slopes, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_bias0s,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_bias0s, bias0s, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_bias1s,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_bias1s, bias1s, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    
+    venusa_qsigmoid_gpu_kernel<<<blocks, threads>>>(a_ptr, c_ptr, N, buffer_bands, buffer_slopes, buffer_bias0s, buffer_bias1s );
+    cudaFree(buffer_bands);
+    cudaFree(buffer_slopes);
+    cudaFree(buffer_bias0s);
+	cudaFree(buffer_bias1s);
+
+    return c;
+}
+
diff --git a/linger/kernel/gpu/venusa_qsoftmax_kernel.cu b/linger/kernel/gpu/venusa_qsoftmax_kernel.cu
new file mode 100644
index 0000000..be9fee0
--- /dev/null
+++ b/linger/kernel/gpu/venusa_qsoftmax_kernel.cu
@@ -0,0 +1,210 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+#include <math.h>
+#include <cmath>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+                        : ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+
+static __device__ int32_t shift_pure(int64_t v, int32_t s)
+{
+	if (s >= 63 || s <= -63) {
+		return 0;
+	}
+	if (s > 0) {
+		v = v << s;
+	} else {
+		v = v >> (-s);
+	}
+	return SATURATE(v, 32);
+}
+
+static __device__ int32_t sub32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a - (int64_t)b);
+	return SATURATE(s, 32);
+}
+
+static __device__ int32_t add32s(int32_t a, int32_t b)
+{
+	int64_t s = 0;
+	s = (int64_t)((int64_t)a + (int64_t)b);
+	return SATURATE(s, 32);
+}
+
+#if 0
+static __device__ int32_t shift_rasym(int64_t x,int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	// y = floor(y + 0.5);
+	// return (int32_t)y;
+	if (n >= 63 || n <= -63) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+    return x;
+}
+
+static __device__ int32_t shift_rasyms(int64_t x, int32_t n)
+{
+	// double y = (double)x * (double)pow(2, n);
+	
+	// y = floor(y + 0.5);
+	// return SATURATE(y, 32);
+	if (n >= 64 || n <= -64) {
+		return 0;
+	}
+    if (n >= 0) {
+		x = x << n;
+	} 
+	else {
+        n = (-n);
+        x = x >> (n - 1);
+		x = (x & 0x1) + (x >> 1);
+    }
+	return SATURATE(x, 32);
+}
+#endif
+
+static __device__ int64_t shfit_floor_x05_int64(int64_t x, int32_t shift)
+{
+	int64_t val = x;
+
+	if (shift >= 64) {
+		return 0;
+	}
+	if (shift > 0) {
+		val = val >> (shift - 1);
+		val = (val & 0x1) + (val >> 1);
+	}
+
+	return val;
+}
+
+static __device__ void venusa_qsoftmax_c(const int* __restrict__ x_ptr, int* __restrict__ y_ptr, int32_t N, int32_t* p23)
+{
+	int32_t max_value = x_ptr[0];
+	int32_t X = 0;
+	int64_t Y = 0;
+	int32_t E = 0;
+	int64_t E_SUM = 0;
+	uint32_t A = 0x800000, B = 0, C = 0;
+	for (int i = 1; i < N; i++)
+	{
+		max_value = x_ptr[i] > max_value ? x_ptr[i] : max_value;
+	}
+    if (max_value == (int32_t)0x80000000)
+        max_value += 1;
+
+	for (int i = 0; i < N; i++)
+	{
+		X = sub32s(x_ptr[i] ,max_value);
+        X = shfit_floor_x05_int64((int64_t)X * (int64_t)774541002, 31);
+		// X = shift_rasyms((int64_t)X * (int64_t)774541002,-31);
+		E = X >> 23;
+		E = E + 1;
+
+		X = X & 0x7fffff;
+		X = X - 0x800000;
+
+		Y = p23[0];
+        for (int j = 1; j < 5; j++) 
+        {
+            int64_t t = (((int64_t)Y * (int64_t)X) <<7) + ((int64_t)p23[j] << 30);
+            if (j < 4) {
+                Y = shfit_floor_x05_int64(t, 30);
+            } else {
+                if (30 - E > 63) {
+                    Y = 0;
+                } else {
+                    Y = shfit_floor_x05_int64(t, 30 - E);
+                }
+            }
+        }
+        ((int32_t*)y_ptr)[i] = Y;
+	}
+
+	for (int i = 0; i < N; i++)
+	{
+		E_SUM += y_ptr[i];
+	}
+
+	B = SATURATE(E_SUM, 32);
+	for (int i = 1; i <= 30; i++) {
+		if(A>=B) {
+			C = C+1; 
+			C = C*2;
+			A = A-B;
+			A = A*2;
+		} else {
+			C = C*2;
+			A = A*2;
+		}
+	}
+
+	for (int i = 0; i < N; i++)
+	{
+		y_ptr[i] = shfit_floor_x05_int64((int64_t)C * (int64_t)y_ptr[i], 38);
+	}
+}
+
+__global__ void venusa_qsoftmax_gpu_kernel(const int* __restrict__ p_in,
+	int* __restrict__ p_out, 
+	int32_t N, int32_t L, int32_t* p23)
+{
+	int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+	if (idx < L)
+	// for (int k = 0; k < L; k++)
+	{
+		const int* x_ptr = p_in + N * idx;
+		int * y_ptr = p_out + N * idx;
+
+		venusa_qsoftmax_c(x_ptr, y_ptr, N, p23);
+	}	
+}
+
+torch::Tensor venusa_qsoftmax_gpu(const torch::Tensor& in, int64_t dim)
+{
+	// int32_t N = in.numel();
+	int32_t N = in.size(1);
+	int32_t L = in.size(0);
+	const dim3 threads = (64);
+    const dim3 blocks((L + threads.x - 1) / threads.x);
+	auto out = torch::zeros_like(in);
+	const int* x_ptr = in.data_ptr<int>();
+	int * y_ptr = out.data_ptr<int>();
+
+	// static const int32_t p[] = { 14685058, 114217091, 514075394, 1488269031, 2147475316 };
+	static const int32_t p23[] = { 57364 ,446161 ,2008107 , 5813551, 8388575 };
+
+	// int32_t *buffer_p;
+	int32_t *buffer_p23;
+
+	// cudaMalloc((void **)&buffer_p, 5 * sizeof(int32_t));
+	// cudaMemcpy(buffer_p, p, 5 * sizeof(int32_t), cudaMemcpyHostToDevice);
+	cudaMalloc((void **)&buffer_p23, 5 * sizeof(int32_t));
+	cudaMemcpy(buffer_p23, p23, 5 * sizeof(int32_t), cudaMemcpyHostToDevice);
+
+	venusa_qsoftmax_gpu_kernel<<<blocks, threads>>>(x_ptr, y_ptr, N, L, buffer_p23);
+	// venusa_qsoftmax_gpu_kernel<<<1, 1>>>(x_ptr, y_ptr, N, L, buffer_p, buffer_p23);
+	
+	// cudaFree(buffer_p);
+	cudaFree(buffer_p23);
+	
+	return out;
+}
+
diff --git a/linger/kernel/gpu/venusa_qtanh_kernel.cu b/linger/kernel/gpu/venusa_qtanh_kernel.cu
new file mode 100644
index 0000000..5272d2e
--- /dev/null
+++ b/linger/kernel/gpu/venusa_qtanh_kernel.cu
@@ -0,0 +1,109 @@
+#include <torch/extension.h>
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <vector>
+#include <stdint.h>
+
+#define MAX_BITS(bits) ((1LL << (bits - 1)) - 1)
+#define MIN_BITS(bits) (-(1LL << (bits - 1)))
+#define SATURATE(x, bits)                \
+  ((x) > MAX_BITS(bits) ? MAX_BITS(bits) \
+						: ((x) < MIN_BITS(bits) ? MIN_BITS(bits) : (x)))
+						
+__global__ void venusa_qtanh_gpu_kernel(const int* __restrict__ a,
+                             int* __restrict__ c, 
+                            int32_t len, uint32_t* bands , uint32_t* slopes,
+                            uint32_t* biass)
+{
+    int32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+	uint32_t j = 0;
+	uint32_t sign = 0;
+	int64_t absx = 0;
+	int64_t slope = 0;
+	int64_t bias = 0;
+	uint32_t shift = 0;
+	int64_t tmp = 0;
+	int64_t out = 0;
+  
+    if (idx < len)
+    {
+        tmp = a[idx];
+
+        if (tmp < 0)
+		{
+			sign = 1;
+			if (tmp == -1 * (1 << 31))
+			{
+				absx = (1 << 31) - 1;
+			}
+			else
+			{
+				absx = -tmp;
+			}
+		}
+		else
+		{
+			sign = 0;
+			absx = tmp;
+		}
+
+		for (j = 1; j < 17; ++j)
+		{
+			if (absx <= bands[j])
+			{
+				slope = slopes[j - 1];
+				bias = biass[j - 1];
+				break;
+			}
+			else
+			{
+				slope = 0;
+				bias = 0;
+			}
+		}
+
+        bias = bias << 3;
+		if (1 == sign)
+		{
+			out = ((-1 * slope * absx) >> 27) - bias;
+		}
+		else
+		{
+			out = ((slope * absx) >> 27) + bias;
+		}
+
+		c[idx] = SATURATE(out, 32);
+    }
+}
+
+torch::Tensor venusa_qtanh_gpu(torch::Tensor a)
+{
+    int32_t N = a.numel();
+    const int threads = 64;
+    const dim3 blocks((N + threads - 1) / threads, threads);
+    auto c = torch::zeros_like(a);
+    const int* a_ptr= a.data_ptr<int>();
+    int * c_ptr = c.data_ptr<int>();
+	static const uint32_t bands[] = {0, 33584191, 58182438, 80361293, 102120620, 124483371, 148392654, 174972063, 206183541, 244722657, 281591256, 312045360, 358241226, 403095772, 471425623, 530372454, 2147483648};
+	static const uint32_t slopes[] = {2114623134, 1910645334, 1662969581, 1396521636, 1131979061, 881027392, 652221876, 451080022, 281798391, 159724746, 98692663, 60001039, 27632752, 13732196, 2832019, 172634};
+	static const uint32_t biass[] = {0,6379959,19800670,39742244,64902158,93996032,125617281,158394439,190900471,218722976,234728779,245973141,256772436,261990874,266776590,268090187,};
+
+    uint32_t* buffer_bands;
+    uint32_t* buffer_slopes;
+    uint32_t* buffer_biass;
+
+    cudaMalloc((void**)&buffer_bands,  17*sizeof(uint32_t));
+    cudaMemcpy(buffer_bands, bands, 17*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_slopes,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_slopes, slopes, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    cudaMalloc((void**)&buffer_biass,  16*sizeof(uint32_t));
+    cudaMemcpy(buffer_biass, biass, 16*sizeof(uint32_t), cudaMemcpyHostToDevice); 
+    
+    venusa_qtanh_gpu_kernel<<<blocks, threads>>>(a_ptr, c_ptr, N, buffer_bands, buffer_slopes, buffer_biass);
+    cudaFree(buffer_bands);
+    cudaFree(buffer_slopes);
+    cudaFree(buffer_biass);
+
+    return c;
+}
+
diff --git a/linger/layer_normalizer.py b/linger/layer_normalizer.py
deleted file mode 100644
index 9fa5c3e..0000000
--- a/linger/layer_normalizer.py
+++ /dev/null
@@ -1,208 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from typing import Tuple
-
-import torch
-import torch.nn as nn
-
-from .config import config
-from .modules import *
-from .ops.ops_names import LINGER_AHEAD_RELU, LINGER_AHEAD_SIGMOID
-from .utils import ClampInfo, Singleton, get_device, logger
-
-
-class _SingletonContainClampModules(Singleton):
-    clamped_quant_list = {}
-    _is_close_register = False
-
-    def _close_register(self):
-        self._is_close_register = True
-
-    def _register(self, module, quant_info):
-        assert isinstance(module, torch.nn.Module) or isinstance(module, list)
-        modules = [module] if isinstance(module, torch.nn.Module) else module
-        for each_mod in modules:
-            if self._is_close_register:
-                print("warning: module has initlized and linger.init may not work")
-            self.clamped_quant_list[each_mod] = quant_info
-
-    def _is_registered(self, module):
-        return (module in self.clamped_quant_list.keys())
-
-    def get(self, module):
-        return self.clamped_quant_list.get(module)
-
-    def clear(self):
-        if self._is_close_register:
-            print("warning: module has initlized and linger.clear may not work")
-        self.clamped_quant_list.clear()
-
-
-def disable_normalize(module: nn.Module):
-    r"""
-        禁用clamp策略替换网络，该接口支持inline 使用
-    Notes:
-        disable_normalize 应该在linger.normalize_layers函数之前调用才生效
-    """
-    queue = [module]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        if type(node) in SupportNormalizeTorchModules:
-            _SingletonContainClampModules()._register(node, None)
-        for _, submodule in node.named_children():
-            queue.append(submodule)
-
-
-def normalize_module(module: nn.Module, type_modules: Tuple = DefaultNormalizeIntXModule, normalize_weight_value: float = 8, normalize_bias_value: float = 8, normalize_output_value: float = None):
-    r"""
-        对module进行自定义clamp设置，通用的clamp策略接口，该接口支持inline使用
-    """
-    if not isinstance(type_modules, tuple):
-        type_modules = (type_modules, )
-    for t in type_modules:
-        if t not in SupportNormalizeTorchModules:
-            logger.fatal(str(t) + 'is not support clamp in linger now')
-            exit(-1)
-    queue = [module]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        if type(node) in type_modules and type(node) in SupportNormalizeTorchModules:
-            cinfo = ClampInfo()
-            cinfo.set_clamp_weight_value(normalize_weight_value)
-            cinfo.set_clamp_bias_value(normalize_bias_value)
-            cinfo.set_clamp_output_value(normalize_output_value)
-            _SingletonContainClampModules()._register(node, cinfo)
-        for _, submodule in node.named_children():
-            queue.append(submodule)
-
-
-def _replaceModule(submodule, normalize_weight_value, normalize_bias_value, normalize_output_value):
-    if isinstance(submodule, tuple(SupportNormalizeTorchModules)):
-        if isinstance(submodule, NormalizeConvBN2d):
-            bias = True if submodule.conv.bias is not None else False
-            if config.BnMomentumUpdate.disable:
-                submodule_bn_momentum = 0
-            else:
-                submodule_bn_momentum = submodule.bn.momentum
-            convbn = NormalizeConvBN2d(submodule.conv.in_channels, submodule.conv.out_channels, submodule.conv.kernel_size, submodule.conv.stride, submodule.conv.padding, submodule.conv.dilation,
-                                       submodule.conv.groups, bias, submodule.conv.padding_mode, submodule.bn.eps, submodule_bn_momentum, submodule.bn.affine, submodule.bn.track_running_stats,
-                                       normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=submodule.ahead_relu)
-            return convbn
-        elif isinstance(submodule, NormalizeConvBN1d):
-            bias = True if submodule.conv.bias is not None else False
-            if config.BnMomentumUpdate.disable:
-                submodule_bn_momentum = 0
-            else:
-                submodule_bn_momentum = submodule.bn.momentum
-            convbn = NormalizeConvBN1d(submodule.conv.in_channels, submodule.conv.out_channels, submodule.conv.kernel_size, submodule.conv.stride, submodule.conv.padding, submodule.conv.dilation,
-                                       submodule.conv.groups, bias, submodule.conv.padding_mode, submodule.bn.eps, submodule_bn_momentum, submodule.bn.affine, submodule.bn.track_running_stats,
-                                       normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=submodule.ahead_relu)
-            return convbn
-        elif isinstance(submodule, NormalizeConvTransposeBN2d):
-            # ahead_add = getattr(submodule,IFLYTEK_BITBRAIN_AHEAD_ADD,False)
-            # submodule.ahead_add = ahead_add
-            submodule.normalize_data = normalize_output_value
-            submodule.normalize_weight = normalize_weight_value
-            submodule.normalize_bias = normalize_bias_value
-            if config.BnMomentumUpdate.disable:
-                submodule.bn.momentum = 0
-            return submodule
-        elif isinstance(submodule, nn.Conv2d):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            ahead_sigmoid = getattr(
-                submodule, LINGER_AHEAD_SIGMOID, False)
-
-            conv = NormalizeConv2d(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation,
-                                   submodule.groups, bias, submodule.padding_mode, normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=ahead_relu, ahead_sigmoid=ahead_sigmoid)
-            return conv
-        elif isinstance(submodule, nn.Linear):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            ahead_sigmoid = getattr(
-                submodule, LINGER_AHEAD_SIGMOID, False)
-
-            linear = NormalizeLinear(submodule.in_features, submodule.out_features, bias, normalize_data=normalize_output_value, normalize_weight=normalize_weight_value,
-                                     normalize_bias=normalize_bias_value, ahead_relu=ahead_relu, ahead_sigmoid=ahead_sigmoid)
-            return linear
-        elif isinstance(submodule, nn.Embedding):
-            embed = NormalizeEmbedding(submodule.num_embeddings, submodule.embedding_dim, submodule.padding_idx, submodule.max_norm, submodule.norm_type, submodule.scale_grad_by_freq,
-                                       submodule.sparse, submodule.weight, normalize_data=normalize_output_value, normalize_weight=normalize_weight_value)
-            return embed
-        elif isinstance(submodule, nn.ConvTranspose2d):
-            bias = True if submodule.bias is not None else False
-            convtranspose = NormalizeConvTranspose2d(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding,
-                                                     submodule.output_padding, submodule.groups, bias, submodule.dilation, submodule.padding_mode, normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value)
-            return convtranspose
-        elif isinstance(submodule, nn.BatchNorm2d):
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            if config.BnMomentumUpdate.disable:
-                submodule_momentum = 0
-            else:
-                submodule_momentum = submodule.momentum
-            bn = NormalizeBatchNorm2d(submodule.num_features, eps=submodule.eps, momentum=submodule_momentum, affine=submodule.affine, track_running_stats=submodule.track_running_stats,
-                                      normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=ahead_relu)
-            return bn
-        elif isinstance(submodule, nn.LayerNorm):
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            ln = NormalizeLayerNorm(submodule.normalized_shape, eps=submodule.eps, elementwise_affine=submodule.elementwise_affine, 
-                                    normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=ahead_relu)
-            return ln
-        elif isinstance(submodule, nn.Conv1d):
-            bias = True if submodule.bias is not None else False
-            ahead_relu = getattr(submodule, LINGER_AHEAD_RELU, False)
-            conv = NormalizeConv1d(submodule.in_channels, submodule.out_channels, submodule.kernel_size, submodule.stride, submodule.padding, submodule.dilation,
-                                   submodule.groups, bias, submodule.padding_mode, normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value, ahead_relu=ahead_relu)
-            return conv
-        elif isinstance(submodule, nn.GRU):
-            gru = NormalizeFastGRU(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first, submodule.dropout, submodule.bidirectional,
-                                   normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value)
-            return gru
-        elif isinstance(submodule, nn.LSTM):
-            lstm = NormalizeFastLSTM(submodule.input_size, submodule.hidden_size, submodule.num_layers, submodule.bias, submodule.batch_first, submodule.dropout, submodule.bidirectional,
-                                     normalize_data=normalize_output_value, normalize_weight=normalize_weight_value, normalize_bias=normalize_bias_value)
-            return lstm
-
-    return None
-
-
-def normalize_layers(model: nn.Module, *, normalize_modules: Tuple = DefaultNormalizeIntXModule, normalize_weight_value: float = 8, normalize_bias_value: float = None, normalize_output_value: float = 8) -> nn.Module:
-    r"""对模型进行clamp 处理以方便进行更好的量化，
-        默认支持nn.Conv2d, nn.Linear, nn.ConvTranspose2d, Conv-BN融合Clamp
-
-    """
-    if type(normalize_modules) is not tuple:
-        normalize_modules = (normalize_modules, )
-    normalize_modules = set(list(normalize_modules) +
-                            [NormalizeConvBN2d, NormalizeConvBN1d, NormalizeConvTransposeBN2d])
-    for user_module_type in normalize_modules:
-        assert user_module_type in SupportNormalizeTorchModules, 'Currently not support clamp of ' + \
-            str(user_module_type)
-    device = get_device(model)
-    queue = [model]
-    while len(queue) > 0:
-        node = queue.pop(0)
-        for name, submodule in node.named_children():
-            if _SingletonContainClampModules()._is_registered(submodule):
-                clamp_info = _SingletonContainClampModules().get(submodule)
-                if clamp_info == None:
-                    continue
-                else:
-                    r_module = _replaceModule(submodule, clamp_info.clamp_weight_value, clamp_info.clamp_bias_value,
-                                              clamp_info.clamp_output_value)
-                    assert r_module is not None
-                    setattr(node, name, r_module)
-            elif type(submodule) in SupportNormalizeTorchModules and type(submodule) in normalize_modules:
-                r_module = _replaceModule(
-                    submodule, normalize_weight_value, normalize_bias_value, normalize_output_value)
-                assert r_module is not None
-                setattr(node, name, r_module)
-            else:
-                queue.append(submodule)
-
-    if model.training:
-        model.train()
-    else:
-        model.eval()
-    model.to(device)
-    return model
diff --git a/linger/layer_tracer.py b/linger/layer_tracer.py
index 34cf248..932fff9 100644
--- a/linger/layer_tracer.py
+++ b/linger/layer_tracer.py
@@ -1,19 +1,399 @@
+import torch
 import torch.nn as nn
 
-from .conv_bn_fuser import *
+from .utils import Singleton, ActivationType, get_device
+from .constrain import ConvBN1d, ConvBN2d
+from .config import QUANT_CONFIGS
 
+LINGER_IGNORE_PAMAMTER = "_linger_ignore_parameter"
+LINGER_ACTIVATION_TYPE = "_linger_activation_type"
 
-def trace_layers(root_model: nn.Module, target_model: nn.Module, *args, fuse_bn: bool = True, ahead_conv_relu: bool = True, ahead_bn_relu: bool = True, ahead_linear_relu: bool = True, ahead_conv_sigmoid: bool = True, ahead_linear_sigmoid: bool = True) -> nn.Module:
+class FuseableConvBN():
+    def __init__(self, conv_f, conv, bn_f, bn, root_model=None):
+        self.conv_f = conv_f
+        self.conv = conv
+        self.bn_f = bn_f
+        self.bn = bn
+        self.scope_conv = None
+        self.scope_bn = None
+        self.root_model = None
+
+    def set_root_model(self, root_model):
+        self.root_model = root_model
+
+class EmptyBatchNorm(torch.nn.Module):
+    r"""融合后的BNmoudule占位符,没有进行任何Tensor操作
+
+    """
+
+    def __init__(self):
+        super(EmptyBatchNorm, self).__init__()
+        setattr(self, LINGER_IGNORE_PAMAMTER,
+                torch.nn.Parameter(torch.zeros([1])))
+
+    def forward(self, input):
+        return input
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
+        pass
+
+def fuse_conv_bn(conv, bn):
+    eps = 1e-5
+    c_b = getattr(conv, 'bias', None)
+
+    b_mean = bn.running_mean.data
+    b_var = bn.running_var.data
+    b_w = bn.weight.data
+    b_b = bn.bias.data
+    sigma = 1/torch.sqrt(b_var+eps)
+    alpha = b_w * sigma
+    beta = b_b - b_mean * alpha
+    conv.weight.data.mul_(alpha.view(-1, *([1]*(len(conv.weight.shape)-1))))
+    if c_b is not None:
+        conv.bias.data.mul_(alpha).add_(beta)
+    else:
+        conv.bias = bn.bias
+        conv.bias.data.mul_(0).add_(beta)
+
+class SingletonConvFusedBnModules(Singleton):
+    fused_conv_module = {}
+    fused_bn_module = {}
+    _is_close_register = False
+
+    def _close_register(self):
+        self._is_close_register = True
+
+    def _register(self, fuseable_conv_bn):
+        if self._is_close_register:
+            print("warning: module has initlized and linger.init may not work")
+        self.fused_conv_module[fuseable_conv_bn.conv] = fuseable_conv_bn
+        self.fused_bn_module[fuseable_conv_bn.bn] = fuseable_conv_bn
+
+    def _is_registered_conv(self, conv):
+        f_conv = self.fused_conv_module.get(conv)
+        return f_conv
+
+    def _is_registered_bn(self, bn):
+        f_bn = self.fused_bn_module.get(bn)
+        return f_bn
+
+    def build_normalize_convbn2d_scope(self, model):
+        queue = [('', '', model)]
+        while len(queue) > 0:
+            (node_name, scope_name, node) = queue.pop(0)
+            find_fused_info = self._is_registered_conv(node)
+            if find_fused_info is not None:
+                find_fused_info.scope_conv = scope_name
+                if find_fused_info.root_model is None:
+                    find_fused_info.set_root_model(model)
+                    conv_m = find_fused_info.conv
+                    bn_m = find_fused_info.bn
+                    conv_have_bias = False if conv_m.bias is None else True
+                    clamp_conv = None
+                    device = QUANT_CONFIGS.device
+                    activation_type = getattr(conv_m, LINGER_ACTIVATION_TYPE, ActivationType.none)
+                    if type(conv_m) == torch.nn.Conv2d:
+                        clamp_conv = ConvBN2d(in_channels=conv_m.in_channels, out_channels=conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride,
+                                                       padding=conv_m.padding, dilation=conv_m.dilation, groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
+                                                       eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
+                                                       constrain=None)
+                    elif type(conv_m) == torch.nn.Conv1d:
+                        clamp_conv = ConvBN1d(in_channels=conv_m.in_channels, out_channels=conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride,
+                                                       padding=conv_m.padding, dilation=conv_m.dilation, groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
+                                                       eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
+                                                       constrain=None)
+                    # elif type(conv_m) == torch.nn.ConvTranspose2d:
+                    #     clamp_conv = CConvBN2d(in_channels=conv_m.in_channels, out_channels=conv_m.out_channels, kernel_size=conv_m.kernel_size, stride=conv_m.stride,
+                    #                                    padding=conv_m.padding, dilation=conv_m.dilation, groups=conv_m.groups, bias=conv_have_bias, padding_mode=conv_m.padding_mode,
+                    #                                    eps=bn_m.eps, momentum=bn_m.momentum, affine=bn_m.affine, track_running_stats=bn_m.track_running_stats,
+                    #                                    constrain=None)
+                    setattr(clamp_conv, LINGER_ACTIVATION_TYPE, activation_type)
+                    clamp_conv = clamp_conv.to(device)
+                    setattr(find_fused_info.conv_f, node_name, clamp_conv)
+                else:
+                    assert find_fused_info.root_model == model
+            for name, submodule in node.named_children():
+                prefix = '' if scope_name == '' else scope_name+'.'
+                queue.append((name, prefix+name, submodule))
+
+    def build_empty_bn_scope(self, model):
+        queue = [('', '', model)]
+        while len(queue) > 0:
+            (node_name, scope_name, node) = queue.pop(0)
+            find_fused_info = self._is_registered_bn(node)
+            if find_fused_info is not None:
+                find_fused_info.scope_bn = scope_name
+                if find_fused_info.root_model is None:
+                    find_fused_info.set_root_model(model)
+                else:
+                    assert find_fused_info.root_model == model
+                setattr(find_fused_info.bn_f, node_name, EmptyBatchNorm())
+            for name, submodule in node.named_children():
+                prefix = '' if scope_name == '' else scope_name+'.'
+                queue.append((name, prefix+name, submodule))
+
+    @staticmethod
+    def get_module(model, scope):
+        attr_arr = scope.split('.')
+        cur_module = model
+        for att in attr_arr:
+            cur_module = getattr(cur_module, att, None)
+        return cur_module
+
+    def fuse_state_dicts(self, state_dict):
+        for v in self.fused_conv_module.values():
+            assert v.scope_conv is not None
+            assert v.scope_bn is not None
+
+            class GeneralModule():
+                pass
+            keys_bn = []
+            atts_bn = {}
+            for key_dict in state_dict.keys():
+                prefix = v.scope_bn+'.'
+                if key_dict.startswith(prefix):
+                    attr_name = key_dict[len(prefix):]
+                    attr_name = attr_name.split('.', 1)[0]
+                    keys_bn.append(key_dict)
+                    atts_bn[attr_name] = state_dict[key_dict]
+            keys_conv = []
+            atts_conv = {}
+            for key_dict in state_dict.keys():
+                prefix = v.scope_conv+'.'
+                if key_dict.startswith(prefix):
+                    attr_name = key_dict[len(prefix):]
+                    attr_name = attr_name.split('.', 1)[0]
+                    atts_conv[attr_name] = state_dict[key_dict]
+                    keys_conv.append(key_dict)
+            if LINGER_IGNORE_PAMAMTER not in atts_bn.keys():
+                for att, att_dict in atts_conv.items():
+                    state_dict[v.scope_conv+'.conv.'+att] = att_dict
+                for att, att_dict in atts_bn.items():
+                    state_dict[v.scope_conv+'.bn.'+att] = att_dict
+                for key_bn_pop in keys_bn:
+                    if key_bn_pop != LINGER_IGNORE_PAMAMTER:
+                        state_dict.pop(key_bn_pop)
+                for key_conv_pop in keys_conv:
+                    state_dict.pop(key_conv_pop)
+
+    def clear(self):
+        if self._is_close_register:
+            print("warning: module has initlized and linger.clear may not work")
+        self.fused_conv_module.clear()
+        self.fused_bn_module.clear()
+
+    def has_fuseable_items(self):
+        return len(self.fused_conv_module) > 0
+
+class OpNodeInfo():
+    def __init__(self):
+        self.inputs = []
+        self.outputs = []
+        self.op = None
+        self.scope = None
+
+    def __str__(self):
+        s = 'input:'
+        for i in self.inputs:
+            s += i+' '
+        s += '\t output:'
+        for o in self.outputs:
+            s += o+' '
+        s += '\t operator:' + self.op
+        s += '\t scope:' + self.scope
+        return s
+
+    @staticmethod
+    def parse_scope_to_path(scope_str):
+        tail_name = scope_str.strip().split('/')[-1].strip()
+        assert tail_name != ''
+        tail_name = tail_name.replace('__module.', '', 1)
+        return tail_name
+
+    def parse_scope(self):
+        return self.parse_scope_to_path(self.scope)
+
+def get_op_nodes(graph):
+    nodes = []
+    for n in graph.nodes():
+        op_node = OpNodeInfo()
+        for i in n.inputs():
+            op_node.inputs.append(i.debugName())
+        for o in n.outputs():
+            op_node.outputs.append(o.debugName())
+        op_node.op = n.kind()
+        op_node.scope = n.scopeName()
+        nodes.append(op_node)
+    return nodes
+
+def find_adjoin_layer(src_node_name, may_be_dst_layers, dict_input, dict_output, src_node_must_be_layers=None):
+    find_nodes = []
+    for dst_layer in may_be_dst_layers:
+
+        input_tensor = dst_layer.inputs[0]
+        src_node = dict_output[input_tensor]
+        if src_node != None and src_node.op == src_node_name:
+            input_node_set = dict_input[input_tensor]
+            if len(input_node_set) == 1 and dst_layer in input_node_set:
+                if src_node_must_be_layers is None:
+                    find_nodes.append((src_node, dst_layer))
+                elif src_node in src_node_must_be_layers:
+                    find_nodes.append((src_node, dst_layer))
+    target_set = set([])
+    src_set = set([])
+    for s, d in find_nodes:
+        src_set.add(s)
+        target_set.add(d)
+    return find_nodes, src_set, target_set
+
+def find_adjoin_adjoin_layer(src_node_name, mid_node_name, may_be_dst_layers, dict_input, dict_output):
+    find_nodes = []
+    for dst_layer in may_be_dst_layers:
+        dst_input_tensor = dst_layer.inputs[0]
+        mid_node = dict_output[dst_input_tensor]
+        if mid_node != None and mid_node.op == mid_node_name:
+            mid_input_node_set = dict_input[dst_input_tensor]
+            if len(mid_input_node_set) == 1 and dst_layer in mid_input_node_set:
+                mid_input_tensor = mid_node.inputs[0]
+                src_node = dict_output[mid_input_tensor]
+                if src_node != None and src_node.op == src_node_name:
+                    src_input_node_set = dict_input[mid_input_tensor]
+                    if len(src_input_node_set) == 1 and mid_node in src_input_node_set:
+                        find_nodes.append((src_node, mid_node, dst_layer))
+    src_set = set([])
+    mid_set = set([])
+    dst_set = set([])
+    for (s, m, d) in find_nodes:
+        src_set.add(s)
+        mid_set.add(m)
+        dst_set.add(d)
+    return find_nodes, src_set, mid_set, dst_set
+
+def filter_layers(node_arr, op_name):
+    list_node = []
+    for n in node_arr:
+        if n.op == op_name:
+            list_node.append(n)
+    return list_node
+
+def parse_fuseable_conv_bn(node_arr, fused_bn=True):
+    dict_output = {}
+    for n in node_arr:
+        for o in n.outputs:
+            dict_output[o] = n
+    dict_input = {}
+    for n in node_arr:
+        for i in n.inputs:
+            if dict_input.get(i) == None:
+                dict_input[i] = set([n])
+            else:
+                dict_input[i].add(n)
+    list_bn = filter_layers(node_arr, 'aten::batch_norm')
+    list_relu = filter_layers(node_arr, 'aten::relu')
+    list_sigmoid = filter_layers(node_arr, 'aten::sigmoid')
+    fused_conv_sigmoid = []
+    fused_linear_sigmoid = []
+    fused_linear_bias_sigmoid = []
+    fused_conv_bn = []
+    fused_conv_bn_relu = []
+    fused_conv_relu = []
+    fused_bn_relu = []
+    fused_linear_relu = []
+    fused_linear_bias_relu = []
+    if fused_bn:
+        fused_conv_bn, _, _ = find_adjoin_layer(
+            'aten::_convolution', list_bn, dict_input, dict_output)
+        # if ahead_bn_relu:
+        fused_conv_bn_relu, _, _, _ = find_adjoin_adjoin_layer(
+            'aten::_convolution', 'aten::batch_norm', list_relu, dict_input, dict_output)
+    # if ahead_conv_relu:
+    fused_conv_relu, _, _ = find_adjoin_layer(
+        'aten::_convolution', list_relu, dict_input, dict_output)
+    # if ahead_conv_sigmoid:
+    fused_conv_sigmoid, _, _ = find_adjoin_layer(
+        'aten::_convolution', list_sigmoid, dict_input, dict_output)
+    # if ahead_linear_sigmoid:
+    fused_linear_bias_sigmoid, _, _, _ = find_adjoin_adjoin_layer(
+        'aten::matmul', 'aten::add_', list_sigmoid, dict_input, dict_output)
+    fused_linear_sigmoid, _, _ = find_adjoin_layer(
+        'aten::matmul', list_sigmoid, dict_input, dict_output)
+    # if ahead_bn_relu:
+    fused_bn_relu, _, _ = find_adjoin_layer(
+        'aten::batch_norm', list_relu, dict_input, dict_output)
+    # if ahead_linear_relu:
+    fused_linear_bias_relu, _, _, _ = find_adjoin_adjoin_layer(
+        'aten::matmul', 'aten::add_', list_relu, dict_input, dict_output)
+    fused_linear_relu, _, _ = find_adjoin_layer(
+        'aten::matmul', list_relu, dict_input, dict_output)
+
+    return fused_conv_bn, fused_conv_bn_relu, fused_conv_relu, fused_bn_relu, fused_linear_relu, fused_linear_bias_relu, fused_conv_sigmoid, fused_linear_sigmoid, fused_linear_bias_sigmoid
+
+def scope_to_module(root_module, scope):
+    tail_name = OpNodeInfo.parse_scope_to_path(scope)
+    module_arr_name = tail_name.split('.')
+    module_cur = root_module
+    module_cur_name = ''
+    moduel_cur_father = root_module
+    str_find = ''
+    for sub_att_name in module_arr_name:
+        str_find += sub_att_name+"."
+        moduel_cur_father = module_cur
+        module_cur = getattr(module_cur, sub_att_name)
+        module_cur_name = sub_att_name
+        assert module_cur is not None, 'can not find '+str_find
+    return (moduel_cur_father, module_cur, module_cur_name)
+
+def FuseConvBNAheadRelu(model, *args, fused_bn=True):
+    SingletonConvFusedBnModules().clear()
+    assert torch.__version__ >= '1.9.0', 'error: torch version must greater than 1.9'
+    graph = torch.jit.trace(model, *args)
+    node_arr = get_op_nodes(graph.inlined_graph)
+    fuseable_conv_bn, fuseable_conv_bn_relu, fuseable_conv_relu, fuseable_bn_relu, fuseable_linear_relu, fuseable_linear_bias_relu, fuseable_conv_sigmoid, fuseable_linear_sigmoid, fuseable_linear_bias_sigmoid = parse_fuseable_conv_bn(node_arr, fused_bn)
+    module_paths = []
+    if fused_bn:
+        for (conv, bn, _) in fuseable_conv_bn_relu:
+            _, conv_module, _ = scope_to_module(model, conv.scope)
+            setattr(conv_module, LINGER_ACTIVATION_TYPE, ActivationType.Relu)
+        for (conv, bn) in fuseable_conv_bn:
+            conv_module_father, conv_module, conv_module_name = scope_to_module(model, conv.scope)
+            bn_module_father, bn_module, bn_module_name = scope_to_module(model, bn.scope)
+            if (type(conv_module) in (torch.nn.Conv2d, torch.nn.ConvTranspose2d) and type(bn_module) == torch.nn.BatchNorm2d) or \
+                    (type(conv_module) in (torch.nn.Conv1d, torch.nn.ConvTranspose1d) and type(bn_module) == torch.nn.BatchNorm1d):
+                fuseableconv_bn = FuseableConvBN(
+                    conv_module_father, conv_module, bn_module_father, bn_module)
+                SingletonConvFusedBnModules()._register(fuseableconv_bn)
+                module_paths.append((conv.parse_scope(), bn.parse_scope()))
+    for (conv, _) in fuseable_conv_relu:
+        _, conv_module, _ = scope_to_module(model, conv.scope)
+        setattr(conv_module, LINGER_ACTIVATION_TYPE, ActivationType.Relu)
+    for (bn, _) in fuseable_bn_relu:
+        _, bn_module, _ = scope_to_module(model, bn.scope)
+        setattr(bn_module, LINGER_ACTIVATION_TYPE, ActivationType.Relu)
+    for (conv, _) in fuseable_conv_sigmoid:
+        _, conv_module, _ = scope_to_module(model, conv.scope)
+        setattr(conv_module, LINGER_ACTIVATION_TYPE, ActivationType.Sigmoid)
+    for (linear, _) in fuseable_linear_sigmoid:
+        _, linear_module, _ = scope_to_module(model, linear.scope)
+        setattr(linear_module, LINGER_ACTIVATION_TYPE, ActivationType.Sigmoid)
+    for(linear, add, _) in fuseable_linear_bias_sigmoid:
+        _, linear_module, _ = scope_to_module(model, linear.scope)
+        setattr(linear_module, LINGER_ACTIVATION_TYPE, ActivationType.Sigmoid)
+    for(linear, _) in fuseable_linear_relu:
+        _, linear_module, _ = scope_to_module(model, linear.scope)
+        setattr(linear_module, LINGER_ACTIVATION_TYPE, ActivationType.Relu)
+
+    for(linear, add, _) in fuseable_linear_bias_relu:
+        _, linear_module, _ = scope_to_module(model, linear.scope)
+        setattr(linear_module, LINGER_ACTIVATION_TYPE, ActivationType.Relu)
+    return module_paths
+
+def trace_layers(model: nn.Module, *args, fuse_bn: bool = True):
     r"""对模型进行trace 同时进行模型提前进行fusion训练,root_model为原始的根module,target_model为目标trace的子model
 
     Args:
-        root_model(torch.nn.Module): 原始根模型
-        target_model(torch.nn.Module): Trace 子模型
+        model(torch.nn.Module): 原始根模型
         args(torch.Tensor or Tuple or List): 模型Trace的位置参数
         fuse_bn(bool)：是否融合BN 
-        ahead_conv_relu(bool): 是否统计Conv输出正值scale
-        ahead_bn_relu(bool): 是否统计BN输出正值scale 
-        ahead_linear_relu(bool): 是否统计Linear输出正值scale
     returns:
         返回融合BN后的module
 
@@ -26,15 +406,16 @@ def trace_layers(root_model: nn.Module, target_model: nn.Module, *args, fuse_bn:
     """
     if SingletonConvFusedBnModules().has_fuseable_items():
         print("Warning: trace_layers only support one-time call, the latest call will overwrite the previous call and this may cause errors, please check !")
-    FuseConvBNAheadRelu(target_model, *args, fused_bn=fuse_bn, ahead_conv_relu=ahead_conv_relu, ahead_bn_relu=ahead_bn_relu,
-                        ahead_linear_relu=ahead_linear_relu, ahead_conv_sigmoid=ahead_conv_sigmoid, ahead_linear_sigmoid=ahead_linear_sigmoid)
+    
+    _ = FuseConvBNAheadRelu(model, *args, fused_bn=fuse_bn)
     if SingletonConvFusedBnModules().has_fuseable_items():
-        SingletonConvFusedBnModules().build_normalize_convbn2d_scope(root_model)
-        SingletonConvFusedBnModules().build_empty_bn_scope(root_model)
+        SingletonConvFusedBnModules().build_normalize_convbn2d_scope(model)
+        SingletonConvFusedBnModules().build_empty_bn_scope(model)
 
     def pre_hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
         SingletonConvFusedBnModules().fuse_state_dicts(state_dict)
         SingletonConvFusedBnModules().clear()
     if SingletonConvFusedBnModules().has_fuseable_items():
-        root_model._register_load_state_dict_pre_hook(pre_hook)
-    return root_model
+        model._register_load_state_dict_pre_hook(pre_hook)
+
+__all__ = ['trace_layers']
diff --git a/linger/lib/lingerext.cpython-36m-x86_64-linux-gnu.so b/linger/lib/lingerext.cpython-36m-x86_64-linux-gnu.so
deleted file mode 100644
index 46974c0..0000000
Binary files a/linger/lib/lingerext.cpython-36m-x86_64-linux-gnu.so and /dev/null differ
diff --git a/linger/lib/lingerext.cpython-37m-x86_64-linux-gnu.so b/linger/lib/lingerext.cpython-37m-x86_64-linux-gnu.so
deleted file mode 100644
index c99d87b..0000000
Binary files a/linger/lib/lingerext.cpython-37m-x86_64-linux-gnu.so and /dev/null differ
diff --git a/linger/lib/lingerext.cpython-38-x86_64-linux-gnu.so b/linger/lib/lingerext.cpython-38-x86_64-linux-gnu.so
deleted file mode 100644
index c8014af..0000000
Binary files a/linger/lib/lingerext.cpython-38-x86_64-linux-gnu.so and /dev/null differ
diff --git a/linger/modules/__init__.py b/linger/modules/__init__.py
deleted file mode 100644
index 9350018..0000000
--- a/linger/modules/__init__.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from .modules_configs import (DefaultNormalizeIntXModule,
-                              SupportNormalizeIntXModules,
-                              SupportNormalizeTorchModules)
-from .normalize_batchnorm2d import NormalizeBatchNorm2d
-from .normalize_conv1d import NormalizeConv1d
-from .normalize_conv2d import NormalizeConv2d
-from .normalize_convbn1d import NormalizeConvBN1d
-from .normalize_convbn2d import NormalizeConvBN2d
-from .normalize_convTranspose2d import NormalizeConvTranspose2d
-from .normalize_convTransposebn2d import NormalizeConvTransposeBN2d
-from .normalize_embedding import NormalizeEmbedding
-from .normalize_fastGRU import NormalizeFastGRU
-from .normalize_fastLSTM import NormalizeFastLSTM
-from .normalize_layernorm import NormalizeLayerNorm
-from .normalize_linear import NormalizeLinear
\ No newline at end of file
diff --git a/linger/modules/modules_configs.py b/linger/modules/modules_configs.py
deleted file mode 100644
index d96b5d1..0000000
--- a/linger/modules/modules_configs.py
+++ /dev/null
@@ -1,24 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from .normalize_layernorm import NormalizeLayerNorm
-from .normalize_fastLSTM import NormalizeFastLSTM
-from .normalize_fastGRU import NormalizeFastGRU
-import torch.nn as nn
-
-from .normalize_conv1d import NormalizeConv1d
-from .normalize_conv2d import NormalizeConv2d
-from .normalize_convbn1d import NormalizeConvBN1d
-from .normalize_convbn2d import NormalizeConvBN2d
-from .normalize_convTranspose2d import NormalizeConvTranspose2d
-from .normalize_convTransposebn2d import NormalizeConvTransposeBN2d
-from .normalize_linear import NormalizeLinear
-
-from.normalize_batchnorm2d import NormalizeBatchNorm2d
-
-
-DefaultNormalizeIntXModule = (nn.Conv2d, nn.Linear, nn.ConvTranspose2d, NormalizeConvBN2d, NormalizeConvTransposeBN2d,
-                              nn.Conv1d, NormalizeConvBN1d, nn.BatchNorm2d, nn.GRU, nn.LSTM)
-SupportNormalizeTorchModules = [nn.Conv2d, nn.Linear, nn.ConvTranspose2d, NormalizeConvBN2d, NormalizeConvTransposeBN2d,
-                                nn.Conv1d, NormalizeConvBN1d, nn.BatchNorm2d, nn.GRU, nn.LSTM, nn.Embedding, nn.LayerNorm]
-SupportNormalizeIntXModules = (NormalizeConv2d, NormalizeLinear, NormalizeConvTranspose2d, NormalizeConvBN2d, NormalizeConvTransposeBN2d,
-                               NormalizeConv1d, NormalizeConvBN1d, NormalizeBatchNorm2d, NormalizeFastGRU, NormalizeFastLSTM, NormalizeLayerNorm)
diff --git a/linger/modules/normalize_batchnorm2d.py b/linger/modules/normalize_batchnorm2d.py
deleted file mode 100644
index 359f591..0000000
--- a/linger/modules/normalize_batchnorm2d.py
+++ /dev/null
@@ -1,55 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeBatchNorm2d(nn.BatchNorm2d):
-    def __init__(self, num_features: int, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        nn.BatchNorm2d.__init__(self, num_features, eps,
-                                momentum, affine, track_running_stats)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        batchsize, channels, height, width = input.shape
-        size = batchsize * height * width
-        if self.training:
-            mean = input.sum((0, 2, 3), keepdim=True) / size
-            var = input.pow(2).sum((0, 2, 3), keepdim=True) / size - \
-                (input.sum((0, 2, 3), keepdim=True) / size).pow(2)
-            var = torch.clamp(var, min=0.0)
-            self.running_mean = (
-                1 - self.momentum) * self.running_mean + self.momentum * mean.squeeze().detach()
-            self.running_var = (1 - self.momentum) * self.running_var + \
-                self.momentum * var.squeeze().detach()
-        else:
-            mean = self.running_mean.reshape(1, -1, 1, 1)
-            var = self.running_var.reshape(1, -1, 1, 1)
-        sigma = 1 / torch.sqrt(var + self.eps)
-        alpha = self.weight.view(1, -1, 1, 1) * sigma
-        beta = self.bias.view(1, -1, 1, 1) - mean * alpha
-        if self.normalize_weight is not None:
-            alpha = NormalizeFunction.apply(
-                alpha, self.normalize_weight, self.training)
-        if self.normalize_bias is not None:
-            beta = NormalizeFunction.apply(
-                beta, self.normalize_bias, self.training)
-        out = alpha * input + beta
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self) -> str:
-        return 'normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(**self.__dict__)
-
-
-__all__ = ['NormalizeBatchNorm2d']
diff --git a/linger/modules/normalize_conv1d.py b/linger/modules/normalize_conv1d.py
deleted file mode 100644
index 041d9ce..0000000
--- a/linger/modules/normalize_conv1d.py
+++ /dev/null
@@ -1,50 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-class NormalizeConv1d(nn.Conv1d):
-    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        nn.Conv1d.__init__(self, in_channels, out_channels, kernel_size,
-                           stride, padding, dilation, groups, bias, padding_mode)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        normalized_weight = self.weight
-        if self.normalize_weight is not None:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-        normalized_bias = self.bias
-        if (normalized_bias is not None) and (self.normalize_bias is not None):
-            normalized_bias = NormalizeFunction.apply(
-                normalized_bias, self.normalize_bias, self.training)
-        out = None
-        if self.padding_mode != 'zeros':
-            out = F.conv1d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
-                         normalized_weight, normalized_bias, self.stride,
-                         tuple(0, 0), self.dilation, self.groups)
-        else:
-            out = F.conv1d(input, normalized_weight, normalized_bias,
-                         self.stride, self.padding, self.dilation, self.groups)
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        s = nn.Conv1d.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeConv1d']
diff --git a/linger/modules/normalize_conv2d.py b/linger/modules/normalize_conv2d.py
deleted file mode 100644
index 79622e5..0000000
--- a/linger/modules/normalize_conv2d.py
+++ /dev/null
@@ -1,121 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeConv2d(nn.Conv2d):
-    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False, ahead_sigmoid=False) -> None:
-
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-
-        # venus limits
-        if type(kernel_size) == int:
-            assert kernel_size in (
-                1, 2, 3, 4, 5), f"in NormalizeConv2d op, kernel size must be 1/2/3/4/5, but you have kernel size {kernel_size}"
-        elif type(kernel_size) == tuple:
-            assert kernel_size[0] in (1, 2, 3, 4, 5) and kernel_size[1] in (
-                1, 2, 3, 4, 5), "in NormalizeConv2d op, kernel size must be 1/2/3/4/5, , but you have kernel size {kernel_size}"
-
-        if type(stride) == int:
-            assert stride in (
-                1, 2, 4), "in NormalizeConv2d op, kernel size must be 1/2/4, but you have stride {stride}"
-        elif type(stride) == tuple:
-            assert stride[0] in (1, 2, 4) and stride[1] in (
-                1, 2, 4), "in NormalizeConv2d op, kernel size must be 1/2/4, but you have stride {stride}"
-
-        if type(padding) == int:
-            assert padding in (
-                0, 1, 2, 3, 4), "in NormalizeConv2d op, padding size must be 1/2/4, but you have padding {padding}"
-        elif type(padding) == tuple:
-            assert padding[0] in (0, 1, 2, 3, 4) and padding[1] in (
-                0, 1, 2, 3, 4), "in NormalizeConv2d op, padding size must be 1/2/3/4/5, but you have padding {padding}"
-
-        if type(kernel_size) == int:
-            kernel_size_h = kernel_size
-            kernel_size_w = kernel_size
-        elif type(kernel_size) == tuple:
-            kernel_size_h = kernel_size[0]
-            kernel_size_w = kernel_size[1]
-        else:
-            assert False, "kernel size type error."
-        # if (groups != in_channels):
-        #     assert math.ceil(in_channels/8) * 8 * kernel_size_h * kernel_size_w * math.ceil(out_channels/2) * 2 <= 32 * \
-        #         1024, f"in NormalizeConv2d op, kernel must meet the requirements of non-depthwise convolution, but you have math.ceil({in_channels}/8) * 8 * {kernel_size_h} * {kernel_size_w} * math.ceil({out_channels}/2) * 2 <= 32 * 1024"
-        # if (groups == in_channels):
-        #     assert math.ceil(in_channels/16) * 16 * kernel_size_h * kernel_size_w <= 32 * \
-        #         1024, f"in NormalizeConv2d op, kernel must meet the requirements of depthwise convolution, but you have math.ceil({in_channels}/16) * 16 * {kernel_size_h} * {kernel_size_w} <= 32 * 1024"
-
-        if type(stride) == int:
-            stride_h = stride
-            stride_w = stride
-        elif type(stride) == tuple:
-            stride_h = stride[0]
-            stride_w = stride[1]
-        else:
-            assert False, "kernel size type error."
-
-        if type(padding) == int:
-            padding_h = padding
-            padding_w = padding
-        elif type(padding) == tuple:
-            padding_h = padding[0]
-            padding_w = padding[1]
-        else:
-            assert False, "kernel size type error."
-
-        assert kernel_size_h >= stride_h and kernel_size_w >= stride_w, f"kernel_size_h >= stride_h and kernel_size_w >= stride_w, but you have {kernel_size_h} < {stride_h} or {kernel_size_w} < {stride_w}"
-        assert padding_h < kernel_size_h and padding_w < kernel_size_w, f"pad_h < weight_h && pad_w < weight_w, but you have {padding_h} >= {kernel_size_h} or {padding_w} >= {kernel_size_w}"
-
-        nn.Conv2d.__init__(self, in_channels, out_channels, kernel_size,
-                           stride, padding, dilation, groups, bias, padding_mode)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-        self.ahead_sigmoid = ahead_sigmoid
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-
-        # venus limits
-        # assert input.shape[2] >= self.weight.shape[2] and input.shape[3] >= self.weight.shape[
-            # 3], f"in NormalizeConv2d op, input's width >= weight's width && input'height >= weight'height, but you have {input.shape[2]} < {self.weight.shape[2]} and {input.shape[3]} < {self.weight.shape[3]}"
-
-        # channel_in = self.weight.shape[1]/self.groups
-        # assert not (math.ceil(channel_in/8/self.stride[1]) * (8*self.stride[1]) * math.ceil(input.shape[3]/8)*8*1 > 64 * 1024 and channel_in > 512) or not (math.ceil(channel_in/8/self.stride[1]) * (8*self.stride[1]) * math.ceil(8/8)*8*input.shape[2] > 64 * 1024 and channel_in >
-                                                                                                                                                            # 512), f"in NormalizeConv2d op, the size of the input data after alignment exceed 64KB and channal_in > 512 at the same time is not allowed, but you have (math.ceil({channel_in}/8/{self.stride[1]}) * (8*{self.stride[1]}) * math.ceil({input.shape[3]}/8)*8*{1} > 64 * 1024 and {channel_in} > 512) or (math.ceil({channel_in}/8/{self.stride[1]}) * (8*{self.stride[1]}) * math.ceil({8}/8)*8*{input.shape[2]} > 64 * 1024 and {channel_in} > 512"
-
-        normalized_weight = self.weight
-        if self.normalize_weight is not None:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-        normalized_bias = self.bias
-        if (normalized_bias is not None) and (self.normalize_bias is not None):
-            normalized_bias = NormalizeFunction.apply(
-                normalized_bias, self.normalize_bias, self.training)
-        out = None
-        if self.padding_mode != 'zeros':
-            out = F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
-                           normalized_weight, normalized_bias, self.stride,
-                           tuple(0, 0), self.dilation, self.groups)
-        else:
-            out = F.conv2d(input, normalized_weight, normalized_bias,
-                           self.stride, self.padding, self.dilation, self.groups)
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        s = nn.Conv2d.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeConv2d']
diff --git a/linger/modules/normalize_convTranspose2d.py b/linger/modules/normalize_convTranspose2d.py
deleted file mode 100644
index fad8ff5..0000000
--- a/linger/modules/normalize_convTranspose2d.py
+++ /dev/null
@@ -1,82 +0,0 @@
-from typing import Tuple, TypeVar, Union
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-T = TypeVar('T')
-_scalar_or_tuple_2_t = Union[T, Tuple[T, T]]
-_size_2_t = _scalar_or_tuple_2_t[int]
-
-
-class NormalizeConvTranspose2d(nn.ConvTranspose2d):
-    def __init__(self, in_channels: int,
-                 out_channels: int,
-                 kernel_size: _size_2_t,
-                 stride: _size_2_t = 1,
-                 padding: _size_2_t = 0,
-                 output_padding: _size_2_t = 0,
-                 groups: int = 1,
-                 bias: bool = True,
-                 dilation: int = 1,
-                 padding_mode: str = 'zeros',
-                 normalize_data=None, normalize_weight=None, normalize_bias=None,
-                 ) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-
-        # venus limits
-        if type(stride) == int:
-            if stride == 2:
-                assert kernel_size in (
-                    2, 3, 4, 5), f"in NormalizeConvTranspose2d op, when stride_h == 2, kernel_h must be 2, 3, 4 or 5, but you have kernel_size: {kernel_size}"
-            elif stride == 4:
-                assert kernel_size in (
-                    4, 5), f"in NormalizeConvTranspose2d op, when stride_h == 4, kernel_h must be 4 or 5, but you have kernel_size: {kernel_size}"
-        else:
-            if stride[1] == 2:
-                assert kernel_size[1] in (
-                    2, 3, 4, 5), f"in NormalizeConvTranspose2d op, when stride_h == 2, kernel_h must be 2, 3, 4 or 5, but you have kernel_size[1]: {kernel_size[1]}"
-            if stride[1] == 4:
-                assert kernel_size[1] in (
-                    4, 5), f"in NormalizeConvTranspose2d op, when stride_h == 4, kernel_h must be 4 or 5, but you have kernel_size[1]: {kernel_size[1]}"
-            if stride[0] == 2:
-                assert kernel_size[0] in (
-                    2, 3, 4, 5), f"in NormalizeConvTranspose2d op, when stride_2 == 2, kernel_2 must be 2, 3, 4 or 5, but you have kernel_size[0]: {kernel_size[0]}"
-            if stride[0] == 4:
-                assert kernel_size[0] in (
-                    4, 5), f"in NormalizeConvTranspose2d op, when stride_w == 4, kernel_w must be 4 or 5, but you have kernel_size[0]: {kernel_size[0]}"
-
-        nn.ConvTranspose2d.__init__(self, in_channels, out_channels, kernel_size,
-                                    stride, padding, output_padding, groups, bias, dilation, padding_mode)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        normalized_weight = self.weight
-        if self.normalize_weight:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-        normalized_bias = self.bias
-        if (self.bias is not None) and (self.normalize_bias is not None):
-            normalized_bias = NormalizeFunction.apply(
-                normalized_bias, self.normalize_bias, self.training)
-        out = F.conv_transpose2d(
-            input, normalized_weight, normalized_bias, self.stride, self.padding,
-            self.output_padding, self.groups, self.dilation)
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        s = nn.ConvTranspose2d.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeConvTranspose2d']
diff --git a/linger/modules/normalize_convTransposebn2d.py b/linger/modules/normalize_convTransposebn2d.py
deleted file mode 100644
index f39ec39..0000000
--- a/linger/modules/normalize_convTransposebn2d.py
+++ /dev/null
@@ -1,95 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeConvTransposeBN2d(nn.Module):
-    def __init__(self, in_channels:int, out_channels:int, kernel_size, stride=1,padding=0, output_padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                eps = 1e-5, momentum=0.1, affine=True, track_running_stats=True,
-                normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu =False) -> None:
-        assert normalize_data   is None or  normalize_data >0  ,'clamp value is None or must >0'
-        assert normalize_weight is None or normalize_weight >0 ,'clamp value is None or must >0'
-        assert normalize_bias   is None or normalize_bias >0   ,'clamp value is None or must >0'
-        super(NormalizeConvTransposeBN2d, self).__init__()
-        self.conv = nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, output_padding, groups, bias, dilation,  padding_mode)
-        self.bn = nn.BatchNorm2d(out_channels, eps, momentum, affine, track_running_stats)
-
-        self.normalize_data            = normalize_data
-        self.normalize_weight          = normalize_weight
-        self.normalize_bias            = normalize_bias
-        self.ahead_relu            = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        if self.training:
-            conv_rlt = self.conv(input)
-            batchsize, channels, height, width = conv_rlt.size()
-            numel = batchsize * height * width
-            conv_rlt = conv_rlt.permute(1, 0, 2, 3).contiguous().view(channels, numel)
-            sum_ = conv_rlt.sum(1)
-            sum_of_square = conv_rlt.pow(2).sum(1)
-            mean = sum_ / numel
-            sumvar = sum_of_square - sum_ * mean
-            unbias_var = sumvar / (numel - 1)
-            unbias_var = torch.clamp(unbias_var,min=0.0)
-            self.bn.running_mean = ( (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean.detach())
-            self.bn.running_var = ( (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var.detach())
-
-            bias_var = sumvar / numel
-            bias_var = torch.clamp(bias_var,min=0.0)
-            inv_std = 1 / (bias_var + self.bn.eps).pow(0.5)
-            bn_rlt = ( (conv_rlt - mean.unsqueeze(1)) * inv_std.unsqueeze(1) * self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1) )
-            bn_rlt = bn_rlt.view(channels, batchsize, height, width).permute(1, 0, 2, 3).contiguous()
-
-            w_bn = self.bn.weight.div(torch.sqrt(self.bn.eps + unbias_var))
-            # new_weight = self.conv.weight.mul(w_bn.view(1, -1, 1, 1))
-            cin, cout, kh, kw = self.conv.weight.shape
-            conv_weight = self.conv.weight.view(self.conv.groups, cin // self.conv.groups, cout, kh, kw )
-            new_weight = conv_weight.mul(w_bn.view(self.conv.groups,1, -1, 1, 1)).view(cin, cout, kh, kw)
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(self.conv.weight.size(1), device=input.device)
-            b_bn = self.bn.bias - self.bn.weight.mul(mean).div(torch.sqrt(unbias_var + self.bn.eps)) # bn.running_mean  mean  bn.running_var  unbias_var
-            new_bias = b_conv.mul(w_bn) + b_bn
-            alpha = 1.0
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(new_bias, self.normalize_bias, self.training)
-            new_conv_rlt = F.conv_transpose2d(input, new_weight, new_bias, self.conv.stride, self.conv.padding, self.conv.output_padding, self.conv.groups, self.conv.dilation)
-            result = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
-        else:
-            # output = self.conv(input)
-            # result = self.bn(output)
-            w_bn = self.bn.weight.div(torch.sqrt(self.bn.eps + self.bn.running_var))
-            # new_weight = self.conv.weight.mul(w_bn.view(1, -1, 1, 1))
-            cin, cout, kh, kw = self.conv.weight.shape
-            conv_weight = self.conv.weight.view(self.conv.groups, cin // self.conv.groups, cout, kh, kw )
-            new_weight = conv_weight.mul(w_bn.view(self.conv.groups,1, -1, 1, 1)).view(cin, cout, kh, kw)
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(self.conv.weight.size(1), device=input.device)
-            b_bn = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(torch.sqrt(self.bn.running_var + self.bn.eps))
-            new_bias = b_conv.mul(w_bn) + b_bn
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(new_bias, self.normalize_bias, self.training)
-            result = F.conv_transpose2d(input, new_weight, new_bias, self.conv.stride, self.conv.padding, self.conv.output_padding, self.conv.groups, self.conv.dilation)
-        
-        if self.normalize_data is not None:
-            # result = NormalizeFunction.apply(result, self.normalize_data, self.training, False)
-            result.clamp_(-self.normalize_data, self.normalize_data)
-        return result
-    def extra_repr(self):
-        return  'normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(**self.__dict__)
-  
-
-    
-__all__=['NormalizeConvTransposeBN2d']
-
diff --git a/linger/modules/normalize_convbn1d.py b/linger/modules/normalize_convbn1d.py
deleted file mode 100644
index 034dd9f..0000000
--- a/linger/modules/normalize_convbn1d.py
+++ /dev/null
@@ -1,103 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeConvBN1d(nn.Module):
-    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        super(NormalizeConvBN1d, self).__init__()
-        self.conv = nn.Conv1d(in_channels, out_channels, kernel_size,
-                              stride, padding, dilation, groups, bias, padding_mode)
-        self.bn = nn.BatchNorm1d(
-            out_channels, eps, momentum, affine, track_running_stats)
-
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        if self.training:
-            conv_rlt = self.conv(input)
-            batchsize, channels, lenth = conv_rlt.size()
-            numel = batchsize * lenth
-            conv_rlt = conv_rlt.permute(
-                1, 0, 2).contiguous().view(channels, numel)
-            sum_ = conv_rlt.sum(1)
-            sum_of_square = conv_rlt.pow(2).sum(1)
-            mean = sum_ / numel
-            sumvar = sum_of_square - sum_ * mean
-            unbias_var = sumvar / (numel - 1)
-            unbias_var = torch.clamp(unbias_var, min=0.0)
-            self.bn.running_mean = (
-                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean.detach())
-            self.bn.running_var = (
-                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var.detach())
-
-            bias_var = sumvar / numel
-            bias_var = torch.clamp(bias_var, min=0.0)
-            inv_std = 1 / (bias_var + self.bn.eps).pow(0.5)
-            bn_rlt = ((conv_rlt - mean.unsqueeze(1)) * inv_std.unsqueeze(1)
-                      * self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
-            bn_rlt = bn_rlt.view(channels, batchsize, lenth).permute(
-                1, 0, 2).contiguous()
-
-            w_bn = self.bn.weight.div(torch.sqrt(self.bn.eps + unbias_var))
-            new_weight = self.conv.weight.mul(w_bn.view(-1, 1, 1))
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(
-                    self.conv.weight.size(0), device=input.device)
-            b_bn = self.bn.bias - \
-                self.bn.weight.mul(mean).div(
-                    torch.sqrt(unbias_var + self.bn.eps))
-            new_bias = b_conv.mul(w_bn) + b_bn
-            alpha = 1.0
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(
-                    new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(
-                    new_bias, self.normalize_bias, self.training)
-            new_conv_rlt = F.conv1d(input, new_weight, new_bias, self.conv.stride,
-                                    self.conv.padding, self.conv.dilation, self.conv.groups)
-            out = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
-        else:
-            w_bn = self.bn.weight.div(torch.sqrt(
-                self.bn.eps + self.bn.running_var))
-            new_weight = self.conv.weight.mul(w_bn.view(-1, 1, 1))
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(
-                    self.conv.weight.size(0), device=input.device)
-            b_bn = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
-                torch.sqrt(self.bn.running_var + self.bn.eps))
-            new_bias = b_conv.mul(w_bn) + b_bn
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(
-                    new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(
-                    new_bias, self.normalize_bias, self.training)
-            out = F.conv1d(input, new_weight, new_bias, self.conv.stride,
-                              self.conv.padding, self.conv.dilation, self.conv.groups)
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        return 'normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(**self.__dict__)
-
-
-__all__ = ['NormalizeConvBN1d']
diff --git a/linger/modules/normalize_convbn2d.py b/linger/modules/normalize_convbn2d.py
deleted file mode 100644
index 7e3f979..0000000
--- a/linger/modules/normalize_convbn2d.py
+++ /dev/null
@@ -1,105 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeConvBN2d(nn.Module):
-    def __init__(self, in_channels: int, out_channels: int, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 eps=1e-5, momentum=0.1, affine=True, track_running_stats=True,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        super(NormalizeConvBN2d, self).__init__()
-        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size,
-                              stride, padding, dilation, groups, bias, padding_mode)
-        self.bn = nn.BatchNorm2d(
-            out_channels, eps, momentum, affine, track_running_stats)
-
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        if self.training:
-            conv_rlt = self.conv(input)
-            batchsize, channels, height, width = conv_rlt.size()
-            numel = batchsize * height * width
-            conv_rlt = conv_rlt.permute(
-                1, 0, 2, 3).contiguous().view(channels, numel)
-            sum_ = conv_rlt.sum(1)
-            sum_of_square = conv_rlt.pow(2).sum(1)
-            mean = sum_ / numel
-            sumvar = sum_of_square - sum_ * mean
-            unbias_var = sumvar / (numel - 1)
-            unbias_var = torch.clamp(unbias_var, min=0.0)
-            self.bn.running_mean = (
-                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean.detach())
-            self.bn.running_var = (
-                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var.detach())
-
-            bias_var = sumvar / numel
-            bias_var = torch.clamp(bias_var, min=0.0)
-            inv_std = 1 / (bias_var + self.bn.eps).pow(0.5)
-            bn_rlt = ((conv_rlt - mean.unsqueeze(1)) * inv_std.unsqueeze(1)
-                      * self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
-            bn_rlt = bn_rlt.view(channels, batchsize, height, width).permute(
-                1, 0, 2, 3).contiguous()
-
-            w_bn = self.bn.weight.div(torch.sqrt(self.bn.eps + unbias_var))
-            new_weight = self.conv.weight.mul(w_bn.view(-1, 1, 1, 1))
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(
-                    self.conv.weight.size(0), device=input.device)
-            b_bn = self.bn.bias - \
-                self.bn.weight.mul(mean).div(
-                    torch.sqrt(unbias_var + self.bn.eps))
-            new_bias = b_conv.mul(w_bn) + b_bn
-            alpha = 1.0
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(
-                    new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(
-                    new_bias, self.normalize_bias, self.training)
-            new_conv_rlt = F.conv2d(input, new_weight, new_bias, self.conv.stride,
-                                    self.conv.padding, self.conv.dilation, self.conv.groups)
-            # out = new_conv_rlt
-            out = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
-        else:
-            w_bn = self.bn.weight.div(torch.sqrt(
-                self.bn.eps + self.bn.running_var))
-            new_weight = self.conv.weight.mul(w_bn.view(-1, 1, 1, 1))
-            if self.conv.bias is not None:
-                b_conv = self.conv.bias
-            else:
-                b_conv = torch.zeros(
-                    self.conv.weight.size(0), device=input.device)
-            b_bn = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
-                torch.sqrt(self.bn.running_var + self.bn.eps))
-            new_bias = b_conv.mul(w_bn) + b_bn
-            if self.normalize_weight is not None:
-                new_weight = NormalizeFunction.apply(
-                    new_weight, self.normalize_weight, self.training)
-            if self.normalize_bias is not None:
-                new_bias = NormalizeFunction.apply(
-                    new_bias, self.normalize_bias, self.training)
-            out = F.conv2d(input, new_weight, new_bias, self.conv.stride,
-                              self.conv.padding, self.conv.dilation, self.conv.groups)
-
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        return 'normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(**self.__dict__)
-
-
-__all__ = ['NormalizeConvBN2d']
diff --git a/linger/modules/normalize_embedding.py b/linger/modules/normalize_embedding.py
deleted file mode 100644
index 0cc424f..0000000
--- a/linger/modules/normalize_embedding.py
+++ /dev/null
@@ -1,44 +0,0 @@
-from typing import Optional
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch import Tensor
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeEmbedding(nn.Embedding):
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None,  normalize_data=None, normalize_weight=None) -> None:
-
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        nn.Embedding.__init__(self, num_embeddings, embedding_dim, padding_idx,
-                              max_norm, norm_type, scale_grad_by_freq, sparse, _weight)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        normalized_weight = self.weight
-        if self.normalize_weight is not None:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-
-        out = None
-        out = F.embedding(input, normalized_weight, self.padding_idx, self.max_norm,
-                        self.norm_type, self.scale_grad_by_freq, self.sparse)
-
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        s = nn.Embedding.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeEmbedding']
diff --git a/linger/modules/normalize_fastGRU.py b/linger/modules/normalize_fastGRU.py
deleted file mode 100644
index 6b8502f..0000000
--- a/linger/modules/normalize_fastGRU.py
+++ /dev/null
@@ -1,375 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-from torch import _VF
-from torch.nn.utils.rnn import PackedSequence
-from torch.onnx import is_in_onnx_export
-
-from ..quant import NormalizeFunction
-
-
-class GRUOnnxFakeFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                input_size, hidden_size, num_layers, batch_first, dropout, bidirectional, bias_B, bias_B_reverse,
-                hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                weight_ih_bi, weight_hh_bi, bias_B_bi):
-        output = None
-        hidden_state = None
-        batch_size = None
-        seq_length = None
-        num_directions = 2 if bidirectional else 1
-        if batch_first:
-            batch_size = input.size(0)
-            seq_length = input.size(
-                1) if lengths is None else torch.max(lengths)
-            output = torch.randn(batch_size, seq_length,
-                                 hidden_size*num_directions, device=input.device)
-        else:
-            batch_size = input.size(1)
-            seq_length = input.size(
-                0) if lengths is None else torch.max(lengths)
-            output = torch.randn(seq_length, batch_size,
-                                 hidden_size*num_directions, device=input.device)
-        hidden_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        return output, hidden_state
-
-    @staticmethod
-    def backward(ctx, gradOutput, gradHidden, gradCell):
-        return None, None, None, None, None, None,\
-            None, None, None, None,\
-            None, None, None, None, None, None,\
-            None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                 weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                 input_size, hidden_size, num_layers, batch_first, dropout, bidirectional, bias_B, bias_B_reverse,
-                 hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                 weight_ih_bi, weight_hh_bi, bias_B_bi):
-        if num_layers != 1:
-            assert False, "Current do not support num_layer != 1 onnx export !"
-        param_dict = {'hidden_size_i': hidden_size,
-                      'direction_s': "forward", 'linear_before_reset_i': 1}
-        param_bidirectional_dict = {}
-        if bidirectional:
-            param_bidirectional_dict = {
-                'hidden_size_i': hidden_size, 'direction_s': "bidirectional", 'linear_before_reset_i': 1}
-
-        if batch_first:
-            input = g.op("Transpose", *[input], perm_i=(1, 0, 2))
-
-        input_list = [input, weight_ih, weight_hh]
-        input_bidirectional_list = [input, weight_ih_bi, weight_hh_bi]
-        if bias_ih is not None and bias_hh is not None:
-            input_list.append(bias_B)
-            input_bidirectional_list.append(bias_B_bi)
-
-        param_dict['outputs'] = 2
-        param_bidirectional_dict['outputs'] = 2
-
-        if lengths is None and hidden_state is None:
-            if not bidirectional:
-                output, hidden = g.op("GRU", *input_list, **param_dict)
-                output = g.op("Squeeze", *[output], axes_i=(1,))
-                if batch_first:
-                    output = g.op("Transpose", *[output], perm_i=(1, 0, 2))
-            else:
-                output_bi, hidden, cell = g.op(
-                    "GRU", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    output_bi = g.op(
-                        "Transpose", *[output_bi], perm_i=(0, 2, 1, 3))
-                args = [output_bi]
-                output_bi = g.op('Reshape', *args, g.op('Constant',
-                                                        value_t=torch.LongTensor([0, 0, -1])))
-                output = g.op("Transpose", *[output_bi], perm_i=(1, 0, 2))
-        elif lengths is not None and hidden_state is None:
-            input_list.append(lengths)
-            input_bidirectional_list.append(lengths)
-            if not bidirectional:
-                output, hidden = g.op("GRU", *input_list, **param_dict)
-                output = g.op("Squeeze", *[output], axes_i=(1,))
-                if batch_first:
-                    output = g.op("Transpose", *[output], perm_i=(1, 0, 2))
-            else:
-                output_bi, hidden = g.op(
-                    "GRU", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    output_bi = g.op(
-                        "Transpose", *[output_bi], perm_i=(0, 2, 1, 3))
-                args = [output_bi]
-                output_bi = g.op('Reshape', *args, g.op('Constant',
-                                                        value_t=torch.LongTensor([0, 0, -1])))
-                output = g.op("Transpose", *[output_bi], perm_i=(1, 0, 2))
-        else:
-            input_list.append(lengths)
-            input_list.append(hidden_state)
-            input_bidirectional_list.append(lengths)
-            input_bidirectional_list.append(hidden_state)
-            if not bidirectional:
-                output, hidden = g.op("GRU", *input_list, **param_dict)
-                output = g.op("Squeeze", *[output], axes_i=(1,))
-                if batch_first:
-                    output = g.op("Transpose", *[output], perm_i=(1, 0, 2))
-            else:
-                output_bi, hidden = g.op(
-                    "GRU", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    output_bi = g.op(
-                        "Transpose", *[output_bi], perm_i=(0, 2, 1, 3))
-                args = [output_bi]
-                output_bi = g.op('Reshape', *args, g.op('Constant',
-                                                        value_t=torch.LongTensor([0, 0, -1])))
-                output = g.op("Transpose", *[output_bi], perm_i=(1, 0, 2))
-
-        return output, hidden
-
-
-class NormalizeFastGRU(nn.GRU):
-
-    def __init__(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None):
-        nn.GRU.__init__(self, input_size, hidden_size, num_layers,
-                        bias, batch_first, dropout, bidirectional)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-
-    def forward(self, input, hx=None):  # noqa: F811
-        if not is_in_onnx_export():
-            orig_input = input
-            if isinstance(orig_input, PackedSequence):
-                input, batch_sizes, sorted_indices, unsorted_indices = input
-                max_batch_size = batch_sizes[0]
-                max_batch_size = int(max_batch_size)
-            else:
-                batch_sizes = None
-                max_batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                sorted_indices = None
-                unsorted_indices = None
-
-            if hx is None:
-                num_directions = 2 if self.bidirectional else 1
-                hx = torch.zeros(self.num_layers * num_directions,
-                                 max_batch_size, self.hidden_size,
-                                 dtype=input.dtype, device=input.device)
-            else:
-                # Each batch of the hidden state should match the input sequence that
-                # the user believes he/she is passing in.
-                hx = self.permute_hidden(hx, sorted_indices)
-
-            self.check_forward_args(input, hx, batch_sizes)
-            direct = 0
-            flat_weights = []
-            weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-            weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-            if self.bias:
-                bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-                bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-
-            if self.normalize_weight is not None:
-                weight_ih = NormalizeFunction.apply(
-                    weight_ih, self.normalize_weight, self.training)
-                weight_hh = NormalizeFunction.apply(
-                    weight_hh, self.normalize_weight, self.training)
-            if bias_ih is not None and self.normalize_bias is not None:
-                bias_ih = NormalizeFunction.apply(
-                    bias_ih, self.normalize_bias, self.training)
-                bias_hh = NormalizeFunction.apply(
-                    bias_hh, self.normalize_bias, self.training)
-
-            flat_weights.extend([weight_ih, weight_hh])
-            if self.bias:
-                flat_weights.extend([bias_ih, bias_hh])
-
-            if self.bidirectional:
-                direct = 1
-                self.check_forward_args(input, hx, batch_sizes)
-                weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-                weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-                if self.bias:
-                    bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-                    bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-                if self.normalize_weight is not None:
-                    weight_ih = NormalizeFunction.apply(
-                        weight_ih, self.normalize_weight, self.training)
-                    weight_hh = NormalizeFunction.apply(
-                        weight_hh, self.normalize_weight, self.training)
-                if bias_ih is not None and self.normalize_bias is not None:
-                    bias_ih = NormalizeFunction.apply(
-                        bias_ih, self.normalize_bias, self.training)
-                    bias_hh = NormalizeFunction.apply(
-                        bias_hh, self.normalize_bias, self.training)
-                flat_weights.extend([weight_ih, weight_hh])
-                if self.bias:
-                    flat_weights.extend([bias_ih, bias_hh])
-
-            if batch_sizes is None:
-                output, hidden = _VF.gru(input, hx, flat_weights, self.bias, self.num_layers,
-                                         self.dropout, self.training, self.bidirectional, self.batch_first)
-            else:
-                output, hidden = _VF.gru(input, batch_sizes, hx, flat_weights, self.bias,
-                                         self.num_layers, self.dropout, self.training, self.bidirectional)
-
-            if isinstance(orig_input, PackedSequence):
-                output_packed = PackedSequence(
-                    output, batch_sizes, sorted_indices, unsorted_indices)
-                return output_packed, self.permute_hidden(hidden, unsorted_indices)
-            else:
-                return output, self.permute_hidden(hidden, unsorted_indices)
-
-        else:
-            orig_input = input
-            if isinstance(orig_input, PackedSequence):
-                assert False, "GRU don't support PackedSequence as input for onnx export!"
-
-            if isinstance(orig_input, tuple):
-                input, lengths, _, _ = orig_input
-            else:
-                input = orig_input
-                lengths = None
-
-            bias_ih = None
-            bias_hh = None
-            bias_ih_reverse = None
-            bias_hh_reverse = None
-            weight_ih = self.weight_ih_l0
-            weight_hh = self.weight_hh_l0
-            weight_ih_reverse = weight_ih
-            weight_hh_reverse = weight_hh
-
-            if self.bias:
-                bias_ih = self.bias_ih_l0
-                bias_hh = self.bias_hh_l0
-                bias_ih_reverse = bias_ih
-                bias_hh_reverse = bias_hh
-            if self.bidirectional:
-                weight_ih_reverse = self.weight_ih_l0_reverse
-                weight_hh_reverse = self.weight_hh_l0_reverse
-                bias_ih_reverse = self.bias_ih_l0_reverse
-                bias_hh_reverse = self.bias_hh_l0_reverse
-
-            hidden_state = None
-            if hx is not None:
-                hidden_state = hx
-            output = None
-            hy = None
-
-            weight_ih_chunk = weight_ih.chunk(3, 0)
-            weight_ih = torch.cat(
-                [weight_ih_chunk[1], weight_ih_chunk[0], weight_ih_chunk[2]], dim=0)
-            weight_ih = weight_ih.unsqueeze(0)
-            weight_hh_chunk = weight_hh.chunk(3, 0)
-            weight_hh = torch.cat(
-                [weight_hh_chunk[1], weight_hh_chunk[0], weight_hh_chunk[2]], dim=0)
-            weight_hh = weight_hh.unsqueeze(0)
-
-            if self.normalize_weight is not None:
-                weight_ih = NormalizeFunction.apply(
-                    weight_ih, self.normalize_weight, self.training)
-                weight_hh = NormalizeFunction.apply(
-                    weight_hh, self.normalize_weight, self.training)
-
-            weight_ih_bi = None
-            weight_hh_bi = None
-
-            if self.bidirectional:
-                weight_ih_chunk_reverse = weight_ih_reverse.chunk(3, 0)
-                weight_ih_reverse = torch.cat(
-                    [weight_ih_chunk_reverse[1], weight_ih_chunk_reverse[0], weight_ih_chunk_reverse[2]], dim=0)
-                weight_ih_reverse = weight_ih_reverse.unsqueeze(0)
-                weight_hh_chunk_reverse = weight_hh_reverse.chunk(3, 0)
-                weight_hh_reverse = torch.cat(
-                    [weight_hh_chunk_reverse[1], weight_hh_chunk_reverse[0], weight_hh_chunk_reverse[2]], dim=0)
-                weight_hh_reverse = weight_hh_reverse.unsqueeze(0)
-                weight_ih_bi = torch.cat([weight_ih, weight_ih_reverse], dim=0)
-                weight_hh_bi = torch.cat([weight_hh, weight_hh_reverse], dim=0)
-
-                if self.normalize_weight is not None:
-                    weight_ih_bi = NormalizeFunction.apply(
-                        weight_ih_bi, self.normalize_weight, self.training)
-                    weight_hh_bi = NormalizeFunction.apply(
-                        weight_hh_bi, self.normalize_weight, self.training)
-
-            bias_B = None
-            bias_B_reverse = None
-            bias_B_bi = None
-            hidden_state_forward = None
-            hidden_state_reverse = None
-            cell_state_forward = None
-            cell_state_reverse = None
-            if bias_ih is not None and bias_hh is not None:
-                bias_ih_chunk = bias_ih.chunk(3, 0)
-                bias_ih = torch.cat(
-                    [bias_ih_chunk[1], bias_ih_chunk[0], bias_ih_chunk[2]], dim=0)
-                bias_hh_chunk = bias_hh.chunk(3, 0)
-                bias_hh = torch.cat(
-                    [bias_hh_chunk[1], bias_hh_chunk[0], bias_hh_chunk[2]], dim=0)
-                bias_B = torch.cat((bias_ih, bias_hh), dim=0)
-                bias_B = bias_B.unsqueeze(0)
-                if self.normalize_bias is not None:
-                    bias_B = NormalizeFunction.apply(
-                        bias_B, self.normalize_bias, self.training)
-
-            if self.bidirectional and bias_ih is not None and bias_hh is not None:
-                bias_ih_chunk_reverse = bias_ih_reverse.chunk(3, 0)
-                bias_ih_reverse = torch.cat(
-                    [bias_ih_chunk_reverse[1], bias_ih_chunk_reverse[0], bias_ih_chunk_reverse[2]], dim=0)
-                bias_hh_chunk_reverse = bias_hh_reverse.chunk(3, 0)
-                bias_hh_reverse = torch.cat(
-                    [bias_hh_chunk_reverse[1], bias_hh_chunk_reverse[0], bias_hh_chunk_reverse[2]], dim=0)
-                bias_B_reverse = torch.cat(
-                    (bias_ih_reverse, bias_hh_reverse), dim=0)
-                bias_B_reverse = bias_B_reverse.unsqueeze(0)
-                bias_B_bi = torch.cat([bias_B, bias_B_reverse], dim=0)
-                if self.normalize_bias is not None:
-                    bias_B_bi = NormalizeFunction.apply(
-                        bias_B_bi, self.normalize_bias, self.training)
-
-            if self.bidirectional and hx is not None:
-                hidden_state_chunk = hidden_state.chunk(2, 0)
-
-                hidden_state_forward = hidden_state_chunk[0]
-                hidden_state_reverse = hidden_state_chunk[1]
-
-            if hx is not None:
-                batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                seq_len = input.size(
-                    1) if self.batch_first else input.size(0)
-                lengths = torch.tensor([seq_len for i in range(
-                    batch_size)], dtype=torch.int32, device=input.device) if lengths is None else lengths
-            if lengths is not None:
-                lengths = lengths.int()
-            output, hy = GRUOnnxFakeFunction.apply(input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                   weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                                                   self.input_size, self.hidden_size, self.num_layers, self.batch_first, self.dropout, self.bidirectional,
-                                                   bias_B, bias_B_reverse,
-                                                   hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                                                   weight_ih_bi, weight_hh_bi, bias_B_bi)
-
-            if self.normalize_data is not None:
-                output.clamp_(-self.normalize_data, self.normalize_data)
-
-            if isinstance(orig_input, tuple):
-                return (output, lengths, hy)
-            else:
-                return output, hy
-
-    def extra_repr(self):
-        s = nn.GRU.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeFastGRU']
diff --git a/linger/modules/normalize_fastLSTM.py b/linger/modules/normalize_fastLSTM.py
deleted file mode 100644
index ac44009..0000000
--- a/linger/modules/normalize_fastLSTM.py
+++ /dev/null
@@ -1,405 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-from torch import _VF
-from torch.nn.utils.rnn import PackedSequence
-from torch.onnx import is_in_onnx_export
-
-from ..quant import NormalizeFunction
-
-
-class LSTMOnnxFakeFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                input_size, hidden_size, num_layers, batch_first, dropout, bidirectional, bias_B, bias_B_reverse,
-                hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                weight_ih_bi, weight_hh_bi, bias_B_bi):
-        output = None
-        hidden_state = None
-        cell_state = None
-        batch_size = None
-        seq_length = None
-        num_directions = 2 if bidirectional else 1
-        if batch_first:
-            batch_size = input.size(0)
-            seq_length = input.size(
-                1) if lengths is None else torch.max(lengths)
-            output = torch.randn(batch_size, seq_length,
-                                 hidden_size*num_directions, device=input.device)
-        else:
-            batch_size = input.size(1)
-            seq_length = input.size(
-                0) if lengths is None else torch.max(lengths)
-            output = torch.randn(seq_length, batch_size,
-                                 hidden_size*num_directions, device=input.device)
-        hidden_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        cell_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        return output, hidden_state, cell_state
-
-    @staticmethod
-    def backward(ctx, gradOutput, gradHidden, gradCell):
-
-        return None, None, None, None, None, None, None,\
-            None, None, None, None,\
-            None, None, None, None, None, None,\
-            None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                 weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                 input_size, hidden_size, num_layers, batch_first, dropout, bidirectional, bias_B, bias_B_reverse,
-                 hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                 weight_ih_bi, weight_hh_bi, bias_B_bi):
-        if num_layers != 1:
-            assert False, "Current intx not support num_layer!=1 onnx export !"
-        param_dict = {'hidden_size_i': hidden_size, 'direction_s': "forward"}
-        param_bidirectional_dict = {}
-        if bidirectional:
-            param_bidirectional_dict = {
-                'hidden_size_i': hidden_size, 'direction_s': "bidirectional"}
-
-        if batch_first:
-            input = g.op("Transpose", *[input], perm_i=(1, 0, 2))
-
-        input_list = [input, weight_ih, weight_hh]
-        input_bidirectional_list = [input, weight_ih_bi, weight_hh_bi]
-        if bias_ih is not None and bias_hh is not None:
-            input_list.append(bias_B)
-            input_bidirectional_list.append(bias_B_bi)
-
-        param_dict['outputs'] = 3
-        param_bidirectional_dict['outputs'] = 3
-
-        if lengths is None and hidden_state is None:
-            if not bidirectional:
-                lstm, hidden, cell = g.op("LSTM", *input_list, **param_dict)
-                lstm = g.op("Squeeze", *[lstm], axes_i=(1,))
-                if batch_first:
-                    lstm = g.op("Transpose", *[lstm], perm_i=(1, 0, 2))
-            else:
-                lstm_bi, hidden, cell = g.op(
-                    "LSTM", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    lstm_bi = g.op(
-                        "Transpose", *[lstm_bi], perm_i=(0, 2, 1, 3))
-                args = [lstm_bi]
-                lstm_bi = g.op('Reshape', *args, g.op('Constant',
-                               value_t=torch.LongTensor([0, 0, -1])))
-                lstm = g.op("Transpose", *[lstm_bi], perm_i=(1, 0, 2))
-
-        elif lengths is not None and hidden_state is None:
-            input_list.append(lengths)
-            input_bidirectional_list.append(lengths)
-            if not bidirectional:
-                lstm, hidden, cell = g.op("LSTM", *input_list, **param_dict)
-                lstm = g.op("Squeeze", *[lstm], axes_i=(1,))
-                if batch_first:
-                    lstm = g.op("Transpose", *[lstm], perm_i=(1, 0, 2))
-            else:
-                lstm_bi, hidden, cell = g.op(
-                    "LSTM", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    lstm_bi = g.op(
-                        "Transpose", *[lstm_bi], perm_i=(0, 2, 1, 3))
-                args = [lstm_bi]
-                lstm_bi = g.op('Reshape', *args, g.op('Constant',
-                               value_t=torch.LongTensor([0, 0, -1])))
-                lstm = g.op("Transpose", *[lstm_bi], perm_i=(1, 0, 2))
-
-        else:
-            input_list.append(lengths)
-            input_list.append(hidden_state)
-            input_list.append(cell_state)
-            input_bidirectional_list.append(lengths)
-            input_bidirectional_list.append(hidden_state)
-            input_bidirectional_list.append(cell_state)
-            if not bidirectional:
-                lstm, hidden, cell = g.op("LSTM", *input_list, **param_dict)
-                lstm = g.op("Squeeze", *[lstm], axes_i=(1,))
-                if batch_first:
-                    lstm = g.op("Transpose", *[lstm], perm_i=(1, 0, 2))
-            else:
-                lstm_bi, hidden, cell = g.op(
-                    "LSTM", *input_bidirectional_list, **param_bidirectional_dict)
-                if batch_first:
-                    lstm_bi = g.op(
-                        "Transpose", *[lstm_bi], perm_i=(0, 2, 1, 3))
-                args = [lstm_bi]
-                lstm_bi = g.op('Reshape', *args, g.op('Constant',
-                               value_t=torch.LongTensor([0, 0, -1])))
-                lstm = g.op("Transpose", *[lstm_bi], perm_i=(1, 0, 2))
-
-        return lstm, hidden, cell
-
-
-class NormalizeFastLSTM(nn.LSTM):
-
-    def __init__(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None):
-        nn.LSTM.__init__(self, input_size, hidden_size, num_layers,
-                         bias, batch_first, dropout, bidirectional)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-
-    def forward(self, input, hx=None):
-        if not is_in_onnx_export():
-            if self.num_layers != 1:
-                assert False, "Intx-NormalizeLSTM don't support num_layer!=1 !"
-            orig_input = input
-            if isinstance(orig_input, PackedSequence):
-                input, batch_sizes, sorted_indices, unsorted_indices = input
-                max_batch_size = batch_sizes[0]
-                max_batch_size = int(max_batch_size)
-            elif isinstance(orig_input, tuple):
-                input, lengths, batch_first, enforce_sorted = orig_input
-                packed_input = torch.nn.utils.rnn.pack_padded_sequence(
-                    input, lengths, batch_first, enforce_sorted)
-                input, batch_sizes, sorted_indices, unsorted_indices = packed_input
-                max_batch_size = batch_sizes[0]
-                max_batch_size = int(max_batch_size)
-            else:
-                batch_sizes = None
-                max_batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                sorted_indices = None
-                unsorted_indices = None
-
-            if hx is None:
-                num_directions = 2 if self.bidirectional else 1
-                zeros = torch.zeros(self.num_layers * num_directions,
-                                    max_batch_size, self.hidden_size,
-                                    dtype=input.dtype, device=input.device)
-                hx = (zeros, zeros)
-            else:
-                hx = self.permute_hidden(hx, sorted_indices)
-
-            flat_weights = []
-            direct = 0
-            self.check_forward_args(input, hx, batch_sizes)
-            weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-            weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-            if self.bias:
-                bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-                bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-
-            if self.normalize_weight is not None:
-                weight_ih = NormalizeFunction.apply(
-                    weight_ih, self.normalize_weight, self.training)
-                weight_hh = NormalizeFunction.apply(
-                    weight_hh, self.normalize_weight, self.training)
-            if bias_ih is not None and self.normalize_bias is not None:
-                bias_ih = NormalizeFunction.apply(
-                    bias_ih, self.normalize_bias, self.training)
-                bias_hh = NormalizeFunction.apply(
-                    bias_hh, self.normalize_bias, self.training)
-
-            flat_weights.extend([weight_ih, weight_hh])
-            if self.bias:
-                flat_weights.extend([bias_ih, bias_hh])
-
-            if self.bidirectional:
-                direct = 1
-                self.check_forward_args(input, hx, batch_sizes)
-                weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-                weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-                if self.bias:
-                    bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-                    bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-                if self.normalize_weight is not None:
-                    weight_ih = NormalizeFunction.apply(
-                        weight_ih, self.normalize_weight, self.training)
-                    weight_hh = NormalizeFunction.apply(
-                        weight_hh, self.normalize_weight, self.training)
-                if bias_ih is not None and self.normalize_bias is not None:
-                    bias_ih = NormalizeFunction.apply(
-                        bias_ih, self.normalize_bias, self.training)
-                    bias_hh = NormalizeFunction.apply(
-                        bias_hh, self.normalize_bias, self.training)
-                flat_weights.extend([weight_ih, weight_hh])
-                if self.bias:
-                    flat_weights.extend([bias_ih, bias_hh])
-
-            if batch_sizes is None:
-                result = _VF.lstm(input, hx,  flat_weights, self.bias, self.num_layers,
-                                  self.dropout, self.training, self.bidirectional, self.batch_first)
-            else:
-                result = _VF.lstm(input, batch_sizes, hx,  flat_weights, self.bias,
-                                  self.num_layers, self.dropout, self.training, self.bidirectional)
-            output = result[0]
-            hidden = result[1:]
-            if self.normalize_data is not None:
-                output = NormalizeFunction.apply(
-                    output, self.normalize_data, self.training, True)
-
-            if isinstance(orig_input, PackedSequence):
-                output_packed = PackedSequence(
-                    output, batch_sizes, sorted_indices, unsorted_indices)
-                return output_packed, self.permute_hidden(hidden, unsorted_indices)
-            elif isinstance(orig_input, tuple):
-                output_packed = PackedSequence(
-                    output, batch_sizes, sorted_indices, unsorted_indices)
-                output, lengths = torch.nn.utils.rnn.pad_packed_sequence(
-                    output_packed, self.batch_first)
-                return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
-            else:
-                return output, self.permute_hidden(hidden, unsorted_indices)
-        else:
-            orig_input = input
-            if isinstance(orig_input, PackedSequence):
-                assert False, "LSTM don't support PackedSequence as input for onnx export!"
-
-            if isinstance(orig_input, tuple):
-                input, lengths, _, _ = orig_input
-            else:
-                input = orig_input
-                lengths = None
-
-            bias_ih = None
-            bias_hh = None
-            bias_ih_reverse = None
-            bias_hh_reverse = None
-            weight_ih = self.weight_ih_l0
-            weight_hh = self.weight_hh_l0
-            weight_ih_reverse = weight_ih
-            weight_hh_reverse = weight_hh
-
-            if self.bias:
-                bias_ih = self.bias_ih_l0
-                bias_hh = self.bias_hh_l0
-                bias_ih_reverse = bias_ih
-                bias_hh_reverse = bias_hh
-            if self.bidirectional:
-                weight_ih_reverse = self.weight_ih_l0_reverse
-                weight_hh_reverse = self.weight_hh_l0_reverse
-                bias_ih_reverse = self.bias_ih_l0_reverse
-                bias_hh_reverse = self.bias_hh_l0_reverse
-
-            hidden_state = None
-            cell_state = None
-            if hx is not None:
-                hidden_state, cell_state = hx
-            output = None
-            hy = None
-            cy = None
-
-            weight_ih_chunk = weight_ih.chunk(4, 0)
-            weight_ih = torch.cat(
-                [weight_ih_chunk[0], weight_ih_chunk[3], weight_ih_chunk[1], weight_ih_chunk[2]], dim=0)
-            weight_ih = weight_ih.unsqueeze(0)
-            weight_hh_chunk = weight_hh.chunk(4, 0)
-            weight_hh = torch.cat(
-                [weight_hh_chunk[0], weight_hh_chunk[3], weight_hh_chunk[1], weight_hh_chunk[2]], dim=0)
-            weight_hh = weight_hh.unsqueeze(0)
-
-            if self.normalize_weight is not None:
-                weight_ih = NormalizeFunction.apply(
-                    weight_ih, self.normalize_weight, self.training)
-                weight_hh = NormalizeFunction.apply(
-                    weight_hh, self.normalize_weight, self.training)
-
-            weight_ih_bi = None
-            weight_hh_bi = None
-
-            if self.bidirectional:
-                weight_ih_chunk_reverse = weight_ih_reverse.chunk(4, 0)
-                weight_ih_reverse = torch.cat(
-                    [weight_ih_chunk_reverse[0], weight_ih_chunk_reverse[3], weight_ih_chunk_reverse[1], weight_ih_chunk_reverse[2]], dim=0)
-                weight_ih_reverse = weight_ih_reverse.unsqueeze(0)
-                weight_hh_chunk_reverse = weight_hh_reverse.chunk(4, 0)
-                weight_hh_reverse = torch.cat(
-                    [weight_hh_chunk_reverse[0], weight_hh_chunk_reverse[3], weight_hh_chunk_reverse[1], weight_hh_chunk_reverse[2]], dim=0)
-                weight_hh_reverse = weight_hh_reverse.unsqueeze(0)
-                weight_ih_bi = torch.cat([weight_ih, weight_ih_reverse], dim=0)
-                weight_hh_bi = torch.cat([weight_hh, weight_hh_reverse], dim=0)
-
-                if self.normalize_weight is not None:
-                    weight_ih_bi = NormalizeFunction.apply(
-                        weight_ih_bi, self.normalize_weight, self.training)
-                    weight_hh_bi = NormalizeFunction.apply(
-                        weight_hh_bi, self.normalize_weight, self.training)
-
-            bias_B = None
-            bias_B_reverse = None
-            bias_B_bi = None
-            hidden_state_forward = None
-            hidden_state_reverse = None
-            cell_state_forward = None
-            cell_state_reverse = None
-            if bias_ih is not None and bias_hh is not None:
-                bias_ih_chunk = bias_ih.chunk(4, 0)
-                bias_ih = torch.cat(
-                    [bias_ih_chunk[0], bias_ih_chunk[3], bias_ih_chunk[1], bias_ih_chunk[2]], dim=0)
-                bias_hh_chunk = bias_hh.chunk(4, 0)
-                bias_hh = torch.cat(
-                    [bias_hh_chunk[0], bias_hh_chunk[3], bias_hh_chunk[1], bias_hh_chunk[2]], dim=0)
-                bias_B = torch.cat((bias_ih, bias_hh), dim=0)
-                bias_B = bias_B.unsqueeze(0)
-                if self.normalize_bias is not None:
-                    bias_B = NormalizeFunction.apply(
-                        bias_B, self.normalize_bias, self.training)
-
-            if self.bidirectional and bias_ih is not None and bias_hh is not None:
-                bias_ih_chunk_reverse = bias_ih_reverse.chunk(4, 0)
-                bias_ih_reverse = torch.cat(
-                    [bias_ih_chunk_reverse[0], bias_ih_chunk_reverse[3], bias_ih_chunk_reverse[1], bias_ih_chunk_reverse[2]], dim=0)
-                bias_hh_chunk_reverse = bias_hh_reverse.chunk(4, 0)
-                bias_hh_reverse = torch.cat(
-                    [bias_hh_chunk_reverse[0], bias_hh_chunk_reverse[3], bias_hh_chunk_reverse[1], bias_hh_chunk_reverse[2]], dim=0)
-                bias_B_reverse = torch.cat(
-                    (bias_ih_reverse, bias_hh_reverse), dim=0)
-                bias_B_reverse = bias_B_reverse.unsqueeze(0)
-                bias_B_bi = torch.cat([bias_B, bias_B_reverse], dim=0)
-                if self.normalize_bias is not None:
-                    bias_B_bi = NormalizeFunction.apply(
-                        bias_B_bi, self.normalize_bias, self.training)
-
-            if self.bidirectional and hx is not None:
-                hidden_state_chunk = hidden_state.chunk(2, 0)
-                cell_state_chunk = cell_state.chunk(2, 0)
-
-                hidden_state_forward = hidden_state_chunk[0]
-                hidden_state_reverse = hidden_state_chunk[1]
-                cell_state_forward = cell_state_chunk[0]
-                cell_state_reverse = cell_state_chunk[1]
-
-            if hx is not None:
-                batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                seq_len = input.size(
-                    1) if self.batch_first else input.size(0)
-                lengths = torch.tensor([seq_len for i in range(
-                    batch_size)], dtype=torch.int32, device=input.device) if lengths is None else lengths
-            if lengths is not None:
-                lengths = lengths.int()
-            output, hy, cy = LSTMOnnxFakeFunction.apply(input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                        weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                                                        self.input_size, self.hidden_size, self.num_layers, self.batch_first, self.dropout, self.bidirectional,
-                                                        bias_B, bias_B_reverse,
-                                                        hidden_state_forward, hidden_state_reverse, cell_state_forward, cell_state_reverse,
-                                                        weight_ih_bi, weight_hh_bi, bias_B_bi)
-
-            if self.normalize_data is not None:
-                output.clamp_(-self.normalize_data, self.normalize_data)
-
-            if isinstance(orig_input, tuple):
-                return (output, lengths), (hy, cy)
-            else:
-                return output, (hy, cy)
-
-    def extra_repr(self):
-        s = nn.GRU.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeFastLSTM']
diff --git a/linger/modules/normalize_layernorm.py b/linger/modules/normalize_layernorm.py
deleted file mode 100644
index d0ef10d..0000000
--- a/linger/modules/normalize_layernorm.py
+++ /dev/null
@@ -1,63 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeLayerNorm(nn.LayerNorm):
-    def __init__(self, normalized_shape, eps=1e-05, elementwise_affine=True,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        nn.LayerNorm.__init__(self, normalized_shape, eps, elementwise_affine)
-        self.normalized_shape = normalized_shape
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        size = 1
-        for dim_size in self.normalized_shape:
-            size *= dim_size
-        if len(self.weight.shape) == 1:
-            w_shape = (-1)
-        elif len(self.weight.shape) == 2:
-            w_shape = (-2, -1)
-        elif len(self.weight.shape) == 3:
-            w_shape = (-3, -2, -1)
-        elif len(self.weight.shape) == 4:
-            w_shape = (-4, -3, -2, -1)
-        else:
-            assert False, f"weight.shape=={self.weight.shape}, please check your LayerNorm definition."
-        mean = input.clone().sum(w_shape, keepdim=True) / size
-        var = input.clone().pow(2).sum(w_shape, keepdim=True) / size - \
-            (input.clone().sum(w_shape, keepdim=True) / size).pow(2)
-        var = 1/torch.sqrt(var + self.eps)
-        var = torch.clamp(var, min=0.0)
-        x_normal = (input - mean) * var
-
-        normalized_weight = self.weight
-        if self.normalize_weight is not None:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-        normalized_bias = self.bias
-        if (normalized_bias is not None) and (self.normalize_bias is not None):
-            normalized_bias = NormalizeFunction.apply(
-                normalized_bias, self.normalize_bias, self.training)
-        out = normalized_weight * x_normal + normalized_bias
-        if self.normalize_data is not None:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self) -> str:
-        s = nn.LayerNorm.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeLayerNorm']
diff --git a/linger/modules/normalize_linear.py b/linger/modules/normalize_linear.py
deleted file mode 100644
index fa58b88..0000000
--- a/linger/modules/normalize_linear.py
+++ /dev/null
@@ -1,42 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..quant import NormalizeFunction
-
-
-class NormalizeLinear(nn.Linear):
-    def __init__(self, in_features: int, out_features: int, bias: bool = True,
-                 normalize_data=None, normalize_weight=None, normalize_bias=None, ahead_relu=False, ahead_sigmoid=False) -> None:
-        assert normalize_data is None or normalize_data > 0, 'normalize value is None or must >0'
-        assert normalize_weight is None or normalize_weight > 0, 'normalize value is None or must >0'
-        assert normalize_bias is None or normalize_bias > 0, 'normalize value is None or must >0'
-        nn.Linear.__init__(self, in_features, out_features, bias)
-        self.normalize_data = normalize_data
-        self.normalize_weight = normalize_weight
-        self.normalize_bias = normalize_bias
-        self.ahead_relu = ahead_relu
-        self.ahead_sigmoid = ahead_sigmoid
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        normalized_weight = self.weight
-        if self.normalize_weight is not None:
-            normalized_weight = NormalizeFunction.apply(
-                normalized_weight, self.normalize_weight, self.training)
-        normalized_bias = self.bias
-        if (self.bias is not None) and (self.normalize_bias is not None):
-            normalized_bias = NormalizeFunction.apply(
-                normalized_bias, self.normalize_bias, self.training)
-        out = F.linear(input, normalized_weight, normalized_bias)
-        if self.normalize_data:
-            out.clamp_(-self.normalize_data, self.normalize_data)
-        return out
-
-    def extra_repr(self):
-        s = nn.Linear.extra_repr(self)
-        extra_s = ',normalize_data:{normalize_data},normalize_weight:{normalize_weight},normalize_bias:{normalize_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        return s+extra_s
-
-
-__all__ = ['NormalizeLinear']
diff --git a/linger/onnx/__init__.py b/linger/onnx/__init__.py
index 4a1d7de..98d1bca 100644
--- a/linger/onnx/__init__.py
+++ b/linger/onnx/__init__.py
@@ -1,2 +1 @@
-from .export import export
-from .update_dequant import parser_dequant
\ No newline at end of file
+from .export import export, generate_onnx_qparam_dict, QCustomOpSymbolic, QCustomRNNSymbolic, change_onnx_to_linger2_0
\ No newline at end of file
diff --git a/linger/onnx/export.py b/linger/onnx/export.py
index d110494..03b32d5 100644
--- a/linger/onnx/export.py
+++ b/linger/onnx/export.py
@@ -1,67 +1,505 @@
-from io import BytesIO
-
 import torch
 import torch.nn
 import torch.onnx
-
 import onnx
+from torch.onnx import register_custom_op_symbolic
+from torch.onnx.symbolic_helper import parse_args
+from torch.onnx import symbolic_helper as sym_help
+from onnx import shape_inference, helper
+from io import BytesIO
+from typing import Any, Callable, Collection, Mapping, Sequence, TYPE_CHECKING
+
+from ..config import QUANT_CONFIGS
+from ..utils import _single, _pair, _triple
 
-from .scope import build_global_scope, build_onnx_scope_info
 from .update_dequant import parser_dequant
 
 torch_onnx_export = torch.onnx.export
 
+QDOMAIN_NAME = 'linger'
+
+import os
+import numpy as np
+from onnx import numpy_helper
+def quant_weight_bias(f_data, data_bits, scale):
+    if data_bits <= 8:
+        q_val = np.round(f_data * scale).astype(np.int8)
+    elif data_bits <= 16:
+        q_val = np.round(f_data * scale).astype(np.int16)
+    else:
+        q_val = np.round(f_data * scale).astype(np.int32)
+    return q_val
+
+def change_onnx_to_linger2_0(onnx_path):
+    onnx_model = onnx.load(onnx_path)
+    graph = onnx_model.graph
+    # 构建 initializer 查找表
+    initializer_map = {init.name: init for init in graph.initializer}
+
+    for node in graph.node:
+        if 'QGRU' in node.op_type or 'QLSTM' in node.op_type:
+            scale_i = None
+            scale_h = None
+            scale_iw = None
+            scale_hw = None
+            for attr in node.attribute:
+                if attr.name == 'scale_x':
+                    scale_i = attr.f
+                    attr.name = 'scale_i'
+                elif attr.name == 'scale_h':
+                    scale_h = attr.f
+                elif attr.name == 'scale_iw':
+                    scale_iw = attr.f
+                elif attr.name == 'scale_hw':
+                    scale_hw = attr.f
+                elif attr.name == 'x_bits':
+                    attr.name = 'data_bits'
+                elif attr.name == 'w_bits':
+                    attr.name = 'parameter_bits'
+            for i in range(1, 3):
+                if i == 1:
+                    scale_w = scale_iw
+                else:
+                    scale_w = scale_hw
+                weight_name = node.input[i]
+                weight_f = numpy_helper.to_array(initializer_map[weight_name])
+                weight_q = quant_weight_bias(weight_f, 8, scale_w)
+                weight_init_new = numpy_helper.from_array(weight_q, weight_name)
+                graph.initializer.remove(initializer_map[weight_name])
+                graph.initializer.append(weight_init_new)
+            if len(node.input) > 3:
+                for i in range(3, 5):
+                    if i == 3:
+                        scale_b = scale_i * scale_iw
+                    else:
+                        scale_b = scale_h * scale_hw
+                    bias_name = node.input[i]
+                    bias_f = numpy_helper.to_array(initializer_map[bias_name])
+                    bias_q = quant_weight_bias(bias_f, 32, scale_b)
+                    bias_init_new = numpy_helper.from_array(bias_q, bias_name)
+                    graph.initializer.remove(initializer_map[bias_name])
+                    graph.initializer.append(bias_init_new)
+
+            if 'QGRU' in node.op_type:
+                node.op_type = 'GRUInt'
+                node.name = node.name + '_GRUInt'
+            elif 'QLSTM' in node.op_type:
+                node.op_type = 'LSTMInt'
+                node.name = node.name + '_LSTMInt'
+            continue
+
+        scale_x = None
+        scale_w = None
+        for attr in node.attribute:
+            if attr.name == 'scale_x':
+                scale_x = attr.f
+            elif attr.name == 'scale_w':
+                scale_w = attr.f
+            elif attr.name == 'x_bits':
+                attr.name = 'data_bits'
+            elif attr.name == 'w_bits':
+                attr.name = 'parameter_bits'
+
+        if len(node.input) > 1 and 'weight' in str(node.input[1]):
+            weight_name = node.input[1]
+            weight_f = numpy_helper.to_array(initializer_map[weight_name])
+            weight_q = quant_weight_bias(weight_f, 8, scale_w)
+            weight_init_new = numpy_helper.from_array(weight_q, weight_name)
+
+            graph.initializer.remove(initializer_map[weight_name])
+            graph.initializer.append(weight_init_new)
+
+        if len(node.input) > 2 and 'bias' in str(node.input[2]):
+            bias_name = node.input[2]
+            bias_f = numpy_helper.to_array(initializer_map[bias_name])
+            bias_q = quant_weight_bias(bias_f, 32, scale_x * scale_w)
+            bias_init_new = numpy_helper.from_array(bias_q, bias_name)
+            graph.initializer.remove(initializer_map[bias_name])
+            graph.initializer.append(bias_init_new)
+
+        if 'QConv2d' in node.op_type or 'QConvBN2d' in node.op_type:
+            node.op_type = 'Conv2dInt'
+            node.name = node.name + '_Conv2dInt'
+        elif 'QConv1d' in node.op_type or 'QConvBN1d' in node.op_type:
+            node.op_type = 'Conv1dInt'
+            node.name = node.name + '_Conv1dInt'
+        elif 'QAvgPool1d' in node.op_type:
+            node.op_type = 'AvgPool1dInt'
+            node.name = node.name + '_AvgPool1dInt'
+        elif 'QAvgPool2d' in node.op_type:
+            node.op_type = 'AvgPool2dInt'
+            node.name = node.name + '_AvgPool2dInt'
+        elif 'QConvTranspose1d' in node.op_type:
+            node.op_type = 'ConvTranspose1dInt'
+            node.name = node.name + '_ConvTranspose1dInt'
+        elif 'QConvTranspose2d' in node.op_type:
+            node.op_type = 'ConvTranspose2dInt'
+            node.name = node.name + '_ConvTranspose2dInt'
+        elif 'QLinear' in node.op_type:
+            node.op_type = 'LinearInt'
+            node.name = node.name + '_LinearInt'
+        elif 'QEmbedding' in node.op_type:
+            node.op_type = 'Gather'
+            node.name = node.name + '_Gather'
+        elif 'QLayerNorm' in node.op_type:
+            node.op_type = 'LayerNormInt'
+            node.name = node.name + '_LayerNormInt'
+        elif 'QAdd' in node.op_type:
+            node.op_type = 'iqAdd'
+            node.name = node.name + '_iqAdd'
+        elif 'QMul' in node.op_type:
+            node.op_type = 'iqMul'
+            node.name = node.name + '_iqMul'
+        elif 'QBmm' in node.op_type:
+            node.op_type = 'BmmInt'
+            node.name = node.name + '_BmmInt'
+        elif 'QCat' in node.op_type:
+            node.op_type = 'iqCat'
+            node.name = node.name + '_iqCat'
+            for attr in node.attribute:
+                if attr.name == 'axis':
+                    attr.name = 'dim'
+        elif 'QSigmoid' in node.op_type:
+            node.op_type = 'iqSigmoid'
+            node.name = node.name + '_iqSigmoid'
+        elif 'QTanh' in node.op_type:
+            node.op_type = 'iqTanh'
+            node.name = node.name + '_iqTanh'
+        elif 'QSoftmax' in node.op_type:
+            node.op_type = 'SoftmaxInt'
+            node.name = node.name + '_SoftmaxInt'
+        elif 'QGLU' in node.op_type:
+            node.op_type = 'GluInt'
+            node.name = node.name + '_GluInt'
+            for attr in node.attribute:
+                if attr.name == 'axis':
+                    attr.name = 'dim'
+
+    # 保存修改后的模型
+    dir_name = os.path.dirname(onnx_path)
+    # 获取文件名和扩展名：aaa, .onnx
+    base_name = os.path.basename(onnx_path)
+    file_name, ext = os.path.splitext(base_name)
+    # 生成新的文件名：aaa_2.0.onnx
+    new_file_name = f"{file_name}_2.0{ext}"
+    # 拼接成完整路径
+    output_path = os.path.join(dir_name, new_file_name)
+    onnx.save(onnx_model, output_path)
+    print(f"转换完成：{output_path}")
+
+def convert_parameter_from_float_to_int(onnx_model):
+    # onnx_model = onnx.load(onnx_path)
+    graph = onnx_model.graph
+    # 构建 initializer 查找表
+    initializer_map = {init.name: init for init in graph.initializer}
+
+    for node in graph.node:
+        if 'QGRU' in node.op_type or 'QLSTM' in node.op_type:
+            scale_i = None
+            scale_h = None
+            scale_iw = None
+            scale_hw = None
+            for attr in node.attribute:
+                if attr.name == 'scale_x':
+                    scale_i = attr.f
+                elif attr.name == 'scale_h':
+                    scale_h = attr.f
+                elif attr.name == 'scale_iw':
+                    scale_iw = attr.f
+                elif attr.name == 'scale_hw':
+                    scale_hw = attr.f
+            for i in range(1, 3):
+                if i == 1:
+                    scale_w = scale_iw
+                else:
+                    scale_w = scale_hw
+                weight_name = node.input[i]
+                weight_f = numpy_helper.to_array(initializer_map[weight_name])
+                weight_q = quant_weight_bias(weight_f, 8, scale_w)
+                weight_init_new = numpy_helper.from_array(weight_q, weight_name)
+                graph.initializer.remove(initializer_map[weight_name])
+                graph.initializer.append(weight_init_new)
+            if len(node.input) > 3:
+                for i in range(3, 5):
+                    if i == 3:
+                        scale_b = scale_i * scale_iw
+                    else:
+                        scale_b = scale_h * scale_hw
+                    bias_name = node.input[i]
+                    bias_f = numpy_helper.to_array(initializer_map[bias_name])
+                    bias_q = quant_weight_bias(bias_f, 32, scale_b)
+                    bias_init_new = numpy_helper.from_array(bias_q, bias_name)
+                    graph.initializer.remove(initializer_map[bias_name])
+                    graph.initializer.append(bias_init_new)
+            continue
+
+        scale_x = None
+        scale_w = None
+        for attr in node.attribute:
+            if attr.name == 'scale_x':
+                scale_x = attr.f
+            elif attr.name == 'scale_w':
+                scale_w = attr.f
+
+        if len(node.input) > 1 and 'weight' in str(node.input[1]):
+            weight_name = node.input[1]
+            weight_f = numpy_helper.to_array(initializer_map[weight_name])
+            weight_q = quant_weight_bias(weight_f, 8, scale_w)
+            weight_init_new = numpy_helper.from_array(weight_q, weight_name)
+
+            graph.initializer.remove(initializer_map[weight_name])
+            graph.initializer.append(weight_init_new)
+
+        if len(node.input) > 2 and 'bias' in str(node.input[2]):
+            bias_name = node.input[2]
+            bias_f = numpy_helper.to_array(initializer_map[bias_name])
+            bias_q = quant_weight_bias(bias_f, 32, scale_x * scale_w)
+            bias_init_new = numpy_helper.from_array(bias_q, bias_name)
+            graph.initializer.remove(initializer_map[bias_name])
+            graph.initializer.append(bias_init_new)
+
+    return onnx_model
+
+    # 保存修改后的模型
+    # dir_name = os.path.dirname(onnx_path)
+    # # 获取文件名和扩展名：aaa, .onnx
+    # base_name = os.path.basename(onnx_path)
+    # file_name, ext = os.path.splitext(base_name)
+    # # 生成新的文件名：aaa_2.0.onnx
+    # new_file_name = f"{file_name}_2.0{ext}"
+    # # 拼接成完整路径
+    # output_path = os.path.join(dir_name, new_file_name)
+    # onnx.save(onnx_model, onnx_path)
+    # print(f"转换完成：{onnx_path}")
 
-def export(model, args, f, export_params=True, verbose=False, training=False,
-           input_names=None, output_names=None, aten=False, export_raw_ir=False,
-           operator_export_type=None, opset_version=12, _retain_param_name=True,
-           do_constant_folding=True, example_outputs=None, strip_doc_string=True,
-           dynamic_axes=None, keep_initializers_as_inputs=None, custom_opsets=None,
-           enable_onnx_checker=True, use_external_data_format=False, is_update_dequant=True,
-           is_input_quant=False, is_scoped_info=False, debug_dump=False):
-    print('change onnx export to linger export')
-    if is_scoped_info:
-        scopes_info = {}
-        scopes_info = build_global_scope(model)
-
-    if is_update_dequant:
-        if isinstance(args, tuple):
-            model.eval()
-            model(*args)
-            args = list(args)
-            args = tuple([arg if not isinstance(arg, float)
-                         else torch.tensor(arg) for arg in args])
+def export(model, args, f, **kwargs):
+
+    # 1. 正常导出 ONNX
+    tmp = BytesIO()
+    torch_onnx_export(model, args, tmp, **kwargs)
+    tmp.seek(0)
+
+    # 2. 自动加载并执行 shape inference
+    onnx_model = onnx.load(tmp)
+
+    onnx_model = convert_parameter_from_float_to_int(onnx_model)
+
+    onnx_model = parser_dequant(onnx_model, False)
+
+    onnx.save(onnx_model, f)
+
+    # inferred_model = shape_inference.infer_shapes(onnx_model)
+
+    # 3. 覆盖保存，让最终导出的文件带有 shape
+    # onnx.save(inferred_model, f)
+
+def generate_onnx_qparam_dict(cls, input_list = False):
+    qparam_dict = {'platform_s': str(QUANT_CONFIGS.platform.name), 'quant_mode_s': str(cls.output_quantizer.round_mode.name)}
+    if input_list:
+        qparam_dict['x_bits_i'] = int(cls.input_quantizer[0].data_bits)
+        qparam_dict['scale_x_f'] = float(cls.input_quantizer[0].scale)
+        if len(cls.input_quantizer) > 1:
+            qparam_dict['y_bits_i'] = int(cls.input_quantizer[1].data_bits)
+            qparam_dict['scale_y_f'] = float(cls.input_quantizer[1].scale)
+    elif hasattr(cls, "input_quantizer"):
+        qparam_dict['x_bits_i'] = int(cls.input_quantizer.data_bits)
+        qparam_dict['scale_x_f'] = float(cls.input_quantizer.scale)
+    if hasattr(cls, 'output_quantizer'):
+        qparam_dict['o_bits_i'] = int(cls.output_quantizer.data_bits)
+        qparam_dict['scale_o_f'] = float(cls.output_quantizer.scale)
+    if hasattr(cls, 'weight_quantizer') and cls.weight_quantizer is not None:
+        # qparam_dict['weight'] = cls.weight
+        qparam_dict['w_bits_i'] = int(cls.weight_quantizer.data_bits)
+        qparam_dict['scale_w_f'] = float(cls.weight_quantizer.scale)
+    # if hasattr(cls, 'weight_quantizer') and cls.weight_quantizer is not None:
+    #     qparam_dict['bias'] = cls.bias
+    qparam_dict['op_type'] = cls._get_name()
+
+    if 'Softmax' in qparam_dict['op_type'] or 'GLU' in qparam_dict['op_type']:
+        qparam_dict['axis_i'] = int(cls.dim)
+        qparam_dict.pop('y_bits_i', None)
+        qparam_dict.pop('scale_y_f', None)
+    elif 'ConvTranspose' in qparam_dict['op_type']:
+        qparam_dict['dilations_i'] = cls.dilation
+        qparam_dict['kernel_shape_i'] = cls.kernel_size
+        qparam_dict['pads_i'] = cls.padding * 2
+        qparam_dict['strides_i'] = cls.stride
+        qparam_dict['group_i'] = cls.groups
+        qparam_dict['output_padding_i'] = cls.output_padding
+    elif 'Conv' in qparam_dict['op_type']:
+        qparam_dict['dilations_i'] = cls.dilation
+        qparam_dict['kernel_shape_i'] = cls.kernel_size
+        qparam_dict['pads_i'] = cls.padding * 2
+        qparam_dict['strides_i'] = cls.stride
+        qparam_dict['group_i'] = cls.groups
+    elif 'AvgPool' in qparam_dict['op_type']:
+        tuple_fn = _pair # for AvgPool2D
+        qparam_dict['kernel_shape_i'] = tuple_fn(cls.kernel_size)
+        qparam_dict['pads_i'] = tuple_fn(cls.padding) * 2
+        qparam_dict['strides_i'] = tuple_fn(cls.stride)
+        qparam_dict['ceil_mode_i'] = cls.ceil_mode
+    elif 'GRU' in qparam_dict['op_type'] or 'LSTM' in qparam_dict['op_type']:
+        qparam_dict['input_size_i'] = int(cls.input_size)
+        qparam_dict['hidden_size_i'] = int(cls.hidden_size)
+        qparam_dict['num_layers_i'] = int(cls.num_layers)
+        qparam_dict['batch_first_i'] = int(cls.batch_first)
+        qparam_dict['go_forward_i'] = True
+        qparam_dict['scale_h_f'] = float(cls.hidden_quantizer.scale)
+        qparam_dict['w_bits_i'] = int(cls.weightih_quantizer.data_bits)
+        qparam_dict['scale_iw_f'] = float(cls.weightih_quantizer.scale)
+        qparam_dict['scale_hw_f'] = float(cls.weighthh_quantizer.scale)
+        qparam_dict['outputs'] = 2
+        if cls.bidirectional:
+            qparam_dict_r = {'platform_s': str(QUANT_CONFIGS.platform.name), 'quant_mode_s': str(cls.output_quantizer.round_mode.name)}
+            qparam_dict_r['x_bits_i'] = int(cls.input_quantizer.data_bits)
+            qparam_dict_r['scale_x_f'] = float(cls.input_quantizer.scale)
+            qparam_dict_r['input_size_i'] = int(cls.input_size)
+            qparam_dict_r['hidden_size_i'] = int(cls.hidden_size)
+            qparam_dict_r['num_layers_i'] = int(cls.num_layers)
+            qparam_dict_r['batch_first_i'] = int(cls.batch_first)
+            qparam_dict_r['go_forward_i'] = False
+            qparam_dict_r['scale_h_f'] = float(cls.hidden_reverse_quantizer.scale)
+            qparam_dict_r['w_bits_i'] = int(cls.weightih_reverse_quantizer.data_bits)
+            qparam_dict_r['scale_iw_f'] = float(cls.weightih_reverse_quantizer.scale)
+            qparam_dict_r['scale_hw_f'] = float(cls.weighthh_reverse_quantizer.scale)
+            qparam_dict_r['o_bits_i'] = int(cls.output_reverse_quantizer.data_bits)
+            qparam_dict_r['scale_o_f'] = float(cls.output_reverse_quantizer.scale)
+            qparam_dict_r['outputs'] = 2
+            qparam_dict['qparam_dict_r'] = qparam_dict_r
+    return qparam_dict
+
+def quantlinear(g, input, scale_x, platform, data_bits, zero_point):
+    return g.op("linger::Quant", input, data_bits_i=data_bits, scale_x_f = scale_x, platform_s = platform, zero_point_i = zero_point)
+
+class QCustomOpSymbolic(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, weight, bias, *args):
+        # 占位符
+        return input.clone()
+
+    @staticmethod
+    def symbolic(g, input, weight, bias, *args):
+        # 获取算子类型
+        qparams = {} if args[0] is None else args[0]
+        
+        op_type = qparams.get("op_type", "QGeneric")
+        node_name = f"{QDOMAIN_NAME}::{op_type}"
+        qparams.pop('op_type', None)
+
+        input_list = []
+        if 'Linear' in op_type or 'Conv' in op_type or 'ConvTranspose' in op_type or 'BatchNorm' in op_type:
+            is_input_qtensor = args[1] if len(args) > 1 else None
+            if not is_input_qtensor:
+                op_inner = quantlinear(g, input, qparams['scale_x_f'], qparams['platform_s'], qparams['x_bits_i'], 0)
+                input_list = [op_inner, weight]
+            else:
+                input_list = [input, weight]
+            # input_list = [input, weight]
+            if bias is not None:
+                input_list.append(bias)
+        elif 'AvgPool' in op_type or 'Sigmoid' in op_type or 'Tanh' in op_type or 'Softmax' in op_type or 'GLU' in op_type:
+            input_list = [input]
+        elif 'Add' in op_type or 'Mul' in op_type or 'Matmul' in op_type or 'Bmm' in op_type:
+            other = args[1]
+            input_list = [input, other]
+        elif 'Cat' in op_type:
+            other = args[1]
+            axis = args[2]
+            input_list = [input, other]
+            qparams['axis_i'] = int(axis)
+            qparams['x_0_bits_i'] = qparams['x_bits_i']
+            qparams['x_1_bits_i'] = qparams['y_bits_i']
+            qparams['scale_x_0_f'] = qparams['scale_x_f']
+            qparams['scale_x_1_f'] = qparams['scale_y_f']
+            qparams.pop('x_bits_i', None)
+            qparams.pop('y_bits_i', None)
+            qparams.pop('scale_x_f', None)
+            qparams.pop('scale_y_f', None)
+        elif 'Embedding' in op_type:
+            node_name = f"{QDOMAIN_NAME}::Gather"
         else:
-            model.eval()
-            model(args)
-        tmp = BytesIO()
-        torch_onnx_export(model, args, tmp, export_params, verbose, training, input_names, output_names, aten, export_raw_ir,
-                          operator_export_type, opset_version, _retain_param_name, do_constant_folding, example_outputs, strip_doc_string,
-                          dynamic_axes, keep_initializers_as_inputs, custom_opsets, enable_onnx_checker, use_external_data_format)
-        tmp.seek(0)
-        onnx_model = onnx.load(tmp)
-
-        if debug_dump:
-            onnx.save(onnx_model, "debug_onnx_torch_export.onnx")
-
-        if is_scoped_info:
-            onnx_model = build_onnx_scope_info(onnx_model)
-            if debug_dump:
-                onnx.save(onnx_model, "debug_onnx_scoped_info.onnx")
+            out = g.op("quant_domain::IdentityQ", input)
         
-        onnx_model = parser_dequant(onnx_model, is_input_quant)
-        if debug_dump:
-            onnx.save(onnx_model, "debug_onnx_update_dequant.onnx")
+            # # shape 推导
+            # input_shape = sym_help._get_tensor_sizes(input)
+            # weight_shape = sym_help._get_tensor_sizes(weight)
+            # if input_shape and weight_shape:
+            #     out_shape = input_shape[:-1] + [weight_shape[0]]
+            #     out_shape[0] = None
+            #     out.setType(input.type().with_sizes(out_shape))
+            # else:
+            #     out.setType(input.type())
 
-        return onnx.save(onnx_model, f)
-    else:
-        print("Error:is_update_dequant is not support now")
+        out = g.op(
+                node_name,
+                *input_list,
+                **qparams
+            )
+        return out
+
+class QCustomRNNSymbolic(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, weight_i, weight_h, bias_i, bias_h, weight_i_r, weight_h_r, bias_i_r, bias_h_r, *args):
+        # 占位符
+        return input.clone(), input.clone()
+
+    @staticmethod
+    def symbolic(g, input, weight_i, weight_h, bias_i, bias_h, weight_i_r, weight_h_r, bias_i_r, bias_h_r, *args):
+        # 获取算子类型
+        qparams = {} if args[0] is None else args[0]
+        
+        op_type = qparams.get("op_type", "QGeneric")
+        node_name = f"{QDOMAIN_NAME}::{op_type}"
+        qparams.pop('op_type', None)
 
-    if is_scoped_info:
-        for _, enter_hook, leave_hook in scopes_info.values():
-            enter_hook.remove()
-            leave_hook.remove()
-        scopes_info.clear()
+        input_list = []
+        if 'GRU' in op_type or 'LSTM' in op_type:
+            qparam_dict_r = qparams.get("qparam_dict_r", None)
+            qparams.pop('qparam_dict_r', None)
+
+            is_input_qtensor = args[1] if len(args) > 1 else None
+            if not is_input_qtensor:
+                op_inner = quantlinear(g, input, qparams['scale_x_f'], qparams['platform_s'], qparams['x_bits_i'], 0)
+                input_list = [op_inner, weight_i, weight_h]
+            else:
+                input_list = [input, weight_i, weight_h]
+
+            # input_list = [input, weight_i, weight_h]
+            if bias_i is not None:
+                input_list.append(bias_i)
+                input_list.append(bias_h)
+            # To do: insert length and hidden
+
+            out, hidden = g.op(node_name, *input_list, **qparams)
+            if qparam_dict_r is not None:   # 双向RNN
+                if not is_input_qtensor:
+                    op_inner = quantlinear(g, input, qparam_dict_r['scale_x_f'], qparam_dict_r['platform_s'], qparam_dict_r['x_bits_i'], 0)
+                    input_list_r = [op_inner, weight_i_r, weight_h_r]
+                else:
+                    input_list_r = [input, weight_i_r, weight_h_r]
+                # input_list_r = [input, weight_i_r, weight_h_r]
+                if bias_i_r is not None:
+                    input_list_r.append(bias_i_r)
+                    input_list_r.append(bias_h_r)
+                out_r, hidden_r = g.op(node_name, *input_list_r, **qparam_dict_r)
+
+                cat_node_name = f"{QDOMAIN_NAME}::QCat"
+                cat_input_list = [out, out_r]
+                cat_param = {}
+                cat_param['platform_s'] = qparams.get("platform_s", None)
+                cat_param['quant_mode_s'] = qparams.get("quant_mode_s", None)
+                cat_param['axis_i'] = int(2)
+                cat_param['x_0_bits_i'] = qparams.get("o_bits_i", 8)
+                cat_param['scale_x_0_f'] = qparams.get("scale_o_f", 1.0)
+                cat_param['x_1_bits_i'] = qparam_dict_r.get("o_bits_i", 8)
+                cat_param['scale_x_1_f'] = qparam_dict_r.get("scale_o_f", 1.0)
+                cat_param['o_bits_i'] = qparams.get("o_bits_i", 8)
+                cat_param['scale_o_f'] = min(qparam_dict_r.get("scale_o_f", 1.0), qparams.get("scale_o_f", 1.0))
+                out = g.op(cat_node_name, *cat_input_list, **cat_param)
+                hidden = g.op("Concat", hidden, hidden_r, axis_i=0)
+            return out, hidden
+        else:
+            return g.op("quant_domain::IdentityQ", input)
 
 
-__all__ = ['export']
+__all__ = ['export', 'QCustomOpSymbolic', 'QCustomRNNSymbolic', 'change_onnx_to_linger2_0']
diff --git a/linger/onnx/infer_type.py b/linger/onnx/infer_type.py
index b756c31..972ae0d 100644
--- a/linger/onnx/infer_type.py
+++ b/linger/onnx/infer_type.py
@@ -27,138 +27,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class ScopedEnter(OpBase):
-    def __init__(self, node):
-        super(ScopedEnter, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ScopedLeave(OpBase):
-    def __init__(self, node):
-        super(ScopedLeave, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class TransposeMatMul(OpBase):
-    def __init__(self, node):
-        super(TransposeMatMul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        assert in_type[0] == 1
-        assert in_type[1] == 1
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ScaledTanh(OpBase):
-    def __init__(self, node):
-        super(ScaledTanh, self).__init__(node)
-
-    def infer_type(self, in_type):
-        assert in_type[0] == 1
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Scaler(OpBase):
-    def __init__(self, node):
-        super(Scaler, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 6, 7, 11]
-        tensor_type_map[self.node.output[0]] = find_key('float32')
-        return tensor_type_map
-
-
-class Scale(OpBase):
-    def __init__(self, node):
-        super(Scale, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class SampleOp(OpBase):
-    def __init__(self, node):
-        super(SampleOp, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Rfft(OpBase):
-    def __init__(self, node):
-        super(Rfft, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Cast(OpBase):
-    def __init__(self, node):
-        super(Cast, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for attr in self.node.attribute:
-            data_type = attr.i
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class Shape(OpBase):
-    def __init__(self, node):
-        super(Shape, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 16]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('int64')
-        return tensor_type_map
-
-
-class Constant(OpBase):
-    def __init__(self, node):
-        super(Constant, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for attr in self.node.attribute:
-            data_type = str(onnx.numpy_helper.to_array(attr.t).dtype)
-        for output in self.node.output:
-            tensor_type_map[output] = find_key(data_type)
-        return tensor_type_map
-
-
-class IQAdd(OpBase):
+class QAdd(OpBase):
     def __init__(self, node):
-        super(IQAdd, self).__init__(node)
+        super(QAdd, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -167,45 +38,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-class iqVar(OpBase):
-    def __init__(self, node):
-        super(iqVar, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert (in_type[0] in [3])
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class IQSum(OpBase):
-    def __init__(self, node):
-        super(IQSum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert (in_type[0] in [1, 3, 6])
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class IQDiv(OpBase):
-    def __init__(self, node):
-        super(IQDiv, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert (in_type[0] in [3]) and (in_type[1] in [1, 3])
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class IQMul(OpBase):
+class QMul(OpBase):
     def __init__(self, node):
-        super(IQMul, self).__init__(node)
+        super(QMul, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -214,23 +49,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = 3
         return tensor_type_map
 
-
-class IQCat(OpBase):
-    def __init__(self, node):
-        super(IQCat, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] == 3
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class IQClamp(OpBase):
+class QCat(OpBase):
     def __init__(self, node):
-        super(IQClamp, self).__init__(node)
+        super(QCat, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -240,10 +61,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class IQSigmoid(OpBase):
+class QSigmoid(OpBase):
     def __init__(self, node):
-        super(IQSigmoid, self).__init__(node)
+        super(QSigmoid, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -254,23 +74,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = 3
         return tensor_type_map
 
-
-class IQSigmoid_Is8_Os8(OpBase):
-    def __init__(self, node):
-        super(IQSigmoid_Is8_Os8, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] == 3
-        for output in self.node.output:
-            tensor_type_map[output] = 3
-        return tensor_type_map
-
-
-class IQTanh(OpBase):
+class QTanh(OpBase):
     def __init__(self, node):
-        super(IQTanh, self).__init__(node)
+        super(QTanh, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -280,10 +86,9 @@ def infer_type(self, in_type):
             tensor_type_map[output] = 3
         return tensor_type_map
 
-
-class AvgPool2dInt(OpBase):
+class QAvgPool(OpBase):
     def __init__(self, node):
-        super(AvgPool2dInt, self).__init__(node)
+        super(QAvgPool, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -305,26 +110,22 @@ def infer_type(self, in_type):
                 tensor_type_map[output] = out_type
         return tensor_type_map
 
-
-class BatchNorm2dInt(OpBase):
+class QBatchNorm(OpBase):
     def __init__(self, node):
-        super(BatchNorm2dInt, self).__init__(node)
+        super(QBatchNorm, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
         assert in_type[0] == 3
         assert in_type[1] == 3 or in_type[1] == 5
         assert in_type[2] == 6
-        assert in_type[3] == 1
-        assert in_type[4] == 1
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class LayerNormInt(OpBase):
+class QLayerNorm(OpBase):
     def __init__(self, node):
-        super(LayerNormInt, self).__init__(node)
+        super(QLayerNorm, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -345,22 +146,9 @@ def infer_type(self, in_type):
                 tensor_type_map[output] = out_type
         return tensor_type_map
 
-
-class ReLU(OpBase):
-    def __init__(self, node):
-        super(ReLU, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 3 or in_type[0] == 1 or in_type[0] == 6
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Conv2dInt(OpBase):
+class QConv(OpBase):
     def __init__(self, node):
-        super(Conv2dInt, self).__init__(node)
+        super(QConv, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -381,10 +169,9 @@ def infer_type(self, in_type):
                 tensor_type_map[output] = out_type
         return tensor_type_map
 
-
-class Conv1dInt(OpBase):
+class QConvTranspose(OpBase):
     def __init__(self, node):
-        super(Conv1dInt, self).__init__(node)
+        super(QConvTranspose, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -394,43 +181,45 @@ def infer_type(self, in_type):
         for attr in self.node.attribute:
             if attr.name == "o_bits":
                 count = count + 1
-                if attr.i == 8:
-                    out_type = 3
-                if attr.i == 32:
-                    out_type = 6
+            if attr.name == "o_bits" and attr.i == 32:
+                out_type = 6
+            if attr.name == "o_bits" and attr.i == 8:
+                out_type = 3
+
         for output in self.node.output:
             if count == 0:
                 tensor_type_map[self.node.output[0]] = find_key('float32')
             else:
                 tensor_type_map[output] = out_type
-        return tensor_type_map
 
+        return tensor_type_map
 
-class LSTMInt_Is8_Is64(OpBase):
+class QLinear(OpBase):
     def __init__(self, node):
-        super(LSTMInt_Is8_Is64, self).__init__(node)
+        super(QLinear, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 3
+        assert in_type[0] == 3  # and in_type[1] == 3
+        count = 0
         out_type = 3
         for attr in self.node.attribute:
-            if attr.name == "o_bits" and attr.i == 32:
-                out_type = 6
-            if attr.name == "o_bits" and attr.i == 8:
-                out_type = 3
-
-        tensor_type_map[self.node.output[0]] = out_type
-
-        tensor_type_map[self.node.output[1]] = find_key('float32')
-        tensor_type_map[self.node.output[2]] = find_key('float32')
-
+            if attr.name == "o_bits":
+                count = count + 1
+                if attr.i == 8:
+                    out_type = 3
+                if attr.i == 32:
+                    out_type = 6
+        for output in self.node.output:
+            if count == 0:
+                tensor_type_map[self.node.output[0]] = find_key('float32')
+            else:
+                tensor_type_map[output] = out_type
         return tensor_type_map
 
-
-class LSTMInt_Is8_Is64_If32_If32(OpBase):
+class QLSTM(OpBase):
     def __init__(self, node):
-        super(LSTMInt_Is8_Is64_If32_If32, self).__init__(node)
+        super(QLSTM, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -445,44 +234,30 @@ def infer_type(self, in_type):
         tensor_type_map[self.node.output[0]] = out_type
 
         tensor_type_map[self.node.output[1]] = find_key('float32')
-        tensor_type_map[self.node.output[2]] = find_key('float32')
+        # tensor_type_map[self.node.output[2]] = find_key('float32')
 
         return tensor_type_map
 
-
-class ConvTranspose2dInt(OpBase):
+class QGRU(OpBase):
     def __init__(self, node):
-        super(ConvTranspose2dInt, self).__init__(node)
+        super(QGRU, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 3  # and in_type[1] == 3
-        count = 0
-        out_type = 3
-        for attr in self.node.attribute:
-            if attr.name == "o_bits":
-                count = count + 1
-            if attr.name == "o_bits" and attr.i == 32:
-                out_type = 6
-            if attr.name == "o_bits" and attr.i == 8:
-                out_type = 3
-
-        for output in self.node.output:
-            if count == 0:
-                tensor_type_map[self.node.output[0]] = find_key('float32')
-            else:
-                tensor_type_map[output] = out_type
+        assert in_type[0] in [1, 3] and in_type[1] == 3 and in_type[2] == 3
+        # for output in self.node.output:
+        tensor_type_map[self.node.output[0]] = 3
+        tensor_type_map[self.node.output[1]] = 1
 
         return tensor_type_map
 
-
-class LinearInt(OpBase):
+class QBmm(OpBase):
     def __init__(self, node):
-        super(LinearInt, self).__init__(node)
+        super(QBmm, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 3  # and in_type[1] == 3
+        assert in_type[0] == 3 and in_type[1] == 3
         count = 0
         out_type = 3
         for attr in self.node.attribute:
@@ -498,87 +273,33 @@ def infer_type(self, in_type):
             else:
                 tensor_type_map[output] = out_type
         return tensor_type_map
-
-
-class LSTMInt(OpBase):
+    
+class QGLU(OpBase):
     def __init__(self, node):
-        super(LSTMInt, self).__init__(node)
+        super(QGLU, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 3
-        out_type = 3
-        for attr in self.node.attribute:
-            if attr.name == "o_bits" and attr.i == 32:
-                out_type = 6
-            if attr.name == "o_bits" and attr.i == 8:
-                out_type = 3
-
-        tensor_type_map[self.node.output[0]] = out_type
+        for index in range(len(in_type)):
+            assert in_type[index] in [3, 5]
+        for output in self.node.output:
+            # 应当为uint8  (2)  但会导致dequant添加错误  所以暂时修改为int8输出
+            tensor_type_map[output] = 3
+        return tensor_type_map
 
-        tensor_type_map[self.node.output[1]] = find_key('float32')
-        tensor_type_map[self.node.output[2]] = find_key('float32')
+class QSoftmax(OpBase):
+    def __init__(self, node):
+        super(QSoftmax, self).__init__(node)
 
+    def infer_type(self, in_type):
+        tensor_type_map = {}
+        assert in_type[0] in [3, 5, 6]
+        tensor_type_map[self.node.output[0]] = in_type[0]
         return tensor_type_map
 
-
-class GRUInt(OpBase):
+class Quant(OpBase):
     def __init__(self, node):
-        super(GRUInt, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 3] and in_type[1] == 3 and in_type[2] == 3
-        # for output in self.node.output:
-        tensor_type_map[self.node.output[0]] = 3
-        tensor_type_map[self.node.output[1]] = 1
-
-        return tensor_type_map
-
-
-class GRUInt_Is8_Is64(OpBase):
-    def __init__(self, node):
-        super(GRUInt_Is8_Is64, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 3
-        out_type = 3
-        for attr in self.node.attribute:
-            if attr.name == "o_bits" and attr.i == 32:
-                out_type = 6
-            if attr.name == "o_bits" and attr.i == 8:
-                out_type = 3
-
-        tensor_type_map[self.node.output[0]] = out_type
-        tensor_type_map[self.node.output[1]] = find_key('float32')
-
-        return tensor_type_map
-
-
-class GRUInt_Is8_Is64_If32(OpBase):
-    def __init__(self, node):
-        super(GRUInt_Is8_Is64_If32, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 3
-        out_type = 3
-        for attr in self.node.attribute:
-            if attr.name == "o_bits" and attr.i == 32:
-                out_type = 6
-            if attr.name == "o_bits" and attr.i == 8:
-                out_type = 3
-
-        tensor_type_map[self.node.output[0]] = out_type
-        tensor_type_map[self.node.output[1]] = find_key('float32')
-
-        return tensor_type_map
-
-
-class Quant(OpBase):
-    def __init__(self, node):
-        super(Quant, self).__init__(node)
+        super(Quant, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
@@ -587,64 +308,6 @@ def infer_type(self, in_type):
             tensor_type_map[output] = find_key('int8')
         return tensor_type_map
 
-
-class ReQuant(OpBase):
-    def __init__(self, node):
-        super(ReQuant, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for idx in range(len(self.node.attribute)):
-            if self.node.attribute[idx].name == "o_bits":
-                count = idx
-                break
-        out_type = 0
-        if self.node.attribute[count].i == 8:
-            out_type = 3
-        elif self.node.attribute[count].i == 16:
-            out_type = 5
-        elif self.node.attribute[count].i == 32:
-            out_type = 6
-        for output in self.node.output:
-            if out_type != 0:
-                tensor_type_map[output] = out_type
-            else:
-                tensor_type_map[output] = find_key('int32')
-        return tensor_type_map
-
-
-class OnnxInferReQuant(OpBase):
-    def __init__(self, node):
-        super(OnnxInferReQuant, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for idx in range(len(self.node.attribute)):
-            if self.node.attribute[idx].name == "bit_dst":
-                count = idx
-                break
-        out_type = 0
-        if self.node.attribute[count].i == 8:
-            out_type = 3
-        for output in self.node.output:
-            if out_type != 0:
-                tensor_type_map[output] = out_type
-            else:
-                tensor_type_map[output] = find_key('int32')
-        return tensor_type_map
-
-
-class IdentityInfer(OpBase):
-    def __init__(self, node):
-        super(IdentityInfer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('int8')
-        return tensor_type_map
-
-
 class Dequant(OpBase):
     def __init__(self, node):
         super(Dequant, self).__init__(node)
@@ -655,2913 +318,273 @@ def infer_type(self, in_type):
             tensor_type_map[output] = find_key('float32')
         return tensor_type_map
 
-
-class Abs(OpBase):
+class Flatten(OpBase):
     def __init__(self, node):
-        super(Abs, self).__init__(node)
+        super(Flatten, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 11, 12, 13]
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Acos(OpBase):
+class Gather(OpBase):
     def __init__(self, node):
-        super(Acos, self).__init__(node)
+        super(Gather, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
+        assert in_type[1] in [6, 7]  # corresponding with 1.7 onnxruntime doc
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Acosh(OpBase):
+class MaxPool(OpBase):
     def __init__(self, node):
-        super(Acosh, self).__init__(node)
+        super(MaxPool, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
+        count = 1
+        if len(self.node.attribute) == 0:
+            tensor_type_map[self.node.output[0]] = in_type[0]
+            return tensor_type_map
+        for idx in range(len(self.node.attribute)):
+            if self.node.attribute[idx].name == "strides":
+                count = idx
+                break
+        if len(self.node.attribute[count].ints) == 2:
+            for index in range(len(in_type)):
+                # corresponding with 1.7 onnxruntime doc
+                assert in_type[index] in [1, 2, 3, 10, 11]
+            tensor_type_map[self.node.output[0]] = in_type[0]
+            if len(self.node.output) == 2:
+                tensor_type_map[self.node.output[1]] = find_key('int64')
+            return tensor_type_map
+        elif len(self.node.attribute[count].ints) == 1:
+            for index in range(len(in_type)):
+                # corresponding with 1.7 onnxruntime doc
+                assert in_type[index] in [1, 2, 3, 10, 11]
+            tensor_type_map[self.node.output[0]] = in_type[0]
+            if len(self.node.output) == 2:
+                tensor_type_map[self.node.output[1]] = find_key('int64')
+            return tensor_type_map
+        else:
+            print("Maxpool node error infertype!")
+            exit()
 
-class Add(OpBase):
+class Relu(OpBase):
     def __init__(self, node):
-        super(Add, self).__init__(node)
+        super(Relu, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
         # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 10, 11, 12, 13]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 6, 7, 10, 11, 12, 13]
+        assert in_type[0] in [1, 10, 11, 3, 6]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class And(OpBase):
+class Reshape(OpBase):
     def __init__(self, node):
-        super(And, self).__init__(node)
+        super(Reshape, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 9
-        assert in_type[1] == 9
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
+        if len(in_type) > 1:
+            assert in_type[1] == 7
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class ArgMax(OpBase):
+class Slice(OpBase):
     def __init__(self, node):
-        super(ArgMax, self).__init__(node)
+        super(Slice, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
+        assert in_type[1] in [1, 6, 7]
         # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 3, 6, 10, 11]
+        assert in_type[2] in [1, 6, 7]
+        if len(in_type) > 3:
+            assert in_type[3] in [1, 6, 7]
+        if len(in_type) == 5:
+            assert in_type[4] in [1, 6, 7]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class ArgMin(OpBase):
+class Split(OpBase):
     def __init__(self, node):
-        super(ArgMin, self).__init__(node)
+        super(Split, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 10, 11]
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
+        if len(in_type) == 2:
+            # corresponding with 1.7 onnxruntime doc
+            assert in_type[1] in [1, 2, 3, 4, 5,
+                                  6, 7, 8, 9, 10, 11, 12, 13, 16]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Asin(OpBase):
+class Transpose(OpBase):
     def __init__(self, node):
-        super(Asin, self).__init__(node)
+        super(Transpose, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Asinh(OpBase):
+class Concat(OpBase):
     def __init__(self, node):
-        super(Asinh, self).__init__(node)
+        super(Concat, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
+        for index in range(len(in_type)):
+            # 去除支持int8  int8下接concat 会添加dequant
+            assert in_type[index] != 0 and in_type[index] != 3
+            assert in_type[index] == in_type[0]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Atan(OpBase):
+class Shape(OpBase):
     def __init__(self, node):
-        super(Atan, self).__init__(node)
+        super(Shape, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
+        # corresponding with 1.7 onnxruntime doc
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 16]
         for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
+            tensor_type_map[output] = find_key('int64')
         return tensor_type_map
 
-
-class Atanh(OpBase):
+class Constant(OpBase):
     def __init__(self, node):
-        super(Atanh, self).__init__(node)
+        super(Constant, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
+        for attr in self.node.attribute:
+            data_type = str(onnx.numpy_helper.to_array(attr.t).dtype)
         for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
+            tensor_type_map[output] = find_key(data_type)
         return tensor_type_map
 
-
-class AveragePool(OpBase):
+class Squeeze(OpBase):
     def __init__(self, node):
-        super(AveragePool, self).__init__(node)
+        super(Squeeze, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
+        if len(in_type) == 2:
+            assert in_type[1] == 7
+        tensor_type_map[self.node.output[0]] = in_type[0]
         return tensor_type_map
-
-
-class BatchNormalization(OpBase):
+    
+class Unsqueeze(OpBase):
     def __init__(self, node):
-        super(BatchNormalization, self).__init__(node)
+        super(Unsqueeze, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
+        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
+                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Sub(OpBase):
+class ArgMax(OpBase):
     def __init__(self, node):
-        super(Sub, self).__init__(node)
+        super(ArgMax, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 7, 10, 11, 12, 13]
+        # corresponding with 1.7 onnxruntime doc
+        assert in_type[0] in [1, 3, 6, 10, 11]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
-
-class Ceil(OpBase):
+class ArgMin(OpBase):
     def __init__(self, node):
-        super(Ceil, self).__init__(node)
+        super(ArgMin, self).__init__(node)
 
     def infer_type(self, in_type):
         tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
+        # corresponding with 1.7 onnxruntime doc
+        assert in_type[0] in [1, 6, 10, 11]
         for output in self.node.output:
             tensor_type_map[output] = in_type[0]
         return tensor_type_map
 
+op_map = {
+          'Concat': Concat,
+          'Shape': Shape,
+          'Constant': Constant,
+          'Squeeze': Squeeze,
+          'Unsqueeze': Unsqueeze,
+          'ArgMax': ArgMax,
+          'ArgMin': ArgMin,
 
-class Celu(OpBase):
-    def __init__(self, node):
-        super(Celu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] == 1
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
+          'Quant': Quant,
+          'Dequant': Dequant,
 
+          'Flatten': Flatten,
+          'Gather': Gather,
+          'MaxPool': MaxPool,
 
-class Clip(OpBase):
-    def __init__(self, node):
-        super(Clip, self).__init__(node)
+          'QAdd': QAdd,
+          'QMul': QMul,
+          'QCat': QCat,
+          'QAvgPool1d': QAvgPool,
+          'QAvgPool2d': QAvgPool,
+          'QConv1d': QConv,
+          'QConv2d': QConv,
+          'QConvBN1d': QConv,
+          'QConvBN2d': QConv,
+          'QConvTranspose1d': QConvTranspose,
+          'QConvTranspose2d': QConvTranspose,
+          'QBatchNorm1d': QBatchNorm,
+          'QBatchNorm2d': QBatchNorm,
+          'QLayerNorm': QLayerNorm,
+          'QLinear': QLinear,
+          'QLSTM': QLSTM,
+          'QGRU': QGRU,
+          'QBmm': QBmm,
+          'QGLU': QGLU,
+          'QSigmoid': QSigmoid,
+          'QSoftmax': QSoftmax,
+          'QTanh': QTanh,
 
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 2, 3, 7, 11, 13]
-        # Determine whether the types are the same
-        assert len(set(in_type)) == 1
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Compress(OpBase):
-    def __init__(self, node):
-        super(Compress, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 9
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Concat(OpBase):
-    def __init__(self, node):
-        super(Concat, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # 去除支持int8  int8下接concat 会添加dequant
-            assert in_type[index] != 0 and in_type[index] != 3
-            assert in_type[index] == in_type[0]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Conv(OpBase):
-    def __init__(self, node):
-        super(Conv, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ConvTranspose(OpBase):
-    def __init__(self, node):
-        super(ConvTranspose, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Cos(OpBase):
-    def __init__(self, node):
-        super(Cos, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] == 1
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Cosh(OpBase):
-    def __init__(self, node):
-        super(Cosh, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] == 1
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class CumSum(OpBase):
-    def __init__(self, node):
-        super(CumSum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 10, 11, 12, 13]
-        assert in_type[1] in [6, 7]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class DepthToSpace(OpBase):
-    def __init__(self, node):
-        super(DepthToSpace, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Det(OpBase):
-    def __init__(self, node):
-        super(Det, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Div(OpBase):
-    def __init__(self, node):
-        super(Div, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 11, 10, 12, 13]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 6, 7, 11, 10, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Einsum(OpBase):
-    def __init__(self, node):
-        super(Einsum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Elu(OpBase):
-    def __init__(self, node):
-        super(Elu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Equal(OpBase):
-    def __init__(self, node):
-        super(Equal, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 9]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 6, 7, 9]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class Erf(OpBase):
-    def __init__(self, node):
-        super(Erf, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Exp(OpBase):
-    def __init__(self, node):
-        super(Exp, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Expand(OpBase):
-    def __init__(self, node):
-        super(Expand, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 7
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Flatten(OpBase):
-    def __init__(self, node):
-        super(Flatten, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Floor(OpBase):
-    def __init__(self, node):
-        super(Floor, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class FixLength(OpBase):
-    def __init__(self, node):
-        super(FixLength, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] != 0 and in_type[1] != 0
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Gather(OpBase):
-    def __init__(self, node):
-        super(Gather, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [6, 7]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MaskedGather(OpBase):
-    def __init__(self, node):
-        super(MaskedGather, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] != 0
-        if len(in_type) == 3:
-            assert in_type[0] in [1, 3, 6, 10]
-            assert in_type[1] in [6, 7]
-            assert in_type[2] == 6
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GatherElements(OpBase):
-    def __init__(self, node):
-        super(GatherElements, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [6, 7]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GatherND(OpBase):
-    def __init__(self, node):
-        super(GatherND, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 7  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GenExcit(OpBase):
-    def __init__(self, node):
-        super(GenExcit, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] != 0
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Gemm(OpBase):
-    def __init__(self, node):
-        super(Gemm, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GlobalAveragePool(OpBase):
-    def __init__(self, node):
-        super(GlobalAveragePool, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GlobalAveragePoolMask(OpBase):
-    def __init__(self, node):
-        super(GlobalAveragePoolMask, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GlobalLpPool(OpBase):
-    def __init__(self, node):
-        super(GlobalLpPool, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class GlobalMaxPool(OpBase):
-    def __init__(self, node):
-        super(GlobalMaxPool, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ConvInteger(OpBase):
-    def __init__(self, node):
-        super(ConvInteger, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 2  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 2
-        if len(in_type) > 2:
-            assert in_type[2] == 2
-        if len(in_type) == 4:
-            assert in_type[3] == 2
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('int32')
-        return tensor_type_map
-
-
-class ConvTranspose2dInteger(OpBase):
-    def __init__(self, node):
-        super(ConvTranspose2dInteger, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] == 3
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('int32')
-        return tensor_type_map
-
-
-class Greater(OpBase):
-    def __init__(self, node):
-        super(Greater, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 7, 10, 11, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class GreaterOrEqual(OpBase):
-    def __init__(self, node):
-        super(GreaterOrEqual, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class Identity(OpBase):
-    def __init__(self, node):
-        super(Identity, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class InstanceNormalization(OpBase):
-    def __init__(self, node):
-        super(InstanceNormalization, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class IsInf(OpBase):
-    def __init__(self, node):
-        super(IsInf, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class IsNaN(OpBase):
-    def __init__(self, node):
-        super(IsNaN, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class LRN(OpBase):
-    def __init__(self, node):
-        super(LRN, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Less(OpBase):
-    def __init__(self, node):
-        super(Less, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 10, 11, 12, 13]
-        assert in_type[1] == in_type[0]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class LessOrEqual(OpBase):
-    def __init__(self, node):
-        super(LessOrEqual, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11]
-        assert in_type[1] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class Log(OpBase):
-    def __init__(self, node):
-        super(Log, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class LogSoftmax(OpBase):
-    def __init__(self, node):
-        super(LogSoftmax, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 11]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class LpNormalization(OpBase):
-    def __init__(self, node):
-        super(LpNormalization, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class LpPool(OpBase):
-    def __init__(self, node):
-        super(LpPool, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MatMul(OpBase):
-    def __init__(self, node):
-        super(MatMul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11, 6, 7, 12, 13]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 10, 11, 6, 7, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MatMulInteger(OpBase):
-    def __init__(self, node):
-        super(MatMulInteger, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 3]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [2, 3]
-        if len(in_type) > 2:
-            assert in_type[2] in [2, 3]
-        if len(in_type) == 4:
-            assert in_type[3] in [2, 3]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('int32')
-        return tensor_type_map
-
-
-class Max(OpBase):
-    def __init__(self, node):
-        super(Max, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 11, 10, 6, 7, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class iqMax(OpBase):
-    def __init__(self, node):
-        super(iqMax, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc #add int8 input support
-            assert in_type[index] in [1, 3, 11, 10, 6, 7, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MaxPool(OpBase):
-    def __init__(self, node):
-        super(MaxPool, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        count = 1
-        if len(self.node.attribute) == 0:
-            tensor_type_map[self.node.output[0]] = in_type[0]
-            return tensor_type_map
-        for idx in range(len(self.node.attribute)):
-            if self.node.attribute[idx].name == "strides":
-                count = idx
-                break
-        if len(self.node.attribute[count].ints) == 2:
-            for index in range(len(in_type)):
-                # corresponding with 1.7 onnxruntime doc
-                assert in_type[index] in [1, 2, 3, 10, 11]
-            tensor_type_map[self.node.output[0]] = in_type[0]
-            if len(self.node.output) == 2:
-                tensor_type_map[self.node.output[1]] = find_key('int64')
-            return tensor_type_map
-        elif len(self.node.attribute[count].ints) == 1:
-            for index in range(len(in_type)):
-                # corresponding with 1.7 onnxruntime doc
-                assert in_type[index] in [1, 2, 3, 10, 11]
-            tensor_type_map[self.node.output[0]] = in_type[0]
-            if len(self.node.output) == 2:
-                tensor_type_map[self.node.output[1]] = find_key('int64')
-            return tensor_type_map
-        else:
-            print("Maxpool node error infertype!")
-            exit()
-
-
-class Mean(OpBase):
-    def __init__(self, node):
-        super(Mean, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class MeanVarianceNormalization(OpBase):
-    def __init__(self, node):
-        super(MeanVarianceNormalization, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1    # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class MaskToLength(OpBase):
-    def __init__(self, node):
-        super(MaskToLength, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 6, 7, 10]
-        tensor_type_map[self.node.output[0]] = 6
-        return tensor_type_map
-
-
-class Min(OpBase):
-    def __init__(self, node):
-        super(Min, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 7, 10, 11, 12, 13]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Mod(OpBase):
-    def __init__(self, node):
-        super(Mod, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Mul(OpBase):
-    def __init__(self, node):
-        super(Mul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 7, 10, 11, 12, 13]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Neg(OpBase):
-    def __init__(self, node):
-        super(Neg, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 3, 5, 6, 7, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class NonMaxSuppression(OpBase):
-    def __init__(self, node):
-        super(NonMaxSuppression, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] == 1
-        if len(in_type) > 2:   # corresponding with 1.7 onnxruntime doc
-            assert in_type[2] == 7
-            if len(in_type) > 3:
-                assert in_type[3] == 1
-                if len(in_type) == 5:
-                    assert in_type[4] == 1
-        tensor_type_map[self.node.output[0]] = find_key('int64')
-        return tensor_type_map
-
-
-class NonZero(OpBase):
-    def __init__(self, node):
-        super(NonZero, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 6, 7, 9]
-        tensor_type_map[self.node.output[0]] = find_key('int64')
-        return tensor_type_map
-
-
-class Not(OpBase):
-    def __init__(self, node):
-        super(Not, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 9
-        tensor_type_map[self.node.output[0]] = find_key('bool')
-        return tensor_type_map
-
-
-class OneHot(OpBase):
-    def __init__(self, node):
-        super(OneHot, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[2] in [1, 6, 7, 8]
-        tensor_type_map[self.node.output[0]] = in_type[2]
-        return tensor_type_map
-
-
-class ConstantOfShape(OpBase):
-    def __init__(self, node):
-        super(ConstantOfShape, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 7
-        if not self.node.attribute:
-            for output in self.node.output:
-                tensor_type_map[output] = find_key('float32')
-        else:
-            for attr in self.node.attribute:
-                data_type = str(onnx.numpy_helper.to_array(attr.t).dtype)
-            for output in self.node.output:
-                tensor_type_map[output] = find_key(data_type)
-        return tensor_type_map
-
-
-class DequantizeLinear(OpBase):
-    def __init__(self, node):
-        super(DequantizeLinear, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 3]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 1
-        if len(in_type) == 3:
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[2] in [2, 3]
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('float32')
-        return tensor_type_map
-
-
-class Dropout(OpBase):
-    def __init__(self, node):
-        super(Dropout, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        if len(in_type) > 1:
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[1] in [1, 10, 11]
-        if len(in_type) == 3:
-            assert in_type[2] == 9
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        if len(self.node.output) == 2:
-            tensor_type_map[self.node.output[1]] = find_key('bool')
-        return tensor_type_map
-
-
-class GRU(OpBase):
-    def __init__(self, node):
-        super(GRU, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11] and in_type[1] in [1, 10, 11] and in_type[2] in [
-            1, 10, 11]    # corresponding with 1.7 onnxruntime doc
-        if len(in_type) > 3:
-            assert in_type[3] in [1, 10, 11]
-        if len(in_type) > 4:
-            assert in_type[4] == 6
-        if len(in_type) == 6:
-            assert in_type[5] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class If(OpBase):
-    def __init__(self, node):
-        super(If, self).__init__(node)
-
-    def infer_type(self, in_type, tensor_type_map):
-        assert in_type[0] == 9
-        if self.node.attribute[0].g.output[0].type.tensor_type.elem_type == 0:
-            for node in self.node.attribute[0].g.node:
-                try:
-                    if node.op_type == 'LSTM':
-                        in_type = [tensor_type_map[node.input[0]]]
-                    elif node.op_type == 'GRU':
-                        in_type = [tensor_type_map[node.input[0]],
-                                   tensor_type_map[node.input[1]], tensor_type_map[node.input[2]]]
-                    else:
-                        in_type = [tensor_type_map[inp] for inp in node.input]
-                    if node.op_type in op_map.keys():
-                        it = op_map[node.op_type](node)
-                    else:
-                        print("Warning: InferType OP ", node.op_type,
-                              " is not supported,this may cause error !")
-                        it = op_map['Others'](node)
-                    tensor_type_map.update(it.infer_type(in_type))
-
-                except AssertionError as e:
-                    print("The "+node.op_type+"'s input_type has an error")
-                    raise
-            data_type = tensor_type_map[self.node.attribute[0].g.output[0].name]
-        else:
-            data_type = self.node.attribute[0].g.output[0].type.tensor_type.elem_type
-
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class LSTM(OpBase):
-    def __init__(self, node):
-        super(LSTM, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        assert in_type[1] in [1, 10, 11]
-        assert in_type[2] in [1, 10, 11]
-        if len(in_type) > 3:
-            assert in_type[3] == in_type[0]
-        if len(in_type) > 4:
-            assert in_type[4] == in_type[0]
-        if len(in_type) > 5:
-            assert in_type[5] == 6
-        if len(in_type) > 6:
-            assert in_type[6] == in_type[0]
-        if len(in_type) == 8:
-            assert in_type[7] == in_type[0]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Loop(OpBase):
-    def __init__(self, node):
-        super(Loop, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        if len(in_type) == 1:  # corresponding with 1.7 onnxruntime doc
-            assert in_type[0] == 7
-        if len(in_type) > 1:
-            assert in_type[1] == 9
-        if len(in_type) > 2:
-            assert in_type[2] in [1, 2, 3, 4, 5,
-                                  6, 7, 8, 9, 10, 11, 12, 13, 16]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[2]
-        return tensor_type_map
-
-
-class Multinomial(OpBase):
-    def __init__(self, node):
-        super(Multinomial, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        if not self.node.attribute:
-            data_type = find_key('int32')
-        else:
-            for attr in self.node.attribute:
-                data_type = attr.i
-                break
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class Or(OpBase):
-    def __init__(self, node):
-        super(Or, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 9
-        assert in_type[1] == 9
-        for output in self.node.output:
-            tensor_type_map[output] = find_key('bool')
-        return tensor_type_map
-
-
-class PRelu(OpBase):
-    def __init__(self, node):
-        super(PRelu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Pad(OpBase):
-    def __init__(self, node):
-        super(Pad, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 3, 6, 7, 10, 11, 12, 13]
-        if len(in_type) == 3:
-            assert in_type[1] == 7
-            assert in_type[2] == in_type[0]  # input[2] :optional 与intype[0]须一致
-
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Pow(OpBase):
-    def __init__(self, node):
-        super(Pow, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 11]
-        assert in_type[1] in [1, 6, 7, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class QLinearConv(OpBase):
-    def __init__(self, node):
-        super(QLinearConv, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            if index in [1, 4, 6]:
-                assert in_type[index] == 1
-            else:
-                assert in_type[index] == 2
-                if len(in_type) == 9:   # corresponding with 1.7 onnxruntime doc
-                    assert in_type[8] == 6
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[7]
-        return tensor_type_map
-
-
-class QLinearMatMul(OpBase):
-    def __init__(self, node):
-        super(QLinearMatMul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            if index in [1, 4, 6]:
-                assert in_type[index] == 1
-            else:
-                # corresponding with 1.7 onnxruntime doc
-                assert in_type[index] == 2
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[7]
-        return tensor_type_map
-
-
-class QuantizeLinear(OpBase):
-    def __init__(self, node):
-        super(QuantizeLinear, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]
-        assert in_type[1] in [1, 10]
-        if len(in_type) == 3:  # corresponding with 1.7 onnxruntime doc
-            assert in_type[2] in [2, 3]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[2]
-        return tensor_type_map
-
-
-class RNN(OpBase):
-    def __init__(self, node):
-        super(RNN, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            if len(in_type) >= 5 and index == 4:
-                assert in_type[index] == 6
-            else:
-                # corresponding with 1.7 onnxruntime doc
-                assert in_type[index] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class RandomNormal(OpBase):
-    def __init__(self, node):
-        super(RandomNormal, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        if not self.node.attribute:
-            data_type = find_key('float32')
-        else:
-            for attr in self.node.attribute:
-                data_type = attr.i
-                break
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class RandomNormalLike(OpBase):
-    def __init__(self, node):
-        super(RandomNormalLike, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        if not self.node.attribute:
-            data_type = in_type[0]
-        else:
-            for attr in self.node.attribute:
-                if attr.name == 'dtype':
-                    data_type = attr.i
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class RandomUniform(OpBase):
-    def __init__(self, node):
-        super(RandomUniform, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        if not self.node.attribute:
-            data_type = find_key('float32')
-        else:
-            for attr in self.node.attribute:
-                data_type = attr.i
-                break
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class RandomUniformLike(OpBase):
-    def __init__(self, node):
-        super(RandomUniformLike, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5,
-                              6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
-        if not self.node.attribute:
-            data_type = in_type[0]
-        else:
-            for attr in self.node.attribute:
-                data_type = attr.i
-                break
-        for output in self.node.output:
-            tensor_type_map[output] = data_type
-        return tensor_type_map
-
-
-class Range(OpBase):
-    def __init__(self, node):
-        super(Range, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] in [1, 5, 6, 7, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Reciprocal(OpBase):
-    def __init__(self, node):
-        super(Reciprocal, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceL1(OpBase):
-    def __init__(self, node):
-        super(ReduceL1, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceL2(OpBase):
-    def __init__(self, node):
-        super(ReduceL2, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceLogSum(OpBase):
-    def __init__(self, node):
-        super(ReduceLogSum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceLogSumExp(OpBase):
-    def __init__(self, node):
-        super(ReduceLogSumExp, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceMax(OpBase):
-    def __init__(self, node):
-        super(ReduceMax, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 2, 3, 6, 7, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceMean(OpBase):
-    def __init__(self, node):
-        super(ReduceMean, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MaskedReduceMean(OpBase):
-    def __init__(self, node):
-        super(MaskedReduceMean, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            assert in_type[index] in [1, 10, 11, 16, 6, 7, 12, 13]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceMin(OpBase):
-    def __init__(self, node):
-        super(ReduceMin, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 2, 3, 6, 7, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceProd(OpBase):
-    def __init__(self, node):
-        super(ReduceProd, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for index in range(len(in_type)):
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[index] in [1, 6, 7, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceSum(OpBase):
-    def __init__(self, node):
-        super(ReduceSum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 10, 11]
-        if len(in_type) > 1:
-            assert in_type[1] == 7
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReduceSumSquare(OpBase):
-    def __init__(self, node):
-        super(ReduceSumSquare, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Relu(OpBase):
-    def __init__(self, node):
-        super(Relu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11, 3, 6]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ShuffleChannel(OpBase):
-    def __init__(self, node):
-        super(ShuffleChannel, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11, 3, 6]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-class Reshape(OpBase):
-    def __init__(self, node):
-        super(Reshape, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        if len(in_type) > 1:
-            assert in_type[1] == 7
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Resize(OpBase):
-    def __init__(self, node):
-        super(Resize, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        if self.node.domain == "thinker":
-            assert in_type[0] in [1, 6, 2, 3]
-        else:
-            # remove int8 input  对齐onnxruntime 1.7.0 版本
-            assert in_type[0] in [1, 6, 2]
-        if len(in_type) > 1:     # corresponding with 1.7 onnxruntime doc
-            assert in_type[1] in [1, 10, 11]
-        if len(in_type) > 2:
-            assert in_type[2] == 1
-        if len(in_type) == 4:
-            assert in_type[3] == 7
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ReverseSequence(OpBase):
-    def __init__(self, node):
-        super(ReverseSequence, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 16]
-        assert in_type[1] == 7
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class DurationToAlignment(OpBase):
-    def __init__(self, node):
-        super(DurationToAlignment, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in onnx_dtype
-        for output in self.node.output:
-            tensor_type_map[output] = 1
-        return tensor_type_map
-
-
-class Round(OpBase):
-    def __init__(self, node):
-        super(Round, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class MeanVarianceNormalization(OpBase):
-    def __init__(self, node):
-        super(MeanVarianceNormalization, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class NegativeLogLikelihoodLoss(OpBase):
-    def __init__(self, node):
-        super(NegativeLogLikelihoodLoss, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        assert in_type[1] in [6, 7]
-        if len(in_type) == 3:
-            assert in_type[2] in [1, 10, 11]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Scan(OpBase):
-    def __init__(self, node):
-        super(Scan, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        if len(in_type) == 1:   # corresponding with 1.7 onnxruntime doc
-            assert in_type[0] in [1, 2, 3, 4, 5,
-                                  6, 7, 8, 9, 10, 11, 12, 13, 16]
-        if len(in_type) == 2:
-            assert in_type[0] == 7
-            assert in_type[1] in [1, 2, 3, 4, 5,
-                                  6, 7, 8, 9, 10, 11, 12, 13, 16]
-        for output in self.node.output:
-            tensor_type_map[output] in [1, 2, 3, 4,
-                                        5, 6, 7, 8, 9, 10, 11, 12, 13, 16]
-        return tensor_type_map
-
-
-class ScatterElements(OpBase):
-    def __init__(self, node):
-        super(ScatterElements, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [6, 7]   # corresponding with 1.7 onnxruntime doc
-        assert in_type[2] in [6, 7]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ScatterND(OpBase):
-    def __init__(self, node):
-        super(ScatterND, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] == 7
-        assert in_type[2] == in_type[0]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Selu(OpBase):
-    def __init__(self, node):
-        super(Selu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Shrink(OpBase):
-    def __init__(self, node):
-        super(Shrink, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 10, 11, 12,
-                              13, 16]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Sigmoid(OpBase):
-    def __init__(self, node):
-        super(Sigmoid, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Sign(OpBase):
-    def __init__(self, node):
-        super(Sign, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11, 16]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Sin(OpBase):
-    def __init__(self, node):
-        super(Sin, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 11]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Sinh(OpBase):
-    def __init__(self, node):
-        super(Sinh, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1   # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Size(OpBase):
-    def __init__(self, node):
-        super(Size, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13]
-        tensor_type_map[self.node.output[0]] = find_key('int64')
-        return tensor_type_map
-
-
-class Slice(OpBase):
-    def __init__(self, node):
-        super(Slice, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 6, 7]
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[2] in [1, 6, 7]
-        if len(in_type) > 3:
-            assert in_type[3] in [1, 6, 7]
-        if len(in_type) == 5:
-            assert in_type[4] in [1, 6, 7]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Softmax(OpBase):
-    def __init__(self, node):
-        super(Softmax, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-class SoftmaxInt(OpBase):
-    def __init__(self, node):
-        super(SoftmaxInt, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [3, 5, 6]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-class LogSoftmaxInt(OpBase):
-    def __init__(self, node):
-        super(LogSoftmaxInt, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [3, 5, 6]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-class SoftmaxCrossEntropyLoss(OpBase):
-    def __init__(self, node):
-        super(SoftmaxCrossEntropyLoss, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11, 16]
-        assert in_type[1] in [6, 7]
-        if len(in_type) == 3:
-            assert in_type[2] in [1, 10, 11, 16]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Softplus(OpBase):
-    def __init__(self, node):
-        super(Softplus, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Softsign(OpBase):
-    def __init__(self, node):
-        super(Softsign, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class SpaceToDepth(OpBase):
-    def __init__(self, node):
-        super(SpaceToDepth, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Split(OpBase):
-    def __init__(self, node):
-        super(Split, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        if len(in_type) == 2:
-            # corresponding with 1.7 onnxruntime doc
-            assert in_type[1] in [1, 2, 3, 4, 5,
-                                  6, 7, 8, 9, 10, 11, 12, 13, 16]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Sqrt(OpBase):
-    def __init__(self, node):
-        super(Sqrt, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Squeeze(OpBase):
-    def __init__(self, node):
-        super(Squeeze, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        if len(in_type) == 2:
-            assert in_type[1] == 7
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class StringNormalizer(OpBase):
-    def __init__(self, node):
-        super(StringNormalizer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 8
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Sum(OpBase):
-    def __init__(self, node):
-        super(Sum, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 6, 7, 10, 11, 12, 13]
-        for input in in_type:
-            assert input == in_type[0]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Tan(OpBase):
-    def __init__(self, node):
-        super(Tan, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Tanh(OpBase):
-    def __init__(self, node):
-        super(Tanh, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class TfIdfVectorizer(OpBase):
-    def __init__(self, node):
-        super(TfIdfVectorizer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [6, 7, 8]
-        tensor_type_map[self.node.output[0]] = find_key('float32')
-        return tensor_type_map
-
-
-class ThresholdedRelu(OpBase):
-    def __init__(self, node):
-        super(ThresholdedRelu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Tile(OpBase):
-    def __init__(self, node):
-        super(Tile, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11,
-                              12, 13]  # corresponding with 1.7 onnxruntime doc
-        if len(in_type) == 2:
-            assert in_type[1] == 7
-        if len(in_type) == 3:
-            assert in_type[1] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13]
-            assert in_type[2] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class TopK(OpBase):
-    def __init__(self, node):
-        super(TopK, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 16]
-        if len(in_type) == 2:
-            assert in_type[1] == 7
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[1]] = 7
-        return tensor_type_map
-
-
-class Transpose(OpBase):
-    def __init__(self, node):
-        super(Transpose, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Unique(OpBase):
-    def __init__(self, node):
-        super(Unique, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        if len(self.node.output) >= 2:
-            tensor_type_map[self.node.output[1]] = find_key('int64')
-        if len(self.node.output) >= 3:
-            tensor_type_map[self.node.output[2]] = find_key('int64')
-        if len(self.node.output) >= 4:
-            tensor_type_map[self.node.output[3]] = find_key('int64')
-        return tensor_type_map
-
-
-class Unsqueeze(OpBase):
-    def __init__(self, node):
-        super(Unsqueeze, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
-                              12, 13, 16]  # corresponding with 1.7 onnxruntime doc
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Where(OpBase):
-    def __init__(self, node):
-        super(Where, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 9
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 2, 6, 7, 8]
-        assert in_type[2] == in_type[1]
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[1]
-        return tensor_type_map
-
-
-class Xor(OpBase):
-    def __init__(self, node):
-        super(Xor, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 9
-        assert in_type[1] == 9
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Affine(OpBase):
-    def __init__(self, node):
-        super(Affine, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class CDist(OpBase):
-    def __init__(self, node):
-        super(CDist, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 11]
-        assert in_type[1] in [1, 11]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ComplexMul(OpBase):
-    def __init__(self, node):
-        super(ComplexMul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]
-        assert in_type[1] in [1, 10]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ComplexMulConj(OpBase):
-    def __init__(self, node):
-        super(ComplexMulConj, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]
-        assert in_type[1] in [1, 10]  # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ConvTransposeWithDynamicPads(OpBase):
-    def __init__(self, node):
-        super(ConvTransposeWithDynamicPads, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] == 1
-        assert in_type[2] == 7
-        assert in_type[3] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Crop(OpBase):
-    def __init__(self, node):
-        super(Crop, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class DynamicQuantizeMatMul(OpBase):
-    def __init__(self, node):
-        super(DynamicQuantizeMatMul, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] in [2, 3]
-        assert in_type[2] == 1
-        assert in_type[3] == in_type[1]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class DynamicSlice(OpBase):
-    def __init__(self, node):
-        super(DynamicSlice, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11, 9, 16]
-        assert in_type[1] in [6, 7]
-        assert in_type[2] == in_type[1]
-        assert in_type[3] == in_type[1]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ExpandDims(OpBase):
-    def __init__(self, node):
-        super(ExpandDims, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [2, 4, 12, 13, 3, 5, 6, 7, 1, 10, 11, 8, 9, 16]
-        assert in_type[1] == 6
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class FastGelu(OpBase):
-    def __init__(self, node):
-        super(FastGelu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]   # corresponding with 1.7 onnxruntime doc
-        assert in_type[1] in [1, 10]   # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class WindowAttentionChunk(OpBase):
-    def __init__(self, node):
-        super(WindowAttentionChunk, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10]
-        assert in_type[1] in [1, 10]
-        assert in_type[2] in [1, 10]
-        for out in self.node.output:
-            tensor_type_map[out] = in_type[0]
-
-        return tensor_type_map
-
-
-class FeatureVectorizer(OpBase):
-    def __init__(self, node):
-        super(FeatureVectorizer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 6, 7, 11]
-        tensor_type_map[self.node.output[0]] = list(
-            onnx_dtype.keys())[list(onnx_dtype.values()).index('float32')]
-        return tensor_type_map
-
-
-class FusedConv(OpBase):
-    def __init__(self, node):
-        super(FusedConv, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] == 1
-        assert in_type[2] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class FusedGemm(OpBase):
-    def __init__(self, node):
-        super(FusedGemm, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] == 1
-        assert in_type[2] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Gelu(OpBase):
-    def __init__(self, node):
-        super(Gelu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ImageScaler(OpBase):
-    def __init__(self, node):
-        super(ImageScaler, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Imputer(OpBase):
-    def __init__(self, node):
-        super(Imputer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 7]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Inverse(OpBase):
-    def __init__(self, node):
-        super(Inverse, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Irfft(OpBase):
-    def __init__(self, node):
-        super(Irfft, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 10, 11]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class LinearClassifier(OpBase):
-    def __init__(self, node):
-        super(LinearClassifier, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 6, 7, 11]
-        assert in_type[1] in [7, 8]
-        tensor_type_map[self.node.output[0]] = list(
-            onnx_dtype.keys())[list(onnx_dtype.values()).index('float32')]
-        return tensor_type_map
-
-
-class LinearRegressor(OpBase):
-    def __init__(self, node):
-        super(LinearRegressor, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class MatMulInteger16(OpBase):
-    def __init__(self, node):
-        super(MatMulInteger16, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 5
-        assert in_type[1] == 5
-        tensor_type_map[self.node.output[0]] = list(
-            onnx_dtype.keys())[list(onnx_dtype.values()).index('int32')]
-        return tensor_type_map
-
-
-class MaxpoolWithMask(OpBase):
-    def __init__(self, node):
-        super(MaxpoolWithMask, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        assert in_type[1] == 6
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class Normalizer(OpBase):
-    def __init__(self, node):
-        super(Normalizer, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 6, 7, 11]
-        tensor_type_map[self.node.output[0]] = list(
-            onnx_dtype.keys())[list(onnx_dtype.values()).index('float32')]
-        return tensor_type_map
-
-
-class LayerNormalization(OpBase):
-    def __init__(self, node):
-        super(LayerNormalization, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        # corresponding with 1.7 onnxruntime doc
-        assert in_type[0] in [1, 10, 11]
-        assert in_type[1] == in_type[0]
-        assert in_type[2] == in_type[0]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[1]] = 1
-        # corresponding with 1.7 onnxruntime doc
-        tensor_type_map[self.node.output[2]] = 1
-        return tensor_type_map
-
-
-class MaskedLayerNorm(OpBase):
-    def __init__(self, node):
-        super(MaskedLayerNorm, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 11, 10, 16]
-        assert in_type[1] == in_type[0]
-        assert in_type[2] == in_type[0]
-        assert in_type[3] == in_type[0]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class ATen(OpBase):
-    def __init__(self, node):
-        super(ATen, self).__init__(node)
-        # print(node.attribute)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        op_type = "ATen"
-        for attr in self.node.attribute:
-            if attr.name == "operator":
-                if attr.s == b'var':  # jgtian  aten-var export.07.08
-                    assert in_type[0] != 3
-                elif attr.s == b'layer_norm':
-                    assert in_type[0] != 3
-                elif attr.s == b'index':  # support int8 quant input
-                    assert in_type[0] in [1, 2, 3, 4, 5, 6,
-                                          7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
-                else:
-                    assert in_type[0] in [1, 2, 4, 5, 6, 7, 8, 9, 10,
-                                          11, 12, 13, 14, 15, 16]  # remove int8 support
-                op_type = attr.s
-                break
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        print('ATen({}): The reasoning may be incorrect'.format(op_type.decode()))
-        return tensor_type_map
-
-
-class HistoryPadding(OpBase):
-    def __init__(self, node):
-        super(HistoryPadding, self).__init__(node)
-        # print(node.attribute)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6,
-                              7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        tensor_type_map[self.node.output[1]] = in_type[0]
-        return tensor_type_map
-
-
-class StreamPadding(OpBase):
-    def __init__(self, node):
-        super(StreamPadding, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 2, 3, 4, 5, 6, 7]
-        assert in_type[1] in [1, 2, 3, 4, 5, 6, 7]
-        assert in_type[2] in [6]
-        assert in_type[0] == in_type[1]
-
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        tensor_type_map[self.node.output[1]] = in_type[0]
-        return tensor_type_map
-
-
-class AdaptiveAvgPool2d(OpBase):
-    def __init__(self, node):
-        super(AdaptiveAvgPool2d, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 1
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class ToUint8(OpBase):
-    def __init__(self, node):
-        super(ToUint8, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 3
-        tensor_type_map[self.node.output[0]] = 2
-        return tensor_type_map
-
-
-class RNNJoin(OpBase):
-    def __init__(self, node):
-        super(RNNJoin, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] in [1, 3]
-        assert in_type[1] == 7
-        assert in_type[2] == 7
-        assert in_type[3] == 7
-        tensor_type_map[self.node.output[0]] = in_type[0]
-        return tensor_type_map
-
-
-class BitwiseOP(OpBase):
-    def __init__(self, node):
-        super(BitwiseOP, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class Triu(OpBase):
-    def __init__(self, node):
-        super(Triu, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = in_type[0]
-        return tensor_type_map
-
-
-class OnnxInferCalcShape(OpBase):
-    def __init__(self, node):
-        super(OnnxInferCalcShape, self).__init__(node)
-
-    def infer_type(self, in_type):
-        assert in_type[0] == 7
-        assert in_type[1] == 7
-        assert in_type[2] == 7
-        assert in_type[3] == 7
-
-        tensor_type_map = {}
-        for output in self.node.output:
-            tensor_type_map[output] = 7
-        return tensor_type_map
-
-
-class BmmInt(OpBase):
-    def __init__(self, node):
-        super(BmmInt, self).__init__(node)
-
-    def infer_type(self, in_type):
-        tensor_type_map = {}
-        assert in_type[0] == 3 and in_type[1] == 3
-        count = 0
-        out_type = 3
-        for attr in self.node.attribute:
-            if attr.name == "o_bits":
-                count = count + 1
-                if attr.i == 8:
-                    out_type = 3
-                if attr.i == 32:
-                    out_type = 6
-        for output in self.node.output:
-            if count == 0:
-                tensor_type_map[self.node.output[0]] = find_key('float32')
-            else:
-                tensor_type_map[output] = out_type
-        return tensor_type_map
-
-
-op_map = {'Cast': Cast,
-          'Shape': Shape,
-          'Constant': Constant,
-          'Quant': Quant,
-          'OnnxInferQuant': Quant,
-          'Dequant': Dequant,
-          'TransposeMatMul': TransposeMatMul,
-          'ScaledTanh': ScaledTanh,
-          'Scaler': Scaler,
-          'Scale': Scale,
-          'SampleOp': SampleOp,
-          'Rfft': Rfft,
-          'iqAdd': IQAdd,
-          'iqMul': IQMul,
-          'iqDiv': IQDiv,
-          'iqSum': IQSum,
-          'iqCat': IQCat,
-          'iqClamp': IQClamp,
-          'iqSigmoid': IQSigmoid,
-          'iqSigmoid_Is8_Os8': IQSigmoid_Is8_Os8,
-          'iqTanh': IQTanh,
-          'ReLU': ReLU,
-          'AvgPool2dInt': AvgPool2dInt,
-          'Conv2dInt': Conv2dInt,
-          'ConvTranspose2dInt': ConvTranspose2dInt,
-          'BatchNorm2dInt': BatchNorm2dInt,
-          'LayerNormInt': LayerNormInt,
-          'LinearInt': LinearInt,
-          'LSTMInt': LSTMInt,
-          'GRUInt': GRUInt,
-          'Abs': Abs,
-          'Acos': Acos,
-          'Acosh': Acosh,
-          'Add': Add,
-          'Affine': Affine,
-          'And': And,
-          'ArgMax': ArgMax,
-          'ArgMin': ArgMin,
-          'Asin': Asin,
-          'Asinh': Asinh,
-          'Atan': Atan,
-          'Atanh': Atanh,
-          'AveragePool': AveragePool,
-          'BatchNormalization': BatchNormalization,
-          'CDist': CDist,
-          'Ceil': Ceil,
-          'Clip': Clip,
-          'ComplexMul': ComplexMul,
-          'ComplexMulConj': ComplexMulConj,
-          'Compress': Compress,
-          'Concat': Concat,
-          'Conv': Conv,
-          'ConvTransposeWithDynamicPads': ConvTransposeWithDynamicPads,
-          'ConstantOfShape': ConstantOfShape,
-          'ConvInteger': ConvInteger,
-          'ConvTranspose': ConvTranspose,
-          'Cos': Cos,
-          'Cosh': Cosh,
-          'CumSum': CumSum,
-          'Crop': Crop,
-          'DynamicQuantizeMatMul': DynamicQuantizeMatMul,
-          'DynamicSlice': DynamicSlice,
-          'DepthToSpace': DepthToSpace,
-          'DequantizeLinear': DequantizeLinear,
-          'Det': Det,
-          'Div': Div,
-          'Dropout': Dropout,
-          'Einsum': Einsum,
-          'Elu': Elu,
-          'Equal': Equal,
-          'Erf': Erf,
-          'Exp': Exp,
-          'Expand': Expand,
-          'Flatten': Flatten,
-          'Floor': Floor,
-          'FixLength': FixLength,
-          'Identity': Identity,
-          'GRU': GRU,
-          'If': If,
-          'Gather': Gather,
-          'MaskedGather': MaskedGather,
-          'GatherElements': GatherElements,
-          'GatherND': GatherND,
-          'Gemm': Gemm,
-          'GenExcit': GenExcit,
-          'GlobalAveragePool': GlobalAveragePool,
-          'GlobalLpPool': GlobalLpPool,
-          'GlobalMaxPool': GlobalMaxPool,
-          'GlobalAveragePoolMask': GlobalAveragePoolMask,
-          'MaskedReduceMean': MaskedReduceMean,
-          'InstanceNormalization': InstanceNormalization,
-          'IsInf': IsInf,
-          'IsNaN': IsNaN,
-          'LRN': LRN,
-          'Less': Less,
-          'Log': Log,
-          'LSTM': LSTM,
-          'Loop': Loop,
-          'LogSoftmax': LogSoftmax,
-          'LessOrEqual': LessOrEqual,
-          'GreaterOrEqual': GreaterOrEqual,
-          'Celu': Celu,
-          'MaskToLength': MaskToLength,
-          'MeanVarianceNormalization': MeanVarianceNormalization,
-          'NegativeLogLikelihoodLoss': NegativeLogLikelihoodLoss,
-          'LpNormalization': LpNormalization,
-          'LpNormalization': LpNormalization,
-          'LpPool': LpPool,
-          'MatMul': MatMul,
-          'MatMulInteger': MatMulInteger,
-          'Max': Max,
-          'MaxPool': MaxPool,
-          'Mean': Mean,
-          'MeanVarianceNormalization': MeanVarianceNormalization,
-          'Multinomial': Multinomial,
-          'Min': Min,
-          'Mod': Mod,
-          'Mul': Mul,
-          'Neg': Neg,
-          'NonMaxSuppression': NonMaxSuppression,
-          'NonZero': NonZero,
-          'Not': Not,
-          'OneHot': OneHot,
-          'Or': Or,
-          'PRelu': PRelu,
-          'Pad': Pad,
-          'Pow': Pow,
-          'QLinearConv': QLinearConv,
-          'QLinearMatMul': QLinearMatMul,
-          'QuantizeLinear': QuantizeLinear,
-          'RNN': RNN,
-          'RandomNormal': RandomNormal,
-          'RandomNormalLike': RandomNormalLike,
-          'RandomUniform': RandomUniform,
-          'RandomUniformLike': RandomUniformLike,
-          'Range': Range,
-          'Reciprocal': Reciprocal,
-          'ReduceL1': ReduceL1,
-          'ReduceL2': ReduceL2,
-          'ReduceLogSum': ReduceLogSum,
-          'ReduceLogSumExp': ReduceLogSumExp,
-          'ReduceMax': ReduceMax,
-          'ReduceMean': ReduceMean,
-          'ReduceMin': ReduceMin,
-          'ReduceProd': ReduceProd,
-          'ReduceSum': ReduceSum,
-          'ReduceSumSquare': ReduceSumSquare,
           'Relu': Relu,
           'Reshape': Reshape,
-          'Resize': Resize,
-          'ReverseSequence': ReverseSequence,
-          'Round': Round,
-          'Scan': Scan,
-          'ScatterElements': ScatterElements,
-          'ScatterND': ScatterND,
-          'Selu': Selu,
-          'Shrink': Shrink,
-          'Sigmoid': Sigmoid,
-          'Sign': Sign,
-          'Sin': Sin,
-          'Sinh': Sinh,
-          'Size': Size,
           'Slice': Slice,
-          'Softmax': Softmax,
-          'SoftmaxInt': SoftmaxInt,
-          'LogSoftmaxInt': LogSoftmaxInt,
-          'SoftmaxCrossEntropyLoss': SoftmaxCrossEntropyLoss,
-          'Softplus': Softplus,
-          'Softsign': Softsign,
-          'SpaceToDepth': SpaceToDepth,
           'Split': Split,
-          'Sqrt': Sqrt,
-          'Squeeze': Squeeze,
-          'StringNormalizer': StringNormalizer,
-          'Sub': Sub,
-          'Sum': Sum,
-          'Tan': Tan,
-          'Tanh': Tanh,
-          'TfIdfVectorizer': TfIdfVectorizer,
-          'ThresholdedRelu': ThresholdedRelu,
-          'Tile': Tile,
-          'TopK': TopK,
           'Transpose': Transpose,
-          'Unique': Unique,
-          'Unsqueeze': Unsqueeze,
-          'Where': Where,
-          'Xor': Xor,
-          'ExpandDims': ExpandDims,
-          'FastGelu': FastGelu,
-          'FeatureVectorizer': FeatureVectorizer,
-          'FusedConv': FusedConv,
-          'FusedGemm': FusedGemm,
-          'Gelu': Gelu,
-          'Greater': Greater,
-          'ImageScaler': ImageScaler,
-          'Imputer': Imputer,
-          'Inverse': Inverse,
-          'Irfft': Irfft,
-          'LinearClassifier': LinearClassifier,
-          'LinearRegressor': LinearRegressor,
-          'MatMulInteger16': MatMulInteger16,
-          'MaxpoolWithMask': MaxpoolWithMask,
-          'Normalizer': Normalizer,
-          'LayerNormalization': LayerNormalization,
-          'MaskedLayerNorm': MaskedLayerNorm,
-          'ATen': ATen,
-          'HistoryPadding': HistoryPadding,
-          'Requant': ReQuant,
-          'OnnxInferReQuant': OnnxInferReQuant,
-          'ConvTranspose2dInteger': ConvTranspose2dInteger,
-          'IdentityInfer': IdentityInfer,
-          'ToUint8': ToUint8,
-          'RNNJoin': RNNJoin,
-          'bitwise_and': BitwiseOP,
-          'bitwise_or': BitwiseOP,
-          'bitwise_not': BitwiseOP,
-          'Triu': Triu,
-          'OnnxInferCalcShape': OnnxInferCalcShape,
-          'ScopedEnter': ScopedEnter,
-          'ScopedLeave': ScopedLeave,
-          'Conv1dInt': Conv1dInt,
-          'LSTMInt_Is8_Is64_If32_If32': LSTMInt_Is8_Is64_If32_If32,
-          'LSTMInt_Is8_Is64': LSTMInt_Is8_Is64,
-          'GRUInt_Is8_Is64': GRUInt_Is8_Is64,
-          'GRUInt_Is8_Is64_If32': GRUInt_Is8_Is64_If32,
-          'BmmInt': BmmInt,
-          'StreamPadding': StreamPadding,
-          'DurationToAlignment': DurationToAlignment,
-          'iqMax': iqMax,
-          'iqVar': iqVar,
-          'ShuffleChannel': ShuffleChannel,
           'Others': OpBase
           }
 
 
-def make_node():
-    input = np.array([[[1., 2.], [3., 4.], [5., 6.]]]).astype(np.float32)
-
-    input_size = 2
-    hidden_size = 3
-    weight_scale = 0.1
-    number_of_gates = 4
-
-    node = onnx.helper.make_node(
-        'LSTM',
-        inputs=['X', 'W', 'R'],
-        outputs=['', 'Y'],
-        hidden_size=hidden_size
-    )
-    shape = LSTM(node)
-    out_type = shape.infer_type([1, 1, 1])
-
-
 def infer_type(model):
     nodes = model.graph.node
     tensor_type_map = {}
diff --git a/linger/onnx/scope.py b/linger/onnx/scope.py
deleted file mode 100644
index 97662ce..0000000
--- a/linger/onnx/scope.py
+++ /dev/null
@@ -1,238 +0,0 @@
-from typing import Dict, List, Set
-
-import torch
-import torch.nn
-
-from ..ops.iqtensor import IQTensor, from_torch_tensor
-
-
-class ScopedEnterFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, t, scope):
-        m = t.clone()
-        if isinstance(t, IQTensor):
-            m = from_torch_tensor(m, t.scale_data, t.bits, t.zero_point)
-        return m
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        return gradOutput, None
-
-    @staticmethod
-    def symbolic(g, input, scope):
-        return g.op("thinker::ScopedEnter", input, scope_s=scope)
-
-
-class ScopedLeaveFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, t, scope):
-        m = t.clone()
-        if isinstance(t, IQTensor):
-            m = from_torch_tensor(m, t.scale_data, t.bits, t.zero_point)
-        return m
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        return gradOutput, None
-
-    @staticmethod
-    def symbolic(g, input, scope):
-        return g.op("thinker::ScopedLeave", input, scope_s=scope)
-
-
-global_scoped = {}
-
-
-def scope_forward(input, scope_name, func):
-    list_r = []
-    for t in input:
-        if isinstance(t, torch.Tensor):
-            list_r.append(func.apply(t, scope_name))
-        elif isinstance(t, dict):
-            dict_r = {}
-            for k, v in t.items():
-                if isinstance(v, torch.Tensor):
-                    dict_r[k] = func.apply(v, scope_name)
-                else:
-                    dict_r[k] = v
-            list_r.append(dict_r)
-        elif isinstance(t, (tuple, list)):
-            list_r.append(scope_forward(t, scope_name, func))
-        else:
-            list_r.append(t)
-    return type(input)(list_r)
-
-
-def enter_forward(module, input):
-    scope_name = global_scoped[module][0]
-    if isinstance(input, torch.Tensor):
-        return ScopedEnterFunction.apply(input, scope_name)
-    list_r = None
-    if isinstance(input, (tuple, list)):
-        list_r = scope_forward(input, scope_name, ScopedEnterFunction)
-    elif isinstance(input, dict):
-        dict_r = {}
-        for k, v in input.items():
-            assert isinstance(
-                v, torch.Tensor), 'dict input only support depth=1'
-            dict_r[k] = ScopedEnterFunction.apply(v, scope_name)
-        return dict_r
-    else:
-        assert 0, 'foward only support tensor, tuple or dict of tensor'
-    return type(input)(list_r)
-
-
-def leave_forward(module, input, output):
-    scope_name = global_scoped[module][0]
-    if isinstance(output, torch.Tensor):
-        return ScopedLeaveFunction.apply(output, scope_name)
-    list_r = None
-    if isinstance(output, (tuple, list)):
-        list_r = scope_forward(output, scope_name, ScopedLeaveFunction)
-    elif isinstance(output, dict):
-        dict_r = {}
-        for k, v in input.items():
-            if isinstance(v, torch.Tensor):
-                dict_r[k] = ScopedLeaveFunction.apply(v, scope_name)
-            else:
-                dict_r[k] = v
-        return dict_r
-    else:
-        assert 0, 'foward only support tensor or tuple of tensor'
-    return type(output)(list_r)
-
-
-def build_module_scope(model: torch.nn.Module):
-    scopes = {}
-    for name, child in model.named_children():
-        hook_handle_foward_pre = child.register_forward_pre_hook(enter_forward)
-        hook_handle_forward = child.register_forward_hook(leave_forward)
-        scopes[child] = (name, hook_handle_foward_pre, hook_handle_forward)
-        scopes.update(build_module_scope(child))
-    return scopes
-
-
-def build_global_scope(model: torch.nn.Module):
-    global global_scoped
-    global_scoped = build_module_scope(model)
-    hook_handle_foward_pre = model.register_forward_pre_hook(
-        enter_forward)  # not execute before forward
-    hook_handle_forward = model.register_forward_hook(leave_forward)
-    global_scoped[model] = (".", hook_handle_foward_pre, hook_handle_forward)
-    return global_scoped
-
-
-def remove_scoped_node(model, op_type):
-    graph_output_name = []
-    for ii in model.graph.output:
-        graph_output_name.append(ii.name)
-    nodes = model.graph.node[::-1]
-    for i, node in enumerate(nodes):
-        if node.op_type == op_type:
-            model.graph.node.remove(node)
-            if node.output[0] in graph_output_name:
-                for each in model.graph.node:
-                    for idx in range(len(each.output)):
-                        if each.output[idx] == node.input[0]:
-                            each.output[idx] = node.output[0]
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx] == node.input[0]:
-                            each.input[idx] = node.output[0]
-
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.output)):
-                                    if x_node.output[idx] == node.input[0]:
-                                        x_node.output[idx] = node.output[0]
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx] == node.input[0]:
-                                        x_node.input[idx] = node.output[0]
-
-            else:
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx] == node.output[0]:
-                            each.input[idx] = node.input[0]
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx] == node.output[0]:
-                                        x_node.input[idx] = node.input[0]
-    return model
-
-
-def build_onnx_scope_info(onnx_model):
-
-    nodes = onnx_model.graph.node
-    node_son_dict: Dict[int, List[str]] = {}  # 保存所有的nodename 和对应的子节点，深度遍历使用
-    node_scope_dict: Dict[int, List[str]] = {}  # 遍历后的名字保存 保存节点名字和对应的域名scope
-    visited_node_set: Set[int] = set()  # 保存已经访问的节点
-    # 先建立node之间的邻接关系
-    nodeoutdict: Dict[int, str] = {}
-    for node in nodes:
-        for nodeout in node.output:
-            nodeoutdict[nodeout] = node
-        node_son_dict[id(node)] = []
-    for node in nodes:
-        for nodein in node.input:
-            nodefather = nodeoutdict.get(nodein)
-            if nodefather is not None:
-                node_son_dict[id(nodefather)].append(node)
-    # 根据邻接关系深度遍历所有节点
-    nodetree = []  # 深度遍历栈
-    # 加入图的所有输入节点  连接在graph.input上的第一个节点  原始只考虑了单一输入
-    initializer_names = []
-    for x in onnx_model.graph.initializer:
-        initializer_names.append(x.name)
-    input_idxs = []
-    for idx, y in enumerate(onnx_model.graph.input):
-        if y.name not in initializer_names:
-            input_idxs.append(y.name)
-    for node in onnx_model.graph.node[::-1]:
-        for node_inp in node.input[::-1]:
-            if node_inp in input_idxs:
-                nodetree.append(node)
-                visited_node_set.add(id(node))
-                continue
-    while len(nodetree) > 0:
-        node = nodetree.pop()  # 获得栈中的节点
-        node_sons = node_son_dict[id(node)]  # 获得当前节点的所有子节点
-        if id(node) not in node_scope_dict:
-            node_scope_dict[id(node)] = []
-        for node_son in node_sons:
-            if id(node_son) in visited_node_set:
-                continue
-            scopelist = node_scope_dict[id(node)].copy()  # 每个子节点都要记录一份当前的域
-            if node.op_type == "ScopedEnter":
-                scope = node.attribute[0].s.decode('utf-8')
-                scopelist.append(scope)
-            elif node.op_type == "ScopedLeave":
-                scope = node.attribute[0].s.decode('utf-8')
-                scopelist.pop()
-            node_scope_dict[id(node_son)] = scopelist
-            nodetree.append(node_son)  # 子节点入栈
-            visited_node_set.add(id(node_son))  # 访问过子节点加入已访问
-        # 遍历子节点结束后，保存修改的name
-        if node.op_type != "ScopedEnter" and node.op_type != "ScopedLeave":
-            newnamelist = node_scope_dict[id(node)]
-            if len(newnamelist) > 1 and newnamelist[0] != '.':
-                node.name = "." + ".".join(newnamelist) + "." + node.name
-            elif len(newnamelist) > 1 and newnamelist[0] == '.':
-                node.name = "." + ".".join(newnamelist[1:]) + "." + node.name
-            elif len(newnamelist) == 1 and newnamelist[0] != '.':
-                node.name = "." + newnamelist[0] + "." + node.name
-            elif len(newnamelist) == 1 and newnamelist[0] == '.':
-                node.name = "." + node.name
-    # 删除多余节点
-    onnx_model = remove_scoped_node(onnx_model, "ScopedEnter")
-    onnx_model = remove_scoped_node(onnx_model, "ScopedLeave")
-
-    return onnx_model
-
-
-__all__ = ['build_global_scope', 'build_onnx_scope_info']
diff --git a/linger/onnx/update_dequant.py b/linger/onnx/update_dequant.py
index e9dd2a9..eeb64f4 100644
--- a/linger/onnx/update_dequant.py
+++ b/linger/onnx/update_dequant.py
@@ -42,7 +42,7 @@ def insert_op_before(model, node_input, node_output, target_node_index, input_i,
         'Dequant',
         inputs=[node_input],
         outputs=["Dequant_"+str(node_input)+"_"+str(node_output)],
-        domain="thinker",
+        domain="linger",
         **kwargs['attr_dict']
     )
 
@@ -90,7 +90,7 @@ def insert_op_after(model, node_input, node_output, target_node_index, output_i,
         'Dequant',
         inputs=["Dequant_"+str(node_input)+"_"+str(node_output)],
         outputs=[node_input],
-        domain="thinker",
+        domain="linger",
         **kwargs['attr_dict']
     )
 
@@ -259,7 +259,7 @@ def infer_type_linger(model, is_change_in_out_type):
                 if node.op_type == "Slice" or node.op_type == "Split":
                     for quant_node in nodes:
                         for iii in range(len(node.output)):
-                            if quant_node.input == [node.output[iii]] and quant_node.op_type == "OnnxInferQuant":
+                            if quant_node.input == [node.output[iii]]:
                                 if in_type[0] == 3:
                                     in_type[0] = 0
             else:
diff --git a/linger/ops/__init__.py b/linger/ops/__init__.py
deleted file mode 100644
index a3d7344..0000000
--- a/linger/ops/__init__.py
+++ /dev/null
@@ -1,21 +0,0 @@
-from .avgpool2d_int import AvgPool2dInt
-from .batchnorm_int import BatchNormInt
-from .layernorm_int import LayerNormInt
-from .bmm_int import BmmInt
-from .conv1d_int import Conv1dInt
-from .conv_int import Conv2dInt
-from .convtranspose_int import ConvTranspose2dInt
-from .embedding_int import EmbeddingInt
-from .gru_int import GRUInt
-from .iqtensor import *
-from .linear_int import LinearInt
-from .linger_functional import *
-from .lstm_int import LSTMInt
-from .module_self import *
-from .ops import ModuleIntConfig
-from .ops_configs import (DefaultQuantIntXOP, SupportQuantedIntModules,
-                          SupportQuantTorchModules)
-from .ops_names import *
-from .relu6_int import ReLU6Int
-from .requant import Requant
-from .scaledround_int import ScaledRoundLayer
diff --git a/linger/ops/avgpool2d_int.py b/linger/ops/avgpool2d_int.py
deleted file mode 100644
index 1e9eb7b..0000000
--- a/linger/ops/avgpool2d_int.py
+++ /dev/null
@@ -1,229 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import Quant
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class AvgPool2dIntFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override,
-                data_bits, training, momentum, running_x, running_o, eval_scale_x, eval_scale_o, prefix, dump, path, mode, o_bits, quant,
-                is_not_from_iqtensor):
-        # venus limits
-        assert data_bits in (
-            4, 8), f"in AvgPool2d op, AvgPool2d data_bits only support 4/8 bits, but you have data_bits {data_bits}"
-        assert o_bits in (
-            4, 8, 16), f"in AvgPool2d op, AvgPool2d o_bits only support 4/8/16 bits, but you have o_bits {o_bits}"
-
-        scale_x = None
-        scale_o = None
-        if training:
-            ctx.save_for_backward(input)
-            ctx.params = [kernel_size, stride, padding,
-                          ceil_mode, count_include_pad, divisor_override]
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            # ctx.bits = data_bits
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input',
-                    iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                scale_x = ScalerBuffer(scale_x)
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            q_input = q_input.float() if data_bits <= 8 else q_input.double()
-            q_outputs_float = F.avg_pool2d(q_input.contiguous(), kernel_size, stride=stride, padding=padding, ceil_mode=ceil_mode,
-                                           count_include_pad=count_include_pad, divisor_override=divisor_override)
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                if isinstance(kernel_size, tuple):
-                    kernel_area = kernel_size[0] * kernel_size[1]
-                else:
-                    kernel_area = kernel_size * kernel_size
-                q_outputs = (
-                    (q_outputs_float * kernel_area).round() / kernel_area + 0.5).floor()
-                bound_value = math.pow(2, data_bits-1) - 1
-                q_outputs.clamp_(-bound_value-1, bound_value)
-            else:
-                assert False, "linger only support luna quant."
-            outputs = quant.dequant(q_outputs, scale_x)
-            if o_bits is not None:
-                running_o.fill_(running_x())
-                scale_o = scale_x
-            ctx.value = scale_x, data_bits, zero_point, is_iq_tensor
-        else:
-            assert running_x > 0, 'invalid running_x <= 0, please fintune first'
-            scale_x = None
-            scale_o = None
-            scale_x = quant.running_to_scale(running_x, data_bits, mode=mode)
-            scale_x = ScalerBuffer(scale_x)
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.double()
-            q_outputs = F.avg_pool2d(q_input.contiguous(), kernel_size, stride=stride, padding=padding, ceil_mode=ceil_mode,
-                                     count_include_pad=count_include_pad, divisor_override=divisor_override)
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                if isinstance(kernel_size, tuple):
-                    kernel_area = kernel_size[0] * kernel_size[1]
-                else:
-                    kernel_area = kernel_size * kernel_size
-                q_outputs = ((q_outputs * kernel_area).round() /
-                             kernel_area + 0.5).floor()
-                bound_value = math.pow(2, data_bits-1) - 1
-                q_outputs.clamp_(-bound_value-1, bound_value)
-            else:
-                assert False, "linger only support luna quant."
-            outputs = quant.dequant(q_outputs, scale_x)
-            scale_o = eval_scale_o
-            if o_bits is not None:
-                scale_o.fill_(scale_x())
-            if dump:
-                name_list = ["input", "outputs", "q_input",  "q_outputs",
-                             "scale_x", "scale_o", "running_x", "running_o"]
-                attr_list = [input, outputs, q_input, q_outputs,
-                             scale_x.data, scale_o.data, running_x.data,  running_o.data]
-                Dump.dump_file(prefix, ".AvgPool2dInt.",
-                               zip(name_list, attr_list), path)
-
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradoutput):
-        input, = ctx.saved_tensors
-        scale_x, data_bits, zero_point, is_iq_tensor = ctx.value
-        kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override = ctx.params
-        if is_iq_tensor:
-            backward_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            backward_input = Quant.dequant(q_input, scale_x)
-        backward_input = backward_input.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            output = F.avg_pool2d(backward_input, kernel_size, stride,
-                                  padding, ceil_mode, count_include_pad, divisor_override)
-            grad = torch.autograd.grad(output, backward_input, gradoutput)
-        return grad[0], None, None, None, None, None, None, \
-            None, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, kernel_size, stride, padding, ceil_mode, count_include_pad, divisor_override,
-                 data_bits, training, momentum, running_x, running_o, scale_x, scale_o, prefix, dump, path, mode, o_bits, quant, is_not_from_iqtensor):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        if isinstance(padding, int):
-            padding = [padding]*2
-        if isinstance(kernel_size, int):
-            kernel_size = [kernel_size]*2
-        if isinstance(stride, int):
-            stride = [stride]*2
-        paddings = padding + padding
-        param_dict = {'kernel_shape_i': kernel_size, 'strides_i': stride, 'pads_i': paddings, 'ceil_mode_i': ceil_mode, 'data_bits_i': data_bits,
-                      'scale_x_f': scale_x()}
-        if is_not_from_iqtensor:
-            input_list = [op_inner, ]
-        else:
-            input_list = [input, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-        return g.op("thinker::AvgPool2dInt", *input_list, **param_dict)
-
-
-class AvgPool2dInt(nn.AvgPool2d, ModuleIntConfig):
-    r"""实现AvgPool2dInt的量化训练与测试，继承自nn.AvgPool2d,
-    Args: 
-        kernel_size stride padding, ceil_mode count_include_pad divisor_override
-        与nn.AvgPool2d一致
-        data_bits(int, default=8): 输入的量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-    Notes:
-        支持luna_quant
-        luna_quant: x = (x + 0.5).floor()
-
-    """
-
-    def __init__(self, kernel_size, stride, padding=0, ceil_mode=False,
-                 count_include_pad=True, divisor_override=None,
-                 data_bits=8, mode=QuantMode.QValue, o_bits=None):
-        nn.AvgPool2d.__init__(self, kernel_size, stride, padding,
-                              ceil_mode, count_include_pad, divisor_override)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, mode=mode, o_bits=o_bits)
-        self.momentum = 0.1
-        self.is_not_from_iqtensor = True
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        assert (self.o_bits is None or self.o_bits ==
-                self.data_bits), 'AvgPool2dInt out_bits must equal data_bits'
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_o = ScalerBuffer(self.running_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        output = AvgPool2dIntFunction.apply(input.contiguous(), self.kernel_size,
-                                            self.stride, self.padding, self.ceil_mode,
-                                            self.count_include_pad, self.divisor_override,
-                                            self.data_bits, self.training, self.momentum,
-                                            running_x, running_o, scale_x, scale_o, self.prefix, self.dump, self.path,
-                                            self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor)
-        self.running_x.fill_(running_x())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_o.fill_(scale_o())
-        return output
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
diff --git a/linger/ops/batchnorm_int.py b/linger/ops/batchnorm_int.py
deleted file mode 100644
index 0d2f051..0000000
--- a/linger/ops/batchnorm_int.py
+++ /dev/null
@@ -1,459 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import itertools
-from collections import OrderedDict
-
-import torch
-import torch.nn as nn
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class BatchNormFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, running_mean, running_var, alpha, beta,
-                training, exponential_average_factor, momentum, eps,
-                running_x, running_w, running_o,
-                eval_scale_x, eval_scale_w, eval_scale_o,
-                data_bits, parameter_bits, prefix, dump, path, mode, o_bits, quant,
-                is_not_from_iqtensor, ahead_relu, clamp_data, clamp_weight, clamp_bias):
-        scale_x = None
-        scale_w = None
-        scale_o = None
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            ctx.bits = data_bits, parameter_bits, o_bits
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            saved_tensors = [input, alpha, beta]
-
-            q_alpha, scale_w, max_value_w = quant.quant(
-                alpha, parameter_bits, mode=mode, quant_data='weight')
-
-            scale_w = ScalerBuffer(scale_w)
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            ctx.eps = eps
-            # mul-add mul
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input',
-                    iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                scale_x = ScalerBuffer(scale_x)
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_alpha = q_alpha.float() if data_bits + parameter_bits <= 16 else q_alpha.double()
-            # q_output = None
-            q_outputs = q_input * q_alpha
-            # mul-add add
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                q_beta = (beta * scale_w * scale_x + 0.5).floor()
-                if data_bits + parameter_bits <= 16:
-                    q_beta = q_beta.float().round()
-                else:
-                    q_beta = q_beta.double()
-                q_outputs = q_outputs + q_beta
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-            out_tensor = outputs
-            scale_o = None
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output', ahead_relu=ahead_relu)
-                scale_o = ScalerBuffer(scale_o)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-                outputs = quant.dequant(q_outputs, scale_o)
-            saved_tensors += [out_tensor]
-            ctx.scale = scale_x, scale_w, scale_o
-            ctx.save_for_backward(*saved_tensors)
-
-        else:
-            assert running_x > 0, 'invalid running_x <= 0, please finetune training before eval'
-            # assert config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,), 'linger only support luna quant.'
-            if alpha.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-                q_alpha, scale_w, _ = quant.quant(
-                    alpha, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                q_beta = None
-                if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                    q_beta = (beta * scale_w * scale_x + 0.5).floor()
-                    if data_bits + parameter_bits <= 16:
-                        q_beta = q_beta.float().round().double()
-                else:
-                    assert False, 'linger only support luna quant.'
-
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                scale_o = eval_scale_o
-                q_alpha = alpha.view(1, -1, 1, 1)
-                q_beta = beta.view(1, -1, 1, 1).double()
-
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_alpha = q_alpha.float() if data_bits + parameter_bits <= 16 else q_alpha.double()
-            q_outputs = q_input * q_alpha + q_beta
-            outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            if o_bits is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_outputs, _, _ = quant.quant(
-                        outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                else:
-                    assert False, "linger only support luna quant."
-                outputs = quant.dequant(q_outputs, scale_o)
-
-            if dump:
-                name_list = ["input", "running_mean", "running_var", "q_alpha", "q_input", "q_beta", "q_outputs",
-                             "scale_x", "scle_w", "scale_o"]
-                attr_list = [input, running_mean, running_var, q_alpha, q_input, q_beta, q_outputs,
-                             scale_x.data, scale_w.data, scale_o.data]
-                Dump.dump_file(prefix, ".BatchNormInt.",
-                               zip(name_list, attr_list), path)
-
-        if isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        input, alpha, beta, outputs = ctx.saved_tensors
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        scale_x, scale_w, scale_o = ctx.scale
-        zero_point, is_iq_tensor = ctx.value
-
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        q_alpha, _, _ = Quant.quant(
-            alpha.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_alpha = Quant.dequant(q_alpha, scale_w)
-        f_alpha = f_alpha.detach().clone().requires_grad_(True)
-        with torch.enable_grad():
-            z = f_alpha * f_input + beta
-            z = normalize_data_with_config(z, clamp_data)
-            gradInput, gradAlpha, gradBeta = torch.autograd.grad(
-                z, (f_input, f_alpha, beta), gradOutput)
-
-        return gradInput, None, None, gradAlpha, gradBeta, \
-            None, None, None, None, \
-            None, None, None, None, \
-            None, None, None, \
-            None, None, None, \
-            None, None, None, None, None, None, None, None, \
-            None, None
-
-    @staticmethod
-    def symbolic(g, input, running_mean, running_var, weights, bias,
-                 training, exponential_average_factor, momentum, eps,
-                 running_x, running_w, running_o,
-                 scale_x, scale_w, scale_o,
-                 data_bits, parameter_bits, prefix, dump, path, mode, o_bits, quant,
-                 is_not_from_iqtensor, ahead_relu, clamp_data, clamp_weight, clamp_bias):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(
-                g, input, scale_x(), platform_quant, data_bits)
-        # if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-        param_dict = {'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits, 'o_bits_i': o_bits,
-                      'scale_x_f': scale_x(), 'scale_w_f': scale_w(), 'scale_o_f': scale_o()}
-
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-        if is_not_from_iqtensor:
-            input_list = [op_inner, weights, bias, running_mean, running_var]
-        else:
-            input_list = [input, weights, bias, running_mean, running_var]
-        return g.op("thinker::BatchNorm2dInt", *input_list, **param_dict)
-
-
-class BatchNormInt(nn.BatchNorm2d, ModuleIntConfig):
-    r"""实现BatchNormInt的量化训练与测试，继承自nn.BatchNorm2d,
-
-    Args:
-        num_features eps momentum affine track_running_stats
-        标准nn.BatchNorm2d的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为乘加操作之后的weight与bias的clamp数值，此处不使用
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    Examples:
-        test/test_batchnorm_int.py
-
-    """
-
-    def __init__(self, num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None, ahead_relu=False):
-        nn.BatchNorm2d.__init__(self, num_features, eps,
-                                momentum, affine, track_running_stats)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_not_from_iqtensor = True
-        self.ahead_relu = ahead_relu
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.mode = mode
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        # assert (self.running_mean.abs().sum() != 0), 'batchnormint onlu support finetune'
-        if self.momentum is None:
-            exponential_average_factor = 0.0
-        else:
-            exponential_average_factor = self.momentum
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-        if self.training and self.track_running_stats:
-            if self.num_batches_tracked is not None:
-                self.num_batches_tracked = self.num_batches_tracked + 1
-                if self.momentum is None:  # use cumulative moving average
-                    exponential_average_factor = 1.0 / \
-                        float(self.num_batches_tracked)
-                else:  # use exponential moving average
-                    exponential_average_factor = self.momentum
-        momentum = 0.1
-        # do fuse
-        batchsize, channels, height, width = input.shape
-        size = batchsize * height * width
-        if self.training:
-            mean = input.clone().sum((0, 2, 3), keepdim=True) / size
-            var = input.clone().pow(2).sum((0, 2, 3), keepdim=True) / size - \
-                (input.clone().sum((0, 2, 3), keepdim=True) / size).pow(2)
-            var = torch.clamp(var, min=0.0)
-            self.running_mean = (
-                1 - self.momentum) * self.running_mean + self.momentum * mean.squeeze().detach()
-            self.running_var = (1 - self.momentum) * self.running_var + \
-                self.momentum * var.squeeze().detach()
-        else:
-            mean = self.running_mean.reshape(1, -1, 1, 1)
-            var = self.running_var.reshape(1, -1, 1, 1)
-        sigma = 1/torch.sqrt(var + self.eps)
-        if self.weight.dtype == torch.float32:
-            alpha = self.weight.view(1, -1, 1, 1)*sigma
-            beta = self.bias.view(1, -1, 1, 1)-mean*alpha
-            alpha = normalize_weight_with_config(
-                alpha, self.clamp_weight, self.training)
-            beta = normalize_bias_with_config(
-                beta, self.clamp_bias, self.training)
-        else:
-            alpha = self.weight
-            beta = self.bias
-
-        ret = BatchNormFunction.apply(input, self.running_mean, self.running_var, alpha, beta,
-                                      self.training or not self.track_running_stats, exponential_average_factor, momentum, self.eps,
-                                      running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                      self.data_bits, self.parameter_bits, self.prefix, self.dump, self.path, self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor, self.ahead_relu,
-                                      self.clamp_data, self.clamp_weight, self.clamp_bias)
-
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(module, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=module._version)
-        if is_in_onnx_export():
-            assert module._buffers['running_x'] > 0, 'invalid running_x and running_o, please finetune first'
-            scale_x = ScalerBuffer(module._buffers['scale_x'])
-            if module.is_not_from_iqtensor:
-                scale_x = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_x']), module.data_bits, mode=module.quant_mode))
-                module._buffers['scale_x'].data.fill_(scale_x())
-
-            if module.o_bits is not None:
-                scale_o = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_o']), module.o_bits, mode=module.quant_mode))
-                module._buffers['scale_o'].data.fill_(scale_o())
-
-        if 'scale_w' in module._buffers and module._parameters['weight'].dtype == torch.float:
-            weights = module._parameters['weight'].data
-            bias = module._parameters['bias'].data
-            mean = module._buffers['running_mean'].data.view(1, -1, 1, 1)
-            var = module._buffers['running_var'].data.view(1, -1, 1, 1)
-            sigma = 1/torch.sqrt(var + module.eps)
-            alpha = weights.view(1, -1, 1, 1)*sigma
-            beta = bias.view(1, -1, 1, 1)-mean*alpha
-            alpha = normalize_weight_with_config(
-                alpha, module.clamp_weight, False)
-            beta = normalize_bias_with_config(beta, module.clamp_bias, False)
-            q_alpha = None
-            q_beta = None
-            if is_in_onnx_export():
-                q_alpha, scale_w, _ = module.quant.quant(
-                    alpha, module.parameter_bits, mode=module.quant_mode)
-                scale_w = ScalerBuffer(scale_w)
-                module._buffers['scale_w'].data.fill_(scale_w())
-                if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                    q_beta = (beta * scale_w * scale_x +
-                              0.5).floor().float().round().int()
-                else:
-                    assert False, 'linger only support luna quant.'
-
-            weight_tensor = module._parameters['weight']
-            bias_tensor = module._parameters['bias']
-            if is_in_onnx_export():
-                if module.parameter_bits <= 8:
-                    weight_tensor.data = q_alpha.char().reshape(-1)
-                    weight_tensor.char()
-                elif module.parameter_bits <= 16:
-                    weight_tensor.data = q_alpha.short().reshape(-1)
-                    weight_tensor.short()
-                else:
-                    weight_tensor.data = q_alpha.int().reshape(-1)
-                    weight_tensor.int()
-                bias_tensor.data = q_beta.int().reshape(-1)
-                bias_tensor.int()
-        module._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in module._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in module._state_dict_hooks.values():
-            hook_result = hook(module, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(module, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        allow_missing_keys = ['running_w', 'running_x', 'running_o',
-                              'scale_x', 'scale_w', 'scale_o']
-        local_missing_keys = []
-        module._load_from_state_dict_global_(state_dict, prefix, local_metadata, strict,
-                                             local_missing_keys, unexpected_keys, error_msgs)
-        matched = True
-        fake_missing_keys = []
-        for k_local in local_missing_keys:
-            if k_local.replace(prefix, '', 1) not in allow_missing_keys:
-                matched = False
-                fake_missing_keys.append(k_local)
-        if matched:
-            local_missing_keys = []
-        else:
-            local_missing_keys = fake_missing_keys
-        missing_keys += local_missing_keys
-
-    def _load_from_state_dict_global_(module, state_dict, prefix, local_metadata, strict,
-                                      missing_keys, unexpected_keys, error_msgs):
-        for hook in module._load_state_dict_pre_hooks.values():
-            hook(state_dict, prefix, local_metadata, strict,
-                 missing_keys, unexpected_keys, error_msgs)
-        local_name_params = itertools.chain(
-            module._parameters.items(), module._buffers.items())
-        local_state = {k: v.data for k,
-                       v in local_name_params if v is not None}
-        for name, param in local_state.items():
-            key = prefix + name
-            if key in state_dict:
-                input_param = state_dict[key]
-
-                if len(param.shape) == 0 and len(input_param.shape) == 1:
-                    input_param = input_param[0]
-
-                if input_param.shape != param.shape:
-                    error_msgs.append('size mismatch for {}: copying a param with shape {} from checkpoint, '
-                                      'the shape in current model is {}.'
-                                      .format(key, input_param.shape, param.shape))
-                    continue
-
-                if isinstance(input_param, torch.nn.Parameter):
-                    input_param = input_param.data
-                try:
-                    param.copy_(input_param)
-                    if input_param.dtype == torch.int32 or input_param.dtype == torch.int8 or input_param.dtype == torch.int16:
-                        module._parameters[name] = param.int()
-
-                except Exception:
-                    error_msgs.append('While copying the parameter named "{}", '
-                                      'whose dimensions in the model are {} and '
-                                      'whose dimensions in the checkpoint are {}.'
-                                      .format(key, param.size(), input_param.size()))
-            elif strict:
-                missing_keys.append(key)
-        if strict:
-            for key in state_dict.keys():
-                if key.startswith(prefix):
-                    input_name = key[len(prefix):]
-                    input_name = input_name.split('.', 1)[0]
-                    if input_name not in module._modules and input_name not in local_state:
-                        unexpected_keys.append(key)
-
-    def extra_repr(self):
-        s = nn.BatchNorm2d.extra_repr(self)
-        extra_s = ' ,clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits}, parameter_bits:{parameter_bits}, o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/bmm_int.py b/linger/ops/bmm_int.py
deleted file mode 100644
index 139a591..0000000
--- a/linger/ops/bmm_int.py
+++ /dev/null
@@ -1,281 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from collections import OrderedDict
-
-import torch
-import torch.nn as nn
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import Quant, normalize_data_with_config
-from ..utils import Dump, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear, torch_bmm)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class BmmIntFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, mat2, data_bits, training, momentum,
-                running_x, running_y, running_o, scale_x, scale_y, scale_o,
-                prefix, dump, path, mode, o_bits,
-                quant, is_not_from_iqtensor1, is_not_from_iqtensor2, clamp_data):
-
-        if training:
-            saved_tensors = [input, mat2]
-            # ctx.save_for_backward(input, mat2)
-            ctx.bits = o_bits, data_bits
-            ctx.clamp_data = clamp_data
-            zero_point_x = input.zero_point if isinstance(
-                input, IQTensor) else 0
-            is_iq_tensor_x = True if isinstance(input, IQTensor) else False
-            zero_point_y = mat2.zero_point if isinstance(
-                mat2, IQTensor) else 0
-            is_iq_tensor_y = True if isinstance(mat2, IQTensor) else False
-            ctx.value = zero_point_x, is_iq_tensor_x, zero_point_y, is_iq_tensor_y
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input',
-                    iq_zero_point=input.zero_point)
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            if isinstance(mat2, IQTensor):
-                q_mat2, _, max_value_y = quant.quant(
-                    mat2.data, data_bits, scale_y, mode=QuantMode.QValue, quant_data='input', iq_zero_point=mat2.zero_point)
-            else:
-                q_mat2, scale_y, max_value_y = quant.quant(
-                    mat2.data, data_bits, mode=mode, quant_data='input')
-                running_y.mul_(1-momentum).add_(momentum*max_value_y)
-            q_input = q_input.float() if data_bits <= 8 else q_input.double()
-            q_mat2 = q_mat2.float() if data_bits <= 8 else q_mat2.double()
-            q_outputs = torch_bmm(q_input, q_mat2)
-            outputs = quant.dequant(q_outputs, scale_x * scale_y)
-            out_tensor = outputs
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-            saved_tensors += [out_tensor]
-            ctx.scale = scale_x, scale_y, scale_o
-            ctx.save_for_backward(*saved_tensors)
-
-        else:
-            assert running_x > 0, 'invalid running_x = 0, please finetune training before eval'
-            if not isinstance(input, IQTensor):
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-            if not isinstance(mat2, IQTensor):
-                scale_y = ScalerBuffer(quant.running_to_scale(
-                    running_y, data_bits, mode=mode))
-            if o_bits is not None:
-                assert running_o > 0, 'invalid running_o = 0 for BmmInt'
-                scale_o = ScalerBuffer(quant.running_to_scale(
-                    running_o, o_bits, mode=mode))
-            input_zero_point = 0
-            mat2_zero_point = 0
-            if isinstance(input, IQTensor):
-                input_zero_point = input.zero_point
-            if isinstance(mat2, IQTensor):
-                mat2_zero_point = mat2.zero_point
-            q_input, _, _ = quant.quant(
-                input.data, data_bits, scale_x, mode=mode, quant_data='input', iq_zero_point=input_zero_point)
-            q_mat2, _, _ = quant.quant(
-                mat2.data, data_bits, scale_y, mode=mode, quant_data='input', iq_zero_point=mat2_zero_point)
-            q_input = q_input.double()
-            q_mat2 = q_mat2.double()
-
-            q_outputs = torch_bmm(q_input, q_mat2)
-            outputs = quant.dequant(q_outputs, scale_x*scale_y)
-            if o_bits is not None:
-                q_outputs, _, _ = quant.quant(
-                    outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-
-            if dump:
-                name_list = ['input', 'mat2', 'outputs',
-                             'q_input', 'q_mat2', 'q_outputs']
-                attr_list = [input, mat2, outputs, q_input, q_mat2, q_outputs]
-                Dump.dump_file(prefix, '.BmmInt.', zip(
-                    name_list, attr_list), path)
-
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradoutput):
-        input, mat2, output = ctx.saved_tensors
-        zero_point_x, is_iq_tensor_x, zero_point_y, is_iq_tensor_y = ctx.value
-        # input = input.detach().clone().requires_grad_(True)
-        # mat2 = mat2.detach().clone().requires_grad_(True)
-        scale_x, scale_y, scale_o = ctx.scale
-        o_bits, data_bits = ctx.bits
-        clamp_data = ctx.clamp_data
-        grad_input = grad_mat2 = None
-
-        if is_iq_tensor_x:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        if is_iq_tensor_y:
-            f_mat2 = mat2.data
-        else:
-            q_mat2, _, _ = Quant.quant(
-                mat2.data, data_bits, scale_y, mode=QuantMode.QValue, quant_data='input')
-            f_mat2 = Quant.dequant(q_mat2, scale_y)
-        f_mat2 = f_mat2.detach().clone().requires_grad_(True)
-
-        with torch.enable_grad():
-            z = torch_bmm(f_input, f_mat2)
-            if o_bits is not None:
-                z = normalize_data_with_config(z, clamp_data)
-            grad_input, grad_mat2 = torch.autograd.grad(
-                z, (f_input, f_mat2), gradoutput)
-
-        return grad_input, grad_mat2, None, None, None, \
-            None, None, None, None,\
-            None, None, None, None, None, None,\
-            None, None, None, None, None,\
-            None
-
-    @staticmethod
-    def symbolic(g, input, mat2,
-                 data_bits, training, momentum,
-                 running_x, running_y, running_o, scale_x, scale_y, scale_o,
-                 prefix, dump, path, mode, o_bits,
-                 quant, is_not_from_iqtensor1, is_not_from_iqtensor2, clamp_data):
-        op_inner1 = None
-        op_inner2 = None
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        if is_not_from_iqtensor1:
-            op_inner1 = quantlinear(
-                g, input, scale_x(), platform_quant, data_bits)
-        if is_not_from_iqtensor2:
-            op_inner2 = quantlinear(
-                g, mat2, scale_y(), platform_quant, data_bits)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(),
-                      'data_bits_i': data_bits}
-        input_list = []
-        if is_not_from_iqtensor1:
-            input_list.append(op_inner1)
-        else:
-            input_list.append(input)
-        if is_not_from_iqtensor2:
-            input_list.append(op_inner2)
-        else:
-            input_list.append(mat2)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::BmmInt", *input_list, **param_dict)
-
-
-class BmmInt(nn.Module, ModuleIntConfig):
-    def __init__(self, data_bits=8, mode=QuantMode.QValue, o_bits=None, clamp_data=None):
-        nn.Module.__init__(self)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, mode=mode, o_bits=o_bits)
-        self.momentum = 0.1
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_not_from_iqtensor1 = True
-        self.is_not_from_iqtensor2 = True
-        self.clamp_data = clamp_data
-
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_y', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_y', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input, mat2):
-        running_x = ScalerBuffer(self.running_x)
-        running_y = ScalerBuffer(self.running_y)
-        running_o = ScalerBuffer(self.running_o)
-        scale_x = ScalerBuffer(self.scale_x)
-        scale_y = ScalerBuffer(self.scale_y)
-        scale_o = ScalerBuffer(self.scale_o)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor1 = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        if isinstance(mat2, IQTensor):
-            self.is_not_from_iqtensor2 = False
-            if mat2.bits != self.data_bits:
-                mat2 = Requant.apply(
-                    mat2, mat2.bits, mat2.scale_data, self.data_bits)
-            scale_y = ScalerBuffer(mat2.scale_data)
-            running_y = ScalerBuffer(mat2.running_data)
-
-        ret = BmmIntFunction.apply(input, mat2,
-                                   self.data_bits, self.training, self.momentum,
-                                   running_x, running_y, running_o, scale_x, scale_y, scale_o,
-                                   self.prefix, self.dump, self.path, self.quant_mode, self.o_bits,
-                                   self.quant, self.is_not_from_iqtensor1, self.is_not_from_iqtensor2, self.clamp_data)
-        self.running_x.fill_(running_x())
-        self.running_y.fill_(running_y())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_y.fill_(scale_y())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self.running_x > 0, 'invalid running_x <=0'
-            scale_x = ScalerBuffer(self.scale_x.data)
-            scale_y = ScalerBuffer(self.scale_y.data)
-            if self.is_not_from_iqtensor1:
-                scale_x = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_x, self.data_bits, mode=self.quant_mode))
-                self.scale_x.data.fill_(scale_x())
-            if self.is_not_from_iqtensor2:
-                scale_y = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_y, self.data_bits, mode=self.quant_mode))
-                self.scale_y.data.fill_(scale_y())
-            if self.o_bits is not None:
-                scale_o = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_o, self.o_bits, mode=self.quant_mode))
-                self.scale_o.data.fill_(scale_o())
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in self._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
diff --git a/linger/ops/conv1d_int.py b/linger/ops/conv1d_int.py
deleted file mode 100644
index 8539e72..0000000
--- a/linger/ops/conv1d_int.py
+++ /dev/null
@@ -1,332 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class Conv1dFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, weights, bias, kernel_size, padding, stride, dilation, groups, params,
-                data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                prefix, dump, path, mode, o_bits, quant, ahead_relu, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-        scale_o = None
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            ctx.bits = data_bits, parameter_bits, o_bits
-            saved_tensors = [input, weights, bias, params]
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-
-            # weights = normalize_weight_with_config(weights, clamp_weight, False)
-            # if bias is not None:
-            #     bias = normalize_bias_with_config(bias, clamp_bias, False)
-            q_weights, scale_w, max_value_w = quant.quant(
-                weights, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_weights = q_weights.float() if data_bits + \
-                parameter_bits <= 16 else q_weights.double()
-            q_outputs = F.conv1d(q_input, q_weights, stride=stride,
-                                 padding=padding, dilation=dilation, groups=groups)
-            outputs = None
-            q_bias = None
-
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                    if data_bits + parameter_bits <= 16:
-                        q_bias = q_bias.float()
-                    else:
-                        q_bias = q_bias.double()
-                    q_outputs += q_bias.reshape(1, -1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-            # ctx.save_for_backward(*saved_tensors)
-            # f_input = quant.dequant(q_input, scale_x)
-            # f_weights = quant.dequant(q_weights, scale_w)
-            # f_bias = None if bias is None else quant.dequant(q_bias, scale_x*scale_w)
-            # saved_tensors = [f_input, f_weights, f_bias, params]
-
-            out_tensor = outputs
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output', ahead_relu=ahead_relu)
-                outputs = quant.dequant(q_outputs, scale_o)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-            ctx.scale = scale_x, scale_w, scale_o
-            saved_tensors += [out_tensor]
-            ctx.save_for_backward(*saved_tensors)
-        else:
-            assert running_x > 0, 'invalid running_x, please finetune training before eval'
-            scale_x = None
-            scale_w = None
-            scale_o = eval_scale_o
-            if weights.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                q_weights, scale_w, _ = quant.quant(
-                    weights, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                if o_bits is not None:
-                    assert running_o > 0, 'invalid running_o <= 0, please finetune training'
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                q_weights = weights.double()
-                if o_bits is not None:
-                    scale_o = eval_scale_o
-
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, scale_x, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.double()
-            q_weights = q_weights.double()
-            q_outputs = F.conv1d(q_input, q_weights, stride=stride,
-                                 padding=padding, dilation=dilation, groups=groups)
-            # # ensure bias clamp
-            # if bias is not None and bias.dtype == torch.float32:
-            #     bias = normalize_bias_with_config(bias, clamp_bias, False)
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    if bias.dtype == torch.float32:
-                        q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                        if data_bits + parameter_bits <= 16:
-                            q_bias = q_bias.float().double()
-                        else:
-                            q_bias = q_bias.double()
-                    else:
-                        q_bias = bias.double()
-                    q_outputs += q_bias.reshape(1, -1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-            if o_bits is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_outputs, _, _ = quant.quant(
-                        outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-            if dump:
-                if q_bias is not None:
-                    name_list = ["input", "weights", "bias", "outputs", "q_input",  "q_weights", "q_bias",
-                                 "q_outputs", "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, bias, outputs, q_input, q_weights, q_bias, q_outputs,
-                                 scale_x.data, scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".Conv1dInt.",
-                                   zip(name_list, attr_list), path)
-                else:
-                    name_list = ["input", "weights", "outputs", "q_input",  "q_weights", "q_outputs",
-                                 "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, outputs, q_input, q_weights, q_outputs, scale_x.data,
-                                 scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".Conv1dInt.",
-                                   zip(name_list, attr_list), path)
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        zero_point, is_iq_tensor = ctx.value
-        scale_x, scale_w, scale_o = ctx.scale
-        input, weights, bias, params, outputs = ctx.saved_tensors
-        # zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        q_weights, _, _ = Quant.quant(
-            weights.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        bias = None if bias is None else bias.detach().clone().requires_grad_(True)
-
-        stride = int(params[0])
-        padding = int(params[1])
-        dilation = int(params[2])
-        groups = int(params[3])
-        gradBias = None
-        # input_clamp = input
-        with torch.enable_grad():
-            z = F.conv1d(f_input, f_weights, bias,
-                         stride, padding, dilation, groups)
-            if o_bits is not None:
-                z = normalize_data_with_config(z, clamp_data)
-            if bias is not None:
-                gradInput, gradWeight, gradBias = torch.autograd.grad(
-                    z, (f_input, f_weights, bias), gradOutput)
-            else:
-                gradInput, gradWeight, = torch.autograd.grad(
-                    z, (f_input, f_weights), gradOutput)
-
-        return gradInput, gradWeight, gradBias, None, None, None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, weights, bias, kernel_size, padding, stride, dilation, groups, params,
-                 data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                 prefix, dump, path, mode, o_bits, quant, ahead_relu, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        paddings = padding + padding
-        param_dict = {'scale_x_f': scale_x(), 'scale_w_f': scale_w(),
-                      'dilations_i': dilation, 'group_i': groups, 'kernel_shape_i': kernel_size, 'pads_i': paddings, 'strides_i': stride,
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits}
-        if is_not_from_iqtensor:
-            input_list = [op_inner, weights]
-        else:
-            input_list = [input, weights]
-        if bias is not None:
-            input_list.append(bias)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::Conv1dInt", *input_list, **param_dict)
-
-
-class Conv1dInt(nn.Conv1d, ModuleIntConfig):
-    r"""实现Conv1dInt的量化训练与测试，继承自nn.Conv1d,
-
-    Args:
-        in_channels out_channels kernel_size stride padding dilation groups bias padding_mode
-        与nn.Conv2d一致的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为weight的clamp数值
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    """
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None, ahead_relu=False):
-        nn.Conv1d.__init__(self, in_channels, out_channels,
-                           kernel_size, stride, padding, dilation, groups, bias)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.params = torch.Tensor(
-            [self.stride[0], self.padding[0], self.dilation[0], groups])
-        self.momentum = 0.1
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.ahead_relu = ahead_relu
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-        weight = self.weight
-        bias = self.bias
-        if weight.dtype == torch.float32:
-            weight = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-            if self.bias is not None:
-                bias = normalize_bias_with_config(
-                    self.bias, self.clamp_bias, self.training)
-        ret = Conv1dFunction.apply(input, weight, bias,
-                                   self.kernel_size, self.padding, self.stride, self.dilation, self.groups,
-                                   self.params, self.data_bits, self.parameter_bits, self.training, self.momentum,
-                                   running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                   self.prefix, self.dump, self.path, self.quant_mode, self.o_bits, self.quant, self.ahead_relu, self.is_not_from_iqtensor,
-                                   self.clamp_data, self.clamp_weight, self.clamp_bias)
-
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.Conv1d.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/conv_int.py b/linger/ops/conv_int.py
deleted file mode 100644
index 77a16bb..0000000
--- a/linger/ops/conv_int.py
+++ /dev/null
@@ -1,412 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class ConvFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, weights, bias, kernel_size, padding, stride, dilation, groups, params,
-                data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                prefix, dump, path, mode, o_bits, quant, ahead_relu, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias, ahead_sigmoid):
-        scale_o = None
-
-        # venus limits
-        # assert input.shape[2] >= weights.shape[2] and input.shape[3] >= weights.shape[
-        #     3], f"in Conv2dInt op, input's width >= weight's width && input'height >= weight'height, but you have {input.shape[2]} >= {weights.shape[2]} and {input.shape[3]} >= {weights.shape[3]}"
-        assert data_bits in (
-            4, 8), f"in Conv2dInt op, data_bits only support 4/8, but you have data_bits {data_bits}"
-        assert parameter_bits in (
-            4, 8), f"in Conv2dInt op, parameter_bits only support 4/8, but you have parameter_bits {parameter_bits}"
-        assert o_bits in (
-            4, 8, 16, 32), f"in Conv2dInt op, o_bits only support 4/8/16/32, but you have o_bits {o_bits}"
-
-        channel_in = weights.shape[1]/groups
-        # assert not (math.ceil(channel_in/8/stride[1]) * (8*stride[1]) * math.ceil(input.shape[3]/8)*8*1 > 64 * 1024 and channel_in > 512) or not (math.ceil(channel_in/8/stride[1]) * (8*stride[1]) * math.ceil(8/8)*8*input.shape[2] > 64 * 1024 and channel_in >
-        #                                                                                                                                           512), f"in Conv2dInt op, the size of the input data after alignment exceed 64KB and channal_in > 512 at the same time is not allowed, but you have (math.ceil({channel_in}/8/{stride[1]}) * (8*{stride[1]}) * math.ceil({input.shape[3]}/8)*8*{1} > 64 * 1024 and {channel_in} > 512) or (math.ceil({channel_in}/8/{stride[1]}) * (8*{stride[1]}) * math.ceil({8}/8)*8*{input.shape[2]} > 64 * 1024 and {channel_in} > 512"
-
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.bits = data_bits, parameter_bits, o_bits
-            ctx.value = zero_point, is_iq_tensor
-
-            saved_tensors = [input, weights, bias, params]
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input',
-                    iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-
-            q_weights, scale_w, max_value_w = quant.quant(
-                weights, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_weights = q_weights.float() if data_bits + \
-                parameter_bits <= 16 else q_weights.double()
-            q_outputs = F.conv2d(q_input, q_weights, stride=stride,
-                                 padding=padding, dilation=dilation, groups=groups)
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                    if data_bits + parameter_bits <= 16:
-                        q_bias = q_bias.float().round()
-                    else:
-                        q_bias = q_bias.double()
-                    # 小心计算溢出问题，未做保护，bias不超过2**23-1数值
-                    q_outputs += q_bias.reshape(1, -1, 1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-
-            # f_input = quant.dequant(q_input, scale_x)
-            # f_weights = quant.dequant(q_weights, scale_w)
-            # f_bias = None if bias is None else quant.dequant(q_bias, scale_x*scale_w)
-            # saved_tensors = [f_input, f_weights, f_bias, params]
-
-            out_tensor = outputs
-            scale_o = None
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='conv_output', ahead_relu=ahead_relu)
-                outputs = quant.dequant(q_outputs, scale_o)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-            ctx.scale = scale_x, scale_w, scale_o
-            ctx.save_for_backward(*saved_tensors)
-
-        else:
-            assert running_x > 0, 'invalid running_x = 0, please finetune training before eval'
-            scale_x = None
-            scale_w = None
-            scale_o = eval_scale_o
-            q_weights = None
-            if weights.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                q_weights, scale_w, _ = quant.quant(
-                    weights, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                if o_bits is not None:
-                    assert running_o > 0, 'invalid running_o <= 0, please finetune training'
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                q_weights = weights.double()
-                if o_bits is not None:
-                    scale_o = eval_scale_o
-
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, scale_x, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.double()
-            q_weights = q_weights.double()
-            q_outputs = F.conv2d(q_input, q_weights, stride=stride,
-                                 padding=padding, dilation=dilation, groups=groups)
-
-            outputs = None
-            q_bias = None
-
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    if bias.dtype == torch.float32:
-                        q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                        if data_bits + parameter_bits <= 16:
-                            # 降低bias精度，保持训练一致
-                            q_bias = q_bias.float().round().double()
-                        else:
-                            q_bias = q_bias.double()
-                    else:
-                        q_bias = bias.double()
-                    q_outputs += q_bias.reshape(1, -1, 1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-
-            if o_bits is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_outputs, _, _ = quant.quant(
-                        outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                else:
-                    assert False, "linger only support luna quant."
-                outputs = quant.dequant(q_outputs, scale_o)
-            if dump:
-                if q_bias is not None:
-                    name_list = ["input", "weights", "bias", "outputs", "q_input",  "q_weights", "q_bias",
-                                 "q_outputs", "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, bias, outputs, q_input, q_weights, q_bias, q_outputs,
-                                 scale_x.data, scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".Conv2dInt.",
-                                   zip(name_list, attr_list), path)
-                else:
-                    name_list = ["input", "weights", "outputs", "q_input",  "q_weights", "q_outputs",
-                                 "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, outputs, q_input, q_weights, q_outputs, scale_x.data,
-                                 scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".Conv2dInt.",
-                                   zip(name_list, attr_list), path)
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        zero_point, is_iq_tensor = ctx.value
-        scale_x, scale_w, scale_o = ctx.scale
-        input, weights, bias, params = ctx.saved_tensors
-
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        q_weights, _, _ = Quant.quant(
-            weights.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        bias = None if bias is None else bias.detach().clone().requires_grad_(True)
-
-        stride = int(params[0]), int(params[1])
-        padding = int(params[2]), int(params[3])
-        dilation = int(params[4]), int(params[5])
-        groups = int(params[6])
-        gradBias = None
-        # input_clamp = input
-        with torch.enable_grad():
-            z = F.conv2d(f_input, f_weights, bias,
-                         stride, padding, dilation, groups)
-            if o_bits is not None:
-                z = normalize_data_with_config(z, clamp_data)
-            if bias is not None:
-                gradInput, gradWeight, gradBias = torch.autograd.grad(
-                    z, (f_input, f_weights, bias), gradOutput)
-            else:
-                gradInput, gradWeight, = torch.autograd.grad(
-                    z, (f_input, f_weights), gradOutput)
-
-        return gradInput, gradWeight, gradBias, None, None, None, None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, weights, bias, kernel_size, padding, stride, dilation, groups, params,
-                 data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                 prefix, dump, path, mode, o_bits, quant, ahead_relu, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias, ahead_sigmoid):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        paddings = padding + padding
-        param_dict = {'scale_x_f': scale_x(), 'scale_w_f': scale_w(),
-                      'dilations_i': dilation, 'group_i': groups, 'kernel_shape_i': kernel_size, 'pads_i': paddings, 'strides_i': stride,
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits}
-        if is_not_from_iqtensor:
-            input_list = [op_inner, weights]
-        else:
-            input_list = [input, weights]
-        if bias is not None:
-            input_list.append(bias)
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-        param_dict['platform_quant_s'] = platform_quant
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            param_dict['castor_mode_s'] = "luna"
-        return g.op("thinker::Conv2dInt", *input_list, **param_dict)
-
-
-class Conv2dInt(nn.Conv2d, ModuleIntConfig):
-    r"""实现Conv2dInt的量化训练与测试，继承自nn.Conv2d,
-
-    Args:
-        in_channels out_channels kernel_size stride padding dilation groups bias padding_mode
-        与nn.Conv2d一致的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为weight的clamp数值
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    """
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros',
-                 data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=8,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None, ahead_relu=False, ahead_sigmoid=False):
-        # venus limits
-        if type(kernel_size) == int:
-            assert kernel_size in (
-                1, 2, 3, 4, 5), f"in Conv2dInt op, kernel size must be 1/2/3/4/5, but you have kernel size {kernel_size}"
-        elif type(kernel_size) == tuple:
-            assert kernel_size[0] in (1, 2, 3, 4, 5) and kernel_size[1] in (
-                1, 2, 3, 4, 5), "in Conv2dInt op, kernel size must be 1/2/3/4/5, but you have kernel size {kernel_size}"
-
-        if type(stride) == int:
-            assert stride in (
-                1, 2, 4), "in Conv2dInt op, stride size must be 1/2/4, but you have stride size {stride}"
-        elif type(stride) == tuple:
-            assert stride[0] in (1, 2, 4) and stride[1] in (
-                1, 2, 4), "in Conv2dInt op, stride size must be 1/2/4, but you have stride size {stride}"
-
-        if type(padding) == int:
-            assert padding in (
-                0, 1, 2, 3, 4), "in Conv2dInt op, padding size must be 1/2/4, but you have padding size {padding}"
-        elif type(padding) == tuple:
-            assert padding[0] in (0, 1, 2, 3, 4) and padding[1] in (
-                0, 1, 2, 3, 4), "in Conv2dInt op, padding size must be 1/2/3/4/5, but you have padding size {padding}"
-
-        if type(kernel_size) == int:
-            kernel_size_h = kernel_size
-            kernel_size_w = kernel_size
-        elif type(kernel_size) == tuple:
-            kernel_size_h = kernel_size[0]
-            kernel_size_w = kernel_size[1]
-        else:
-            assert False, "kernel size type error."
-        # if (groups != in_channels):
-        #     assert math.ceil(in_channels/8) * 8 * kernel_size_h * kernel_size_w * math.ceil(out_channels/2) * 2 <= 32 * \
-        #         1024, f"in Conv2dInt op, kernel must meet the requirements of non-depthwise convolution, but you have math.ceil({in_channels}/8) * 8 * {kernel_size_h} * {kernel_size_w} * math.ceil({out_channels}/2) * 2 <= 32 * 1024"
-        # if (groups == in_channels):
-        #     assert math.ceil(in_channels/16) * 16 * kernel_size_h * kernel_size_w <= 32 * \
-        #         1024, f"in Conv2dInt op, kernel must meet the requirements of depthwise convolution, but you have math.ceil({in_channels}/16) * 16 * {kernel_size_h} * {kernel_size_w} <= 32 * 1024"
-
-        if type(stride) == int:
-            stride_h = stride
-            stride_w = stride
-        elif type(stride) == tuple:
-            stride_h = stride[0]
-            stride_w = stride[1]
-        else:
-            assert False, "kernel size type error."
-
-        if type(padding) == int:
-            padding_h = padding
-            padding_w = padding
-        elif type(padding) == tuple:
-            padding_h = padding[0]
-            padding_w = padding[1]
-        else:
-            assert False, "kernel size type error."
-
-        assert kernel_size_h >= stride_h and kernel_size_w >= stride_w, f"kernel_size_h >= stride_h and kernel_size_w >= stride_w, but you have {kernel_size_h} < {stride_h} or {kernel_size_w} < {stride_w}"
-        assert padding_h < kernel_size_h and padding_w < kernel_size_w, f"pad_h < weight_h && pad_w < weight_w, but you have {padding_h} >= {kernel_size_h} or {padding_w} >= {kernel_size_w}"
-
-        nn.Conv2d.__init__(self, in_channels, out_channels,
-                           kernel_size, stride, padding, dilation, groups, bias)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.params = torch.Tensor([self.stride[0], self.stride[1], self.padding[0],
-                                   self.padding[1], self.dilation[0], self.dilation[1], groups])
-        self.momentum = 0.1
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.ahead_relu = ahead_relu
-        self.ahead_sigmoid = ahead_sigmoid
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.mode = mode
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-        weight = self.weight
-        bias = self.bias
-        if weight.dtype == torch.float32:
-            weight = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-            if self.bias is not None:
-                bias = normalize_bias_with_config(
-                    self.bias, self.clamp_bias, self.training)
-        ret = ConvFunction.apply(input, weight, bias,
-                                 self.kernel_size, self.padding, self.stride, self.dilation, self.groups,
-                                 self.params, self.data_bits, self.parameter_bits, self.training, self.momentum,
-                                 running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                 self.prefix, self.dump, self.path, self.quant_mode, self.o_bits, self.quant, self.ahead_relu, self.is_not_from_iqtensor,
-                                 self.clamp_data, self.clamp_weight, self.clamp_bias, self.ahead_sigmoid)
-
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.Conv2d.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu},ahead_sigmoid:{ahead_sigmoid}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/convtranspose_int.py b/linger/ops/convtranspose_int.py
deleted file mode 100644
index 8ecc2db..0000000
--- a/linger/ops/convtranspose_int.py
+++ /dev/null
@@ -1,353 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class ConvTransposeFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, weights, bias, kernel_size, padding, output_padding, stride, dilation, groups,
-                data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                prefix, dump, path, mode, o_bits, quant, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-        scale_o = None
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            ctx.bits = data_bits, parameter_bits, o_bits
-
-            saved_tensors = [input, weights, bias]
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            q_weights, scale_w, max_value_w = quant.quant(
-                weights, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_weights = q_weights.float() if data_bits + \
-                parameter_bits <= 16 else q_weights.double()
-            q_outputs = F.conv_transpose2d(q_input, q_weights, stride=stride, padding=padding,
-                                           output_padding=output_padding, dilation=dilation, groups=groups)
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                    if data_bits + parameter_bits <= 16:
-                        q_bias = q_bias.float()
-                    else:
-                        q_bias = q_bias.double()
-                    q_outputs += q_bias.reshape(1, -1, 1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-
-            # f_input = quant.dequant(q_input, scale_x)
-            # f_weights = quant.dequant(q_weights, scale_w)
-            # f_bias = None if bias is None else quant.dequant(q_bias, scale_x*scale_w)
-            # saved_tensors = [f_input, f_weights, f_bias]
-            ctx.stride = stride
-            ctx.padding = padding
-            ctx.output_padding = output_padding
-            ctx.dilation = dilation
-            ctx.groups = groups
-            out_tensor = outputs
-            mask_output = None
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-            ctx.scale = scale_x, scale_w, scale_o
-            saved_tensors += [out_tensor]
-            ctx.save_for_backward(*saved_tensors)
-        else:
-            assert running_x > 0, 'invalid running_x, please finetune training'
-            q_weights = None
-            scale_o = eval_scale_o
-            if weights.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                q_weights, scale_w, _ = quant.quant(
-                    weights, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                if o_bits is not None:
-                    assert running_o > 0, 'invalid running_o <= 0, please finetune training'
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                q_weights = weights.double()
-                if o_bits is not None:
-                    scale_o = eval_scale_o
-
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, scale_x, _ = quant.quant(
-                    input, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, scale_x, _ = quant.quant(
-                    input, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.double()
-            q_weights = q_weights.double()
-            q_outputs = F.conv_transpose2d(q_input, q_weights, stride=stride, padding=padding,
-                                           output_padding=output_padding, dilation=dilation, groups=groups)
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    if bias.dtype == torch.float32:
-                        q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                        if data_bits + parameter_bits <= 16:
-                            q_bias = q_bias.float().double()
-                        else:
-                            q_bias = q_bias.double()
-                    else:
-                        q_bias = bias.double()
-                    q_outputs += q_bias.reshape(1, -1, 1, 1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-            if o_bits is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_outputs, _, _ = quant.quant(
-                        outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-            if dump:
-                if bias is not None:
-                    name_list = ["input", "weights", "bias", "outputs", "q_input",  "q_weights", "q_bias",
-                                 "q_outputs", "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, bias, outputs, q_input, q_weights, q_bias, q_outputs,
-                                 scale_x.data, scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".ConvTranspose2dInt.",
-                                   zip(name_list, attr_list), path)
-                else:
-                    name_list = ["input", "weights", "outputs", "q_input",  "q_weights", "q_outputs",
-                                 "scale_x", "scale_w", "scale_o", "running_x", "running_w", "running_o"]
-                    attr_list = [input, weights, outputs, q_input, q_weights, q_outputs, scale_x.data,
-                                 scale_w.data, scale_o.data, running_x.data, running_w.data, running_o.data]
-                    Dump.dump_file(prefix, ".ConvTranspose2dInt.",
-                                   zip(name_list, attr_list), path)
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        zero_point, is_iq_tensor = ctx.value
-        scale_x, scale_w, scale_o = ctx.scale
-        input, weights, bias, outputs = ctx.saved_tensors
-        # input = input.detach().clone().requires_grad_(True)
-        # weights = weights.detach().clone().requires_grad_(True)
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        q_weights, _, _ = Quant.quant(
-            weights.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        bias = None if bias is None else bias.detach().clone().requires_grad_(True)
-
-        stride = ctx.stride[0],         ctx.stride[1]
-        padding = ctx.padding[0],        ctx.padding[1]
-        output_padding = ctx.output_padding[0], ctx.output_padding[1]
-        dilation = ctx.dilation[0],       ctx.dilation[1]
-        groups = ctx.groups
-        gradBias = None
-
-        with torch.enable_grad():
-            z = F.conv_transpose2d(
-                f_input, f_weights, bias, stride, padding, output_padding, groups, dilation)
-            if o_bits is not None:
-                z = normalize_data_with_config(z, clamp_data)
-            if bias is not None:
-                gradInput, gradWeight, gradBias = torch.autograd.grad(
-                    z, (f_input, f_weights, bias), gradOutput)
-            else:
-                gradInput, gradWeight, = torch.autograd.grad(
-                    z, (f_input, f_weights), gradOutput)
-
-        return gradInput, gradWeight, gradBias, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, weights, bias, kernel_size, padding, output_padding, stride, dilation, groups,
-                 data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                 prefix, dump, path, mode, o_bits, quant, is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-            input_list = [op_inner, weights]
-        else:
-            input_list = [input, weights]
-        padding0 = padding
-        paddings = padding+padding0
-        param_dict = {'scale_x_f': scale_x(), 'scale_w_f': scale_w(),
-                      'dilations_i': dilation, 'group_i': groups, 'kernel_shape_i': kernel_size, 'pads_i': paddings, 'output_padding_i': output_padding, 'strides_i': stride,
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits}
-        if bias is not None:
-            input_list.append(bias)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::ConvTranspose2dInt", *input_list, **param_dict)
-
-
-class ConvTranspose2dInt(nn.ConvTranspose2d, ModuleIntConfig):
-    r"""实现ConvTranspose2dInt的量化训练与测试，继承自ConvTranspose2d,
-
-    Args:
-        in_channels out_channels kernel_size stride padding output_padding groups bias dilation padding_mode
-        与nn.ConvTranspose2d一致的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为weight的clamp数值
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    """
-
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                 padding=0, output_padding=0, groups=1, bias=True, dilation=1,
-                 padding_mode='zeros', data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None):
-        # venus limits
-        if type(stride) == int:
-            if stride == 2:
-                assert kernel_size in (
-                    2, 3, 4, 5), f"in ConvTranspose2dInt op, when stride_h == 2, kernel_h must be in (2,3,4,5), but you have kernel_size: {kernel_size}"
-            elif stride == 4:
-                assert kernel_size in (
-                    4, 5), f"in ConvTranspose2dInt op, when stride_h == 4, kernel_h must be in (4,5), but you have kernel_size: {kernel_size}"
-        else:
-            if stride[1] == 2:
-                assert kernel_size[1] in (
-                    2, 3, 4, 5), f"in ConvTranspose2dInt op, when stride_h == 2, kernel_h must be in (2,3,4,5), but you have kernel_size[1]: {kernel_size[1]}"
-            if stride[1] == 4:
-                assert kernel_size[1] in (
-                    4, 5), f"in ConvTranspose2dInt op, when stride_h == 4, kernel_h must be in (4,5), but you have kernel_size[1]: {kernel_size[1]}"
-            if stride[0] == 2:
-                assert kernel_size[0] in (
-                    2, 3, 4, 5), f"in ConvTranspose2dInt op, when stride_w == 2, kernel_w must be in (2,3,4,5), but you have kernel_size[0]: {kernel_size[0]}"
-            if stride[0] == 4:
-                assert kernel_size[0] in (
-                    4, 5), f"in ConvTranspose2dInt op, when stride_w == 4, kernel_w must be in (4,5), but you have kernel_size[0]: {kernel_size[0]}"
-
-        nn.ConvTranspose2d.__init__(self, in_channels, out_channels, kernel_size,
-                                    stride, padding, output_padding, groups, bias, dilation, padding_mode)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-
-        self.momentum = 0.1
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-
-        weight = self.weight
-        bias = self.bias
-        if weight.dtype == torch.float32:
-            weight = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-            if self.bias is not None:
-                bias = normalize_bias_with_config(
-                    self.bias, self.clamp_bias, self.training)
-        ret = ConvTransposeFunction.apply(input, weight, bias,
-                                          self.kernel_size, self.padding, self.output_padding, self.stride, self.dilation, self.groups,
-                                          self.data_bits, self.parameter_bits, self.training, self.momentum,
-                                          running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                          self.prefix, self.dump, self.path, self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor,
-                                          self.clamp_data, self.clamp_weight, self.clamp_bias)
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.ConvTranspose2d.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-
-        return s+extra_s
diff --git a/linger/ops/embedding_int.py b/linger/ops/embedding_int.py
deleted file mode 100644
index 273c7d5..0000000
--- a/linger/ops/embedding_int.py
+++ /dev/null
@@ -1,244 +0,0 @@
-from collections import OrderedDict
-from typing import Optional
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch import Tensor
-from torch.onnx import is_in_onnx_export
-
-from ..quant import normalize_data_with_config, normalize_weight_with_config
-from ..utils import Dump, QuantMode, ScalerBuffer
-from .iqtensor import IQTensor, from_torch_tensor
-from .ops import ModuleIntConfig
-from .requant import Requant
-from ..quant import Quant
-
-
-class EmbeddingFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse,
-                weight, data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                prefix, dump, path, mode, o_bits, quant,  is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-        scale_o = None
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            ctx.bits = o_bits, parameter_bits
-            saved_tensors = [input, weight]
-            ctx.params = [num_embeddings, embedding_dim, padding_idx,
-                          max_norm, norm_type, scale_grad_by_freq, sparse, ]
-
-            # weights = normalize_weight_with_config(weight, clamp_weight, False)
-            weights = weight
-            q_weights, scale_w, max_value_w = quant.quant(
-                weights, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            q_weights = q_weights.float() if data_bits + \
-                parameter_bits <= 16 else q_weights.double()
-            q_outputs = F.embedding(
-                input, q_weights, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
-
-            ctx.save_for_backward(*saved_tensors)
-
-            outputs = quant.dequant(q_outputs, scale_w)
-            if o_bits is not None:
-                running_o.fill_(running_w())
-                scale_o = scale_w
-            ctx.scale_w = scale_w
-
-        else:
-            scale_x = None
-            scale_w = None
-            scale_o = eval_scale_o
-            q_weights = None
-            if weight.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                weights = weight
-                # weights = normalize_weight_with_config(
-                #     weight, clamp_weight, False)
-                q_weights, scale_w, _ = quant.quant(
-                    weights, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                if o_bits is not None:
-                    assert running_o > 0, 'invalid running_o <= 0, please finetune training'
-                    scale_o = scale_w
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                q_weights = weight.double()
-                if o_bits is not None:
-                    scale_o = eval_scale_w
-            q_weights = q_weights.double()
-            q_outputs = F.embedding(
-                input, q_weights, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
-
-            outputs = quant.dequant(q_outputs, scale_w)
-            if dump:
-                name_list = ["input", "outputs",  "q_outputs", "q_weights",
-                             "scale_x", "scale_w" "scale_o", "running_x", "running_o"]
-                attr_list = [input, outputs, q_outputs, q_weights, scale_x.data,
-                             scale_w.data, scale_o.data, running_x.data,  running_o.data]
-                Dump.dump_file(prefix, ".EmbeddingInt.",
-                               zip(name_list, attr_list), path)
-
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        clamp_data = ctx.clamp_data
-        o_bits, parameter_bits = ctx.bits
-        scale_w = ctx.scale_w
-        input, weight,  = ctx.saved_tensors
-        num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, = ctx.params
-        q_weights, _, _ = Quant.quant(
-            weight.data, o_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        with torch.enable_grad():
-            # weights = normalize_weight_with_config(weight, clamp_weight, True)
-            # weights = weight
-            z = F.embedding(input, f_weights, padding_idx, max_norm,
-                            norm_type, scale_grad_by_freq, sparse)
-            if o_bits is not None:
-                z = normalize_data_with_config(
-                    z, clamp_data)
-            gradWeight, = torch.autograd.grad(z, (f_weights), gradOutput)
-        return None, None, None, None, None, None, None, None, gradWeight,\
-            None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None,  None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse,
-                 weight, data_bits, parameter_bits, training, momentum, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                 prefix, dump, path, mode, o_bits, quant,  is_not_from_iqtensor, clamp_data, clamp_weight, clamp_bias):
-
-        return g.op("thinker::Gather", weight, input, scale_w_f=eval_scale_w(), scale_o_f=eval_scale_o(), parameter_bits_i=parameter_bits, o_bits_i=o_bits)
-
-
-class EmbeddingInt(nn.Embedding, ModuleIntConfig):
-    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
-                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
-                 sparse: bool = False, _weight: Optional[Tensor] = None,
-                 data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None):
-        nn.Embedding.__init__(self, num_embeddings, embedding_dim, padding_idx,
-                              max_norm, norm_type, scale_grad_by_freq, sparse, _weight)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.momentum = 0.1
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.mode = mode
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            # assert False, "embeddingint   input shouldn't be IQTensor !"
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-
-        weights = self.weight
-        if self.weight.dtype == torch.float32:
-            weights = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-
-        ret = EmbeddingFunction.apply(input, self.num_embeddings, self.embedding_dim, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse,
-                                      weights, self.data_bits, self.parameter_bits, self.training, self.momentum,
-                                      running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                      self.prefix, self.dump, self.path, self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor,
-                                      self.clamp_data, self.clamp_weight, self.clamp_bias)
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(module, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=module._version)
-        if is_in_onnx_export():
-            assert module._buffers['running_w'] > 0, 'invalid running_w, please finetune first'
-
-        if 'scale_w' in module._buffers and module._parameters['weight'].dtype == torch.float:
-            weights = module._parameters['weight'].data
-            q_weights, scale_mul_w, _ = module.quant.quant(
-                weights, module.parameter_bits, mode=module.quant_mode)
-            if is_in_onnx_export():
-                scale_w = ScalerBuffer(scale_mul_w)
-                module._buffers['scale_w'].data.fill_(scale_w())
-                scale_o = scale_w
-                module._buffers['scale_o'].data.fill_(scale_o())
-
-            weight_tensor = module._parameters['weight']
-            if is_in_onnx_export():
-
-                if module.parameter_bits <= 8:
-                    weight_tensor.data = q_weights.char()
-                    weight_tensor.char()
-
-                elif module.parameter_bits <= 16:
-                    weight_tensor.data = q_weights.short()
-                    weight_tensor.short()
-
-                else:
-                    weight_tensor.data = q_weights.int()
-                    weight_tensor.int()
-
-        module._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in module._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in module._state_dict_hooks.values():
-            hook_result = hook(module, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.Embedding.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/gru_int.py b/linger/ops/gru_int.py
deleted file mode 100644
index 83cb8d4..0000000
--- a/linger/ops/gru_int.py
+++ /dev/null
@@ -1,1222 +0,0 @@
-import copy
-import math
-from collections import OrderedDict
-
-import lingerext
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn.utils.rnn import PackedSequence
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import (normalize_bias_with_config, normalize_data_with_config,
-                     normalize_weight_with_config)
-from ..utils import (Dump, PlatFormQuant, QuantMode, ScalerBuffer, _slice,
-                     _unbind, _unbind_packed, get_max_value, hx_slice)
-from .iqtensor import (IQTensor, Quant2IQTensor, from_torch_tensor,
-                       platform_to_string, quantlinear)
-from .linger_functional import (iqCat, torch_pack_padded_sequence,
-                                torch_pad_packed_sequence)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-iqcat_sym = iqCat.symbolic
-
-
-def castor_luna_sigmoid(x_int, scale_x=2048.0):
-    l_scale = 11 - int(math.log2(scale_x))
-
-    if l_scale > 0:
-        x_int = x_int * pow(2, l_scale)
-    else:
-        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
-    y_int = lingerext.luna_iqsigmoid(x_int.contiguous(), float(scale_x))
-    y_int.clamp_(0, 2**7-1)
-    return y_int
-
-
-def castor_luna_tanh(x_int, scale_x=2048.0):
-    l_scale = 11 - int(math.log2(scale_x))
-
-    if l_scale > 0:
-        x_int = x_int * pow(2, l_scale)
-    else:
-        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
-    y_int = lingerext.luna_iqtanh(x_int.contiguous(), float(scale_x))
-    y_int.clamp_(-2**7, 2**7-1)
-    return y_int
-
-
-def castor_luna_requant(x_int, scale_x=2048.0):
-    l_scale = 11 - int(math.log2(scale_x))
-
-    if l_scale > 0:
-        x_int = x_int * pow(2, l_scale)
-    else:
-        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
-    return x_int
-
-
-class GRUCellFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, hidden, weight_ih, weight_hh, bias_ih, bias_hh,
-                data_bits, parameter_bits, o_bits,
-                running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                momentum, training, prefix, dump, path, mode, quant,
-                is_not_from_iqtensor, clamp_data):
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.o_bits = o_bits
-            save_tensors = [input, hidden, weight_ih,
-                            weight_hh, bias_ih, bias_hh]
-            q_input, scale_i, max_value_ix = quant.quant(
-                input.data, data_bits, mode=mode, quant_data='input')
-            q_iweight, scale_iw, max_value_iw = quant.quant(
-                weight_ih, parameter_bits, mode=mode, quant_data='weight')
-            running_iw.mul_(1-momentum).add_(momentum*max_value_iw)
-
-            q_hidden, scale_h, max_value_hx = quant.quant(
-                hidden, data_bits, mode=mode, quant_data='input')
-            q_hweight, scale_hw, max_value_hw = quant.quant(
-                weight_hh, parameter_bits, mode=mode, quant_data='weight')
-            running_hw.mul_(1-momentum).add_(momentum*max_value_hw)
-
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_iweight = q_iweight.float() if data_bits + \
-                parameter_bits <= 16 else q_iweight.double()
-            q_hidden = q_hidden.float() if data_bits + \
-                parameter_bits <= 16 else q_hidden.double()
-            q_hweight = q_hweight.float() if data_bits + \
-                parameter_bits <= 16 else q_hweight.double()
-
-            q_gi_outputs = F.linear(q_input, q_iweight)
-            q_gh_outputs = F.linear(q_hidden, q_hweight)
-
-            if bias_ih is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_ibias = (bias_ih * scale_iw *
-                               scale_i + 0.5).floor().int()
-                    if data_bits + parameter_bits <= 16:
-                        q_ibias = q_ibias.float()
-                    else:
-                        q_ibias = q_ibias.double()
-                else:
-                    assert False, "linger only support luna quant."
-                # q_ibias = q_ibias.int()
-                q_gi_outputs += q_ibias.view(-1)
-
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_hbias = (bias_hh * scale_hw *
-                               scale_h + 0.5).floor().int()
-                    if data_bits + parameter_bits <= 16:
-                        q_hbias = q_hbias.float()
-                    else:
-                        q_hbias = q_hbias.double()
-                else:
-                    assert False, "linger only support luna quant."
-                q_gh_outputs += q_hbias.view(-1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:  # QX+QW -> Q11
-                l_scale_gi = 11 - int(math.log2(scale_i()*scale_iw()))
-                if l_scale_gi > 0:
-                    gi = q_gi_outputs * pow(2, l_scale_gi)
-                else:
-                    gi = (q_gi_outputs * pow(2, l_scale_gi) + 0.5).floor().int()
-
-                l_scale_gh = 11 - int(math.log2(scale_h()*scale_hw()))
-                if l_scale_gh > 0:
-                    gh = q_gh_outputs * pow(2, l_scale_gh)
-                else:
-                    gh = (q_gh_outputs * pow(2, l_scale_gh) + 0.5).floor().int()
-            else:  # QX+QW -> Q10
-                assert False, "linger only support luna quant."
-
-            for_backward_gi = quant.dequant(q_gi_outputs, scale_i*scale_iw)
-            for_backward_gh = quant.dequant(q_gh_outputs, scale_h*scale_hw)
-            save_tensors += [for_backward_gi, for_backward_gh]
-
-            i_r, i_i, i_n = gi.chunk(3, 1)
-            h_r, h_i, h_n = gh.chunk(3, 1)
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                resetgate = castor_luna_sigmoid(i_r + h_r)  # Q11->Q7
-                inputgate = castor_luna_sigmoid(i_i + h_i)  # Q11->Q7
-                resetgate_h_n = resetgate * h_n  # Q7*Q11->Q18
-                resetgate_h_n = castor_luna_requant(
-                    resetgate_h_n, 2**18)  # Q18->Q11
-                newgate = castor_luna_tanh(i_n + resetgate_h_n)  # Q11 ->Q7
-                new_hidden = (hidden*(2**7) + 0.5).floor().int()
-                # Q7*Q7+Q7*(Q7-Q7)->Q14
-                hy = newgate * (2**7) + inputgate * (new_hidden - newgate)
-                hy = quant.dequant(hy, float(2**14))
-            else:
-                assert False, "linger only support luna quant."
-
-            save_tensors += [hy.clone()]
-
-            if o_bits is not None:
-                hy = normalize_data_with_config(hy, clamp_data)
-                q_hy_outputs, scale_o, max_value_o = quant.quant(
-                    hy, o_bits, mode=mode, quant_data='output')
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-                running_h.mul_(1-momentum).add_(momentum*max_value_o)
-                hy = quant.dequant(q_hy_outputs, scale_o)
-            else:  # 为None 时  也得保证  running_h的实际值为running_o
-                _, _, fake_max_value_o = quant.quant(
-                    hy, 8, mode=mode, quant_data='output')
-                running_h.mul_(1-momentum).add_(momentum*fake_max_value_o)
-
-            ctx.save_for_backward(*save_tensors)
-        else:
-            assert running_i > 0, 'invalid running_i <= 0, please finetune training'
-            if weight_ih.dtype == torch.float32:
-                if is_not_from_iqtensor:
-                    scale_i = ScalerBuffer(quant.running_to_scale(
-                        running_i, data_bits, mode=mode))
-                scale_h = ScalerBuffer(quant.running_to_scale(
-                    running_h, data_bits, mode=mode))
-                scale_iw = ScalerBuffer(quant.running_to_scale(
-                    running_iw, parameter_bits, mode=mode))
-                scale_hw = ScalerBuffer(quant.running_to_scale(
-                    running_hw, parameter_bits, mode=mode))
-
-            q_input, scale_i, _ = quant.quant(
-                input.data, data_bits, scale_i, mode=mode, quant_data='input')
-            q_hidden, scale_h, _ = quant.quant(
-                hidden, data_bits, scale_h, mode=mode, quant_data='input')
-            q_iweight = None
-            q_hweight = None
-            if weight_ih.dtype == torch.float32:
-                q_iweight, _, _ = quant.quant(
-                    weight_ih, parameter_bits, scale_iw, mode=mode, quant_data='weight')
-                q_hweight, _, _ = quant.quant(
-                    weight_hh, parameter_bits, scale_hw, mode=mode, quant_data='weight')
-            else:
-                q_iweight = weight_ih.double()
-                q_hweight = weight_hh.double()
-
-            q_input = q_input.double()
-            q_iweight = q_iweight.double()
-            q_hidden = q_hidden.double()
-            q_hweight = q_hweight.double()
-            q_gi_outputs = F.linear(q_input, q_iweight)
-            q_gh_outputs = F.linear(q_hidden, q_hweight)
-
-            if bias_ih is not None:
-                if bias_ih.dtype == torch.float32:
-                    if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                        q_ibias = (bias_ih * scale_iw *
-                                   scale_i + 0.5).floor().int()
-                        q_hbias = (bias_hh * scale_hw *
-                                   scale_h + 0.5).floor().int()
-                        if data_bits + parameter_bits <= 16:
-                            q_ibias = q_ibias.float().double()
-                            q_hbias = q_hbias.float().double()
-                        else:
-                            q_ibias = q_ibias.double()
-                            q_hbias = q_hbias.double()
-                    else:
-                        assert False, "linger only support luna quant."
-
-                else:
-                    q_ibias = bias_ih.double()
-                    q_hbias = bias_hh.double()
-                q_gi_outputs += q_ibias.view(-1)
-                q_gh_outputs += q_hbias.view(-1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:  # QX+QW -> Q11
-                l_scale_gi = 11 - int(math.log2(scale_i()*scale_iw()))
-                if l_scale_gi > 0:
-                    gi = q_gi_outputs * pow(2, l_scale_gi)
-                else:
-                    gi = (q_gi_outputs * pow(2, l_scale_gi) + 0.5).floor().int()
-
-                l_scale_gh = 11 - int(math.log2(scale_h()*scale_hw()))
-                if l_scale_gh > 0:
-                    gh = q_gh_outputs * pow(2, l_scale_gh)
-                else:
-                    gh = (q_gh_outputs * pow(2, l_scale_gh) + 0.5).floor().int()
-            else:  # QX+QW -> Q10
-                assert False, "linger only support luna quant."
-
-            i_r, i_i, i_n = gi.chunk(3, 1)
-            h_r, h_i, h_n = gh.chunk(3, 1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                resetgate = castor_luna_sigmoid(i_r + h_r)  # Q11->Q7
-                inputgate = castor_luna_sigmoid(i_i + h_i)  # Q11->Q7
-                resetgate_h_n = resetgate * h_n  # Q7*Q11->Q18
-                resetgate_h_n = castor_luna_requant(
-                    resetgate_h_n, 2**18)  # Q18->Q11
-                newgate = castor_luna_tanh(i_n + resetgate_h_n)  # Q11 ->Q7
-                new_hidden = (hidden*(2**7) + 0.5).floor().int()
-                # Q7*Q7+Q7*(Q7-Q7)->Q14
-                hy = newgate * (2**7) + inputgate * (new_hidden - newgate)
-                hy = quant.dequant(hy, float(2**14))
-            else:
-                assert False, "linger only support luna quant."
-
-            if o_bits is not None:
-                assert running_o > 0, 'invalid running_o <=0, please finetune first'
-                if weight_ih.dtype == torch.float32:
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-                q_hy_outputs, _, _ = quant.quant(
-                    hy, o_bits, scale_o, mode=mode, quant_data='output')
-                hy = quant.dequant(q_hy_outputs, scale_o)
-
-            if dump:
-                if bias_ih is not None and bias_hh is not None:
-                    name_list = ["input", "hidden", "q_input", "q_hidden", "gi", "gh", "q_gi_outputs", "q_gh_outputs", "q_ibias", "q_hbias", "runnung_i", "running_h", "running_iw",
-                                 "running_ih", "running_io", "running_ho", "scale_i", "scale_iw", "scale_h", "scale_hw", "resetgate", "inputgate", "newgate", "output", "q_hy_outputs"]
-                    attr_list = [input, hidden, q_input, q_hidden, gi, gh, q_gi_outputs, q_gh_outputs, q_ibias, q_hbias, running_i, running_h, running_iw,
-                                 running_hw, running_io, running_ho, scale_i, scale_iw, scale_h, scale_hw, resetgate, inputgate, newgate, hy, q_hy_outputs]
-                    Dump.dump_file(prefix, ".GruInt.", zip(
-                        name_list, attr_list), path)
-                else:
-                    name_list = ["input", "hidden", "q_input", "q_hidden", "gi", "gh", "q_gi_outputs", "q_gh_outputs", "runnung_i", "running_h", "running_iw",
-                                 "running_ih", "running_io", "running_ho", "scale_i", "scale_iw", "scale_h", "scale_hw", "resetgate", "inputgate", "newgate", "output", "q_hy_outputs"]
-                    attr_list = [input, hidden, q_input, q_hidden, gi, gh, q_gi_outputs, q_gh_outputs, running_i, running_h, running_iw,
-                                 running_hw, running_io, running_ho, scale_i, scale_iw, scale_h, scale_hw, resetgate, inputgate, newgate, hy, q_hy_outputs]
-                    Dump.dump_file(prefix, ".GruInt.", zip(
-                        name_list, attr_list), path)
-
-        return hy
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        input, hidden, weight_ih, weight_hh, input_bias, hidden_bias, input_gates, hidden_gates, hy = ctx.saved_tensors
-        clamp_data = ctx.clamp_data
-        o_bits = ctx.o_bits
-        hy.requires_grad_(True)
-        hidden.requires_grad_(True)
-        with torch.enable_grad():
-            if o_bits is not None:
-                clamp_hy = normalize_data_with_config(hy, clamp_data)
-            else:
-                clamp_hy = hy
-            gradOutput, = torch.autograd.grad(clamp_hy, hy, gradOutput)
-        grad_input_gates, grad_hidden_gates, grad_hx, grad_input_bias, grad_hidden_bias = lingerext.gru_cell_backward(
-            gradOutput, input_gates, hidden_gates, hidden, input_bias, hidden_bias)
-        grad_in = grad_input_gates.matmul(weight_ih)
-        grad_w_ih = grad_input_gates.t().matmul(input)
-        grad_h_ih = grad_hidden_gates.t().matmul(hidden)
-
-        return grad_in, grad_hx, grad_w_ih, grad_h_ih, grad_input_bias, grad_hidden_bias, None, None, None, None, None, \
-            None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, \
-            None, None, None, None, None, None, None, None, None, None
-
-
-class GRUSingleONNXFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                input_size, hidden_size, num_layers, batch_first, dropout, bidirectional,
-                data_bits, parameter_bits, o_bits,
-                scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                is_not_from_iqtensor):
-        output = None
-        hidden_state = None
-        batch_size = None
-        seq_length = None
-        num_directions = 2 if bidirectional else 1
-        if batch_first:
-            batch_size = input.size(0)
-            seq_length = input.size(
-                1) if lengths is None else torch.max(lengths)
-            output = torch.randn(batch_size, seq_length,
-                                 hidden_size*num_directions, device=input.device)
-        else:
-            batch_size = input.size(1)
-            seq_length = input.size(
-                0) if lengths is None else torch.max(lengths)
-            output = torch.randn(seq_length, batch_size,
-                                 hidden_size*num_directions, device=input.device)
-        hidden_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        return output, hidden_state
-
-    @staticmethod
-    def backward(ctx, gradOutput, gradHidden):
-
-        return None, None, None, None, None, None,\
-            None, None, None, None,\
-            None, None, None, None, None, None,\
-            None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                 weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                 input_size, hidden_size, num_layers, batch_first, dropout, bidirectional,
-                 data_bits, parameter_bits, o_bits,
-                 scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                 scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                 is_not_from_iqtensor):
-
-        param_dict = {'input_size_i': input_size, 'hidden_size_i': hidden_size, 'num_layers_i': num_layers,
-                      'batch_first_i': batch_first, 'dropout_f': 0, 'go_forward_i': True,
-                      'scale_i_f': scale_i(), 'scale_h_f': scale_h(), 'scale_iw_f': scale_iw(), 'scale_hw_f': scale_hw(),
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits,
-                      }
-        param_back_dict = {}
-        if bidirectional:
-            param_back_dict = {'input_size_i': input_size, 'hidden_size_i': hidden_size, 'num_layers_i': num_layers,
-                               'batch_first_i': batch_first, 'dropout_f': 0, 'go_forward_i': False,
-                               'scale_i_f': scale_i_reverse(), 'scale_h_f': scale_h_reverse(), 'scale_iw_f': scale_iw_reverse(), 'scale_hw_f': scale_hw_reverse(),
-                               'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits,
-                               }
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        op_inner = None
-        input_list = None
-        input_back_list = None
-        if is_not_from_iqtensor:
-            op_inner = quantlinear(g, input, scale_i(),
-                                   platform_quant, data_bits)
-            input_list = [op_inner, weight_ih, weight_hh]
-            input_back_list = [op_inner, weight_ih_reverse, weight_hh_reverse]
-        else:
-            input_list = [input, weight_ih, weight_hh]
-            input_back_list = [input, weight_ih_reverse, weight_hh_reverse]
-        if bias_ih is not None and bias_hh is not None:
-            input_list.append(bias_ih)
-            input_list.append(bias_hh)
-            input_back_list.append(bias_ih_reverse)
-            input_back_list.append(bias_hh_reverse)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-            if bidirectional:
-                param_back_dict['scale_o_f'] = scale_o_reverse()
-                param_back_dict['o_bits_i'] = o_bits
-
-        param_dict['platform_quant_s'] = platform_quant
-        param_dict['outputs'] = 2
-        param_back_dict['platform_quant_s'] = platform_quant
-        param_back_dict['outputs'] = 2
-        if lengths is None and hidden_state is None:
-            gru, hidden = g.op("thinker::GRUInt", *input_list, **param_dict)
-            if bidirectional:
-                gru_backward, hidden_back = g.op(
-                    "thinker::GRUInt", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [gru, gru_backward, scale_o, scale_o_reverse]
-                    gru = iqcat_sym(g, None, scale_o, None, 2,
-                                    None, False, None, None, None, *args)
-                else:
-                    args = [gru, gru_backward]
-                    gru = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-        elif lengths is not None and hidden_state is None:
-            input_list.insert(1, lengths)
-            input_back_list.insert(1, lengths)
-            gru, hidden = g.op("thinker::GRUInt_Is8_Is64",
-                               *input_list, **param_dict)
-            if bidirectional:
-                gru_backward, hidden_back = g.op(
-                    "thinker::GRUInt_Is8_Is64", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [gru, gru_backward, scale_o, scale_o_reverse]
-                    gru = iqcat_sym(g, None, scale_o, None, 2,
-                                    None, False, None, None, None, *args)
-                else:
-                    args = [gru, gru_backward]
-                    gru = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-        else:
-            input_list.insert(1, lengths)
-            input_list.insert(2, hidden_state)
-            input_back_list.insert(1, lengths)
-            input_back_list.insert(2, hidden_state)
-            gru, hidden = g.op(
-                "thinker::GRUInt_Is8_Is64_If32", *input_list, **param_dict)
-            if bidirectional:
-                gru_backward, hidden_back = g.op(
-                    "thinker::GRUInt_Is8_Is64_If32", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [gru, gru_backward, scale_o, scale_o_reverse]
-                    gru = iqcat_sym(g, None, scale_o, None, 2,
-                                    None, False, None, None, None, *args)
-                else:
-                    args = [gru, gru_backward]
-                    gru = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-
-        return gru, hidden
-
-
-class GRUInt(nn.GRU):
-    r"""实现GRUInt的量化训练与测试，继承自nn.GRU,
-
-    Args:
-        input_size hidden_size num_layers bias batch_first dropout bidirectional
-        与nn.GRU一致的参数
-        unified(bool): 确认正反向参数统计是否一致
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        scale_i(np.float32): 统计的是输入scale，输入大小为(b, t, d)或(t, b, d)
-        scale_h(np.float32): 统计的是每一帧计算隐藏输出的最值momentum统计得到的scale
-        scale_iw(np.float32): 依据最终的模型参数计算得到，无统计
-        scale_hw(np.float32): 依据最终模型参数计算得到，无统计参数
-        scale_o(np.float32): 最终输出的统计scale
-        scale_reverse_*(np.float32):对应反向过程中各个scale数值
-    """
-
-    def __init__(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, unified=True,
-                 data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None, clamp_data=None, clamp_weight=None, clamp_bias=None):
-        nn.GRU.__init__(self, input_size, hidden_size, num_layers,
-                        bias, batch_first, dropout, bidirectional)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.unified = unified
-        self.momentum = 0.1
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-
-        self.register_buffer('running_i', torch.zeros(1))
-        self.register_buffer('running_h', torch.zeros(1))
-        self.register_buffer('running_iw', torch.zeros(1))
-        self.register_buffer('running_hw', torch.zeros(1))
-        self.register_buffer('running_io', torch.zeros(1))
-        self.register_buffer('running_ho', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-        self.register_buffer('scale_i', torch.zeros(1))
-        self.register_buffer('scale_h', torch.zeros(1))
-        self.register_buffer('scale_iw', torch.zeros(1))
-        self.register_buffer('scale_hw', torch.zeros(1))
-        self.register_buffer('scale_io', torch.zeros(1))
-        self.register_buffer('scale_ho', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-        if self.bidirectional:
-            self.register_buffer('running_i_reverse', torch.zeros(1))
-            self.register_buffer('running_h_reverse', torch.zeros(1))
-            self.register_buffer('running_iw_reverse', torch.zeros(1))
-            self.register_buffer('running_hw_reverse', torch.zeros(1))
-            self.register_buffer('running_io_reverse', torch.zeros(1))
-            self.register_buffer('running_ho_reverse', torch.zeros(1))
-            self.register_buffer('running_o_reverse', torch.zeros(1))
-
-            self.register_buffer('scale_i_reverse', torch.zeros(1))
-            self.register_buffer('scale_h_reverse', torch.zeros(1))
-            self.register_buffer('scale_iw_reverse', torch.zeros(1))
-            self.register_buffer('scale_hw_reverse', torch.zeros(1))
-            self.register_buffer('scale_io_reverse', torch.zeros(1))
-            self.register_buffer('scale_ho_reverse', torch.zeros(1))
-            self.register_buffer('scale_o_reverse', torch.zeros(1))
-        self.sigmoid_table = None
-        self.tanh_table = None
-
-    def _single_direction_tensor(self, input, hidden, layer=0, direct=0):
-        step_outputs = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        input = torch.cat(input.split(1, 0)[::-1]) if direct == 1 else input
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-        for input_x in input:
-            hidden = GRUCellFunction.apply(input_x, hidden, weight_ih, weight_hh, bias_ih, bias_hh,
-                                           self.data_bits, self.parameter_bits, self.o_bits,
-                                           running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                           scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                           self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant, self.is_not_from_iqtensor,
-                                           self.clamp_data)
-            step_outputs.append(hidden)
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        step_outputs = step_outputs[::-1] if direct == 1 else step_outputs
-        output = torch.stack(step_outputs, 0)
-        return output, hidden
-
-    def _single_direction_packed(self, input, hidden, layer=0, direct=0, batch_sizes=None):
-        if direct:
-            return self._packed_reverse(input, hidden, layer, direct, batch_sizes)
-        else:
-            return self._packed_forward(input, hidden, layer, direct, batch_sizes)
-
-    def _packed_forward(self, input, hidden, layer=0, direct=0, batch_sizes=None):
-        step_outputs = []
-        final_hiddens = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        hidden = copy.deepcopy(hidden)
-        input, batch_size_list = _unbind_packed(input, batch_sizes)
-        last_batch_size = batch_size_list[0]
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-        for input_i, batch_len in zip(input, batch_size_list):
-            inc = batch_len - last_batch_size
-            if inc < 0:
-                # record unused-hidden of last-time
-                final_hiddens.append(
-                    _slice(hidden, batch_len, last_batch_size))
-                # slice new hidden
-                hidden = hx_slice(None, hidden, last_batch_size, batch_len)
-            hidden = GRUCellFunction.apply(input_i, hidden, weight_ih, weight_hh, bias_ih, bias_hh,
-                                           self.data_bits, self.parameter_bits, self.o_bits,
-                                           running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                           scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                           self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant,
-                                           self.is_not_from_iqtensor, self.clamp_data)
-            step_outputs.append(hidden)
-            last_batch_size = batch_len
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        final_hiddens.append(hidden)
-        ret_hidden = final_hiddens[::-1]
-        hy_list = []
-        for each in ret_hidden:
-            hy_list.append(each)
-        hidden = torch.cat(hy_list, 0)
-        output = torch.cat(step_outputs, 0)
-        return output, hidden
-
-    def _packed_reverse(self, input, hidden, layer=0, direct=0, batch_sizes=None, bQuant=False):
-        step_outputs = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        input, batch_size_list = _unbind_packed(input, batch_sizes)
-        input = input[::-1]
-        batch_size_list = batch_size_list[::-1]
-        input_hx = copy.deepcopy(hidden)
-        last_batch_size = batch_size_list[0]
-        hidden = _slice(hidden, 0, last_batch_size)
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-        for input_i, batch_len in zip(input, batch_size_list):
-            if last_batch_size != batch_len:
-                hidden = hx_slice(input_hx, hidden, last_batch_size, batch_len)
-            hidden = GRUCellFunction.apply(input_i, hidden, weight_ih, weight_hh, bias_ih, bias_hh,
-                                           self.data_bits, self.parameter_bits, self.o_bits,
-                                           running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                           scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                           self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant,
-                                           self.is_not_from_iqtensor, self.clamp_data)
-            step_outputs.append(hidden)
-            last_batch_size = batch_len
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        step_outputs = step_outputs[::-1]
-        output = torch.cat(step_outputs, 0)
-        return output, hidden
-
-    def _finetune(self, input, hidden, layer=0, direct=0, batch_sizes=None):
-        if batch_sizes is None:
-            return self._single_direction_tensor(input, hidden, layer, direct)
-        else:
-            return self._single_direction_packed(input, hidden, layer, direct, batch_sizes)
-
-    def _run_single_direction(self, input, hidden, layer=0, direct=0, batch_sizes=None):
-        return self._finetune(input, hidden, layer, direct, batch_sizes)
-
-    def single_direction(self, input, layer, hx, batch_sizes=None):
-        hidden = hx
-        output, hidden = self._run_single_direction(
-            input, hidden, layer, direct=0, batch_sizes=batch_sizes)
-        return output, [hidden]
-
-    def bidirection(self, input, layer, hx, batch_sizes=None):
-        hx_f = hx[0]
-        hx_b = hx[1]
-        fw_output, fw_hidden = self._run_single_direction(
-            input, hx_f, layer, direct=0, batch_sizes=batch_sizes)
-        rev_output, rev_hidden = self._run_single_direction(
-            input, hx_b, layer, direct=1, batch_sizes=batch_sizes)
-        if batch_sizes is None:
-            output = torch.cat((fw_output, rev_output), fw_output.dim()-1)
-        else:  # packed sequence
-            output = torch.cat((fw_output, rev_output), -1)
-        return output, [fw_hidden, rev_hidden]
-
-    def gru_forward(self, input, hiddens, batch_sizes=None):
-        final_hiddens = []
-        for layer_num in range(self.num_layers):
-            hid = hiddens[layer_num] if hiddens is not None else None
-            output, hc = self.bidirection(input, layer_num, hid, batch_sizes) if self.bidirectional else self.single_direction(
-                input, layer_num, hid, batch_sizes)
-            final_hiddens.extend(hc)
-            input = output
-            # add dropout
-            if (self.dropout != 0 and self.training and layer_num < self.num_layers - 1):
-                input = torch.nn.functional.dropout(input, self.dropout)
-
-        hy = [hidden for hidden in final_hiddens]
-        hy = torch.stack(hy, 0)
-
-        return input, hy
-
-    def _generate_hiddens(self, hx):
-        if hx is not None:
-            hidden_list = _unbind(hx)
-            length = len(hidden_list)
-            if self.bidirectional:
-                assert length/self.num_layers % 2 == 0, 'hidden len must be double in bidirectional mode'
-            i = 0
-            hiddens = []
-            while i < length:
-                if self.bidirectional:
-                    hiddens.append((hidden_list[i], hidden_list[i+1]))
-                    i += 2
-                else:
-                    hiddens.append(hidden_list[i])
-                    i += 1
-        else:
-            hiddens = None
-        return hiddens
-
-    def forward_input_tensor(self, input, hx, batch_sizes=None):
-        input = input.transpose(0, 1) if self.batch_first else input
-        hiddens = self._generate_hiddens(hx)
-        output, hr = self.gru_forward(input, hiddens)
-        output = output.transpose(0, 1) if self.batch_first else output
-        return output, hr
-
-    def forward_input_packed(self, input, hx, batch_sizes=None):
-        hiddens = self._generate_hiddens(hx)
-        output, hr = self.gru_forward(input, hiddens, batch_sizes)
-        return output, hr
-
-    def forward(self, input, hx=None):
-        orig_input = input
-        if not is_in_onnx_export():
-            if isinstance(orig_input, tuple):
-                input, lengths, batch_first, enforce_sorted = orig_input
-                if isinstance(input, IQTensor):
-                    self.is_not_from_iqtensor = False
-                    if input.bits != self.data_bits:
-                        input = Requant.apply(
-                            input, input.bits, input.scale_data, self.data_bits)
-                    self.scale_i.fill_(input.scale_data)
-                    self.running_i.fill_(input.running_data)
-                    if self.bidirectional:
-                        self.scale_i_reverse.fill_(input.scale_data)
-                        self.running_i_reverse.fill_(input.running_data)
-                packed_input = torch_pack_padded_sequence(
-                    input, lengths, batch_first, enforce_sorted)
-                input, batch_sizes, sorted_indices, unsorted_indices = packed_input
-                max_batch_size = batch_sizes[0]
-                max_batch_size = int(max_batch_size)
-            else:
-                batch_sizes = None
-                max_batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                sorted_indices = None
-                unsorted_indices = None
-                if isinstance(input, IQTensor):
-                    self.is_not_from_iqtensor = False
-                    if input.bits != self.data_bits:
-                        input = Requant.apply(
-                            input, input.bits, input.scale_data, self.data_bits)
-                    self.scale_i.fill_(input.scale_data)
-                    self.running_i.fill_(input.running_data)
-                    if self.bidirectional:
-                        self.scale_i_reverse.fill_(input.scale_data)
-                        self.running_i_reverse.fill_(input.running_data)
-            assert self.num_layers == 1, 'invalid num_layers, now only support num_layers = 1'
-            if hx is None:
-                num_directions = 2 if self.bidirectional else 1
-                hx = torch.zeros(self.num_layers * num_directions,
-                                 max_batch_size, self.hidden_size,
-                                 dtype=input.dtype, device=input.device)
-            else:
-                # Each batch of the hidden state should match the input sequence that
-                # the user believes he/she is passing in.
-                hx = self.permute_hidden(hx, sorted_indices)
-
-            if batch_sizes is not None:
-                output, hidden = self.forward_input_packed(
-                    input, hx, batch_sizes)
-            else:
-                output, hidden = self.forward_input_tensor(input, hx)
-            if isinstance(orig_input, tuple):
-                output_packed = PackedSequence(
-                    output, batch_sizes, sorted_indices, unsorted_indices)
-                output, lengths = torch_pad_packed_sequence(
-                    output_packed, self.batch_first)
-                if self.o_bits is not None:
-                    if self.training:
-                        output = Quant2IQTensor.apply(
-                            output, self.o_bits, self.quant_mode, 'output')
-                    else:
-                        scale_o = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o, self.o_bits, mode=self.quant_mode))
-                        if self.bidirectional:
-                            if self.unified:
-                                scale_o_reverse = scale_o
-                            else:
-                                scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                                    self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                            scale_o = ScalerBuffer(
-                                min(scale_o(), scale_o_reverse()))
-                        output = from_torch_tensor(
-                            output, scale_o(), self.o_bits)
-                return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
-            else:
-                if self.o_bits is not None:
-                    if self.training:
-                        output = Quant2IQTensor.apply(
-                            output, self.o_bits, self.quant_mode, 'output')
-                    else:
-                        scale_o = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o, self.o_bits, mode=self.quant_mode))
-                        if self.bidirectional:
-                            if self.unified:
-                                scale_o_reverse = scale_o
-                            else:
-                                scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                                    self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                            scale_o = ScalerBuffer(
-                                min(scale_o(), scale_o_reverse()))
-                        output = from_torch_tensor(
-                            output, scale_o(), self.o_bits)
-                return output, self.permute_hidden(hidden, unsorted_indices)
-        else:
-            lengths = None
-            if isinstance(orig_input, tuple):
-                input, lengths, _, _ = orig_input
-            else:
-                input = orig_input
-                lengths = None
-            if isinstance(input, IQTensor):
-                self.is_not_from_iqtensor = False
-                if input.bits != self.data_bits:
-                    input = Requant.apply(
-                        input, input.bits, input.scale_data, self.data_bits)
-            bias_ih = None
-            bias_hh = None
-            bias_ih_reverse = None
-            bias_hh_reverse = None
-            weight_ih = self.weight_ih_l0
-            weight_hh = self.weight_hh_l0
-            weight_ih_reverse = weight_ih
-            weight_hh_reverse = weight_hh
-            if self.bias:
-                bias_ih = self.bias_ih_l0
-                bias_hh = self.bias_hh_l0
-                bias_ih_reverse = bias_ih
-                bias_hh_reverse = bias_hh
-            if self.bidirectional:
-                weight_ih_reverse = self.weight_ih_l0_reverse
-                weight_hh_reverse = self.weight_hh_l0_reverse
-                bias_ih_reverse = self.bias_ih_l0_reverse
-                bias_hh_reverse = self.bias_hh_l0_reverse
-
-            scale_i = ScalerBuffer(self.scale_i)
-            scale_iw = ScalerBuffer(self.scale_iw)
-            scale_io = ScalerBuffer(self.scale_io)
-            scale_h = ScalerBuffer(self.scale_h)
-            scale_hw = ScalerBuffer(self.scale_hw)
-            scale_ho = ScalerBuffer(self.scale_ho)
-            scale_o = ScalerBuffer(self.scale_o)
-            scale_i_reverse = None
-            scale_iw_reverse = None
-            scale_io_reverse = None
-            scale_h_reverse = None
-            scale_hw_reverse = None
-            scale_ho_reverse = None
-            scale_o_reverse = None
-            hidden_state = None
-            cell_state = None
-            if self.bidirectional:
-                scale_i_reverse = ScalerBuffer(self.scale_i_reverse)
-                scale_iw_reverse = ScalerBuffer(self.scale_iw_reverse)
-                scale_io_reverse = ScalerBuffer(self.scale_io_reverse)
-                scale_h_reverse = ScalerBuffer(self.scale_h_reverse)
-                scale_hw_reverse = ScalerBuffer(self.scale_hw_reverse)
-                scale_ho_reverse = ScalerBuffer(self.scale_ho_reverse)
-                scale_o_reverse = ScalerBuffer(self.scale_o_reverse)
-            output = None
-            hy = None
-            if hx is not None:
-                hidden_state = hx
-            if hidden_state is not None:
-                batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                seq_len = input.size(
-                    1) if self.batch_first else input.size(0)
-                lengths = torch.tensor([seq_len for i in range(
-                    batch_size)], dtype=torch.int64, device=input.device) if lengths is None else lengths
-            output, hy = GRUSingleONNXFunction.apply(input, lengths, hidden_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                     weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                                                     self.input_size, self.hidden_size, self.num_layers, self.batch_first, self.dropout, self.bidirectional,
-                                                     self.data_bits, self.parameter_bits, self.o_bits,
-                                                     scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                                                     scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                                                     self.is_not_from_iqtensor)
-            if self.o_bits is not None:
-                if self.bidirectional:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                output = from_torch_tensor(output, scale_o(), self.o_bits)
-            if isinstance(orig_input, tuple):
-                return (output, lengths), hy
-            else:
-                return output, hy
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self.running_i > 0, 'invalid running_x <= 0'
-            scale_i = ScalerBuffer(self.scale_i.data)
-            if self.is_not_from_iqtensor:
-                scale_i = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_i, self.data_bits, mode=self.quant_mode))
-                self.scale_i.data.fill_(scale_i())
-            scale_h = ScalerBuffer(self.quant.running_to_scale(
-                self.running_h, self.data_bits, mode=self.quant_mode))
-            self.scale_h.data.fill_(scale_h())
-            if self.o_bits is not None:
-                scale_o = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_o, self.o_bits, mode=self.quant_mode))
-                self.scale_o.data.fill_(scale_o())
-
-            if self.bidirectional:
-                if self.unified:
-                    self.running_i_reverse.data = self.running_i.data
-                    self.running_h_reverse.data = self.running_h.data
-                    self.scale_i_reverse.data = self.scale_i.data
-                    self.scale_h_reverse.data = self.scale_h.data
-                    scale_i_reverse = scale_i
-                    scale_h_reverse = scale_h
-                    if self.o_bits is not None:
-                        self.running_o_reverse.data = self.running_o.data
-                        self.scale_o_reverse.data = self.scale_o.data
-                        scale_o_reverse = scale_o
-                else:
-                    scale_i_reverse = ScalerBuffer(self.scale_i_reverse.data)
-                    if self.is_not_from_iqtensor:
-                        scale_i_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_i_reverse, self.data_bits, mode=self.quant_mode))
-                        self.scale_i_reverse.data.fill_(scale_i_reverse())
-                    scale_h_reverse = ScalerBuffer(self.quant.running_to_scale(
-                        self.running_h_reverse, self.data_bits, mode=self.quant_mode))
-                    self.scale_h_reverse.data.fill_(scale_h_reverse())
-                    if self.o_bits is not None:
-                        scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                        self.scale_o_reverse.data.fill_(scale_o_reverse())
-
-        if self.weight_ih_l0.dtype == torch.float32:
-            clamp_weight_iw = normalize_weight_with_config(
-                self.weight_ih_l0, self.clamp_weight, False)
-            clamp_weight_hw = normalize_weight_with_config(
-                self.weight_hh_l0, self.clamp_weight, False)
-            self.weight_ih_l0.data = clamp_weight_iw
-            self.weight_hh_l0.data = clamp_weight_hw
-            if is_in_onnx_export():
-                scale_iw = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_iw, self.parameter_bits, mode=self.quant_mode))
-                self.scale_iw.data.fill_(scale_iw())
-                scale_hw = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_hw, self.parameter_bits, mode=self.quant_mode))
-                self.scale_hw.data.fill_(scale_hw())
-                q_weight_iw, scale_iw, _ = self.quant.quant(
-                    clamp_weight_iw, self.parameter_bits, scale=scale_iw, mode=self.quant_mode, quant_data='weight')
-                q_weight_hw, scale_hw, _ = self.quant.quant(
-                    clamp_weight_hw, self.parameter_bits, scale=scale_hw, mode=self.quant_mode, quant_data='weight')
-            if self.bias:
-                clamp_bias_iw = normalize_bias_with_config(
-                    self.bias_ih_l0, self.clamp_bias, False)
-                clamp_bias_hw = normalize_bias_with_config(
-                    self.bias_hh_l0, self.clamp_bias, False)
-                self.bias_ih_l0.data = clamp_bias_iw
-                self.bias_hh_l0.data = clamp_bias_hw
-                if is_in_onnx_export():
-                    if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                        q_bias_iw = (clamp_bias_iw * scale_i *
-                                     scale_iw + 0.5).floor()
-                        q_bias_hw = (clamp_bias_hw * scale_h *
-                                     scale_hw + 0.5).floor()
-                        if self.data_bits + self.parameter_bits <= 16:
-                            q_bias_iw = q_bias_iw.float().int()
-                            q_bias_hw = q_bias_hw.float().int()
-                    else:
-                        assert False, "linger only support luna quant."
-            if self.bidirectional:
-                clamp_weight_iw_reverse = normalize_weight_with_config(
-                    self.weight_ih_l0_reverse, self.clamp_weight, False)
-                clamp_weight_hw_reverse = normalize_weight_with_config(
-                    self.weight_hh_l0_reverse, self.clamp_weight, False)
-                self.weight_ih_l0_reverse.data = clamp_weight_iw_reverse
-                self.weight_hh_l0_reverse.data = clamp_weight_hw_reverse
-                if is_in_onnx_export():
-                    if self.unified:
-                        q_weight_iw_reverse, scale_iw_reverse, _ = self.quant.quant(
-                            clamp_weight_iw_reverse, self.parameter_bits, scale=scale_iw, mode=self.quant_mode, quant_data='weight')
-                        q_weight_hw_reverse, scale_hw_reverse, _ = self.quant.quant(
-                            clamp_weight_hw_reverse, self.parameter_bits, scale=scale_hw, mode=self.quant_mode, quant_data='weight')
-                    else:
-                        scale_iw_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_iw_reverse, self.parameter_bits, mode=self.quant_mode))
-                        scale_hw_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_hw_reverse, self.parameter_bits, mode=self.quant_mode))
-                        q_weight_iw_reverse, scale_iw_reverse, _ = self.quant.quant(
-                            clamp_weight_iw_reverse, self.parameter_bits, scale=scale_iw_reverse(), mode=self.quant_mode, quant_data='weight')
-                        q_weight_hw_reverse, scale_hw_reverse, _ = self.quant.quant(
-                            clamp_weight_hw_reverse, self.parameter_bits, scale=scale_hw_reverse(), mode=self.quant_mode, quant_data='weight')
-                    self.scale_iw_reverse.data.fill_(scale_iw_reverse())
-                    self.scale_hw_reverse.data.fill_(scale_hw_reverse())
-                if self.bias:
-                    clamp_bias_iw_reverse = normalize_bias_with_config(
-                        self.bias_ih_l0_reverse, self.clamp_bias, False)
-                    clamp_bias_hw_reverse = normalize_bias_with_config(
-                        self.bias_hh_l0_reverse, self.clamp_bias, False)
-                    self.bias_ih_l0_reverse.data = clamp_bias_iw_reverse
-                    self.bias_hh_l0_reverse.data = clamp_bias_hw_reverse
-                    if is_in_onnx_export():
-                        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                            q_bias_iw_reverse = (
-                                clamp_bias_iw_reverse * scale_i_reverse * scale_iw_reverse + 0.5).floor()
-                            q_bias_hw_reverse = (
-                                clamp_bias_hw_reverse * scale_h_reverse * scale_hw_reverse + 0.5).floor()
-                            if self.data_bits + self.parameter_bits <= 16:
-                                q_bias_iw_reverse = q_bias_iw_reverse.float().int()
-                                q_bias_hw_reverse = q_bias_hw_reverse.float().int()
-                        else:
-                            assert False, "linger only support luna quant."
-            if is_in_onnx_export():
-                if self.parameter_bits <= 8:
-                    self.weight_ih_l0.data = q_weight_iw.char()
-                    self.weight_hh_l0.data = q_weight_hw.char()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.char()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.char()
-                elif self.parameter_bits <= 16:
-                    self.weight_ih_l0.data = q_weight_iw.short()
-                    self.weight_hh_l0.data = q_weight_hw.short()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.short()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.short()
-                else:
-                    self.weight_ih_l0.data = q_weight_iw.int()
-                    self.weight_hh_l0.data = q_weight_hw.int()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.int()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.int()
-                if self.bias:
-                    self.bias_ih_l0.data = q_bias_iw.int()
-                    self.bias_hh_l0.data = q_bias_hw.int()
-                    if self.bidirectional:
-                        self.bias_ih_l0_reverse.data = q_bias_iw_reverse.int()
-                        self.bias_hh_l0_reverse.data = q_bias_hw_reverse.int()
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, self in self._modules.items():
-            if self is not None:
-                self.state_dict(destination, prefix + name +
-                                '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def extra_repr(self):
-        s = nn.GRU.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/iqtensor.py b/linger/ops/iqtensor.py
deleted file mode 100644
index e1ea440..0000000
--- a/linger/ops/iqtensor.py
+++ /dev/null
@@ -1,1387 +0,0 @@
-import math
-from collections import OrderedDict
-
-import numpy as np
-import torch
-import torch._C as _C
-import torch.onnx.symbolic_helper as sym_help
-from torch.onnx import is_in_onnx_export
-
-from torch.onnx.symbolic_opset9 import flatten as onnx_syms_flatten
-from torch.onnx.symbolic_opset9 import permute as onnx_syms_permute
-from torch.onnx.symbolic_opset9 import prim_ConstantChunk as onnx_sym_chunk
-from torch.onnx.symbolic_opset9 import reshape_as as onnx_syms_reshape_as
-from torch.onnx.symbolic_opset9 import squeeze as onnx_syms_squeeze
-from torch.onnx.symbolic_opset9 import transpose as onnx_syms_transpose
-from torch.onnx.symbolic_opset9 import unsqueeze as onnx_syms_unsqueeze
-from torch.onnx.symbolic_opset9 import view as onnx_syms_view
-from torch.onnx.symbolic_opset10 import flip as onnx_syms_flip
-
-from ..config import config
-from ..ops.ops import ModuleIntConfig
-from ..ops.ops_names import (LINGER_IQTENSOR_LAYER_COUNTER,
-                             LINGER_MIX_INT8_MANUAL_ROUND_LAYERS, LINGER_MODE)
-from ..quant import Quant
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .module_self import get_current_module
-
-
-def platform_to_string(platform_quant):
-    if platform_quant == PlatFormQuant.luna_quant:
-        return "luna_quant"
-
-
-def quantlinear(g, input,  scale_x, platform_quant, data_bits=8):
-    return g.op("thinker::Quant", input, data_bits_i=data_bits, scale_x_f=scale_x, platform_quant_s=platform_quant)
-
-
-def dequantlinear(g, input, scale_x):
-    return g.op("thinker::Dequant", input, scale_x_f=scale_x)
-
-
-class Quant2IQTensor(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, data, data_bits, mode, quant_data):
-        q_outputs, scale, _ = Quant.quant(
-            data, data_bits, mode=mode, quant_data=quant_data)
-        outputs = Quant.dequant(q_outputs, scale)
-        outputs = from_torch_tensor(outputs, scale(), data_bits)
-        return outputs
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return grad_output, None, None, None
-
-
-class Convert2IQTensor(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, t):
-        s = IQTensor()
-        s.data = t.data
-        return s
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        return gradOutput
-
-    @staticmethod
-    def symbolic(g, input):
-        return g.op("Identity", input)
-
-
-def from_torch_tensor(t, scale, bits, zero_point=0, running=None):
-    r"""把torch tensor 转换成IQTensor
-
-    Args:
-        t(torch.Tensor):需要转换的torch tensor
-        scale(float):IQTensor 的scale_data
-        bits(int):IQTensor的精度
-        zero_point(int): 控制uint8->int8偏移量
-    returns:
-        转换后的IQTensor
-
-    Notes:
-        这次转换是有grad信息，会被记录到图中
-    """
-    if scale is None:
-        return t
-    s = Convert2IQTensor.apply(t)
-    s.scale_data = scale
-    s.bits = bits
-    s.zero_point = zero_point
-    s.running_data = running
-    if s.running_data is None:
-        bound_value = math.pow(2, bits-1)-1
-        s.running_data = (bound_value+zero_point) / s.scale_data
-    return s
-
-
-class iqView(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *size):
-        y = super(input.__class__, input,).view(*size)
-        self.num_in = len(size)
-        self.size = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        size = self.size
-        l = [None for i in range(self.num_in)]
-        ret = [s.contiguous().view(size)] + l
-        return tuple(ret)
-
-    @staticmethod
-    def symbolic(g, input, *size):
-        if isinstance(size[0], tuple):
-            size = size[0]
-        CValue_list = [t for t in size if sym_help._is_value(t)]
-        if len(CValue_list) != 0:
-            unsqueezed = []
-            for t in size:
-                if sym_help._is_value(t):
-                    unsqueezed.append(g.op("Unsqueeze", t, axes_i=[0]))
-                else:
-                    const_op = g.op("Constant", value_t=torch.tensor(t))
-                    unsqueezed.append(g.op("Unsqueeze", const_op, axes_i=[0]))
-            size = g.op("Concat", *unsqueezed, axis_i=0)
-        return onnx_syms_view(g, input, size)
-
-
-class iqReshape(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *size):
-        y = super(input.__class__, input,).reshape(*size)
-        self.num_in = len(size)
-        self.size = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        size = self.size
-        l = [None for i in range(self.num_in)]
-        ret = [s.reshape(size)] + l
-        return tuple(ret)
-
-    @staticmethod
-    def symbolic(g, input, *size):
-        if isinstance(size[0], tuple):
-            size = size[0]
-        CValue_list = [t for t in size if sym_help._is_value(t)]
-        if len(CValue_list) != 0:
-            unsqueezed = []
-            for t in size:
-                if sym_help._is_value(t):
-                    unsqueezed.append(g.op("Unsqueeze", t, axes_i=[0]))
-                else:
-                    const_op = g.op("Constant", value_t=torch.tensor(t))
-                    unsqueezed.append(g.op("Unsqueeze", const_op, axes_i=[0]))
-            size = g.op("Concat", *unsqueezed, axis_i=0)
-        return onnx_syms_view(g, input, size)
-
-
-class iqReshape_as(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, other):
-        y = super(input.__class__, input,).reshape_as(other)
-        self.size = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        size = self.size
-        ret = s.reshape(size)
-        return ret, None
-
-    @staticmethod
-    def symbolic(g, input, other):
-        return onnx_syms_reshape_as(g, input, other)
-
-
-class iqFlip(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *args):
-        y = super(input.__class__, input,).flip(*args)
-        self.num_in = len(args)
-        self.dim = args
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        num_in = self.num_in
-        dim = self.dim
-        return s.flip(*dim), None
-
-    @staticmethod
-    def symbolic(g, input, *dims):
-        if isinstance(dims[0], int):
-            return onnx_syms_flip(g, input, dims)
-        return onnx_syms_flip(g, input, *dims)
-
-
-class iqSplit(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, split_size_or_sections, dim=0):
-        y = super(input.__class__, input,).split(split_size_or_sections, dim)
-        self.dim = dim
-        y = tuple([from_torch_tensor(t, input.scale_data,
-                  input.bits, input.zero_point) for t in y])
-        return y
-
-    @staticmethod
-    def backward(self, *s):
-        dim = self.dim
-        return torch.cat(s, dim), None, None
-
-    @staticmethod
-    def symbolic(g, input, split_size_or_sections, dim=0):
-        sizes = sym_help._get_tensor_dim_size(input, dim)
-        if (isinstance(split_size_or_sections, int)):
-            splits = [split_size_or_sections] * \
-                (sizes // split_size_or_sections)
-            leftover = sizes % split_size_or_sections
-            if leftover:
-                splits.append(leftover)
-        else:
-            splits = list(split_size_or_sections)
-        return g.op("Split", input, split_i=splits, axis_i=dim, outputs=len(splits))
-
-
-class iqSqueeze(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *args):
-        y = super(input.__class__, input,).squeeze(*args)
-        self.num_in = len(args)
-        self.size = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        size = self.size
-        l = [None for i in range(self.num_in)]
-        ret = [s.reshape(size)] + l
-        return tuple(ret)
-
-    @staticmethod
-    def symbolic(g, input, *dim):
-        return onnx_syms_squeeze(g, input, *dim)
-
-
-class iqUnsqueeze(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, dim):
-        y = super(input.__class__, input,).unsqueeze(dim)
-        self.size = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        size = self.size
-        return s.reshape(size), None
-
-    @staticmethod
-    def symbolic(g, input, dim):
-        return onnx_syms_unsqueeze(g, input, dim)
-
-
-class iqTranspose(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, dim0, dim1):
-        y = super(input.__class__, input,).transpose(dim0, dim1)
-        self.dim0 = dim0
-        self.dim1 = dim1
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        dim0 = self.dim0
-        dim1 = self.dim1
-        return s.transpose(dim0, dim1), None, None
-
-    @staticmethod
-    def symbolic(g, input, dim0, dim1):
-        return onnx_syms_transpose(g, input, dim0, dim1)
-
-
-class iqPermute(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *dims):
-        y = super(input.__class__, input,).permute(*dims)
-        self.dims = dims
-        self.save_for_backward(input, )
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        input, = self.saved_tensors
-        dims = self.dims
-        input = input.detach().clone().requires_grad_(True)
-        gradInput = None
-        with torch.enable_grad():
-            z = input.permute(*dims)
-            gradInput, = torch.autograd.grad(z, (input, ), s)
-        grad_tuple = [gradInput]
-        for i in range(len(dims)):
-            grad_tuple.append(None)
-        return tuple(grad_tuple)
-
-    @staticmethod
-    def symbolic(g, input, *dims):
-        if isinstance(dims[0], tuple):
-            dims = dims[0]
-        return onnx_syms_permute(g, input, dims)
-
-
-class iqFlatten(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, start_dim=0, end_dim=-1):
-        y = super(input.__class__, input,).flatten(start_dim, end_dim)
-        self.input_dims = input.size()
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        input_dims = self.input_dims
-        return s.reshape(input_dims), None, None
-
-    @staticmethod
-    def symbolic(g, input, start_dim=0, end_dim=-1):
-        return onnx_syms_flatten(g, input, start_dim, end_dim)
-
-
-class iqContiguous(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, *args, **kwargs):
-        y = super(input.__class__, input,).contiguous(*args, **kwargs)
-        return from_torch_tensor(y, input.scale_data, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        return s, None
-
-    @staticmethod
-    def symbolic(g, input, *args, **kwargs):
-        return g.op("Identity", input)
-
-
-class iqChunk(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, chunks, dim=0):
-        self.dim = dim
-        y = super(input.__class__, input,).chunk(chunks, dim)
-        return tuple(from_torch_tensor(ret, input.scale_data, input.bits, input.zero_point) for ret in y)
-
-    @staticmethod
-    def backward(self, *s):
-        dim = self.dim
-        return torch.cat(s, dim), None, None
-
-    @staticmethod
-    def symbolic(g, input, chunks, dim):
-        return onnx_sym_chunk(g, input, chunks, dim)
-
-
-class iqMul(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, y, c_y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        self.save_for_backward(x, y)
-        
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        if math.log(scale_x(),2) + math.log(scale_y(),2) < math.log(scale_z_iq(),2):
-            scale_y_add = math.log(scale_z_iq(),2) - math.log(scale_x(),2) - math.log(scale_y(),2)
-            scale_y.fill_(scale_y()*2**scale_y_add)
-        x_int = x.quant_to_int8(scale_x())
-        y_int = y.quant_to_int8(scale_y())
-        x_int = x_int.contiguous()
-        y_int = y_int.contiguous()
-        if (x_int.size() != y_int.size()):
-            x_int, y_int = torch.broadcast_tensors(x_int, y_int)
-        z_int = None
-
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            r_scale = np.float32(scale_z_iq()/(scale_x()*scale_y()))
-            z_int = (((x_int * y_int)*r_scale)+0.5).floor()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqmul'
-        # z_float = z_int.float() /scale_z_iq()
-        z_float = Quant.dequant(z_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input1", "input2", "outputs",
-                         "q_input1",  "q_input2", "q_outputs"]
-            attr_list = [x, y, z_float, x_int, y_int, z_int]
-            Dump.dump_file(prefix, ".iqMul.", zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x, y = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = y.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            z = x * y
-            grad = torch.autograd.grad(z, (x, y), s)
-        return grad[0], grad[1], None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, y, c_y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        if c_y is not None:
-            c_y = math.floor(c_y * scale_y() + 0.5)
-            c_y = max(-128, min(127, c_y))
-            y = g.op("Constant", value_t=torch.tensor(c_y, dtype=torch.int8))
-        input_list = [x, y]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(
-        ), 'scale_o_f': scale_o(), 'platform_quant_s': platform_quant}
-
-        return g.op("thinker::iqMul", *input_list, **param_dict)
-
-
-class iqMulLayer(torch.nn.Module):
-    r"""对iqmul的layer封装
-
-    """
-
-    def __init__(self):
-        super(iqMulLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_y_constant = False
-
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('y', torch.zeros(1))
-
-    def forward(self, x, y, local_scale_o, quant_mode=QuantMode.QValue):
-        r""" 前向操作
-
-        Args:
-            x,y(IQTensor):执行x*y的IQTensor
-            local_scale_o(float):本batch的scale值
-            quant_mode(QuantMode):输出z的量化方法，目前仅支持Q值量化
-        returns:
-            乘法结果z 类型为IQTensor 
-
-        """
-        scale_x = ScalerBuffer(x.scale_data)
-        c_y = None
-        if not isinstance(y, torch.Tensor):
-            c_y = y
-            self.is_y_constant = True
-            y = torch.tensor(y, dtype=torch.float32, device=x.device)
-            q_y, scale_y, _ = Quant.quant(
-                y, 8, mode=quant_mode, quant_data='input')
-            y = Quant.dequant(q_y, scale_y)
-            y = from_torch_tensor(y, scale_y(), 8)
-
-            # if 0 < y < 1:
-            #     y_ = math.log(y, 2)
-            #     if y_ % 1 != 0:
-            #         assert False, f"in iqmul, input y must equals to 2**a, where a is integer, but you have y = {y}"
-            #     y = torch.tensor(y, dtype=torch.float32, device=x.device)
-            #     y = from_torch_tensor(y, 2**(-y_), 8)
-            # elif y >= 1:
-            #     if y & (y-1) != 0:
-            #         assert False, f"in iqmul, input y must equals to 2**a, where a is integer, but you have y = {y}"
-            #     y = torch.tensor(y, dtype=torch.float32, device=x.device)
-            #     y = from_torch_tensor(y, 1, 8)
-            # else:
-            #     assert False, f"in iqmul, input y must lager than 0, and equals to 2**a, where a is integer, but you have y = {y}"
-
-        scale_y = ScalerBuffer(y.scale_data)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        z = iqMul.apply(x, y, c_y, scale_x, scale_y, x.zero_point, y.zero_point, local_scale_o,
-                        scale_o, running_o, self.training, quant_mode, self.prefix, self.dump, self.path)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def iqmul(module, x, y, name="_default"):
-    r"""实现IQTensor加法,并对输出值进行定标,当前输出支持最值定标和Q值(2幂次)定标
-
-    .. math::
-        z = x * y
-
-    Args:
-        module(torch.nn.Module):乘法注册的module，如果iqmul在module的forward里面使用一般是self
-        x(IOTensor):IQTensor变量
-        y(IOTensor):IQTensor变量
-        name(str):该加法产生的子module名字，是moudule的成员变量名字
-    Notes:
-        如果使用IQTensor，该加法会自动生效,通过linger.SetIQTensorMul(False)可以关闭iqmul
-    Example:
-        >>> class iqTestLayer(torch.nn.Module):
-        >>>     def __init__(self):
-        >>>         super(iqTestLayer,self).__init__()
-        >>>     def forward(self,x,y):
-        >>>         return iqmul(self,x,y,'test') 
-        >>> a = from_torch_tensor(x,127.0/6.0,8)
-        >>> b = from_torch_tensor(y,127.0/8,8)
-        >>> net = iqTestLayer().cuda()
-        >>> m = net(a,b)
-
-    """
-    quant_mode = getattr(module, LINGER_MODE, QuantMode.QValue)
-    # assert quant_mode is not None, 'invalid add quant mode'
-    assert isinstance(x, IQTensor)
-    # assert isinstance(y,IQTensor)
-    assert x.bits == 8, 'iqmul only support 8bit'
-    if isinstance(y, IQTensor):
-        assert y.bits == 8, 'iqmul only support 8bit tensor'
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqmul_' + name
-    iq_layer = None
-    if hasattr(module, var_name):
-        iq_layer = getattr(module, var_name)
-    else:
-        iq_layer = iqMulLayer()
-        iq_layer.training = module.training
-        iq_layer = iq_layer.to(x.device)
-        setattr(module, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch.mul(x, y)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(x, y, scale_z, quant_mode)
-
-
-class iqDiv(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        self.save_for_backward(x, y)
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        z_int = None
-        z_float = x.data / y.data
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            z_int = (z_float * scale_z_iq() + 0.5).floor().int()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqdiv'
-        # z_float = z_int.float() /scale_z_iq()
-        z_float = Quant.dequant(z_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input1", "input2", "outputs", "q_outputs"]
-            attr_list = [x, y, z_float, z_int]
-            Dump.dump_file(prefix, ".iqDiv.", zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x, y = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = y.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            z = x / y
-            grad = torch.autograd.grad(z, (x, y), s)
-        return grad[0], grad[1], None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        input_list = [x, y]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(
-        ), 'scale_o_f': scale_o(), 'platform_quant_s': platform_quant}
-
-        return g.op("thinker::iqDiv", *input_list, **param_dict)
-
-
-class iqDivScalar(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        assert isinstance(y, (int, float)), 'only support div scalar here'
-        self.save_for_backward(x)
-        self.y = y
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        z_int = None
-        x_float = x
-        y_float = torch.tensor(y, dtype=torch.float32, device=x.device)
-        z_float = x_float / y_float
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            z_int = (z_float * scale_z_iq() + 0.5).floor().int()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqdiv'
-        # z_float = z_int.float() /scale_z_iq()
-        z_float = Quant.dequant(z_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input1", "input2", "outputs", "q_outputs"]
-            attr_list = [x, y, z_float, z_int]
-            Dump.dump_file(prefix, ".iqDivScalar.",
-                           zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x, = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = self.y
-        grad = None
-        with torch.enable_grad():
-            z = x / y
-            grad = torch.autograd.grad(z, (x,), s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, y, scale_x, scale_y, zero_x, zero_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        tensor_y = g.op("Constant", value_t=torch.tensor(
-            y, dtype=torch.float32))
-        input_list = [x, tensor_y]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(
-        ), 'scale_o_f': scale_o(), 'platform_quant_s': platform_quant}
-
-        return g.op("thinker::iqDiv", *input_list, **param_dict)
-
-
-class iqDivLayer(torch.nn.Module):
-    r"""对iqdiv的layer封装
-
-    """
-
-    def __init__(self):
-        super(iqDivLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, x, y, local_scale_o, quant_mode=QuantMode.QValue):
-        r""" 前向操作
-
-        Args:
-            x,y(IQTensor):执行x/y的IQTensor
-            local_scale_o(float):本batch的scale值
-            quant_mode(QuantMode):输出z的量化方法，目前仅支持Q值量化
-        returns:
-            除法结果z 类型为IQTensor 
-
-        """
-        scale_x = ScalerBuffer(x.scale_data)
-        scale_y = ScalerBuffer(y.scale_data) if isinstance(
-            y, IQTensor) else ScalerBuffer(1.0)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        if isinstance(y, IQTensor):
-            z = iqDiv.apply(x, y, scale_x, scale_y, x.zero_point, y.zero_point, local_scale_o,
-                            scale_o, running_o, self.training, quant_mode, self.prefix, self.dump, self.path)
-        else:
-            z = iqDivScalar.apply(x, y, scale_x, scale_y, x.zero_point, 0, local_scale_o,
-                                  scale_o, running_o, self.training, quant_mode, self.prefix, self.dump, self.path)
-        # print('running_o: ', running_o())
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def iqdiv(module, x, y, name="_default"):
-    r"""实现IQTensor除法(python3实现逻辑),并对输出值进行定标,当前输出支持最值定标和Q值(2幂次)定标
-
-    .. math::
-        z = x / y
-
-    Args:
-        module(torch.nn.Module):乘法注册的module，如果iqmul在module的forward里面使用一般是self
-        x(IOTensor):IQTensor变量
-        y(float):Scalar变量
-        name(str):该加法产生的子module名字，是moudule的成员变量名字
-    Notes:
-        如果使用IQTensor，该加法会自动生效,通过linger.SetIQTensorDiv(False)可以关闭iqdiv
-    Example:
-        >>> class iqTestLayer(torch.nn.Module):
-        >>>     def __init__(self):
-        >>>         super(iqTestLayer,self).__init__()
-        >>>     def forward(self,x,y):
-        >>>         return iqdiv(self,x,y,'test') 
-        >>> a = from_torch_tensor(x,127.0/6.0,8)
-        >>> b = from_torch_tensor(y,127.0/8,8)
-        >>> net = iqTestLayer().cuda()
-        >>> m = net(a,b)
-
-    """
-    quant_mode = getattr(module, LINGER_MODE, QuantMode.QValue)
-    # assert quant_mode is not None, 'invalid add quant mode'
-    assert isinstance(x, IQTensor)
-    assert x.bits == 8, 'iqdiv only support 8bit'
-    if isinstance(y, IQTensor):
-        assert y.bits == 8, 'iqdiv only support 8bit'
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqdiv_' + name
-    iq_layer = None
-    if hasattr(module, var_name):
-        iq_layer = getattr(module, var_name)
-    else:
-        iq_layer = iqDivLayer()
-        iq_layer.training = module.training
-        iq_layer = iq_layer.to(x.device)
-        setattr(module, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch.div(x, y)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(x, y, scale_z, quant_mode)
-
-
-class iqAdd(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, y, scale_x, scale_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        self.save_for_backward(x, y)
-        x_int = x.quant_to_int8(scale_x())
-        y_int = y.quant_to_int8(scale_y())
-        x_int = x_int.contiguous()
-        y_int = y_int.contiguous()
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            min_scale_in = min(scale_x(), scale_y())
-            if scale_z_iq > min_scale_in:
-                scale_z_iq = ScalerBuffer(min_scale_in)
-            else:
-                scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        if (x_int.size() != y_int.size()):
-            x_int, y_int = torch.broadcast_tensors(x_int, y_int)
-        z_int = None
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            z_int = (x_int * (scale_z_iq()/scale_x()) + 0.5).floor().int() + \
-                (y_int*(scale_z_iq()/scale_y()) + 0.5).floor().int()
-            # z_int = (x_int * (scale_z_iq()/scale_x()) + y_int * (scale_z_iq()/scale_y()) + 0.5).floor().int()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqadd'
-        z_float = Quant.dequant(z_int, scale_z_iq)
-        # z_float = z_int.float() /scale_z_iq()
-
-        if dump:
-            name_list = ["input1", "input2", "outputs",
-                         "q_input1",  "q_input2", "q_outputs"]
-            attr_list = [x, y, z_float, x_int, y_int, z_int]
-            Dump.dump_file(prefix, ".iqAdd.", zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x, y = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = y.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            z = x + y
-            grad = torch.autograd.grad(z, (x, y), s)
-        return grad[0], grad[1], None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, y, scale_x, scale_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        input_list = [x, y]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(
-        ), 'scale_o_f': scale_o(), 'platform_quant_s': platform_quant}
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            param_dict['mode_s'] = 'Non_t710_mode'
-        return g.op("thinker::iqAdd", *input_list, **param_dict)
-
-
-class iqAddScalar(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, y, scale_x, scale_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        self.save_for_backward(x,)
-        self.y = y
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        z_int = None
-        y = torch.tensor(y, dtype=torch.float32, device=x.device)
-        z_float = x + y
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            z_int = ((z_float * scale_z_iq()) + 0.5).floor().int()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqadd'
-        z_float = Quant.dequant(z_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input1", "input2", "outputs", "q_outputs"]
-            attr_list = [x, y, z_float, z_int]
-            Dump.dump_file(prefix, ".iqAddScalar.",
-                           zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x,  = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = self.y
-        grad = None
-        with torch.enable_grad():
-            z = x + y
-            grad = torch.autograd.grad(z, (x,), s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, y, scale_x, scale_y, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        tensor_y = g.op("Constant", value_t=torch.tensor(
-            y, dtype=torch.float32))
-        input_list = [x, tensor_y]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_y_f': scale_y(
-        ), 'scale_o_f': scale_o(), 'platform_quant_s': platform_quant}
-        return g.op("thinker::iqAdd", *input_list, **param_dict)
-
-
-class iqAddLayer(torch.nn.Module):
-    r"""对iqadd的layer封装
-
-    """
-
-    def __init__(self):
-        super(iqAddLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, x, y, local_scale_o, quant_mode=QuantMode.QValue):
-        r""" 前向操作
-
-        Args:
-            x,y(IQTensor):执行x+y的IQTensor
-            local_scale_o(float):本batch的scale值
-            quant_mode(QuantMode):输出z的量化方法，目前仅支持Q值量化
-        returns:
-            加法结果z 类型为IQTensor 
-
-        """
-        scale_x = ScalerBuffer(x.scale_data)
-        scale_y = ScalerBuffer(y.scale_data) if isinstance(
-            y, IQTensor) else ScalerBuffer(1.0)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        if isinstance(y, IQTensor):
-            z = iqAdd.apply(x, y, scale_x, scale_y, local_scale_o, scale_o, running_o,
-                            self.training, quant_mode, self.prefix, self.dump, self.path)
-        else:
-            z = iqAddScalar.apply(x, y, scale_x, scale_y, local_scale_o, scale_o,
-                                  running_o, self.training, quant_mode, self.prefix, self.dump, self.path)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def iqadd(module, x, y, name="_default"):
-    r"""实现IQTensor加法,并对输出值进行定标,当前输出支持最值定标和Q值(2幂次)定标
-
-    .. math::
-        z = x + y
-
-    Args:
-        module(torch.nn.Module):加法注册的module，如果iqadd在module的forward里面使用一般是self
-        x(IOTensor):IQTensor变量
-        y(IOTensor):IQTensor变量
-        name(str):该加法产生的子module名字，是moudule的成员变量名字
-    Notes:
-        如果使用IQTensor，该加法会自动生效,通过linger.SetIQTensorAdd(False)可以关闭iqadd
-    Example:
-        >>> class iqTestLayer(torch.nn.Module):
-        >>>     def __init__(self):
-        >>>         super(iqTestLayer,self).__init__()
-        >>>     def forward(self,x,y):
-        >>>         return iqadd(self,x,y,'test') 
-        >>> a = from_torch_tensor(x,127.0/6.0,8)
-        >>> b = from_torch_tensor(y,127.0/8,8)
-        >>> net = iqTestLayer().cuda()
-        >>> m = net(a,b)
-
-    """
-    quant_mode = getattr(module, LINGER_MODE, QuantMode.QValue)
-    # assert quant_mode is not None, 'invalid add quant mode'
-    assert isinstance(x, IQTensor)
-    # assert isinstance(y,IQTensor)
-    assert x.bits == 8, 'iqadd only support 8bit'
-    if isinstance(y, IQTensor):
-        assert y.bits == 8, 'iqadd only support 8bit'
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqadd_' + name
-    iq_layer = None
-    if hasattr(module, var_name):
-        iq_layer = getattr(module, var_name)
-    else:
-        iq_layer = iqAddLayer()
-        iq_layer.training = module.training
-        iq_layer = iq_layer.to(x.device)
-        setattr(module, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch.add(x, y)
-        if torch.is_tensor(y):
-            z_f = torch.cat((x, y, z_f))
-        else:
-            z_f = torch.cat((x, z_f))
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(x, y, scale_z, quant_mode)
-
-
-class iqSum(torch.autograd.Function):
-    @staticmethod
-    def forward(self, x, scale_x, args, kwargs, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        self.save_for_backward(x,)
-        self.args = args
-        self.kwargs = kwargs
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        z_int = None
-        z_float = torch.sum(x, *args, **kwargs)
-        if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-            z_int = (z_float * scale_z_iq() + 0.5).floor().int()
-            z_int.clamp_(-128, 127)
-        else:
-            assert False, 'platform_quant mode donot support for iqSum'
-        z_float = Quant.dequant(z_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input", "outputs", "q_outputs"]
-            attr_list = [x, z_float, z_int]
-            Dump.dump_file(prefix, ".iqSum.", zip(name_list, attr_list), path)
-
-        return from_torch_tensor(z_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        x,  = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        args = self.args
-        kwargs = self.kwargs
-        grad = None
-        with torch.enable_grad():
-            z = torch.sum(x, *args, **kwargs)
-            grad = torch.autograd.grad(z, (x,), s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, scale_x, args, kwargs, local_scale_o, scale_o, running_o, training, quant_mode, prefix, dump, path):
-        input_list = [x, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict = {'scale_x_f': scale_x(), 'scale_o_f': scale_o(),
-                      'platform_quant_s': platform_quant}
-        if len(args) == 1:
-            param_dict['dims_i'] = args[0]
-
-        return g.op("thinker::iqSum", *input_list, **param_dict)
-
-
-class iqSumLayer(torch.nn.Module):
-    r"""对iqsum的layer封装
-
-    """
-
-    def __init__(self):
-        super(iqSumLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, x, args, kwargs, local_scale_o, quant_mode=QuantMode.QValue):
-        r""" 前向操作
-
-        Args:
-            x,y(IQTensor):执行x.sum()的IQTensor
-            local_scale_o(float):本batch的scale值
-            quant_mode(QuantMode):输出z的量化方法，目前仅支持Q值量化
-        returns:
-            加法结果z 类型为IQTensor 
-
-        """
-        scale_x = ScalerBuffer(x.scale_data)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        z = iqSum.apply(x, scale_x, args, kwargs, local_scale_o, scale_o, running_o,
-                        self.training, quant_mode, self.prefix, self.dump, self.path)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def iqsum(module, x, *args, **kwargs):
-    r"""实现IQTensor求和,并对输出值进行定标,当前输出支持最值定标和Q值(2幂次)定标
-
-    .. math::
-        z = x.sum(*args, **kwargs)
-    """
-    quant_mode = getattr(module, LINGER_MODE, QuantMode.QValue)
-    assert isinstance(x, IQTensor)
-    assert x.bits == 8, 'iqsum only support 8bit'
-    name = kwargs.pop('name', '_default')
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqsum_' + name
-    iq_layer = None
-    if hasattr(module, var_name):
-        iq_layer = getattr(module, var_name)
-    else:
-        iq_layer = iqSumLayer()
-        iq_layer.training = module.training
-        iq_layer = iq_layer.to(x.device)
-        setattr(module, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch.sum(x, *args, **kwargs)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(x, args, kwargs, scale_z, quant_mode)
-
-
-class IQTensor(torch.Tensor):
-    r"""实现量化方式和导出onnx方式IQTensor,除了包含torch.Tensor相关属性和变量外.
-
-    Supports:
-        `第一类`: bypass类函数，对数据不做任何处理,仅仅传递IQTensor属性。包括IQTensor类函数view,view_as,reshape,reshape_as,squeeze,unsqueeze,
-        transpose,flatten,__getitem__ (即y=x[2:]类似切片操作)，以及torch的操作torch\.max_pool2d、torch\.relu、torch\.relu_ 
-
-        `第二类`: 重定标函数,这些操作会在输出tensor重新定标，包括IQTensor类函数__add__(+)、__iadd__(+=) 、__mul__(*)和__imul__(*),以及torch的操作torch.cat。
-
-        `第三类`: 其他函数，将执行torch.Tensor的默认行为
-
-    Attributes:
-        bits(int):表示IQTensor数据的精度，一般为8或16
-        scale_data(float):表示IQTensor数据的scale，表示意义为 :math:`\frac{2^{bits-1}-1}{max\_value}`
-
-    """
-    if torch.__version__ >= '1.7.0':
-        @classmethod
-        def __torch_function__(cls, func, types, args=(), kwargs=None):
-
-            if kwargs is None:
-                kwargs = {}
-
-            if not all(issubclass(cls, t) for t in types):
-                return NotImplemented
-
-            with _C.DisableTorchFunction():
-                ret = func(*args, **kwargs)
-                return ret
-
-    def __add__(self, *args, **kwargs):
-        if not isinstance(args[0], (IQTensor, float, int)) or not config.IQTensor.iqadd \
-                or self.dtype != torch.float:
-            return super(IQTensor, self).__add__(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).__add__(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        return iqadd(module_self, self, args[0], '_default_index_'+str(iname_index))
-
-    def __iadd__(self, *args, **kwargs):
-        if not isinstance(args[0], (IQTensor, float, int)) or not config.IQTensor.iqadd \
-                or self.dtype != torch.float:
-            return super(IQTensor, self).__iadd__(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).__iadd__(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        self = iqadd(module_self, self,
-                     args[0], '_default_index_'+str(iname_index))
-        return self
-
-    def __mul__(self, *args, **kwargs):
-        if not isinstance(args[0], (IQTensor, float, int)) or not config.IQTensor.iqmul \
-                or self.dtype != torch.float:
-            return super(IQTensor, self).__mul__(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).__mul__(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        return iqmul(module_self, self, args[0], '_default_index_'+str(iname_index))
-
-    def __imul__(self, *args, **kwargs):
-        if not isinstance(args[0], (IQTensor, float, int)) or not config.IQTensor.iqmul \
-                or self.dtype != torch.float:
-            return super(IQTensor, self).__imul__(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).__imul__(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        self = iqmul(module_self, self,
-                     args[0], '_default_index_'+str(iname_index))
-        return self
-
-    def __truediv__(self, *args, **kwargs):
-        if not isinstance(args[0], (IQTensor, float, int)) or not config.IQTensor.iqdiv \
-                or self.dtype != torch.float:
-            return super(IQTensor, self).__truediv__(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).__truediv__(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        self = iqdiv(module_self, self,
-                     args[0], '_default_index_'+str(iname_index))
-        return self
-
-    def view(self, *args, **kwargs):
-        return iqView.apply(self, *args, **kwargs)
-
-    def flip(self, *args, **kwargs):
-        return iqFlip.apply(self, *args, **kwargs)
-
-    def split(self, *args, **kwargs):
-        return iqSplit.apply(self, *args, **kwargs)
-
-    def view_as(self, other):
-        return iqReshape_as.apply(self, other)
-
-    def reshape(self, *args, **kwargs):
-        return iqReshape.apply(self, *args, **kwargs)
-
-    def reshape_as(self, other):
-        return iqReshape_as.apply(self, other)
-
-    def squeeze(self, *args):
-        return iqSqueeze.apply(self, *args)
-
-    def unsqueeze(self, dim):
-        return iqUnsqueeze.apply(self, dim)
-
-    def transpose(self, dim0, dim1):
-        return iqTranspose.apply(self, dim0, dim1)
-
-    def permute(self, *dims):
-        return iqPermute.apply(self, *dims)
-
-    def flatten(self, start_dim=0, end_dim=-1):
-        return iqFlatten.apply(self, start_dim, end_dim)
-
-    def contiguous(self, *args, **kwargs):
-        return iqContiguous.apply(self, *args, **kwargs)
-
-    def chunk(self, chunks, dim=0):
-        m = super(IQTensor, self).chunk(chunks, dim)
-        return tuple(from_torch_tensor(ret, self.scale_data, self.bits, self.zero_point) for ret in m)
-
-    def sum(self, *args, **kwargs):
-        if not config.IQTensor.iqsum or self.dtype != torch.float:
-            return super(IQTensor, self).sum(*args, **kwargs)
-        module_self = get_current_module()
-        if module_self is None:
-            return super(IQTensor, self).sum(*args, **kwargs)
-        iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-        setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-        kwargs['name'] = '_default_index_'+str(iname_index)
-        self = iqsum(module_self, self, *args, **kwargs)
-        return self
-
-    def __getitem__(self, indices):
-        m = super(IQTensor, self).__getitem__(indices)
-        return from_torch_tensor(m, self.scale_data, self.bits, self.zero_point)
-
-    def requant_(self):
-        r"""重新quant数据，按照成员的scale_data和bits进行重新量化和还原float,执行inplace操作
-
-        .. math::
-            data=\frac{clamp(round(data*scale\_data),2^{bits-1},2^{bits-1}-1)}{scale\_data}
-
-        Notes:
-            该方法不涉及到grad操作
-        retruns:
-            没有返回值
-
-        """
-        with torch.no_grad():
-            assert self.scale_data is not None
-            r = super(IQTensor, self).__mul__(self.scale_data)
-            m2 = math.pow(2, self.bits-1)
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                r = (r + 0.5).floor()
-            else:
-                assert False, "linger only support luna quant."
-            r.clamp_(-m2-0.01+self.zero_point, m2 - 1+0.01+self.zero_point)
-            self.data.copy_(r/self.scale_data)
-
-    def quant_to_int8(self, scale=None):
-        r"""quant数据到int8
-
-        .. math::
-            result=int(clamp(round(data*scale\_data),-128,127))
-
-        Args:
-            scale(float or None):进行quant的scale，如果为None，表示使用self.scale_data,默认为None
-        returns:
-            返回量化后的int值，注意：返回数据类型torch\.int32,值为int8范围
-
-        """
-        with torch.no_grad():
-            if scale == None:
-                scale = self.scale_data
-            r = super(IQTensor, self).__mul__(scale)
-            # training with same drop
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                r = (r + 0.5).floor()
-            else:
-                assert False, "linger only support luna quant."
-            r.clamp_(-128.0-0.01+self.zero_point, 127.0+0.01+self.zero_point)
-            r = r.float()
-            return r
-
-    def scale_to(self, target_scale, training=False):
-        r"""将自身数据调整到target_scale
-
-        .. math::
-            result=\frac{clamp(round(int(clamp(round(data*scale\_data),-128,127))*(\frac{target\_scale}{scale\_data})),-128,127)}{target\_scale}
-
-        Args:
-            target_scale(float):进行调整的目标scale
-        returns:
-            返回调整后的浮点数据
-
-        """
-        with torch.no_grad():
-            scale = self.scale_data
-            r = super(IQTensor, self).__mul__(scale)
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                r = (r + 0.5).floor()
-            else:
-                assert False, "linger only support luna quant."
-            r.clamp_(-128.0-0.01+self.zero_point, 127.0+0.01+self.zero_point)
-            r = r * (target_scale/scale)
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                r = (r + 0.5).floor()
-            else:
-                assert False, "linger only support luna quant."
-            r.clamp_(-128.0-0.01+self.zero_point, 127.0+0.01+self.zero_point)
-            r = r / target_scale
-            return r
-
-
-torch_bmm = torch.bmm
-
-__all__ = ['IQTensor', 'from_torch_tensor', 'iqAddLayer', 'iqadd', 'iqMulLayer', 'iqmul', 'iqSumLayer',
-           'iqDivLayer', 'platform_to_string', 'quantlinear', 'dequantlinear', 'Quant2IQTensor', 'torch_bmm']
diff --git a/linger/ops/layernorm_int.py b/linger/ops/layernorm_int.py
deleted file mode 100644
index 7aa31cc..0000000
--- a/linger/ops/layernorm_int.py
+++ /dev/null
@@ -1,495 +0,0 @@
-import math
-import itertools
-from collections import OrderedDict
-
-import torch
-import torch.nn as nn
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-import lingerext
-
-class LayerNormFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, weights, bias, normalized_shape,
-                training, momentum, eps,
-                running_x, running_normal, running_w, running_o,
-                eval_scale_x, eval_scale_normal, eval_scale_w, eval_scale_o,
-                data_bits, parameter_bits, prefix, dump,
-                path, mode, o_bits, quant, is_not_from_iqtensor,
-                ahead_relu, clamp_data, clamp_weight, clamp_bias):
-        N = 1
-        for dim_size in normalized_shape:
-            N *= dim_size
-        if len(weights.shape) == 1:
-            w_shape = (-1)
-        elif len(weights.shape) == 2:
-            w_shape = (-2, -1)
-        elif len(weights.shape) == 3:
-            w_shape = (-3, -2, -1)
-        elif len(weights.shape) == 4:
-            w_shape = (-4, -3, -2, -1)
-        else:
-            assert False, f"weight.shape=={weights.shape}, please check your LayerNorm definition."
-        H, W = input.shape[-2], input.shape[-1]
-        scale_x = None
-        scale_w = None
-        scale_o = None
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            ctx.eps = eps
-            ctx.bits = data_bits, parameter_bits, o_bits
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            ctx.N = N
-            ctx.w_shape = w_shape
-            saved_tensors = [input, weights, bias]
-
-            # x
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            # q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_input = q_input.contiguous().int()
-            sum_x = q_input.clone().sum(w_shape, keepdim=True)
-            sum_x2 = q_input.clone().pow(2).sum(w_shape, keepdim=True)
-            denominator = N * sum_x2 - sum_x * sum_x
-            scale_eps = int(math.log2(scale_x.data)) * 2
-            q_eps = math.floor(eps * pow(2, scale_eps) * N * N + 0.5)
-            denominator = denominator + q_eps
-            q_x_normal = q_input * N
-            q_x_normal = q_x_normal - sum_x
-            q_x_normal = lingerext.luna_layernormint(q_x_normal.contiguous().int(), denominator.contiguous().long(), math.log2(scale_x.data))
-            q_x_normal = q_x_normal.contiguous().int()
-            q_x_normal.clamp_(-2**15, 2**15-1)
-            eval_scale_normal.fill_(2**10)
-
-            q_weights, scale_w, max_value_w = quant.quant(
-                weights, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-
-            q_outputs = q_x_normal * q_weights
-
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                q_bias = (bias * scale_w * eval_scale_normal + 0.5).floor().long()
-                q_bias.clamp_(-2**31, 2**31-1)
-                q_outputs = q_outputs + q_bias
-                q_outputs.clamp_(-2**31, 2**31-1)
-                outputs = quant.dequant(q_outputs, eval_scale_normal*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-
-            out_tensor = outputs
-            scale_o = None
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output', ahead_relu=ahead_relu)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-                outputs = quant.dequant(q_outputs, scale_o)
-            saved_tensors += [out_tensor]
-            ctx.scale = scale_x, eval_scale_normal, scale_w, scale_o
-            ctx.save_for_backward(*saved_tensors)
-
-        else:
-            assert running_x > 0, 'invalid running_x <= 0, please finetune training before eval'
-            eval_scale_normal.fill_(2**10)
-            if weights.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-                weigths = normalize_weight_with_config(
-                    weights, clamp_weight, False)
-                bias = normalize_bias_with_config(bias, clamp_bias, False)
-                q_weights, scale_w, _ = quant.quant(
-                    weigths, parameter_bits, mode=mode, quant_data='weight')
-                q_bias = None
-                if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                    if bias.dtype == torch.float32:
-                        q_bias = (bias * scale_w * eval_scale_normal + 0.5).floor().long()
-                    else:
-                        q_bias = bias.contiguous().long()
-                else:
-                    assert False, 'linger only support luna quant.'
-            else:
-                scale_x = eval_scale_x
-                scale_normal = eval_scale_normal
-                scale_w = eval_scale_w
-                scale_o = eval_scale_o
-                q_weights = weights.contiguous().int()
-                q_bias = bias.contiguous().long()
-            q_bias.clamp_(-2**31, 2**31-1)
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.contiguous().int()
-            q_weights = q_weights.contiguous().int()            
-            sum_x = q_input.clone().sum(w_shape, keepdim=True)
-            sum_x2 = q_input.clone().pow(2).sum(w_shape, keepdim=True)
-            denominator = N * sum_x2 - sum_x * sum_x
-            scale_eps = int(math.log2(scale_x.data)) * 2
-            q_eps = math.floor(eps * pow(2, scale_eps) * N * N + 0.5)
-            denominator = denominator + q_eps
-            q_x_normal = q_input * N
-            q_x_normal = q_x_normal - sum_x
-            q_x_normal = lingerext.luna_layernormint(q_x_normal.contiguous().int(), denominator.contiguous().long(), math.log2(scale_x.data))
-            q_x_normal = q_x_normal.contiguous().int()
-            q_x_normal.clamp_(-2**15, 2**15-1)
-            q_outputs = q_x_normal * q_weights + q_bias
-            q_outputs.clamp_(-2**31, 2**31-1)
-            q_outputs = q_outputs.contiguous().double()
-            q_outputs = (q_outputs * scale_o() / (eval_scale_normal() * scale_w()) + 0.5).floor()
-            q_outputs = q_outputs.contiguous().int()
-            q_outputs.clamp_(-128, 127)
-            outputs = quant.dequant(q_outputs, scale_o)
-
-            if dump:
-                name_list = ["input", "weights", "bias", "q_weights",  "q_input", "q_input_normal", "denominator", "q_bias", "q_outputs",
-                             "scale_normal", "scale_x", "scle_w", "scale_o"]
-                attr_list = [input, weights, bias, q_weights, q_input, q_x_normal, denominator, q_bias, q_outputs,
-                             eval_scale_normal.data, scale_x.data, scale_w.data, scale_o.data]
-                Dump.dump_file(prefix, ".LayerNormInt.",
-                               zip(name_list, attr_list), path)
-
-        if isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        input, weights, bias, outputs = ctx.saved_tensors
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        scale_x, scale_normal, scale_w, scale_o = ctx.scale
-        zero_point, is_iq_tensor = ctx.value
-        N = ctx.N
-        w_shape = ctx.w_shape
-        eps = ctx.eps
-
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-
-        f_input = f_input.detach().clone().requires_grad_(True)
-
-        q_weights, _, _ = Quant.quant(
-            weights.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        bias = None if bias is None else bias.detach().clone().requires_grad_(True)
-
-        with torch.enable_grad():
-            mean = f_input.clone().sum(w_shape, keepdim=True) / N
-            var = input.clone().pow(2).sum(w_shape, keepdim=True) / N - \
-                (input.clone().sum(w_shape, keepdim=True) / N).pow(2)
-            var = 1/torch.sqrt(var + eps)
-            var = torch.clamp(var, min=0.0)
-            f_input_normal = (input - mean) * var
-            z = f_weights * f_input_normal + bias
-            z = normalize_data_with_config(z, clamp_data)
-            gradInput, gradWeight, gradBias = torch.autograd.grad(
-                z, (f_input, f_weights, bias), gradOutput)
-
-        return gradInput, gradWeight, gradBias, None, \
-            None, None, None, None, None, \
-            None, None, None, \
-            None, None, None, None, \
-            None, None, None, None, \
-            None, None, None, None, \
-            None, None, None, None, None, \
-            None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, weights, bias, normalized_shape,
-                 training, momentum, eps,
-                 running_x, running_normal, running_w, running_o,
-                 scale_x, scale_normal, scale_w, scale_o,
-                 data_bits, parameter_bits, prefix, dump,
-                 path, mode, o_bits, quant, is_not_from_iqtensor,
-                 ahead_relu, clamp_data, clamp_weight, clamp_bias):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        param_dict = {'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits, 'o_bits_i': o_bits,
-                      'scale_x_f': scale_x(), 'scale_w_f': scale_w(), 'scale_o_f': scale_o()}
-
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-        if is_not_from_iqtensor:
-            input_list = [op_inner, weights, bias]
-        else:
-            input_list = [input, weights, bias]
-        return g.op("thinker::LayerNormInt", *input_list, **param_dict)
-
-
-class LayerNormInt(nn.LayerNorm, ModuleIntConfig):
-    r"""实现LayerNormInt的量化训练与测试，继承自nn.LayerNorm,
-
-    Args:
-        num_features eps momentum affine track_running_stats
-        标准nn.LayerNorm的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为乘加操作之后的weight与bias的clamp数值，此处不使用
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    Examples:
-        test/test_layernorm_int.py
-
-    """
-
-    def __init__(self, normalized_shape, eps=1e-05, momentum=0.1, elementwise_affine=True, data_bits=8, parameter_bits=16, mode=QuantMode.QValue, o_bits=8,
-                 clamp_data=None, clamp_weight=None, clamp_bias=None, ahead_relu=False):
-        parameter_bits = 16
-        nn.LayerNorm.__init__(self, normalized_shape, eps, elementwise_affine)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.is_not_from_iqtensor = True
-        self.ahead_relu = ahead_relu
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.mode = mode
-        self.momentum = momentum
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_normal', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_normal', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_normal = ScalerBuffer(self.running_normal)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_normal = ScalerBuffer(self.scale_normal)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-        momentum = 0.1
-        weight = self.weight
-        bias = self.bias
-        if self.weight.dtype == torch.float32:
-            weight = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-            if self.bias is not None:
-                bias = normalize_bias_with_config(
-                    self.bias, self.clamp_bias, self.training)
-        
-        ret = LayerNormFunction.apply(input, weight, bias, self.normalized_shape,
-                                      self.training and self.elementwise_affine, momentum, self.eps,
-                                      running_x, running_normal, running_w, running_o,
-                                      scale_x, scale_normal, scale_w, scale_o,
-                                      self.data_bits, self.parameter_bits, self.prefix, self.dump,
-                                      self.path, self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor,
-                                      self.ahead_relu, self.clamp_data, self.clamp_weight, self.clamp_bias)
-        self.running_x.fill_(running_x())
-        self.running_normal.fill_(running_normal())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_normal.fill_(scale_normal())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(module, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=module._version)
-        scale_x = ScalerBuffer(module._buffers['scale_x'])
-        scale_normal = ScalerBuffer(module._buffers['scale_normal'])
-        if is_in_onnx_export():
-            assert module._buffers['running_x'] > 0, 'invalid running_x <= 0, cannot access param before training, layer prefix is: {}'.format(
-                prefix)
-            if module.is_not_from_iqtensor:
-                scale_x = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_x']), module.data_bits, mode=module.quant_mode))
-                module._buffers['scale_x'].data.fill_(scale_x())
-            scale_normal.fill_(2**10)
-            module._buffers['scale_normal'].data.fill_(scale_normal())
-            if module.o_bits is not None:
-                scale_o = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_o']), module.o_bits, mode=module.quant_mode))
-                module._buffers['scale_o'].data.fill_(scale_o())
-        if 'scale_w' in module._buffers and module._parameters['weight'].dtype == torch.float:
-            weight_tensor = module._parameters['weight']
-            weight_tensor_clamp = None
-            bias_tensor_clamp = None
-            if hasattr(module, 'clamp_weight'):
-                weight_tensor_clamp = normalize_weight_with_config(
-                    weight_tensor, module.clamp_weight, False)
-            else:
-                weight_tensor_clamp = weight_tensor
-            weight_tensor.data = weight_tensor_clamp
-            if module.bias is not None:
-                bias_tensor = module._parameters['bias']
-                # bias_temp = None
-                if hasattr(module, 'clamp_bias'):
-                    bias_tensor_clamp = normalize_bias_with_config(
-                        bias_tensor, module.clamp_bias, False)
-                else:
-                    bias_tensor_clamp = bias_tensor
-                bias_tensor.data = bias_tensor_clamp
-            if is_in_onnx_export():
-                weight_temp, scale_w, _ = module.quant.quant(
-                    weight_tensor_clamp, module.parameter_bits, mode=module.quant_mode)
-                scale_w = ScalerBuffer(scale_w)
-                module._buffers['scale_w'].data.fill_(scale_w())
-
-                if module.parameter_bits <= 8:
-                    weight_tensor.data = weight_temp.char()
-                    weight_tensor.char()
-                elif module.parameter_bits <= 16:
-                    weight_tensor.data = weight_temp.short()
-                    weight_tensor.short()
-                else:
-                    weight_tensor.data = weight_temp.int()
-                    weight_tensor.int()
-                if module.bias is not None:
-                    bias_tensor_clamp = module._parameters['bias']
-                    if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                        assert module.quant_mode == QuantMode.QValue, 'luna_quant only support Qvalue and o_bits=None'
-                        if module.data_bits + module.parameter_bits <= 16:
-                            module._parameters['bias'].data = (
-                                bias_tensor_clamp * scale_w * scale_normal + 0.5).floor().float().int()
-                        else:
-                            module._parameters['bias'].data = (
-                                bias_tensor_clamp * scale_w * scale_normal + 0.5).floor().int()
-                        module._parameters['bias'].int()
-                    else:
-                        assert False, "linger only support luna quant."
-        module._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in module._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in module._state_dict_hooks.values():
-            hook_result = hook(module, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(module, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        allow_missing_keys = ['running_w', 'running_x', 'running_normal', 'running_o',
-                              'scale_x', 'running_normal', 'scale_w', 'scale_o', 'scale_normal']
-        local_missing_keys = []
-        module._load_from_state_dict_global_(state_dict, prefix, local_metadata, strict,
-                                             local_missing_keys, unexpected_keys, error_msgs)
-        matched = True
-        fake_missing_keys = []
-        for k_local in local_missing_keys:
-            if k_local.replace(prefix, '', 1) not in allow_missing_keys:
-                matched = False
-                fake_missing_keys.append(k_local)
-        if matched:
-            local_missing_keys = []
-        else:
-            local_missing_keys = fake_missing_keys
-        missing_keys += local_missing_keys
-
-    def _load_from_state_dict_global_(module, state_dict, prefix, local_metadata, strict,
-                                      missing_keys, unexpected_keys, error_msgs):
-        for hook in module._load_state_dict_pre_hooks.values():
-            hook(state_dict, prefix, local_metadata, strict,
-                 missing_keys, unexpected_keys, error_msgs)
-        local_name_params = itertools.chain(
-            module._parameters.items(), module._buffers.items())
-        local_state = {k: v.data for k,
-                       v in local_name_params if v is not None}
-        for name, param in local_state.items():
-            key = prefix + name
-            if key in state_dict:
-                input_param = state_dict[key]
-
-                if len(param.shape) == 0 and len(input_param.shape) == 1:
-                    input_param = input_param[0]
-
-                if input_param.shape != param.shape:
-                    error_msgs.append('size mismatch for {}: copying a param with shape {} from checkpoint, '
-                                      'the shape in current model is {}.'
-                                      .format(key, input_param.shape, param.shape))
-                    continue
-
-                if isinstance(input_param, torch.nn.Parameter):
-                    input_param = input_param.data
-                try:
-                    param.copy_(input_param)
-                    if input_param.dtype == torch.int32 or input_param.dtype == torch.int8 or input_param.dtype == torch.int16:
-                        module._parameters[name] = param.int()
-
-                except Exception:
-                    error_msgs.append('While copying the parameter named "{}", '
-                                      'whose dimensions in the model are {} and '
-                                      'whose dimensions in the checkpoint are {}.'
-                                      .format(key, param.size(), input_param.size()))
-            elif strict:
-                missing_keys.append(key)
-        if strict:
-            for key in state_dict.keys():
-                if key.startswith(prefix):
-                    input_name = key[len(prefix):]
-                    input_name = input_name.split('.', 1)[0]
-                    if input_name not in module._modules and input_name not in local_state:
-                        unexpected_keys.append(key)
-
-    def extra_repr(self):
-        s = nn.LayerNorm.extra_repr(self)
-        extra_s = ' ,clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits}, parameter_bits:{parameter_bits}, o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/linear_int.py b/linger/ops/linear_int.py
deleted file mode 100644
index 5068c9b..0000000
--- a/linger/ops/linear_int.py
+++ /dev/null
@@ -1,328 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-
-class LinearFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, weight, bias, data_bits, parameter_bits, running_x, running_w, running_o, eval_scale_x, eval_scale_w, eval_scale_o,
-                momentum, training, prefix, dump, path, mode, o_bits, quant, is_not_from_iqtensor, ahead_relu, clamp_data, clamp_weight, clamp_bias, ahead_sigmoid):
-        scale_o = None
-
-        # venus limits
-        n = input.shape[-1]
-        if (data_bits == 8):
-            assert math.ceil(4/4) * 4 * math.ceil(n/8) * 8 * data_bits/8 <= 64 * \
-                1024, f"in LinearInt op, input shape must satisfy math.ceil(4/4) * 4 * math.ceil(n/8) * 8 * data_bits/8 <= 64*1024, but you have math.ceil({4}/4) * 4 * math.ceil({n}/8) * 8 * {data_bits/8} <= 64*1024"
-        elif (data_bits == 16):
-            assert math.ceil(4/4) * 4 * math.ceil(n/2) * 2 * data_bits/8 <= 64 * \
-                1024, f"in LinearInt op, input shape must satisfy math.ceil(4/4) * 4 * math.ceil(n/2) * 2 * data_bits/8 <= 64*1024, but you have math.ceil({4}/4) * 4 * math.ceil({n}/2) * 2 * {data_bits/8} <= 64*1024"
-        elif (data_bits == 32):
-            assert math.ceil(2/2) * 2 * math.ceil(n/2) * 2 * data_bits/8 <= 64 * \
-                1024, f"in LinearInt op, input shape must satisfy math.ceil(2/2) * 2 * math.ceil(n/2) * 2 * data_bits/8 <= 64*1024, but you have math.ceil({2}/2) * 2 * math.ceil({n}/2) * 2 * {data_bits/8} <= 64*1024"
-
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.clamp_weight = clamp_weight
-            ctx.clamp_bias = clamp_bias
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            ctx.bits = data_bits, parameter_bits, o_bits
-            saved_tensors = [input, weight, bias]
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(
-                    input.data, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-
-            q_weight, scale_w, max_value_w = quant.quant(
-                weight, parameter_bits, mode=mode, quant_data='weight')
-            running_w.mul_(1-momentum).add_(momentum*max_value_w)
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_weight = q_weight.float() if data_bits + \
-                parameter_bits <= 16 else q_weight.double()
-            q_outputs = F.linear(q_input, q_weight)
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                    if data_bits + parameter_bits <= 16:
-                        q_bias = q_bias.float()
-                    else:
-                        q_bias = q_bias.double()
-                    q_outputs += q_bias.reshape(-1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-
-            # f_input = quant.dequant(q_input, scale_x)
-            # f_weight = quant.dequant(q_weight, scale_w)
-            # f_bias = None if bias is None else quant.dequant(q_bias, scale_x*scale_w)
-            # saved_tensors = [f_input, f_weight, f_bias]
-
-            out_tensor = outputs
-            # ctx.save_for_backward(*saved_tensors)
-            if o_bits is not None:
-                outputs = normalize_data_with_config(outputs, clamp_data)
-                out_tensor = outputs
-                q_outputs, scale_o, max_value_o = quant.quant(
-                    outputs, o_bits, mode=mode, quant_data='output', ahead_relu=ahead_relu)
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-                outputs = quant.dequant(q_outputs, scale_o)
-            ctx.scale = scale_x, scale_w, scale_o
-            saved_tensors += [out_tensor]
-            ctx.save_for_backward(*saved_tensors)
-
-        else:
-            assert running_x > 0, 'invalid running_x, please finetune training'
-            q_weight = None
-            scale_o = eval_scale_o
-            if weight.dtype == torch.float32:
-                scale_x = ScalerBuffer(quant.running_to_scale(
-                    running_x, data_bits, mode=mode))
-                q_weight, scale_w, _ = quant.quant(
-                    weight, parameter_bits, mode=mode, quant_data='weight')
-                scale_w = ScalerBuffer(scale_w)
-                if o_bits is not None:
-                    assert running_o > 0, 'invalid running_o <= 0, please finetune training'
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-            else:
-                scale_x = eval_scale_x
-                scale_w = eval_scale_w
-                q_weight = weight.double()
-                if o_bits is not None:
-                    scale_o = eval_scale_o
-
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, scale_x, _ = quant.quant(
-                    input.data, data_bits, scale_x, mode=mode, quant_data='input')
-            q_input = q_input.double()
-            q_weight = q_weight.double()
-            q_outputs = F.linear(q_input, q_weight)
-
-            outputs = None
-            q_bias = None
-            if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                assert mode == QuantMode.QValue, 'castor quant only support QValue and o_bits=None'
-                if bias is not None:
-                    if bias.dtype == torch.float32:
-                        q_bias = (bias * scale_w * scale_x + 0.5).floor()
-                        if data_bits + parameter_bits <= 16:
-                            q_bias = q_bias.float().double()
-                        else:
-                            q_bias = q_bias.double()
-                    else:
-                        q_bias = bias.double()
-                    q_outputs += q_bias.reshape(-1)
-                outputs = quant.dequant(q_outputs, scale_x*scale_w)
-            else:
-                assert False, "linger only support luna quant."
-            if o_bits is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_outputs, _, _ = quant.quant(
-                        outputs, o_bits, scale_o, mode=mode, quant_data='output')
-                outputs = quant.dequant(q_outputs, scale_o)
-            if dump:
-                if bias is not None:
-                    name_list = ["input", "weight", "bias", "outputs", "q_input", "q_weight", "q_bias",
-                                 "q_outputs", "running_x", "running_w", "running_o", "scale_x", "scale_w", "scale_o"]
-                    attr_list = [input, weight, bias, outputs, q_input, q_weight, q_bias, q_outputs,
-                                 running_x.data, running_w.data, running_o.data, scale_x.data, scale_w.data, scale_o.data]
-                    Dump.dump_file(prefix, ".LinearInt.",
-                                   zip(name_list, attr_list), path)
-                else:
-                    name_list = ["input", "weight", "outputs", "q_input", "q_weight", "q_outputs",
-                                 "running_x", "running_w", "running_o", "scale_x", "scale_w", "scale_o"]
-                    attr_list = [input, weight, outputs, q_input, q_weight, q_outputs, running_x.data,
-                                 running_w.data, running_o.data, scale_x.data, scale_w.data, scale_o.data]
-                    Dump.dump_file(prefix, ".LinearInt.",
-                                   zip(name_list, attr_list), path)
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        clamp_data = ctx.clamp_data
-        data_bits, parameter_bits, o_bits = ctx.bits
-        zero_point, is_iq_tensor = ctx.value
-        scale_x, scale_w, scale_o = ctx.scale
-        input, weights, bias, outputs = ctx.saved_tensors
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-
-        f_input = f_input.detach().clone().requires_grad_(True)
-        q_weights, _, _ = Quant.quant(
-            weights.data, parameter_bits, scale_w, mode=QuantMode.QValue, quant_data='weight')
-        f_weights = Quant.dequant(q_weights, scale_w)
-
-        f_weights = f_weights.detach().clone().requires_grad_(True)
-        bias = None if bias is None else bias.detach().clone().requires_grad_(True)
-        gradInput = gradWeight = gradBias = None
-
-        with torch.enable_grad():
-            z = F.linear(f_input, f_weights, bias)
-            if o_bits is not None:
-                z = normalize_data_with_config(z, clamp_data)
-            if bias is not None:
-                gradInput, gradWeight, gradBias = torch.autograd.grad(
-                    z, (f_input, f_weights, bias), gradOutput)
-            else:
-                gradInput, gradWeight, = torch.autograd.grad(
-                    z, (f_input, f_weights), gradOutput)
-
-        return gradInput, gradWeight, gradBias, None, None, None, None, None, None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, weight, bias, data_bits, parameter_bits, running_x, running_w, running_o, scale_x, scale_w, scale_o, momentum, training,
-                 prefix, dump, path, mode, o_bits, quant, is_not_from_iqtensor, ahead_relu, clamp_data, clamp_weight, clamp_bias, ahead_sigmoid):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-            input_list = [op_inner, weight]
-        else:
-            input_list = [input, weight]
-        param_dict = {'scale_x_f': scale_x(), 'scale_w_f': scale_w(),
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits}
-        if bias is not None:
-            input_list.append(bias)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::LinearInt", *input_list, **param_dict)
-
-
-class LinearInt(nn.Linear, ModuleIntConfig):
-    r"""实现LinearInt的量化训练与测试，继承自nn.Linear,
-
-    Args:
-        in_features out_features bias
-        与nn.Linear一致的参数
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        clamp_data(float or None): 针对输出的clamp数值
-        clamp_weight(float or None): 针对转为weight的clamp数值
-        clamp_bias(float or None): 与clamp_weight一致
-        ahead_relu(bool): 是否做融合relu之后的数值统计scale
-
-    """
-
-    def __init__(self, in_features, out_features, bias=True, data_bits=8, parameter_bits=8, mode=QuantMode.QValue,
-                 o_bits=None, clamp_data=None, clamp_weight=None, clamp_bias=None, ahead_relu=False, ahead_sigmoid=False):
-        # assert data_bits == parameter_bits, "data_bits and parameter_bits must be equal"
-        assert data_bits in (8, 16, 32), "data_bits only support 8, 16, 32"
-        nn.Linear.__init__(self, in_features, out_features, bias)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.momentum = 0.1
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-        self.ahead_relu = ahead_relu
-        self.ahead_sigmoid = ahead_sigmoid
-        self.mode = mode
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_w', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_w', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_w = ScalerBuffer(self.running_w)
-        running_o = ScalerBuffer(self.running_o)
-        scale_w = ScalerBuffer(self.scale_w)
-        scale_o = ScalerBuffer(self.scale_o)
-        weight = self.weight
-        bias = self.bias
-        if weight.dtype == torch.float32:
-            weight = normalize_weight_with_config(
-                self.weight, self.clamp_weight, self.training)
-            if self.bias is not None:
-                bias = normalize_bias_with_config(
-                    self.bias, self.clamp_bias, self.training)
-        ret = LinearFunction.apply(input, weight, bias,
-                                   self.data_bits, self.parameter_bits,
-                                   running_x, running_w, running_o, scale_x, scale_w, scale_o,
-                                   self.momentum, self.training, self.prefix, self.dump, self.path,
-                                   self.quant_mode, self.o_bits, self.quant, self.is_not_from_iqtensor, self.ahead_relu,
-                                   self.clamp_data, self.clamp_weight, self.clamp_bias, self.ahead_sigmoid)
-        self.running_x.fill_(running_x())
-        self.running_w.fill_(running_w())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_w.fill_(scale_w())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.Linear.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_weight:{clamp_weight},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu},ahead_sigmoid:{ahead_sigmoid}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},parameter_bits:{parameter_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/ops/linger_functional.py b/linger/ops/linger_functional.py
deleted file mode 100644
index 15a2cb6..0000000
--- a/linger/ops/linger_functional.py
+++ /dev/null
@@ -1,1567 +0,0 @@
-import logging
-import math
-from collections import OrderedDict
-
-import lingerext
-import numpy as np
-import torch
-from torch.onnx import is_in_onnx_export
-from torch.onnx.symbolic_opset9 import max_pool2d as onnx_syms_max_pool2d
-from torch.onnx.symbolic_opset11 import (_prepare_onnx_paddings,
-                                         constant_pad_nd, reflection_pad,
-                                         replication_pad)
-
-from ..config import config
-from ..ops.bmm_int import BmmInt
-from ..ops.ops import ModuleIntConfig
-from ..ops.ops_names import (LINGER_FUNCINT_BMM_COUNTER,
-                             LINGER_IQTENSOR_LAYER_COUNTER,
-                             LINGER_MIX_INT8_MANUAL_ROUND_LAYERS, LINGER_MODE,
-                             LINGER_OBIT)
-from ..quant import Quant
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, iqTranspose,
-                       platform_to_string, quantlinear)
-from .module_self import get_current_module
-from .requant import Requant
-
-torch_max = torch.max
-torch_transpose = torch.transpose
-torch_relu = torch.relu
-torch_pad = torch.nn.functional.pad
-torch_relu_ = torch.relu_
-torch_cat = torch.cat
-torch_max_pool2d = torch.max_pool2d
-torch_sigmoid = torch.sigmoid
-torch_sigmoid_ = torch.sigmoid_
-
-torch_tanh = torch.tanh
-torch_tanh_ = torch.tanh_
-torch_clamp = torch.clamp
-torch_clamp_ = torch.clamp_
-torch_dropout = torch.nn.functional.dropout
-torch_onnx_export = torch.onnx.export
-torch_pack_padded_sequence = torch.nn.utils.rnn.pack_padded_sequence
-torch_pad_packed_sequence = torch.nn.utils.rnn.pad_packed_sequence
-
-torch_softmax = torch.softmax
-torch_logsoftmax = torch.log_softmax
-torch_var = torch.var
-
-
-def forward_torch_tensor(decision_tensor):
-    return type(decision_tensor) == torch.Tensor
-
-
-def find_sigmoidtable(x_int, sigmoid_table):
-    x_int_uint8 = x_int + 256
-    y_int = torch.where(x_int >= 0, x_int, x_int_uint8)
-    y_int = y_int.reshape(-1)
-    for i, ele in enumerate(y_int):
-        y_int[i] = sigmoid_table[ele]
-
-    return y_int.reshape(x_int.shape)
-
-def channel_shuffle(x, groups):
-    batchsize, num_channels, height, width = x.data.size()
-    channels_per_group = num_channels // groups
-    x = x.view(batchsize, groups, channels_per_group, height, width)
-    x = torch.transpose(x, 1, 2).contiguous()
-    x = x.view(batchsize, -1, height, width)
-    return x
-
-class channelShuffle(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, groups):
-        self.save_for_backward(input)
-        self.groups = groups
-        y = channel_shuffle(input, groups)
-        return from_torch_tensor(y, input.scale_data, input.bits)
-
-    @staticmethod
-    def backward(self, s):
-        x, = self.saved_tensors
-        groups = self.groups
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = channel_shuffle(x, groups)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None
-
-    @staticmethod
-    def symbolic(g, x, groups):
-        param_dict = dict()
-        input_list = [x, ]
-        param_dict['groups_i'] = groups
-
-        return g.op("thinker::ShuffleChannel", *input_list, **param_dict)
-
-def channel_shuffle_quant(*args, **kwargs):
-    if forward_torch_tensor(args[0]):
-        return channel_shuffle(*args, **kwargs)
-    assert isinstance(args[0], IQTensor)
-    assert hasattr(args[0], 'scale_data')
-    assert hasattr(args[0], 'bits')
-    return channelShuffle.apply(*args)
-
-class iqRelu(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input):
-        self.save_for_backward(input)
-        y = torch_relu(input)
-        return from_torch_tensor(y, input.scale_data, input.bits)
-
-    @staticmethod
-    def backward(self, s):
-        x, = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_relu(x)
-            grad = torch.autograd.grad(y, x, s)
-        return grad
-
-    @staticmethod
-    def symbolic(g, x):
-
-        input_list = [x, ]
-
-        return g.op("Relu", *input_list)
-
-
-def _constant_pad_nd(g, input, padding, value=None):
-    mode = "constant"
-    pad = _prepare_onnx_paddings(g, input.type().dim(), padding)
-    return g.op("thinker::iqPad", input, pad, value, mode_s=mode)
-
-
-class iqPad(torch.autograd.Function):
-    @staticmethod
-    def forward(self, scale_data, input, pad, mode='constant', value=0.):
-        self.save_for_backward(input)
-        self.pad = pad
-        self.mode = mode
-        self.value = value
-        y = torch_pad(input, pad, mode, value)
-        if value == 0:
-            return from_torch_tensor(y, input.scale_data, input.bits)
-        else:
-            return y
-
-    @staticmethod
-    def backward(self, s):
-        x, = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_pad(x, self.pad, self.mode, self.value)
-            grad = torch.autograd.grad(y, (x,), s)
-        return None, grad[0], None, None, None
-
-    @staticmethod
-    def symbolic(g, scale_data, input, padding, mode='constant', value=0.):
-        if mode == "constant":
-            if value == 0:
-                value = Quant.quant(torch.tensor(
-                    value), bits=8, scale=scale_data)[0].char()
-                padding = g.op("Constant", value_t=torch.tensor(
-                    padding, dtype=torch.int64))
-                return _constant_pad_nd(g, input, padding, value)
-            else:
-                value = torch.tensor(value)
-                padding = g.op("Constant", value_t=torch.tensor(
-                    padding, dtype=torch.int64))
-                return constant_pad_nd(g, input, padding, value)
-        elif mode == "reflect":
-            padding = g.op("Constant", value_t=torch.tensor(
-                padding, dtype=torch.int64))
-            return reflection_pad(g, input, padding)
-        elif mode == "replicate":
-            padding = g.op("Constant", value_t=torch.tensor(
-                padding, dtype=torch.int64))
-            return replication_pad(g, input, padding)
-
-
-class iqMaxPool2d(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, kernel_size, stride=(), padding=0, dilation=1, ceil_mode=False):
-        # venus limits
-        assert input.bits in (
-            4, 8), f"in iqMaxPool2d op, input bits only support 4/8 bits, but you have input bits {input.bits}"
-
-        self.save_for_backward(input)
-        self.kernel_size = kernel_size
-        self.stride = stride
-        self.padding = padding
-        self.dilation = dilation
-        self.ceil_mode = ceil_mode
-        y = torch_max_pool2d(input, kernel_size, stride,
-                             padding, dilation, ceil_mode)
-        return from_torch_tensor(y, input.scale_data, input.bits)
-
-    @staticmethod
-    def backward(self, s):
-        x, = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        kernel_size = self.kernel_size
-        stride = self.stride
-        padding = self.padding
-        dilation = self.dilation
-        ceil_mode = self.ceil_mode
-        grad = None
-        with torch.enable_grad():
-            y = torch_max_pool2d(x, kernel_size, stride,
-                                 padding, dilation, ceil_mode)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None
-
-
-iqMaxPool2d.symbolic = onnx_syms_max_pool2d
-
-
-class iqCat(torch.autograd.Function):
-    @staticmethod
-    def forward(self, local_scale_o, scale_o, running_o, dim, quant_mode, training, prefix, dump, path, *args):
-        tensors = args[0:len(args)//2]
-        scale_s = args[len(args)//2:]
-        self.save_for_backward(*tensors)
-        self.dim = dim
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/scale_z_iq()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32((math.pow(2, 8-1)-1) / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            list_tensor = []
-            for m in tensors:
-                list_tensor.append(m.scale_to(scale_z_iq(), training))
-            y_float = torch_cat(list_tensor, dim)
-        else:
-            assert False, "linger only support luna quant."
-        if dump:
-            name_list = ["input", "outputs"]
-            attr_list = [tensors, y_float]
-            Dump.dump_file(prefix, ".iqCat.", zip(name_list, attr_list), path)
-        return from_torch_tensor(y_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(self, s):
-        tensors = self.saved_tensors
-        tensors = [tensor.detach().clone().requires_grad_(True)
-                   for tensor in tensors]
-        dim = self.dim
-        grad = None
-        with torch.enable_grad():
-            y = torch_cat(tensors, dim)
-            grad = torch.autograd.grad(y, tensors, s)
-        ret = [None, None, None, None, None, None, None, None, None]+list(grad)
-        l = [None for _ in range(len(tensors))]
-        return tuple(ret + l)
-
-    @staticmethod
-    def symbolic(g, local_scale_o, scale_o, running_o, dim, quant_mode, training, prefix, dump, path, *args):
-        param_dict = {}
-        input_list = args[0:len(args)//2]
-        for i, value in enumerate(args[len(args)//2:]):
-            param_dict['scale_x_'+str(i)+"_f"] = value()
-        assert len(input_list) == len(param_dict)
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['scale_o_f'] = scale_o()
-        param_dict['dim_i'] = dim
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::iqCat", *input_list, **param_dict)
-
-
-def relu(*args, **kwargs):
-    if forward_torch_tensor(args[0]):
-        return torch_relu(*args, **kwargs)
-    assert isinstance(args[0], IQTensor)
-    assert hasattr(args[0], 'scale_data')
-    assert hasattr(args[0], 'bits')
-    return iqRelu.apply(args[0])
-
-
-def relu_(*args, **kwargs):
-    return relu(*args, **kwargs)
-
-
-def max_pool2d(input, kernel_size, stride=(), padding=0, dilation=1, ceil_mode=False):
-    if forward_torch_tensor(input):
-        return torch_max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
-    assert isinstance(input, IQTensor)
-    assert hasattr(input, 'scale_data')
-    assert hasattr(input, 'bits')
-    return iqMaxPool2d.apply(input, kernel_size, stride, padding, dilation, ceil_mode)
-
-
-def pad(*args, **kwargs):
-    if forward_torch_tensor(args[0]):
-        return torch_pad(*args, **kwargs)
-    assert isinstance(args[0], IQTensor)
-    assert hasattr(args[0], 'scale_data')
-    assert hasattr(args[0], 'bits')
-    return iqPad.apply(args[0].scale_data, *args)
-
-
-class iqCatLayer(torch.nn.Module):
-    def __init__(self):
-        super(iqCatLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, tensors, local_scale_o, dim, quant_mode=QuantMode.QValue):
-        parmater_to_function = []
-        for s in tensors:
-            parmater_to_function.append(s)
-        for s in tensors:
-            parmater_to_function.append(ScalerBuffer(s.scale_data))
-
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        z = iqCat.apply(local_scale_o, scale_o, running_o, dim, quant_mode,
-                        self.training, self.prefix, self.dump, self.path, *parmater_to_function)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def cat(tensors, dim=0, out=None):
-    is_iq_cat = True
-    for t in tensors:
-        if not isinstance(t, IQTensor):
-            is_iq_cat = False
-            break
-    module_self = get_current_module()
-    if module_self is None:
-        is_iq_cat = False
-    if not is_iq_cat:
-        return torch_cat(tensors, dim, out=out)
-    assert out == None, 'iqtensor not support out param cat for now please make sure the out is None'
-    assert tensors[0].bits == 8, 'iqcat only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqcat_'+str(iname_index)
-    if not module_self.training and not hasattr(module_self, var_name):
-        logging.warning(
-            'eval module has iqcat layer while do not match training module')
-        return torch_cat(tensors, dim, out=out)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = iqCatLayer()
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensors[0].device)
-        setattr(module_self, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        max_z = -1
-        for m in tensors:
-            max_z_t = torch.max(torch.abs(m)).item()
-            if max_z < max_z_t:
-                max_z = max_z_t
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z
-
-    return iq_layer(tensors, scale_z, dim, quant_mode)
-
-
-class softmaxInt(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, dim, data_bits, training, running_x, running_o, scale_x, scale_o, scale_local_o, prefix, dump, path, mode, o_bits, is_not_from_iqtensor):
-        momentum = 0.1
-        if training:
-            ctx.save_for_backward(input)
-            ctx.dim = dim
-            ctx.o_bits = o_bits
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = Quant().quant(input.data, data_bits, scale_x,
-                                                           mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                input_without_inf = input.clone()
-                input_inf_mask = (input_without_inf<=(-2**(data_bits - 1))) #float("-inf")
-                input_without_inf[input_without_inf<=(-2**(data_bits - 1))] = 0                
-                q_input, scale_x, max_value_x = Quant().quant(
-                    input_without_inf.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-                q_input[input_inf_mask] = -2**(data_bits * 2 - 1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                n_dim = len(input.shape)
-                # self.dim_ = n_dim - 1 if (self.dim_ == -1) else self.dim_
-                dims = [i for i in range(n_dim)]
-                dims[dim] = n_dim - 1
-                dims[n_dim - 1] = dim
-                x_ori = q_input.contiguous()
-                if dim != -1 and dim != n_dim - 1:
-                    x_ori = x_ori.permute(*dims)
-                x_shaped = x_ori.reshape(-1, x_ori.shape[-1])
-
-                l_scale = 25 - int(math.log2(scale_x.data))     # Q25 in
-                if l_scale > 0:
-                    x_int_shift = (x_shaped * pow(2, l_scale)).long()
-                else:
-                    x_int_shift = (
-                        x_shaped * pow(2, l_scale) + 0.5).floor().long()
-                x_int_shift.clamp_(-2**31, 2**31-1)
-
-                q_output = lingerext.luna_softmax_int(
-                    x_int_shift.contiguous().int(), float(scale_x()))   # Q25->Q15
-                q_output.clamp_(0, 2**15-1)
-                q_output = q_output.reshape(x_ori.shape)
-                if dim != -1 and dim != n_dim - 1:
-                    q_output = q_output.permute(*dims)
-                scale_local_o.fill_(2**15)
-                outputs = Quant().dequant(q_output, scale_local_o)  # Q15->float
-            else:
-                assert False, 'platform_quant mode donot support for softmaxInt'
-
-            if o_bits is not None:
-                q_output, scale_o, max_value_o = Quant().quant(
-                    outputs, o_bits, mode=mode, quant_data='output')
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-        else:
-            assert running_x > 0, 'invalid running_x = 0, please finetune training before eval'
-            if not isinstance(input, IQTensor):
-                scale_x = ScalerBuffer(Quant().running_to_scale(running_x, data_bits, mode=mode))
-            if o_bits is not None:
-                assert running_o > 0, 'invalid running_o = 0 for softmaxInt'
-                scale_o = ScalerBuffer(Quant().running_to_scale(running_o, o_bits, mode=mode))
-                # scale_o.fill_(2**31)
-     
-            q_input, _, _ = Quant().quant(input.data, data_bits,
-                                             scale_x, mode=mode, quant_data='input')
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                n_dim = len(input.shape)
-                # self.dim_ = n_dim - 1 if (self.dim_ == -1) else self.dim_
-                dims = [i for i in range(n_dim)]
-                dims[dim] = n_dim - 1
-                dims[n_dim - 1] = dim
-                x_ori = q_input.contiguous()
-                if dim != -1 and dim != n_dim - 1:
-                    x_ori = x_ori.permute(*dims)
-                x_shaped = x_ori.reshape(-1, x_ori.shape[-1])
-
-                l_scale = 25 - int(math.log2(scale_x.data))     # Q25 in
-                if l_scale > 0:
-                    x_int_shift = (x_shaped * pow(2, l_scale)).long()
-                else:
-                    x_int_shift = (
-                        x_shaped * pow(2, l_scale) + 0.5).floor().long()
-                x_int_shift.clamp_(-2**31, 2**31-1)
-
-                q_output = lingerext.luna_softmax_int(
-                    x_int_shift.contiguous().int(), float(scale_x()))   # Q25->Q15
-                q_output.clamp_(0, 2**15-1)
-                q_output = q_output.reshape(x_ori.shape)
-                if dim != -1 and dim != n_dim - 1:
-                    q_output = q_output.permute(*dims)
-                scale_local_o.fill_(2**15)
-                q_output = (q_output * scale_o() / scale_local_o() + 0.5).floor()
-                q_output = q_output.contiguous().int()
-                q_output.clamp_(-128, 127)
-                outputs = Quant().dequant(q_output, scale_o)
-            else:
-                assert False, 'platform_quant mode donot support for softmaxInt'
-
-            if dump:
-                name_list = ['input',  'outputs', 'q_input',  'q_outputs']
-                attr_list = [input,  outputs, q_input,  q_output]
-                Dump.dump_file(prefix, '.SoftmaxInt.',
-                               zip(name_list, attr_list), path)
-
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        dim = ctx.dim
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_softmax(x, dim=dim)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, dim, data_bits, training,
-                 running_x, running_o, scale_x, scale_o, scale_local_o,
-                 prefix, dump, path, mode, o_bits, is_not_from_iqtensor):
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        if is_not_from_iqtensor:
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        param_dict = {'scale_x_f': scale_x(
-        ),  'data_bits_i': data_bits, 'dim_i': dim}
-        input_list = []
-        if is_not_from_iqtensor:
-            input_list.append(op_inner)
-        else:
-            input_list.append(input)
-
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::SoftmaxInt", *input_list, **param_dict)
-
-
-class softmaxIntLayer(torch.nn.Module):
-    def __init__(self, data_bits=8, mode=QuantMode.QValue, o_bits=8):
-        super(softmaxIntLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = data_bits
-        self.o_bits = o_bits
-        self.mode = mode
-        self.is_not_from_iqtensor = True
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('scale_local_o', torch.zeros(1))
-
-    def forward(self, input, dim=-1):
-        running_x = ScalerBuffer(self.running_x)
-        running_o = ScalerBuffer(self.running_o)
-        scale_x = ScalerBuffer(self.scale_x)
-        scale_o = ScalerBuffer(self.scale_o)
-        scale_local_o = ScalerBuffer(self.scale_local_o)
-
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-
-        z = softmaxInt.apply(input, dim, self.data_bits, self.training, running_x, running_o, scale_x, scale_o, scale_local_o,
-                             self.prefix, self.dump, self.path, self.mode, self.o_bits, self.is_not_from_iqtensor)
-        self.running_x.fill_(running_x())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_o.fill_(scale_o())
-
-        return z
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self.running_x > 0, 'invalid running_x <=0'
-            scale_x = ScalerBuffer(self.scale_x.data)
-            if self.is_not_from_iqtensor:
-                scale_x = ScalerBuffer(Quant().running_to_scale(
-                    self.running_x, self.data_bits, mode=self.mode))
-                self.scale_x.data.fill_(scale_x())
-
-            if self.o_bits is not None:
-                scale_o = ScalerBuffer(Quant().running_to_scale(
-                    self.running_o, self.o_bits, mode=self.mode))
-                # scale_o.fill_(2**31)
-                self.scale_o.data.fill_(scale_o())
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in self._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def softmax(tensor, dim, _stacklevel=3, dtype=None):
-
-    module_self = get_current_module()
-
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_SoftmaxInt_'+str(iname_index)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = softmaxIntLayer(mode=quant_mode)
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self, var_name, iq_layer)
-    # iq_layer.o_bits = 32
-    return iq_layer(tensor, dim=dim)
-
-
-class logsoftmaxInt(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, dim, data_bits, training, running_x, running_o, scale_x, scale_o, scale_local_o, prefix, dump, path, mode, o_bits, is_not_from_iqtensor):
-        momentum = 0.1
-        if training:
-            ctx.save_for_backward(input)
-            ctx.dim = dim
-            ctx.o_bits = o_bits
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = Quant().quant(input.data, data_bits, scale_x,
-                                                           mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, _, max_value_x = Quant().quant(
-                    input.data, data_bits, mode=mode, quant_data='input')
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                n_dim = len(input.shape)
-                # self.dim_ = n_dim - 1 if (self.dim_ == -1) else self.dim_
-                dims = [i for i in range(n_dim)]
-                dims[dim] = n_dim - 1
-                dims[n_dim - 1] = dim
-                x_ori = q_input.contiguous()
-                if dim != -1 and dim != n_dim - 1:
-                    x_ori = x_ori.permute(*dims)
-                x_shaped = x_ori.reshape(-1, x_ori.shape[-1])
-
-                l_scale = 25 - int(math.log2(scale_x.data))     # Q25 in
-                if l_scale > 0:
-                    x_int_shift = (x_shaped * pow(2, l_scale)).int()
-                else:
-                    x_int_shift = (
-                        x_shaped * pow(2, l_scale) + 0.5).floor().int()
-                q_output = lingerext.luna_logsoftmax_int(
-                    x_int_shift.contiguous(), float(scale_x()))   # Q25->Q25
-                q_output.clamp_(-2**31, 0)
-                q_output = q_output.reshape(x_ori.shape)
-                if dim != -1 and dim != n_dim - 1:
-                    q_output = q_output.permute(*dims)
-                scale_local_o.fill_(2**25)
-                outputs = Quant().dequant(q_output, scale_local_o)  # Q25->float
-            else:
-                assert False, 'platform_quant mode donot support for logsoftmaxInt'
-
-            if o_bits is not None:
-                q_output, scale_o, max_value_o = Quant().quant(
-                    outputs, o_bits, mode=mode, quant_data='output')
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-        else:
-            assert running_x > 0, 'invalid running_x = 0, please finetune training before eval'
-            if not isinstance(input, IQTensor):
-                scale_x = ScalerBuffer(Quant().running_to_scale(
-                    running_x, data_bits, mode=mode))
-            if o_bits is not None:
-                assert running_o > 0, 'invalid running_o = 0 for logsoftmaxInt'
-                scale_o = ScalerBuffer(Quant().running_to_scale(
-                    running_o, o_bits, mode=mode))
-                if scale_o.data > float(2**31):
-                    scale_o.fill_(2**31)
-            q_input, _, _ = Quant().quant(input.data, data_bits,
-                                             scale_x, mode=mode, quant_data='input')
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                n_dim = len(input.shape)
-                # self.dim_ = n_dim - 1 if (self.dim_ == -1) else self.dim_
-                dims = [i for i in range(n_dim)]
-                dims[dim] = n_dim - 1
-                dims[n_dim - 1] = dim
-                x_ori = q_input.contiguous()
-                if dim != -1 and dim != n_dim - 1:
-                    x_ori = x_ori.permute(*dims)
-                x_shaped = x_ori.reshape(-1, x_ori.shape[-1])
-
-                l_scale = 25 - int(math.log2(scale_x.data))     # Q25 in
-                if l_scale > 0:
-                    x_int_shift = (x_shaped * pow(2, l_scale)).int()
-                else:
-                    x_int_shift = (
-                        x_shaped * pow(2, l_scale) + 0.5).floor().int()
-
-                q_output = lingerext.luna_logsoftmax_int(
-                    x_int_shift.contiguous(), float(scale_x()))   # Q25->Q25
-                q_output.clamp_(-2**31, 0)
-                q_output = q_output.reshape(x_ori.shape)
-                if dim != -1 and dim != n_dim - 1:
-                    q_output = q_output.permute(*dims)
-                scale_local_o.fill_(2**25)
-                outputs = Quant().dequant(q_output, scale_local_o)  # Q25->float
-            else:
-                assert False, 'platform_quant mode donot support for logsoftmaxInt'
-
-            if o_bits is not None:
-                q_output, _, _ = Quant().quant(outputs, o_bits, scale_o,
-                                                  mode=mode, quant_data='output')
-                outputs = Quant().dequant(q_output, scale_o)
-
-            if dump:
-                name_list = ['input',  'outputs', 'q_input',  'q_outputs']
-                attr_list = [input,  outputs, q_input,  q_output]
-                Dump.dump_file(prefix, '.SoftmaxInt.',
-                               zip(name_list, attr_list), path)
-
-        if o_bits is None:
-            return outputs
-        elif isinstance(scale_o, float):
-            return from_torch_tensor(outputs, scale_o, o_bits)
-        elif isinstance(scale_o, torch.Tensor):
-            return from_torch_tensor(outputs, scale_o.item(), o_bits)
-        else:
-            return from_torch_tensor(outputs, scale_o.data, o_bits)
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        dim = ctx.dim
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_logsoftmax(x, dim=dim)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, dim, data_bits, training,
-                 running_x, running_o, scale_x, scale_o, scale_local_o,
-                 prefix, dump, path, mode, o_bits, is_not_from_iqtensor):
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        if is_not_from_iqtensor:
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        param_dict = {'scale_x_f': scale_x(
-        ),  'data_bits_i': data_bits, 'dim_i': dim}
-        input_list = []
-        if is_not_from_iqtensor:
-            input_list.append(op_inner)
-        else:
-            input_list.append(input)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-        param_dict['platform_quant_s'] = platform_quant
-        return g.op("thinker::LogSoftmaxInt", *input_list, **param_dict)
-
-
-class logsoftmaxIntLayer(torch.nn.Module):
-    def __init__(self, data_bits=8, mode=QuantMode.QValue, o_bits=8):
-        super(logsoftmaxIntLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = data_bits
-        self.o_bits = o_bits
-        self.mode = mode
-        self.is_not_from_iqtensor = True
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('scale_local_o', torch.zeros(1))
-
-    def forward(self, input, dim=-1):
-        running_x = ScalerBuffer(self.running_x)
-        running_o = ScalerBuffer(self.running_o)
-        scale_x = ScalerBuffer(self.scale_x)
-        scale_o = ScalerBuffer(self.scale_o)
-        scale_local_o = ScalerBuffer(self.scale_local_o)
-
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-
-        z = logsoftmaxInt.apply(input, dim, self.data_bits, self.training, running_x, running_o, scale_x, scale_o, scale_local_o,
-                                self.prefix, self.dump, self.path, self.mode, self.o_bits, self.is_not_from_iqtensor)
-        self.running_x.fill_(running_x())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_o.fill_(scale_o())
-
-        return z
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self.running_x > 0, 'invalid running_x <=0'
-            scale_x = ScalerBuffer(self.scale_x.data)
-            if self.is_not_from_iqtensor:
-                scale_x = ScalerBuffer(Quant().running_to_scale(
-                    self.running_x, self.data_bits, mode=self.mode))
-                self.scale_x.data.fill_(scale_x())
-            if self.o_bits is not None:
-                scale_o = ScalerBuffer(Quant().running_to_scale(
-                    self.running_o, self.o_bits, mode=self.mode))
-                if scale_o.data > float(2**31):
-                    scale_o.fill_(2**31)
-                self.scale_o.data.fill_(scale_o())
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in self._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def logsoftmax(tensor, dim, _stacklevel=3, dtype=None):
-    is_logsoftmax_int = True
-    if not isinstance(tensor, IQTensor):
-        is_logsoftmax_int = False
-    module_self = get_current_module()
-    if module_self is None:
-        is_logsoftmax_int = False
-    if not is_logsoftmax_int:
-        return torch_logsoftmax(tensor, dim=dim)
-    # assert tensor.bits == 8, 'LogsoftmaxInt only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_LogSoftmaxInt_'+str(iname_index)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = logsoftmaxIntLayer(mode=quant_mode)
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self, var_name, iq_layer)
-    # iq_layer.o_bits = 32
-
-    return iq_layer(tensor, dim=dim)
-
-
-class iqSigmoid(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, x, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path, data_bits):
-        ctx.save_for_backward(x)
-        # x_int = x.quant_to_int8(scale_x())
-        # x_int = x_int.contiguous().int()
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        bound_value = 127
-
-        if training:
-            running_o.mul_(1-momentum).add_(momentum *
-                                            (bound_value/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(bound_value/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32(bound_value / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        y_int = None
-
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            q_input, _, _ = Quant().quant(
-                x.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=x.zero_point)
-            l_scale = 11 - int(math.log2(scale_x.data))
-
-            if l_scale > 0:
-                x_int_shift = (q_input * pow(2, l_scale)).int()
-            else:
-                x_int_shift = (q_input * pow(2, l_scale) + 0.5).floor().int()
-
-            y_int = lingerext.luna_iqsigmoid(
-                x_int_shift.contiguous(), float(scale_x()))
-            y_int.clamp_(0, 2**7-1)
-            scale_z_iq.fill_(2**7)
-            scale_o.fill_(scale_z_iq())
-            running_o.fill_(1.0)
-            y_float = Quant.dequant(y_int, scale_z_iq)
-
-            if dump:
-                name_list = ['input',  'outputs', 'q_input',  'q_outputs']
-                attr_list = [x,  y_float, q_input,  y_int]
-                Dump.dump_file(prefix, '.iqSigmoid.',
-                                zip(name_list, attr_list), path)
-
-            return from_torch_tensor(y_float, scale_z_iq(), 8, zero_point=0)
-        else:
-            assert False, 'platform_quant mode donot support for iqSigmoid'
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_sigmoid(x)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path, data_bits):
-        param_dict = {'scale_x_f': scale_x(), 'scale_o_f': scale_o()}
-        input_list = [x, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-        param_dict['castor_mode_s'] = "luna"
-        op = None
-        op = g.op("thinker::iqSigmoid", *input_list, **param_dict)
-        return op
-
-
-class iqSigmoidLayer(torch.nn.Module):
-    def __init__(self, data_bits=16):
-        super(iqSigmoidLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = data_bits
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, input, local_scale_o, quant_mode=QuantMode.QValue):
-        scale_x = ScalerBuffer(input.scale_data)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        if isinstance(input, IQTensor):
-            input = Requant.apply(
-                input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-        z = iqSigmoid.apply(input, scale_x, local_scale_o, running_o, scale_o,
-                            self.training, quant_mode, self.prefix, self.dump, self.path, self.data_bits)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def sigmoid(tensor, *, out=None):
-    is_iq_sigmoid = True
-    if not isinstance(tensor, IQTensor):
-        is_iq_sigmoid = False
-    module_self = get_current_module()
-    if module_self is None:
-        is_iq_sigmoid = False
-    if not is_iq_sigmoid:
-        return torch_sigmoid(tensor)
-    # assert tensor.bits == 8, 'iqsigmoid only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_iqsigmoid_'+str(iname_index)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = iqSigmoidLayer()
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch_sigmoid(tensor)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(tensor, scale_z, quant_mode)
-
-
-def sigmoid_(tensor, *, out=None):
-    return sigmoid(tensor, out)
-
-
-class iqTanh(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, x, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path, data_bits):
-        ctx.save_for_backward(x)
-        # x_int = x.quant_to_int8(scale_x())
-        # x_int = x_int.contiguous().int()
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        bound_value = 127
-
-        if training:
-            running_o.mul_(1-momentum).add_(momentum *
-                                            (bound_value/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(bound_value/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32(bound_value / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        y_int = None
-
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            q_input, _, _ = Quant().quant(
-                x.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=x.zero_point)
-            l_scale = 11 - int(math.log2(scale_x.data))
-
-            if l_scale > 0:
-                x_int = (q_input * pow(2, l_scale)).int()
-            else:
-                x_int = (q_input * pow(2, l_scale) + 0.5).floor().int()
-            y_int = lingerext.luna_iqtanh(x_int.contiguous(), float(scale_x()))
-            scale_z_iq.fill_(2**7)
-            scale_o.fill_(scale_z_iq())
-            running_o.fill_(1.0)
-            y_float = Quant.dequant(y_int, scale_z_iq)
-            return from_torch_tensor(y_float, scale_z_iq(), 8, zero_point=0)
-
-        else:
-            assert False, "linger only support luna quant."
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_tanh(x)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path, data_bits):
-        param_dict = {'scale_x_f': scale_x(), 'scale_o_f': scale_o()}
-        input_list = [x, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-        param_dict['castor_mode_s'] = "luna"
-        op = None
-        op = g.op("thinker::iqTanh", *input_list, **param_dict)
-        return op
-
-
-class iqTanhLayer(torch.nn.Module):
-    def __init__(self, data_bits=16):
-        super(iqTanhLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = data_bits
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, input, local_scale_o, quant_mode=QuantMode.QValue):
-        scale_x = ScalerBuffer(input.scale_data)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        if isinstance(input, IQTensor):
-            input = Requant.apply(
-                input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-        z = iqTanh.apply(input, scale_x, local_scale_o, running_o, scale_o,
-                         self.training, quant_mode, self.prefix, self.dump, self.path, self.data_bits)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def tanh(tensor, *, out=None):
-    is_iq_tanh = True
-    if not isinstance(tensor, IQTensor):
-        is_iq_tanh = False
-    module_self = get_current_module()
-    if module_self is None:
-        is_iq_tanh = False
-    if not is_iq_tanh:
-        return torch_tanh(tensor)
-    # assert tensor.bits == 8, 'iqTanh only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_iqtanh_'+str(iname_index)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = iqTanhLayer()
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch_tanh(tensor)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-
-    return iq_layer(tensor, scale_z, quant_mode)
-
-
-def tanh_(tensor, *, out=None):
-    return tanh(tensor, out)
-
-
-class iqClamp(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, x, min, max, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path):
-        ctx.save_for_backward(x)
-        ctx.min = min
-        ctx.max = max
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        if training:
-            running_o.mul_(1-momentum).add_(momentum*(127/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(127/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32(127 / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        y_int = None
-        y_float = torch_clamp(x, min, max)
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            Qx = math.log(127/max, 2)
-            assert Qx.is_integer(
-            ), "luna_quant max value don't support {} clamp, it must be (127 /2^n).".format(max)
-            assert (min == -128/2**math.log(127/7.9375, 2)) or (min ==
-                                                                0), "luna_quant min value don't support {} clamp, it must match with the max-value (-128/2^n) or 0.".format(min)
-            scale_z_iq = 2**Qx
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            y_int = (x * scale_z_iq + 0.5).floor().int()
-            if min == 0:
-                y_int = torch_clamp(y_int, 0, 127)
-            scale_o.fill_(scale_z_iq())
-
-        else:
-            assert False, 'platform_quant mode donot support for iqClamp'
-        y_int = torch_clamp(y_int, -128, 127)
-        y_float = Quant.dequant(y_int, scale_z_iq)
-        if dump:
-            name_list = ["input", "outputs", "q_outputs"]
-            attr_list = [x, y_float, y_int]
-            Dump.dump_file(prefix, ".iqClamp.", zip(
-                name_list, attr_list), path)
-        return from_torch_tensor(y_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        min = ctx.min
-        max = ctx.max
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_clamp(x, min, max)
-            grad = torch.autograd.grad(y, (x,), s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, min, max, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path):
-        param_dict = {'min_f': min, 'max_f': max,
-                      'scale_x_f': scale_x(), 'scale_o_f': scale_o()}
-        input_list = [x, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-
-        return g.op("thinker::iqClamp", *input_list, **param_dict)
-
-
-class iqClampLayer(torch.nn.Module):
-    def __init__(self):
-        super(iqClampLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, input, min, max, local_scale_o, quant_mode=QuantMode.QValue):
-        scale_x = ScalerBuffer(input.scale_data)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        z = iqClamp.apply(input, min, max, scale_x, local_scale_o, running_o,
-                          scale_o, self.training, quant_mode, self.prefix, self.dump, self.path)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-def clamp(tensor, min=-math.inf, max=math.inf, out=None):
-    is_iq_clamp = True
-    if not isinstance(tensor, IQTensor):
-        is_iq_clamp = False
-    module_self = get_current_module()
-    if module_self is None:
-        is_iq_clamp = False
-    if not is_iq_clamp:
-        return torch_clamp(tensor, min, max)
-    assert tensor.bits == 8, 'iqclamp only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_iqclamp_'+str(iname_index)
-    iq_layer = None
-    if hasattr(module_self, var_name):
-        iq_layer = getattr(module_self, var_name)
-    else:
-        iq_layer = iqClampLayer()
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self, var_name, iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch_clamp(tensor, min, max)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(tensor, min, max, scale_z, quant_mode)
-
-
-def clamp_(tensor, min=-math.inf, max=math.inf, out=None):
-    return clamp(tensor, min, max, out)
-
-
-def dropout(tensor, p: float, training: bool = False, inplace: bool = False):
-    return torch_dropout(tensor, p, False, inplace)
-
-
-def pack_padded_sequence(input, lengths, batch_first=False, enforce_sorted=True):
-    return input, lengths, batch_first, enforce_sorted
-
-
-def pad_packed_sequence(sequence, batch_first=False, padding_value=0.0, total_length=None):
-    assert (padding_value == 0.0 and total_length is None), 'lstmint for pad_packed only support padding_value=0.0 and total_length=None'
-    output, lengths = sequence
-    return output, lengths
-
-
-def transpose(*args, **kwargs):
-    if forward_torch_tensor(args[0]):
-        return torch_transpose(*args, **kwargs)
-    assert isinstance(args[0], IQTensor)
-    assert hasattr(args[0], 'scale_data')
-    assert hasattr(args[0], 'bits')
-    return iqTranspose.apply(args[0], args[1], args[2])
-
-
-class iqMax(torch.autograd.Function):
-    @staticmethod
-    def forward(self, input, other, scale_x, scale_y, scale_o):
-        self.save_for_backward(*(input, other))
-        y = torch_max(input, other)
-
-        return from_torch_tensor(y, scale_o, input.bits, input.zero_point)
-
-    @staticmethod
-    def backward(self, s):
-        x, y = self.saved_tensors
-        x = x.detach().clone().requires_grad_(True)
-        y = y.detach().clone().requires_grad_(True)
-
-        grad = None
-        with torch.enable_grad():
-            z = torch_max(x, y)
-            grad = torch.autograd.grad(z, (x, y), s)
-        return grad[0], grad[1], None, None, None
-
-    @staticmethod
-    def symbolic(g, input, other, scale_x, scale_y, scale_o):
-        # torch.max(input, other)
-        param_dict = {'scale_x_f': scale_x,
-                      'scale_y_f': scale_y, 'scale_o_f': scale_o}
-        input_list = [input, other, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict['platform_quant_s'] = platform_quant
-
-        op = g.op("thinker::iqMax", *input_list, **param_dict)
-        return op
-
-
-def iqmax(*args, **kwargs):
-    if len(args) != 2:
-        return torch_max(*args, **kwargs)
-    if not isinstance(args[0], IQTensor):
-        return torch_max(*args, **kwargs)
-    if not isinstance(args[1], IQTensor):
-        return torch_max(*args, **kwargs)
-    if len(kwargs) != 0:
-        return torch_max(*args, **kwargs)
-    assert isinstance(args[0], IQTensor)
-    assert isinstance(args[1], IQTensor)
-
-    assert hasattr(args[0], 'scale_data')
-    assert hasattr(args[0], 'bits')
-    assert hasattr(args[1], 'scale_data')
-    assert hasattr(args[1], 'bits')
-
-    max_scale_o = min(args[0].scale_data, args[1].scale_data)
-    return iqMax.apply(args[0], args[1], args[0].scale_data, args[1].scale_data, max_scale_o)
-
-
-def bmm(input, mat2, *, out=None):
-    module_self = get_current_module()
-
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-
-    out_bits = getattr(module_self, LINGER_OBIT,
-                       None) if True else None
-    iname_index = getattr(module_self, LINGER_FUNCINT_BMM_COUNTER)
-    setattr(module_self, LINGER_FUNCINT_BMM_COUNTER, iname_index+1)
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + \
-        '_function_bmm_'+str(iname_index)
-    quant_layer = None
-    bmm_output = None
-    if hasattr(module_self, var_name):
-        quant_layer = getattr(module_self, var_name)
-    else:
-        quant_layer = BmmInt(data_bits=8, mode=quant_mode,)
-        quant_layer.training = module_self.training
-        quant_layer = quant_layer.to(input.device)
-        setattr(module_self, var_name, quant_layer)
-    quant_layer.clamp_data = None
-    quant_layer.o_bits = out_bits
-    bmm_output = quant_layer(input, mat2)
-    return bmm_output
-
-
-class iqVar(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, x, dim, unbiased, keepdim, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path):
-        ctx.save_for_backward(x)
-        ctx.value = dim, unbiased, keepdim
-        x_int = x.quant_to_int8(scale_x())
-        x_int = x_int.contiguous()
-        scale_z_iq = local_scale_o
-        momentum = 0.1
-        bound_value = 127
-
-        if training:
-            running_o.mul_(1-momentum).add_(momentum * (bound_value/local_scale_o()))
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(scale_z_iq(), 2))
-                scale_z_iq = math.pow(2, scale_log)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-        else:
-            assert running_o.data > 0, 'Must at least training one batch'
-            if quant_mode == QuantMode.QValue:
-                scale_log = round(math.log(bound_value/running_o.data, 2))
-                scale_z_iq = math.pow(2, scale_log)
-            else:
-                scale_z_iq = np.float32(bound_value / running_o.data)
-            scale_z_iq = ScalerBuffer(scale_z_iq)
-            scale_o.fill_(scale_z_iq())
-        y_int = None
-
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            x_float = Quant.dequant(x_int, scale_x)
-            y_float = torch_var(x_float, dim, unbiased, keepdim)
-            y_int = (y_float * scale_z_iq()).round().int()
-            y_int.clamp_(-128,127)
-
-        y_float = Quant.dequant(y_int, scale_z_iq)
-
-        if dump:
-            name_list = ["input", "outputs", "q_input", "q_outputs"]
-            attr_list = [x, y_float, x_int, y_int]
-            Dump.dump_file(prefix, ".iqVar.", zip(name_list, attr_list), path)
-        
-        return from_torch_tensor(y_float, scale_z_iq(), 8)
-
-    @staticmethod
-    def backward(ctx, s):
-        x, = ctx.saved_tensors
-        dim, unbiased, keepdim  = ctx.value
-        x = x.detach().clone().requires_grad_(True)
-        grad = None
-        with torch.enable_grad():
-            y = torch_var(x, dim, unbiased, keepdim)
-            grad = torch.autograd.grad(y, x, s)
-        return grad[0], None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, x, dim, unbiased, keepdim, scale_x, local_scale_o, running_o, scale_o, training, quant_mode, prefix, dump, path):
-        param_dict = {"scale_x_f": scale_x(), "scale_o_f": scale_o()}
-        input_list = [x, ]
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        param_dict["platform_quant_s"] = platform_quant
-        param_dict["castor_mode_s"] = "luna"
-        param_dict["dim_i"] = dim
-        param_dict["unbiased_i"] = unbiased
-        op = None
-        op = g.op("thinker::iqVar", *input_list, **param_dict)
-        return op
-
-class iqVarLayer(torch.nn.Module):
-    def __init__(self, data_bits=8, mode=QuantMode.QValue,):
-        super(iqVarLayer, self).__init__()
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = data_bits
-        self.mode = mode
-        self.register_buffer('scale_o', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-    def forward(self, input, dim, unbiased, keepdim, local_scale_o, quant_mode=QuantMode.QValue):
-        scale_x = ScalerBuffer(input.scale_data)
-        local_scale_o = ScalerBuffer(local_scale_o)
-        scale_o = ScalerBuffer(self.scale_o)
-        running_o = ScalerBuffer(self.running_o)
-        if isinstance(input, IQTensor):
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-        z = iqVar.apply(input, dim, unbiased, keepdim, scale_x, local_scale_o, running_o, scale_o,
-                            self.training, quant_mode, self.prefix, self.dump, self.path)
-        self.scale_o.fill_(scale_o.data)
-        self.running_o.fill_(running_o.data)
-        return z
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-def var(tensor, dim, unbiased=True, keepdim=False, *, out=None):
-    is_iq_var = True
-    if not isinstance(tensor,IQTensor):
-        is_iq_var = False
-    module_self = get_current_module()
-    if module_self is None:
-        is_iq_var = False   
-    if not is_iq_var:
-        return torch_var(tensor)
-
-    assert tensor.bits == 8, 'iqvar only support 8bit'
-    quant_mode = getattr(module_self, LINGER_MODE, QuantMode.QValue)
-    iname_index = getattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER)
-    setattr(module_self, LINGER_IQTENSOR_LAYER_COUNTER, iname_index+1)
-    var_name = LINGER_MIX_INT8_MANUAL_ROUND_LAYERS + '_iqvar_' + str(iname_index)
-    
-    iq_layer = None
-    if hasattr(module_self,var_name):
-        iq_layer = getattr(module_self,var_name)
-    else:
-        iq_layer = iqVarLayer()
-        iq_layer.training = module_self.training
-        iq_layer = iq_layer.to(tensor.device)
-        setattr(module_self,var_name,iq_layer)
-    scale_z = None
-    with torch.no_grad():
-        z_f = torch_var(tensor, dim, unbiased, keepdim)
-        max_z = torch.max(torch.abs(z_f))
-        if max_z == 0:
-            scale_z = 1.0
-        else:
-            scale_z = 127 / max_z.item()
-    return iq_layer(tensor, dim, unbiased, keepdim, scale_z, quant_mode)
-
-
-torch.max_pool2d = max_pool2d
-torch.relu = relu
-torch.max = iqmax
-
-torch.relu_ = relu_
-torch.transpose = transpose
-torch.nn.functional.pad = pad
-
-__all__ = ['torch_relu', 'torch_relu_', 'torch_max_pool2d', 'torch_cat', 'iqCatLayer', 'cat', 'torch_sigmoid', 'torch_sigmoid_',
-           'iqSigmoidLayer', 'sigmoid', 'sigmoid_', 'torch_tanh', 'torch_tanh_', 'iqTanhLayer', 'tanh', 'tanh_', 'torch_clamp', 'torch_clamp_', 'iqClampLayer', 'clamp', 'clamp_',
-           'dropout', 'pack_padded_sequence', 'pad_packed_sequence', 'torch_pack_padded_sequence', 'torch_pad_packed_sequence', 'bmm',
-           'torch_softmax', 'softmaxIntLayer', 'softmax',
-           'torch_logsoftmax', 'logsoftmaxIntLayer', 'logsoftmax', 'var', 'channel_shuffle_quant', 'channel_shuffle']
diff --git a/linger/ops/lstm_int.py b/linger/ops/lstm_int.py
deleted file mode 100644
index 64bcba8..0000000
--- a/linger/ops/lstm_int.py
+++ /dev/null
@@ -1,1239 +0,0 @@
-import copy
-import math
-from collections import OrderedDict
-
-import lingerext
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn.utils.rnn import PackedSequence
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import (normalize_bias_with_config, normalize_data_with_config,
-                     normalize_weight_with_config)
-from ..utils import (Dump, PlatFormQuant, QuantMode, ScalerBuffer, _slice,
-                     _unbind, _unbind_packed, get_max_value, hx_slice)
-from .iqtensor import (IQTensor, Quant2IQTensor, from_torch_tensor,
-                       platform_to_string, quantlinear)
-from .linger_functional import (iqCat, torch_pack_padded_sequence,
-                                torch_pad_packed_sequence)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-iqcat_sym = iqCat.symbolic
-
-
-def castor_luna_sigmoid(x_int, scale_x):
-    l_scale = 11 - int(math.log2(scale_x))
-
-    if l_scale > 0:
-        x_int = x_int * pow(2, l_scale)
-    else:
-        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
-
-    x_int.clamp_(-2**15, 2**15-1)
-    y_int = lingerext.luna_iqsigmoid(x_int.contiguous(), float(scale_x))
-    y_int.clamp_(0, 2**7-1)
-
-    return y_int
-
-
-def castor_luna_tanh(x_int, scale_x):
-    l_scale = 11 - int(math.log2(scale_x))
-
-    if l_scale > 0:
-        x_int = x_int * pow(2, l_scale)
-    else:
-        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
-
-    x_int.clamp_(-2**15, 2**15-1)
-    y_int = lingerext.luna_iqtanh(x_int.contiguous(), float(scale_x))
-    y_int.clamp_(-2**7, 2**7-1)
-
-    return y_int
-
-
-class LSTMCellFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, hidden, cx, weight_ih, weight_hh, bias_ih, bias_hh, data_bits, parameter_bits, o_bits,
-                running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                momentum, training, prefix, dump, path, mode, quant, is_not_from_iqtensor,
-                clamp_data):
-        if training:
-            ctx.clamp_data = clamp_data
-            ctx.o_bits = o_bits
-            save_tensors = [input, hidden, weight_ih,
-                            weight_hh, bias_ih, bias_hh]
-            q_input, scale_i, max_value_ix = quant.quant(
-                input, data_bits, mode=mode, quant_data='input')
-            q_iweight, scale_iw, max_value_iw = quant.quant(
-                weight_ih, parameter_bits, mode=mode, quant_data='weight')
-            running_iw.mul_(1-momentum).add_(momentum*max_value_iw)
-
-            q_hidden, scale_h, max_value_hx = quant.quant(
-                hidden, data_bits, mode=mode, quant_data='input')
-            q_hweight, scale_hw, max_value_hw = quant.quant(
-                weight_hh, parameter_bits, mode=mode, quant_data='weight')
-            running_hw.mul_(1-momentum).add_(momentum*max_value_hw)
-
-            q_input = q_input.float() if data_bits + parameter_bits <= 16 else q_input.double()
-            q_iweight = q_iweight.float() if data_bits + \
-                parameter_bits <= 16 else q_iweight.double()
-            q_hidden = q_hidden.float() if data_bits + \
-                parameter_bits <= 16 else q_hidden.double()
-            q_hweight = q_hweight.float() if data_bits + \
-                parameter_bits <= 16 else q_hweight.double()
-
-            q_gi_outputs = F.linear(q_input, q_iweight)
-            q_gh_outputs = F.linear(q_hidden, q_hweight)
-
-            if bias_ih is not None:
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_ibias = (bias_ih * scale_iw *
-                               scale_i + 0.5).floor().int()
-                    if data_bits + parameter_bits <= 16:
-                        q_ibias = q_ibias.float()
-                    else:
-                        q_ibias = q_ibias.double()
-                else:
-                    assert False, "linger only support luna quant."
-                q_gi_outputs += q_ibias.view(-1)
-
-                if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                    q_hbias = (bias_hh * scale_hw *
-                               scale_h + 0.5).floor().int()
-                    if data_bits + parameter_bits <= 16:
-                        q_hbias = q_hbias.float()
-                    else:
-                        q_hbias = q_hbias.double()
-                else:
-                    assert False, "linger only support luna quant."
-                q_gh_outputs += q_hbias.view(-1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:  # QX+QW -> Q11
-                l_scale_gi = 11 - int(math.log2(scale_i()*scale_iw()))
-                if l_scale_gi > 0:
-                    gi = q_gi_outputs * pow(2, l_scale_gi)
-                else:
-                    gi = (q_gi_outputs * pow(2, l_scale_gi) + 0.5).floor().int()
-
-                l_scale_gh = 11 - int(math.log2(scale_h()*scale_hw()))
-                if l_scale_gh > 0:
-                    gh = q_gh_outputs * pow(2, l_scale_gh)
-                else:
-                    gh = (q_gh_outputs * pow(2, l_scale_gh) + 0.5).floor().int()
-            else:  # QX+QW -> Q10
-                assert False, "linger only support luna quant."
-
-            for_backward_gi = quant.dequant(q_gi_outputs, scale_i*scale_iw)
-            for_backward_gh = quant.dequant(q_gh_outputs, scale_h*scale_hw)
-            save_tensors += [for_backward_gi, for_backward_gh]
-
-            gates = gi + gh
-            ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
-
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                ingate = castor_luna_sigmoid(ingate, 2048)  # Q11->Q7
-                forgetgate = castor_luna_sigmoid(forgetgate, 2048)  # Q11->Q7
-                cellgate = castor_luna_tanh(cellgate, 2048)  # Q11->Q7
-                outgate = castor_luna_sigmoid(outgate, 2048)  # Q11->Q7
-                new_cx = (cx * pow(2, 7) + 0.5).floor().int()  # float->Q7
-                # Q7*Q7 + Q7*Q7 -> Q14+Q14 ->Q14
-                cy = (forgetgate * new_cx) + (ingate * cellgate)
-                # Q7*tanh(Q14->Q11)->Q7*Q7->Q14
-                hy = outgate * castor_luna_tanh(cy, 2**14)
-                cy = quant.dequant(cy, float(2**14))  # Q14->float
-                hy = quant.dequant(hy, float(2**14))  # Q14->float
-            else:
-                assert False, "linger only support luna quant."
-
-            save_tensors += [cx, cy, hy]
-
-            if o_bits is not None:
-                hy = normalize_data_with_config(hy, clamp_data)
-                q_hy_outputs, scale_o, max_value_o = quant.quant(
-                    hy, o_bits, mode=mode, quant_data='output')
-                running_o.mul_(1-momentum).add_(momentum*max_value_o)
-                running_h.mul_(1-momentum).add_(momentum*max_value_o)
-                hy = quant.dequant(q_hy_outputs, scale_o)
-            else:  # 为None 时  也得保证  running_h的实际值为running_o
-                _, _, fake_max_value_o = quant.quant(
-                    hy, 8, mode=mode, quant_data='output')
-                running_h.mul_(1-momentum).add_(momentum*fake_max_value_o)
-
-            ctx.save_for_backward(*save_tensors)
-
-        else:
-            assert running_i > 0, 'invalid running_i <= 0, please finetune training'
-            if weight_ih.dtype == torch.float32:
-                if is_not_from_iqtensor:
-                    scale_i = ScalerBuffer(quant.running_to_scale(
-                        running_i, parameter_bits, mode=mode))
-                scale_h = ScalerBuffer(quant.running_to_scale(
-                    running_h, parameter_bits, mode=mode))
-                scale_iw = ScalerBuffer(quant.running_to_scale(
-                    running_iw, parameter_bits, mode=mode))
-                scale_hw = ScalerBuffer(quant.running_to_scale(
-                    running_hw, parameter_bits, mode=mode))
-
-            q_input, scale_i, _ = quant.quant(
-                input, data_bits, scale_i, mode=mode, quant_data='input')
-            q_hidden, scale_h, _ = quant.quant(
-                hidden, data_bits, scale_h, mode=mode, quant_data='input')
-            q_iweight = None
-            q_hweight = None
-            if weight_ih.dtype == torch.float32:
-                q_iweight, _, _ = quant.quant(
-                    weight_ih, parameter_bits, scale_iw, mode=mode, quant_data='weight')
-                q_hweight, _, _ = quant.quant(
-                    weight_hh, parameter_bits, scale_hw, mode=mode, quant_data='weight')
-            else:
-                q_iweight = weight_ih.double()
-                q_hweight = weight_hh.double()
-            q_input = q_input.double()
-            q_iweight = q_iweight.double()
-            q_hidden = q_hidden.double()
-            q_hweight = q_hweight.double()
-            q_gi_outputs = F.linear(q_input, q_iweight)
-            q_gh_outputs = F.linear(q_hidden, q_hweight)
-            if bias_ih is not None:
-                if bias_ih.dtype == torch.float32:
-                    if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                        q_ibias = (bias_ih * scale_iw *
-                                   scale_i + 0.5).floor().int()
-                        q_hbias = (bias_hh * scale_hw *
-                                   scale_h + 0.5).floor().int()
-                        if data_bits + parameter_bits <= 16:
-                            q_ibias = q_ibias.float().double()
-                            q_hbias = q_hbias.float().double()
-                        else:
-                            q_ibias = q_ibias.double()
-                            q_hbias = q_hbias.double()
-                    else:
-                        assert False, "linger only support luna quant."
-                else:
-                    q_ibias = bias_ih.double()
-                    q_hbias = bias_hh.double()
-                q_gi_outputs += q_ibias.view(-1)
-                q_gh_outputs += q_hbias.view(-1)
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:  # QX+QW -> Q11
-                l_scale_gi = 11 - int(math.log2(scale_i()*scale_iw()))
-                if l_scale_gi > 0:
-                    gi = q_gi_outputs * pow(2, l_scale_gi)
-                else:
-                    gi = (q_gi_outputs * pow(2, l_scale_gi) + 0.5).floor().int()
-
-                l_scale_gh = 11 - int(math.log2(scale_h()*scale_hw()))
-                if l_scale_gh > 0:
-                    gh = q_gh_outputs * pow(2, l_scale_gh)
-                else:
-                    gh = (q_gh_outputs * pow(2, l_scale_gh) + 0.5).floor().int()
-            else:  # QX+QW -> Q10
-                assert False, "linger only support luna quant."
-
-            gates = gi + gh
-            ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
-            if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                ingate = castor_luna_sigmoid(ingate, 2048)  # Q11->Q7
-                forgetgate = castor_luna_sigmoid(forgetgate, 2048)  # Q11->Q7
-                cellgate = castor_luna_tanh(cellgate, 2048)  # Q11->Q7
-                outgate = castor_luna_sigmoid(outgate, 2048)  # Q11->Q7
-                new_cx = (cx * pow(2, 7) + 0.5).floor().int()  # float->Q7
-                # Q7*Q7 + Q7*Q7 -> Q14+Q14 ->Q14
-                cy = (forgetgate * new_cx) + (ingate * cellgate)
-                # Q7*tanh(Q14->Q11)->Q7*Q7->Q14
-                hy = outgate * castor_luna_tanh(cy, 2**14)
-                cy = quant.dequant(cy, float(2**14))  # Q14->float
-                hy = quant.dequant(hy, float(2**14))  # Q14->float
-            else:
-                assert False, "linger only support luna quant."
-
-            if o_bits is not None:
-                assert running_o > 0, 'invalid running_o<=0, please finetune training'
-                if weight_ih.dtype == torch.float32:
-                    scale_o = ScalerBuffer(quant.running_to_scale(
-                        running_o, o_bits, mode=mode))
-                q_hy_outputs, _, _ = quant.quant(
-                    hy, o_bits, scale_o, mode=mode, quant_data='output')
-                hy = quant.dequant(q_hy_outputs, scale_o)
-            if dump:
-                if bias_ih is not None and bias_hh is not None and o_bits is not None:
-                    name_list = ["input", "hidden", "q_input", "q_hidden", 'q_iweight', 'q_hweight', "gi", "gh", "q_gi_outputs", "q_gh_outputs", "q_ibias",
-                                 "q_hbias", "scale_i", "scale_iw", "scale_h", "scale_hw", "ingate", "forgetgate", "cellgate", "outgate", "output", "q_hy_outputs", "gates"]
-                    attr_list = [input, hidden, q_input, q_hidden, q_iweight, q_hweight, gi, gh, q_gi_outputs, q_gh_outputs, q_ibias,
-                                 q_hbias, scale_i, scale_iw, scale_h, scale_hw, ingate, forgetgate, cellgate, outgate,  hy, q_hy_outputs, gates]
-                    Dump.dump_file(prefix, ".LstmInt.", zip(
-                        name_list, attr_list), path)
-                else:
-                    name_list = ["input", "hidden", "q_input", "q_hidden", 'q_iweight', 'q_hweight',  "gi", "gh", "q_gi_outputs",
-                                 "q_gh_outputs", "scale_i", "scale_iw", "scale_h", "scale_hw", "ingate", "forgetgate", "cellgate", "outgate", "output"]
-                    attr_list = [input, hidden, q_input, q_hidden, q_iweight, q_hweight, gi, gh, q_gi_outputs,
-                                 q_gh_outputs, scale_i, scale_iw, scale_h, scale_hw, ingate, forgetgate, cellgate, outgate, hy]
-                    Dump.dump_file(prefix, ".LstmInt.", zip(
-                        name_list, attr_list), path)
-        return hy, cy
-
-    @staticmethod
-    def backward(ctx, grad_hy, grad_hc):
-        input, hidden, weight_ih, weight_hh, input_bias, hidden_bias, input_gates, hidden_gates, cx, cy, hy = ctx.saved_tensors
-        clamp_data = ctx.clamp_data
-        o_bits = ctx.o_bits
-        hy.requires_grad_(True)
-        hidden.requires_grad_(True)
-        with torch.enable_grad():
-            if o_bits is not None:
-                clamp_hy = normalize_data_with_config(hy, clamp_data)
-            else:
-                clamp_hy = hy.clone()
-            grad_hy, = torch.autograd.grad(clamp_hy, hy, grad_hy)
-        grad_input_gates, grad_hidden_gates, grad_cx, grad_input_bias, grad_hidden_bias = lingerext.lstm_cell_backward(
-            grad_hy, grad_hc, input_gates, hidden_gates, input_bias, hidden_bias, cx, cy)
-
-        grad_in = grad_input_gates.matmul(weight_ih)
-        grad_hx = grad_hidden_gates.matmul(weight_hh)
-        grad_w_ih = grad_input_gates.t().matmul(input)
-        grad_h_ih = grad_hidden_gates.t().matmul(hidden)
-
-        return grad_in, grad_hx, grad_cx, grad_w_ih, grad_h_ih, grad_input_bias, grad_hidden_bias, None, None, None, None, None, None, None, None, \
-            None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-
-class LSTMSingleONNXFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                input_size, hidden_size, num_layers, batch_first, dropout, bidirectional,
-                data_bits, parameter_bits, o_bits,
-                scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                is_not_from_iqtensor):
-        output = None
-        hidden_state = None
-        cell_state = None
-        batch_size = None
-        seq_length = None
-        num_directions = 2 if bidirectional else 1
-        if batch_first:
-            batch_size = input.size(0)
-            seq_length = input.size(
-                1) if lengths is None else torch.max(lengths)
-            output = torch.randn(batch_size, seq_length,
-                                 hidden_size*num_directions, device=input.device)
-        else:
-            batch_size = input.size(1)
-            seq_length = input.size(
-                0) if lengths is None else torch.max(lengths)
-            output = torch.randn(seq_length, batch_size,
-                                 hidden_size*num_directions, device=input.device)
-        hidden_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        cell_state = torch.zeros(
-            num_directions, batch_size, hidden_size, device=input.device)
-        return output, hidden_state, cell_state
-
-    @staticmethod
-    def backward(ctx, gradOutput, gradHidden, gradCell):
-
-        return None, None, None, None, None, None, None,\
-            None, None, None, None,\
-            None, None, None, None, None, None,\
-            None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None, None, None, None,\
-            None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                 weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                 input_size, hidden_size, num_layers, batch_first, dropout, bidirectional,
-                 data_bits, parameter_bits, o_bits,
-                 scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                 scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                 is_not_from_iqtensor):
-
-        param_dict = {'input_size_i': input_size, 'hidden_size_i': hidden_size, 'num_layers_i': num_layers,
-                      'batch_first_i': batch_first, 'dropout_f': 0, 'go_forward_i': True,
-                      'scale_i_f': scale_i(), 'scale_h_f': scale_h(), 'scale_iw_f': scale_iw(), 'scale_hw_f': scale_hw(),
-                      'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits,
-                      }
-        param_back_dict = {}
-        if bidirectional:
-            param_back_dict = {'input_size_i': input_size, 'hidden_size_i': hidden_size, 'num_layers_i': num_layers,
-                               'batch_first_i': batch_first, 'dropout_f': 0, 'go_forward_i': False,
-                               'scale_i_f': scale_i_reverse(), 'scale_h_f': scale_h_reverse(), 'scale_iw_f': scale_iw_reverse(), 'scale_hw_f': scale_hw_reverse(),
-                               'data_bits_i': data_bits, 'parameter_bits_i': parameter_bits,
-                               }
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        op_inner = None
-        input_list = None
-        input_back_list = None
-        if is_not_from_iqtensor:
-            op_inner = quantlinear(g, input, scale_i(),
-                                   platform_quant, data_bits)
-            input_list = [op_inner, weight_ih, weight_hh]
-            input_back_list = [op_inner, weight_ih_reverse, weight_hh_reverse]
-        else:
-            input_list = [input, weight_ih, weight_hh]
-            input_back_list = [input, weight_ih_reverse, weight_hh_reverse]
-        if bias_ih is not None and bias_hh is not None:
-            input_list.append(bias_ih)
-            input_list.append(bias_hh)
-            input_back_list.append(bias_ih_reverse)
-            input_back_list.append(bias_hh_reverse)
-        if o_bits is not None:
-            param_dict['scale_o_f'] = scale_o()
-            param_dict['o_bits_i'] = o_bits
-            if bidirectional:
-                param_back_dict['scale_o_f'] = scale_o_reverse()
-                param_back_dict['o_bits_i'] = o_bits
-
-        param_dict['platform_quant_s'] = platform_quant
-        param_dict['outputs'] = 3
-        param_back_dict['platform_quant_s'] = platform_quant
-        param_back_dict['outputs'] = 3
-        if lengths is None and hidden_state is None:
-            lstm, hidden, cell = g.op(
-                "thinker::LSTMInt", *input_list, **param_dict)
-            if bidirectional:
-                lstm_backward, hidden_back, cell_back = g.op(
-                    "thinker::LSTMInt", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [lstm, lstm_backward, scale_o, scale_o_reverse]
-                    lstm = iqcat_sym(g, None, scale_o, None, 2,
-                                     None, False, None, None, None, *args)
-                else:
-                    args = [lstm, lstm_backward]
-                    lstm = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-                cell = g.op("Concat", cell, cell_back, axis_i=0)
-        elif lengths is not None and hidden_state is None:
-            input_list.insert(1, lengths)
-            input_back_list.insert(1, lengths)
-            lstm, hidden, cell = g.op(
-                "thinker::LSTMInt", *input_list, **param_dict)
-            if bidirectional:
-                lstm_backward, hidden_back, cell_back = g.op(
-                    "thinker::LSTMInt", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [lstm, lstm_backward, scale_o, scale_o_reverse]
-                    lstm = iqcat_sym(g, None, scale_o, None, 2,
-                                     None, False, None, None, None, *args)
-                else:
-                    args = [lstm, lstm_backward]
-                    lstm = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-                cell = g.op("Concat", cell, cell_back, axis_i=0)
-        else:
-            input_list.insert(1, lengths)
-            input_list.insert(2, hidden_state)
-            input_list.insert(3, cell_state)
-            input_back_list.insert(1, lengths)
-            input_back_list.insert(2, hidden_state)
-            input_back_list.insert(3, cell_state)
-            lstm, hidden, cell = g.op(
-                "thinker::LSTMInt", *input_list, **param_dict)
-            if bidirectional:
-                lstm_backward, hidden_back, cell_back = g.op(
-                    "thinker::LSTMInt", *input_back_list, **param_back_dict)
-                if o_bits is not None:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                    args = [lstm, lstm_backward, scale_o, scale_o_reverse]
-                    lstm = iqcat_sym(g, None, scale_o, None, 2,
-                                     None, False, None, None, None, *args)
-                else:
-                    args = [lstm, lstm_backward]
-                    lstm = g.op("Concat", *args, axis_i=2)
-                hidden = g.op("Concat", hidden, hidden_back, axis_i=0)
-                cell = g.op("Concat", cell, cell_back, axis_i=0)
-
-        return lstm, hidden, cell
-
-
-class LSTMInt(nn.LSTM):
-    r"""实现LSTMInt的量化训练与测试，继承自nn.LSTM,
-
-    Args:
-        input_size hidden_size num_layers bias batch_first dropout bidirectional
-        与nn.GRU一致的参数
-        unified(bool): 确认正反向参数统计是否一致
-        data_bits(int): 输入量化位数
-        parameter_bits(int): 参数量化位数
-        mode(Enum): 量化方式，支持MaxValue与Qvalue
-        o_bits(int, default=None):输出量化位数
-        scale_i(np.float32): 统计的是LSTMP的输入scale，输入大小为(b, t, d)或(t, b, d)
-        scale_h(np.float32): 统计的是每一帧计算隐藏输出的最值momentum统计得到的scale
-        scale_iw(np.float32): 依据最终的模型参数计算得到，无统计
-        scale_hw(np.float32): 依据最终模型参数计算得到，无统计参数
-        scale_o(np.float32): 最终输出的统计scale
-        scale_reverse_*(np.float32):对应反向过程中各个scale数值
-    """
-
-    def __init__(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, unified=True,
-                 data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None, clamp_data=None, clamp_weight=None, clamp_bias=None):
-        nn.LSTM.__init__(self, input_size, hidden_size, num_layers,
-                         bias, batch_first, dropout, bidirectional)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, parameter_bits=parameter_bits, mode=mode, o_bits=o_bits)
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.unified = unified
-        self.momentum = 0.1
-        self.is_not_from_iqtensor = True
-        self.clamp_data = clamp_data
-        self.clamp_weight = clamp_weight
-        self.clamp_bias = clamp_bias
-
-        self.register_buffer('running_i', torch.zeros(1))
-        self.register_buffer('running_h', torch.zeros(1))
-        self.register_buffer('running_iw', torch.zeros(1))
-        self.register_buffer('running_hw', torch.zeros(1))
-        self.register_buffer('running_io', torch.zeros(1))
-        self.register_buffer('running_ho', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-
-        self.register_buffer('scale_i', torch.zeros(1))
-        self.register_buffer('scale_h', torch.zeros(1))
-        self.register_buffer('scale_iw', torch.zeros(1))
-        self.register_buffer('scale_hw', torch.zeros(1))
-        self.register_buffer('scale_io', torch.zeros(1))
-        self.register_buffer('scale_ho', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-        if self.bidirectional:
-            self.register_buffer('running_i_reverse', torch.zeros(1))
-            self.register_buffer('running_h_reverse', torch.zeros(1))
-            self.register_buffer('running_iw_reverse', torch.zeros(1))
-            self.register_buffer('running_hw_reverse', torch.zeros(1))
-            self.register_buffer('running_io_reverse', torch.zeros(1))
-            self.register_buffer('running_ho_reverse', torch.zeros(1))
-            self.register_buffer('running_o_reverse', torch.zeros(1))
-
-            self.register_buffer('scale_i_reverse', torch.zeros(1))
-            self.register_buffer('scale_h_reverse', torch.zeros(1))
-            self.register_buffer('scale_iw_reverse', torch.zeros(1))
-            self.register_buffer('scale_hw_reverse', torch.zeros(1))
-            self.register_buffer('scale_io_reverse', torch.zeros(1))
-            self.register_buffer('scale_ho_reverse', torch.zeros(1))
-            self.register_buffer('scale_o_reverse', torch.zeros(1))
-
-        self.sigmoid_table = None
-        self.tanh_table = None
-
-    def _single_direction_tensor(self, input, hidden, cell_state, layer=0, direct=0):
-        step_outputs = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        input = torch.cat(input.split(1, 0)[::-1]) if direct == 1 else input
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-
-        for input_x in input:
-            hidden, cell_state = LSTMCellFunction.apply(input_x, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                        self.data_bits, self.parameter_bits, self.o_bits,
-                                                        running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                                        scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                                        self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant,
-                                                        self.is_not_from_iqtensor, self.clamp_data)
-            step_outputs.append(hidden)
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        step_outputs = step_outputs[::-1] if direct == 1 else step_outputs
-        output = torch.stack(step_outputs, 0)
-        return output, (hidden, cell_state)
-
-    def _single_direction_packed(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
-        if direct:
-            return self._packed_reverse(input, hidden, cell_state, layer, direct, batch_sizes)
-        else:
-            return self._packed_forward(input, hidden, cell_state, layer, direct, batch_sizes)
-
-    def _packed_forward(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
-        step_outputs = []
-        final_hiddens = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        hidden = copy.deepcopy(hidden)
-        cell_state = copy.deepcopy(cell_state)
-        input, batch_size_list = _unbind_packed(input, batch_sizes)
-        last_batch_size = batch_size_list[0]
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-
-        for input_i, batch_len in zip(input, batch_size_list):
-            inc = batch_len - last_batch_size
-            if inc < 0:
-                # 按batch的帧长排完序，由长到短，较短的帧hidden计算的次数少，直接取低位保留
-                final_hiddens.append(
-                    _slice((hidden, cell_state), batch_len, last_batch_size))
-                hidden, cell_state = hx_slice(
-                    None, (hidden, cell_state), last_batch_size, batch_len)
-            hidden, cell_state = LSTMCellFunction.apply(input_i, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                        self.data_bits, self.parameter_bits, self.o_bits,
-                                                        running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                                        scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                                        self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant,
-                                                        self.is_not_from_iqtensor, self.clamp_data)
-            step_outputs.append(hidden)
-            last_batch_size = batch_len
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        final_hiddens.append((hidden, cell_state))
-        ret_hidden = final_hiddens[::-1]
-        hy_list = []
-        cy_list = []
-        for each in ret_hidden:
-            hy_list.append(each[0])
-            cy_list.append(each[1])
-        hidden = torch.cat(hy_list, 0)
-        cell_state = torch.cat(cy_list, 0)
-        output = torch.cat(step_outputs, 0)
-        return output, (hidden, cell_state)
-
-    def _packed_reverse(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
-        step_outputs = []
-        weight_ih = self.weight_ih_l0_reverse if direct == 1 else self.weight_ih_l0
-        weight_hh = self.weight_hh_l0_reverse if direct == 1 else self.weight_hh_l0
-        bias_ih = self.bias_ih_l0_reverse if direct == 1 else self.bias_ih_l0
-        bias_hh = self.bias_hh_l0_reverse if direct == 1 else self.bias_hh_l0
-        if weight_ih.dtype == torch.float32:
-            weight_ih = normalize_weight_with_config(
-                weight_ih, self.clamp_weight, self.training)
-            weight_hh = normalize_weight_with_config(
-                weight_hh, self.clamp_weight, self.training)
-            bias_ih = normalize_bias_with_config(
-                bias_ih, self.clamp_bias, self.training)
-            bias_hh = normalize_bias_with_config(
-                bias_hh, self.clamp_bias, self.training)
-
-        running_i_tensor = self.running_i_reverse if direct == 1 and not self.unified else self.running_i
-        running_h_tensor = self.running_h_reverse if direct == 1 and not self.unified else self.running_h
-        running_iw_tensor = self.running_iw_reverse if direct == 1 and not self.unified else self.running_iw
-        running_hw_tensor = self.running_hw_reverse if direct == 1 and not self.unified else self.running_hw
-        running_io_tensor = self.running_io_reverse if direct == 1 and not self.unified else self.running_io
-        running_ho_tensor = self.running_ho_reverse if direct == 1 and not self.unified else self.running_ho
-        running_o_tensor = self.running_o_reverse if direct == 1 and not self.unified else self.running_o
-
-        scale_i_tensor = self.scale_i_reverse if direct == 1 and not self.unified else self.scale_i
-        scale_h_tensor = self.scale_h_reverse if direct == 1 and not self.unified else self.scale_h
-        scale_iw_tensor = self.scale_iw_reverse if direct == 1 and not self.unified else self.scale_iw
-        scale_hw_tensor = self.scale_hw_reverse if direct == 1 and not self.unified else self.scale_hw
-        scale_io_tensor = self.scale_io_reverse if direct == 1 and not self.unified else self.scale_io
-        scale_ho_tensor = self.scale_ho_reverse if direct == 1 and not self.unified else self.scale_ho
-        scale_o_tensor = self.scale_o_reverse if direct == 1 and not self.unified else self.scale_o
-
-        running_i = ScalerBuffer(running_i_tensor)
-        running_h = ScalerBuffer(running_h_tensor)
-        running_iw = ScalerBuffer(running_iw_tensor)
-        running_hw = ScalerBuffer(running_hw_tensor)
-        running_io = ScalerBuffer(running_io_tensor)
-        running_ho = ScalerBuffer(running_ho_tensor)
-        running_o = ScalerBuffer(running_o_tensor)
-        scale_i = ScalerBuffer(scale_i_tensor)
-        scale_h = ScalerBuffer(scale_h_tensor)
-        scale_iw = ScalerBuffer(scale_iw_tensor)
-        scale_hw = ScalerBuffer(scale_hw_tensor)
-        scale_io = ScalerBuffer(scale_io_tensor)
-        scale_ho = ScalerBuffer(scale_ho_tensor)
-        scale_o = ScalerBuffer(scale_o_tensor)
-
-        input, batch_size_list = _unbind_packed(input, batch_sizes)
-        input = input[::-1]  # 按照时间t 进行反转
-        batch_size_list = batch_size_list[::-1]
-        input_hx = (copy.deepcopy(hidden), copy.deepcopy(cell_state))
-        last_batch_size = batch_size_list[0]
-        if self.training:
-            if self.is_not_from_iqtensor:
-                max_value_ix = get_max_value(input)
-                running_i.mul_(
-                    1-self.momentum).add_(self.momentum*max_value_ix)
-        hidden = _slice(hidden, 0, last_batch_size)
-        cell_state = _slice(cell_state, 0, last_batch_size)
-        for input_i, batch_len in zip(input, batch_size_list):
-            if last_batch_size != batch_len:
-                # 获取input_hx高位hidden部分与上一帧的hidden进行填充，相当于补0
-                hidden, cell_state = hx_slice(
-                    input_hx, (hidden, cell_state), last_batch_size, batch_len)
-            hidden, cell_state = LSTMCellFunction.apply(input_i, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                        self.data_bits, self.parameter_bits, self.o_bits,
-                                                        running_i, running_h, running_iw, running_hw, running_io, running_ho, running_o,
-                                                        scale_i, scale_h, scale_iw, scale_hw, scale_io, scale_ho, scale_o,
-                                                        self.momentum, self.training, self.prefix, self.dump, self.path, self.quant_mode, self.quant,
-                                                        self.is_not_from_iqtensor, self.clamp_data)
-            step_outputs.append(hidden)
-            last_batch_size = batch_len
-
-        running_i_tensor.fill_(running_i())
-        running_h_tensor.fill_(running_h())
-        running_iw_tensor.fill_(running_iw())
-        running_hw_tensor.fill_(running_hw())
-        running_io_tensor.fill_(running_io())
-        running_ho_tensor.fill_(running_ho())
-        running_o_tensor.fill_(running_o())
-        scale_i_tensor.fill_(scale_i())
-        scale_h_tensor.fill_(scale_h())
-        scale_iw_tensor.fill_(scale_iw())
-        scale_hw_tensor.fill_(scale_hw())
-        scale_io_tensor.fill_(scale_io())
-        scale_ho_tensor.fill_(scale_ho())
-        scale_o_tensor.fill_(scale_o())
-
-        step_outputs = step_outputs[::-1]
-        output = torch.cat(step_outputs, 0)
-        return output, (hidden, cell_state)
-
-    def _finetune(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
-        if batch_sizes is None:
-            return self._single_direction_tensor(input, hidden, cell_state, layer, direct)
-        else:
-            return self._single_direction_packed(input, hidden, cell_state, layer, direct, batch_sizes)
-
-    def _run_single_direction(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
-        return self._finetune(input, hidden, cell_state, layer, direct, batch_sizes)
-
-    def single_direction(self, input, layer, hx, batch_sizes=None):
-        hidden = hx[0]
-        cell_state = hx[1]
-        output, hidden = self._run_single_direction(
-            input, hidden, cell_state, layer, direct=0, batch_sizes=batch_sizes)
-        return output, [hidden]
-
-    def bidirection(self, input, layer, hx, batch_sizes=None):
-        hx_f = hx[0][0]
-        ct_f = hx[0][1]
-        hx_b = hx[1][0]
-        ct_b = hx[1][1]
-        fw_output, fw_hidden = self._run_single_direction(
-            input, hx_f, ct_f, layer, direct=0, batch_sizes=batch_sizes)
-        rev_output, rev_hidden = self._run_single_direction(
-            input, hx_b, ct_b, layer, direct=1, batch_sizes=batch_sizes)
-        if batch_sizes is None:
-            output = torch.cat((fw_output, rev_output), fw_output.dim()-1)
-        else:  # packed sequence
-            output = torch.cat((fw_output, rev_output), -1)
-        return output, [fw_hidden, rev_hidden]
-
-    def lstm_forward(self, input, hiddens, batch_sizes=None):
-        final_hiddens = []
-        for layer_num in range(self.num_layers):
-            hid = hiddens[layer_num] if hiddens is not None else None
-            output, hc = self.bidirection(input, layer_num, hid, batch_sizes) if self.bidirectional else self.single_direction(
-                input, layer_num, hid, batch_sizes)
-            final_hiddens.extend(hc)
-            input = output
-            # add dropout
-            if (self.dropout != 0 and self.training and layer_num < self.num_layers - 1):
-                input = torch.nn.functional.dropout(input, self.dropout)
-        hy = [hidden[0] for hidden in final_hiddens]
-        cy = [hidden[1] for hidden in final_hiddens]
-        hy = torch.stack(hy, 0)
-        cy = torch.stack(cy, 0)
-        return input, hy, cy
-
-    def _generate_hiddens(self, hx):
-        if hx is not None:
-            assert len(hx) == 2, 'hidden(tuple) input length must be 2'
-            hidden_list = _unbind(hx[0])
-            cellstate_list = _unbind(hx[1])
-            assert len(hidden_list) == len(cellstate_list)
-            length = len(hidden_list)
-            if self.bidirectional:
-                assert length/self.num_layers % 2 == 0, 'hidden len must be double in bidirectional mode'
-
-            i = 0
-            hiddens = []
-            while i < length:
-                if self.bidirectional:
-                    hiddens.append(
-                        ((hidden_list[i], cellstate_list[i]), (hidden_list[i+1], cellstate_list[i+1])))
-                    i += 2
-                else:
-                    hiddens.append((hidden_list[i], cellstate_list[i]))
-                    i += 1
-        else:
-            hiddens = None
-        return hiddens
-
-    def forward_input_tensor(self, input, hx, batch_sizes=None):
-        input = input.transpose(0, 1) if self.batch_first else input
-        hiddens = self._generate_hiddens(hx)
-        output, hr, ct = self.lstm_forward(input, hiddens)
-        output = output.transpose(0, 1) if self.batch_first else output
-        return output, hr, ct
-
-    def forward_input_packed(self, input, hx, batch_sizes=None):
-        hiddens = self._generate_hiddens(hx)
-        output, hr, ct = self.lstm_forward(input, hiddens, batch_sizes)
-        return output, hr, ct
-
-    def forward(self, input, hx=None):
-        orig_input = input
-        if not is_in_onnx_export():
-            if isinstance(orig_input, tuple):
-                input, lengths, batch_first, enforce_sorted = orig_input
-                if isinstance(input, IQTensor):
-                    self.is_not_from_iqtensor = False
-                    if input.bits != self.data_bits:
-                        input = Requant.apply(
-                            input, input.bits, input.scale_data, self.data_bits)
-                    self.scale_i.fill_(input.scale_data)
-                    self.running_i.fill_(input.running_data)
-                    if self.bidirectional:
-                        self.scale_i_reverse.fill_(input.scale_data)
-                        self.running_i_reverse.fill_(input.running_data)
-                packed_input = torch_pack_padded_sequence(
-                    input, lengths, batch_first, enforce_sorted)
-                input, batch_sizes, sorted_indices, unsorted_indices = packed_input
-                max_batch_size = batch_sizes[0]
-                max_batch_size = int(max_batch_size)
-            else:
-                batch_sizes = None
-                max_batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                sorted_indices = None
-                unsorted_indices = None
-                if isinstance(input, IQTensor):
-                    self.is_not_from_iqtensor = False
-                    if input.bits != self.data_bits:
-                        input = Requant.apply(
-                            input, input.bits, input.scale_data, self.data_bits)
-                    self.scale_i.fill_(input.scale_data)
-                    self.running_i.fill_(input.running_data)
-                    if self.bidirectional:
-                        self.scale_i_reverse.fill_(input.scale_data)
-                        self.running_i_reverse.fill_(input.running_data)
-            assert self.num_layers == 1, 'invalid num_layers, now only support num_layers = 1'
-            if hx is None:
-                num_directions = 2 if self.bidirectional else 1
-                zeros = torch.zeros(self.num_layers * num_directions,
-                                    max_batch_size, self.hidden_size,
-                                    dtype=input.dtype, device=input.device)
-                hx = (zeros, zeros)
-            else:
-                # Each batch of the hidden state should match the input sequence that
-                # the user believes he/she is passing in.
-                hx = self.permute_hidden(hx, sorted_indices)
-
-            self.check_forward_args(input, hx, batch_sizes)
-            if batch_sizes is not None:
-                output, hy, cy = self.forward_input_packed(
-                    input, hx, batch_sizes)
-            else:
-                output, hy, cy = self.forward_input_tensor(input, hx)
-            hidden = (hy, cy)
-
-            if isinstance(orig_input, tuple):
-                output_packed = PackedSequence(
-                    output, batch_sizes, sorted_indices, unsorted_indices)
-                output, lengths = torch_pad_packed_sequence(
-                    output_packed, self.batch_first)
-                if self.o_bits is not None:
-                    if self.training:
-                        output = Quant2IQTensor.apply(
-                            output, self.o_bits, self.quant_mode, 'output')
-                    else:
-                        scale_o = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o, self.o_bits, mode=self.quant_mode))
-                        if self.bidirectional:
-                            if self.unified:
-                                scale_o_reverse = scale_o
-                            else:
-                                scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                                    self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                            scale_o = ScalerBuffer(
-                                min(scale_o(), scale_o_reverse()))
-                        output = from_torch_tensor(
-                            output, scale_o(), self.o_bits)
-                return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
-            else:
-                if self.o_bits is not None:
-                    if self.training:
-                        output = Quant2IQTensor.apply(
-                            output, self.o_bits, self.quant_mode, 'output')
-                    else:
-                        scale_o = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o, self.o_bits, mode=self.quant_mode))
-                        if self.bidirectional:
-                            if self.unified:
-                                scale_o_reverse = scale_o
-                            else:
-                                scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                                    self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                            scale_o = ScalerBuffer(
-                                min(scale_o(), scale_o_reverse()))
-                        output = from_torch_tensor(
-                            output, scale_o(), self.o_bits)
-                return output, self.permute_hidden(hidden, unsorted_indices)
-        else:
-            lengths = None
-            if isinstance(orig_input, tuple):
-                input, lengths, _, _ = orig_input
-            else:
-                input = orig_input
-                lengths = None
-            if isinstance(input, IQTensor):
-                self.is_not_from_iqtensor = False
-                if input.bits != self.data_bits:
-                    input = Requant.apply(
-                        input, input.bits, input.scale_data, self.data_bits)
-            bias_ih = None
-            bias_hh = None
-            bias_ih_reverse = None
-            bias_hh_reverse = None
-            weight_ih = self.weight_ih_l0
-            weight_hh = self.weight_hh_l0
-            weight_ih_reverse = weight_ih
-            weight_hh_reverse = weight_hh
-            if self.bias:
-                bias_ih = self.bias_ih_l0
-                bias_hh = self.bias_hh_l0
-                bias_ih_reverse = bias_ih
-                bias_hh_reverse = bias_hh
-            if self.bidirectional:
-                weight_ih_reverse = self.weight_ih_l0_reverse
-                weight_hh_reverse = self.weight_hh_l0_reverse
-                bias_ih_reverse = self.bias_ih_l0_reverse
-                bias_hh_reverse = self.bias_hh_l0_reverse
-
-            scale_i = ScalerBuffer(self.scale_i)
-            scale_iw = ScalerBuffer(self.scale_iw)
-            scale_io = ScalerBuffer(self.scale_io)
-            scale_h = ScalerBuffer(self.scale_h)
-            scale_hw = ScalerBuffer(self.scale_hw)
-            scale_ho = ScalerBuffer(self.scale_ho)
-            scale_o = ScalerBuffer(self.scale_o)
-            scale_i_reverse = None
-            scale_iw_reverse = None
-            scale_io_reverse = None
-            scale_h_reverse = None
-            scale_hw_reverse = None
-            scale_ho_reverse = None
-            scale_o_reverse = None
-            hidden_state = None
-            cell_state = None
-            if self.bidirectional:
-                scale_i_reverse = ScalerBuffer(self.scale_i_reverse)
-                scale_iw_reverse = ScalerBuffer(self.scale_iw_reverse)
-                scale_io_reverse = ScalerBuffer(self.scale_io_reverse)
-                scale_h_reverse = ScalerBuffer(self.scale_h_reverse)
-                scale_hw_reverse = ScalerBuffer(self.scale_hw_reverse)
-                scale_ho_reverse = ScalerBuffer(self.scale_ho_reverse)
-                scale_o_reverse = ScalerBuffer(self.scale_o_reverse)
-            if hx is not None:
-                hidden_state, cell_state = hx
-            output = None
-            hy = None
-            cy = None
-            if hx is not None:
-                batch_size = input.size(
-                    0) if self.batch_first else input.size(1)
-                seq_len = input.size(
-                    1) if self.batch_first else input.size(0)
-                lengths = torch.tensor([seq_len for i in range(
-                    batch_size)], dtype=torch.int64, device=input.device) if lengths is None else lengths
-            output, hy, cy = LSTMSingleONNXFunction.apply(input, lengths, hidden_state, cell_state, weight_ih, weight_hh, bias_ih, bias_hh,
-                                                          weight_ih_reverse, weight_hh_reverse, bias_ih_reverse, bias_hh_reverse,
-                                                          self.input_size, self.hidden_size, self.num_layers, self.batch_first, self.dropout, self.bidirectional,
-                                                          self.data_bits, self.parameter_bits, self.o_bits,
-                                                          scale_i, scale_iw, scale_io, scale_h, scale_hw, scale_ho, scale_o,
-                                                          scale_i_reverse, scale_iw_reverse, scale_io_reverse, scale_h_reverse, scale_hw_reverse, scale_ho_reverse, scale_o_reverse,
-                                                          self.is_not_from_iqtensor)
-            if self.o_bits is not None:
-                if self.bidirectional:
-                    scale_o = ScalerBuffer(min(scale_o(), scale_o_reverse()))
-                output = from_torch_tensor(output, scale_o(), self.o_bits)
-            if isinstance(orig_input, tuple):
-                return (output, lengths), (hy, cy)
-            else:
-                return output, (hy, cy)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self.running_i > 0, 'invalid running_x <=0'
-            scale_i = ScalerBuffer(self.scale_i.data)
-            if self.is_not_from_iqtensor:
-                scale_i = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_i, self.data_bits, mode=self.quant_mode))
-                self.scale_i.data.fill_(scale_i())
-            scale_h = ScalerBuffer(self.quant.running_to_scale(
-                self.running_h, self.data_bits, mode=self.quant_mode))
-            self.scale_h.data.fill_(scale_h())
-            if self.o_bits is not None:
-                scale_o = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_o, self.o_bits, mode=self.quant_mode))
-                self.scale_o.data.fill_(scale_o())
-
-            if self.bidirectional:
-                if self.unified:
-                    self.running_i_reverse.data = self.running_i.data
-                    self.running_h_reverse.data = self.running_h.data
-                    self.scale_i_reverse.data = self.scale_i.data
-                    self.scale_h_reverse.data = self.scale_h.data
-                    scale_i_reverse = scale_i
-                    scale_h_reverse = scale_h
-                    if self.o_bits is not None:
-                        self.running_o_reverse.data = self.running_o.data
-                        self.scale_o_reverse.data = self.scale_o.data
-                        scale_o_reverse = scale_o
-                else:
-                    scale_i_reverse = ScalerBuffer(self.scale_i_reverse.data)
-                    if self.is_not_from_iqtensor:
-                        scale_i_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_i_reverse, self.data_bits, mode=self.quant_mode))
-                        self.scale_i_reverse.data.fill_(scale_i_reverse())
-                    scale_h_reverse = ScalerBuffer(self.quant.running_to_scale(
-                        self.running_h_reverse, self.data_bits, mode=self.quant_mode))
-                    self.scale_h_reverse.data.fill_(scale_h_reverse())
-                    if self.o_bits is not None:
-                        scale_o_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_o_reverse, self.o_bits, mode=self.quant_mode))
-                        self.scale_o_reverse.data.fill_(scale_o_reverse())
-
-        if self.weight_ih_l0.dtype == torch.float32:
-            clamp_weight_iw = normalize_weight_with_config(
-                self.weight_ih_l0, self.clamp_weight, False)
-            clamp_weight_hw = normalize_weight_with_config(
-                self.weight_hh_l0, self.clamp_weight, False)
-            self.weight_ih_l0.data = clamp_weight_iw
-            self.weight_hh_l0.data = clamp_weight_hw
-
-            if is_in_onnx_export():
-                scale_iw = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_iw, self.parameter_bits, mode=self.quant_mode))
-                self.scale_iw.data.fill_(scale_iw())
-                scale_hw = ScalerBuffer(self.quant.running_to_scale(
-                    self.running_hw, self.parameter_bits, mode=self.quant_mode))
-                self.scale_hw.data.fill_(scale_hw())
-                q_weight_iw, scale_iw, _ = self.quant.quant(
-                    clamp_weight_iw, self.parameter_bits, scale=scale_iw, mode=self.quant_mode, quant_data='weight')
-                q_weight_hw, scale_hw, _ = self.quant.quant(
-                    clamp_weight_hw, self.parameter_bits, scale=scale_hw, mode=self.quant_mode, quant_data='weight')
-            if self.bias:
-                clamp_bias_iw = normalize_bias_with_config(
-                    self.bias_ih_l0, self.clamp_bias, False)
-                clamp_bias_hw = normalize_bias_with_config(
-                    self.bias_hh_l0, self.clamp_bias, False)
-                self.bias_ih_l0.data = clamp_bias_iw
-                self.bias_hh_l0.data = clamp_bias_hw
-                if is_in_onnx_export():
-                    if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                        q_bias_iw = (clamp_bias_iw * scale_i *
-                                     scale_iw + 0.5).floor()
-                        q_bias_hw = (clamp_bias_hw * scale_h *
-                                     scale_hw + 0.5).floor()
-                        if self.data_bits + self.parameter_bits <= 16:
-                            q_bias_iw = q_bias_iw.float().int()
-                            q_bias_hw = q_bias_hw.float().int()
-                    else:
-                        assert False, "linger only support luna quant."
-            if self.bidirectional:
-                clamp_weight_iw_reverse = normalize_weight_with_config(
-                    self.weight_ih_l0_reverse, self.clamp_weight, False)
-                clamp_weight_hw_reverse = normalize_weight_with_config(
-                    self.weight_hh_l0_reverse, self.clamp_weight, False)
-                self.weight_ih_l0_reverse.data = clamp_weight_iw_reverse
-                self.weight_hh_l0_reverse.data = clamp_weight_hw_reverse
-                if is_in_onnx_export():
-                    if self.unified:
-                        q_weight_iw_reverse, scale_iw_reverse, _ = self.quant.quant(
-                            clamp_weight_iw_reverse, self.parameter_bits, scale=scale_iw, mode=self.quant_mode, quant_data='weight')
-                        q_weight_hw_reverse, scale_hw_reverse, _ = self.quant.quant(
-                            clamp_weight_hw_reverse, self.parameter_bits, scale=scale_hw, mode=self.quant_mode, quant_data='weight')
-                    else:
-                        scale_iw_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_iw_reverse, self.parameter_bits, mode=self.quant_mode))
-                        scale_hw_reverse = ScalerBuffer(self.quant.running_to_scale(
-                            self.running_hw_reverse, self.parameter_bits, mode=self.quant_mode))
-                        q_weight_iw_reverse, scale_iw_reverse, _ = self.quant.quant(
-                            clamp_weight_iw_reverse, self.parameter_bits, scale=scale_iw_reverse(), mode=self.quant_mode, quant_data='weight')
-                        q_weight_hw_reverse, scale_hw_reverse, _ = self.quant.quant(
-                            clamp_weight_hw_reverse, self.parameter_bits, scale=scale_hw_reverse(), mode=self.quant_mode, quant_data='weight')
-                    self.scale_iw_reverse.data.fill_(scale_iw_reverse())
-                    self.scale_hw_reverse.data.fill_(scale_hw_reverse())
-                if self.bias:
-                    clamp_bias_iw_reverse = normalize_bias_with_config(
-                        self.bias_ih_l0_reverse, self.clamp_bias, False)
-                    clamp_bias_hw_reverse = normalize_bias_with_config(
-                        self.bias_hh_l0_reverse, self.clamp_bias, False)
-                    self.bias_ih_l0_reverse.data = clamp_bias_iw_reverse
-                    self.bias_hh_l0_reverse.data = clamp_bias_hw_reverse
-                    if is_in_onnx_export():
-                        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-                            q_bias_iw_reverse = (
-                                clamp_bias_iw_reverse * scale_i_reverse * scale_iw_reverse + 0.5).floor()
-                            q_bias_hw_reverse = (
-                                clamp_bias_hw_reverse * scale_h_reverse * scale_hw_reverse + 0.5).floor()
-                            if self.data_bits + self.parameter_bits <= 16:
-                                q_bias_iw_reverse = q_bias_iw_reverse.float().int()
-                                q_bias_hw_reverse = q_bias_hw_reverse.float().int()
-                        else:
-                            assert False, "linger only support luna quant."
-            if is_in_onnx_export():
-                if self.parameter_bits <= 8:
-                    self.weight_ih_l0.data = q_weight_iw.char()
-                    self.weight_hh_l0.data = q_weight_hw.char()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.char()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.char()
-                elif self.parameter_bits <= 16:
-                    self.weight_ih_l0.data = q_weight_iw.short()
-                    self.weight_hh_l0.data = q_weight_hw.short()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.short()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.short()
-                else:
-                    self.weight_ih_l0.data = q_weight_iw.int()
-                    self.weight_hh_l0.data = q_weight_hw.int()
-                    if self.bidirectional:
-                        self.weight_ih_l0_reverse.data = q_weight_iw_reverse.int()
-                        self.weight_hh_l0_reverse.data = q_weight_hw_reverse.int()
-                if self.bias:
-                    self.bias_ih_l0.data = q_bias_iw.int()
-                    self.bias_hh_l0.data = q_bias_hw.int()
-                    if self.bidirectional:
-                        self.bias_ih_l0_reverse.data = q_bias_iw_reverse.int()
-                        self.bias_hh_l0_reverse.data = q_bias_hw_reverse.int()
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, self in self._modules.items():
-            if self is not None:
-                self.state_dict(destination, prefix + name +
-                                '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
diff --git a/linger/ops/module_self.py b/linger/ops/module_self.py
deleted file mode 100644
index 1d3dab0..0000000
--- a/linger/ops/module_self.py
+++ /dev/null
@@ -1,27 +0,0 @@
-from ..ops.ops_names import (LINGER_FUNCINT_BMM_COUNTER,
-                             LINGER_IQTENSOR_LAYER_COUNTER)
-
-self_module = []
-
-
-def get_current_module():
-    if len(self_module) > 0:
-        return self_module[-1]
-    else:
-        return None
-
-
-def hook_pre_forward(module, input):
-    setattr(module, LINGER_IQTENSOR_LAYER_COUNTER, 0)
-    setattr(module, LINGER_FUNCINT_BMM_COUNTER, 0)
-    self_module.append(module)
-
-
-def hook_forward(module, input, output):
-    cur = self_module.pop()
-    assert cur == module
-    setattr(module, LINGER_IQTENSOR_LAYER_COUNTER, 0)
-    setattr(module, LINGER_FUNCINT_BMM_COUNTER, 0)
-
-
-__all__ = ['get_current_module', 'hook_pre_forward', 'hook_forward']
diff --git a/linger/ops/ops.py b/linger/ops/ops.py
deleted file mode 100644
index b68f89d..0000000
--- a/linger/ops/ops.py
+++ /dev/null
@@ -1,177 +0,0 @@
-import itertools
-from collections import OrderedDict
-
-import torch
-import torch.onnx
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_weight_with_config)
-from ..utils import PlatFormQuant, QuantMode, ScalerBuffer
-
-
-class ModuleIntConfig():
-    def __init__(self, data_bits=8, parameter_bits=8, mode=QuantMode.QValue, o_bits=None):
-        self.data_bits = data_bits
-        self.parameter_bits = parameter_bits
-        self.quant_mode = mode
-        self.o_bits = o_bits
-        self.quant = Quant()
-
-    @staticmethod
-    def state_dict_global(module, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=module._version)
-        if is_in_onnx_export():
-            assert module._buffers['running_x'] > 0, 'invalid running_x <= 0, cannot access param before training, layer prefix is: {}'.format(
-                prefix)
-            scale_x = ScalerBuffer(module._buffers['scale_x'])
-            if module.is_not_from_iqtensor:
-                scale_x = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_x']), module.data_bits, mode=module.quant_mode))
-                module._buffers['scale_x'].data.fill_(scale_x())
-            if module.o_bits is not None:
-                scale_o = ScalerBuffer(module.quant.running_to_scale(ScalerBuffer(
-                    module._buffers['running_o']), module.o_bits, mode=module.quant_mode))
-                module._buffers['scale_o'].data.fill_(scale_o())
-
-        if 'scale_w' in module._buffers and module._parameters['weight'].dtype == torch.float:
-            weight_tensor = module._parameters['weight']
-            weight_tensor_clamp = None
-            bias_tensor_clamp = None
-            if hasattr(module, 'clamp_weight'):
-                weight_tensor_clamp = normalize_weight_with_config(
-                    weight_tensor, module.clamp_weight, False)
-            else:
-                weight_tensor_clamp = weight_tensor
-            weight_tensor.data = weight_tensor_clamp
-            if module.bias is not None:
-                bias_tensor = module._parameters['bias']
-                # bias_temp = None
-                if hasattr(module, 'clamp_bias'):
-                    bias_tensor_clamp = normalize_bias_with_config(
-                        bias_tensor, module.clamp_bias, False)
-                else:
-                    bias_tensor_clamp = bias_tensor
-                bias_tensor.data = bias_tensor_clamp
-            if is_in_onnx_export():
-                weight_temp, scale_w, _ = module.quant.quant(
-                    weight_tensor_clamp, module.parameter_bits, mode=module.quant_mode)
-                scale_w = ScalerBuffer(scale_w)
-                module._buffers['scale_w'].data.fill_(scale_w())
-
-                if module.parameter_bits <= 8:
-                    weight_tensor.data = weight_temp.char()
-                    weight_tensor.char()
-                elif module.parameter_bits <= 16:
-                    weight_tensor.data = weight_temp.short()
-                    weight_tensor.short()
-                else:
-                    weight_tensor.data = weight_temp.int()
-                    weight_tensor.int()
-                if module.bias is not None:
-                    bias_tensor_clamp = module._parameters['bias']
-                    if config.PlatFormQuant.platform_quant in (PlatFormQuant.luna_quant,):
-                        assert module.quant_mode == QuantMode.QValue, 'luna_quant only support Qvalue and o_bits=None'
-                        if module.data_bits + module.parameter_bits <= 16:
-                            module._parameters['bias'].data = (
-                                bias_tensor_clamp * scale_w * scale_x + 0.5).floor().float().int()
-                        else:
-                            module._parameters['bias'].data = (
-                                bias_tensor_clamp * scale_w * scale_x + 0.5).floor().int()
-                        module._parameters['bias'].int()
-                    else:
-                        assert False, "linger only support luna quant."
-        module._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in module._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in module._state_dict_hooks.values():
-            hook_result = hook(module, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    @staticmethod
-    def _load_from_state_dict_global(module, state_dict, prefix, local_metadata, strict,
-                                     missing_keys, unexpected_keys, error_msgs):
-        allow_missing_keys = ['running_w', 'running_x', 'running_y', 'running_o', 'running_i', 'running_iw', 'running_h', 'running_hw', 'running_io', 'running_ho',
-                              'running_i_reverse', 'running_iw_reverse', 'running_h_reverse', 'running_hw_reverse', 'running_io_reverse', 'running_ho_reverse',
-                              'scale_w', 'scale_x', 'scale_y', 'scale_o', 'scale_i', 'scale_h', 'scale_iw', 'scale_hw', 'sale_io', 'scale_ho', 'scale_i_reverse', 'scale_h_reverse',
-                              'scale_iw_reverse', 'scale_hw_reverse', 'sale_io_reverse', 'scale_ho_reverse', 'running_c', 'scale_c', 'scale_cw', 'min_thresh', 'max_thresh',
-                              "running_co", "scale_io", "scale_co", "sigmoid_table", "tanh_table", 'scale_o_reverse', 'scale_io_reverse', 'running_o_reverse',
-                              "running_q", "running_k", "running_v", "running_attn", "scale_q", "scale_k", "scale_v", "scale_attn", "running_pos", "scale_pos"]
-        local_missing_keys = []
-        ModuleIntConfig._load_from_state_dict_global_(module, state_dict, prefix, local_metadata, strict,
-                                                      local_missing_keys, unexpected_keys, error_msgs)
-        matched = True
-        fake_missing_keys = []
-        for k_local in local_missing_keys:
-            if k_local.replace(prefix, '', 1) not in allow_missing_keys:
-                matched = False
-                fake_missing_keys.append(k_local)
-        if matched:
-            local_missing_keys = []
-        else:
-            local_missing_keys = fake_missing_keys
-        missing_keys += local_missing_keys
-
-    @staticmethod
-    def _load_from_state_dict_global_(module, state_dict, prefix, local_metadata, strict,
-                                      missing_keys, unexpected_keys, error_msgs):
-        for hook in module._load_state_dict_pre_hooks.values():
-            hook(state_dict, prefix, local_metadata, strict,
-                 missing_keys, unexpected_keys, error_msgs)
-        local_name_params = itertools.chain(
-            module._parameters.items(), module._buffers.items())
-        local_state = {k: v.data for k,
-                       v in local_name_params if v is not None}
-        for name, param in local_state.items():
-            key = prefix + name
-            if key in state_dict:
-                input_param = state_dict[key]
-
-                # Backward compatibility: loading 1-dim tensor from 0.3.* to version 0.4+
-                if len(param.shape) == 0 and len(input_param.shape) == 1:
-                    input_param = input_param[0]
-
-                if input_param.shape != param.shape:
-                    # local shape should match the one in checkpoint
-                    error_msgs.append('size mismatch for {}: copying a param with shape {} from checkpoint, '
-                                      'the shape in current model is {}.'
-                                      .format(key, input_param.shape, param.shape))
-                    continue
-
-                if isinstance(input_param, torch.nn.Parameter):
-                    input_param = input_param.data
-                try:
-                    param.copy_(input_param)
-                    if input_param.dtype == torch.int32:
-                        module._parameters[name] = param.int()
-                    elif input_param.dtype == torch.int16:
-                        module._parameters[name] = param.short()
-                    elif input_param.dtype == torch.int8:
-                        module._parameters[name] = param.char()
-
-                except Exception:
-                    error_msgs.append('While copying the parameter named "{}", '
-                                      'whose dimensions in the model are {} and '
-                                      'whose dimensions in the checkpoint are {}.'
-                                      .format(key, param.size(), input_param.size()))
-            elif strict:
-                missing_keys.append(key)
-        if strict:
-            for key in state_dict.keys():
-                if key.startswith(prefix):
-                    input_name = key[len(prefix):]
-                    input_name = input_name.split('.', 1)[0]
-                    if input_name not in module._modules and input_name not in local_state:
-                        unexpected_keys.append(key)
-
-
-__all__ = ['ModuleIntConfig']
diff --git a/linger/ops/ops_configs.py b/linger/ops/ops_configs.py
deleted file mode 100644
index 1d83043..0000000
--- a/linger/ops/ops_configs.py
+++ /dev/null
@@ -1,26 +0,0 @@
-import torch.nn as nn
-
-from ..modules import *
-from .avgpool2d_int import AvgPool2dInt
-from .batchnorm_int import BatchNormInt
-from .bmm_int import BmmInt
-from .conv1d_int import Conv1dInt
-from .conv_int import Conv2dInt
-from .convtranspose_int import ConvTranspose2dInt
-from .embedding_int import EmbeddingInt
-from .gru_int import GRUInt
-from .iqtensor import iqAddLayer, iqDivLayer, iqMulLayer, iqSumLayer
-from .linear_int import LinearInt
-from .linger_functional import (iqCatLayer, iqClampLayer, iqSigmoidLayer,
-                                iqTanhLayer, softmaxInt, logsoftmaxInt)
-from .lstm_int import LSTMInt
-from .relu6_int import ReLU6Int
-from .layernorm_int import LayerNormInt
-
-DefaultQuantIntXOP = (nn.BatchNorm2d, nn.Linear, nn.Conv2d, nn.ConvTranspose2d, nn.AvgPool2d, nn.Conv1d, NormalizeConvBN1d, NormalizeConvBN2d, NormalizeConvTransposeBN2d,
-                      NormalizeConv2d, NormalizeConv1d, NormalizeConvTranspose2d, NormalizeLinear, NormalizeFastLSTM, NormalizeFastGRU,nn.ReLU6, nn.GRU, nn.LSTM)
-
-SupportQuantTorchModules = [nn.BatchNorm2d, nn.Linear, nn.Conv2d, nn.ConvTranspose2d, nn.GRU, nn.LSTM, nn.AvgPool2d, nn.Conv1d,  NormalizeConvBN1d, NormalizeConvBN2d, NormalizeConvTransposeBN2d, NormalizeConv2d,
-                            NormalizeConv1d, NormalizeConvTranspose2d, NormalizeLinear, nn.ReLU6, NormalizeFastLSTM, NormalizeFastGRU, NormalizeBatchNorm2d, NormalizeLayerNorm, nn.Embedding, nn.Upsample, NormalizeEmbedding, nn.LayerNorm]
-SupportQuantedIntModules = (BatchNormInt, LinearInt, Conv2dInt, ConvTranspose2dInt, GRUInt, LSTMInt, AvgPool2dInt, Conv1dInt, ReLU6Int, BmmInt,
-                            iqAddLayer, iqMulLayer, iqDivLayer, iqSumLayer, iqCatLayer, iqSigmoidLayer, iqTanhLayer, iqClampLayer, EmbeddingInt, LayerNormInt, softmaxInt, logsoftmaxInt)
diff --git a/linger/ops/ops_names.py b/linger/ops/ops_names.py
deleted file mode 100644
index 2e420c3..0000000
--- a/linger/ops/ops_names.py
+++ /dev/null
@@ -1,23 +0,0 @@
-LINGER = '_linger'
-
-LINGER_MIX_INT8_MANUAL_ROUND_LAYERS = LINGER+'_round_tensor'
-
-LINGER_DUMP_NAME = LINGER+"_dump_name"
-
-LINGER_IGNORE_PAMAMTER = LINGER+"_ignore_parameter"
-
-LINGER_IQTENSOR_LAYER_COUNTER = LINGER+"_iq_tensor_index_"
-
-LINGER_FUNCINT_BMM_COUNTER = LINGER+"_funcint_bmm_index_"
-
-LINGER_MODE = LINGER+"_mode"
-
-LINGER_OBIT = LINGER+"_obit"
-
-LINGER_AHEAD_RELU = LINGER+"_ahead_relu"
-
-LINGER_AHEAD_SIGMOID = LINGER+"_ahead_sigmoid"
-
-
-__all__ = ['LINGER_MIX_INT8_MANUAL_ROUND_LAYERS', 'LINGER_DUMP_NAME', 'LINGER_IGNORE_PAMAMTER', 'LINGER_IQTENSOR_LAYER_COUNTER',
-           'LINGER_FUNCINT_BMM_COUNTER', 'LINGER_OBIT', 'LINGER_MODE', 'LINGER_AHEAD_RELU', 'LINGER_AHEAD_SIGMOID']
diff --git a/linger/ops/relu6_int.py b/linger/ops/relu6_int.py
deleted file mode 100644
index b96462e..0000000
--- a/linger/ops/relu6_int.py
+++ /dev/null
@@ -1,157 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from ..config import config
-from ..ops.ops import ModuleIntConfig
-from ..utils import Dump, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-from ..quant import Quant
-
-
-class ReLU6IntFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, min_thresh, max_thresh, data_bits, training, momentum, running_x, eval_scale_x, mode, quant, is_not_from_iqtensor, prefix, dump, path):
-
-        scale_x = None
-        if training:
-            ctx.save_for_backward(input)
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            if isinstance(input, IQTensor):
-                q_input, _, max_value_x = quant.quant(input, data_bits, eval_scale_x, mode=QuantMode.QValue, quant_data='input',
-                                                      iq_zero_point=input.zero_point)
-                scale_x = eval_scale_x
-            else:
-                q_input, scale_x, max_value_x = quant.quant(
-                    input, data_bits, mode=mode, quant_data='input')
-                scale_x = scale_x
-                running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            bound_value = math.pow(2, data_bits-1)-1
-            max_thresh_int = round(6 * scale_x)
-            max_thresh_int = bound_value if max_thresh_int > bound_value else max_thresh_int
-            q_outputs = q_input.clamp(0, max_thresh_int)
-            outputs = quant.dequant(q_outputs, scale_x)
-            max_thresh.fill_(max_thresh_int)
-            ctx.scale = scale_x, data_bits
-        else:
-            assert running_x > 0, 'invalid running_x <= 0, please fintune first'
-            scale_x = None
-            scale_x = ScalerBuffer(quant.running_to_scale(
-                running_x, data_bits, mode=mode))
-            if isinstance(input, IQTensor):
-                scale_x = eval_scale_x
-                q_input, _, _ = quant.quant(
-                    input, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input', iq_zero_point=input.zero_point)
-            else:
-                q_input, _, _ = quant.quant(
-                    input, data_bits, scale_x, mode=mode, quant_data='input')
-            bound_value = math.pow(2, data_bits-1)-1
-            max_thresh_int = round(6 * scale_x)
-            max_thresh_int = bound_value if max_thresh_int > bound_value else max_thresh_int
-            q_outputs = q_input.clamp(0, max_thresh_int)
-            outputs = quant.dequant(q_outputs, scale_x)
-            max_thresh.fill_(max_thresh_int)
-            if dump:
-                name_list = ["input", "outputs", "q_input",
-                             "q_outputs", "scale_x", "running_x"]
-                attr_list = [input, outputs, q_input,
-                             q_outputs, scale_x.data, running_x.data]
-                Dump.dump_file(prefix, ".ReLU6Int.", zip(
-                    name_list, attr_list), path)
-
-        if isinstance(scale_x, float):
-            return from_torch_tensor(outputs, scale_x, data_bits)
-        elif isinstance(scale_x, torch.Tensor):
-            return from_torch_tensor(outputs, scale_x.item(), data_bits)
-        else:
-            return from_torch_tensor(outputs, scale_x.data, data_bits)
-
-    @staticmethod
-    def backward(ctx, gradoutput):
-        input, = ctx.saved_tensors
-        scale_x, data_bits = ctx.scale
-        zero_point, is_iq_tensor = ctx.value
-        if is_iq_tensor:
-            f_input = input.data
-        else:
-            q_input, _, _ = Quant.quant(
-                input.data, data_bits, scale_x, mode=QuantMode.QValue, quant_data='input')
-            f_input = Quant.dequant(q_input, scale_x)
-        f_input = f_input.detach().clone().requires_grad_(True)
-        gradInput = None
-        with torch.enable_grad():
-            y = F.hardtanh(f_input, 0, 6, False)
-            gradInput, = torch.autograd.grad(y, f_input, gradoutput)
-        return gradInput, None, None, None, None, None, None, None, None, None, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, min_thresh, max_thresh, data_bits, training, momentum, running_x, scale_x, mode, quant, is_not_from_iqtensor, prefix, dump, path):
-        op_inner = None
-        if is_not_from_iqtensor:
-            platform_quant = platform_to_string(
-                config.PlatFormQuant.platform_quant)
-            op_inner = quantlinear(g, input, scale_x(),
-                                   platform_quant, data_bits)
-        if is_not_from_iqtensor:
-            input_list = [op_inner, min_thresh, max_thresh]
-        else:
-            input_list = [input, min_thresh, max_thresh]
-        return g.op("Clip", *input_list)
-
-
-class ReLU6Int(nn.ReLU6, ModuleIntConfig):
-    r"""
-        实现relu6的clamp按照定点形式计算，而非浮点6
-
-    Args:
-        data_bits(int): 输入量化位数
-        mode(Enum):量化计算方式，支持MaxValue, QValue
-        o_bits(int):输出量化位数 
-        scale_x(np.float32): 输出量化scale，与输入scale_x一致
-    """
-
-    def __init__(self, data_bits=8, mode=QuantMode.QValue):
-        nn.ReLU6.__init__(self)
-        ModuleIntConfig.__init__(
-            self, data_bits=data_bits, mode=mode, o_bits=None)
-        self.momentum = 0.1
-        self.is_not_from_iqtensor = True
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('min_thresh', torch.tensor(0, dtype=torch.int8))
-        self.register_buffer('max_thresh', torch.tensor(0, dtype=torch.int8))
-
-    def forward(self, input):
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        assert self.data_bits == 8, 'relu6int only support 8bit'
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        output = ReLU6IntFunction.apply(input.contiguous(), self.min_thresh, self.max_thresh, self.data_bits, self.training, self.momentum,
-                                        running_x, scale_x, self.quant_mode, self.quant, self.is_not_from_iqtensor, self.prefix, self.dump, self.path)
-        self.running_x.fill_(running_x())
-        self.scale_x.fill_(scale_x())
-
-        return output
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
diff --git a/linger/ops/requant.py b/linger/ops/requant.py
deleted file mode 100644
index 553bec2..0000000
--- a/linger/ops/requant.py
+++ /dev/null
@@ -1,40 +0,0 @@
-import math
-
-import torch
-
-from ..utils import QuantMode
-from .iqtensor import IQTensor, from_torch_tensor
-
-
-class Requant(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, bits_src, scale_src, bits_dst, mode=QuantMode.QValue):
-        s_rescale = (math.pow(2, bits_dst-1) - 1.0) / \
-            (math.pow(2, bits_src-1) - 1.0)
-        if mode == QuantMode.QValue:
-            s_rescale = math.pow(2, round(math.log(s_rescale, 2)))
-        scale_bst = s_rescale*scale_src
-        zero_point = 0
-        if isinstance(input, IQTensor):
-            zero_point = input.zero_point
-        if zero_point != 0:
-            zero_point = math.pow(2, bits_dst-1)
-        s = from_torch_tensor(input, scale_bst, bits_dst, zero_point=zero_point)
-        s.requant_()
-        return s
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        return gradOutput, None, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, bits_src, scale_src, bits_dst, mode=QuantMode.QValue):
-        s_rescale = (math.pow(2, bits_dst-1) - 1.0) / \
-            (math.pow(2, bits_src-1) - 1.0)
-        if mode == QuantMode.QValue:
-            s_rescale = math.pow(2, round(math.log(s_rescale, 2)))
-        scale_bst = s_rescale*scale_src
-        return g.op("thinker::Requant", input, data_bits_i=bits_src, scale_x_f=scale_src, scale_o_f=scale_bst, o_bits_i=bits_dst)
-
-
-__all__ = ['Requant']
diff --git a/linger/ops/scaledround_int.py b/linger/ops/scaledround_int.py
deleted file mode 100644
index 89500f1..0000000
--- a/linger/ops/scaledround_int.py
+++ /dev/null
@@ -1,120 +0,0 @@
-from collections import OrderedDict
-
-import torch
-import torch.onnx
-from torch.onnx import is_in_onnx_export
-
-from ..config import config
-from ..quant import Quant
-from ..utils import *
-from ..utils import Dump
-from .iqtensor import from_torch_tensor, platform_to_string, quantlinear
-from .ops import ModuleIntConfig
-
-
-class ScaledRoundLayerFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, momentum, running_x, eval_scale_x, data_bits, training, prefix, dump, path, mode, quant, zero_point):
-        if training:
-            q_input, scale_x, max_value_x = quant.quant(
-                input, data_bits, mode=mode, quant_data='input', iq_zero_point=zero_point)
-            running_x.mul_(1-momentum).add_(momentum*max_value_x)
-            scale_x = ScalerBuffer(scale_x)
-            outputs = quant.dequant(q_input, scale_x)
-        else:
-            assert running_x != 0, 'invalid running_x=0, please finetune training before eval'
-            scale_x = ScalerBuffer(quant.running_to_scale(
-                running_x, data_bits, mode=mode, zero_point=zero_point))
-            q_input, _, _ = quant.quant(
-                input, data_bits, scale_x, mode=mode, quant_data='input', iq_zero_point=zero_point)
-            outputs = quant.dequant(q_input, scale_x)
-
-            if dump:
-                name_list = ['input', 'q_input',
-                             'outputs', 'scale_x', 'running_x']
-                attr_list = [input, q_input, outputs, scale_x(), running_x()]
-                Dump.dump_file(prefix, "ScaledRoundLayer.",
-                               zip(name_list, attr_list), path)
-            eval_scale_x.fill_(scale_x())
-        if isinstance(scale_x, float):
-            return from_torch_tensor(outputs, scale_x, data_bits, zero_point)
-        elif isinstance(scale_x, torch.Tensor):
-            return from_torch_tensor(outputs, scale_x.item(), data_bits, zero_point)
-        else:
-            return from_torch_tensor(outputs, scale_x.data, data_bits, zero_point)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return (grad_output, None, None, None, None, None, None, None, None, None, None, None)
-
-    @staticmethod
-    def symbolic(g, input, momentum, running_x, scale_x, data_bits, training, prefix, dump, path, mode, quant, zero_point):
-        platform_quant = platform_to_string(
-            config.PlatFormQuant.platform_quant)
-        return quantlinear(g, input, scale_x(), platform_quant, data_bits)
-
-
-class ScaledRoundLayer(torch.nn.Module):
-    r"""
-        将浮点的tensor依据统计的scale转为定点数值
-
-    Args:
-        bits: 输出量化的位数
-        mode：量化方式，支持MaxValue和Qvalue
-    """
-
-    def __init__(self, bits=8, mode=QuantMode.QValue, zero_point=0):
-        super(ScaledRoundLayer, self).__init__()
-
-        self.prefix = ""
-        self.dump = False
-        self.path = ""
-        self.data_bits = bits
-        self.quant_mode = mode
-        self.momentum = 0.1
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('running_x', torch.zeros(1))
-        self.quant = Quant()
-        self.zero_point = zero_point
-
-    def forward(self, x, *others):
-        assert self.data_bits == 8, 'quant_tensor only support 8bit'
-        scale_x = ScalerBuffer(self.scale_x)
-        x = ScaledRoundLayerFunction.apply(x, self.momentum, self.running_x, scale_x, self.data_bits, self.training,
-                                           self.prefix, self.dump, self.path, self.quant_mode, self.quant, self.zero_point)
-        self.scale_x.fill_(scale_x.data)
-        if len(others) == 0:
-            return x
-        else:
-            return tuple([x])+others
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        if destination is None:
-            destination = OrderedDict()
-            destination._metadata = OrderedDict()
-        destination._metadata[prefix[:-1]
-                              ] = local_metadata = dict(version=self._version)
-        if is_in_onnx_export():
-            assert self._buffers['running_x'] > 0, 'invalid running_x, please finetune first'
-            scale_x = ScalerBuffer(self.quant.running_to_scale(ScalerBuffer(
-                self._buffers['running_x']), self.data_bits, mode=self.quant_mode, zero_point=self.zero_point))
-            self._buffers['scale_x'].data.fill_(scale_x())
-
-        self._save_to_state_dict(destination, prefix, keep_vars)
-        for name, module in self._modules.items():
-            if module is not None:
-                module.state_dict(destination, prefix +
-                                  name + '.', keep_vars=keep_vars)
-        for hook in self._state_dict_hooks.values():
-            hook_result = hook(self, destination, prefix, local_metadata)
-            if hook_result is not None:
-                destination = hook_result
-        return destination
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-
-__all__ = ['ScaledRoundLayer']
diff --git a/linger/ops/shuffle_channel.py b/linger/ops/shuffle_channel.py
deleted file mode 100644
index 6fed489..0000000
--- a/linger/ops/shuffle_channel.py
+++ /dev/null
@@ -1,90 +0,0 @@
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ..config import config
-from ..quant import (Quant, normalize_bias_with_config,
-                     normalize_data_with_config, normalize_weight_with_config)
-from ..utils import Dump, PlatFormQuant, QuantMode, ScalerBuffer
-from .iqtensor import (IQTensor, from_torch_tensor, platform_to_string,
-                       quantlinear)
-from .ops import ModuleIntConfig
-from .requant import Requant
-
-from ..modules.normalize_shuffleChannel import NormalizeShuffleChannel
-
-
-class ShuffleChannelFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input, groups, 
-                data_bits, o_bits,
-                running_x, running_o, scale_x, scale_o,
-                training, clamp_data):
-        if training:
-            ctx.clamp_data = clamp_data
-            zero_point = input.zero_point if isinstance(input, IQTensor) else 0
-            is_iq_tensor = True if isinstance(input, IQTensor) else False
-            ctx.value = zero_point, is_iq_tensor
-            ctx.bits = data_bits, o_bits
-            saved_tensors = [input]
-
-
-    @staticmethod
-    def backward(ctx, gradOutput):
-        pass
-
-    @staticmethod
-    def symbolic(g):
-        pass
-
-class ShuffleChannelInt(NormalizeShuffleChannel):
-    def __init__(self, groups: int, data_bits=8, o_bits=None, clamp_data=None):
-        NormalizeShuffleChannel.__init__(self, groups)
-        self.groups = groups
-        self.clamp_data = clamp_data
-        self.register_buffer('running_x', torch.zeros(1))
-        self.register_buffer('running_o', torch.zeros(1))
-        self.register_buffer('scale_x', torch.zeros(1))
-        self.register_buffer('scale_o', torch.zeros(1))
-
-    def forward(self, input: torch.Tensor) -> torch.Tensor:
-        scale_x = ScalerBuffer(self.scale_x)
-        running_x = ScalerBuffer(self.running_x)
-        if isinstance(input, IQTensor):
-            self.is_not_from_iqtensor = False
-            if input.bits != self.data_bits:
-                input = Requant.apply(
-                    input, input.bits, input.scale_data, self.data_bits, self.mode)
-            scale_x = ScalerBuffer(input.scale_data)
-            running_x = ScalerBuffer(input.running_data)
-        running_o = ScalerBuffer(self.running_o)
-        scale_o = ScalerBuffer(self.scale_o)
-
-        ret = ShuffleChannelFunction.apply(input, self.groups, 
-                                        self.data_bits, self.o_bits,
-                                        running_x, running_o, scale_x, scale_o,
-                                        self.training, self.clamp_data)
-
-        self.running_x.fill_(running_x())
-        self.running_o.fill_(running_o())
-        self.scale_x.fill_(scale_x())
-        self.scale_o.fill_(scale_o())
-        return ret
-
-    def state_dict(self, destination=None, prefix='', keep_vars=False):
-        return ModuleIntConfig.state_dict_global(self, destination, prefix, keep_vars)
-
-    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
-                              missing_keys, unexpected_keys, error_msgs):
-        ModuleIntConfig._load_from_state_dict_global(
-            self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-    def extra_repr(self):
-        s = nn.Linear.extra_repr(self)
-        extra_s = ',clamp_data:{clamp_data},clamp_bias:{clamp_bias},ahead_relu:{ahead_relu},ahead_sigmoid:{ahead_sigmoid}'.format(
-            **self.__dict__)
-        extra_s += ',data_bits:{data_bits},o_bits:{o_bits}'.format(
-            **self.__dict__)
-        return s+extra_s
diff --git a/linger/quant.py b/linger/quant.py
deleted file mode 100644
index cdf3e3d..0000000
--- a/linger/quant.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import math
-
-import lingerext
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.onnx.symbolic_helper as sym_help
-
-from linger.config import config
-from linger.utils import PlatFormQuant, QuantMode, ScalerBuffer
-epsilon = 1e-10
-
-
-class Quant():
-    @staticmethod
-    def quant(x, bits=8, scale=-1, mode=QuantMode.QValue, *, quant_data='weight', ahead_relu=False, iq_zero_point=0):
-        bound_value = None
-        scale_local = None
-        max_abs = None
-        assert quant_data == 'input' or quant_data == 'output' or quant_data == 'weight' or quant_data == 'hidden' or quant_data == 'conv_output', 'invalid quant_data, please confirm'
-        # assert drop_rate >= 0.0, 'invalid drop_rate param'
-        zero_point = iq_zero_point
-        if hasattr(x, 'zero_point'):
-            zero_point = x.zero_point
-        y = x.detach().clone()
-        if (ahead_relu and quant_data == 'output') or (ahead_relu and quant_data == 'conv_output'):
-            x = F.relu(x)
-        if mode == QuantMode.QValue:
-            max_abs = 0
-            bound_value = math.pow(2, bits-1)-1
-            if scale > 0:
-                scale_local = scale
-                max_abs = (bound_value+zero_point) / scale_local
-            else:
-                min_x = torch.min(x)
-                max_x = torch.max(x)
-                if min_x == max_x == 0:
-                    scale_local = math.pow(2, bits)
-                else:
-                    max_abs = torch.max(-min_x, max_x)
-                    max_value = math.floor(math.log((bound_value+zero_point) / max_abs, 2))
-                    scale_local = math.pow(2, max_value)
-                    max_abs = (bound_value+zero_point) / scale_local
-        else:
-            print('Error Quant Mode!!!')
-
-        scale_local = ScalerBuffer(scale_local)
-        x = y * scale_local
-        
-        #not iqtensor or test
-        if config.PlatFormQuant.platform_quant == PlatFormQuant.luna_quant:
-            x_quant = (x + 0.5).floor()
-        else:
-            assert False, "linger only support luna quant."
-        x_quant.clamp_(-bound_value-1+zero_point, bound_value+zero_point)
-        # drop_rate means quant percent, if drop_rate=1.0, all quant
-        
-        x = x_quant.float()
-        return x, scale_local, max_abs
-
-    @staticmethod
-    def dequant(x, scale):
-        scale_tensor = None
-        if isinstance(scale, (float, np.float32)):
-            scale_tensor = torch.tensor(
-                scale, dtype=torch.float32, device=x.device)
-        else:
-            scale_tensor = torch.tensor(
-                scale.data, dtype=torch.float32, device=x.device)
-        return (x/scale_tensor).float()
-
-    @staticmethod
-    def running_to_scale(running_data, bits, mode=QuantMode.QValue, zero_point=0):
-        running_data = ScalerBuffer(running_data)
-        scale_data = None
-        bound_value = math.pow(2, bits-1)-1
-        if mode == QuantMode.QValue:
-            max_value = round(math.log(
-                (bound_value+zero_point)/running_data(), 2)) if running_data() != 0 else 0.0
-            scale_data = math.pow(2, max_value)
-        scale_data = ScalerBuffer(scale_data)
-        return scale_data
-
-class ClipGrad(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, data, clip_threshold):
-        ctx.clip_threshold = clip_threshold
-        return data
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        clip_threshold = ctx.clip_threshold
-        grad_output = grad_output.clamp(-clip_threshold, clip_threshold)
-        return grad_output, None
-
-
-class NormalizeFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, data, bound, training=True, is_weight=True):
-        data, temp_loc1, temp_loc2 = lingerext.normalize_function_forward(
-            data, float(bound), training, is_weight)
-        ctx.save_for_backward(temp_loc1, temp_loc2)
-        return data
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        temp_loc1, temp_loc2 = ctx.saved_tensors
-        grad_output = lingerext.normalize_function_backward(
-            grad_output, temp_loc1, temp_loc2)
-
-        return grad_output, None, None, None
-
-    @staticmethod
-    def symbolic(g, input, bound, training, is_weight=True):
-        dtype = input.type().scalarType()
-        if dtype is None:
-            dtype = 6  # float
-        else:
-            dtype = sym_help.scalar_type_to_onnx.index(
-                sym_help.cast_pytorch_to_onnx[dtype])
-        min_val = g.op("Constant", value_t=torch.tensor(-bound,
-                       dtype=sym_help.scalar_type_to_pytorch_type[dtype]))
-        max_val = g.op("Constant", value_t=torch.tensor(
-            bound, dtype=sym_help.scalar_type_to_pytorch_type[dtype]))
-        return g.op("Clip", *[input, min_val, max_val])
-
-
-def normalize_data_with_config(input, normalize_data):
-    if normalize_data is not None:
-        input_clamp_data = normalize_data
-        input = input.clamp(-input_clamp_data, input_clamp_data)
-    return input
-
-
-def normalize_weight_with_config(weights, normalize_weight, training=False):
-    if normalize_weight is not None:
-        weights = NormalizeFunction.apply(weights, normalize_weight, training)
-    return weights
-
-
-def normalize_bias_with_config(bias, normalize_bias, training=False):
-    if normalize_bias is not None:
-        bias = NormalizeFunction.apply(bias, normalize_bias, training)
-    return bias
diff --git a/linger/quant/__init__.py b/linger/quant/__init__.py
new file mode 100644
index 0000000..9f8db4a
--- /dev/null
+++ b/linger/quant/__init__.py
@@ -0,0 +1,2 @@
+from .ops import *
+from .qtensor import *
\ No newline at end of file
diff --git a/linger/quant/calibrate_funs.py b/linger/quant/calibrate_funs.py
new file mode 100644
index 0000000..3680640
--- /dev/null
+++ b/linger/quant/calibrate_funs.py
@@ -0,0 +1,88 @@
+import torch
+import torch.onnx
+from torch.onnx import is_in_onnx_export
+
+_QCALIBRATE_TABLE = {}
+
+
+def register_calibrate_method(module_cls):
+    """
+    可以从外部调用此功能,注册新的校准方法。
+    函数输入有两个,分别为self和input,self表示input(权重或激活数据)对应的量化器。
+    如果为TQT训练,需要根据input动态的更新self中的learning_data和scale。
+    如果为PTQ量化,仅需更新scale即可。
+    """
+
+    def wrapper(cls):
+        _QCALIBRATE_TABLE[module_cls] = cls
+        return cls
+
+    return wrapper
+
+
+def get_calibrate_function(calibrate_name):
+    return _QCALIBRATE_TABLE.get(calibrate_name, None)
+
+@register_calibrate_method('abs_max')
+def abs_max_init(self, tensor, *args):
+    with torch.no_grad():
+        if self.is_calibrate:
+            raise ValueError("Quantizer has beem calibrated! ")
+        self.learning_data.fill_(tensor.abs().max().log2())
+        learning_data = self.data_bits - 1 - self.learning_data.squeeze(0)
+        learning_data = self.quant_round(learning_data, self.round_mode)
+        scale = 2**learning_data
+        self.scale = scale.clamp(min=1e-6, max=2**24)
+        self.is_calibrate.fill_(True)
+
+@register_calibrate_method('top_10')
+def top_10_init(self, tensor, *args):
+    with torch.no_grad():
+        if self.is_calibrate:
+            raise ValueError("Quantizer has beem calibrated! ")
+        if tensor.numel() > 11:
+            self.learning_data.fill_((torch.topk(tensor.abs().flatten(), 10)[0][-1]).log2())
+        else: # 可能有的激活没有10个元素
+            self.learning_data.fill_(tensor.abs().max().log2())
+        learning_data = self.data_bits - 1 - self.learning_data.squeeze(0)
+        learning_data = self.quant_round(learning_data, self.round_mode)
+        scale = 2**learning_data
+        self.scale.fill_(scale.clamp(min=1e-6, max=2**24))
+        self.is_calibrate.fill_(True)
+
+
+def get_best_pow2coef_W(w, bit):
+    min_int = -(2**(bit-1))
+    max_int = -min_int - 1
+    def fake_quant(x, scale):
+        x_int = (x / scale).round()
+        x_int = torch.clamp(x_int, min_int,max_int)
+        x_out = x_int * scale
+        return x_out
+    scale = []
+
+    for i in range(-10,10):
+        scale.append(2**i)
+    
+    score = []
+    for i in range(len(scale)):
+        #计算每个scale的得分，用score_temp存储每个input的得分
+        q_w = fake_quant(w, scale[i])
+        score.append( (w - q_w).norm() )
+    
+    return (torch.tensor(scale[score.index(min(score))])).log2().to("cuda") + bit - 1
+
+@register_calibrate_method('w_like')
+def w_like_init(self, tensor, *args):
+    with torch.no_grad():
+        if self.is_calibrate:
+            raise ValueError("Quantizer has beem calibrated! ")
+        self.learning_data.fill_(get_best_pow2coef_W(tensor, args[0]))
+        learning_data = self.data_bits - 1 - self.learning_data.squeeze(0)
+        learning_data = self.quant_round(learning_data, self.round_mode)
+        scale = 2**learning_data
+        self.scale = scale.clamp(min=1e-6, max=2**24)
+        self.is_calibrate.fill_(True)
+
+
+
diff --git a/linger/quant/ops/__init__.py b/linger/quant/ops/__init__.py
new file mode 100644
index 0000000..941a284
--- /dev/null
+++ b/linger/quant/ops/__init__.py
@@ -0,0 +1,3 @@
+from .qconfig import *
+from .qmodule import *
+from .qtensor import *
\ No newline at end of file
diff --git a/linger/quant/ops/qconfig.py b/linger/quant/ops/qconfig.py
new file mode 100644
index 0000000..ff8c7ee
--- /dev/null
+++ b/linger/quant/ops/qconfig.py
@@ -0,0 +1,116 @@
+import torch
+import torch.onnx
+from typing import Callable, List
+from functools import partial
+
+# ops_name
+LINGER = '_linger'
+
+LINGER_QTENSOR_LAYER_COUNTER = LINGER+"_qtensor_index_"
+LINGER_QTENSOR_LAYERS_PREIFX = LINGER+'_qtensor'
+
+self_module = []
+
+def get_current_module():
+    if len(self_module) > 0:
+        return self_module[-1]
+    else:
+        return None
+
+def hook_pre_forward(module, input):
+    setattr(module, LINGER_QTENSOR_LAYER_COUNTER, 0)
+    self_module.append(module)
+
+def hook_forward(module, input, output):
+    cur = self_module.pop()
+    assert cur == module
+    setattr(module, LINGER_QTENSOR_LAYER_COUNTER, 0)
+
+
+_QMODULE_TABLE = {}
+_QTENSOR_OP_TABLE = {}
+
+def register_qmodule(module_cls):
+    """
+    Used for registering a new quantized module.
+
+    The QModule must implement two abstract methods:
+
+    - qcreate: class method to instantiate a new QModule from an nn.Module, without copying its weights,
+    - forward: instance method for quantized inference.
+
+    The code to register a new module looks like:
+
+    ```
+    @register_qmodule(<base torch.nn.Module>)
+    class MyQModule(QModuleMixin, <base torch.nn.Module>):
+        <implementation>
+
+        @classmethod
+        def qcreate(cls,
+                    module: torch.nn.Module,
+                    weights: Optional[],
+                    activations: Optional[] = None,
+                    optimizer: Optional[Optimizer] = None):
+            ...
+
+        def forward(self, input: torch.Tensor) -> torch.Tensor:
+            ...
+    ```
+
+    """
+
+    def wrapper(cls):
+        _QMODULE_TABLE[module_cls] = cls
+        return cls
+
+    return wrapper
+
+def register_qtensor_op(aten_ops: List[Callable]):
+    """
+    Used for registering a new __torch_dispatch__ aten operation to QBytesTensor.
+
+    The code to register a new operation looks like:
+
+    @register_qbytestensor_op(list_of_ops)
+    def foo(op, *args, **kwargs):
+        <implementation>
+    """
+
+    def wrapper(op):
+        for aten_op in aten_ops:
+            _QTENSOR_OP_TABLE[aten_op] = partial(op, aten_op)
+
+    return wrapper
+
+def get_qtensor_op_dispatch(aten_op):
+    return _QTENSOR_OP_TABLE.get(aten_op, None)
+
+def get_qmodule_op(module_op):
+    return _QMODULE_TABLE.get(type(module_op), None)
+
+def quantize_module(
+    module,
+    activations_cfg = None,
+    *args,
+    **kwargs
+):
+    if type(module) in _QMODULE_TABLE.keys():
+        qcls = _QMODULE_TABLE[type(module)]
+        return qcls.from_module(module, activations_cfg, *args, **kwargs)
+    return None
+
+def quantize_tensor(
+    module,
+    activate_cfg = None,
+    *args,
+    **kwargs
+):
+    if module in _QMODULE_TABLE.keys():
+        qcls = _QMODULE_TABLE[module]
+        return qcls.from_module(module, activate_cfg, *args, **kwargs)
+    return None
+
+__all__ = ["_QMODULE_TABLE", "_QTENSOR_OP_TABLE", "LINGER_QTENSOR_LAYER_COUNTER", "LINGER_QTENSOR_LAYERS_PREIFX", \
+           "get_current_module", "hook_pre_forward", "hook_forward", "register_qmodule", "register_qtensor_op", \
+            "get_qmodule_op", "get_qtensor_op_dispatch", "quantize_module", "quantize_tensor"]
diff --git a/linger/quant/ops/qmodule/__init__.py b/linger/quant/ops/qmodule/__init__.py
new file mode 100644
index 0000000..01d4c42
--- /dev/null
+++ b/linger/quant/ops/qmodule/__init__.py
@@ -0,0 +1,21 @@
+from .qmodule import QModuleMixin
+from .qlinear import *
+from .qconv1d import *
+from .qconv2d import *
+from .qmaxpool1d import *
+from .qmaxpool2d import *
+from .qavgpool1d import *
+from .qavgpool2d import *
+from .qconvtranspose1d import *
+from .qconvtranspose2d import *
+from .qconvbn1d import *
+from .qconvbn2d import *
+from .qbatchnorm1d import *
+from .qbatchnorm2d import *
+from .qrelu import *
+from .qembedding import *
+from .qlayernorm import *
+from .qglu import *
+from .qSparifyFFN import *
+from .qgru import QGRU
+from .qlstm import QLSTM
diff --git a/linger/quant/ops/qmodule/qSparifyFFN.py b/linger/quant/ops/qmodule/qSparifyFFN.py
new file mode 100644
index 0000000..8b00871
--- /dev/null
+++ b/linger/quant/ops/qmodule/qSparifyFFN.py
@@ -0,0 +1,144 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+import copy
+
+from ....constrain.SparifyFFN import SparifyFFN, GetSparifyMask
+from ...quantizer import WQuantizer, AQuantizer, BQuantizer
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from ....config import QUANT_CONFIGS
+from ....utils import _single, _pair, _triple, QatMethod
+
+@register_qmodule(SparifyFFN)
+class QSparifyFFN(QModuleMixin, SparifyFFN):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        sp_module = cls(
+            module.input_size,
+            module.output_size,
+            module.bias is not None,
+            8,
+            None,
+            None,
+            3,
+            
+            dtype=module.weight_fc1.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None, # 仅打开输入输出的量化
+            bias_cfg=None,
+            constrain = constrain, 
+        )
+        temp_constrain = copy.deepcopy(constrain)
+        temp_constrain['clamp_activation_value'] = None
+        temp_constrain['clamp_factor_value'] = 3
+
+        sp_module.register_module("weight_fc1_quantizer", WQuantizer(weights_cfg, temp_constrain))
+        sp_module.register_module("weight_fc2_quantizer", WQuantizer(weights_cfg, temp_constrain))
+        temp_constrain['clamp_factor_value'] = 7
+        sp_module.register_module("weight_mask_quantizer", WQuantizer(weights_cfg, temp_constrain))
+
+        sp_module.register_module("bias_fc1_quantizer", BQuantizer(bias_cfg, None))
+        sp_module.register_module("bias_fc2_quantizer", BQuantizer(bias_cfg, None))
+        sp_module.register_module("bias_mask_quantizer", BQuantizer(bias_cfg, None))
+
+        sp_module.register_module("outfc1_quantizer", AQuantizer(activations_cfg, None))
+        sp_module.register_module("outmask_quantizer", AQuantizer(activations_cfg, None))
+
+        sp_module.weight_fc1 = module.weight_fc1
+        sp_module.weight_fc2 = module.weight_fc2
+        sp_module.weight_mask = module.weight_mask
+
+        sp_module.bias_fc1 = module.bias_fc1
+        sp_module.bias_fc2 = module.bias_fc2
+        sp_module.bias_mask = module.bias_mask
+
+        del temp_constrain
+
+        return sp_module
+    
+    @property
+    def qweight_fc1(self):  
+        fake_weight = self.weight_fc1_quantizer(self.weight_fc1)
+        return fake_weight
+
+    @property
+    def qweight_fc2(self):
+        fake_weight = self.weight_fc2_quantizer(self.weight_fc2)
+        return fake_weight
+    
+    @property
+    def qweight_mask(self):    
+        fake_weight = self.weight_mask_quantizer(self.weight_mask)
+        return fake_weight
+
+    @property
+    def qbias_fc1(self):
+        if self.bias_fc1_quantizer is None:
+            return self.bias_fc1
+        fake_bias = self.bias_fc1_quantizer(self.bias_fc1, self.weight_fc1_quantizer.scale * self.input_quantizer.scale)
+        return fake_bias
+    
+    @property
+    def qbias_fc2(self):
+        if self.bias_fc2_quantizer is None:
+            return self.bias_fc2
+        fake_bias = self.bias_fc2_quantizer(self.bias_fc2, self.weight_fc2_quantizer.scale * self.outfc1_quantizer.scale)
+        return fake_bias
+
+    @property
+    def qbias_mask(self):
+        if self.bias_mask_quantizer is None:
+            return self.bias_mask
+        fake_bias = self.bias_mask_quantizer(self.bias_mask, self.weight_mask_quantizer.scale * self.input_quantizer.scale)
+        return fake_bias
+    
+    def quantize_outL(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            fake_input = from_qtensor_to_tensor(input)
+            self.outfc1_quantizer.scale.fill_(input.scale.detach())
+            self.outfc1_quantizer.data_bits = input.data_bits
+        else:
+            fake_input = self.outfc1_quantizer(input) # 前向过程中会更新input_quantizer的scale
+        return fake_input
+    
+    def quantize_outM(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            fake_input = from_qtensor_to_tensor(input)
+            self.outmask_quantizer.scale.fill_(input.scale.detach())
+            self.outmask_quantizer.data_bits = input.data_bits
+        else:
+            fake_input = self.outmask_quantizer(input) # 前向过程中会更新input_quantizer的scale
+        return fake_input
+
+    def qforward(self, input):
+        outL0 = F.linear(input, self.qweight_fc1, self.qbias_fc1)
+        outL = self.quantize_outL(outL0)
+
+        outM = F.linear(input, self.qweight_mask, self.qbias_mask)
+        outM = self.quantize_outM(outM)
+        outM = torch.softmax(outM, dim=-1)
+
+        mask = GetSparifyMask.apply(outM, self.ratio)
+        outM2 = mask.repeat_interleave(self.repeat_num, dim=-1)
+        out1 = outL * outM2 # 浮点Tensor类型计算,相当于mask操作，不能走进qmul
+
+        out2 = F.relu(out1)
+        out = F.linear(out2, self.qweight_fc2, self.qbias_fc2)
+
+        return out
+
diff --git a/linger/quant/ops/qmodule/qavgpool1d.py b/linger/quant/ops/qmodule/qavgpool1d.py
new file mode 100644
index 0000000..2b9e9d5
--- /dev/null
+++ b/linger/quant/ops/qmodule/qavgpool1d.py
@@ -0,0 +1,48 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(nn.AvgPool1d)
+class QAvgPool1d(QModuleMixin, nn.AvgPool1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            kernel_size = module.kernel_size,
+            stride = module.stride,
+            padding = module.padding,
+            ceil_mode = module.ceil_mode,
+            count_include_pad = module.count_include_pad,
+
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=None, 
+            bias_cfg=None,
+            open_ihook = True,
+            open_ohook = True,
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        return F.avg_pool1d(
+            input,
+            self.kernel_size,
+            self.stride,
+            self.padding,
+            self.ceil_mode,
+            self.count_include_pad,
+        )
+        
+
diff --git a/linger/quant/ops/qmodule/qavgpool2d.py b/linger/quant/ops/qmodule/qavgpool2d.py
new file mode 100644
index 0000000..64925ad
--- /dev/null
+++ b/linger/quant/ops/qmodule/qavgpool2d.py
@@ -0,0 +1,50 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(nn.AvgPool2d)
+class QAvgPool2d(QModuleMixin, nn.AvgPool2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            kernel_size = module.kernel_size,
+            stride = module.stride,
+            padding = module.padding,
+            ceil_mode = module.ceil_mode,
+            count_include_pad = module.count_include_pad,
+            divisor_override = module.divisor_override,
+
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=None, 
+            bias_cfg=None,
+            open_ihook = True,
+            open_ohook = True,
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        return F.avg_pool2d(
+            input,
+            self.kernel_size,
+            self.stride,
+            self.padding,
+            self.ceil_mode,
+            self.count_include_pad,
+            self.divisor_override,
+        )
+        
+
diff --git a/linger/quant/ops/qmodule/qbatchnorm1d.py b/linger/quant/ops/qmodule/qbatchnorm1d.py
new file mode 100644
index 0000000..6446b34
--- /dev/null
+++ b/linger/quant/ops/qmodule/qbatchnorm1d.py
@@ -0,0 +1,87 @@
+import torch
+import torch.nn.functional as F
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.BatchNorm1d)
+class QBatchNorm1d(QModuleMixin, torch.nn.BatchNorm1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            num_features=module.num_features,
+            eps=module.eps,
+            momentum=module.momentum,
+            affine=module.affine,
+            track_running_stats=module.track_running_stats,
+
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.momentum is None:
+            exponential_average_factor = 0.0
+        else:
+            exponential_average_factor = self.momentum
+
+        if self.training and self.track_running_stats:
+            self.num_batches_tracked += 1
+            if self.momentum is None:
+                exponential_average_factor = 1.0 / float(self.num_batches_tracked)
+
+        # 兼容 [N, C] 或 [N, C, L]
+        if input.dim() == 2:
+            dims = (0,)
+        elif input.dim() == 3:
+            dims = (0, 2)
+        else:
+            raise ValueError(f"Expected 2D or 3D input (got {input.dim()}D)")
+
+        size = 1
+        for d in dims:
+            size *= input.size(d)
+
+        if self.training:
+            mean = input.mean(dim=dims, keepdim=True)
+            var = input.var(dim=dims, unbiased=False, keepdim=True)
+            var = torch.clamp(var, min=1e-5)
+
+            # 更新运行均值与方差（无梯度）
+            self.running_mean = (
+                (1 - exponential_average_factor) * self.running_mean +
+                exponential_average_factor * mean.squeeze().detach()
+            )
+            self.running_var = (
+                (1 - exponential_average_factor) * self.running_var +
+                exponential_average_factor * var.squeeze().detach()
+            )
+        else:
+            mean = self.running_mean.view(1, -1, *([1] * (input.dim() - 2)))
+            var = self.running_var.view(1, -1, *([1] * (input.dim() - 2)))
+
+        # 计算仿射变换参数
+        sigma = 1.0 / torch.sqrt(var + self.eps)
+        alpha = self.weight.view(1, -1, *([1] * (input.dim() - 2))) * sigma
+        beta = self.bias.view(1, -1, *([1] * (input.dim() - 2))) - mean * alpha
+
+        # 伪量化
+        fake_alpha = self.weight_quantizer(alpha)
+        fake_beta = self.bias_quantizer(beta)
+
+        out = fake_alpha * input + fake_beta
+        return out
+
+
diff --git a/linger/quant/ops/qmodule/qbatchnorm2d.py b/linger/quant/ops/qmodule/qbatchnorm2d.py
new file mode 100644
index 0000000..41e9500
--- /dev/null
+++ b/linger/quant/ops/qmodule/qbatchnorm2d.py
@@ -0,0 +1,80 @@
+import torch
+import torch.nn.functional as F
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.BatchNorm2d)
+class QBatchNorm2d(QModuleMixin, torch.nn.BatchNorm2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            num_features=module.num_features,
+            eps=module.eps,
+            momentum=module.momentum,
+            affine=module.affine,
+            track_running_stats=module.track_running_stats,
+
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.momentum is None:
+            exponential_average_factor = 0.0
+        else:
+            exponential_average_factor = self.momentum
+        if self.training and self.track_running_stats:
+            # TODO: if statement only here to tell the jit to skip emitting this when it is None
+            if self.num_batches_tracked is not None:  # type: ignore[has-type]
+                self.num_batches_tracked.add_(1)  # type: ignore[has-type]
+                if self.momentum is None:  # use cumulative moving average
+                    exponential_average_factor = 1.0 / float(self.num_batches_tracked)
+                else:  # use exponential moving average
+                    exponential_average_factor = self.momentum
+
+        N,C,H,W= input.shape 
+        size = N * H * W
+
+        if self.training:
+            mean = input.sum((0, 2, 3), keepdim=True) / size
+            var = input.pow(2).sum((0, 2, 3), keepdim=True) / size - mean.pow(2)
+            var = torch.clamp(var, min=1e-5)
+
+            # Update running stats (no grad)
+            self.running_mean = (
+                (1 - exponential_average_factor) * self.running_mean +
+                exponential_average_factor * mean.squeeze().detach()
+            )
+            self.running_var = (
+                (1 - exponential_average_factor) * self.running_var +
+                exponential_average_factor * var.squeeze().detach()
+            )
+        else:
+            mean = self.running_mean.reshape(1, -1, 1, 1)
+            var = self.running_var.reshape(1, -1, 1, 1)
+
+        sigma = 1 / torch.sqrt(var + self.eps)
+        alpha = self.weight.view(1, -1, 1, 1) * sigma
+        beta = self.bias.view(1, -1, 1, 1) - mean * alpha
+
+        fake_alpha = self.weight_quantizer(alpha)
+        fake_beta = self.bias_quantizer(beta)
+
+        out = fake_alpha * input + fake_beta
+
+        return out
+
+
diff --git a/linger/quant/ops/qmodule/qconv1d.py b/linger/quant/ops/qmodule/qconv1d.py
new file mode 100644
index 0000000..cb47dd7
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconv1d.py
@@ -0,0 +1,38 @@
+import torch
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.Conv1d)
+class QConv1d(QModuleMixin, torch.nn.Conv1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            dilation=module.dilation,
+            groups=module.groups,
+            bias=module.bias is not None,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        return self._conv_forward(input, self.qweight, self.qbias)
+
diff --git a/linger/quant/ops/qmodule/qconv2d.py b/linger/quant/ops/qmodule/qconv2d.py
new file mode 100644
index 0000000..38f730d
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconv2d.py
@@ -0,0 +1,38 @@
+import torch
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.Conv2d)
+class QConv2d(QModuleMixin, torch.nn.Conv2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            dilation=module.dilation,
+            groups=module.groups,
+            bias=module.bias is not None,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        return self._conv_forward(input, self.qweight, self.qbias)
+
diff --git a/linger/quant/ops/qmodule/qconvbn1d.py b/linger/quant/ops/qmodule/qconvbn1d.py
new file mode 100644
index 0000000..25c3ee8
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconvbn1d.py
@@ -0,0 +1,129 @@
+import torch
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ....constrain import CConvBN1d, ConvBN1d
+from ....onnx import generate_onnx_qparam_dict, QCustomOpSymbolic
+
+@register_qmodule(ConvBN1d)
+@register_qmodule(CConvBN1d)
+class QConvBN1d(QModuleMixin, CConvBN1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        convbn_mdl = cls(
+            in_channels=module.conv.in_channels,
+            out_channels=module.conv.out_channels,
+            kernel_size=module.conv.kernel_size,
+            stride=module.conv.stride,
+            padding=module.conv.padding,
+            dilation=module.conv.dilation,
+            groups=module.conv.groups,
+            bias=module.conv.bias is not None,
+            padding_mode=module.conv.padding_mode,
+            eps=module.bn.eps,
+            momentum=module.bn.momentum,
+            affine=module.bn.affine,
+            track_running_stats=module.bn.track_running_stats,
+
+            dtype=module.conv.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain=constrain, 
+        )
+    
+        convbn_mdl.weight = torch.nn.Parameter(
+            module.conv.weight.detach().clone(),
+            requires_grad=module.conv.weight.requires_grad
+        )
+
+        convbn_mdl.bias = torch.nn.Parameter(
+            module.bn.bias.detach().clone(),
+            requires_grad=module.bn.bias.requires_grad
+        )
+
+        convbn_mdl.in_channels = module.conv.in_channels
+        convbn_mdl.out_channels = module.conv.out_channels
+        convbn_mdl.dilation = module.conv.dilation
+        convbn_mdl.kernel_size = module.conv.kernel_size
+        convbn_mdl.padding = module.conv.padding
+        convbn_mdl.stride = module.conv.stride
+        convbn_mdl.groups = module.conv.groups
+        convbn_mdl.output_padding = module.conv.output_padding
+        convbn_mdl.padding_mode = module.conv.padding_mode
+        return convbn_mdl
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.training:
+            # conv_rlt = self.conv(input) # for calculate bn mean and var
+            conv_rlt = self.conv._conv_forward(input, self.conv.weight, self.conv.bias)
+            N, C, H = conv_rlt.size()
+            bn_size = N * H
+            conv_rlt = conv_rlt.permute(1, 0, 2).contiguous().view(C, bn_size)
+            sum_ = conv_rlt.sum(1)
+            sum_square_ = conv_rlt.pow(2).sum(1)
+            mean_ = sum_ / bn_size
+            sum_var_ = sum_square_ - sum_ * mean_
+            unbias_var_ = sum_var_ / (bn_size - 1)  # 无偏方差，用 unbias_var（除 N-1）来更新 running_var（长期的、用于推理时的估计），在统计上更合理（减少估计偏差）
+            unbias_var_ = torch.clamp(unbias_var_, min=0.0)
+            self.bn.running_mean = (
+                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean_.detach())
+            self.bn.running_var = (
+                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var_.detach())
+
+            bias_var_ = sum_var_ / bn_size  # 计算当前 batch 的标准差用于“在该 batch 上归一化” —— 这是即时、直接的标准化数学操作
+            bias_var_ = torch.clamp(bias_var_, min=0.0)
+            inv_std_ = 1 / (bias_var_ + self.bn.eps).pow(0.5)
+            bn_rlt = ((conv_rlt - mean_.unsqueeze(1)) * inv_std_.unsqueeze(1) * 
+                        self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
+            bn_rlt = bn_rlt.view(C, N, H).permute(1, 0, 2).contiguous()
+            w_bn_ = self.bn.weight.div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(mean_).div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+
+            alpha = 0.1
+
+            # fake_quant new_weight and new_bias
+            fake_weight = self.weight_quantizer(new_weight)
+            fake_bias = self.bias_quantizer(new_bias)
+
+            new_conv_rlt = F.conv1d(input, fake_weight, fake_bias, self.conv.stride,
+                                    self.conv.padding, self.conv.dilation, self.conv.groups)
+            output = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
+        else:
+            w_bn_ = self.bn.weight.div(torch.sqrt(self.bn.eps + self.bn.running_var))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
+                torch.sqrt(self.bn.running_var + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+            self.weight.data = new_weight
+            self.bias.data = new_bias            
+            if torch.onnx.is_in_onnx_export():
+                qparam_dict = generate_onnx_qparam_dict(self, False)
+                output = QCustomOpSymbolic.apply(input, self.weight, self.bias, qparam_dict, self.input_quantizer.is_qtensor)
+            else:
+                output = F.conv1d(input, self.qweight, self.qbias, self.stride,
+                              self.padding, self.dilation, self.groups)
+        
+        return output
+
diff --git a/linger/quant/ops/qmodule/qconvbn2d.py b/linger/quant/ops/qmodule/qconvbn2d.py
new file mode 100644
index 0000000..013bc22
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconvbn2d.py
@@ -0,0 +1,129 @@
+import torch
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ....constrain.cconvbn2d import CConvBN2d, ConvBN2d
+from ....onnx import generate_onnx_qparam_dict, QCustomOpSymbolic
+
+@register_qmodule(ConvBN2d)
+@register_qmodule(CConvBN2d)
+class QConvBN2d(QModuleMixin, CConvBN2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        convbn_mdl = cls(
+            in_channels=module.conv.in_channels,
+            out_channels=module.conv.out_channels,
+            kernel_size=module.conv.kernel_size,
+            stride=module.conv.stride,
+            padding=module.conv.padding,
+            dilation=module.conv.dilation,
+            groups=module.conv.groups,
+            bias=module.conv.bias is not None,
+            padding_mode=module.conv.padding_mode,
+            eps=module.bn.eps,
+            momentum=module.bn.momentum,
+            affine=module.bn.affine,
+            track_running_stats=module.bn.track_running_stats,
+
+            dtype=module.conv.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain=constrain, 
+        )
+    
+        convbn_mdl.weight = torch.nn.Parameter(
+            module.conv.weight.detach().clone(),
+            requires_grad=module.conv.weight.requires_grad
+        )
+
+        convbn_mdl.bias = torch.nn.Parameter(
+            module.bn.bias.detach().clone(),
+            requires_grad=module.bn.bias.requires_grad
+        )
+
+        convbn_mdl.in_channels = module.conv.in_channels
+        convbn_mdl.out_channels = module.conv.out_channels
+        convbn_mdl.dilation = module.conv.dilation
+        convbn_mdl.kernel_size = module.conv.kernel_size
+        convbn_mdl.padding = module.conv.padding
+        convbn_mdl.stride = module.conv.stride
+        convbn_mdl.groups = module.conv.groups
+        convbn_mdl.output_padding = module.conv.output_padding
+        convbn_mdl.padding_mode = module.conv.padding_mode
+        return convbn_mdl
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.training:
+            # conv_rlt = self.conv(input) # for calculate bn mean and var
+            conv_rlt = self.conv._conv_forward(input, self.conv.weight, self.conv.bias)
+            N, C, H, W = conv_rlt.size()
+            bn_size = N * H * W
+            conv_rlt = conv_rlt.permute(1, 0, 2, 3).contiguous().view(C, bn_size)
+            sum_ = conv_rlt.sum(1)
+            sum_square_ = conv_rlt.pow(2).sum(1)
+            mean_ = sum_ / bn_size
+            sum_var_ = sum_square_ - sum_ * mean_
+            unbias_var_ = sum_var_ / (bn_size - 1)  # 无偏方差，用 unbias_var（除 N-1）来更新 running_var（长期的、用于推理时的估计），在统计上更合理（减少估计偏差）
+            unbias_var_ = torch.clamp(unbias_var_, min=0.0)
+            self.bn.running_mean = (
+                (1 - self.bn.momentum) * self.bn.running_mean + self.bn.momentum * mean_.detach())
+            self.bn.running_var = (
+                (1 - self.bn.momentum) * self.bn.running_var + self.bn.momentum * unbias_var_.detach())
+
+            bias_var_ = sum_var_ / bn_size  # 计算当前 batch 的标准差用于“在该 batch 上归一化” —— 这是即时、直接的标准化数学操作
+            bias_var_ = torch.clamp(bias_var_, min=0.0)
+            inv_std_ = 1 / (bias_var_ + self.bn.eps).pow(0.5)
+            bn_rlt = ((conv_rlt - mean_.unsqueeze(1)) * inv_std_.unsqueeze(1) * 
+                        self.bn.weight.unsqueeze(1) + self.bn.bias.unsqueeze(1))
+            bn_rlt = bn_rlt.view(C, N, H, W).permute(1, 0, 2, 3).contiguous()
+            w_bn_ = self.bn.weight.div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(mean_).div(torch.sqrt(unbias_var_ + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+
+            alpha = 0.1
+
+            # fake_quant new_weight and new_bias
+            fake_weight = self.weight_quantizer(new_weight)
+            fake_bias = self.bias_quantizer(new_bias)
+
+            new_conv_rlt = F.conv2d(input, fake_weight, fake_bias, self.conv.stride,
+                                    self.conv.padding, self.conv.dilation, self.conv.groups)
+            output = alpha * bn_rlt + (1 - alpha) * new_conv_rlt
+        else:
+            w_bn_ = self.bn.weight.div(torch.sqrt(self.bn.eps + self.bn.running_var))
+            new_weight = self.conv.weight.mul(w_bn_.view(-1, 1, 1, 1))
+            if self.conv.bias is not None:
+                b_conv_ = self.conv.bias
+            else:
+                b_conv_ = torch.zeros(self.conv.weight.size(0), device=input.device)
+            b_bn_ = self.bn.bias - self.bn.weight.mul(self.bn.running_mean).div(
+                torch.sqrt(self.bn.running_var + self.bn.eps))
+            new_bias = b_conv_.mul(w_bn_) + b_bn_
+            self.weight.data = new_weight
+            self.bias.data = new_bias
+            if torch.onnx.is_in_onnx_export():
+                qparam_dict = generate_onnx_qparam_dict(self, False)
+                output = QCustomOpSymbolic.apply(input, self.weight, self.bias, qparam_dict, self.input_quantizer.is_qtensor)
+            else:
+                output = F.conv2d(input, self.qweight, self.qbias, self.stride,
+                              self.padding, self.dilation, self.groups)
+        
+        return output
+
diff --git a/linger/quant/ops/qmodule/qconvtranspose1d.py b/linger/quant/ops/qmodule/qconvtranspose1d.py
new file mode 100644
index 0000000..91af734
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconvtranspose1d.py
@@ -0,0 +1,49 @@
+import torch
+import torch.nn.functional as F
+from typing import List, Optional, Tuple, Union
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.ConvTranspose1d)
+class QConvTranspose1d(QModuleMixin, torch.nn.ConvTranspose1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            output_padding=module.output_padding,
+            groups=module.groups,
+            bias=module.bias is not None,
+            dilation=module.dilation,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        return F.conv_transpose1d(
+            input, 
+            self.qweight, 
+            self.qbias,
+            self.stride,
+            self.padding,
+            self.output_padding,
+            self.groups,
+            self.dilation,)
+
diff --git a/linger/quant/ops/qmodule/qconvtranspose2d.py b/linger/quant/ops/qmodule/qconvtranspose2d.py
new file mode 100644
index 0000000..48034d0
--- /dev/null
+++ b/linger/quant/ops/qmodule/qconvtranspose2d.py
@@ -0,0 +1,51 @@
+import torch
+import torch.nn.functional as F
+from typing import List, Optional, Tuple, Union
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.ConvTranspose2d)
+class QConvTranspose2d(QModuleMixin, torch.nn.ConvTranspose2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            in_channels=module.in_channels,
+            out_channels=module.out_channels,
+            kernel_size=module.kernel_size,
+            stride=module.stride,
+            padding=module.padding,
+            output_padding=module.output_padding,
+            groups=module.groups,
+            bias=module.bias is not None,
+            dilation=module.dilation,
+            padding_mode=module.padding_mode,
+            dtype=module.weight.dtype,
+
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input: torch.Tensor, output_size: Optional[List[int]] = None) -> torch.Tensor:
+        return F.conv_transpose2d(
+            input, 
+            self.qweight, 
+            self.qbias,
+            self.stride,
+            self.padding,
+            self.output_padding,
+            self.groups,
+            self.dilation,
+            )
+
diff --git a/linger/quant/ops/qmodule/qembedding.py b/linger/quant/ops/qmodule/qembedding.py
new file mode 100644
index 0000000..b9a1f19
--- /dev/null
+++ b/linger/quant/ops/qmodule/qembedding.py
@@ -0,0 +1,55 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+
+@register_qmodule(torch.nn.Embedding)
+class QEmbedding(QModuleMixin, nn.Embedding):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        return cls(
+            module.num_embeddings,
+            module.embedding_dim,
+            module.padding_idx,
+            module.max_norm,
+            module.norm_type,
+            module.scale_grad_by_freq,
+            module.sparse,
+            None, # _weight参数永远设置为None
+            None, # _freeze参数永远设置为None
+            dtype = module.weight.dtype,
+            device = device,
+            activations_cfg = activations_cfg,
+            weights_cfg = weights_cfg,
+            bias_cfg = bias_cfg,
+            constrain = constrain,
+            open_ihook = False,
+            open_ohook = False
+        )
+
+    def qforward(self, input):
+        out_q =  F.embedding(
+                    input,
+                    self.qweight,
+                    self.padding_idx,
+                    self.max_norm,
+                    self.norm_type,
+                    self.scale_grad_by_freq,
+                    self.sparse,
+                )
+        return from_tensor_to_qtensor(out_q, self.weight_quantizer.scale, self.weight_quantizer.data_bits)
+
diff --git a/linger/quant/ops/qmodule/qglu.py b/linger/quant/ops/qmodule/qglu.py
new file mode 100644
index 0000000..e798a1e
--- /dev/null
+++ b/linger/quant/ops/qmodule/qglu.py
@@ -0,0 +1,48 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+from ..qtensor import QSigmoidFunction
+from ...quantizer import AQuantizer, BQuantizer
+
+from ....config import QUANT_CONFIGS
+
+@register_qmodule(nn.GLU)
+class QGLU(QModuleMixin, nn.GLU):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        glu_module = cls(
+            dim = module.dim,
+            
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=constrain,
+            bias_cfg=None,
+        )
+
+        glu_module.register_module("sigmoid_quantizer", BQuantizer(activations_cfg, None))
+        return glu_module
+
+    def qforward(self, input: torch.Tensor) -> torch.Tensor:
+        if QUANT_CONFIGS.calibration:
+            return F.glu(input, self.dim)
+        else:
+            input_a, input_b = torch.chunk(input, 2, dim=self.dim)
+            input_b = QSigmoidFunction.apply(input_b, self.input_quantizer) # int32 dequant
+            input_b = self.sigmoid_quantizer(input_b, torch.tensor(2**15, dtype=torch.float32))   # Q15 for thinker forward
+            output = input_a * input_b
+            return output
+
diff --git a/linger/quant/ops/qmodule/qgru.py b/linger/quant/ops/qmodule/qgru.py
new file mode 100644
index 0000000..da07d50
--- /dev/null
+++ b/linger/quant/ops/qmodule/qgru.py
@@ -0,0 +1,545 @@
+import math
+import copy
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+from torch.nn.utils.rnn import PackedSequence
+
+from ..qtensor import QSigmoidFunction
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from ...quantizer import WQuantizer, AQuantizer, BQuantizer
+from ....config import QUANT_CONFIGS
+from ....utils import _unbind, _unbind_packed, _slice, hx_slice, QatMethod, PlatForm
+
+import lingerext
+
+def luna_requant(x_int, scale_x, scale_y):
+    l_scale = scale_y - scale_x
+    
+    if l_scale > 0:
+        x_int = x_int * pow(2, l_scale)
+    else:
+        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
+    return x_int
+
+class QGRUSigmoidFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+
+        # 转换为Q27格式的int32
+        i_q27 = (input * (1 << 27) + 0.5).floor().to(torch.int32)   # float到int32需要2次舍入，这是第二次
+        i_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+
+        output_q15 = luna_requant(output_q31.int(), 31, 15)
+        output = output_q15.float() / (1 << 15)
+        return output
+    
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, = ctx.saved_tensors
+
+        # 使用标准sigmoid的梯度近似
+        input = input.detach().clone().requires_grad_(True)
+
+        with torch.enable_grad():
+            y = F.sigmoid(input)
+            gradInput = torch.autograd.grad(y, input, grad_output)
+
+        return gradInput
+
+class QGRUTanhFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+
+        # 转换为Q27格式的int32
+        i_q27 = (input * (1 << 27) + 0.5).floor().to(torch.int32)   # float到int32需要2次舍入，这是第二次
+        i_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qtanh_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.arcs_qtanh_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qtanh_forward(i_q27.contiguous())
+
+        output_q15 = luna_requant(output_q31.int(), 31, 15)
+        output = output_q15.float() / (1 << 15)
+        return output
+    
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, = ctx.saved_tensors
+
+        # 使用标准tanh的梯度近似
+        input = input.detach().clone().requires_grad_(True)
+
+        with torch.enable_grad():
+            y = F.tanh(input)
+            gradInput = torch.autograd.grad(y, input, grad_output)
+
+        return gradInput
+
+class QGRUCell(nn.Module):
+    def forward(self, input_x, hidden, weight_ih, weight_hh, bias_ih, bias_hh, training,
+                    input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                    biasih_quantizer, biashh_quantizer, output_quantizer):
+        # step1 input.mul, hidden.mul, fake_quant to keep same value
+        gi_output = F.linear(input_x, weight_ih, bias_ih)
+        gh_output = F.linear(hidden, weight_hh, bias_hh)
+
+        scale_gi = input_quantizer.scale * weightih_quantizer.scale
+        scale_gh = hidden_quantizer.scale * weighthh_quantizer.scale
+        gi_output = biasih_quantizer(gi_output, scale_gi) # fake_quant gi_output
+        gh_output = biashh_quantizer(gh_output, scale_gh)
+
+        i_r, i_i, i_n = gi_output.chunk(3, 1)
+        h_r, h_i, h_n = gh_output.chunk(3, 1)
+        ih_r = i_r + h_r    #这一步推理时没有舍入
+        ih_i = i_i + h_i
+        resetgate = QGRUSigmoidFunction.apply(ih_r)
+        inputgate = QGRUSigmoidFunction.apply(ih_i)
+        resetgate_h_n = resetgate * h_n
+        r_ih_n = resetgate_h_n + i_n
+        newgate = QGRUTanhFunction.apply(r_ih_n)
+        
+        hy = newgate + inputgate * (hidden - newgate)
+
+        return hy
+        
+
+@register_qmodule(torch.nn.GRU)
+class QGRU(QModuleMixin, nn.GRU):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        gru_module = cls(
+            module.input_size,
+            module.hidden_size,
+            module.num_layers,
+            module.bias,
+            module.batch_first,
+            module.dropout,
+            module.bidirectional,
+            
+            dtype=module.weight_ih_l0.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None, # 仅打开输入输出的量化
+            bias_cfg=None,
+            constrain=constrain,
+            open_ihook = False,
+            open_ohook = False
+        )
+
+        gru_module.add_module("input_quantizer", AQuantizer(activations_cfg, None))
+        gru_module.add_module("weightih_quantizer", WQuantizer(weights_cfg, constrain))
+        gru_module.add_module("weighthh_quantizer", WQuantizer(weights_cfg, constrain))
+        gru_module.add_module("hidden_quantizer", AQuantizer(activations_cfg, None))
+        gru_module.add_module("output_quantizer", AQuantizer(activations_cfg, constrain))
+        if module.bias:
+            gru_module.add_module("biasih_quantizer", BQuantizer(bias_cfg, constrain))
+            gru_module.add_module("biashh_quantizer", BQuantizer(bias_cfg, constrain))
+        else:
+            gru_module.add_module("biasih_quantizer", None)
+            gru_module.add_module("biashh_quantizer", None)
+
+        if module.bidirectional:
+            gru_module.add_module("weightih_reverse_quantizer", WQuantizer(weights_cfg, constrain))
+            gru_module.add_module("weighthh_reverse_quantizer", WQuantizer(weights_cfg, constrain))
+            gru_module.add_module("hidden_reverse_quantizer", AQuantizer(activations_cfg, None))
+            gru_module.add_module("output_reverse_quantizer", AQuantizer(activations_cfg, constrain))
+            if module.bias:
+                gru_module.add_module("biasih_reverse_quantizer", BQuantizer(bias_cfg, constrain))
+                gru_module.add_module("biashh_reverse_quantizer", BQuantizer(bias_cfg, constrain))
+            else:
+                gru_module.add_module("biasih_reverse_quantizer", None)
+                gru_module.add_module("biashh_reverse_quantizer", None)
+
+        gru_module.weight_ih_l0 = module.weight_ih_l0
+        gru_module.weight_hh_l0 = module.weight_hh_l0
+        gru_module.bias_ih_l0 = module.bias_ih_l0
+        gru_module.bias_hh_l0 = module.bias_hh_l0
+
+        if module.bidirectional:
+            gru_module.weight_ih_l0_reverse = module.weight_ih_l0_reverse
+            gru_module.weight_hh_l0_reverse = module.weight_hh_l0_reverse
+            gru_module.bias_ih_l0_reverse = module.bias_ih_l0_reverse
+            gru_module.bias_hh_l0_reverse = module.bias_hh_l0_reverse
+
+        gru_module.qgru_cell_func = QGRUCell()
+
+        return gru_module
+    
+    @property
+    def qweight_ih_hh(self):
+        fake_w_ih = self.weightih_quantizer(self.weight_ih_l0)
+        fake_w_hh = self.weighthh_quantizer(self.weight_hh_l0)
+        return fake_w_ih, fake_w_hh
+    
+    @property
+    def qweight_ih_hh_reverse(self):
+        if self.bidirectional:
+            fake_w_ih_r = self.weightih_reverse_quantizer(self.weight_ih_l0_reverse)
+            fake_w_hh_r = self.weighthh_reverse_quantizer(self.weight_hh_l0_reverse)
+            return fake_w_ih_r, fake_w_hh_r
+        return None, None
+
+    @property
+    def qbias_ih_hh(self):
+        if self.biasih_quantizer is None:
+            return self.bias_ih_l0, self.bias_hh_l0
+        fake_bias_ih = self.biasih_quantizer(self.bias_ih_l0, self.weightih_quantizer.scale * self.input_quantizer.scale)
+        fake_bias_hh = self.biashh_quantizer(self.bias_hh_l0, self.weighthh_quantizer.scale * self.hidden_quantizer.scale)
+        return fake_bias_ih, fake_bias_hh
+    
+    @property
+    def qbias_ih_hh_reverse(self):
+        if self.bidirectional:
+            if self.biasih_quantizer is None:
+                return self.bias_ih_l0_reverse, self.bias_hh_l0_reverse
+            fake_bias_ih_r = self.biasih_reverse_quantizer(self.bias_ih_l0_reverse, self.weightih_reverse_quantizer.scale * self.input_quantizer.scale)
+            fake_bias_hh_r = self.biashh_reverse_quantizer(self.bias_hh_l0_reverse, self.weighthh_reverse_quantizer.scale * self.hidden_reverse_quantizer.scale)
+            return fake_bias_ih_r, fake_bias_hh_r
+        return None, None
+
+    def quantize_gru_input(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            fake_input = from_qtensor_to_tensor(input)
+            self.input_quantizer.scale.fill_(input.scale.detach())
+            self.input_quantizer.data_bits = input.data_bits
+        else:
+            fake_input = self.input_quantizer(input) # 前向过程中会更新input_quantizer的scale
+        return fake_input
+
+    def quantize_gru_hidden(self, hidden: torch.Tensor, direct, scale=None) -> torch.Tensor:
+        if direct:
+            fake_hidden = self.hidden_reverse_quantizer(hidden, scale)
+        else:
+            fake_hidden = self.hidden_quantizer(hidden, scale)
+        return fake_hidden
+    
+    def quantize_gru_out(self, input: torch.Tensor, direct) -> torch.Tensor:
+        if direct:
+            fake_out = self.output_reverse_quantizer(input)
+        else:
+            fake_out = self.output_quantizer(input)
+        return fake_out
+        # return from_tensor_to_qtensor(fake_out, self.output_quantizer.scale, self.output_quantizer.data_bits)
+    
+    def qforward(self, input, *args, **kwargs):
+        hx = None if len(args) == 0 else args[0]
+        if QUANT_CONFIGS.calibration:
+            return self.forward_calibrate(input, hx)
+        else:
+            return self.forward_train(input, hx)
+
+
+    def forward_calibrate(self, input, hx=None):
+        with torch.no_grad():
+            if self.num_layers != 1:
+                assert False, "Intx-NormalizeGRU don't support num_layer!=1 !"
+            orig_input = input
+            if isinstance(orig_input, PackedSequence):
+                input, batch_sizes, sorted_indices, unsorted_indices = orig_input
+                max_batch_size = batch_sizes[0]
+                max_batch_size = int(max_batch_size)
+            elif isinstance(orig_input, tuple):
+                input, lengths, batch_first, enforce_sorted = orig_input
+                packed_input = torch.nn.utils.rnn.pack_padded_sequence(
+                    input, lengths, batch_first, enforce_sorted)
+                input, batch_sizes, sorted_indices, unsorted_indices = packed_input
+                max_batch_size = batch_sizes[0]
+                max_batch_size = int(max_batch_size)
+            else:
+                batch_sizes = None
+                max_batch_size = input.size(
+                    0) if self.batch_first else input.size(1)
+                sorted_indices = None
+                unsorted_indices = None
+
+            if hx is None:
+                num_directions = 2 if self.bidirectional else 1
+                hx = torch.zeros(self.num_layers * num_directions,
+                                    max_batch_size, self.hidden_size,
+                                    dtype=input.dtype, device=input.device)
+            else:
+                hx = self.permute_hidden(hx, sorted_indices)
+
+            input = self.quantize_gru_input(input)
+
+            self.check_forward_args(input, hx, batch_sizes)
+
+            w_ih, w_hh = self.qweight_ih_hh
+            b_ih, b_hh = self.qbias_ih_hh
+            hx_f = hx[0:1,:,:]
+            output, hidden = torch.ops.aten.gru(
+                                input, hx_f, [w_ih, w_hh, b_ih, b_hh],
+                                has_biases=self.bias,
+                                num_layers=self.num_layers,
+                                dropout=self.dropout,
+                                train=self.training,
+                                # bidrectional=False,
+                                bidirectional=self.bidirectional,
+                                batch_first=self.batch_first
+                            )
+            hidden = self.quantize_gru_hidden(hidden, 0)
+            output = self.quantize_gru_out(output, 0)
+
+            if self.bidirectional:
+                input_r = input.flip(0)
+                hx_b = hx[1:, :, :]
+                w_ih_r, w_hh_r = self.qweight_ih_hh_reverse
+                b_ih_r, b_hh_r = self.qbias_ih_hh_reverse
+                output_r, hidden_r = torch.ops.aten.gru(
+                                input_r, hx_b, [w_ih_r, w_hh_r, b_ih_r, b_hh_r],
+                                has_biases=self.bias,
+                                num_layers=self.num_layers,
+                                dropout=self.dropout,
+                                train=self.training,
+                                bidirectional=False,
+                                batch_first=self.batch_first
+                            )
+                hidden_r = self.quantize_gru_hidden(hidden_r, 1)
+                output_r = self.quantize_gru_out(output_r, 1)
+                output = torch.cat((output, output_r), -1)
+                hidden = torch.cat((hidden, hidden_r), -1)
+                
+            output = from_tensor_to_qtensor(output, self.output_quantizer.scale, self.output_quantizer.data_bits)
+
+            if isinstance(orig_input, PackedSequence):
+                output_packed = PackedSequence(
+                    output, batch_sizes, sorted_indices, unsorted_indices)
+                return output_packed, self.permute_hidden(hidden, unsorted_indices)
+            elif isinstance(orig_input, tuple):
+                output_packed = PackedSequence(
+                    output, batch_sizes, sorted_indices, unsorted_indices)
+                output, lengths = torch.nn.utils.rnn.pad_packed_sequence(
+                    output_packed, self.batch_first)
+                return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
+            else:
+                return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def forward_train(self, input, hx=None):
+        orig_input = input
+        if isinstance(orig_input, PackedSequence):
+            input, batch_sizes, sorted_indices, unsorted_indices = orig_input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        elif isinstance(orig_input, tuple):
+            input, lengths, batch_first, enforce_sorted = orig_input
+            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
+                input, lengths, batch_first, enforce_sorted)
+            input, batch_sizes, sorted_indices, unsorted_indices = packed_input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        else:
+            batch_sizes = None
+            max_batch_size = input.size(0) if self.batch_first else input.size(1)
+            sorted_indices = None
+            unsorted_indices = None
+
+        assert self.num_layers == 1, 'invalid num_layers, now only support num_layers = 1'
+
+        self.quantize_gru_input(input)
+
+        # init hidden
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            hx = torch.zeros(self.num_layers * num_directions, max_batch_size, self.hidden_size, dtype=input.dtype, device=input.device)
+            self.quantize_gru_hidden(hx[0], 0, self.input_quantizer.scale)  # init hidden_quantizer
+            if self.bidirectional:
+                self.quantize_gru_hidden(hx[1], 1, self.input_quantizer.scale)
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+            self.quantize_gru_hidden(hx[0], 0, None)
+            if self.bidirectional:
+                self.quantize_gru_hidden(hx[1], 1, None)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        if batch_sizes is not None:
+            output, hidden = self.forward_input_packed(input, hx, batch_sizes)
+        else:
+            output, hidden = self.forward_input_tensor(input, hx)
+        
+        output = from_tensor_to_qtensor(output, self.output_quantizer.scale, self.output_quantizer.data_bits)
+
+        if isinstance(orig_input, PackedSequence):
+            output_packed = PackedSequence(
+                output, batch_sizes, sorted_indices, unsorted_indices)
+            return output_packed, self.permute_hidden(hidden, unsorted_indices)
+        elif isinstance(orig_input, tuple):
+            output_packed = PackedSequence(
+                output, batch_sizes, sorted_indices, unsorted_indices)
+            output, lengths = torch.nn.utils.rnn.pad_packed_sequence(
+                output_packed, self.batch_first)
+            return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
+        else:
+            return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def forward_input_packed(self, input, hx, batch_sizes=None):
+        hiddens = self._generate_hiddens(hx)
+        output, hr = self.gru_forward(input, hiddens, batch_sizes)
+        return output, hr
+
+    def forward_input_tensor(self, input, hx):
+        # Convert input to (seq_len, batch_size, input_size)
+        input = input.transpose(0, 1) if self.batch_first else input
+        hiddens = self._generate_hiddens(hx)
+        output, hr = self.gru_forward(input, hiddens)
+        output = output.transpose(0, 1) if self.batch_first else output
+        return output, hr
+
+    def gru_forward(self, input, hiddens, batch_sizes=None):
+        final_hiddens = []
+        # Go through layers
+        for layer_num in range(self.num_layers):
+            hid = hiddens[layer_num] if hiddens is not None else None
+            output, hc = self._bidirection(input, layer_num, hid, batch_sizes) if self.bidirectional else self._single_direction(input, layer_num, hid, batch_sizes)
+            final_hiddens.extend(hc)
+            ## add dropout
+            if (self.dropout!= 0 and self.training and layer_num < self.num_layers - 1):
+                 output = torch.nn.functional.dropout(output, self.dropout)
+        hy = [hidden for hidden in final_hiddens]
+        hy = torch.stack(hy, 0)
+        return output, hy
+
+    def _single_direction(self, input, layer, hx, batch_sizes = None):
+        hidden = hx
+        output, hidden = self._run_single_direction(input, hidden, layer, direct=0, batch_sizes=batch_sizes)
+        return output, [hidden]
+
+    def _bidirection(self, input, layer, hx, batch_sizes = None):
+        hx_f = hx[0]
+        hx_b = hx[1]
+        fw_output, fw_hidden = self._run_single_direction(input, hx_f, layer, direct=0, batch_sizes=batch_sizes)
+        rev_output, rev_hidden = self._run_single_direction(input, hx_b, layer, direct=1, batch_sizes=batch_sizes)
+        if batch_sizes is None:
+            output = torch.cat((fw_output, rev_output), fw_output.dim()-1)   
+        else:  #packed sequence
+            output = torch.cat((fw_output, rev_output), -1)
+        return output, [fw_hidden, rev_hidden]
+    
+    def _run_single_direction(self, input, hidden, layer=0, direct=0, batch_sizes=None):
+        # bidirection quantizer
+        input_quantizer = self.input_quantizer
+        hidden_quantizer    = self.hidden_quantizer if direct == 0 else self.hidden_reverse_quantizer
+        weightih_quantizer  = self.weightih_quantizer if direct == 0 else self.weightih_reverse_quantizer
+        weighthh_quantizer  = self.weighthh_quantizer if direct == 0 else self.weighthh_reverse_quantizer
+        biasih_quantizer  = self.biasih_quantizer if direct == 0 else self.biasih_reverse_quantizer
+        biashh_quantizer  = self.biashh_quantizer if direct == 0 else self.biashh_reverse_quantizer
+        output_quantizer    = self.output_quantizer if direct == 0 else self.output_reverse_quantizer
+
+        # fake_quant hidden, weight, bias
+        weight_ih, weight_hh = self.qweight_ih_hh if direct == 0 else self.qweight_ih_hh_reverse
+        bias_ih, bias_hh = self.qbias_ih_hh if direct == 0 else self.qbias_ih_hh_reverse
+
+        step_outputs = []
+
+        if batch_sizes is None:
+            # input =  torch.cat(input.split(1,0)[::-1])  if direct == 1 else input
+            input = input.flip(0) if direct == 1 else input
+
+            step_outputs = []
+            for input_x in input:
+                hidden = self.qgru_cell_func(input_x, hidden, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                        biasih_quantizer, biashh_quantizer, output_quantizer)
+                hidden = self.quantize_gru_hidden(hidden, direct)
+                step_outputs.append(hidden)
+            step_outputs = step_outputs[::-1] if direct == 1 else step_outputs
+            output = torch.stack(step_outputs, 0)
+        elif direct == 0:
+            final_hiddens = []
+            hidden = copy.deepcopy(hidden)
+            #split by time
+            input, batch_size_list = _unbind_packed(input, batch_sizes)
+            last_batch_size = batch_size_list[0]
+            for input_i, batch_len in zip(input, batch_size_list):
+                inc = batch_len - last_batch_size
+                if inc < 0:
+                    #按batch的帧长排完序，由长到短，较短的帧hidden计算的次数少，直接取低位保留
+                    final_hiddens.append(_slice(hidden, batch_len, last_batch_size))
+                    hidden = hx_slice(None, hidden, last_batch_size, batch_len)
+                hidden = self.qgru_cell_func(input_x, hidden, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                        biasih_quantizer, biashh_quantizer, output_quantizer)
+                hidden = self.quantize_gru_hidden(hidden, direct)
+                step_outputs.append(hidden)
+                last_batch_size = batch_len
+
+            final_hiddens.append(hidden)
+            ret_hidden = final_hiddens[::-1]
+            hy_list = []
+            for each in ret_hidden:
+                hy_list.append(each)
+            hidden = torch.cat(hy_list, 0)
+            output = torch.cat(step_outputs, 0)
+        else:
+            input, batch_size_list = _unbind_packed(input, batch_sizes)
+            input = input[::-1]   #按照时间t 进行反转
+            # input =  torch.cat(input.split(1,0)[::-1])  if direct == 1 else input 
+            batch_size_list = batch_size_list[::-1] 
+            input_hx = copy.deepcopy(hidden)
+            last_batch_size = batch_size_list[0]
+            hidden = _slice(hidden, 0, last_batch_size)
+            for input_i,batch_len in zip(input, batch_size_list):
+                if last_batch_size != batch_len:
+                    #获取input_hx高位hidden部分与上一帧的hidden进行填充，相当于补0
+                    hidden = hx_slice(input_hx, hidden, last_batch_size, batch_len)           
+                hidden = self.qgru_cell_func(input_x, hidden, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                        biasih_quantizer, biashh_quantizer, output_quantizer)
+                hidden = self.quantize_gru_hidden(hidden, direct)
+                step_outputs.append(hidden)
+                last_batch_size = batch_len
+            
+            step_outputs = step_outputs[::-1] 
+            output = torch.cat(step_outputs, 0)
+
+        output = self.quantize_gru_out(output, direct)
+        return output, hidden
+    
+    def _generate_hiddens(self, hx):
+        if hx is not None:
+            hidden_list = _unbind(hx)
+            length = len(hidden_list)
+            if self.bidirectional:
+                assert length / self.num_layers%2 == 0, 'hidden len must be double in bidirectional mode'
+            i = 0
+            hiddens = []
+            while i < length:
+                if self.bidirectional:
+                    hiddens.append((hidden_list[i], hidden_list[i+1]))
+                    i = i + 2
+                else:
+                    hiddens.append(hidden_list[i])
+                    i = i + 1
+        else:
+            hiddens = None
+        return hiddens
+
diff --git a/linger/quant/ops/qmodule/qlayernorm.py b/linger/quant/ops/qmodule/qlayernorm.py
new file mode 100644
index 0000000..d7b7c2e
--- /dev/null
+++ b/linger/quant/ops/qmodule/qlayernorm.py
@@ -0,0 +1,137 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ...quantizer import WQuantizer, AQuantizer, BQuantizer
+from ....config import QUANT_CONFIGS
+
+import lingerext
+
+class QLayerNormFunction(torch.autograd.Function):
+    """QLayerNormFunction
+    input---quant_data
+    weight---quant_data
+    bias---float_data
+    """
+    @staticmethod
+    def forward(ctx, input, weight, bias, normalized_shape, eps,
+                input_quantizer, weight_quantizer, bias_quantizer, training):
+        
+        # === 1 reshape input to (M, N)
+        N = 1
+        for d in normalized_shape:
+            N *= d
+        M = input.numel() // N
+
+        scale_x = input_quantizer.scale
+        scale_w = weight_quantizer.scale
+
+        q_input = input_quantizer.quant_round(input * scale_x, input_quantizer.round_mode)
+        q_input = q_input.clamp(input_quantizer.quant_min, input_quantizer.quant_max)
+        q_weight = weight_quantizer.quant_round(weight * scale_w, weight_quantizer.round_mode)
+        q_weight = q_weight.clamp(weight_quantizer.quant_min, weight_quantizer.quant_max)
+        x_2d = q_input.contiguous().view(M, N).long().long()
+
+        # === 2 调用底层算子
+        sum_x = x_2d.clone().sum(-1, keepdim=True)
+        sum_x2 = x_2d.clone().pow(2).sum(-1, keepdim=True)
+        denominator = N * sum_x2 - sum_x * sum_x
+        scale_eps = 2 * scale_x.log2()
+        q_eps = math.floor(eps * pow(2, scale_eps) * N * N + 0.5)
+        # q_eps = input_quantizer.quant_round(eps * pow(2, scale_eps) * N * N, input_quantizer.round_mode)
+        # q_eps = (torch.clamp(q_eps, input_quantizer.quant_min, input_quantizer.quant_max)).long()
+        denominator = denominator + q_eps
+        numerator = N * x_2d
+        numerator = numerator - sum_x
+        q_y_normal = lingerext.qlayernorm_kernel_forward(numerator.int(), denominator.long(), math.log2(scale_x.data))
+        scale_y_normal = 2**10
+
+        # === 3 reshape back to original
+        q_y_normal = q_y_normal.view(input.shape)
+        q_y_normal.clamp_(-2**15, 2**15-1)
+        q_output = q_y_normal * q_weight
+        if bias is not None:
+            q_bias = bias_quantizer.quant_round(bias * scale_y_normal * scale_w, bias_quantizer.round_mode)
+            q_bias = torch.clamp(q_bias, bias_quantizer.quant_min, bias_quantizer.quant_max)
+        else:
+            q_bias = None
+        
+        q_output = q_output + q_bias
+        q_output.clamp_(-2**31, 2**31-1)
+        outputs = q_output.float() / (scale_y_normal * scale_w)
+
+        saved_tensors = []
+        if training:
+            ctx.normalized_shape = normalized_shape
+            ctx.eps = eps
+            ctx.scale_x = scale_x
+            ctx.scale_w = scale_w
+            saved_tensors += [input, weight, bias]
+            ctx.save_for_backward(*saved_tensors)
+
+        return outputs
+    
+    @staticmethod
+    def backward(ctx, gradOutput):
+        input, weight, bias = ctx.saved_tensors
+        normalized_shape = ctx.normalized_shape
+        eps = ctx.eps
+
+        input = input.detach().requires_grad_(True)
+        weight = weight.detach().requires_grad_(True)
+        bias = bias.detach().requires_grad_(True)
+
+        with torch.enable_grad():
+            z = F.layer_norm(input, normalized_shape, weight, bias, eps)
+            grads = torch.autograd.grad(
+                z, (input, weight, bias) if bias is not None else (input, weight),
+                gradOutput,
+                retain_graph=False,
+                allow_unused=True
+            )
+
+        gradInput = grads[0]
+        gradWeight = grads[1]
+        gradBias = grads[2] if bias is not None else None
+
+        return gradInput, gradWeight, gradBias, None, None, None, None, None, None
+
+@register_qmodule(torch.nn.LayerNorm)
+class QLayerNorm(QModuleMixin, nn.LayerNorm):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        return cls(
+            normalized_shape = module.normalized_shape,
+            eps = module.eps,
+            elementwise_affine = module.elementwise_affine,
+            # bias = module.bias is not None,
+            device = device,
+            dtype = module.weight.dtype,
+            activations_cfg = activations_cfg,
+            weights_cfg = weights_cfg,
+            bias_cfg = bias_cfg,
+            constrain = constrain
+        )
+
+    def qforward(self, input):
+        if QUANT_CONFIGS.calibration:
+            return F.layer_norm(input, self.normalized_shape, self.qweight, self.qbias, self.eps)
+        else:
+            return QLayerNormFunction.apply(
+                input, self.qweight, self.bias, self.normalized_shape, self.eps,
+                self.input_quantizer, self.weight_quantizer, self.bias_quantizer, self.training
+            )
+
diff --git a/linger/quant/ops/qmodule/qlinear.py b/linger/quant/ops/qmodule/qlinear.py
new file mode 100644
index 0000000..d0d0367
--- /dev/null
+++ b/linger/quant/ops/qmodule/qlinear.py
@@ -0,0 +1,37 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(torch.nn.Linear)
+class QLinear(QModuleMixin, nn.Linear):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        return cls(
+            module.in_features,
+            module.out_features,
+            module.bias is not None,
+            dtype=module.weight.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=weights_cfg,
+            bias_cfg=bias_cfg if module.bias is not None else None,
+            constrain = constrain, 
+        )
+
+    def qforward(self, input):
+        return F.linear(input, self.qweight, bias=self.qbias)
+
diff --git a/linger/quant/ops/qmodule/qlstm.py b/linger/quant/ops/qmodule/qlstm.py
new file mode 100644
index 0000000..c6f6465
--- /dev/null
+++ b/linger/quant/ops/qmodule/qlstm.py
@@ -0,0 +1,578 @@
+import math
+import copy
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+from torch.nn.utils.rnn import PackedSequence
+
+from ..qtensor import QSigmoidFunction
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from ...quantizer import WQuantizer, AQuantizer, BQuantizer
+from ....config import QUANT_CONFIGS
+from ....utils import _unbind, _unbind_packed, _slice, hx_slice, QatMethod, PlatForm
+
+import lingerext
+
+def luna_requant(x_int, scale_x, scale_y):
+    l_scale = scale_y - scale_x
+    
+    if l_scale > 0:
+        x_int = x_int * pow(2, l_scale)
+    else:
+        x_int = (x_int * pow(2, l_scale) + 0.5).floor().int()
+    return x_int
+
+class QLSTMSigmoidFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+
+        # 转换为Q27格式的int32
+        i_q27 = (input * (1 << 27) + 0.5).floor().to(torch.int32)   # float到int32需要2次舍入，这是第二次
+        i_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qsigmoid_forward(i_q27.contiguous())
+
+        output_q15 = luna_requant(output_q31.int(), 31, 15)
+        output = output_q15.float() / (1 << 15)
+        return output
+    
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, = ctx.saved_tensors
+
+        # 使用标准sigmoid的梯度近似
+        input = input.detach().clone().requires_grad_(True)
+
+        with torch.enable_grad():
+            y = F.sigmoid(input)
+            gradInput = torch.autograd.grad(y, input, grad_output)
+
+        return gradInput
+
+class QLSTMTanhFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        ctx.save_for_backward(input)
+
+        # 转换为Q27格式的int32
+        i_q27 = (input * (1 << 27) + 0.5).floor().to(torch.int32)   # float到int32需要2次舍入，这是第二次
+        i_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qtanh_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.arcs_qtanh_forward(i_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qtanh_forward(i_q27.contiguous())
+
+        output_q15 = luna_requant(output_q31.int(), 31, 15)
+        output = output_q15.float() / (1 << 15)
+        return output
+    
+    @staticmethod
+    def backward(ctx, grad_output):
+        input, = ctx.saved_tensors
+
+        # 使用标准tanh的梯度近似
+        input = input.detach().clone().requires_grad_(True)
+
+        with torch.enable_grad():
+            y = F.tanh(input)
+            gradInput = torch.autograd.grad(y, input, grad_output)
+
+        return gradInput
+
+
+class QLSTMCell(nn.Module):
+    def forward(self, input_x, hidden, cx, weight_ih, weight_hh, bias_ih, bias_hh, training,
+                    input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                    biasih_quantizer, biashh_quantizer, cell_quantizer):
+        # step1 input.mul, hidden.mul, fake_quant to keep same value
+        gi_output = F.linear(input_x, weight_ih, bias_ih)
+        gh_output = F.linear(hidden, weight_hh, bias_hh)
+
+        scale_gi = input_quantizer.scale * weightih_quantizer.scale
+        scale_gh = hidden_quantizer.scale * weighthh_quantizer.scale
+        gi_output = biasih_quantizer(gi_output, scale_gi) # fake_quant gi_output
+        gh_output = biashh_quantizer(gh_output, scale_gh)
+
+        gates = gi_output + gh_output   # 这一步推理时没有舍入
+        ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
+        ingate      = QLSTMSigmoidFunction.apply(ingate)
+        forgetgate  = QLSTMSigmoidFunction.apply(forgetgate)
+        cellgate    = QLSTMTanhFunction.apply(cellgate)
+        outgate     = QLSTMSigmoidFunction.apply(outgate)
+        # sigmoid和tanh结果均为Q15伪量化之后结果，已经包含舍入操作
+
+        # fake_quant cx
+        scale_cell = torch.tensor(2**15, dtype=torch.float32)
+        new_cx = cell_quantizer(cx, scale_cell)
+        cy1 = new_cx * forgetgate
+        cy2 = ingate * cellgate
+        cy = cy1 + cy2
+        hy = QLSTMTanhFunction.apply(cy)    #包含一次舍入
+        hy = hy * outgate
+        cy = cell_quantizer(cy, scale_cell)
+
+        return hy, cy
+        
+@register_qmodule(torch.nn.LSTM)
+class QLSTM(QModuleMixin, nn.LSTM):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        lstm_module = cls(
+            module.input_size,
+            module.hidden_size,
+            module.num_layers,
+            module.bias,
+            module.batch_first,
+            module.dropout,
+            module.bidirectional,
+            module.proj_size,
+            
+            dtype=module.weight_ih_l0.dtype,
+            device=device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None, # 仅打开输入输出的量化
+            bias_cfg=None,
+            constrain=constrain,
+            open_ihook = False,
+            open_ohook = False
+        )
+
+        lstm_module.add_module("input_quantizer", AQuantizer(activations_cfg, None))
+        lstm_module.add_module("weightih_quantizer", WQuantizer(weights_cfg, constrain))
+        lstm_module.add_module("weighthh_quantizer", WQuantizer(weights_cfg, constrain))
+        lstm_module.add_module("hidden_quantizer", AQuantizer(activations_cfg, None))
+        lstm_module.add_module("cell_quantizer", BQuantizer(bias_cfg, None))   # cell量化器，用于保持推理一致性
+        lstm_module.add_module("output_quantizer", AQuantizer(activations_cfg, constrain))
+        if module.bias:
+            lstm_module.add_module("biasih_quantizer", BQuantizer(bias_cfg, constrain))
+            lstm_module.add_module("biashh_quantizer", BQuantizer(bias_cfg, constrain))
+        else:
+            lstm_module.add_module("biasih_quantizer", None)
+            lstm_module.add_module("biashh_quantizer", None)
+
+        if module.bidirectional:
+            lstm_module.add_module("weightih_reverse_quantizer", WQuantizer(weights_cfg, constrain))
+            lstm_module.add_module("weighthh_reverse_quantizer", WQuantizer(weights_cfg, constrain))
+            lstm_module.add_module("hidden_reverse_quantizer", AQuantizer(activations_cfg, None))
+            lstm_module.add_module("output_reverse_quantizer", AQuantizer(activations_cfg, constrain))
+            if module.bias:
+                lstm_module.add_module("biasih_reverse_quantizer", BQuantizer(bias_cfg, constrain))
+                lstm_module.add_module("biashh_reverse_quantizer", BQuantizer(bias_cfg, constrain))
+            else:
+                lstm_module.add_module("biasih_reverse_quantizer", None)
+                lstm_module.add_module("biashh_reverse_quantizer", None)
+
+        lstm_module.weight_ih_l0 = module.weight_ih_l0
+        lstm_module.weight_hh_l0 = module.weight_hh_l0
+        lstm_module.bias_ih_l0 = module.bias_ih_l0
+        lstm_module.bias_hh_l0 = module.bias_hh_l0
+
+        if module.bidirectional:
+            lstm_module.weight_ih_l0_reverse = module.weight_ih_l0_reverse
+            lstm_module.weight_hh_l0_reverse = module.weight_hh_l0_reverse
+            lstm_module.bias_ih_l0_reverse = module.bias_ih_l0_reverse
+            lstm_module.bias_hh_l0_reverse = module.bias_hh_l0_reverse
+
+        lstm_module.qlstm_cell_func = QLSTMCell()
+
+        return lstm_module
+    
+    @property
+    def qweight_ih_hh(self):
+        fake_w_ih = self.weightih_quantizer(self.weight_ih_l0)
+        fake_w_hh = self.weighthh_quantizer(self.weight_hh_l0)
+        return fake_w_ih, fake_w_hh
+    
+    @property
+    def qweight_ih_hh_reverse(self):
+        if self.bidirectional:
+            fake_w_ih_r = self.weightih_reverse_quantizer(self.weight_ih_l0_reverse)
+            fake_w_hh_r = self.weighthh_reverse_quantizer(self.weight_hh_l0_reverse)
+            return fake_w_ih_r, fake_w_hh_r
+        return None, None
+
+    @property
+    def qbias_ih_hh(self):
+        if self.biasih_quantizer is None:
+            return self.bias_ih_l0, self.bias_hh_l0
+        fake_bias_ih = self.biasih_quantizer(self.bias_ih_l0, self.weightih_quantizer.scale * self.input_quantizer.scale)
+        fake_bias_hh = self.biashh_quantizer(self.bias_hh_l0, self.weighthh_quantizer.scale * self.hidden_quantizer.scale)
+        return fake_bias_ih, fake_bias_hh
+    
+    @property
+    def qbias_ih_hh_reverse(self):
+        if self.bidirectional:
+            if self.biasih_quantizer is None:
+                return self.bias_ih_l0_reverse, self.bias_hh_l0_reverse
+            fake_bias_ih_r = self.biasih_reverse_quantizer(self.bias_ih_l0_reverse, self.weightih_reverse_quantizer.scale * self.input_quantizer.scale)
+            fake_bias_hh_r = self.biashh_reverse_quantizer(self.bias_hh_l0_reverse, self.weighthh_reverse_quantizer.scale * self.hidden_reverse_quantizer.scale)
+            return fake_bias_ih_r, fake_bias_hh_r
+        return None, None
+
+    def quantize_lstm_input(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            fake_input = from_qtensor_to_tensor(input)
+            self.input_quantizer.scale.fill_(input.scale.detach())
+            self.input_quantizer.data_bits = input.data_bits
+        else:
+            fake_input = self.input_quantizer(input) # 前向过程中会更新input_quantizer的scale
+        return fake_input
+
+    def quantize_lstm_hidden(self, hidden: torch.Tensor, direct, scale=None) -> torch.Tensor:
+        if direct:
+            fake_hidden = self.hidden_reverse_quantizer(hidden, scale)
+        else:
+            fake_hidden = self.hidden_quantizer(hidden, scale)
+        return fake_hidden
+    
+    def quantize_lstm_out(self, input: torch.Tensor, direct) -> torch.Tensor:
+        if direct:
+            fake_out = self.output_reverse_quantizer(input)
+        else:
+            fake_out = self.output_quantizer(input)
+        return fake_out
+        # return from_tensor_to_qtensor(fake_out, self.output_quantizer.scale, self.output_quantizer.data_bits)
+    
+    def qforward(self, input, *args, **kwargs):
+        hx = None if len(args) == 0 else args[0]
+        if QUANT_CONFIGS.calibration:
+            return self.forward_calibrate(input, hx)
+        else:
+            return self.forward_train(input, hx)
+
+
+    def forward_calibrate(self, input, hx=None):
+        with torch.no_grad():
+            if self.num_layers != 1:
+                assert False, "Intx-NormalizeLSTM don't support num_layer!=1 !"
+            orig_input = input
+            if isinstance(orig_input, PackedSequence):
+                input, batch_sizes, sorted_indices, unsorted_indices = orig_input
+                max_batch_size = batch_sizes[0]
+                max_batch_size = int(max_batch_size)
+            elif isinstance(orig_input, tuple):
+                input, lengths, batch_first, enforce_sorted = orig_input
+                packed_input = torch.nn.utils.rnn.pack_padded_sequence(
+                    input, lengths, batch_first, enforce_sorted)
+                input, batch_sizes, sorted_indices, unsorted_indices = packed_input
+                max_batch_size = batch_sizes[0]
+                max_batch_size = int(max_batch_size)
+            else:
+                batch_sizes = None
+                max_batch_size = input.size(
+                    0) if self.batch_first else input.size(1)
+                sorted_indices = None
+                unsorted_indices = None
+
+            if hx is None:
+                num_directions = 2 if self.bidirectional else 1
+                zeros = torch.zeros(self.num_layers * num_directions,
+                                    max_batch_size, self.hidden_size,
+                                    dtype=input.dtype, device=input.device)
+                hx = (zeros, zeros)
+            else:
+                hx = self.permute_hidden(hx, sorted_indices)
+
+            input = self.quantize_lstm_input(input)
+
+            self.check_forward_args(input, hx, batch_sizes)
+
+            w_ih, w_hh = self.qweight_ih_hh
+            b_ih, b_hh = self.qbias_ih_hh
+            output, hidden = torch.ops.aten.lstm(
+                                input, hx, [w_ih, w_hh, b_ih, b_hh],
+                                has_biases=self.bias,
+                                num_layers=self.num_layers,
+                                dropout=self.dropout,
+                                train=self.training,
+                                bidirectional=False,
+                                batch_first=self.batch_first
+                            )
+            hidden = hidden[0]
+            cell = hidden[1]
+            # hidden = self.quantize_lstm_hidden(hidden, 0)
+            output = self.quantize_lstm_out(output, 0)
+
+            if self.bidirectional:
+                input_r = input.flip(0)
+                w_ih_r, w_hh_r = self.qweight_ih_hh_reverse
+                b_ih_r, b_hh_r = self.qbias_ih_hh_reverse
+                output_r, hidden_r = torch.ops.aten.lstm(
+                                input_r, hx, [w_ih_r, w_hh_r, b_ih_r, b_hh_r],
+                                has_biases=self.bias,
+                                num_layers=self.num_layers,
+                                dropout=self.dropout,
+                                train=self.training,
+                                bidirectional=False,
+                                batch_first=self.batch_first
+                            )
+                hidden_r = hidden_r[0]
+                cell_r = hidden_r[1]
+                # hidden_r = self.quantize_lstm_hidden(hidden_r, 1)
+                output_r = self.quantize_lstm_out(output_r, 1)
+                output = torch.cat((output, output_r), -1)
+                hidden = torch.cat((hidden, hidden_r), -1)
+                cell = torch.cat((cell, cell_r), -1)
+                
+            hidden = (hidden, cell)
+            output = from_tensor_to_qtensor(output, self.output_quantizer.scale, self.output_quantizer.data_bits)
+
+            if isinstance(orig_input, PackedSequence):
+                output_packed = PackedSequence(
+                    output, batch_sizes, sorted_indices, unsorted_indices)
+                return output_packed, self.permute_hidden(hidden, unsorted_indices)
+            elif isinstance(orig_input, tuple):
+                output_packed = PackedSequence(
+                    output, batch_sizes, sorted_indices, unsorted_indices)
+                output, lengths = torch.nn.utils.rnn.pad_packed_sequence(
+                    output_packed, self.batch_first)
+                return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
+            else:
+                return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def forward_train(self, input, hx=None):
+        orig_input = input
+        if isinstance(orig_input, PackedSequence):
+            input, batch_sizes, sorted_indices, unsorted_indices = orig_input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        elif isinstance(orig_input, tuple):
+            input, lengths, batch_first, enforce_sorted = orig_input
+            packed_input = torch.nn.utils.rnn.pack_padded_sequence(
+                input, lengths, batch_first, enforce_sorted)
+            input, batch_sizes, sorted_indices, unsorted_indices = packed_input
+            max_batch_size = batch_sizes[0]
+            max_batch_size = int(max_batch_size)
+        else:
+            batch_sizes = None
+            max_batch_size = input.size(0) if self.batch_first else input.size(1)
+            sorted_indices = None
+            unsorted_indices = None
+
+        assert self.num_layers == 1, 'invalid num_layers, now only support num_layers = 1'
+
+        self.quantize_lstm_input(input)
+
+        # init hidden
+        if hx is None:
+            num_directions = 2 if self.bidirectional else 1
+            zeros = torch.zeros(self.num_layers * num_directions, max_batch_size, self.hidden_size, dtype=input.dtype, device=input.device)
+            hx = (zeros, zeros)
+            self.quantize_lstm_hidden(hx[0][0], 0, self.input_quantizer.scale)  # init hidden_quantizer
+            if self.bidirectional:
+                self.quantize_lstm_hidden(hx[0][1], 1, self.input_quantizer.scale)
+        else:
+            # Each batch of the hidden state should match the input sequence that
+            # the user believes he/she is passing in.
+            hx = self.permute_hidden(hx, sorted_indices)
+            self.quantize_lstm_hidden(hx[0][0], 0, None)
+            if self.bidirectional:
+                self.quantize_lstm_hidden(hx[0][1], 1, None)
+
+        self.check_forward_args(input, hx, batch_sizes)
+
+        # forward
+        if batch_sizes is None:
+            output, hy, cy = self.forward_input_tensor(input, hx)
+        else:
+            output, hy, cy = self.forward_input_packed(input, hx, batch_sizes)
+        hidden = (hy, cy)
+        
+        output = from_tensor_to_qtensor(output, self.output_quantizer.scale, self.output_quantizer.data_bits)
+
+        if isinstance(orig_input, PackedSequence):
+            output_packed = PackedSequence(
+                output, batch_sizes, sorted_indices, unsorted_indices)
+            return output_packed, self.permute_hidden(hidden, unsorted_indices)
+        elif isinstance(orig_input, tuple):
+            output_packed = PackedSequence(
+                output, batch_sizes, sorted_indices, unsorted_indices)
+            output, lengths = torch.nn.utils.rnn.pad_packed_sequence(
+                output_packed, self.batch_first)
+            return (output, lengths), self.permute_hidden(hidden, unsorted_indices)
+        else:
+            return output, self.permute_hidden(hidden, unsorted_indices)
+
+    def forward_input_packed(self, input, hx, batch_sizes=None):
+        hiddens = self._generate_hiddens(hx)
+        output, hr, ct = self.lstm_forward(input, hiddens, batch_sizes)
+        return output, hr, ct
+
+    def forward_input_tensor(self, input, hx):
+        # Convert input to (seq_len, batch_size, input_size)
+        input = input.transpose(0, 1) if self.batch_first else input
+        hiddens = self._generate_hiddens(hx)
+        output, hr, ct = self.lstm_forward(input, hiddens)
+        output = output.transpose(0, 1) if self.batch_first else output
+        return output, hr, ct
+
+    def lstm_forward(self, input, hiddens, batch_sizes=None):
+        final_hiddens = []
+        # Go through layers
+        for layer_num in range(self.num_layers):
+            hid = hiddens[layer_num] if hiddens is not None else None
+            output, hc = self._bidirection(input, layer_num, hid, batch_sizes) if self.bidirectional else self._single_direction(input, layer_num, hid, batch_sizes)
+            final_hiddens.extend(hc)
+            ## add dropout
+            if (self.dropout!= 0 and self.training and layer_num < self.num_layers - 1):
+                 output = torch.nn.functional.dropout(output, self.dropout)
+
+        hy = [hidden[0] for hidden in final_hiddens]
+        cy = [hidden[1] for hidden in final_hiddens]
+        hy = torch.stack(hy, 0)
+        cy = torch.stack(cy, 0)
+
+        return output, hy, cy
+
+    def _single_direction(self, input, layer, hx, batch_sizes = None):
+        hidden = hx[0]
+        cell_state = hx[1]
+        output, hidden = self._run_single_direction(input, hidden, cell_state, layer, direct=0, batch_sizes=batch_sizes)
+        return output, [hidden]
+
+    def _bidirection(self, input, layer, hx, batch_sizes = None):
+        hx_f = hx[0][0]
+        ct_f = hx[0][1]
+        hx_b = hx[1][0]
+        ct_b = hx[1][1]
+        fw_output, fw_hidden = self._run_single_direction(input, hx_f, ct_f, layer, direct=0, batch_sizes=batch_sizes)
+        rev_output, rev_hidden = self._run_single_direction(input, hx_b, ct_b, layer, direct=1, batch_sizes=batch_sizes)
+        if batch_sizes is None:
+            output = torch.cat((fw_output, rev_output), fw_output.dim()-1)
+        else:  #packed sequence
+            output = torch.cat((fw_output, rev_output), -1)
+        return output, [fw_hidden, rev_hidden]
+    
+    def _run_single_direction(self, input, hidden, cell_state, layer=0, direct=0, batch_sizes=None):
+        # bidirection quantizer
+        input_quantizer = self.input_quantizer
+        cell_quantizer = self.cell_quantizer
+        hidden_quantizer    = self.hidden_quantizer if direct == 0 else self.hidden_reverse_quantizer
+        weightih_quantizer  = self.weightih_quantizer if direct == 0 else self.weightih_reverse_quantizer
+        weighthh_quantizer  = self.weighthh_quantizer if direct == 0 else self.weighthh_reverse_quantizer
+        biasih_quantizer  = self.biasih_quantizer if direct == 0 else self.biasih_reverse_quantizer
+        biashh_quantizer  = self.biashh_quantizer if direct == 0 else self.biashh_reverse_quantizer
+        output_quantizer    = self.output_quantizer if direct == 0 else self.output_reverse_quantizer
+
+        # fake_quant hidden, weight, bias
+        weight_ih, weight_hh = self.qweight_ih_hh if direct == 0 else self.qweight_ih_hh_reverse
+        bias_ih, bias_hh = self.qbias_ih_hh if direct == 0 else self.qbias_ih_hh_reverse
+
+        step_outputs = []
+
+        if batch_sizes is None:
+            # input = input if direct == 0 else torch.cat(input.split(1,0)[::-1]) 
+            input = input if direct == 0 else input.flip(0).contiguous()
+            for input_x in input:
+                hidden, cell_state = self.qlstm_cell_func(input_x, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                        biasih_quantizer, biashh_quantizer, cell_quantizer)
+                hidden = self.quantize_lstm_hidden(hidden, direct)   #hidden fake_quant
+
+                step_outputs.append(hidden)
+        
+            step_outputs = step_outputs[::-1] if direct == 1 else step_outputs
+            output = torch.stack(step_outputs, 0)
+        elif direct == 0:
+            final_hiddens = []
+            hidden = copy.deepcopy(hidden)
+            cell_state = copy.deepcopy(cell_state)
+            #split by time
+            input, batch_size_list = _unbind_packed(input, batch_sizes)
+            last_batch_size = batch_size_list[0]
+            for input_i, batch_len in zip(input, batch_size_list):
+                inc = batch_len - last_batch_size
+                if inc < 0:
+                    #按batch的帧长排完序，由长到短，较短的帧hidden计算的次数少，直接取低位保留
+                    final_hiddens.append(_slice((hidden, cell_state) ,batch_len, last_batch_size))
+                    hidden, cell_state = hx_slice(None, (hidden, cell_state), last_batch_size, batch_len)
+                hidden, cell_state = self.qlstm_cell_func(input_i, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                                        biasih_quantizer, biashh_quantizer, cell_quantizer)
+                hidden = self.quantize_lstm_hidden(hidden, direct)   #hidden fake_quant
+                step_outputs.append(hidden)
+                last_batch_size = batch_len
+            final_hiddens.append((hidden, cell_state))
+            ret_hidden = final_hiddens[::-1]
+            hy_list = []
+            cy_list = []
+            for each in ret_hidden:
+                hy_list.append(each[0])
+                cy_list.append(each[1])
+            hidden = torch.cat(hy_list, 0)
+            cell_state = torch.cat(cy_list, 0)
+            output = torch.cat(step_outputs, 0)
+        else:
+            input, batch_size_list = _unbind_packed(input, batch_sizes)
+            input = input[::-1]   #按照时间t 进行反转
+            # input =  torch.cat(input.split(1,0)[::-1])  if direct == 1 else input 
+            batch_size_list = batch_size_list[::-1] 
+            input_hx = (copy.deepcopy(hidden), copy.deepcopy(cell_state)) 
+            last_batch_size = batch_size_list[0]
+            
+            hidden = _slice(hidden, 0, last_batch_size)
+            cell_state = _slice(cell_state, 0, last_batch_size)
+            for input_i,batch_len in zip(input, batch_size_list):
+                if last_batch_size != batch_len:
+                    #获取input_hx高位hidden部分与上一帧的hidden进行填充，相当于补0
+                    hidden, cell_state = hx_slice(input_hx, (hidden, cell_state), last_batch_size, batch_len)           
+                hidden, cell_state = self.qlstm_cell_func(input_i, hidden, cell_state, weight_ih, weight_hh, bias_ih, bias_hh, self.training,
+                                                        input_quantizer, hidden_quantizer, weightih_quantizer, weighthh_quantizer, 
+                                                        biasih_quantizer, biashh_quantizer, cell_quantizer)
+                hidden = self.quantize_lstm_hidden(hidden, direct)   #hidden fake_quant
+                step_outputs.append(hidden)
+                last_batch_size = batch_len
+            
+            step_outputs = step_outputs[::-1] 
+            output = torch.cat(step_outputs, 0)
+
+        output = self.quantize_lstm_out(output, direct)
+        return output, (hidden, cell_state)
+    
+
+    def _generate_hiddens(self, hx):
+        if hx is not None:
+            assert len(hx) == 2, 'hidden(tuple) input length must be 2'
+            hidden_list = _unbind(hx[0])
+            cellstate_list = _unbind(hx[1])
+            assert len(hidden_list) == len(cellstate_list) 
+            length = len(hidden_list)
+            if self.bidirectional:
+                assert length/self.num_layers%2==0, 'hidden len must be double in bidirectional mode'
+
+            i = 0
+            hiddens = []
+            while i < length:
+                if self.bidirectional:
+                    hiddens.append(((hidden_list[i], cellstate_list[i]), (hidden_list[i+1], cellstate_list[i+1])))
+                    i += 2
+                else:
+                    hiddens.append((hidden_list[i], cellstate_list[i]))
+                    i+= 1
+        else:
+            hiddens = None
+        return hiddens
+
diff --git a/linger/quant/ops/qmodule/qmaxpool1d.py b/linger/quant/ops/qmodule/qmaxpool1d.py
new file mode 100644
index 0000000..5b9f43b
--- /dev/null
+++ b/linger/quant/ops/qmodule/qmaxpool1d.py
@@ -0,0 +1,65 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(nn.MaxPool1d)
+class QMaxPool1d(QModuleMixin, nn.MaxPool1d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            kernel_size = module.kernel_size,
+            stride = module.stride,
+            padding = module.padding,
+            dilation = module.dilation,
+            return_indices = module.return_indices,
+            ceil_mode = module.ceil_mode,
+
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=None, 
+            bias_cfg=None,
+            open_ihook = False,
+            open_ohook = False,
+        )
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            tmp_input = from_qtensor_to_tensor(input)
+            scale = input.scale.detach()
+            data_bits = input.data_bits
+            out =  F.max_pool2d(
+                tmp_input,
+                self.kernel_size,
+                self.stride,
+                self.padding,
+                self.dilation,
+                ceil_mode=self.ceil_mode,
+                return_indices=self.return_indices,
+            )
+            return from_tensor_to_qtensor(out, scale, data_bits)
+        else:
+            return F.max_pool2d(
+                    input,
+                    self.kernel_size,
+                    self.stride,
+                    self.padding,
+                    self.dilation,
+                    ceil_mode=self.ceil_mode,
+                    return_indices=self.return_indices,
+                )
+        
+
diff --git a/linger/quant/ops/qmodule/qmaxpool2d.py b/linger/quant/ops/qmodule/qmaxpool2d.py
new file mode 100644
index 0000000..58a6419
--- /dev/null
+++ b/linger/quant/ops/qmodule/qmaxpool2d.py
@@ -0,0 +1,65 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from typing import Optional, Union, Dict, Any
+
+@register_qmodule(nn.MaxPool2d)
+class QMaxPool2d(QModuleMixin, nn.MaxPool2d):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            kernel_size = module.kernel_size,
+            stride = module.stride,
+            padding = module.padding,
+            dilation = module.dilation,
+            return_indices = module.return_indices,
+            ceil_mode = module.ceil_mode,
+
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=None, 
+            bias_cfg=None,
+            open_ihook = False,
+            open_ohook = False,
+        )
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            tmp_input = from_qtensor_to_tensor(input)
+            scale = input.scale.detach()
+            data_bits = input.data_bits
+            out =  F.max_pool2d(
+                tmp_input,
+                self.kernel_size,
+                self.stride,
+                self.padding,
+                self.dilation,
+                ceil_mode=self.ceil_mode,
+                return_indices=self.return_indices,
+            )
+            return from_tensor_to_qtensor(out, scale, data_bits)
+        else:
+            return F.max_pool2d(
+                    input,
+                    self.kernel_size,
+                    self.stride,
+                    self.padding,
+                    self.dilation,
+                    ceil_mode=self.ceil_mode,
+                    return_indices=self.return_indices,
+                )
+        
+
diff --git a/linger/quant/ops/qmodule/qmodule.py b/linger/quant/ops/qmodule/qmodule.py
new file mode 100644
index 0000000..a45b3ae
--- /dev/null
+++ b/linger/quant/ops/qmodule/qmodule.py
@@ -0,0 +1,208 @@
+
+from abc import ABC
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Union, Dict, Any
+
+from ..qconfig import _QMODULE_TABLE
+from ...quantizer import WQuantizer, AQuantizer, BQuantizer
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from ....config import QUANT_CONFIGS
+from ....onnx import generate_onnx_qparam_dict, QCustomOpSymbolic, QCustomRNNSymbolic
+
+__all__ = ["QModuleMixin"]
+
+class QModuleMixin(ABC):
+    def __init__(
+        self,
+        *args, # 原始torch.module初始化所需的参数
+        device: Optional[torch.device] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None, # 量化策略相关的参数
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        open_ihook: Optional[bool] = True,
+        open_ohook: Optional[bool] = True,
+        **kwargs,
+    ):
+        mro = self.__class__.__mro__
+        if torch.nn.Module not in mro: # 必须和torch.nn.module一起被Qlinear类继承
+            raise TypeError("Quantized modules must inherit from a torch.nn.Module class")
+        if mro.index(__class__) > mro.index(torch.nn.Module): # 继承时此类必须写在前边，torch.nn.module才能被初始化
+            raise TypeError(
+                "QModuleMixin must be placed before any torch.nn.Module class in quantized module inheritance."
+            )
+        # This will setup the torch.nn.Module
+        super().__init__(*args, **kwargs) # 原始linear或conv等线性module的初始化
+
+        if weights_cfg is not None:
+            self.weight_quantizer = WQuantizer(weights_cfg, constrain)
+        if bias_cfg is not None:
+            self.bias_quantizer = BQuantizer(bias_cfg, constrain)
+
+        self._quantize_hooks = {}
+        if open_ihook:
+            self.input_quantizer  = AQuantizer(activations_cfg, None)
+            self._quantize_hooks["input"] = self.register_forward_pre_hook(self.quantize_input)
+        if open_ohook:
+            self.output_quantizer = AQuantizer(activations_cfg, constrain)
+            self._quantize_hooks["output"] = self.register_forward_hook(self.quantize_output)
+
+    @classmethod
+    def from_module(
+        cls,
+        module: torch.nn.Module,
+        activations_cfg: Optional[Union[str]] = None,
+        *args,
+        **kwargs
+    ):
+        # Create the quantized module on the meta device to prevent weights intialization
+        weights_cfg = kwargs.get('weights_cfg', None)
+        bias_cfg = kwargs.get('bias_cfg', None)
+        constrain = kwargs.get('constrain', None)
+        device = QUANT_CONFIGS.device
+        qmodule = cls.qcreate(module, activations_cfg, weights_cfg, bias_cfg, constrain, device=device)
+        if qmodule is None:
+            return None
+
+        if hasattr(module, 'weight'):
+            with torch.no_grad():
+                qmodule.weight = module.weight
+                if hasattr(module, "bias") and module.bias is not None:
+                    qmodule.bias = module.bias
+
+        return qmodule.to(device)
+
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activations_cfg: Optional[Union[str]] = None,
+        weight_cfg: Optional[Union[str]] = None,
+        bias_cfg: Optional[Union[str]] = None,
+        constrain: Optional[Union[str]] = None,
+        device: Optional[torch.device] = None,
+    ):
+        raise NotImplementedError
+
+    @property
+    def qweight(self):
+        if not hasattr(self, "weight_quantizer"):
+            return self.weight
+        fake_weight = self.weight_quantizer(self.weight)
+        return fake_weight
+
+    @property
+    def qbias(self):
+        if not hasattr(self, "bias_quantizer") or self.bias is None:
+            return self.bias
+        fake_bias = self.bias_quantizer(self.bias, self.weight_quantizer.scale * self.input_quantizer.scale)
+        return fake_bias
+
+    def qforward(self, input: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        raise NotImplementedError
+    
+    def forward(self, input: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        if torch.onnx.is_in_onnx_export():
+            weight = None if not hasattr(self, "weight_quantizer") else self.weight
+            bias = None if not hasattr(self, "bias_quantizer") else self.bias
+            qparam_dict = generate_onnx_qparam_dict(self, False)
+            if 'GRU' in self._get_name() or 'LSTM' in self._get_name():
+                self.input_quantizer.is_qtensor = True if isinstance(input, QTensor) else False
+                if self.bidirectional:
+                    out, _ = QCustomRNNSymbolic.apply(input, self.weight_ih_l0, self.weight_hh_l0, self.bias_ih_l0, self.bias_hh_l0, \
+                                                    self.weight_ih_l0_reverse, self.weight_hh_l0_reverse, self.bias_ih_l0_reverse, self.bias_hh_l0_reverse, qparam_dict, self.input_quantizer.is_qtensor)
+                else:
+                    out, _ =  QCustomRNNSymbolic.apply(input, self.weight_ih_l0, self.weight_hh_l0, self.bias_ih_l0, self.bias_hh_l0, None, None, None, None, qparam_dict, self.input_quantizer.is_qtensor)
+                return from_tensor_to_qtensor(out, self.output_quantizer.scale, self.output_quantizer.data_bits), _
+            elif 'Embedding' in self._get_name():
+                out = QCustomOpSymbolic.apply(input, weight, bias, qparam_dict, True)
+                return from_tensor_to_qtensor(out, self.weight_quantizer.scale, self.weight_quantizer.data_bits)
+            return QCustomOpSymbolic.apply(input, weight, bias, qparam_dict, self.input_quantizer.is_qtensor)
+        else:
+            return self.qforward(input, *args, **kwargs)
+        
+    def quantize_input(self, module: torch.nn.Module, input: torch.Tensor) -> torch.Tensor:
+        input = input[0]
+
+        if isinstance(input, QTensor):
+            fake_input = from_qtensor_to_tensor(input)
+            self.input_quantizer.is_qtensor = True
+            self.input_quantizer.scale.fill_(input.scale.detach())
+            self.input_quantizer.data_bits = input.data_bits
+        else:
+            fake_input = self.input_quantizer(input) # 前向过程中会更新input_quantizer的scale
+        return fake_input
+
+    def quantize_output(
+        self,
+        module: torch.nn.Module,
+        input: torch.Tensor,
+        output: torch.Tensor,
+    ) -> torch.Tensor:
+        fake_out = self.output_quantizer(output)
+        return from_tensor_to_qtensor(fake_out, self.output_quantizer.scale, self.output_quantizer.data_bits)
+
+    def extra_repr(self):
+        s = ''
+        extra_s = ''
+        if 'Conv1d' in self._get_name() or 'ConvBN1d' in self._get_name():
+            s = nn.Conv1d.extra_repr(self)
+        elif 'Conv2d' in self._get_name() or 'ConvBN2d' in self._get_name():
+            s = nn.Conv2d.extra_repr(self)
+        elif 'MaxPool1d' in self._get_name():
+            s = nn.MaxPool1d.extra_repr(self)
+        elif 'MaxPool2d' in self._get_name():
+            s = nn.MaxPool2d.extra_repr(self)
+        elif 'AvgPool1d' in self._get_name():
+            s = nn.AvgPool1d.extra_repr(self)
+        elif 'AvgPool2d' in self._get_name():
+            s = nn.AvgPool2d.extra_repr(self)
+        elif 'ConvTranspose1d' in self._get_name():
+            s = nn.ConvTranspose1d.extra_repr(self)
+        elif 'ConvTranspose2d' in self._get_name():
+            s = nn.ConvTranspose2d.extra_repr(self)
+        elif 'BatchNorm1d' in self._get_name():
+            s = nn.BatchNorm1d.extra_repr(self)
+        elif 'BatchNorm2d' in self._get_name():
+            s = nn.BatchNorm2d.extra_repr(self)
+        elif 'Linear' in self._get_name():
+            s = nn.Linear.extra_repr(self)
+        elif 'Relu' in self._get_name():
+            s = nn.ReLU.extra_repr(self)
+        elif 'GLU' in self._get_name():
+            s = nn.GLU.extra_repr(self)
+        elif 'LayerNorm' in self._get_name():
+            s = nn.LayerNorm.extra_repr(self)
+        elif 'GRU' in self._get_name():
+            s = nn.GRU.extra_repr(self)
+        elif 'LSTM' in self._get_name():
+            s = nn.LSTM.extra_repr(self)
+        elif 'Embedding' in self._get_name():
+            s = nn.Embedding.extra_repr(self)
+            
+        # extra_s = ', clamp_data:{}, clamp_weight:{}, clamp_bias:{}, clamp_factor:{}, activation_type:{}'.format(self.output_quantizer.clamp_value, self.weight_quantizer.clamp_value, self.bias_quantizer.clamp_value, self.weight_quantizer.clamp_factor, self.activation_type.name)
+        if hasattr(self, "input_quantizer"):
+            extra_s += ', data_bits:{}'.format(self.input_quantizer.data_bits)
+        if hasattr(self, "output_quantizer"):
+            extra_s += ', o_bits:{}'.format(self.output_quantizer.data_bits)
+        if hasattr(self, "weight_quantizer"):
+            extra_s += ', weight_bits:{}'.format(self.weight_quantizer.data_bits)
+        if hasattr(self, "bias_quantizer"):
+            extra_s += ', bias_bits:{}'.format(self.bias_quantizer.data_bits)
+        extra_s += ', mode:{}'.format(QUANT_CONFIGS.quant_info.round_mode)
+        extra_s += ', platform:{}'.format(QUANT_CONFIGS.platform.name)
+        return s + extra_s
+
+    def __repr__(self):
+        extra_lines = []
+        extra_repr = self.extra_repr()
+        if extra_repr:
+            extra_lines = extra_repr.split('\n')
+        main_str = self._get_name() + '('
+        if len(extra_lines) > 0:
+            main_str += extra_lines[0]
+        main_str += ')'
+        return main_str
+
diff --git a/linger/quant/ops/qmodule/qrelu.py b/linger/quant/ops/qmodule/qrelu.py
new file mode 100644
index 0000000..8fccb11
--- /dev/null
+++ b/linger/quant/ops/qmodule/qrelu.py
@@ -0,0 +1,44 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from .qmodule import QModuleMixin
+from ..qconfig import register_qmodule
+from typing import Optional, Union, Dict, Any
+
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+
+@register_qmodule(nn.ReLU)
+class QRelu(QModuleMixin, nn.ReLU):
+    @classmethod
+    def qcreate(
+        cls,
+        module,
+        activations_cfg: Optional[Dict[str, Any]] = None,
+        weights_cfg: Optional[Dict[str, Any]] = None,
+        bias_cfg: Optional[Dict[str, Any]] = None,
+        constrain: Optional[Dict[str, Any]] = None,
+        device: Optional[Dict[str, Any]] = None,
+    ):
+        return cls(
+            inplace = module.inplace,
+            
+            device = device,
+            activations_cfg=activations_cfg,
+            weights_cfg=None,
+            constrain=None,
+            bias_cfg=None,
+            open_ihook = False,
+            open_ohook = False,
+        )
+        
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if isinstance(input, QTensor):
+            tmp_input = from_qtensor_to_tensor(input)
+            scale = input.scale.detach()
+            data_bits = input.data_bits
+            out = F.relu(tmp_input, inplace=self.inplace)
+            return from_tensor_to_qtensor(out, scale, data_bits)
+        else:
+            return F.relu(input, inplace=self.inplace)
+
diff --git a/linger/quant/ops/qtensor/__init__.py b/linger/quant/ops/qtensor/__init__.py
new file mode 100644
index 0000000..8ed9971
--- /dev/null
+++ b/linger/quant/ops/qtensor/__init__.py
@@ -0,0 +1,9 @@
+from .qtensor_ops import *
+from .qadd import QAdd
+from .qmul import QMul
+from .qbmm import QBmm
+from .qcat import QCat
+from .qmatmul import QMatmul
+from .qsoftmax import QSoftmax
+from .qsigmoid import QSigmoid, QSigmoidFunction
+from .qtanh import QTanh
\ No newline at end of file
diff --git a/linger/quant/ops/qtensor/qadd.py b/linger/quant/ops/qtensor/qadd.py
new file mode 100644
index 0000000..4eda614
--- /dev/null
+++ b/linger/quant/ops/qtensor/qadd.py
@@ -0,0 +1,64 @@
+import torch
+from typing import Dict, Any, Optional
+from .qtensor_mod import QModuleTensor
+from ....config import QUANT_CONFIGS
+from ....utils import PlatForm
+# def add(module, x, y, name="_default"):
+#     assert isinstance(x, QTensor)
+#     assert isinstance(y, (QTensor, float, int))
+
+#     quant_info = getattr(module, LINGER_QUANTINFO, QuantInfo())
+
+#     var_name = name
+#     iq_layer = None
+#     if hasattr(module, var_name):
+#         iq_layer = getattr(module, var_name)
+#     else:
+#         iq_layer = QAdd(quant_info=quant_info)
+#         iq_layer.training = module.training
+#         iq_layer = iq_layer.to(x.device)
+#         setattr(module, var_name, iq_layer)
+
+#     return iq_layer(x, y)
+
+# @register_qmodule(torch.add)
+# @register_qmodule(operator.add)
+# @register_qmodule(torch.ops.aten.add.Tensor)
+class QAdd(QModuleTensor):
+    r"""对iqadd的layer封装
+
+    """
+    # def __init__(self, activate_config: Optional[Dict[str, Any]] = None, num_input: int = 2):
+    #     super(QAdd, self).__init__(activate_config, num_input)
+
+    #     self.prefix         = ""
+    #     self.dump           = False
+    #     self.path           = ""
+    #     self.a_config       = activate_config
+
+
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = num_input
+        )
+
+    def qforward(self, x, y):
+        if self.training:
+            self.output_quantizer.min_scale = torch.min(self.input_quantizer[0].scale, self.input_quantizer[1].scale)
+        else:
+            if self.output_quantizer.scale != self.input_quantizer[0].scale:
+                int_x = self.input_quantizer[0].quant_round(x * self.output_quantizer.scale, self.input_quantizer[0].round_mode)
+                x = int_x / self.output_quantizer.scale
+            if self.output_quantizer.scale != self.input_quantizer[1].scale:
+                int_y = self.input_quantizer[1].quant_round(y * self.output_quantizer.scale, self.input_quantizer[1].round_mode)
+                y = int_y / self.output_quantizer.scale
+        return x + y
+        
diff --git a/linger/quant/ops/qtensor/qbmm.py b/linger/quant/ops/qtensor/qbmm.py
new file mode 100644
index 0000000..0d1e10b
--- /dev/null
+++ b/linger/quant/ops/qtensor/qbmm.py
@@ -0,0 +1,55 @@
+import torch
+from torch.onnx import is_in_onnx_export
+from typing import Dict, Any, Optional
+
+from .qtensor_mod import QModuleTensor
+from ..qconfig import register_qmodule
+from ...qtensor import QTensor
+from ...quantizer import AQuantizer
+from ....config import *
+
+# def qbmm(module, x, y, name="_default"):
+#     assert isinstance(x, QTensor)
+#     assert isinstance(y, QTensor)
+
+#     quant_info = getattr(module, LINGER_QUANTINFO, QuantInfo())
+
+#     var_name = name
+#     iq_layer = None
+#     if hasattr(module, var_name):
+#         iq_layer = getattr(module, var_name)
+#     else:
+#         iq_layer = QBmm(quant_info=quant_info)
+#         iq_layer.training = module.training
+#         iq_layer = iq_layer.to(x.device)
+#         setattr(module, var_name, iq_layer)
+
+#     return iq_layer(x, y)
+
+# @register_qmodule(torch.bmm)
+class QBmm(QModuleTensor):
+    # def __init__(self, activate_config: Optional[Dict[str, Any]] = None, num_input: int = 2):
+    #     super(QModuleTensor, self).__init__()
+
+    #     self.prefix         = ""
+    #     self.dump           = False
+    #     self.path           = ""
+    #     self.a_config       = activate_config
+
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = num_input
+        )
+
+    def forward(self, x, y):
+        return torch.bmm(x, y)
+
+        
diff --git a/linger/quant/ops/qtensor/qcat.py b/linger/quant/ops/qtensor/qcat.py
new file mode 100644
index 0000000..7a43333
--- /dev/null
+++ b/linger/quant/ops/qtensor/qcat.py
@@ -0,0 +1,25 @@
+import torch
+from typing import Dict, Any, Optional
+from .qtensor_mod import QModuleTensor
+
+class QCat(QModuleTensor):
+    r"""对cat的layer封装
+
+    """
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = num_input,
+            is_cat = True
+        )
+
+    def qforward(self, x, y):
+        return torch.cat(x, dim=y)
+        
diff --git a/linger/quant/ops/qtensor/qmatmul.py b/linger/quant/ops/qtensor/qmatmul.py
new file mode 100644
index 0000000..c4d8e28
--- /dev/null
+++ b/linger/quant/ops/qtensor/qmatmul.py
@@ -0,0 +1,23 @@
+import torch
+from typing import Dict, Any, Optional
+from .qtensor_mod import QModuleTensor
+
+class QMatmul(QModuleTensor):
+    r"""量化乘法算子封装
+
+    """
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config=activate_config,
+            num_input=num_input
+        )
+    
+    def qforward(self, x, y):
+       return torch.matmul(x, y)
\ No newline at end of file
diff --git a/linger/quant/ops/qtensor/qmul.py b/linger/quant/ops/qtensor/qmul.py
new file mode 100644
index 0000000..3b088d6
--- /dev/null
+++ b/linger/quant/ops/qtensor/qmul.py
@@ -0,0 +1,32 @@
+import torch
+from typing import Dict, Any, Optional
+from .qtensor_mod import QModuleTensor
+
+class QMul(QModuleTensor):
+    r"""量化乘法算子封装
+
+    """
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config=activate_config,
+            num_input=num_input
+        )
+    
+    def qforward(self, x, y):
+        if self.training:
+            self.output_quantizer.min_scale = torch.min(self.input_quantizer[0].scale, self.input_quantizer[1].scale)
+        else:
+            if self.output_quantizer.scale != self.input_quantizer[0].scale:
+                int_x = self.input_quantizer[0].quant_round(x * self.output_quantizer.scale, self.input_quantizer[0].round_mode)
+                x = int_x / self.output_quantizer.scale
+            if self.output_quantizer.scale != self.input_quantizer[1].scale:
+                int_y = self.input_quantizer[1].quant_round(y * self.output_quantizer.scale, self.input_quantizer[1].round_mode)
+                y = int_y / self.output_quantizer.scale
+        return x * y
diff --git a/linger/quant/ops/qtensor/qsigmoid.py b/linger/quant/ops/qtensor/qsigmoid.py
new file mode 100644
index 0000000..934572b
--- /dev/null
+++ b/linger/quant/ops/qtensor/qsigmoid.py
@@ -0,0 +1,82 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Dict, Any, Optional
+
+from .qtensor_mod import QModuleTensor
+from ..qconfig import register_qmodule
+from ....config import QUANT_CONFIGS
+from ....utils import PlatForm
+
+import lingerext
+
+class QSigmoidFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, input_quantizer):
+        ctx.save_for_backward(input)
+        
+        x = input.contiguous()
+
+        scale_x = input_quantizer.scale
+        quant_x = input_quantizer.quant_round(x * scale_x, input_quantizer.round_mode)
+        
+        # 转换为Q27格式的int32
+        l_scale = 27 - int(math.log2(scale_x.data))
+        if l_scale > 0:
+            x_q27 = (quant_x * pow(2, l_scale)).to(torch.int32)
+        else:
+            x_q27 = (quant_x * pow(2, l_scale) + 0.5).floor().to(torch.int32)
+        x_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qsigmoid_forward(x_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.arcs_qsigmoid_forward(x_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qsigmoid_forward(x_q27.contiguous())
+        
+        # 转换为浮点数 (Q31 -> float)
+        output = output_q31.float() / (1 << 31)
+        
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x, = ctx.saved_tensors
+        
+        # 使用标准sigmoid的梯度近似
+        x = x.detach().clone().requires_grad_(True)
+        grad = None
+        with torch.enable_grad():
+            y = F.sigmoid(x)
+            grad = torch.autograd.grad(y, x, grad_output)
+        
+        return grad, None
+
+# @register_qmodule(nn.Sigmoid)
+class QSigmoid(QModuleTensor):
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 1,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = num_input
+        )
+    
+    def qforward(self, x, *args, **kwargs):
+        if QUANT_CONFIGS.calibration:
+            return torch.sigmoid(x)
+        else:
+            if isinstance(self.input_quantizer, nn.ModuleList):
+                return QSigmoidFunction.apply(x, self.input_quantizer[0])
+            else:
+                return QSigmoidFunction.apply(x, self.input_quantizer)
+
diff --git a/linger/quant/ops/qtensor/qsoftmax.py b/linger/quant/ops/qtensor/qsoftmax.py
new file mode 100644
index 0000000..b610841
--- /dev/null
+++ b/linger/quant/ops/qtensor/qsoftmax.py
@@ -0,0 +1,97 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Dict, Any, Optional
+
+from .qtensor_mod import QModuleTensor
+from ..qconfig import register_qmodule
+from ....config import QUANT_CONFIGS
+from ....utils import PlatForm
+
+import lingerext
+
+class QSoftmaxFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, dim):
+        ctx.dim = dim
+        ctx.save_for_backward(input)
+        
+        x = input.contiguous()
+        # 处理负维度
+        dim = dim if dim >= 0 else dim + input.dim()
+
+        # ---- Step 1: reshape input -> 2D tensor [outer, size] ----
+        ndim = x.dim()
+        size = x.size(dim)
+        outer = int(x.numel() / size)
+        permute_order = list(range(ndim))
+        # move softmax dim to last
+        if dim != ndim - 1:
+            permute_order[dim], permute_order[-1] = permute_order[-1], permute_order[dim]
+            x = x.permute(permute_order)
+        x_2d = x.reshape(outer, size)
+        
+        # 转换为Q25格式的int32
+        x_q25 = (x_2d * (1 << 25) + 0.5).floor().to(torch.int32)
+        x_q25.clamp_(-2**31, 2**31-1)
+
+        output_q15 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q15 = lingerext.arcs_qsoftmax_forward(x_q25, dim)
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q15 = lingerext.arcs_qsoftmax_forward(x_q25, dim)
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q15 = lingerext.venusa_qsoftmax_forward(x_q25, dim)
+        
+        # 转换为浮点数 (Q15 -> float)
+        y = output_q15.float() / (1 << 15)
+
+        # ---- reshape back to original shape ----
+        output = y.reshape_as(x)
+        if dim != ndim - 1:
+            # inverse permutation
+            inv_perm = [0] * ndim
+            for i, p in enumerate(permute_order):
+                inv_perm[p] = i
+            output = output.permute(inv_perm).contiguous()
+        
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x, = ctx.saved_tensors
+        dim = ctx.dim
+        
+        # 使用标准softmax的梯度近似
+        x = x.detach().clone().requires_grad_(True)
+        grad = None
+        with torch.enable_grad():
+            y = F.softmax(x, dim=dim)
+            grad = torch.autograd.grad(y, x, grad_output)
+        
+        return grad[0], None
+
+@register_qmodule(nn.Softmax)
+class QSoftmax(QModuleTensor):
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 1,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = 2,
+            dim = dim
+        )
+    
+    def qforward(self, x, *args, **kwargs):
+        if QUANT_CONFIGS.calibration:
+            return F.softmax(x, self.dim)
+        else:
+            return QSoftmaxFunction.apply(x, self.dim)
+
diff --git a/linger/quant/ops/qtensor/qtanh.py b/linger/quant/ops/qtensor/qtanh.py
new file mode 100644
index 0000000..7b4e078
--- /dev/null
+++ b/linger/quant/ops/qtensor/qtanh.py
@@ -0,0 +1,82 @@
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Dict, Any, Optional
+
+from .qtensor_mod import QModuleTensor
+from ..qconfig import register_qmodule
+from ....config import QUANT_CONFIGS
+from ....utils import PlatForm
+
+import lingerext
+
+class QTanhFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, input_quantizer):
+        ctx.save_for_backward(input)
+        
+        x = input.contiguous()
+        
+        scale_x = input_quantizer.scale
+        quant_x = input_quantizer.quant_round(x * scale_x, input_quantizer.round_mode)
+        
+        # 转换为Q27格式的int32
+        l_scale = 27 - int(math.log2(scale_x.data))
+        if l_scale > 0:
+            x_q27 = (quant_x * pow(2, l_scale)).to(torch.int32)
+        else:
+            x_q27 = (quant_x * pow(2, l_scale) + 0.5).floor().to(torch.int32)
+        x_q27.clamp_(-2**31, 2**31-1)
+
+        output_q31 = None
+        if QUANT_CONFIGS.platform == PlatForm.venus:
+            output_q31 = lingerext.venusa_qtanh_forward(x_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.arcs or QUANT_CONFIGS.platform == PlatForm.mars:
+            output_q31 = lingerext.arcs_qtanh_forward(x_q27.contiguous())
+        elif QUANT_CONFIGS.platform == PlatForm.venusA:
+            output_q31 = lingerext.venusa_qtanh_forward(x_q27.contiguous())
+        
+        # 转换为浮点数 (Q31 -> float)
+        output = output_q31.float() / (1 << 31)
+        
+        return output
+
+    @staticmethod
+    def backward(ctx, grad_output):
+        x, = ctx.saved_tensors
+        
+        # 使用标准tanh的梯度近似
+        x = x.detach().clone().requires_grad_(True)
+        grad = None
+        with torch.enable_grad():
+            y = F.tanh(x)
+            grad = torch.autograd.grad(y, x, grad_output)
+        
+        return grad, None
+
+# @register_qmodule(nn.Tanh)
+class QTanh(QModuleTensor):
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_config: Optional[Dict[str, Any]] = None,
+        num_input: int = 1,
+        dim: int = -1
+    ):
+        return cls(
+            activate_config = activate_config,
+            num_input = num_input
+        )
+    
+    def qforward(self, x, *args, **kwargs):
+        if QUANT_CONFIGS.calibration:
+            return torch.tanh(x)
+        else:
+            if isinstance(self.input_quantizer, nn.ModuleList):
+                return QTanhFunction.apply(x, self.input_quantizer[0])
+            else:
+                return QTanhFunction.apply(x, self.input_quantizer)
+
diff --git a/linger/quant/ops/qtensor/qtensor_mod.py b/linger/quant/ops/qtensor/qtensor_mod.py
new file mode 100644
index 0000000..bcfb605
--- /dev/null
+++ b/linger/quant/ops/qtensor/qtensor_mod.py
@@ -0,0 +1,154 @@
+
+import torch
+from typing import Dict, Any, Optional
+
+from ...quantizer import AQuantizer
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor
+from ....config import QUANT_CONFIGS
+from ....onnx import generate_onnx_qparam_dict, QCustomOpSymbolic
+
+__all__ = ["QModuleTensor"]
+
+class QModuleTensor(torch.nn.Module):
+    def __init__(self, activate_config: Optional[Dict[str, Any]] = None, num_input: int = 2, dim: int = -1, is_cat = False):
+        super(QModuleTensor, self).__init__()
+
+        self.num_input = num_input
+        self.dim = dim
+        self.is_cat = is_cat
+        self.input_quantizer = torch.nn.ModuleList()
+        for i in range(num_input):
+            self.input_quantizer.append(AQuantizer(activate_config))
+        self.output_quantizer = AQuantizer(activate_config)
+
+        self._quantize_hooks = {}
+        if (is_cat == True):
+            self._quantize_hooks["input"] = self.register_forward_pre_hook(self.quantize_cat_input)
+        else:
+            self._quantize_hooks["input"] = self.register_forward_pre_hook(self.quantize_input)
+        self._quantize_hooks["output"] = self.register_forward_hook(self.quantize_output)
+
+
+    def quantize_input(self, module: torch.nn.Module, input: torch.Tensor) -> torch.Tensor:
+        device = QUANT_CONFIGS.device
+        
+        # 创建处理后的输入列表
+        processed_inputs = []
+        
+        for i in range(len(input)):
+            current_input = input[i]
+            
+            # 标准化输入 - 创建新的tensor而不是修改原元组
+            if not isinstance(current_input, torch.Tensor) and not isinstance(current_input, QTensor):
+                current_input = torch.tensor(current_input, dtype=torch.float32, device=device)
+            
+            # 量化处理
+            if isinstance(current_input, QTensor):
+                tmp_input = from_qtensor_to_tensor(current_input)
+                self.input_quantizer[i].scale.fill_(current_input.scale.detach())
+                self.input_quantizer[i].data_bits = current_input.data_bits
+            else:
+                tmp_input = self.input_quantizer[i](current_input)
+            processed_inputs.append(tmp_input)
+        
+        return tuple(processed_inputs)
+        
+    
+    def quantize_cat_input(self, module: torch.nn.Module, input_list: list) -> torch.Tensor:
+        device = QUANT_CONFIGS.device
+
+        if not input_list:
+            return torch.tensor([], device=device)
+        
+        # 创建处理后的输入列表
+        processed_inputs = []
+        
+        for i in range(len(input_list[0])):
+            current_input = input_list[0][i]
+            
+            # 标准化输入 - 创建新的tensor而不是修改原元组
+            if not isinstance(current_input, torch.Tensor) and not isinstance(current_input, QTensor):
+                current_input = torch.tensor(current_input, dtype=torch.float32, device=device)
+            
+            if isinstance(current_input, QTensor):
+                tmp_input = from_qtensor_to_tensor(current_input)
+                self.input_quantizer[i].scale.fill_(current_input.scale.detach())
+                self.input_quantizer[i].data_bits = current_input.data_bits
+            else:
+                tmp_input = self.input_quantizer[i](current_input)
+            processed_inputs.append(tmp_input)
+        
+        return tuple([processed_inputs, input_list[1]])
+    
+    def quantize_output(
+        self,
+        module: torch.nn.Module,
+        input: torch.Tensor,
+        output: torch.Tensor,
+    ) -> torch.Tensor:
+        fake_output = self.output_quantizer(output)
+        return from_tensor_to_qtensor(fake_output, self.output_quantizer.scale, self.output_quantizer.data_bits)
+    
+    @classmethod
+    def qcreate(
+        cls,
+        module: torch.nn.Module,
+        activate_cfg: Optional[Dict[str, Any]] = None,
+        num_input: int = 2,
+        dim: int = -1
+    ):
+        raise NotImplementedError
+    
+    @classmethod
+    def from_module(
+        cls,
+        module: torch.nn.Module,
+        activate_cfg: Optional[Dict[str, Any]] = None,
+        *args,
+        **kwargs
+    ):
+        # Create the quantized module on the meta device to prevent weights intialization
+        num_input = kwargs.get('num_input', 1)
+        dim = kwargs.get('dim', 1)
+        qmodule = cls.qcreate(module, activate_cfg, num_input, dim)
+        if qmodule is None:
+            return None
+        
+        device = QUANT_CONFIGS.device
+
+        return qmodule.to(device)
+    
+    def qforward(self, input: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        raise NotImplementedError
+    
+    def forward(self, input: torch.Tensor, *args, **kwargs) -> torch.Tensor:
+        other = None if len(args) == 0 else args[0]
+        if torch.onnx.is_in_onnx_export():
+            qparam_dict = generate_onnx_qparam_dict(self, True)
+            if self.is_cat:
+                return QCustomOpSymbolic.apply(input[0], None, None, qparam_dict, input[1], other)
+            return QCustomOpSymbolic.apply(input, None, None, qparam_dict, other)
+        else:
+            return self.qforward(input, other)
+
+    def extra_repr(self):
+        extra_s = ''
+        if self.num_input == 1:
+            extra_s += 'data_bits:{}, o_bits:{}'.format(self.input_quantizer[0].data_bits,self.output_quantizer.data_bits)
+        else:
+            extra_s += 'data_x_bits:{}, data_y_bits:{}, o_bits:{}'.format(self.input_quantizer[0].data_bits,self.input_quantizer[1].data_bits,self.output_quantizer.data_bits)
+        extra_s += ', mode:{}'.format(self.output_quantizer.round_mode)
+        extra_s += ', platform:{}'.format(QUANT_CONFIGS.platform)
+
+        return extra_s
+
+    def __repr__(self):
+        extra_lines = []
+        extra_repr = self.extra_repr()
+        if extra_repr:
+            extra_lines = extra_repr.split('\n')
+        main_str = self._get_name() + '('
+        main_str += extra_lines[0]
+        main_str += ')'
+        return main_str
+
diff --git a/linger/quant/ops/qtensor/qtensor_ops.py b/linger/quant/ops/qtensor/qtensor_ops.py
new file mode 100644
index 0000000..d9e5fc0
--- /dev/null
+++ b/linger/quant/ops/qtensor/qtensor_ops.py
@@ -0,0 +1,372 @@
+import torch
+from torch.onnx import is_in_onnx_export
+from typing import Dict, Any, Optional
+from torch.types import _dtype as DType
+
+from .qadd import QAdd
+from .qmul import QMul
+from .qbmm import QBmm
+from .qmatmul import QMatmul
+from .qcat import QCat
+from .qsigmoid import QSigmoid
+from .qtanh import QTanh
+from .qsoftmax import QSoftmax
+from ..qconfig import *
+from ...qtensor import QTensor, from_tensor_to_qtensor, from_qtensor_to_tensor, qfallback
+from ...quantizer import AQuantizer
+from ....config import QUANT_CONFIGS
+
+# @register_qtensor_op([torch.ops.aten.add, torch.ops.aten.add_])
+# @register_qtensor_op([torch.add, torch._C.TensorBase.add, torch._C.TensorBase.add_])
+@register_qtensor_op([torch.add, torch.Tensor.add, torch.Tensor.add_])
+def add(op, input, other):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, other)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qadd_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QAdd(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input = 2)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input, other)
+    return output
+
+# @register_qtensor_op([torch.ops.aten.mul, torch.ops.aten.mul_])
+@register_qtensor_op([torch.mul, torch.Tensor.mul, torch.Tensor.mul_])
+def mul(op, input, other):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, other)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qmul_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QMul(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input = 2)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input, other)
+    return output
+
+@register_qtensor_op([torch.matmul])
+def matmul(op, input, other):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, other)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qmatmul_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QMatmul(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input = 2)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input, other)
+    return output
+
+@register_qtensor_op([torch.Tensor.contiguous])
+def contiguous(op, input):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input)
+        qtensor = from_tensor_to_qtensor(out, scale, data_bits)
+        return qtensor
+    return op(input, *size)
+
+@register_qtensor_op([torch.bmm, torch.Tensor.bmm])
+def bmm(op, input, other):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, other)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qbmm_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QBmm(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input = 2)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input, other)
+    return output
+
+# @register_qtensor_op([torch.concat, torch.cat])
+@register_qtensor_op([torch.cat])
+def cat(op, input, dim = 0):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, dim)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qcat_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QCat(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input = 2, is_cat=True)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input[0].device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input, dim)
+    return output
+
+@register_qtensor_op([torch.ops.aten.reshape])
+def reshape(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.split])
+def split(op, input, split_size_or_sections, dim: int = 0):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, split_size_or_sections, dim)
+        qtensor_list = []
+        for tensor in out:
+            qtensor = from_tensor_to_qtensor(tensor, scale, data_bits)
+            qtensor_list.append(qtensor)
+        return qtensor_list
+        # return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, split_size_or_sections, dim)
+
+@register_qtensor_op([torch.ops.aten.flip])
+def flip(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.view, torch.ops.aten._unsafe_view])
+def view(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.transpose])
+def transpose(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.permute])
+def permute(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.Tensor.reshape])
+def reshape(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.slice])
+def slice(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.select, torch.select, torch.Tensor.select])
+def slice(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.squeeze])
+def squeeze(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.unsqueeze])
+def unsqueeze(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.flatten])
+def flatten(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.__getitem__])
+def __getitem__(op, input, *size):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *size)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *size)
+
+@register_qtensor_op([torch.ops.aten.pad])
+def pad(op, input, pad, mode: str = ..., value: Optional[float] = None):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, pad, mode, value)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, pad, mode, value)
+
+@register_qtensor_op([torch.sigmoid, torch.sigmoid_, torch.Tensor.sigmoid, torch.Tensor.sigmoid_])
+def sigmoid(op, input):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qsigmoid_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QSigmoid(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input=1)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input)
+    return output
+
+@register_qtensor_op([torch.tanh, torch.tanh_, torch.Tensor.tanh, torch.Tensor.tanh_])
+def tanh(op, input):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qtanh_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QTanh(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input=1)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    output = q_layer(input)
+    return output
+
+@register_qtensor_op([torch.softmax, torch._softmax, torch.Tensor.softmax, torch.nn.functional.softmax])
+def softmax(op, input, dim, _stacklevel: int = 3, dtype: Optional[DType] = None):
+    module_self = get_current_module()
+    if module_self is None:
+        return qfallback(op, input, dim)
+
+    iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    var_name = LINGER_QTENSOR_LAYERS_PREIFX + '_qsoftmax_' + str(iname_index)
+
+    q_layer = None
+    if hasattr(module_self, var_name):
+        q_layer = getattr(module_self, var_name)
+    else:
+        q_layer = QSoftmax(activate_config = QUANT_CONFIGS.quant_info.to_dict(), num_input=2, dim = dim)
+        q_layer.training = module_self.training
+        q_layer = q_layer.to(input.device)
+        setattr(module_self, var_name, q_layer)
+    q_layer.dim = dim
+    output = q_layer(input, dim)
+    return output
+
+@register_qtensor_op([torch.relu, torch.relu_, torch.nn.functional.relu, torch.Tensor.relu, torch.Tensor.relu_])
+def relu(op, input, *args, **kwargs):
+    assert isinstance(input, QTensor), 'input is not QTensor'
+    if isinstance(input, QTensor):
+        tmp_input = from_qtensor_to_tensor(input)
+        scale = input.scale.detach()
+        data_bits = input.data_bits
+        out = op(tmp_input, *args, **kwargs)
+        return from_tensor_to_qtensor(out, scale, data_bits)
+    return op(input, *args, **kwargs)
diff --git a/linger/quant/qtensor.py b/linger/quant/qtensor.py
new file mode 100644
index 0000000..2e4914f
--- /dev/null
+++ b/linger/quant/qtensor.py
@@ -0,0 +1,127 @@
+import torch
+from torch.utils import _pytree as pytree
+from torch._C import DisableTorchFunction
+
+from .ops import *
+
+def qfallback(callable, *args, **kwargs):
+    kwargs = kwargs or {}
+    args, kwargs = pytree.tree_map_only(QTensor, lambda x: x.value, (args, kwargs))
+    return callable(*args, **kwargs)
+
+class Convert2QTensor(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, t, scale, data_bits):
+        s = QTensor(t, scale, data_bits)
+        return s
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput, None, None
+
+class Convert2Tensor(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, t):
+        return t.clone()
+
+    @staticmethod
+    def backward(ctx, gradOutput):
+        return gradOutput
+
+def from_tensor_to_qtensor(t, scale: float = None, data_bits: int = None, zero_point=0):
+    qt = Convert2QTensor.apply(t, scale, data_bits)
+    return qt
+
+def from_qtensor_to_tensor(t):
+    assert isinstance(t, QTensor)
+    return Convert2Tensor.apply(t)
+
+
+class QTensor(torch.Tensor):
+    @staticmethod
+    def __new__(cls, data, scale=1.0, data_bits=8, zero_point=0):
+        return torch.Tensor._make_subclass(cls, data, require_grad=data.requires_grad)
+
+    def __init__(self, data, scale=1.0, data_bits=8, zero_point=0):
+        self.scale = scale
+        self.data_bits = data_bits
+        self.value = data
+
+    if torch.__version__ >= '1.7.0':
+        @classmethod
+        def __torch_function__(cls, func, types, args=(), kwargs=None):        
+            if kwargs is None:
+                kwargs = {}
+
+            if not all(issubclass(cls, t) for t in types):
+                return NotImplemented
+
+            with torch._C.DisableTorchFunction():
+                qdispatch = get_qtensor_op_dispatch(func)
+                if qdispatch is not None:
+                    return qdispatch(*args, **kwargs)
+                ret = func(*args, **kwargs)
+                return ret
+            
+    if torch.__version__ >= '2.0':
+        def __torch_dispatch__(self, func, types, args=(), kwargs=None):
+            kwargs = kwargs or {}
+            op = func.overloadpacket
+            qdispatch = get_qtensor_op_dispatch(op)
+            if qdispatch is not None:
+                return qdispatch(*args, **kwargs)
+            return qfallback(func, *args, **kwargs)
+    elif torch.__version__ >= '1.10':
+        def __torch_dispatch__(self, func, args=(), kwargs=None):
+            kwargs = kwargs or {}
+            op = func.overloadpacket
+            qdispatch = get_qtensor_op_dispatch(op)
+            if qdispatch is not None:
+                return qdispatch(*args, **kwargs)
+            return qfallback(func, *args, **kwargs)
+
+    # # 重写add方法以支持torch.add语法
+    # def add(self, other, alpha=1):
+    #     if not isinstance(other, (QTensor, float, int)) or self.dtype != torch.float:
+    #         return super(QTensor, self).add(other, alpha=alpha)
+    #     module_self = get_current_module()
+    #     if module_self is None:
+    #         return super(QTensor, self).add(other, alpha=alpha)
+        
+    #     iname_index = getattr(module_self, LINGER_QTENSOR_LAYER_COUNTER)
+    #     setattr(module_self, LINGER_QTENSOR_LAYER_COUNTER, iname_index+1)
+    #     return qadd(module_self, self, other, str(iname_index))
+        
+    # # 重写__add__方法以支持a + b语法
+    # def __add__(self, other):
+    #     return self.add(other)
+    
+    # # 重写__iadd__方法以支持a += b语法
+    # def __iadd__(self, other):
+    #     return self.add(other)
+    
+    # # 重写bmm方法
+    # def bmm(self, mat2):
+    #     """
+    #     重写的bmm方法
+    #     """
+    #     # 检查是否两个张量都已经量化
+    #     if not isinstance(self, QTensor) or not isinstance(mat2, QTensor):
+    #         return super(QTensor, self).bmm(mat2)
+    #     module_self = get_current_module()
+    #     if module_self is None:
+    #         return super(QTensor, self).bmm(mat2)
+        
+    #     iname_index = getattr(module_self, LINGER_QBMM_LAYER_COUNTER)
+    #     setattr(module_self, LINGER_QBMM_LAYER_COUNTER, iname_index+1)
+    #     return qbmm(module_self, self, mat2, str(iname_index))
+
+    # def flatten(self, start_dim: int = 0, end_dim: int = -1):
+    #     y = super(QTensor, self).flatten(start_dim, end_dim)
+    #     if isinstance(self, QTensor):
+    #         y = from_tensor_to_qtensor(self, self.scale, self.data_bits)
+    #     return y
+
+
+
+    
\ No newline at end of file
diff --git a/linger/quant/quantizer.py b/linger/quant/quantizer.py
new file mode 100644
index 0000000..4ee316f
--- /dev/null
+++ b/linger/quant/quantizer.py
@@ -0,0 +1,304 @@
+import math
+import torch
+from typing import Optional, Union, Dict, Any
+
+from .qtensor import QTensor
+from .calibrate_funs import get_calibrate_function
+from ..config  import QUANT_CONFIGS
+from ..utils import QuantMode, QuantStrategy, ActivationType, FakeQuantMethod, QatMethod, PlatForm
+import lingerext
+
+class FakeQuantOnnxFunction(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input):
+        return input
+    
+    @staticmethod
+    def backward(ctx, gradOutput1):
+        return gradOutput1
+    
+    @staticmethod
+    def symbolic(g, input):
+        return input
+
+class FAKEQUANT(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, data_bits, learning_data, scale_min, quant_min, quant_max):
+        out, mask, scale_tmp = lingerext.fake_quant(input, data_bits, float(learning_data), float(scale_min), quant_min, quant_max)
+        ctx.save_for_backward(mask.bool())
+        return out, scale_tmp
+
+    @staticmethod
+    def backward(ctx, gradOutput1, gradOutput2):
+        mask = ctx.saved_tensors
+        return gradOutput1.masked_fill(mask[0], 0.0), None, None, None, None, None
+
+class BIASQUANT(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, data_bits, scale_bias, scale_min, quant_min, quant_max):
+        out,mask = lingerext.bias_quant(input, data_bits, float(scale_bias), float(scale_min), quant_min, quant_max)
+        ctx.save_for_backward(mask.bool())
+        return out
+    @staticmethod
+    def backward(ctx, gradOutput):
+        mask = ctx.saved_tensors
+        return gradOutput.masked_fill(mask[0], 0.0), None, None, None, None, None
+
+class FAKEQUANT_WITH_GRAD_SACLE(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, data_bits, learning_data, scale_min, quant_min, quant_max):
+        out, mask, learning_data_coff_back, scale_tmp = lingerext.fake_quant_with_grad_scale(input, data_bits, float(learning_data), float(scale_min), quant_min, quant_max)
+        saved_tensors = [mask.bool(), learning_data_coff_back]
+        ctx.save_for_backward(*saved_tensors)
+        return out, scale_tmp
+
+    @staticmethod
+    def backward(ctx, gradOutput1, gradOutput2):
+        mask, learning_data_coff_back = ctx.saved_tensors
+        grad_learning_data = (gradOutput1 * learning_data_coff_back).sum()
+        return gradOutput1.masked_fill(mask, 0.0), None, grad_learning_data, None, None, None
+
+class BIASQUANT_WITH_GRAD_SACLE(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, data_bits, scale_bias, scale_min, quant_min, quant_max):
+        out,mask,scale_coff_back = lingerext.bias_quant_with_grad_scale(input, data_bits, float(scale_bias), float(scale_min), quant_min, quant_max)
+        saved_tensors = [mask.bool(), scale_coff_back]
+        ctx.save_for_backward(*saved_tensors)
+        return out
+    @staticmethod
+    def backward(ctx, gradOutput):
+        mask,  scale_coff_back = ctx.saved_tensors
+        grad_scale = (gradOutput * scale_coff_back).sum()
+        return gradOutput.masked_fill(mask, 0.0), None, grad_scale, None, None, None
+
+
+
+
+
+"""
+TODO:假如我要加入一个不必将scale约束至2的幂次方、或perchennel的Quantiezer该怎么实现代码比较整洁呢？
+"""
+class Quantizer(torch.nn.Module, ):
+    def __init__(self, quantizer_cfg: Optional[Dict[str, Any]] = None, constrain: Optional[Dict[str, Any]] = None):
+        super(Quantizer, self).__init__()
+        if quantizer_cfg is None:
+            quantizer_cfg = {}
+        self.round_mode = quantizer_cfg.get('round_mode', QuantMode.floor_add)
+        self.is_symmetry = quantizer_cfg.get('is_symmetry', True)
+
+        self.qat_method = QUANT_CONFIGS.quant_info.qat_method
+        
+        # TQT策略使用，learning_data可学习，通过校准进行初始化
+        self.learning_data = torch.nn.parameter.Parameter(torch.tensor([2.1]), )
+        self.register_buffer("is_calibrate", torch.tensor(False, dtype=bool))
+
+        # MOM策略使用，running_data统计input.abs().max()获得
+        self.register_buffer("running_data", torch.tensor(0.0))
+        self.momentum = 0.1
+
+        # TQT和MOM训练时都更新scale(因为bias可能会调用当前训练步骤的scale); 推理时都通过scale(MOM为通过running_data保存的)计算以加快推理速度
+        # MOM正确的scale（统计的running_data对应的scale）通过重写self.state_dict()函数保存的checkpoint中，故推理可直接使用scale;TQT一直都是正确的scale
+        self.register_buffer("scale", torch.tensor(1.0))
+        
+    def init_quant_data(self, tensor):
+        with torch.no_grad():
+            if hasattr(self, "clamp_factor") and self.clamp_factor is not None:
+                clamp_data = tensor.abs().mean() * self.clamp_factor
+            elif hasattr(self, "clamp_value") and self.clamp_value is not None:
+                clamp_data = self.clamp_value
+            else:
+                clamp_data = None
+            tensor.data = tensor if clamp_data is None else torch.clamp(tensor, min = -clamp_data, max = clamp_data)
+
+        calibrate_function = get_calibrate_function(self.calibrate_name)
+        calibrate_function(self, tensor, self.data_bits)
+        return tensor
+    
+    def quant_round(self, x, mode):
+        if mode == QuantMode.floor_add: 
+            out = ((x + 0.5).floor() - x).detach() + x
+        elif mode == QuantMode.floor:
+            out = (x.floor() - x).detach() + x
+        else:
+            out = (x.round() - x).detach() + x
+        return out
+       
+    def fake_quant_native(self, input, scale_bias = None):
+        if scale_bias is None:
+            if self.qat_method == QatMethod.TQT:
+                learning_data_temp = self.data_bits - 1 - self.learning_data
+            elif self.qat_method == QatMethod.MOM:
+                abs_max = input.abs().max()
+                self.running_data.to(input.device).mul_(1-self.momentum).add_(self.momentum * abs_max)
+                learning_data_temp = self.data_bits - 1 - abs_max.log2()
+            else:
+                raise ValueError("Only TQT and MOM strategies are supported! ")
+            learning_data = self.quant_round(learning_data_temp, self.round_mode)
+            scale = 2**learning_data
+            # scale = scale.clamp(min=1e-6, max=2**24)
+        else:
+            scale = scale_bias
+
+        if hasattr(self, "min_scale") and scale > self.min_scale:
+            scale = self.min_scale
+            self.scale.fill_(float(scale))
+        
+        x_s = input * scale
+        x_int = self.quant_round(x_s, self.round_mode)
+        x_int = x_int.clamp(self.quant_min, self.quant_max)
+        fake_input =  x_int / scale
+
+        self.scale.fill_(float(scale))
+        # self.scale.mul_(0).add_(scale) # MOM也必须要给scale初始化，便于后续(如bias)调用使用当前模块scale时保持正确
+
+        return fake_input
+
+    def fake_quant_cuda(self, input, scale_bias = None):
+        if hasattr(self, "min_scale"): # 只有add算子的outputquantizer有min_scale，其余情况min_scale和正常的scale相同
+            min_scale = self.min_scale
+        else:
+            min_scale = float("inf")
+
+        if scale_bias is None:
+            if self.qat_method == QatMethod.TQT:
+                out, scale_tmp = FAKEQUANT.apply(input, self.data_bits, self.learning_data, min_scale, self.quant_min, self.quant_max)
+            elif self.qat_method == QatMethod.MOM:
+                abs_max = input.abs().max()
+                self.running_data.mul_(1-self.momentum).add_(self.momentum * abs_max)
+                out, scale_tmp = FAKEQUANT.apply(input, self.data_bits, abs_max.log2(), min_scale, self.quant_min, self.quant_max)
+        else:
+            out = BIASQUANT.apply(input, self.data_bits, scale_bias, min_scale, self.quant_min, self.quant_max)
+            scale_tmp = scale_bias.detach()
+        self.scale.fill_(scale_tmp) # MOM也必须要给scale初始化，便于后续(如bias)调用当前模块scale时保持正确
+        return out
+
+    def fake_quant_cuda_with_grad_scale(self, input, scale_bias = None):
+        if hasattr(self, "min_scale"): # 只有add算子的outputquantizer有min_scale，其余情况min_scale和正常的scale相同
+            min_scale = self.min_scale
+        else:
+            min_scale = float("inf")
+
+        if scale_bias is None:
+            if self.qat_method == QatMethod.TQT:
+                out, scale_tmp = FAKEQUANT_WITH_GRAD_SACLE.apply(input, self.data_bits, self.learning_data, min_scale, self.quant_min, self.quant_max)
+            elif self.qat_method == QatMethod.MOM:
+                abs_max = input.abs().max()
+                self.running_data.mul_(1-self.momentum).add_(self.momentum * abs_max)
+                out, scale_tmp = FAKEQUANT_WITH_GRAD_SACLE.apply(input, self.data_bits, abs_max.log2(), min_scale, self.quant_min, self.quant_max)
+        else:
+            out = BIASQUANT_WITH_GRAD_SACLE.apply(input, self.data_bits, scale_bias, min_scale, self.quant_min, self.quant_max)
+            scale_tmp = scale_bias.detach()
+        self.scale.fill_(scale_tmp) # MOM也必须要给scale初始化，便于后续(如bias)调用当前模块scale时保持正确
+        return out
+
+    def inference(self, input, scale=None):
+        if hasattr(self, "min_scale"):
+            min_scale = self.min_scale
+        else:
+            min_scale = float("inf")
+        # 推理时固定走cuda路线，若为TQT模式，通过learning_data保存scale，若为MOM模式，通过running_data重新初始化scale
+        if scale is None:
+            if self.qat_method == QatMethod.MOM and self.running_data != 0.0:
+                learning_data = self.data_bits - 1 - self.running_data.abs().max().log2()
+                self.scale.fill_(float((2 ** (self.quant_round(learning_data, self.round_mode)))))
+            scale = self.scale
+        if QUANT_CONFIGS.quant_method == FakeQuantMethod.CUDA:
+            out = BIASQUANT.apply(input, self.data_bits, scale, min_scale, self.quant_min, self.quant_max)
+        else:
+            out = self.fake_quant_native(input, scale)
+        return out
+    
+    def forward(self, input, scale=None):
+        if torch.onnx.is_in_onnx_export():
+            return FakeQuantOnnxFunction.apply(input)
+        elif QUANT_CONFIGS.calibration:
+            return self.init_quant_data(input)
+        elif not self.training:
+            return self.inference(input, scale)
+        else:
+            # only weight clamp
+            if self.qat_method == QatMethod.MOM and hasattr(self, "is_init_mom_clamp_weight") and self.is_init_mom_clamp_weight == False:
+                with torch.no_grad():
+                    if hasattr(self, "clamp_factor") and self.clamp_factor is not None:
+                        clamp_data = input.abs().mean() * self.clamp_factor
+                    elif hasattr(self, "clamp_value") and self.clamp_value is not None:
+                        clamp_data = self.clamp_value
+                    else:
+                        clamp_data = None
+                    input.data = input if clamp_data is None else torch.clamp(input, min = -clamp_data, max = clamp_data)
+                self.is_init_mom_clamp_weight.fill_(True)
+
+            if QUANT_CONFIGS.quant_method == FakeQuantMethod.CUDA:
+                fake_input = self.fake_quant_cuda(input, scale)
+            elif QUANT_CONFIGS.quant_method == FakeQuantMethod.CUDA_GS:
+                fake_input = self.fake_quant_cuda_with_grad_scale(input, scale)
+            else:
+                fake_input = self.fake_quant_native(input, scale)
+        return fake_input
+    
+    def state_dict(self, *args, **kwargs):
+        with torch.no_grad():
+            if self.qat_method == QatMethod.MOM and self.running_data != 0.0:
+                learning_data = self.data_bits - 1 - self.running_data.abs().max().log2()
+                self.scale.fill_(float((2 ** (self.quant_round(learning_data, self.round_mode)))))
+        return super().state_dict(*args, **kwargs)
+    
+    @property
+    def quant_min(self):
+        quant_min = - 2 ** (self.data_bits - 1)
+        return quant_min
+    
+    @property
+    def quant_max(self):
+        quant_max = 2 ** (self.data_bits - 1) - 1
+        return quant_max
+
+
+class AQuantizer(Quantizer):
+    def __init__(self, quantizer_cfg: Optional[Dict[str, Any]] = None, constrain: Optional[Dict[str, Any]] = None):
+        super().__init__(quantizer_cfg, constrain)
+
+        self.data_bits = quantizer_cfg.get('activate_bits', 8)
+        self.quant_strategy = quantizer_cfg.get('a_strategy', QuantStrategy.RANGE_MEAN)
+        self.activation_type = quantizer_cfg.get('activation_type', None)
+        self.is_bias_quantizer = False
+        self.is_qtensor = False
+        self.calibrate_name = quantizer_cfg.get('a_calibrate_name', "top_10")
+        # self.quant_min = - 2 ** (self.data_bits - 1)
+        # self.quant_max = 2 ** (self.data_bits - 1) - 1
+
+        self.clamp_value = None if constrain is None else constrain.get('clamp_activation_value', None)
+        
+        self.round_mode = quantizer_cfg.get('round_mode', QuantMode.floor_add)
+
+class WQuantizer(Quantizer):
+    def __init__(self, quantizer_cfg: Optional[Dict[str, Any]] = None, constrain: Optional[Dict[str, Any]] = None):
+        super().__init__(quantizer_cfg, constrain)
+
+        self.data_bits = quantizer_cfg.get('weight_bits', 8)
+        self.quant_strategy = quantizer_cfg.get('w_strategy', QuantStrategy.RANGE_MEAN)
+        self.is_perchannel = quantizer_cfg.get('is_perchannel', False)
+        self.is_bias_quantizer = False
+        self.calibrate_name = quantizer_cfg.get('w_calibrate_name', "abs_max")
+        # self.quant_min = - 2 ** (self.data_bits - 1)
+        # self.quant_max = 2 ** (self.data_bits - 1) - 1
+
+        self.clamp_value = None if constrain is None else constrain.get('clamp_weight_value', None)
+        self.clamp_factor = None if constrain is None else constrain.get('clamp_factor_value', None)
+
+        self.register_buffer("is_init_mom_clamp_weight", torch.tensor(False, dtype=bool))
+        self.round_mode = QuantMode.round
+class BQuantizer(Quantizer):
+    def __init__(self, quantizer_cfg: Optional[Dict[str, Any]] = None, constrain: Optional[Dict[str, Any]] = None):
+        super().__init__(quantizer_cfg, constrain)
+
+        self.data_bits = quantizer_cfg.get('bias_bits', 32)
+        self.quant_strategy = quantizer_cfg.get('w_strategy', QuantStrategy.RANGE_MEAN)
+        self.is_bias_quantizer = True
+        self.calibrate_name = quantizer_cfg.get('w_calibrate_name', "abs_max")
+        # self.quant_min = - 2 ** (self.data_bits - 1)
+        # self.quant_max = 2 ** (self.data_bits - 1) - 1
+
+        self.clamp_value = None if constrain is None else constrain.get('clamp_bias_value', None)
+        self.round_mode = QuantMode.round
+
diff --git a/linger/tools/__init__.py b/linger/tools/__init__.py
deleted file mode 100644
index 8e71431..0000000
--- a/linger/tools/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .weight_bias_analyse import wb_analyse
-from .fix_dequant import fix_dequant
-from .onnx_quant import onnx_quant
diff --git a/linger/tools/fix_dequant.py b/linger/tools/fix_dequant.py
deleted file mode 100644
index 308d88a..0000000
--- a/linger/tools/fix_dequant.py
+++ /dev/null
@@ -1,99 +0,0 @@
-import linger
-import numpy as np
-import onnx
-
-onnx_dtype = {
-    'Undefined': 'UNDEFINED',    'Float': 'float32',    'UInt8': 'uint8',    'Int8': 'int8',    'UInt16': 'uint16',
-    'Int16': 'int16',    'Int32': 'int32',    'Int64': 'int64',    'String': 'str',    'Bool': 'bool',    'Float16': 'float16',
-    'Double': 'double',    'UInt32': 'uint32',    'UInt64': 'uint64',    'Complex64': 'complex64',    'Complex128': 'complex128',
-    'BFloat16': 'bfloat16'
-}
-
-
-def remove_identity_node(model_path, op_type="Identity"):
-    model = onnx.load(model_path)
-    graph_output_name = []
-    for ii in model.graph.output:
-        graph_output_name.append(ii.name)
-    nodes = model.graph.node[::-1]
-    for i, node in enumerate(nodes):
-        if node.op_type == op_type:
-            model.graph.node.remove(node)
-            if node.output[0] in graph_output_name:
-                for each in model.graph.node:
-                    for idx in range(len(each.output)):
-                        if each.output[idx] == node.input[0]:
-                            each.output[idx] = node.output[0]
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx] == node.input[0]:
-                            each.input[idx] = node.output[0]
-
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.output)):
-                                    if x_node.output[idx] == node.input[0]:
-                                        x_node.output[idx] = node.output[0]
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx] == node.input[0]:
-                                        x_node.input[idx] = node.output[0]
-
-            else:
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx] == node.output[0]:
-                            each.input[idx] = node.input[0]
-
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx] == node.output[0]:
-                                        x_node.input[idx] = node.input[0]
-
-    onnx.save(model, model_path[:-5]+"_remove_identity.onnx")
-
-
-def check_model_run(fixed_onnx):  # 用于检测修复后的onnx是否能够正常前向运行
-    import onnxinfer
-
-    sessoption = onnxinfer.InferSessionOptions()
-
-    # 此处fixed_onnx  即为上面成功运行的onnx模型
-    sess = onnxinfer.InferSession(fixed_onnx, sessoption, is_fuse=0)
-
-    data = {}
-    for i in range(sess.GetInputCount()):
-        ishape = sess.GetInputTypeInfo(i)
-        inp = np.ones(ishape.GetShape(),
-                      dtype=onnx_dtype[ishape.GetElementDataType().name])
-        data[sess.GetInputNames()[i]] = inp
-
-    option = onnxinfer.InferRunOptions()
-    rlt = sess.Run(run_option=option, data_in=data)
-    print(rlt[0].AsReadOnlyNumpy().shape)  # 能成功输出shape即可证明onnx图前向运行正常
-    print("The Fixed model detection runs successfully !")
-
-
-def fix_dequant(model_name, is_check):
-    model_name = model_name[:-5]  # 原始出错的onnx模型名称
-
-    ori_onnx = model_name + ".onnx"
-    remove_dequant_onnx = model_name + "_remove_identity.onnx"
-    fixed_onnx = model_name + "_fix.onnx"
-
-    remove_identity_node(ori_onnx, "Dequant")
-
-    model = onnx.load(remove_dequant_onnx)
-
-    model = linger.parser_dequant(model, False)  # 此处的linger使用最新master版的环境
-    # model.opset_import[0].version   =   10
-
-    onnx.save(model, fixed_onnx)  # 最后将修复好的onnx保存为 后缀多了_fix.onnx
-    print("ONNX fix over! Save model as onnx file \"", fixed_onnx, "\"")
-
-    if is_check:
-        check_model_run(fixed_onnx)
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/__init__.py b/linger/tools/onnx_quant/__init__.py
deleted file mode 100644
index ffca88f..0000000
--- a/linger/tools/onnx_quant/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .fix_quant_onnx_replace import onnx_quant
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/fix_quant_onnx_replace.py b/linger/tools/onnx_quant/fix_quant_onnx_replace.py
deleted file mode 100644
index 9f133e2..0000000
--- a/linger/tools/onnx_quant/fix_quant_onnx_replace.py
+++ /dev/null
@@ -1,50 +0,0 @@
-from typing import List
-
-import onnx
-
-from .transform.onnx_utils import remove_identity_node
-from .transform.replace_add import replace_add_to_iqadd
-from .transform.replace_avgpoolint import replace_avgpool2dint
-from .transform.replace_clip import replace_clip_attr
-from .transform.replace_conv import replace_conv2dint
-from .transform.replace_iqsigmoid import replace_iqsigmoid
-from .transform.replace_linearint import replace_linearint
-
-supported_replace_ops = ['Conv', 'ConvTranspose', 'Add', 'Clip', 'AveragePool', 'MaxPool', 'Relu', 'Gemm', 'Transpose', 'Reshape', 'Squeeze', 'Unsqueeze',
-                         'Split', "Sigmoid"]
-
-
-def onnx_quant(model_path, quant_ops_type: List[str], remove_ops_type: List[str] = [], scale_x=16.0, scale_y=16.0, scale_w=64.0, scale_o=16.0, platform_quant="luna_quant"):
-    model = onnx.load(model_path)
-    for remove_op_type in remove_ops_type:
-        model = remove_identity_node(model, remove_op_type)
-    for quant_op_type in quant_ops_type:
-        if quant_op_type == "Conv" or quant_op_type == "ConvTranspose":
-            model = replace_conv2dint(
-                model, scale_x, scale_w, scale_o, platform_quant)
-        if quant_op_type == "Add":
-            model = replace_add_to_iqadd(
-                model, scale_x, scale_y, scale_o, platform_quant)
-        if quant_op_type == "Clip":
-            model = replace_clip_attr(model, scale_o)
-        if quant_op_type == "Sigmoid":
-            model = replace_iqsigmoid(model, scale_x, 256.0, platform_quant)
-        if quant_op_type == "AveragePool":
-            model = replace_avgpool2dint(
-                model, scale_x, scale_o, platform_quant)
-        if quant_op_type == "Gemm":
-            model = replace_linearint(
-                model, scale_x, scale_w, scale_o, platform_quant)
-
-    # change inp/outp type to int8, ignore quant/dequant op
-    for i in range(len(model.graph.input)):
-        if model.graph.input[i].type.tensor_type.elem_type == 1:
-            model.graph.input[i].type.tensor_type.elem_type = 3
-    for i in range(len(model.graph.output)):
-        if model.graph.output[i].type.tensor_type.elem_type == 1:
-            if "Sigmoid" in quant_ops_type:
-                model.graph.output[i].type.tensor_type.elem_type = 2
-            else:
-                model.graph.output[i].type.tensor_type.elem_type = 3
-
-    onnx.save(model, model_path[:-5]+"_quant.onnx")
diff --git a/linger/tools/onnx_quant/transform/__init__.py b/linger/tools/onnx_quant/transform/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/linger/tools/onnx_quant/transform/onnx_utils.py b/linger/tools/onnx_quant/transform/onnx_utils.py
deleted file mode 100644
index e1813d4..0000000
--- a/linger/tools/onnx_quant/transform/onnx_utils.py
+++ /dev/null
@@ -1,73 +0,0 @@
-import onnx
-
-
-
-def get_node_id(node, nodes):
-    for k in range(len(nodes)):
-        if node==nodes[k]:
-            return k
-
-
-def remove_identity_node(model, op_type="Identity"):
-    r"""实现onnx的identity节点移除
-    Note:本功能针对的是导出onnx时未自动删除identity节点的onnx模型，linger最新版导出已自动支持此功能
-
-    Args:
-        model_path: 要修改的onnx路径
-        op_type   : 要删去的无用节点类型（默认为Identity）
-    
-    Notes:
-        本函数不返回数据    函数会将修改后的删除完相应类型节点后的模型  写入到要保存的新路径中，设为原始名称+“_remove_identity.onnx”
-
-    """
-    # model = onnx.load(model_path)
-    graph_output_name = []
-    for ii in model.graph.output:
-        graph_output_name.append(ii.name)
-    nodes  = model.graph.node[::-1]
-    for i,node in enumerate(nodes):
-        if node.op_type== op_type:
-            model.graph.node.remove(node)
-            if node.output[0] in graph_output_name:
-                for each in model.graph.node:
-                    for idx in range(len(each.output)):
-                        if each.output[idx]==node.input[0]:
-                            each.output[idx] = node.output[0]
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx]==node.input[0]:
-                            each.input[idx] = node.output[0]
-
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.output)):
-                                    if x_node.output[idx]==node.input[0]:
-                                        x_node.output[idx] = node.output[0]
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx]==node.input[0]:
-                                        x_node.input[idx] = node.output[0]
-                        
-            else:
-                for each in model.graph.node:
-                    for idx in range(len(each.input)):
-                        if each.input[idx]==node.output[0]:
-                            each.input[idx] = node.input[0]
-                # for each in model.graph.node:
-                #     for idx in range(len(each.output)):
-                #         if each.output[idx]==node.output[0]:
-                #             each.output[idx] = node.input[0]
-                #             print(each.output[idx])
-                for each in model.graph.node:
-                    if each.op_type == "If":
-                        for gi in range(len(each.attribute)):
-                            for x_node in each.attribute[gi].g.node:
-                                for idx in range(len(x_node.input)):
-                                    if x_node.input[idx]==node.output[0]:
-                                        x_node.input[idx] = node.input[0]
-                                # for idx in range(len(x_node.output)):
-                                #     if x_node.output[idx]==node.output[0]:
-                                #         x_node.output[idx] = node.input[0]
-    # onnx.save(model,model_path[:-5]+"_remove_identity.onnx")
-    return model
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_add.py b/linger/tools/onnx_quant/transform/replace_add.py
deleted file mode 100644
index 45c5dae..0000000
--- a/linger/tools/onnx_quant/transform/replace_add.py
+++ /dev/null
@@ -1,14 +0,0 @@
-import onnx
-
-
-def replace_add_to_iqadd(model, scale_x, scale_y, scale_o, platform_quant):
-    for node in model.graph.node[::-1]:
-        if node.op_type == "Add":
-            node.op_type = "iqadd"
-            node.domain = "thinker"
-            node.attribute.extend(
-            [onnx.helper.make_attribute("scale_x", scale_x), onnx.helper.make_attribute("scale_y", scale_y), onnx.helper.make_attribute("scale_o", scale_o), \
-                onnx.helper.make_attribute("data_bits", 8), onnx.helper.make_attribute("o_bits", 8), onnx.helper.make_attribute("parameter_bits", 8), \
-                onnx.helper.make_attribute("platform_quant", platform_quant)]
-        )
-    return model
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_avgpoolint.py b/linger/tools/onnx_quant/transform/replace_avgpoolint.py
deleted file mode 100644
index 71e0ec2..0000000
--- a/linger/tools/onnx_quant/transform/replace_avgpoolint.py
+++ /dev/null
@@ -1,13 +0,0 @@
-import onnx
-
-
-def replace_avgpool2dint(model, scale_x, scale_o, platform_quant):
-    for node in model.graph.node[::-1]:
-        if node.op_type == "AveragePool":
-            node.op_type = "AvgPool2dInt"
-            node.domain = "thinker"
-            node.attribute.extend(
-                    [onnx.helper.make_attribute("scale_x", scale_x), onnx.helper.make_attribute("scale_o", scale_o), \
-                onnx.helper.make_attribute("data_bits", 8), onnx.helper.make_attribute("o_bits", 8), onnx.helper.make_attribute("parameter_bits", 8), \
-                onnx.helper.make_attribute("platform_quant", platform_quant) ])
-    return model
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_clip.py b/linger/tools/onnx_quant/transform/replace_clip.py
deleted file mode 100644
index 8233907..0000000
--- a/linger/tools/onnx_quant/transform/replace_clip.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import onnx
-from onnx import numpy_helper
-
-def replace_clip_attr(model,scale_o): #only support opset 9 and below
-    if model.opset_import[0].version <12:
-        assert False, "Clip quant only support opset 12 and above !"
-    
-    for index,node in enumerate(model.graph.node[::-1]):
-        if node.op_type == "Clip":
-            model.graph.node[::-1][index+1].attribute[0].i = 3
-            model.graph.node[::-1][index+2].attribute[0].i = 3
-            # max_value = int.from_bytes(model.graph.node[::-1][index+3].attribute[0].t.raw_data, byteorder='little', signed=True)
-            # min_value = int.from_bytes(model.graph.node[::-1][index+4].attribute[0].t.raw_data, byteorder='little', signed=True)
-
-            max_value = numpy_helper.to_array(model.graph.node[::-1][index+3].attribute[0].t)
-            min_value = numpy_helper.to_array(model.graph.node[::-1][index+4].attribute[0].t)
-
-            max_value = max_value * scale_o
-            min_value = min_value * scale_o
-            
-            model.graph.node[::-1][index+3].attribute[0].t.data_type = 11  #double
-            model.graph.node[::-1][index+4].attribute[0].t.data_type = 11
-
-            model.graph.node[::-1][index+3].attribute[0].t.raw_data = max_value.tobytes()
-            model.graph.node[::-1][index+4].attribute[0].t.raw_data= min_value.tobytes()
-
-    return model
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_conv.py b/linger/tools/onnx_quant/transform/replace_conv.py
deleted file mode 100644
index 72b7e28..0000000
--- a/linger/tools/onnx_quant/transform/replace_conv.py
+++ /dev/null
@@ -1,139 +0,0 @@
-import  onnx
-from onnx import numpy_helper
-import numpy as np
-from onnx import TensorProto
-from .onnx_utils import get_node_id,remove_identity_node
-
-
-def get_quant_conv_node(onnx_model, node, scale_x, scale_w, scale_o, platform_quant):
-    # scale_x = 16.0
-    # scale_w = 64.0
-    # scale_o = 16.0
-    domain = "thinker"
-    activation_bits = 8
-    weight_bits = 8
-
-    # get weights of conv
-    weight_data = None
-    bias_data = None
-    
-    # for init in reversed(onnx_model.graph.initializer):
-    for init in onnx_model.graph.initializer[::-1]:
-
-        if init.name == node.input[1]:
-            weight_data = numpy_helper.to_array(init)
-            onnx_model.graph.initializer.remove(init)
-        try:
-            if init.name == node.input[2]:
-                bias_data = numpy_helper.to_array(init)
-                onnx_model.graph.initializer.remove(init)
-        except:
-            pass
-
-    # initializer quantization
-    # the definition of scale is different from the normal !!!
-    # todo, how to handle data type
-    if weight_data.dtype == "float32":
-        weight_data = np.floor(weight_data * scale_w + 0.5).astype(np.int8)    
-    weight_tensor = onnx.helper.make_tensor(node.input[1], TensorProto.INT8, weight_data.shape, \
-        weight_data.tobytes(), raw=True)  # raw 
-    try:
-        if bias_data.dtype == "float32":
-            bias_data = np.floor(bias_data * scale_w * scale_x + 0.5).astype(np.int32)
-        bias_tensor = onnx.helper.make_tensor(node.input[2], TensorProto.INT32, bias_data.shape, bias_data.tobytes(), raw=True)     
-    except:
-        pass           
-    if node.op_type == "Conv":
-        if len(node.input) == 3:
-            quant_node = onnx.helper.make_node(
-                "Conv2dInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1], node.input[2]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-        else:
-            quant_node = onnx.helper.make_node(
-                "Conv2dInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-    elif node.op_type == "ConvTranspose":
-        if len(node.input) == 3:
-            quant_node = onnx.helper.make_node(
-                "ConvTranspose2dInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1], node.input[2]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-        else:
-            quant_node = onnx.helper.make_node(
-                "ConvTranspose2dInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-
-    # add attributes of original node to the quant node
-    quant_node.attribute.extend(
-        onnx.helper.make_attribute(attr.name, onnx.helper.get_attribute_value(attr)) for attr in node.attribute
-    )
-    # add attributes that original node does not have to the quant node
-    quant_node.attribute.extend(
-        [onnx.helper.make_attribute("scale_x", scale_x), onnx.helper.make_attribute("scale_w", scale_w), onnx.helper.make_attribute("scale_o", scale_o), \
-            onnx.helper.make_attribute("data_bits", activation_bits), onnx.helper.make_attribute("o_bits", activation_bits), onnx.helper.make_attribute("parameter_bits", weight_bits), \
-            onnx.helper.make_attribute("platform_quant", platform_quant)]
-    )
-    if node.op_type == "ConvTranspose":
-        quant_node.attribute.extend(
-        [onnx.helper.make_attribute("output_padding", [0,0])])
-
-    # insert the quant initializers
-    try:
-        onnx_model.graph.initializer.extend([weight_tensor, bias_tensor])
-    except:
-        onnx_model.graph.initializer.extend([weight_tensor])
-    return quant_node
-
-def insert_op_before(model, target_node_index, ori_node, scale_x, scale_w, scale_o, platform_quant ):
-    '''
-    op_name
-    weight_dict
-    attr_dict
-    '''
-
-    replace_node = get_quant_conv_node(model, ori_node, scale_x=scale_x, scale_w=scale_w, scale_o=scale_o, platform_quant=platform_quant)
-    
-    model.graph.node.insert(target_node_index, replace_node)
-
-def replace_conv2dint(module, scale_x, scale_w, scale_o, platform_quant):
-    nodes  = module.graph.node[::-1]
-    remove_node_all=[]
-    for i,node in enumerate(nodes):
-        if node.op_type == 'Conv' or node.op_type == "ConvTranspose":
-            remove_node_all.append((i,node))
-
-    from onnx import numpy_helper
-    input_tensors = {  t.name: numpy_helper.to_array(t) for t in module.graph.initializer }
-            
-    for i_node in remove_node_all:
-        i    = i_node[0]
-        node = i_node[1] 
-
-        origin_layer_index = get_node_id(node,module.graph.node)
-        module.graph.node.remove(node)        
-        insert_op_before(
-                    module,
-                    target_node_index = origin_layer_index,
-                    ori_node = node,
-                    scale_x = scale_x, 
-                    scale_w = scale_w, 
-                    scale_o = scale_o, 
-                    platform_quant = platform_quant
-                     )
-   
-    return module
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_iqsigmoid.py b/linger/tools/onnx_quant/transform/replace_iqsigmoid.py
deleted file mode 100644
index 0085552..0000000
--- a/linger/tools/onnx_quant/transform/replace_iqsigmoid.py
+++ /dev/null
@@ -1,14 +0,0 @@
-import onnx
-
-
-def replace_iqsigmoid(model, scale_x, scale_o, platform_quant):
-    for node in model.graph.node[::-1]:
-        if node.op_type == "Sigmoid":
-            node.op_type = "iqSigmoid"
-            node.domain = "thinker"
-            node.attribute.extend(
-            [onnx.helper.make_attribute("scale_x", scale_x), onnx.helper.make_attribute("scale_o", scale_o), \
-                onnx.helper.make_attribute("data_bits", 8), onnx.helper.make_attribute("o_bits", 8), onnx.helper.make_attribute("parameter_bits", 8), \
-                onnx.helper.make_attribute("platform_quant", platform_quant)]
-        )
-    return model
\ No newline at end of file
diff --git a/linger/tools/onnx_quant/transform/replace_linearint.py b/linger/tools/onnx_quant/transform/replace_linearint.py
deleted file mode 100644
index 45984ee..0000000
--- a/linger/tools/onnx_quant/transform/replace_linearint.py
+++ /dev/null
@@ -1,119 +0,0 @@
-import  onnx
-from onnx import numpy_helper
-import numpy as np
-from onnx import TensorProto
-from .onnx_utils import get_node_id,remove_identity_node
-
-
-def get_quant_linear_node(onnx_model, node, scale_x, scale_w, scale_o, platform_quant):
-    # scale_x = 16.0
-    # scale_w = 64.0
-    # scale_o = 16.0
-    domain = "thinker"
-    activation_bits = 8
-    weight_bits = 8
-
-    # get weights of conv
-    weight_data = None
-    bias_data = None
-    
-    # for init in reversed(onnx_model.graph.initializer):
-    for init in onnx_model.graph.initializer[::-1]:
-
-        if init.name == node.input[1]:
-            weight_data = numpy_helper.to_array(init)
-            onnx_model.graph.initializer.remove(init)
-        try:
-            if init.name == node.input[2]:
-                bias_data = numpy_helper.to_array(init)
-                onnx_model.graph.initializer.remove(init)
-        except:
-            pass
-
-    # initializer quantization
-    # the definition of scale is different from the normal !!!
-    # todo, how to handle data type
-    if weight_data.dtype == "float32":
-        weight_data = np.floor(weight_data * scale_w + 0.5).astype(np.int8)    
-    weight_tensor = onnx.helper.make_tensor(node.input[1], TensorProto.INT8, weight_data.shape, \
-        weight_data.tobytes(), raw=True)  # raw 
-    try:
-        if bias_data.dtype == "float32":
-            bias_data = np.floor(bias_data * scale_w * scale_x + 0.5).astype(np.int32)
-        bias_tensor = onnx.helper.make_tensor(node.input[2], TensorProto.INT32, bias_data.shape, bias_data.tobytes(), raw=True)     
-    except:
-        pass           
-    if node.op_type == "Gemm":
-        if len(node.input) == 3:
-            quant_node = onnx.helper.make_node(
-                "LinearInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1], node.input[2]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-        else:
-            quant_node = onnx.helper.make_node(
-                "LinearInt",
-                name=node.name,
-                inputs=[node.input[0], node.input[1]],
-                outputs=[node.output[0]],
-                domain=domain
-            )
-
-    # add attributes of original node to the quant node
-    quant_node.attribute.extend(
-        onnx.helper.make_attribute(attr.name, onnx.helper.get_attribute_value(attr)) for attr in node.attribute
-    )
-    # add attributes that original node does not have to the quant node
-    quant_node.attribute.extend(
-        [onnx.helper.make_attribute("scale_x", scale_x), onnx.helper.make_attribute("scale_w", scale_w), onnx.helper.make_attribute("scale_o", scale_o), \
-            onnx.helper.make_attribute("data_bits", activation_bits), onnx.helper.make_attribute("o_bits", activation_bits), onnx.helper.make_attribute("parameter_bits", weight_bits), \
-            onnx.helper.make_attribute("platform_quant", platform_quant)]
-    )
-
-    # insert the quant initializers
-    try:
-        onnx_model.graph.initializer.extend([weight_tensor, bias_tensor])
-    except:
-        onnx_model.graph.initializer.extend([weight_tensor])
-    return quant_node
-
-def insert_op_before(model, target_node_index, ori_node, scale_x, scale_w, scale_o, platform_quant ):
-    '''
-    op_name
-    weight_dict
-    attr_dict
-    '''
-
-    replace_node = get_quant_linear_node(model, ori_node, scale_x=scale_x, scale_w=scale_w, scale_o=scale_o, platform_quant=platform_quant)
-    
-    model.graph.node.insert(target_node_index, replace_node)
-
-def replace_linearint(module, scale_x, scale_w, scale_o, platform_quant):
-    nodes  = module.graph.node[::-1]
-    remove_node_all=[]
-    for i,node in enumerate(nodes):
-        if node.op_type == 'Gemm' :   # or node.op_type == "Matmul":
-            remove_node_all.append((i,node))
-
-    from onnx import numpy_helper
-    input_tensors = {  t.name: numpy_helper.to_array(t) for t in module.graph.initializer }
-            
-    for i_node in remove_node_all:
-        i    = i_node[0]
-        node = i_node[1] 
-
-        origin_layer_index = get_node_id(node,module.graph.node)
-        module.graph.node.remove(node)        
-        insert_op_before(
-                    module,
-                    target_node_index = origin_layer_index,
-                    ori_node = node,
-                    scale_x = scale_x, 
-                    scale_w = scale_w, 
-                    scale_o = scale_o, 
-                    platform_quant = platform_quant
-                     )
-   
-    return module
\ No newline at end of file
diff --git a/linger/tools/weight_bias_analyse.py b/linger/tools/weight_bias_analyse.py
deleted file mode 100644
index 9eb0810..0000000
--- a/linger/tools/weight_bias_analyse.py
+++ /dev/null
@@ -1,48 +0,0 @@
-import torch
-import prettytable as pt
-
-
-def clamp_with_dynamic(input: torch.Tensor, dynamic_percent: float = 0.9, layer_name: str = "layer", tb_all=None):
-    clamp_value = 0
-    if dynamic_percent < 1.0:
-        x = input.data.clone().cpu()
-        x = x.abs().reshape(-1).sort().values
-        clamp_index = int(len(x) * dynamic_percent) - 1
-        clamp_value = x[clamp_index]
-        max_abs = x[-1]
-        if len(x) == 1:
-            mean_abs = x[-1]
-        else:
-            mean_abs = x.mean()
-        length = max_abs / mean_abs
-        versu = max_abs / clamp_value
-        if length >= 1:
-            index_list = [layer_name, mean_abs,
-                          max_abs, length, clamp_value, versu]
-            tb_all.add_row(index_list)
-        if versu > 10:
-            index_list = ["!!!!!!!!!!!!!!", "!!!!!!!!!!!!!!", "!!!!!!!!!!!!!!",
-                          "!!!!!!!!!!!!!!", "!!!!!!!!!!!!!!", "!!!!!!!!!!!!!!"]
-            tb_all.add_row(index_list)
-
-
-def wb_analyse(path: str , save_log_path: str ="wb_analyse.log"):
-    if isinstance(path, str):
-        model = torch.load(path)
-    else:
-        model = path
-
-    wb_flile_path = save_log_path
-    wb_flile = open(wb_flile_path, 'w+')
-    tb_all = pt.PrettyTable()
-
-    tb_all.field_names = ["Layer_name", "Mean", "Max",
-                          "Multiple(Max/Mean)", "Dynamic 0.99", "Versu(Max/Dynamic)"]
-
-    for k in model.keys():
-        if "running" not in k:
-            v = model[k]
-            clamp_with_dynamic(v, 0.99, k, tb_all)
-
-    wb_flile.write(str(tb_all))
-    wb_flile.close()
diff --git a/linger/utils.py b/linger/utils.py
index 7bf5caa..318745a 100644
--- a/linger/utils.py
+++ b/linger/utils.py
@@ -1,19 +1,12 @@
-import logging
-import os
-from enum import Enum
-
-import numpy as np
+import math
 import torch
+import collections
+import numpy as np
+from itertools import repeat
+from typing import List, Dict, Any
+from enum import Enum
 
-logger = logging.getLogger("linger")
-logger.setLevel(logging.INFO)
-formatter = logging.Formatter('%(asctime)s  - %(message)s')
-ch = logging.StreamHandler()
-ch.setLevel(logging.DEBUG)
-ch.setFormatter(formatter)
-logger.addHandler(ch)
-
-
+# 单例模式
 class Singleton(object):
     def __new__(cls, *args, **kw):
         if not hasattr(cls, '_instance'):
@@ -21,50 +14,58 @@ def __new__(cls, *args, **kw):
             cls._instance = orig.__new__(cls, *args, **kw)
         return cls._instance
 
-
-class PlatFormQuant(Enum):
-    luna_quant = 1
-
+class PlatForm(Enum):
+    venus   = 1
+    mars    = 2
+    arcs    = 3
+    venusA  = 4
+    jupiter = 5    
+
+class ActivationType(Enum):
+    none    = 1
+    Relu    = 2
+    LeakRelu= 3
+    ReluX   = 4
+    Sigmoid = 5
+    Tanh    = 6
 
 class QuantMode(Enum):
-    QValue = 1
-
-
-class QuantInfo():
-    def __init__(self):
-        self.data_bits = 8
-        self.parameter_bits = 8
-        self.output_bits = None
-        self.mode = QuantMode.QValue
-
-    def set_data_bits(self, bits):
-        self.data_bits = bits
-
-    def set_parameter_bits(self, bits):
-        self.parameter_bits = bits
-
-    def set_output_bits(self, bits):
-        self.output_bits = bits
-
-    def set_mode(self, mode):
-        self.mode = mode
-
-
-class ClampInfo():
-    def __init__(self):
-        self.clamp_weight_value = 8
-        self.clamp_bias_value = 8
-        self.clamp_output_value = None
-
-    def set_clamp_weight_value(self, value):
-        self.clamp_weight_value = value
-
-    def set_clamp_bias_value(self, value):
-        self.clamp_bias_value = value
-
-    def set_clamp_output_value(self, value):
-        self.clamp_output_value = value
-
+    floor       = 0
+    floor_add   = 1
+    round       = 2
+    ceil        = 3
+
+class QuantStrategy(Enum):
+    MSE         = 1
+    RANGE_MEAN  = 2
+    NSTD        = 3
+    HIST        = 4
+    KLD         = 5
+    TQT         = 6
+
+
+class FakeQuantMethod(Enum):
+    NATIVE   = 1
+    CUDA     = 2 
+    COMPILE  = 3
+    TRITON   = 4
+    CUDA_GS  = 5
+
+class QatMethod(Enum):
+    TQT     = 1
+    MOM     = 2 
+
+def _ntuple(n):
+    def parse(x):
+        if isinstance(x, collections.abc.Iterable):
+            return tuple(x)
+        return tuple(repeat(x, n))
+    return parse
+
+_single = _ntuple(1)
+_pair = _ntuple(2)
+_triple = _ntuple(3)
+_quadruple = _ntuple(4)
 
 def get_device(model):
     device = None
@@ -75,32 +76,56 @@ def get_device(model):
         model._get_name())
     return device
 
+def quant(x, bits=8, scale=-1, zero_point=0, mode=QuantMode.floor_add):
+    bound_value = None
+    scale_local = None
+    max_abs = None
 
-def get_max_value(input):
-    max_value = -1
-    if isinstance(input, list):
-        input_tmp = [data.detach() for data in input]
-        for data in input_tmp:
-            tmp = torch.max(torch.abs(data))
-            max_value = max_value if max_value > tmp else tmp
+    zero_point_ = zero_point
+    if hasattr(x, 'zero_point'):
+        zero_point_ = x.zero_point
+    y = x.detach().clone()
+
+    bound_value = math.pow(2, bits - 1) - 1
+    if scale > 0:
+        scale_local = scale
+        max_abs = (bound_value + zero_point_) / scale_local
+    else:
+        min_x = torch.min(x)
+        max_x = torch.max(x)
+        if min_x == max_x == 0:
+            scale_local = math.pow(2, bits)
+        else:
+            max_abs = torch.max(-min_x, max_x)
+            max_value = round(math.log((bound_value + zero_point_) / max_abs, 2))
+            scale_local = math.pow(2, max_value)
+            max_abs = (bound_value + zero_point_) / scale_local
+
+    x = y * scale_local
+    
+    if mode == QuantMode.floor_add:
+        x_quant = (x + 0.5).floor()
+    elif mode == QuantMode.floor:
+        x_quant = x.floor()
     else:
-        input_tmp = input.detach()
-        tmp = torch.max(torch.abs(input_tmp))
-        max_value = max_value if max_value > tmp else tmp
-    return max_value
+        x_quant = x.round()
 
+    x_quant = x_quant.clamp(-bound_value - 1 + zero_point_, bound_value + zero_point_)    
+    x = x_quant.float()
+    return x, scale_local
 
-class Dump():
-    @staticmethod
-    def dump_file(header, c_name, print_list, file_path):
-        if not os.path.exists(file_path):
-            os.makedirs(file_path)
-        for name, item in print_list:
-            if(isinstance(item, torch.Tensor)):
-                np.savetxt(os.path.join(file_path, header+c_name+name),
-                           item.detach().reshape(-1).cpu().numpy(), fmt="%.6f")
+def dequant(x, scale):
+    scale_tensor = None
+    if isinstance(scale, (float, np.float32)):
+        scale_tensor = torch.tensor(
+            scale, dtype=torch.float32, device=x.device)
+    else:
+        scale_tensor = torch.tensor(
+            scale.data, dtype=torch.float32, device=x.device)
+    return (x / scale_tensor).float()
 
 
+## for RNN pack
 def _unbind(src_tensor):
     dim_0 = src_tensor.size(0)
     nums = dim_0.item() if isinstance(dim_0, torch.Tensor) else dim_0
@@ -108,7 +133,6 @@ def _unbind(src_tensor):
                        for each in src_tensor.split([1]*nums, dim=0)]
     return sub_tensor_list
 
-
 def _unbind_packed(packed_tensor, batch_sizes):
     offset = 0
     tensor_list = []
@@ -119,14 +143,12 @@ def _unbind_packed(packed_tensor, batch_sizes):
         tensor_list.append(t)
     return tensor_list, batch_size_list
 
-
 def _slice(input, start, end):
     if isinstance(input, (tuple)):
         return tuple([each.narrow(0, start, end-start) for each in input])
     else:
         return input.narrow(0, start, end-start)
 
-
 def hx_slice(input_hidden, cur_hidden, last_batch_size, cur_batch_size):
     if input_hidden is None:  # forward: slice cur_hidden
         assert cur_batch_size < last_batch_size, 'error: forward batch_sizes is not desc order'
@@ -141,109 +163,5 @@ def hx_slice(input_hidden, cur_hidden, last_batch_size, cur_batch_size):
         return torch.cat((cur_hidden, slice_hidden), 0)
 
 
-def qshift_round_away_from_zero(x, ):
-    x_mask = torch.ones_like(x)
-    x_mask[x < 0] = -1
-    x_abs = (x.abs() + 0.5).floor()
-    x = x_abs * x_mask
-    return x
-
-
-class ScalerBuffer():
-    def __init__(self, value):
-        if isinstance(value, torch.Tensor):
-            value = value.item()
-        elif isinstance(value, ScalerBuffer):
-            value = value.data
-        self.value = np.float32(value)
-
-    def __repr__(self):
-        return self.value
-
-    def __str__(self):
-        return str(self.value)
-
-    def fill_(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        self.value = np.float32(x)
-
-    def __add__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return ScalerBuffer(self.value + np.float32(x))
-
-    def __radd__(self, x):
-        return x + self.value
-
-    def add_(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        self.value = self.value + np.float32(x)
-        return self
-
-    def __mul__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return ScalerBuffer(self.value * np.float32(x))
-
-    def __rmul__(self, x):
-        return x * self.value
-
-    def mul_(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        self.value = self.value * np.float32(x)
-        return self
-
-    def __truediv__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return ScalerBuffer(self.value / np.float32(x))
-
-    def __rtruediv__(self, x):
-        return x/self.value
-
-    def __gt__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return True if self.value > np.float32(x) else False
-
-    def __eq__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return True if self.value == np.float32(x) else False
-
-    def __ne__(self, x):
-        if isinstance(x, torch.Tensor):
-            x = x.item()
-        elif isinstance(x, ScalerBuffer):
-            x = x.data
-        return True if self.value != np.float32(x) else False
-
-    def __call__(self):
-        return self.value
-
-    @property
-    def data(self):
-        return self.value
-
-
-__all__ = ['Singleton', 'PlatFormQuant', 'QuantMode', 'QuantInfo', 'ClampInfo', 'get_max_value', 'ClampInfo', 'get_device',
-           '_unbind', '_unbind_packed', '_slice', 'hx_slice', 'qshift_round_away_from_zero', 'ScalerBuffer', 'logger']
+__all__ = ['Singleton', 'PlatForm', 'ActivationType', 'QuantMode', 'QuantStrategy', 'FakeQuantMethod', 'QatMethod', 'get_device', 'quant', 'dequant',
+           '_unbind', '_unbind_packed', '_slice', 'hx_slice']
diff --git a/requirements.txt b/requirements.txt
index 0834566..04120c9 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,9 +1,8 @@
-onnx==1.7.0
-pybind11==2.6.1
-torch==1.9.0
-torchvision==0.10.0
-prettytable
-protobuf==3.8.0
-numpy
-ninja
-pytest
\ No newline at end of file
+#python==3.10.0
+onnx==1.13.1
+torch==2.0.0
+torchaudio==2.0.1
+torchvision==0.15.1
+protobuf==3.20.3
+numpy==1.23.5
+PyYAML==6.0.3
\ No newline at end of file
diff --git a/setup.py b/setup.py
index 6a81502..06ba591 100644
--- a/setup.py
+++ b/setup.py
@@ -1,23 +1,65 @@
+import torch
 from setuptools import find_packages, setup
+from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+
+
+sources = ['linger/kernel/cpu/extension.cpp']
+
+torch_version = torch.__version__
+if '+' in torch_version:
+    torch_version = torch_version.split('+')[0]
+versions = torch_version.split('.')
+version_maj = int(versions[0])
+version_min = int(versions[1])
+version_patch = int(versions[2])
+# version 1.5.1
+if version_maj*100+version_min*10 + version_patch >= 151:
+    sources.append('linger/kernel/cpu/util_kernel.cpp')
+    sources.append('linger/kernel/gpu/util_kernel.cu')
+    sources.append('linger/kernel/cpu/arcs_qsoftmax_kernel.cpp')
+    sources.append('linger/kernel/gpu/arcs_qsoftmax_kernel.cu')
+    sources.append('linger/kernel/cpu/venusa_qsoftmax_kernel.cpp')
+    sources.append('linger/kernel/gpu/venusa_qsoftmax_kernel.cu')
+    sources.append('linger/kernel/cpu/venusa_qsigmoid_kernel.cpp')
+    sources.append('linger/kernel/gpu/venusa_qsigmoid_kernel.cu')
+    sources.append('linger/kernel/cpu/arcs_qsigmoid_kernel.cpp')
+    sources.append('linger/kernel/gpu/arcs_qsigmoid_kernel.cu')
+    sources.append('linger/kernel/cpu/venusa_qtanh_kernel.cpp')
+    sources.append('linger/kernel/gpu/venusa_qtanh_kernel.cu')
+    sources.append('linger/kernel/cpu/arcs_qtanh_kernel.cpp')
+    sources.append('linger/kernel/gpu/arcs_qtanh_kernel.cu')
+    sources.append('linger/kernel/cpu/qlayernorm_kernel.cpp')
+    sources.append('linger/kernel/gpu/qlayernorm_kernel.cu')
+    sources.append('linger/kernel/gpu/fake_quant_kernel.cu')
 
 setup(
-    name="pylinger",
-    version="1.1.1",
+    name="linger",
+    version="3.0.2",
     description="linger is package of fix training",
-    author="listenai",
-	author_email="lingerthinker@listenai.com",
-	url="https://github.com/LISTENAI/linger",
+    author="ListenAI",
+    ext_modules=[
+        CUDAExtension('lingerext',
+                      sources=sources,
+                      extra_compile_args={'cxx': ['-g', '-O2', '-Wall', '-Wextra', '-Wno-unused-parameter', '-Wno-missing-field-initializers', '-fPIC', '-fopenmp'],
+                                          'nvcc': [ '-O2',
+                                                    '--use_fast_math',
+                                                    '--ftz=true',
+                                                    '-Xcompiler', '-fPIC',
+                                                    '-Xcompiler', '-fopenmp',
+                                                    '--compiler-options', '-Wall',
+                                                    '--compiler-options', '-Wextra']}),
+    ],
+
+
+    cmdclass={
+        'build_ext': BuildExtension
+    },
     packages=find_packages(),
     include_package_data=True,
     classifiers=[
-            'Intended Audience :: Science/Research',
-            'Topic :: Scientific/Engineering',
-            'Topic :: Scientific/Engineering :: Mathematics',
-            'Topic :: Scientific/Engineering :: Artificial Intelligence',
-            'Topic :: Software Development',
-            'Topic :: Software Development :: Libraries',
-            'Topic :: Software Development :: Libraries :: Python Modules',
-            'Programming Language :: C++',
-            'Programming Language :: Python :: 3',
-    ]
+        "Operating System :: OS Independent",
+        "Intended Audience :: Developers and Researchers",
+        "License :: OSI Approved :: iflytek internal License",
+        "Programming Language :: python",
+    ],
 )
diff --git a/test/example_iqtensor_dump_fuction.py b/test/example_iqtensor_dump_fuction.py
deleted file mode 100644
index 94d8bd4..0000000
--- a/test/example_iqtensor_dump_fuction.py
+++ /dev/null
@@ -1,58 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-channel_size = 3
-
-
-def try_dump():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(channel_size, channel_size,
-                                  kernel_size=1, stride=1, padding=0, bias=True)
-            self.bn = nn.BatchNorm2d(channel_size)
-
-        def forward(self, x):
-
-            x = self.conv(x)
-            z = self.bn(x)
-
-            add_rlt = x + z
-            mul_rlt = x * z
-            div_rlt = x / 0.5
-            sum_rlt = x.sum(axis=[2, 3], keepdim=False)
-            return add_rlt, mul_rlt, div_rlt, sum_rlt
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d)
-
-    net = linger.init(net,   quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-
-    bb = torch.randn(1, channel_size, 32, 32).cuda()
-
-    net.train()
-    for _ in range(22):
-        net(bb)
-    net.eval()
-
-    # print(net(bb))
-
-    with linger.Dumper() as dumper:
-        net.eval()
-        # dumper.enable_dump_quanted(net,path="dump_iqadd")
-        dumper.enable_dump_quanted(net, path="dump_iqtensor")
-        net(bb)
-
-    # print("dump: ",net(bb))
-
-    # export_path  = "dump_iqadd.onnx"
-    # with torch.no_grad(): #training = torch.onnx.TrainingMode.TRAINING,
-    #     torch.onnx.export(net, (bb),export_path, export_params=True,opset_version=12,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/example_torchintx_functional_dump_function.py b/test/example_torchintx_functional_dump_function.py
deleted file mode 100644
index 85d8474..0000000
--- a/test/example_torchintx_functional_dump_function.py
+++ /dev/null
@@ -1,55 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-channel_size = 3
-
-
-def try_dump():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(channel_size, channel_size,
-                                  kernel_size=1, stride=1, padding=0, bias=True)
-            # self.bn = nn.BatchNorm2d(channel_size)
-
-        def forward(self, x):
-
-            x = self.conv(x)
-            # z = self.bn(x)
-            cat_rlt = torch.cat((x, x), 0)
-            sigmoid_rlt = torch.sigmoid(x)
-            tanh_rlt = torch.tanh(x)
-            clamp_rlt = torch.clamp(x, -2, 0)
-            return cat_rlt, sigmoid_rlt, tanh_rlt, clamp_rlt
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d)
-
-    net = linger.init(net,   quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-
-    bb = torch.randn(1, channel_size, 32, 32).cuda()
-
-    net.train()
-    for _ in range(22):
-        net(bb)
-    net.eval()
-
-    # print(net(bb))
-
-    with linger.Dumper() as dumper:
-        net.eval()
-        dumper.enable_dump_quanted(net, path="dump_linger_funtional")
-        net(bb)
-
-    # print("dump: ",net(bb))
-
-    # export_path  = "dump_iqadd.onnx"
-    # with torch.no_grad(): #training = torch.onnx.TrainingMode.TRAINING,
-    #     torch.onnx.export(net, (bb),export_path, export_params=True,opset_version=12,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/image_preprocess.py b/test/image_preprocess.py
deleted file mode 100644
index 731bf97..0000000
--- a/test/image_preprocess.py
+++ /dev/null
@@ -1,31 +0,0 @@
-from PIL import Image
-import numpy as np
-
-import torch
-import torch.nn as nn
-import torchvision.transforms as transforms
-
-# import struct
-CIFAR100_TRAIN_MEAN = (1,1,1)
-CIFAR100_TRAIN_STD = (1,1,1)
-
-# Load image
-image = Image.open("apple.jpg")
-
-# Define the preprocessing operations
-transform = transforms.Compose([
-    transforms.Resize((32, 32)),
-    transforms.ToTensor(),
-    transforms.Normalize(CIFAR100_TRAIN_MEAN, CIFAR100_TRAIN_STD)
-])
-
-# Apply preprocessing operations
-preprocessed_image = transform(image)
-
-# Convert the preprocessed image to a NumPy array
-preprocessed_image_np = preprocessed_image.numpy()
-
-# Convert the preprocessed image data to int8
-preprocessed_image_int8 = np.floor(preprocessed_image_np * 64 + 0.5).astype(np.int8)
-preprocessed_image_int8.tofile("apple_after_resize.bin")
-
diff --git a/test/test_Layernorm_int.py b/test/test_Layernorm_int.py
deleted file mode 100644
index 655499c..0000000
--- a/test/test_Layernorm_int.py
+++ /dev/null
@@ -1,137 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def notest_layernorm_int_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(1, 3, kernel_size=2,
-                                  stride=1, padding=1, bias=False, groups=1)
-            self.ln = nn.LayerNorm([4, 4])
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(3, 1, kernel_size=2,
-                                   stride=1, padding=1, bias=False, groups=1)
-            self.ln1 = nn.LayerNorm([5, 5])
-            self.relu1 = nn.ReLU()
-            self.fc = nn.Linear(25, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.ln(x)
-            x = self.relu(x)
-            x = self.conv1(x)
-            x = self.ln1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            # print(x.shape)
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 1, 3, 3).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d, nn.LayerNorm)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # # linger.FuseBNIntoConv(net, dummy_input)
-    linger.trace_layers(net, net, dummy_input, fuse_bn=False)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    print(net)
-
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    # model.to(device)
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/layernormInt.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(dummy_input)
-    assert (out1.mean() - out1.mean()
-            ) < 0.001, print('out1: {}, out2: {}'.format(out1.sum(), out1.sum()))
-    assert out1.abs().sum() == out2.abs().sum(), 'inconsistant for batchnormint'
-
-
-def notest_monolayernorm():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.ln = nn.LayerNorm([28, 28])
-
-        def forward(self, x):
-            x = self.ln(x)
-            return x
-
-    dummy_input = torch.randn(10, 1, 28, 28).cuda()
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    net = Net()
-    ln_lg = Net()
-    weight = torch.nn.Parameter(torch.empty(28, 28))
-    nn.init.normal_(weight)
-    bias = torch.nn.Parameter(torch.empty(28, 28))
-    nn.init.normal_(bias)
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d, nn.LayerNorm)
-    ln_lg = linger.init(ln_lg, quant_modules=replace_tuple,
-                        mode=linger.QuantMode.QValue)
-    ln_lg.ln.weight = weight
-    ln_lg.ln.bias = bias
-    net.ln.weight = weight
-    net.ln.bias = bias
-
-    ln_lg.train()
-    net.train()
-    net = net.to(device)
-    ln_lg = ln_lg.to(device)
-
-    for _ in range(300):
-
-        out = ln_lg(dummy_input)
-        out1 = net(dummy_input)
-
-    out = ln_lg(dummy_input)
-    out1 = net(dummy_input)
-    ln_lg.eval()
-    net.eval()
-    assert (out1.mean() - out.mean()
-            ) < 0.001, print('out1: {}, out: {}'.format(out1.sum(), out.sum()))
-
-    out2 = ln_lg(dummy_input)
-    out3 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(ln_lg, dummy_input, "data.ignore/layernorm.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-    out4 = ln_lg(dummy_input)
-
-    assert (out2.mean() - out3.mean()
-            ) < 0.001, print('out1: {}, out2: {}'.format(out2.sum(), out3.sum()))
-
-    assert out4.abs().sum() == out2.abs().sum(), 'inconsistant for layernormint'
-
-
-notest_layernorm_int_net()
diff --git a/test/test_add.py b/test/test_add.py
new file mode 100644
index 0000000..91519c7
--- /dev/null
+++ b/test/test_add.py
@@ -0,0 +1,114 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_addition(self, x_float):
+        """执行量化加法并与浮点加法比较"""
+        # 浮点加法
+        float_add = x_float + x_float
+        
+        # 量化加法
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        
+        # 量化加法
+        quant_add = x_quant + x_quant
+        
+        # 计算差异
+        diff = torch.abs(float_add) - torch.abs(quant_add)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_add).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_add, quant_add, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种加法
+        float_result, quant_result, relative_diff = self.quantized_addition(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    """
+    测试模型forward中的量化加法比较
+    """
+    print("测试模型前向传播中的量化加法比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化加法精度测试...")
+    print("=" * 60)
+    
+    # 在forward中比较量化加法
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化加法测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化加法测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_ahead_relu.py b/test/test_ahead_relu.py
deleted file mode 100644
index 0305d5d..0000000
--- a/test/test_ahead_relu.py
+++ /dev/null
@@ -1,109 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.ops.ops_names import LINGER_AHEAD_RELU
-
-
-class Net(nn.Module):
-    def __init__(self):
-        super(Net, self).__init__()
-        self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                              padding=1, bias=True)
-        self.bn = nn.BatchNorm2d(10)
-        self.relu = nn.ReLU()
-        self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                               padding=1, bias=True)
-        self.relu1 = nn.ReLU()
-        self.conv2 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                               padding=1, bias=True)
-        self.bn1 = nn.BatchNorm2d(10)
-        self.relu2 = nn.ReLU()
-        self.fc = nn.Linear(1000, 100)
-
-    def forward(self, x):
-        x = self.conv(x)
-        x = self.bn(x)
-        x = self.relu(x)
-        x = x - 1
-        x = self.conv1(x)
-        x = self.relu1(x)
-        x = self.conv2(x)
-        x = self.bn1(x)
-        x = self.relu2(x)
-        n, c, h, w = x.shape
-        x = x.view((n, c*h*w))
-        x = self.fc(x)
-        return x
-
-
-def test_conv_linear():
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    linger.FuseConvBNAheadRelu(
-        net, aa, fused_bn=False, ahead_bn_relu=True, ahead_conv_relu=True)
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_linear.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
-
-
-def test_ahead_relu_conv_attr():
-    model = Net()
-    aa = torch.randn(1, 10, 10, 10)
-    linger.trace_layers(model, model, aa)
-    assert model.conv2.ahead_relu
-    assert model.conv.ahead_relu
-
-    model = Net()
-    linger.trace_layers(model, model, aa, ahead_bn_relu=False)
-    assert hasattr(model.conv1, linger.LINGER_AHEAD_RELU)
-    assert getattr(model.conv1, linger.LINGER_AHEAD_RELU, False)
-
-    model = Net()
-    linger.trace_layers(model, model, aa, ahead_conv_relu=False)
-    assert model.conv.ahead_relu
-    assert model.conv2.ahead_relu
-
-    model = Net()
-    linger.trace_layers(
-        model, model, aa, ahead_conv_relu=False, ahead_bn_relu=False)
-    assert not model.conv.ahead_relu
-    assert not model.conv2.ahead_relu
-
-    model = Net()
-    linger.trace_layers(model, model, aa, fuse_bn=False)
-
-    assert hasattr(model.bn, LINGER_AHEAD_RELU)
-    assert getattr(model.bn, LINGER_AHEAD_RELU, True)
-    assert getattr(model.conv1, LINGER_AHEAD_RELU, False)
\ No newline at end of file
diff --git a/test/test_alexnet.py b/test/test_alexnet.py
new file mode 100644
index 0000000..eaeea78
--- /dev/null
+++ b/test/test_alexnet.py
@@ -0,0 +1,404 @@
+import os
+import tarfile
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torchvision
+import torchvision.transforms as transforms
+from torch.utils.data import DataLoader
+import onnx
+# import onnxruntime as ort
+import numpy as np
+from tqdm import tqdm
+import shutil
+
+import linger
+
+# 设置随机种子确保可复现性
+torch.manual_seed(42)
+np.random.seed(42)
+
+# ======================
+# 1. 定义AlexNet模型 (适配CIFAR-10)
+# ======================
+class AlexNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(AlexNet, self).__init__()
+        # 第一层块
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
+        self.relu = nn.ReLU(inplace=True)
+        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
+        
+        # 第二层块
+        self.conv2 = nn.Conv2d(64, 192, kernel_size=3, padding=1)
+        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
+        
+        # 第三层块
+        self.conv3 = nn.Conv2d(192, 384, kernel_size=3, padding=1)
+        
+        # 第四层块
+        self.conv4 = nn.Conv2d(384, 256, kernel_size=3, padding=1)
+        
+        # 第五层块
+        self.conv5 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
+        self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)
+        
+        # 用于通道匹配的1x1卷积（残差连接）
+        self.res_conv1 = nn.Conv2d(192, 384, kernel_size=1)  # conv2→conv3
+        self.res_conv2 = nn.Conv2d(384, 256, kernel_size=1)  # conv3→conv4
+        
+        # 分类器
+        self.classifier = nn.Sequential(
+            nn.Dropout(0.5),
+            nn.Linear(256 * 4 * 4, 1024),
+            nn.ReLU(inplace=True),
+            nn.Dropout(0.5),
+            nn.Linear(1024, 512),
+            nn.ReLU(inplace=True),
+            nn.Linear(512, num_classes),
+        )
+
+    def forward(self, x):
+        # 第一块
+        x = self.relu(self.conv1(x))
+        x = self.pool1(x)
+        
+        # 第二块
+        x2 = self.relu(self.conv2(x))
+        x2_pooled = self.pool2(x2)
+        
+        # 第三块 + 残差连接
+        x3 = self.relu(self.conv3(x2_pooled))
+        res1 = self.res_conv1(x2_pooled)
+        x3 = x3 + res1  # 残差连接
+        
+        # 第四块 + 残差连接
+        x4 = self.relu(self.conv4(x3))
+        res2 = self.res_conv2(x3)
+        x4 = x4 + res2  # 残差连接
+        
+        # 第五块
+        x5 = self.relu(self.conv5(x4))
+        x5 = self.pool5(x5)
+        
+        # 分类器
+        x_flat = torch.flatten(x5, 1)
+        out = self.classifier(x_flat)
+        return out
+
+# ======================
+# 2. 数据准备 (使用已下载的cifar-10-python.tar.gz)
+# ======================
+def extract_cifar10(tar_path, extract_path='./data'):
+    """
+    解压已下载的CIFAR-10数据集
+    
+    参数:
+        tar_path: cifar-10-python.tar.gz的路径
+        extract_path: 解压目标路径
+    """
+    os.makedirs(extract_path, exist_ok=True)
+    
+    print(f"解压数据集: {tar_path} -> {extract_path}")
+    with tarfile.open(tar_path, 'r:gz') as tar:
+        tar.extractall(path=extract_path)
+    
+    # 确保解压后的目录结构正确
+    extracted_dir = os.path.join(extract_path, 'cifar-10-batches-py')
+    if not os.path.exists(extracted_dir):
+        raise FileNotFoundError(f"解压后未找到cifar-10-batches-py目录，请检查tar文件内容")
+    
+    print(f"数据集已成功解压到: {extracted_dir}")
+    return extracted_dir
+
+def get_cifar10_data(tar_path, batch_size=128):
+    """
+    加载CIFAR-10数据集 (使用已下载的tar文件)
+    
+    参数:
+        tar_path: cifar-10-python.tar.gz的路径
+        batch_size: 批次大小
+    """
+    # 解压数据集
+    # import pdb; pdb.set_trace()
+    # data_dir = extract_cifar10(tar_path)
+    data_dir = './data'
+    
+    # 数据增强和标准化
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(32, padding=4),
+        transforms.RandomHorizontalFlip(),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
+    ])
+    
+    transform_test = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
+    ])
+    
+    # 加载数据集 (设置download=False，因为我们已经解压了)
+    trainset = torchvision.datasets.CIFAR10(
+        root=data_dir, 
+        train=True, 
+        download=False, 
+        transform=transform_train
+    )
+    testset = torchvision.datasets.CIFAR10(
+        root=data_dir, 
+        train=False, 
+        download=False, 
+        transform=transform_test
+    )
+    
+    trainloader = DataLoader(
+        trainset, 
+        batch_size=batch_size, 
+        shuffle=True, 
+        num_workers=2,
+        pin_memory=True
+    )
+    testloader = DataLoader(
+        testset, 
+        batch_size=batch_size, 
+        shuffle=False, 
+        num_workers=2,
+        pin_memory=True
+    )
+    
+    return trainloader, testloader, data_dir
+
+# ======================
+# 3. 训练函数
+# ======================
+def train(model, trainloader, criterion, optimizer, device, epoch):
+    model.train()
+    running_loss = 0.0
+    correct = 0
+    total = 0
+    
+    progress_bar = tqdm(trainloader, desc=f'Epoch {epoch+1}', unit='batch')
+    for inputs, targets in progress_bar:
+        inputs, targets = inputs.to(device), targets.to(device)
+        
+        optimizer.zero_grad()
+        outputs = model(inputs)
+        loss = criterion(outputs, targets)
+        loss.backward()
+        optimizer.step()
+        
+        running_loss += loss.item()
+        _, predicted = outputs.max(1)
+        total += targets.size(0)
+        correct += predicted.eq(targets).sum().item()
+        
+        progress_bar.set_postfix({
+            'loss': running_loss / (total / trainloader.batch_size),
+            'acc': 100. * correct / total
+        })
+    
+    return running_loss / len(trainloader), 100. * correct / total
+
+# ======================
+# 4. 测试函数
+# ======================
+def test(model, testloader, criterion, device):
+    model.eval()
+    test_loss = 0
+    correct = 0
+    total = 0
+    
+    with torch.no_grad():
+        for inputs, targets in testloader:
+            inputs, targets = inputs.to(device), targets.to(device)
+            outputs = model(inputs)
+            loss = criterion(outputs, targets)
+            
+            test_loss += loss.item()
+            _, predicted = outputs.max(1)
+            total += targets.size(0)
+            correct += predicted.eq(targets).sum().item()
+    
+    acc = 100. * correct / total
+    print(f'Test Loss: {test_loss/len(testloader):.4f} | Test Acc: {acc:.2f}%')
+    return test_loss / len(testloader), acc
+
+# ======================
+# 5. ONNX导出与验证
+# ======================
+def export_to_onnx(model, device, onnx_path='alexnet_cifar10.onnx'):
+    # 设置为评估模式
+    model.eval()
+    
+    # 创建示例输入 (batch_size=1)
+    dummy_input = torch.randn(1, 3, 32, 32).to(device)
+    
+    # 导出ONNX模型
+    torch.onnx.export(
+        model, 
+        dummy_input,
+        onnx_path,
+        export_params=True,
+        opset_version=11,
+        do_constant_folding=True,
+        input_names=['input'],
+        output_names=['output'],
+        dynamic_axes={
+            'input': {0: 'batch_size'},
+            'output': {0: 'batch_size'}
+        }
+    )
+    
+    print(f"ONNX模型已导出至: {onnx_path}")
+    
+    # 验证ONNX模型
+    onnx_model = onnx.load(onnx_path)
+    onnx.checker.check_model(onnx_model)
+    print("ONNX模型验证通过!")
+    
+    # 使用ONNX Runtime进行推理验证
+    # ort_session = ort.InferenceSession(onnx_path)
+    
+    # 准备输入数据 (使用测试集第一张图片)
+    testset = torchvision.datasets.CIFAR10(
+        root='./data', 
+        train=False, 
+        download=False
+    )
+    img, label = testset[0]
+    
+    # 转换为模型输入格式
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616))
+    ])
+    img_tensor = transform(img).unsqueeze(0).numpy()
+    
+    # PyTorch推理
+    with torch.no_grad():
+        torch_output = model(torch.tensor(img_tensor).to(device))
+        torch_pred = torch.argmax(torch_output, 1).item()
+    
+    # ONNX Runtime推理
+    # ort_inputs = {ort_session.get_inputs()[0].name: img_tensor}
+    # ort_output = ort_session.run(None, ort_inputs)[0]
+    # onnx_pred = np.argmax(ort_output)
+    
+    print(f"PyTorch预测: {torch_pred}, 真实标签: {label}")
+    # print(f"PyTorch预测: {torch_pred}, ONNX预测: {onnx_pred}, 真实标签: {label}")
+    # assert torch_pred == onnx_pred, "PyTorch和ONNX预测结果不一致!"
+    print("ONNX推理验证成功!")
+
+# ======================
+# 6. 主函数
+# ======================
+def main():
+    # 超参数设置
+    BATCH_SIZE = 128
+    EPOCHS = 20
+    LR = 0.01
+    MODEL_PATH = 'alexnet_cifar10.pth'
+    ONNX_PATH = 'alexnet_cifar10.onnx'
+    
+    # 设备配置
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+    
+    # 查找已下载的CIFAR-10数据集
+    print("\n" + "="*50)
+    print("查找已下载的CIFAR-10数据集...")
+    print("="*50)
+    
+    # 尝试在常见位置查找数据集
+    possible_paths = [
+        './cifar-10-python.tar.gz',
+        '../cifar-10-python.tar.gz',
+        './data/cifar-10-python.tar.gz',
+        os.path.expanduser('~/Downloads/cifar-10-python.tar.gz')
+    ]
+    
+    tar_path = None
+    for path in possible_paths:
+        if os.path.exists(path):
+            tar_path = path
+            break
+    
+    if tar_path is None:
+        # 提示用户输入路径
+        tar_path = input("未找到cifar-10-python.tar.gz文件，请输入完整路径: ").strip()
+        if not os.path.exists(tar_path):
+            raise FileNotFoundError(f"指定的文件不存在: {tar_path}")
+    
+    print(f"找到数据集文件: {tar_path}")
+    
+    # 获取数据
+    print("\n" + "="*50)
+    print("加载CIFAR-10数据集...")
+    print("="*50)
+    trainloader, testloader, data_dir = get_cifar10_data(tar_path, BATCH_SIZE)
+    
+    # 初始化模型
+    model = AlexNet(num_classes=10).to(device)
+    print("\n模型结构:\n", model)
+
+    model = linger.init(model)
+    print(model)
+    
+    # 损失函数和优化器
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.SGD(model.parameters(), lr=LR, momentum=0.9, weight_decay=5e-4)
+    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
+        optimizer, mode='max', factor=0.5, patience=2, verbose=True
+    )
+    
+    # 训练循环
+    best_acc = 0
+    print("\n" + "="*50)
+    print("开始训练...")
+    print("="*50)
+    
+    for epoch in range(EPOCHS):
+        train_loss, train_acc = train(model, trainloader, criterion, optimizer, device, epoch)
+        test_loss, test_acc = test(model, testloader, criterion, device)
+        
+        # 学习率调整
+        scheduler.step(test_acc)
+        
+        # 保存最佳模型
+        if test_acc > best_acc:
+            best_acc = test_acc
+            torch.save(model.state_dict(), MODEL_PATH)
+            print(f"保存最佳模型至 {MODEL_PATH} (准确率: {best_acc:.2f}%)")
+    
+    print(f"\n训练完成! 最佳测试准确率: {best_acc:.2f}%")
+    
+    # 加载最佳模型进行最终测试
+    model.load_state_dict(torch.load(MODEL_PATH))
+    print("\n" + "="*50)
+    print("使用最佳模型进行最终测试:")
+    print("="*50)
+    test(model, testloader, criterion, device)
+    
+    # 导出ONNX模型
+    print("\n" + "="*50)
+    print("导出ONNX模型...")
+    print("="*50)
+    export_to_onnx(model, device, ONNX_PATH)
+    
+    # 清理临时数据 (保留原始tar文件，只删除解压后的目录)
+    print("\n" + "="*50)
+    print("清理临时数据...")
+    print("="*50)
+    if os.path.exists(data_dir):
+        try:
+            shutil.rmtree(data_dir)
+            print(f"已删除临时解压目录: {data_dir}")
+        except Exception as e:
+            print(f"清理临时数据时出错: {e}")
+    
+    print("\n" + "="*50)
+    print("所有任务完成!")
+    print("="*50)
+
+if __name__ == '__main__':
+    main()
diff --git a/test/test_avg_pool_int.py b/test/test_avg_pool_int.py
deleted file mode 100644
index 442f471..0000000
--- a/test/test_avg_pool_int.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-if not os.path.exists("data.ignore"):
-    os.makedirs("data.ignore")
-
-
-def test_avgpool2dint():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.pool = nn.AvgPool2d((2, 2), (2, 2), (0, 0), False)
-            self.fc = nn.Linear(250, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = self.pool(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # linger.disable_quant(net.fc)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        # if i == 190:
-        #     import pdb; pdb.set_trace()
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/avg_pool.onnx", export_params=True, opset_version=11,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    print("out1.mean(): ", out1.mean())
-    assert abs(out1.mean() - 1.0) < 0.15
diff --git a/test/test_batchnorm_int.py b/test/test_batchnorm_int.py
deleted file mode 100644
index d651f0c..0000000
--- a/test/test_batchnorm_int.py
+++ /dev/null
@@ -1,80 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_batchnorm_int_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3,
-                                  stride=1, padding=1, bias=False, groups=2)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(10, 10, kernel_size=3,
-                                   stride=1, padding=1, bias=False, groups=2)
-            self.bn1 = nn.BatchNorm2d(10)
-            self.relu1 = nn.ReLU()
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # linger.FuseBNIntoConv(net, dummy_input)
-    linger.trace_layers(net, net, dummy_input, fuse_bn=False)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out = net(dummy_input)
-
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input,"data.ignore/batchnormInt.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(dummy_input)
-    assert (out.mean() - out1.mean()
-            ) < 0.001, print('out1: {}, out2: {}'.format(out.sum(), out1.sum()))
-    assert out.abs().sum() == out2.abs().sum(), 'inconsistant for batchnormint'
diff --git a/test/test_bmm.py b/test/test_bmm.py
new file mode 100644
index 0000000..e6aec51
--- /dev/null
+++ b/test/test_bmm.py
@@ -0,0 +1,114 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_bmm(self, x_float):
+        """执行量化批量矩阵乘法并与浮点矩阵乘法比较"""
+        x_reshaped_float = x_float.view(-1, 32, 32)  # [6, 32, 32]
+        
+        float_bmm = torch.bmm(x_reshaped_float, x_reshaped_float)
+        
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        x_reshaped_quant = x_quant.view(-1, 32, 32)  # [6, 32, 32]
+
+        quant_bmm = torch.bmm(x_reshaped_quant, x_reshaped_quant)
+        
+        # 计算差异
+        diff = torch.abs(float_bmm) - torch.abs(quant_bmm)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_bmm).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_bmm, quant_bmm, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种矩阵乘法
+        float_result, quant_result, relative_diff = self.quantized_bmm(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    """
+    测试模型forward中的量化批量矩阵乘法比较
+    """
+    print("测试模型前向传播中的量化批量矩阵乘法比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化批量矩阵乘法精度测试...")
+    print("=" * 60)
+    
+    # 在forward中比较量化加法
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化批量矩阵乘法测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化批量矩阵乘法测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_bmm_quant.py b/test/test_bmm_quant.py
deleted file mode 100644
index 35104c6..0000000
--- a/test/test_bmm_quant.py
+++ /dev/null
@@ -1,59 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy
-import torch
-import torch.nn as nn
-
-
-def test_bmmint_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.linear1 = nn.Linear(12, 12)
-            self.linear2 = nn.Linear(12, 12)
-            self.linear3 = nn.Linear(16, 16)
-
-        @torch.no_grad()
-        def forward(self, input):
-
-            x = self.linear1(input)
-
-            y = self.linear2(input)
-
-            x = x.view(8, 4, 3)
-
-            y = y.view(8, 3, 4)
-            x = torch.bmm(x, y)
-
-            x = x.view(8, 16)
-            x = self.linear3(x)
-
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    # aa = torch.LongTensor([[1,2,4,5],[4,3,2,9]]).cuda()
-    aa = torch.randn((8, 12), requires_grad=True).cuda()
-    replace_tuple = (nn.Linear)
-
-    net = Net().cuda()
-
-    linger.SetFunctionBmmQuant(True)  # default  false
-
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    criterion = nn.MSELoss()
-    print(net)
-    optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
-    loss = None
-    label = torch.ones((8, 16)).cuda()
-    for i in range(100):
-        optimizer.zero_grad()
-        out = net(aa)
-
-    with torch.no_grad():
-        net.eval()
-        torch.onnx.export(net, (aa), "data.ignore/bmm_int.onnx", export_params=True, opset_version=12,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-    linger.SetFunctionBmmQuant(False)  # set default
diff --git a/test/test_cat.py b/test/test_cat.py
new file mode 100644
index 0000000..54560e7
--- /dev/null
+++ b/test/test_cat.py
@@ -0,0 +1,110 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_cat(self, x_float, dim):
+        """执行量化拼接并与浮点拼接比较"""
+        
+        # tuple_tensor = tuple([x_float, x_float])
+        float_cat = torch.cat([x_float, x_float], dim = dim)
+    
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+    
+        quant_cat = torch.cat([x_quant, x_quant], dim = dim)
+        
+        # 计算差异
+        diff = torch.abs(float_cat) - torch.abs(quant_cat)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_cat).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_cat, quant_cat, relative_diff
+    
+    def forward(self, x, dim):
+        # 在forward中比较两种拼接
+        float_result, quant_result, relative_diff = self.quantized_cat(x, dim)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    print("测试模型前向传播中的量化拼接比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x, dim = 0)
+
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化拼接精度测试...")
+    print("=" * 60)
+    
+    # 在forward中比较量化拼接
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化拼接测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化拼接测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_clamp_batchnorm2d.py b/test/test_clamp_batchnorm2d.py
deleted file mode 100644
index 979d36e..0000000
--- a/test/test_clamp_batchnorm2d.py
+++ /dev/null
@@ -1,67 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-from linger import NormalizeBatchNorm2d
-
-
-def test_normalize_batchnorm2d_foward():
-    module = NormalizeBatchNorm2d(
-        64, normalize_data=4, normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(1, 64, 512, 512)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
-
-
-def test_normalize_batchnorm2d():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = NormalizeBatchNorm2d(10, normalize_data=1.5, normalize_weight=1.5, normalize_bias=1.5)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/convbn_normalize.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_bn_normalize.onnx", export_params=True,
-                          opset_version=9, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
\ No newline at end of file
diff --git a/test/test_clamp_conv1d.py b/test/test_clamp_conv1d.py
deleted file mode 100644
index 8bb0780..0000000
--- a/test/test_clamp_conv1d.py
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-from linger import NormalizeConv1d
-
-
-def test_normalize_conv2d_foward():
-    module = NormalizeConv1d(64, 128, kernel_size=3,
-                             normalize_data=4, normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(1, 64, 512)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
diff --git a/test/test_clamp_conv2d.py b/test/test_clamp_conv2d.py
deleted file mode 100644
index ee01863..0000000
--- a/test/test_clamp_conv2d.py
+++ /dev/null
@@ -1,14 +0,0 @@
-import torch
-from linger import NormalizeConv2d
-
-
-def test_normalize_conv2d_foward():
-    module = NormalizeConv2d(2, 4, (3, 3), normalize_data=4,
-                             normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(1, 2, 4, 4)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
\ No newline at end of file
diff --git a/test/test_clamp_convtranspose2d.py b/test/test_clamp_convtranspose2d.py
deleted file mode 100644
index f8c1d26..0000000
--- a/test/test_clamp_convtranspose2d.py
+++ /dev/null
@@ -1,17 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-from linger import NormalizeConvTranspose2d
-
-
-def test_normalize_linear_foward():
-    module = NormalizeConvTranspose2d(
-        128, 512, 3, bias=True, normalize_data=4, normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(2, 128, 10, 10)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
diff --git a/test/test_clamp_gru.py b/test/test_clamp_gru.py
deleted file mode 100644
index a0adee2..0000000
--- a/test/test_clamp_gru.py
+++ /dev/null
@@ -1,159 +0,0 @@
-import linger
-import numpy
-import torch
-import torch.nn as nn
-from linger import NormalizeFastGRU
-
-
-def test_lstmpint_net():
-
-    def getacc(lprob, target):
-        num_class = lprob.size()[1]
-        _, new_target = torch.broadcast_tensors(lprob, target)
-
-        remove_pad_mask = new_target.ne(-1)
-        lprob = lprob[remove_pad_mask]
-
-        target = target[target != -1]
-        target = target.unsqueeze(-1)
-
-        lprob = lprob.reshape((-1, num_class))
-
-        preds = torch.argmax(lprob, dim=1)
-
-        correct_holder = torch.eq(preds.squeeze(), target.squeeze()).float()
-
-        num_corr = correct_holder.sum()
-        num_sample = torch.numel(correct_holder)
-        acc = num_corr/num_sample
-        return acc
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv0 = nn.Conv2d(1, 100, kernel_size=(
-                1, 3), padding=(0, 1), groups=1, bias=True)
-            self.bn0 = nn.BatchNorm2d(100)
-            self.relu0 = nn.ReLU()
-            self.conv1 = nn.Conv2d(100, 100, kernel_size=(
-                1, 3), padding=(0, 1), groups=1, bias=True)
-            self.bn1 = nn.BatchNorm2d(100)
-            self.relu1 = nn.ReLU()
-            self.lstmp = nn.GRU(100, 50, num_layers=1,
-                                batch_first=True, bidirectional=True)
-            self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-
-        def forward(self, input, batch_lengths=None, initial_state=None):
-            x = self.conv0(input)
-            x = self.bn0(x)
-            x = self.relu0(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.reshape((n, -1, 1, w)).squeeze(2)
-            x = x.permute((0, 2, 1))  # b t d
-            x = nn.utils.rnn.pack_padded_sequence(
-                x, batch_lengths, batch_first=True, enforce_sorted=False)
-            x, _ = self.lstmp(x, initial_state)  # output b, t, h (10, 10, 100)
-            x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
-            x = x.permute((2, 0, 1))
-            d, b, t = x.shape
-            x = x.reshape((1, d, 1, b*t))  # (1, 100, 1, 100)
-            x = self.final_conv(x)  # (1, 10, 1, 100) (d, b*t)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    dummy_input = torch.randn(10, 1, 1, 10).cuda()
-    label = torch.randint(10, (10, 10)).cuda()  # class=10
-    mask = torch.ones(10, 10)
-    for i in range(9):
-        index = numpy.random.randint(5, 10)
-        mask[i, index:] = 0
-        label[i, index:] = -1
-
-    input_lengths = mask.long().sum(1).cpu().numpy()
-    input_lengths = torch.tensor(input_lengths)  # .cuda()
-    batch_size = 10
-    hidden_size = 50
-    size = 2
-    initial_state = torch.zeros(size, batch_size, hidden_size).cuda()
-    net = Net().cuda()
-    replace_modules = (nn.Conv2d, nn.GRU, nn.BatchNorm2d)
-    criterion = nn.CrossEntropyLoss(ignore_index=-1)
-    # net = linger.normalize_layers(net)
-
-    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
-
-    print('net: ', net)
-    # net.load_state_dict(torch.load('data.ignore/gru.pt'))
-    loss = None
-    for i in range(5):
-        optimizer.zero_grad()
-        out = net(dummy_input, input_lengths, initial_state)
-        out = out.squeeze().permute((1, 0))  # (b*t, d)
-        loss = criterion(out, label.reshape(-1))
-        if i % 50 == 0:
-            print('loss: ', loss)
-            acc = getacc(out, label.reshape(-1, 1))
-            print('train acc: ', acc)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out1 = net(dummy_input, input_lengths, initial_state)
-    out1 = out1.squeeze().permute((1, 0))
-    acc = getacc(out1, label.reshape(-1, 1))
-    print('test acc: ', acc)
-    torch.save(net.state_dict(), 'data.ignore/gru.pt')
-    # out3 = net(dummy_input, input_lengths, initial_state)
-    # out3 = out3.squeeze().permute((1,0))
-    # input_lengths = torch.tensor(input_lengths)#.cuda()
-    # with torch.no_grad():
-    #     net.eval()
-    #     torch.onnx.export(net, (dummy_input, input_lengths, initial_state), "data.ignore/gru.onnx", export_params=True,
-    #                       opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(dummy_input, input_lengths, initial_state)
-    out2 = out2.squeeze().permute((1, 0))
-    # print('out: ', out1)
-    assert (out1 == out2).all()
-
-
-def test_normalize_gru():
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.rnn = nn.GRU(10, 20, 1, batch_first=True, bidirectional=True)
-            # self.rnn = NormalizeFastGRU(10, 20, 1, batch_first=True, bidirectional=True)
-
-        def forward(self, input, h0):
-            x, hn = self.rnn(input, h0)
-
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-
-    dummy_input = torch.randn(3, 5, 10).cuda()
-    h0 = torch.randn(2, 3, 20).cuda()
-
-    model = Net().cuda()
-
-    model = linger.normalize_layers(model)
-
-    # torch.save(model.state_dict(), "data.ignore/torch_gru.pt")
-    # model.load_state_dict(torch.load("data.ignore/torch_gru.pt"))
-    model.eval()
-
-    out1 = model(dummy_input, h0)
-
-    print('model: ', model)
-
-    with torch.no_grad():
-        model.eval()
-        torch.onnx.export(model, (dummy_input, h0), "data.ignore/normalize_torch_gru.onnx", export_params=True,
-                          opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_clamp_gru_in_onnxinfer.py b/test/test_clamp_gru_in_onnxinfer.py
deleted file mode 100644
index cd0aac1..0000000
--- a/test/test_clamp_gru_in_onnxinfer.py
+++ /dev/null
@@ -1,30 +0,0 @@
-
-
-
-def notest_clamp_gru_in_onnxinfer():
-    pass
-
-    # import onnxinfer
-    # import torch
-    # import numpy
-    # torch.manual_seed(1)
-    # torch.cuda.manual_seed_all(1)
-    # numpy.random.seed(1)
-
-    # dummy_input = torch.randn(3, 5, 10).cuda()
-    # h0 = torch.randn(2, 3, 20).cuda()
-
-    # sessoption = onnxinfer.InferSessionOptions()
-    # sess = onnxinfer.InferSession(
-    #     'data.ignore/normalize_torch_gru.onnx', sessoption, is_fuse=False, save_transform_model=None)
-    # data = {sess.GetInputNames()[0]: dummy_input, sess.GetInputNames()[1]: h0}
-    # rlt = sess.Run(data_in=data)
-    # output = rlt[0].AsReadOnlyNumpy()
-
-    # sessoption1 = onnxinfer.InferSessionOptions()
-    # sess1 = onnxinfer.InferSession(
-    #     'data.ignore/torch_gru.onnx', sessoption1, is_fuse=False, save_transform_model=None)
-    # data1 = {sess1.GetInputNames()[0]: dummy_input,
-    #          sess1.GetInputNames()[1]: h0}
-    # rlt1 = sess1.Run(data_in=data1)
-    # output1 = rlt1[0].AsReadOnlyNumpy()
diff --git a/test/test_clamp_layernorm.py b/test/test_clamp_layernorm.py
deleted file mode 100644
index 322230f..0000000
--- a/test/test_clamp_layernorm.py
+++ /dev/null
@@ -1,65 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-from linger import NormalizeLayerNorm
-
-
-def test_normalize_batchnorm2d_foward():
-    module = NormalizeLayerNorm(
-        [4,4], normalize_data=4, normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(1, 64, 4, 4)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
-
-def test_normalize_batchnorm2d():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(1, 3, kernel_size=2,
-                                  stride=1, padding=1, bias=False, groups=1)
-            self.ln = NormalizeLayerNorm([4,4], normalize_data=4, normalize_weight=4, normalize_bias=4)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(48, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.ln(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 1, 3, 3).cuda()
-    target = torch.ones(1, 100).cuda()
-
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/layernorm_normalize.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_layernorm_normalize.onnx", export_params=True,
-                          opset_version=9, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_clamp_linear.py b/test/test_clamp_linear.py
deleted file mode 100644
index 83aece6..0000000
--- a/test/test_clamp_linear.py
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-from linger import NormalizeLinear
-
-
-def test_normalize_linear_foward():
-    module = NormalizeLinear(512, 128, True, normalize_data=4,
-                             normalize_weight=4, normalize_bias=4)
-    module.weight.data.fill_(8)
-    module.bias.data.fill_(8)
-    input = 8 * torch.randn(2, 512)
-    assert (input < 8).any()
-    assert (input > 4).any()
-    m = module(input)
-    assert (m < 4.0001).all()
diff --git a/test/test_clamp_lstm.py b/test/test_clamp_lstm.py
deleted file mode 100644
index 56b9a3c..0000000
--- a/test/test_clamp_lstm.py
+++ /dev/null
@@ -1,118 +0,0 @@
-import linger
-import numpy
-import torch
-import torch.nn as nn
-
-
-def test_lstmpint_net():
-
-    def getacc(lprob, target):
-        num_class = lprob.size()[1]
-        _, new_target = torch.broadcast_tensors(lprob, target)
-
-        remove_pad_mask = new_target.ne(-1)
-        lprob = lprob[remove_pad_mask]
-
-        target = target[target != -1]
-        target = target.unsqueeze(-1)
-
-        lprob = lprob.reshape((-1, num_class))
-
-        preds = torch.argmax(lprob, dim=1)
-
-        correct_holder = torch.eq(preds.squeeze(), target.squeeze()).float()
-
-        num_corr = correct_holder.sum()
-        num_sample = torch.numel(correct_holder)
-        acc = num_corr/num_sample
-        return acc
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv0 = nn.Conv2d(1, 100, kernel_size=(
-                1, 3), padding=(0, 1), groups=1, bias=True)
-            self.bn0 = nn.BatchNorm2d(100)
-            self.relu0 = nn.ReLU()
-            self.conv1 = nn.Conv2d(100, 100, kernel_size=(
-                1, 3), padding=(0, 1), groups=1, bias=True)
-            self.bn1 = nn.BatchNorm2d(100)
-            self.relu1 = nn.ReLU()
-            # self.lstmp = nn.LSTM(100, 100, num_layers=1, batch_first=True, bidirectional=False)
-            self.lstmp = nn.LSTM(100, 50, num_layers=1,
-                                 batch_first=True, bidirectional=True)
-            self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-
-        def forward(self, input, batch_lengths=None, initial_state=None):
-            x = self.conv0(input)
-            x = self.bn0(x)
-            x = self.relu0(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.reshape(n, -1, 1, w).squeeze(2)
-            x = x.permute((0, 2, 1))  # b t d
-            x = nn.utils.rnn.pack_padded_sequence(
-                x, batch_lengths, batch_first=True, enforce_sorted=False)
-            # output b, t, h (10, 10, 100)
-            x, hidden = self.lstmp(x, initial_state)
-            x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
-            x = x.permute((2, 0, 1))
-            d, b, t = x.shape
-            x = x.reshape((1, d, 1, b*t))  # (1, 100, 1, 100)
-            x = self.final_conv(x)  # (1, 10, 1, 100) (d, b*t)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    dummy_input = torch.randn(10, 1, 1, 10).cuda()
-    label = torch.randint(10, (10, 10)).cuda()  # class=10
-    mask = torch.ones(10, 10)
-    for i in range(9):
-        index = numpy.random.randint(5, 10)
-        mask[i, index:] = 0
-        label[i, index:] = -1
-
-    input_lengths = mask.long().sum(1).cpu().numpy()
-    input_lengths = torch.tensor(input_lengths)  # .cuda()
-    print('input_lengths: ', input_lengths)
-    # input_lengths = None
-    # label = label.permute((1, 0))
-    batch_size = 10
-    hidden_size = 50
-    size = 2
-    # batch_size = 10; hidden_size=100; size=1
-    initial_state = (torch.zeros(size, batch_size, hidden_size).cuda(),
-                     torch.zeros(size, batch_size, hidden_size).cuda())
-    # initial_state = None
-    net = Net().cuda()
-    criterion = nn.CrossEntropyLoss(ignore_index=-1)
-    net = linger.normalize_layers(net)
-    # net = linger.init(net)
-    # net.load_state_dict(torch.load('data.ignore/lstm.pt'))
-    print('net: ', net)
-    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input, input_lengths, initial_state)
-        out = out.squeeze().permute((1, 0))  # (b*t, d)
-        loss = criterion(out, label.reshape(-1))
-        if i % 50 == 0:
-            print('loss: ', loss)
-            acc = getacc(out, label.reshape(-1, 1))
-            print('train acc: ', acc)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out1 = net(dummy_input, input_lengths, initial_state)
-    out1 = out1.squeeze().permute((1, 0))
-    acc = getacc(out1, label.reshape(-1, 1))
-    print('test acc: ', acc)
-    assert acc > 0.4
-    # torch.save(net.state_dict(), 'data.ignore/lstm1.pt')
-    out2 = net(dummy_input, input_lengths, initial_state)
-    out2 = out2.squeeze().permute((1, 0))
-    assert (out1 == out2).all()
\ No newline at end of file
diff --git a/test/test_clamp_lstm_onnx_export.py b/test/test_clamp_lstm_onnx_export.py
deleted file mode 100644
index f18d180..0000000
--- a/test/test_clamp_lstm_onnx_export.py
+++ /dev/null
@@ -1,63 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy
-import torch
-import torch.nn as nn
-
-
-def test_NormalizeLSTM_onnx_export():
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-
-            self.lstm = linger.NormalizeFastLSTM(
-                100, 50, num_layers=1, batch_first=True, bidirectional=True)
-            # self.lstm = nn.LSTM(
-            #     100, 50, num_layers=1, batch_first=True, bidirectional=True)
-
-        def forward(self, input, batch_lengths=None, initial_state=None):
-            # input  (b t d)
-            # x = nn.utils.rnn.pack_padded_sequence(x, batch_lengths, batch_first=True, enforce_sorted=False)
-
-            # normalize
-            x = (input, batch_lengths, True, False)
-            x, hc = self.lstm(x, initial_state)
-            x, _ = x
-
-            # torch
-            # x, hc = self.lstm(input, initial_state)
-
-            # x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
-
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    aa = torch.randn(10, 10, 100).cuda()
-    label = torch.randint(10, (10, 10)).cuda()  # class=10
-    mask = torch.ones(10, 10)
-    for i in range(9):
-        index = numpy.random.randint(5, 10)
-        mask[i, index:] = 0
-        label[i, index:] = -1
-
-    input_lengths = mask.long().sum(1).cpu().numpy()
-    input_lengths = torch.tensor(input_lengths)  # .cuda()
-
-    batch_size = 10
-    hidden_size = 50
-    size = 2
-    initial_state = (torch.zeros(size, batch_size, hidden_size).cuda(),
-                     torch.zeros(size, batch_size, hidden_size).cuda())
-    net = Net().cuda()
-
-    net.eval()
-
-    out1 = net(aa, input_lengths, initial_state)
-
-    with torch.no_grad():
-        net.eval()
-        torch.onnx.export(net, (aa, input_lengths, initial_state), "data.ignore/normalize_torch_lstm.onnx", export_params=True,
-                          opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_conv1d.py b/test/test_conv1d.py
new file mode 100644
index 0000000..844952d
--- /dev/null
+++ b/test/test_conv1d.py
@@ -0,0 +1,127 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.conv1d_float = nn.Conv1d(
+            in_channels=3,     
+            out_channels=16,    
+            kernel_size=3,      
+            padding=1           
+        )
+
+        self.conv1d_quant = QConv1d(
+            in_channels=3,     
+            out_channels=16,    
+            kernel_size=3,      
+            padding=1,       
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.conv1d_float.weight.copy_(self.conv1d_quant.qweight)
+            if self.conv1d_quant.qbias is not None:
+                self.conv1d_float.bias.copy_(self.conv1d_quant.qbias)
+
+    def forward(self, x):
+        x = x.view(2, 3, -1)  
+        result_float = self.conv1d_float(x)
+        result_quant = self.conv1d_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化conv1d比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_conv_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'conv1d_float' in name and 'conv1d_quant' not in name:
+                layer_type = "conv1d"
+            elif 'conv1d_quant' in name:
+                layer_type = "qconv1d"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+
+    input_tensor = torch.randn(2, 3, 32, 32)
+    loss = check_conv_gradients(model, input_tensor)
+    
+    print("开始量化conv1d精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化conv1d测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化conv1d测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_conv2d.py b/test/test_conv2d.py
new file mode 100644
index 0000000..1fd5f26
--- /dev/null
+++ b/test/test_conv2d.py
@@ -0,0 +1,126 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.conv2d_float = nn.Conv2d(
+            in_channels=3,     
+            out_channels=16,    
+            kernel_size=3,      
+            padding=1           
+        )
+
+        self.conv2d_quant = QConv2d(
+            in_channels=3,     
+            out_channels=16,    
+            kernel_size=3,      
+            padding=1,       
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.conv2d_float.weight.copy_(self.conv2d_quant.qweight)
+            if self.conv2d_quant.qbias is not None:
+                self.conv2d_float.bias.copy_(self.conv2d_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.conv2d_float(x)
+        result_quant = self.conv2d_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化conv2d比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_conv_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'conv2d_float' in name and 'conv2d_quant' not in name:
+                layer_type = "conv2d"
+            elif 'conv2d_quant' in name:
+                layer_type = "qconv2d"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+
+    input_tensor = torch.randn(2, 3, 32, 32)
+    loss = check_conv_gradients(model, input_tensor)
+    
+    print("开始量化conv2d精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化conv2d测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化conv2d测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_conv_dropout.py b/test/test_conv_dropout.py
deleted file mode 100644
index 636f413..0000000
--- a/test/test_conv_dropout.py
+++ /dev/null
@@ -1,57 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_conv_dropout_linear():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.dropout = nn.Dropout(0.5)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.dropout(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net = linger.init(net)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    # import pdb; pdb.set_trace()
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_dropout_linear.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_dropout_linear.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.1
diff --git a/test/test_conv_linear.py b/test/test_conv_linear.py
deleted file mode 100644
index 26611a1..0000000
--- a/test/test_conv_linear.py
+++ /dev/null
@@ -1,52 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_conv_linear():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_linear.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
diff --git a/test/test_conv_permute.py b/test/test_conv_permute.py
deleted file mode 100644
index e0940ae..0000000
--- a/test/test_conv_permute.py
+++ /dev/null
@@ -1,58 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_conv_dropout_linear():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.dropout = nn.Dropout(0.5)
-            self.fc = nn.Linear(100, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = x.squeeze()
-            x = x.permute(0, 1, 2)  # permute only support len(x.shape)<=3
-            x = self.dropout(x)
-            c, h, w = x.shape
-            x = x.reshape((c, h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net = linger.init(net)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_permute_linear.pt')
-    out1 = net(dummy_input)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/conv_permute_linear.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 1
\ No newline at end of file
diff --git a/test/test_conv_scale.py b/test/test_conv_scale.py
deleted file mode 100644
index c75a46a..0000000
--- a/test/test_conv_scale.py
+++ /dev/null
@@ -1,55 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_conv_scale():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(392, 100, bias=True)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            # x = linger.quant_tensor(self, x, mode=linger.QuantMode.QValue)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net2 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net2.train()
-
-    net2 = linger.init(net2, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-
-    optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer2.zero_grad()
-        out2 = net2(dummy_input)
-        loss2 = criterion(out2, target)
-        loss2.backward()
-        optimizer2.step()
-        if i % 30 == 29:
-            print('loss2 {}'.format(loss2))
-    net2.eval()
-    # out2 = net2(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net2, dummy_input, "data.ignore/2.onnx", export_params=True, opset_version=11,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    # assert out2.mean() - target.mean() < 0.01
diff --git a/test/test_conv_with_eval_normalize.py b/test/test_conv_with_eval_normalize.py
deleted file mode 100644
index 46b0bb2..0000000
--- a/test/test_conv_with_eval_normalize.py
+++ /dev/null
@@ -1,54 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_clip():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(2)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(392, 100, bias=True)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net2 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net2.train()
-    net2 = linger.init(net2, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-
-    optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer2.zero_grad()
-        out2 = net2(dummy_input)
-        loss2 = criterion(out2, target)
-        loss2.backward()
-        optimizer2.step()
-        if i % 30 == 29:
-            print('loss2 {}'.format(loss2))
-    torch.save(net2.state_dict(), 'data.ignore/eval_normalize.pt')
-    net2.eval()
-    out2 = net2(dummy_input)
-    assert abs(out2.mean() - target.mean()) < 0.01
-    # reset avoid other test files params be changed
diff --git a/test/test_conv_with_normalize.py b/test/test_conv_with_normalize.py
deleted file mode 100644
index 5d949ac..0000000
--- a/test/test_conv_with_normalize.py
+++ /dev/null
@@ -1,56 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_clip():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(2)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(392, 100, bias=True)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net2 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net2.train()
-    net2 = linger.init(net2, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-
-    optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer2.zero_grad()
-        out2 = net2(dummy_input)
-        loss2 = criterion(out2, target)
-        loss2.backward()
-        optimizer2.step()
-        if i % 30 == 29:
-            print('loss2 {}'.format(loss2))
-
-    net2.eval()
-    out2 = net2(dummy_input)
-    assert abs(out2.mean() - target.mean()) < 0.01
-    # reset avoid other test files params be changed
\ No newline at end of file
diff --git a/test/test_convbn1d_normalize.py b/test/test_convbn1d_normalize.py
deleted file mode 100644
index 9f89acd..0000000
--- a/test/test_convbn1d_normalize.py
+++ /dev/null
@@ -1,81 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant, QuantMode
-
-
-def test_convbn1d_normalize():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv0 = nn.Conv1d(10, 10, kernel_size=3,
-                                   stride=1, padding=1, bias=True)
-            self.relu0 = nn.ReLU()
-            self.conv = nn.Conv1d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = nn.BatchNorm1d(10)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(10*50, 100)
-
-        def forward(self, x):
-            x = self.conv0(x)
-            x = self.relu0(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            # x = self.convbn(x)
-            x = self.relu(x)
-            n, c, l = x.shape
-            x = x.view((n, c*l))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(10, 10, 50).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(1000):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/convbn_clamp1d.pt')
-    out1 = net(aa)
-    # print(out1)
-    linger.trace_layers(net, net, aa)
-    # linger.disable_normalize(net.conv0)
-    net = linger.normalize_layers(net)
-    net.train().cuda()
-    net.load_state_dict(torch.load('data.ignore/convbn_clamp1d.pt'))
-    out2 = net(aa)
-    net.eval()
-    out3 = net(aa)
-    torch.save(net.state_dict(), 'data.ignore/convbn_quant1d.pt')
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    # linger.SetCastorBiasInt16(True)
-    net = linger.init(net, mode=QuantMode.QValue)
-    net.train()
-    net.load_state_dict(torch.load('data.ignore/convbn_quant1d.pt'))
-    net.cuda().train()
-    out4 = net(aa)
-
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_bn_clamp1d.onnx", export_params=True,
-                          opset_version=9, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
-
diff --git a/test/test_convbn_normalize.py b/test/test_convbn_normalize.py
deleted file mode 100644
index 53eb813..0000000
--- a/test/test_convbn_normalize.py
+++ /dev/null
@@ -1,59 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_convbn_clamp():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            # self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-            #              padding=1, bias=True)
-            # self.bn = nn.BatchNorm2d(10)
-            self.convbn = linger.NormalizeConvBN2d(10, 10, kernel_size=3, stride=1,
-                                                   padding=1, bias=True, normalize_data=100, normalize_weight=100, normalize_bias=100)
-            self.relu = nn.ReLU6()
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            # x = self.conv(x)
-            # x = self.bn(x)
-            x = self.convbn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/convbn_normalize.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_bn_normalize.onnx", export_params=True,
-                          opset_version=9, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
diff --git a/test/test_convtranspose1d.py b/test/test_convtranspose1d.py
new file mode 100644
index 0000000..c31163c
--- /dev/null
+++ b/test/test_convtranspose1d.py
@@ -0,0 +1,123 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.convtranspose1d_float = nn.ConvTranspose1d(
+            in_channels=32, out_channels=16, kernel_size=3, stride=2, padding=1         
+        )
+
+        self.convtranspose1d_quant = QConvTranspose1d(
+            in_channels=32, out_channels=16, kernel_size=3, stride=2, padding=1,      
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.convtranspose1d_float.weight.copy_(self.convtranspose1d_quant.qweight)
+            if self.convtranspose1d_quant.qbias is not None:
+                self.convtranspose1d_float.bias.copy_(self.convtranspose1d_quant.qbias)
+
+    def forward(self, x):
+        # x = x.view(2, 3, -1)  
+        result_float = self.convtranspose1d_float(x)
+        result_quant = self.convtranspose1d_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化convtranspose1d比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    # batch_size = 2
+    # x = torch.rand(batch_size, 3, 32, 32).to(device)
+    x = torch.randn(4, 32, 50).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_conv_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'convtranspose1d_float' in name and 'convtranspose1d_quant' not in name:
+                layer_type = "convtranspose1d"
+            elif 'convtranspose1d_quant' in name:
+                layer_type = "qconvtranspose1d"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+
+    # input_tensor = torch.randn(2, 3, 32, 32)
+    # input_tensor = torch.randn(4, 32, 50)
+    # loss = check_conv_gradients(model, input_tensor)
+    
+    print("开始量化convtranspose1d精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化convtranspose1d测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化convtranspose1d测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_convtranspose2d.py b/test/test_convtranspose2d.py
new file mode 100644
index 0000000..3e9a6b4
--- /dev/null
+++ b/test/test_convtranspose2d.py
@@ -0,0 +1,122 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.convtranspose2d_float = nn.ConvTranspose2d(
+            in_channels=32, out_channels=16, kernel_size=3, stride=2, padding=1         
+        )
+
+        self.convtranspose2d_quant = QConvTranspose2d(
+            in_channels=32, out_channels=16, kernel_size=3, stride=2, padding=1,      
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.convtranspose2d_float.weight.copy_(self.convtranspose2d_quant.qweight)
+            if self.convtranspose2d_quant.qbias is not None:
+                self.convtranspose2d_float.bias.copy_(self.convtranspose2d_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.convtranspose2d_float(x)
+        result_quant = self.convtranspose2d_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化convtranspose2d比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    # batch_size = 2
+    # x = torch.rand(batch_size, 3, 32, 32).to(device)
+    x = torch.randn(4, 32, 16, 16).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_conv_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    # print(f"输入梯度: {input_tensor.grad.norm().item()}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'convtranspose2d_float' in name and 'convtranspose2d_quant' not in name:
+                layer_type = "convtranspose2d"
+            elif 'convtranspose2d_quant' in name:
+                layer_type = "qconvtranspose2d"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+
+    input_tensor = torch.randn(4, 32, 16, 16)
+    loss = check_conv_gradients(model, input_tensor)
+    
+    print("开始量化convtranspose2d精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化convtranspose2d测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化convtranspose2d测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_debug.py b/test/test_debug.py
new file mode 100644
index 0000000..d04f5a5
--- /dev/null
+++ b/test/test_debug.py
@@ -0,0 +1,80 @@
+import numpy as np
+import struct
+import math
+def q_rsqrt_fixed(x_int, frac_bits=15, iterations=1):
+    """
+    纯整数定点版本快速倒平方根 (Fast Inverse Sqrt)
+    输入:
+        x_int: 定点整数（Q格式）
+        frac_bits: 小数位数，比如Q15就是15
+        iterations: 牛顿迭代次数 (1或2次)
+    输出:
+        y_int: 定点整数（Q格式），近似 1/sqrt(x)
+    """
+    if x_int <= 0:
+        return 0  # 避免除0或负数开方
+
+    # ===== 初始猜测 =====
+    # 简化的初始估计：y0 = (1 << frac_bits) * (1/sqrt(x_real))
+    # 由于我们没有浮点，这里用比例常数近似: y0 = (1 << (frac_bits*3//2)) // sqrt(x_int)
+    # 实际上更快的做法是用经验公式：y0 ≈ C / x_int + B
+    # 经过实验，C=0x5A82 (~0.7071 * 2^15), B=0 可得到较好初值
+    # y_int = (1 << frac_bits)  # 初始猜测设为1.0
+    # y_int = (1 << (frac_bits * 3 // 2)) // (x_int >> (frac_bits // 2))
+    float_x = float(x_int) / (1 << frac_bits)
+    f = struct.pack('f', float_x)        # float -> bytes
+    i = struct.unpack('I', f)[0]       # bytes -> uint32 (unsigned int)
+    # magic constant from original implementation
+    i = 0x5f3759df - (i >> 1)
+    # reinterpret bits as float again
+    y = struct.unpack('f', struct.pack('I', i))[0]
+    y_int = round(y * (1 << frac_bits))
+
+    # Newton 迭代: y = y * (1.5 - 0.5 * x * y * y)
+    threehalfs = (3 << (frac_bits - 1))  # 1.5 in Q format
+    half = (1 << (frac_bits - 1))        # 0.5 in Q format
+
+    for _ in range(iterations):
+        # y^2
+        y2 = (y_int * y_int) >> frac_bits
+        # x * y^2
+        xy2 = (x_int * y2) >> frac_bits
+        # (1.5 - 0.5 * x * y^2)
+        term = threehalfs - ((half * xy2) >> frac_bits)
+        # y * term
+        y_int = (y_int * term) >> frac_bits
+
+    return y_int
+
+import torch
+def test_data(x, N):
+    sum_x = torch.sum(x)
+    sum_x2 = torch.sum(x * x)
+    y = N * sum_x2 - sum_x * sum_x
+    return y
+
+# ================== 测试与验证 ==================
+if __name__ == "__main__":
+    x = torch.randint(1, 2, (1,))
+    y1 = test_data(x, 10)
+    y2 = test_data(x, 100)
+    y3 = test_data(x, 1000)
+    y4 = test_data(x, 10000)
+    y5 = test_data(x, 100000)
+    print(f"{y1},{y2},{y3},{y4},{y5}")
+    # frac_bits = 15
+    # # 输入一些Q15数 (对应0.25, 0.5, 1.0, 2.0)
+    # x_real = np.array([0.25, 0.5, 1.0, 2.0, 3.0, 10.0], dtype=np.float32)
+    # x_int = np.round(x_real * (1 << frac_bits)).astype(np.int64)
+
+    # results = []
+    # for xi, xr in zip(x_int, x_real):
+    #     y_int = q_rsqrt_fixed(xi, frac_bits=frac_bits, iterations=2)
+    #     y_real = y_int / (1 << frac_bits)
+    #     y_ref = 1.0 / np.sqrt(xr)
+    #     err = abs(y_real - y_ref)
+    #     results.append((xr, y_real, y_ref, err))
+
+    # print(f"{'x_real':>8} {'y_fixed':>10} {'y_ref':>10} {'error':>10}")
+    # for xr, y, ref, err in results:
+    #     print(f"{xr:8.3f} {y:10.6f} {ref:10.6f} {err:10.6f}")
diff --git a/test/test_embedding.py b/test/test_embedding.py
new file mode 100644
index 0000000..2dffdf9
--- /dev/null
+++ b/test/test_embedding.py
@@ -0,0 +1,136 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.embedding_float = nn.Embedding(
+            num_embeddings=256,  # 对应32个bin
+            embedding_dim=64    # 每个像素的嵌入维度
+        )
+
+        self.embedding_quant = QEmbedding(
+            num_embeddings=256,  # 对应32个bin
+            embedding_dim=64,    # 每个像素的嵌入维度
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict(),
+            open_ihook=False,
+            open_ohook=False
+        )
+
+        with torch.no_grad():
+            self.embedding_float.weight.copy_(self.embedding_quant.qweight)
+    
+    def forward(self, x):
+        result_quant = self.embedding_quant(x)
+        result_float = self.embedding_float(x)  
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def discretize_pixels(tensor, num_bins=256):
+    min_val = tensor.min()
+    max_val = tensor.max()
+    normalized = (tensor - min_val) / (max_val - min_val)
+    discrete = (normalized * (num_bins - 1)).long()
+    return discrete
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化embedding比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+    
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+    x = discretize_pixels(x, num_bins=256)
+
+    # 前向传播（包含量化比较）
+    with torch.no_grad():   
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+    
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_conv_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    # print(f"输入梯度: {input_tensor.grad.norm().item()}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'embedding_float' in name and 'embedding_quant' not in name:
+                layer_type = "embedding"
+            elif 'embedding_quant' in name:
+                layer_type = "qembedding"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+
+    input_tensor = torch.randn(2, 3, 32, 32)
+    input_tensor = discretize_pixels(input_tensor, num_bins=256)
+    loss = check_conv_gradients(model, input_tensor)
+    
+    print("开始量化embedding精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化embedding测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化embedding测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_embedding_int.py b/test/test_embedding_int.py
deleted file mode 100644
index 93a92c2..0000000
--- a/test/test_embedding_int.py
+++ /dev/null
@@ -1,47 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import numpy
-import torch
-import torch.nn as nn
-
-
-def test_embeddingint_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.gather = nn.Embedding(10, 3)
-
-        def forward(self, input):
-
-            x = self.gather(input)
-            x = x * 1
-
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    aa = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]]).cuda()
-    replace_tuple = (nn.Conv2d, nn.LSTM, nn.Embedding)
-
-    net = Net().cuda()
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    criterion = nn.CrossEntropyLoss(ignore_index=-1)
-    print(net)
-    optimizer = torch.optim.Adam(net.parameters(), lr=0.1)
-    loss = None
-    label = torch.randint(1, (2,  3)).cuda()
-    for i in range(10):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, label)
-        if i % 1 == 0:
-            print('loss: ', loss)
-
-        loss.backward()
-        optimizer.step()
-    with torch.no_grad():
-        net.eval()
-        net(aa)
-        torch.onnx.export(net, (aa), "data.ignore/embedding_int.onnx", export_params=True,
-                          opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
\ No newline at end of file
diff --git a/test/test_export_dict_onnx.py b/test/test_export_dict_onnx.py
deleted file mode 100644
index ad86f59..0000000
--- a/test/test_export_dict_onnx.py
+++ /dev/null
@@ -1,86 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_batchnorm_int_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(2, 2, kernel_size=2,
-                                  stride=1, padding=1, bias=False, groups=2)
-            self.bn = nn.BatchNorm2d(2)
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(2, 2, kernel_size=2,
-                                   stride=1, padding=1, bias=False, groups=2)
-            self.bn1 = nn.BatchNorm2d(2)
-            self.relu1 = nn.ReLU()
-            self.fc = nn.Linear(32, 100)
-
-        def forward(self, x: dict):
-            x = x['input']
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    bb = torch.randn(1, 2, 2, 2).cuda()
-    aa = {}
-    aa['input'] = bb
-    aa['input1'] = bb
-
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear,
-                     nn.BatchNorm2d, linger.NormalizeConvBN2d)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # linger.FuseBNIntoConv(net, aa)
-    linger.trace_layers(net, net, aa, fuse_bn=False)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out = net(aa)
-
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(aa)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/batchnormInt.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(aa)
-    assert (out.mean() - out1.mean()
-            ) < 0.001, print('out1: {}, out2: {}'.format(out.sum(), out1.sum()))
-    assert out.abs().sum() == out2.abs().sum(), 'inconsistant for batchnormint'
diff --git a/test/test_full_training_process_with_fusebn.py b/test/test_full_training_process_with_fusebn.py
deleted file mode 100644
index 53d3160..0000000
--- a/test/test_full_training_process_with_fusebn.py
+++ /dev/null
@@ -1,95 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-import os
-
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-def test_full_training():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv1 = nn.Sequential(
-                                    nn.Conv2d(2, 2, kernel_size=3, stride=1, padding=1, bias=False, groups=2),
-                                    nn.BatchNorm2d(2),
-                                    nn.ReLU(),)
-            self.conv2 = nn.Sequential(
-                                    nn.Conv2d(2, 2, kernel_size=3, stride=1, padding=1, bias=False, groups=2),
-                                    nn.BatchNorm2d(2),
-                                    nn.ReLU(),)
-            self.fc = nn.Linear(392, 100)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv1(x)
-            x = self.conv2(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-
-    #normal training
-    for i in range(1000):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 100 == 99:
-            print('origin loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    out_ori = net(dummy_input)
-    torch.save(net.state_dict(), 'data.ignore/model.pt.ignore')
-
-    #normal finetune
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    linger.trace_layers(net,net, dummy_input)
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    net.train()
-    net.load_state_dict(torch.load('data.ignore/model.pt.ignore', map_location='cpu'))
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    out = net(dummy_input)
-    # save int model
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(dummy_input)
-    assert out.sum() == out1.sum()
-
-    #normal testing
-    net1 = Net().cuda()
-    net1.train()
-    linger.trace_layers(net1,net1, dummy_input)
-    net1 = linger.init(net1, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    net1.eval()
-    net1.load_state_dict(torch.load('data.ignore/aa.pt', map_location='cpu'))
-    out2 = net1(dummy_input)
-
-    assert out1.sum() == out2.sum()
diff --git a/test/test_full_training_process_with_fusebn_with_castor_quant.py b/test/test_full_training_process_with_fusebn_with_castor_quant.py
deleted file mode 100644
index 72b4709..0000000
--- a/test/test_full_training_process_with_fusebn_with_castor_quant.py
+++ /dev/null
@@ -1,110 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_full_training():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv1 = nn.Sequential(
-                nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                          padding=1, bias=False, groups=2),
-                nn.BatchNorm2d(2),
-                nn.ReLU(),)
-            self.conv2 = nn.Sequential(
-                nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                          padding=1, bias=False, groups=2),
-                nn.BatchNorm2d(2),
-                nn.ReLU(),)
-            self.fc = nn.Linear(392, 100)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv1(x)
-            x = self.conv2(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-
-    # normal training
-    for i in range(1000):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 100 == 99:
-            print('origin loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    out_ori = net(dummy_input)
-    torch.save(net.state_dict(), 'data.ignore/model.pt.ignore')
-
-    # normal finetune
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    linger.trace_layers(net, net, dummy_input)
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    # linger.SetCastorBiasInt16(True)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    net.train()
-    net.load_state_dict(torch.load(
-        'data.ignore/model.pt.ignore', map_location='cpu'))
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    out = net(dummy_input)
-    # save int model
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(dummy_input)
-    assert out.sum() == out1.sum()
-
-    # normal testing
-    net1 = Net().cuda()
-    net1.train()
-    linger.trace_layers(net1, net1, dummy_input)
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    # linger.SetCastorBiasInt16(True)
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    net1.eval()
-    net1.load_state_dict(torch.load('data.ignore/aa.pt', map_location='cpu'))
-    out2 = net1(dummy_input)
-
-    assert out1.sum() == out2.sum()
-
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/convtranspose_conv_linear.onnx", export_params=True,
-                          opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out3 = net(dummy_input)
-    assert out2.sum() == out3.sum()
diff --git a/test/test_glu.py b/test/test_glu.py
new file mode 100644
index 0000000..71ea2a0
--- /dev/null
+++ b/test/test_glu.py
@@ -0,0 +1,122 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.glu_float = nn.GLU(
+            dim = -1         
+        )
+
+        self.glu_quant = QGLU(
+            dim = -1,
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        # with torch.no_grad():
+        #     self.conv2d_float.weight.copy_(self.conv2d_quant.qweight)
+        #     if self.conv2d_quant.qbias is not None:
+        #         self.conv2d_float.bias.copy_(self.conv2d_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.glu_float(x)
+        result_quant = self.glu_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("测试模型前向传播中的量化glu比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"输入梯度: {input_tensor.grad.norm().item()}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'glu_float' in name and 'glu_quant' not in name:
+                layer_type = "glu"
+            elif 'glu_quant' in name:
+                layer_type = "qglu"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+
+    input_tensor = torch.randn(2, 3, 32, 32, requires_grad=True)
+    loss = check_gradients(model, input_tensor)
+    
+    print("开始量化glu精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化glu测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化glu测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_group_conv1d_fused_bn_demo.py b/test/test_group_conv1d_fused_bn_demo.py
deleted file mode 100644
index 996b58d..0000000
--- a/test/test_group_conv1d_fused_bn_demo.py
+++ /dev/null
@@ -1,70 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from linger.utils import PlatFormQuant
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-import os
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-def test_group_fuse_bn1d():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv1d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True, groups=2)
-            self.bn = nn.BatchNorm1d(10)
-            self.fc = nn.Linear(10*50, 100)
-    
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            n, c, l = x.shape
-            x = x.view(n, c*l)
-            x = self.fc(x)
-            return x
-    
-    class Net1(nn.Module):
-        def __init__(self):
-            super(Net1, self).__init__()
-            self.conv = nn.Conv1d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True, groups=2)
-            self.bn = nn.BatchNorm1d(10)
-            self.fc = nn.Linear(10*50, 100)
-    
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            n, c, l = x.shape
-            x = x.view(n, c*l)
-            x = self.fc(x)
-            return x
-    
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    torch.cuda.set_device(0)
-    aa = torch.randn(10, 10, 50).cuda()
-    
-    replace_tuple=(nn.Conv1d, nn.ConvTranspose2d, nn.Linear)
-    net1 = Net().cuda()
-    net1.eval()
-    print(net1)
-    net1(aa)
-    torch.save(net1.state_dict(), 'data.ignore/model1d.pt.ignore')
-    net1.train()
-    out1 = net1(aa)
-    
-    net2 = Net1().cuda()
-    
-    linger.trace_layers(net2,net2, aa)
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net3 = linger.init(net2, quant_modules=replace_tuple)
-    #net3 = net2
-    net3.load_state_dict(torch.load('data.ignore/model1d.pt.ignore'))
-    out3 = net3(aa)
-    
-    assert out1.sum() - out3.sum() < 0.01
-    
-    
\ No newline at end of file
diff --git a/test/test_group_conv_fused_bn_demo.py b/test/test_group_conv_fused_bn_demo.py
deleted file mode 100644
index 630e9b1..0000000
--- a/test/test_group_conv_fused_bn_demo.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_group_fuse_bn():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True, groups=2)
-            self.bn = nn.BatchNorm2d(2)
-            self.fc = nn.Linear(8, 100)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-
-    class Net1(nn.Module):
-        def __init__(self):
-            super(Net1, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True, groups=2)
-            self.bn = nn.BatchNorm2d(2)
-            self.fc = nn.Linear(8, 100)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    torch.cuda.set_device(0)
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-
-    target = torch.ones(1, 100).cuda()
-    # criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net1 = Net().cuda()
-    net1.train()
-    net1(dummy_input)
-    torch.save(net1.state_dict(), 'data.ignore/model.pt.ignore')
-    net1.eval()
-    out1 = net1(dummy_input)
-
-    net2 = Net1().cuda()
-    print(net2)
-    linger.trace_layers(net2, net2, dummy_input)
-    print(net2)
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net3 = linger.init(net2, quant_modules=replace_tuple)
-    net3.load_state_dict(torch.load('data.ignore/model.pt.ignore'))
-    # net3.cuda()
-    out3 = net3(dummy_input)
-    print(out1.sum() - out3.sum())
-
-    assert out1.sum() - out3.sum() < 1
diff --git a/test/test_gru.py b/test/test_gru.py
new file mode 100644
index 0000000..d715ea0
--- /dev/null
+++ b/test/test_gru.py
@@ -0,0 +1,141 @@
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader, TensorDataset
+
+# ==========================================================
+# 1. 定义一个最简单的GRU网络
+# ==========================================================
+class GRUNet(nn.Module):
+    def __init__(self, input_size=10, hidden_size=20, num_layers=1, num_classes=2):
+        super(GRUNet, self).__init__()
+        self.gru = nn.GRU(input_size=input_size,
+                          hidden_size=hidden_size,
+                          num_layers=num_layers,
+                          batch_first=True,
+                          bidirectional=True)
+        self.fc = nn.Linear(hidden_size * 2, num_classes)
+
+    def forward(self, x):
+        # x: [batch, seq_len, input_size]
+        out, h_n = self.gru(x)          # out: [batch, seq_len, hidden_size]
+        out = out[:, -1, :]             # 取最后一个时间步的输出
+        out = self.fc(out)              # [batch, num_classes]
+        return out
+
+# ==========================================================
+# 2. 生成随机数据 (假数据用于验证功能)
+# ==========================================================
+def generate_data(num_samples=1000, seq_len=5, input_size=10, num_classes=2):
+    X = torch.randn(num_samples, seq_len, input_size)
+    y = torch.randint(0, num_classes, (num_samples,))
+    return X, y
+
+# ==========================================================
+# 3. 训练与测试流程
+# ==========================================================
+def train_and_test():
+    # 参数配置
+    input_size = 10
+    hidden_size = 20
+    num_classes = 2
+    seq_len = 5
+    num_epochs = 5
+    batch_size = 32
+    lr = 1e-3
+
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"🚀 Using device: {device}")
+
+    # 数据
+    X_train, y_train = generate_data(800, seq_len, input_size, num_classes)
+    X_test, y_test = generate_data(200, seq_len, input_size, num_classes)
+
+    train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
+    test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=batch_size)
+
+    # 模型、优化器、损失函数
+    model = GRUNet(input_size, hidden_size, num_classes=num_classes).to(device)
+    print(model)
+
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.Adam(model.parameters(), lr=lr)
+
+    for epoch in range(num_epochs):
+        model.train()
+        total_loss = 0.0
+        for X_batch, y_batch in train_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            optimizer.zero_grad()
+            outputs = model(X_batch)
+            loss = criterion(outputs, y_batch)
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+
+        avg_loss = total_loss / len(train_loader)
+        print(f"Epoch [{epoch+1}/{num_epochs}]  Loss: {avg_loss:.4f}")
+
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for X_batch, y_batch in test_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            outputs = model(X_batch)
+            _, predicted = torch.max(outputs, 1)
+            total += y_batch.size(0)
+            correct += (predicted == y_batch).sum().item()
+    acc = correct / total * 100
+    print(f"Test Accuracy: {acc:.2f}%")
+
+    # 量化配置
+    import linger
+    from linger.utils import FakeQuantMethod, QatMethod
+    linger.QUANT_CONFIGS.quant_method = FakeQuantMethod.CUDA
+    linger.QUANT_CONFIGS.quant_info.qat_method = QatMethod.MOM
+    linger.QUANT_CONFIGS.quant_info.weight_bits = 8
+    linger.QUANT_CONFIGS.quant_info.activate_bits = 8
+
+    model = linger.init(model)
+    print(model)
+
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.Adam(model.parameters(), lr=lr)
+
+    # 训练循环
+    for epoch in range(num_epochs):
+        model.train()
+        total_loss = 0.0
+        for X_batch, y_batch in train_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            optimizer.zero_grad()
+            outputs = model(X_batch)
+            loss = criterion(outputs, y_batch)
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+
+        avg_loss = total_loss / len(train_loader)
+        print(f"Epoch [{epoch+1}/{num_epochs}]  Loss: {avg_loss:.4f}")
+
+    # 测试
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for X_batch, y_batch in test_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            outputs = model(X_batch)
+            _, predicted = torch.max(outputs, 1)
+            total += y_batch.size(0)
+            correct += (predicted == y_batch).sum().item()
+
+    acc = correct / total * 100
+    print(f"Test Accuracy: {acc:.2f}%")
+
+# ==========================================================
+# 4. 运行主程序
+# ==========================================================
+if __name__ == "__main__":
+    train_and_test()
diff --git a/test/test_gru_int.py b/test/test_gru_int.py
deleted file mode 100644
index 983e150..0000000
--- a/test/test_gru_int.py
+++ /dev/null
@@ -1,120 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import torch
-import torch.nn as nn
-import numpy
-from torch.onnx import is_in_onnx_export
-
-def test_lstmpint_net():
-
-    def getacc(lprob, target):
-        num_class = lprob.size()[1]
-        _, new_target = torch.broadcast_tensors(lprob, target)
-
-        remove_pad_mask = new_target.ne(-1)
-        lprob = lprob[remove_pad_mask]
-
-        target = target[target!=-1]
-        target = target.unsqueeze(-1)
-
-
-        lprob = lprob.reshape((-1, num_class))
-
-        preds = torch.argmax(lprob, dim=1)
-        
-        correct_holder = torch.eq(preds.squeeze(), target.squeeze()).float()
-
-        num_corr = correct_holder.sum()
-        num_sample = torch.numel(correct_holder)
-        acc = num_corr/num_sample
-        return acc
-
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv0 = nn.Conv2d(1, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-            self.bn0 = nn.BatchNorm2d(100)
-            self.relu0 = nn.ReLU()
-            self.conv1 = nn.Conv2d(100, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-            self.bn1 = nn.BatchNorm2d(100)
-            self.relu1 = nn.ReLU()
-            self.lstmp = nn.GRU(100, 50, num_layers=1, batch_first=True, bidirectional=True)
-            self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-        def forward(self, input, batch_lengths=None, initial_state=None):
-            x = self.conv0(input)
-            x = self.bn0(x)
-            x = self.relu0(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.reshape((n, -1, 1, w)).squeeze(2)
-            x = x.permute(0, 2, 1)  #b t d
-            x = nn.utils.rnn.pack_padded_sequence(x, batch_lengths, batch_first=True, enforce_sorted=False)
-            x, _ = self.lstmp(x, initial_state) #output b, t, h (10, 10, 100)
-            x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
-            x = x.permute(2, 0, 1)
-            d, b, t = x.shape
-            x = x.reshape((1, d, 1, b*t)) # (1, 100, 1, 100)
-            x = self.final_conv(x)  #(1, 10, 1, 100) (d, b*t)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    aa = torch.randn(10, 1, 1, 10).cuda()
-    label = torch.randint(10, (10, 10)).cuda()   #class=10
-    mask = torch.ones(10, 10)
-    for i in range(9):
-        index = numpy.random.randint(5, 10)
-        mask[i, index:] = 0
-        label[i, index:] = -1
-
-    input_lengths = mask.long().sum(1).cpu().numpy()
-    input_lengths = torch.tensor(input_lengths)#.cuda()
-    # input_lengths = None
-    # label = label.permute((1, 0))
-    # batch_size = 10; hidden_size=100; size=1
-    batch_size = 10; hidden_size=50; size=2
-    initial_state = torch.zeros(size, batch_size, hidden_size).cuda()
-    # initial_state = None
-    net = Net().cuda()
-    replace_modules = (nn.Conv2d, nn.GRU, nn.BatchNorm2d)
-    # replace_modules = (nn.GRU,)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = linger.init(net, quant_modules=replace_modules)
-    criterion = nn.CrossEntropyLoss(ignore_index = -1)
-
-    optimizer = torch.optim.Adam(net.parameters(), lr = 0.001)
-    loss = None
-    for i in range(50):
-        optimizer.zero_grad()
-        out = net(aa, input_lengths, initial_state)
-        out = out.squeeze().permute(1,0) #(b*t, d)
-        loss = criterion(out, label.reshape(-1))
-        if i % 50 == 0:
-            print('loss: ', loss)
-            acc = getacc(out, label.reshape(-1, 1))
-            print('train acc: ', acc)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out1 = net(aa, input_lengths, initial_state)
-    out1 = out1.squeeze().permute((1,0))
-    acc = getacc(out1, label.reshape(-1, 1))
-    print('test acc: ', acc)
-    assert acc > 0.4
-    torch.save(net.state_dict(), 'data.ignore/gruint.pt')
-    out3 = net(aa, input_lengths, initial_state)
-    out3 = out3.squeeze().permute((1,0))
-    input_lengths = torch.tensor(input_lengths)#.cuda()
-    with torch.no_grad():
-        net.eval()
-        torch.onnx.export(net, (aa, input_lengths, initial_state), "data.ignore/gruint3.onnx",export_params=True,opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(aa, input_lengths, initial_state)
-    out2 = out2.squeeze().permute((1,0))
-    # print('out: ', out1)
-    assert (out1 == out2).all()
-    assert(out1 == out3).all()
diff --git a/test/test_iq_auto_grad.py b/test/test_iq_auto_grad.py
deleted file mode 100644
index e09eb8a..0000000
--- a/test/test_iq_auto_grad.py
+++ /dev/null
@@ -1,375 +0,0 @@
-import os
-
-import linger
-import torch
-from linger.ops import (IQTensor, from_torch_tensor, iqadd, iqAddLayer,
-                             iqmul, iqMulLayer, torch_cat)
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_iqMul():
-    x = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-
-    y = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    iq_layer = iqMulLayer().cuda()
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    m = iq_layer(a, b, 127.0/32)  # iqAdd.apply(x,y).sum().backward()
-    assert abs(m.data[0][0].item()-18.1417) < 0.2
-    assert abs(m.data[0][1].item()-32) < 0.3
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True).cuda()
-    q = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True).cuda()
-    m = p*q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert err.abs().sum() < 0.01
-
-
-def test_iqMul_layer():
-    x = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-
-        def forward(self, x, y):
-            return iqmul(self, x, y, 'test')
-
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    net = iqTestLayer().cuda()  # iqAdd.apply(x,y).sum().backward()
-    m = net(a, b)
-    assert abs(m.data[0][0].item()-18.1417) < 0.2
-    assert abs(m.data[0][1].item()-32) < 0.3
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    q = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    m = p*q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert a.grad is None
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert b.grad is None
-    assert err.abs().sum() < 0.01
-
-
-def test_iqMul_layer_cpu():
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-
-        def forward(self, x, y):
-            return iqmul(self, x, y, 'test')
-
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    net = iqTestLayer()
-    m = net(a, b)
-    # print(m)
-    assert abs(m.data[0][0].item()-18.1417) < 0.2
-    assert abs(m.data[0][1].item()-32) < 0.3
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    q = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    m = p*q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert a.grad is None
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert b.grad is None
-    assert err.abs().sum() < 0.01
-
-
-def test_iqmul_module():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y):
-            return x * y
-    net = TestModel().cuda()
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    net = linger.init(net)
-    z = net(a, b)
-    assert isinstance(z, IQTensor)
-
-
-def test_iqimul_module():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y):
-            x *= y
-            return x
-    net = TestModel().cuda()
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    net = linger.init(net)
-    z = net(a, b)
-    assert isinstance(z, IQTensor)
-
-
-def test_iqAdd():
-    x = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-
-    y = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    iq_layer = iqAddLayer().cuda()
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    m = iq_layer(a, b, 127.0/12)  # iqAdd.apply(x,y).sum().backward()
-    assert m.data[0][0]-16.535 < 0.1
-    assert m.data[0][1]-20.031 < 0.1
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True).cuda()
-    q = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True).cuda()
-    m = p+q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert err.abs().sum() < 0.01
-
-def test_iqAdd_layer():
-    x = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-
-        def forward(self, x, y):
-            return iqadd(self, x, y, 'test')
-
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    net = iqTestLayer().cuda()  # iqAdd.apply(x,y).sum().backward()
-    m = net(a, b)
-    assert m.data[0][0]-16.535 < 0.1
-    assert m.data[0][1]-20.031 < 0.1
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    q = torch.tensor([[3.0, 8.0]], device=torch.device(
-        'cuda:0'), requires_grad=True)
-    m = p+q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert a.grad is None
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert b.grad is None
-    assert err.abs().sum() < 0.01
-
-
-def test_iqAdd_layer_cpu():
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-
-        def forward(self, x, y):
-            return iqadd(self, x, y, 'test')
-
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    net = iqTestLayer()
-    m = net(a, b)
-    assert m.data[0][0]-16.535 < 0.1
-    assert m.data[0][1]-20.031 < 0.1
-    m.sum().backward()
-
-    p = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    q = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    m = p+q
-    m = m.sum()
-    m.backward()
-    err = p.grad.data - x.grad.data
-    assert a.grad is None
-    assert err.abs().sum() < 0.01
-    err = q.grad.data - y.grad.data
-    assert b.grad is None
-    assert err.abs().sum() < 0.01
-
-
-def test_iqadd_module():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y):
-            return x + y
-    net = TestModel().cuda()
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    net = linger.init(net)
-    z = net(a, b)
-    assert isinstance(z, IQTensor)
-
-
-def test_iqiadd_module():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y):
-            x += y
-            return x
-    net = TestModel().cuda()
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    net = linger.init(net)
-    z = net(a, b)
-    assert isinstance(z, IQTensor)
-
-
-def test_iqcat_function():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y, z):
-            return torch.cat((x, y, z), dim=0)
-
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    z = torch.tensor([[5.0, 7.0]], requires_grad=True)
-
-    a = torch_cat((x, y, z), dim=0)
-
-    a.sum().backward()
-
-    x_ = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y_ = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    z_ = torch.tensor([[5.0, 7.0]], requires_grad=True)
-    x_iq = from_torch_tensor(x_, 127.0/6.0, 8)
-    y_iq = from_torch_tensor(y_, 127.0/8.0, 8)
-    z_iq = from_torch_tensor(z_, 127.0/7.0, 8)
-
-    net = TestModel()
-    net = linger.init(net)
-    b = net(x_iq, y_iq, z_iq)
-    b.sum().backward()
-    assert (b-a).sum() < 0.2
-    err = x_.grad.data - x.grad.data
-    assert err.sum() < 0.1
-    err = y_.grad.data - y.grad.data
-    assert err.sum() < 0.1
-    err = z_.grad.data - z.grad.data
-    assert err.sum() < 0.1
-
-
-def test_iqcat_running_o():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y, z):
-            return torch.cat((x, y, z), dim=0)
-
-    x_ = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y_ = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    z_ = torch.tensor([[5.0, 7.0]], requires_grad=True)
-    x_iq = from_torch_tensor(x_, 127.0/6.0, 8)
-    y_iq = from_torch_tensor(y_, 127.0/8.0, 8)
-    z_iq = from_torch_tensor(z_, 127.0/7.0, 8)
-
-    net = TestModel()
-    net = linger.init(net)
-    b = None
-    for _ in range(100):
-        b = net(x_iq, y_iq, z_iq)
-    net.eval()
-    c = net(x_iq, y_iq, z_iq)
-    assert (c - b).abs().sum() < 0.1
-
-
-def test_iqcat_state_dict():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-            self.aa = torch.nn.Parameter(torch.zeros((1)))
-
-        def forward(self, x, y, z):
-            return torch.cat((x, y, z), dim=0)
-
-    x_ = torch.tensor([[6.0, 4.0]], requires_grad=True)
-    y_ = torch.tensor([[3.0, 8.0]], requires_grad=True)
-    z_ = torch.tensor([[5.0, 7.0]], requires_grad=True)
-    x_iq = from_torch_tensor(x_, 127.0/6.0, 8)
-    y_iq = from_torch_tensor(y_, 127.0/8.0, 8)
-    z_iq = from_torch_tensor(z_, 127.0/7.0, 8)
-
-    net = TestModel()
-    net = linger.init(net)
-    b = None
-    for _ in range(100):
-        b = net(x_iq, y_iq, z_iq)
-    torch.save(net.state_dict(), 'data.ignore/param.dict')
-
-    net2 = TestModel()
-    net2 = linger.init(net2)
-    net2.load_state_dict(torch.load(
-        'data.ignore/param.dict', map_location='cpu'))
-
-    net.eval()
-    c = net(x_iq, y_iq, z_iq)
-    assert (c - b).abs().sum() < 0.1
diff --git a/test/test_iq_onnx_export.py b/test/test_iq_onnx_export.py
deleted file mode 100644
index d97d082..0000000
--- a/test/test_iq_onnx_export.py
+++ /dev/null
@@ -1,488 +0,0 @@
-import hashlib
-import math
-import os
-from logging import PlaceHolder
-
-import linger
-import linger.onnx
-import onnx
-import torch
-import torch.nn
-import torch.nn.functional as F
-import torch.onnx
-from linger.config import config
-from linger.ops import (IQTensor, from_torch_tensor, iqadd, iqAddLayer, iqmul,
-                        iqMulLayer)
-from linger.utils import PlatFormQuant
-
-
-def get_file_topolo_sort_type_list(f):
-    model = onnx.load(f)
-    return [n.op_type for n in model.graph.node]
-
-
-def get_file_md5(fname):
-    m = hashlib.md5()
-    with open(fname, 'rb') as fobj:
-        while True:
-            data = fobj.read(4096)
-            if not data:
-                break
-            m.update(data)
-    return m.hexdigest()
-
-
-if not os.path.exists("data.ignore"):
-    os.mkdir("data.ignore")
-
-
-def test_view_export():
-    is_tuple = False
-
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            if is_tuple:
-                return args[0].view((1, 4, -1))
-            else:
-                return args[0].view(1, 4, -1)
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 4)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_view.onnx", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_view.onnx")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_view.onnx", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_view.onnx")
-    assert aa_model_md5 == aa_iq_model_md5
-
-    is_tuple = True
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_view_tuple.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-    aa_iq_tuple_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_view_tuple.onnx.ignore")
-    assert aa_model_md5 == aa_iq_tuple_model_md5
-
-
-def test_view_as_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0].view_as(args[1])
-
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 4)
-    aa_copy = dummy_input.detach().data
-
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, (aa_iq, aa_copy), "data.ignore/iq_view_as.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', 'y'], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    iq_model = onnx.load("data.ignore/iq_view_as.onnx.ignore")
-    assert iq_model.graph.node[1].op_type == 'Reshape'
-
-
-def test_relu_export():
-    inplace = False
-
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            if inplace:
-                return torch.relu(*args)
-            else:
-                return torch.relu_(*args)
-    net = TestModel()
-    dummy_input = torch.randn(9, 8, 7, 6)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_relu.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_relu.onnx.ignore")
-    base_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_relu.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/torch_relu_iq1.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/torch_relu_iq1.onnx.ignore")
-    iq1_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_relu_iq1.onnx.ignore")
-    assert iq1_md5 == base_md5
-    inplace = True
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/torch_relu_iq2.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq2_model = onnx.load("data.ignore/torch_relu_iq2.onnx.ignore")
-    iq2_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_relu_iq2.onnx.ignore")
-    assert iq2_md5 == base_md5
-
-
-def test_maxpool_2d_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return torch.max_pool2d(args[0], kernel_size=(2, 2), stride=2, padding=0)
-    net = TestModel()
-    dummy_input = torch.randn(1, 8, 16)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_maxpool_2d.onnx.ignore", export_params=True, opset_version=9,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    mode_base = onnx.load("data.ignore/torch_maxpool_2d.onnx.ignore")
-
-    iq_1 = from_torch_tensor(dummy_input, 11, 8)
-    with torch.no_grad():
-        linger.onnx.export(net, iq_1, "data.ignore/qi1_maxpool_2d.onnx.ignore", export_params=True, opset_version=9,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    mode_iq = onnx.load("data.ignore/qi1_maxpool_2d.onnx.ignore")
-    assert mode_iq.graph.node[0].op_type == mode_base.graph.node[0].op_type
-    assert mode_iq.graph.node[0].attribute == mode_base.graph.node[0].attribute
-
-
-def test_iq_mul_export_onnx():
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True).cuda()
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True).cuda()
-    iq_layer = iqMulLayer()
-    a0 = from_torch_tensor(x-1, 127.0/(6.0-1), 8)
-    b0 = from_torch_tensor(y-1, 127.0/(8-1), 8)
-    oscale = 127.0/(12-2)
-    iq_layer(a0, b0, oscale)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    scale = 127.0 / 12
-    with torch.no_grad():
-        linger.onnx.export(iq_layer, (a, b, scale), "data.ignore/iq_mul1.onnx", export_params=True, keep_initializers_as_inputs=False,
-                           opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    m_float = onnx.load("data.ignore/iq_mul1.onnx")
-    assert m_float.graph.node[0].op_type == 'iqMul'
-    assert len(m_float.graph.node[0].attribute) == 4
-    for m in m_float.graph.node[0].attribute:
-        if m.name == "scale_o":
-            assert m.f == 128
-        if m.name == "scale_x":
-            assert abs(m.f - 127.0/6) < 0.01
-        if m.name == "scale_y":
-            assert abs(m.f - 127.0/8) < 0.01
-
-
-def test_iq_mul_module_export_onnx():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, x, y):
-            return iqmul(self, x, y, 'testname')
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True).cuda()
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True).cuda()
-    net = TestModel()
-    a0 = from_torch_tensor(x-1, 127.0/(6.0-1), 8)
-    b0 = from_torch_tensor(y-1, 127.0/(8-1), 8)
-
-    net(a0, b0)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    with torch.no_grad():
-        linger.onnx.export(net, (a, b), "data.ignore/iq_mul2.onnx", export_params=True, keep_initializers_as_inputs=False,
-                           opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    m_float = onnx.load("data.ignore/iq_mul2.onnx")
-    assert m_float.graph.node[0].op_type == 'iqMul'
-    assert len(m_float.graph.node[0].attribute) == 4
-    assert len(m_float.graph.node) == 2
-    for m in m_float.graph.node[0].attribute:
-        if m.name == "scale_o":
-            max_value = round(math.log(127/2.1, 2))
-            scale_local = math.pow(2, max_value)
-            assert abs(m.f - scale_local) < 0.1
-        if m.name == "scale_x":
-            assert abs(m.f - 127.0/6) < 0.01
-        if m.name == "scale_y":
-            assert abs(m.f - 127.0/8) < 0.01
-
-
-def test_iq_mul_u8i8_module_export_onnx():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, x, y):
-            return iqmul(self, x, y, 'testname')
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True).cuda()
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True).cuda()
-    net = TestModel()
-    a0 = from_torch_tensor(x-1, 127.0/(6.0-1), 8, 128)
-    b0 = from_torch_tensor(y-1, 127.0/(8-1), 8)
-
-    net(a0, b0)
-    a = from_torch_tensor(x, 127.0/6.0, 8, 128)
-    b = from_torch_tensor(y, 127.0/8, 8)
-
-    with torch.no_grad():
-        linger.onnx.export(net, (a, b), "data.ignore/iq_mul3.onnx", export_params=True, keep_initializers_as_inputs=False,
-                           opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    m_float = onnx.load("data.ignore/iq_mul3.onnx")
-    assert m_float.graph.node[0].op_type == 'iqMul'
-
-
-def test_iq_add_export_onnx():
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True).cuda()
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True).cuda()
-    iq_layer = iqAddLayer()
-    a0 = from_torch_tensor(x-1, 127.0/(6.0-1), 8)
-    b0 = from_torch_tensor(y-1, 127.0/(8-1), 8)
-    oscale = 127.0/(12-2)
-    iq_layer(a0, b0, oscale)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    scale = torch.tensor([127.0/12])
-    with torch.no_grad():
-        linger.onnx.export(iq_layer, (a, b, scale), "data.ignore/iq_add1.onnx.ignore", export_params=True,
-                           keep_initializers_as_inputs=False, opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    m_float = onnx.load("data.ignore/iq_add1.onnx.ignore")
-    assert m_float.graph.node[0].op_type == 'iqAdd'
-    assert len(m_float.graph.node[0].attribute) == 5
-    for m in m_float.graph.node[0].attribute:
-        if m.name == "scale_o":
-            assert m.f == m.f
-        if m.name == "scale_x":
-            assert abs(m.f - 127.0/6) < 0.01
-        if m.name == "scale_y":
-            assert abs(m.f - 127.0/8) < 0.01
-
-
-def test_iq_add_module_export_onnx():
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, x, y):
-            return iqadd(self, x, y, 'testname')
-    x = torch.tensor([[6.0, 4.0]], requires_grad=True).cuda()
-    y = torch.tensor([[3.0, 8.0]], requires_grad=True).cuda()
-    net = TestModel()
-    a0 = from_torch_tensor(x-1, 127.0/(6.0-1), 8)
-    b0 = from_torch_tensor(y-1, 127.0/(8-1), 8)
-    oscale = 127.0/(12-2)
-    net(a0, b0)
-    a = from_torch_tensor(x, 127.0/6.0, 8)
-    b = from_torch_tensor(y, 127.0/8, 8)
-    scale = torch.tensor([127.0/12])
-    with torch.no_grad():
-        linger.onnx.export(net, (a, b), "data.ignore/iq_add2.onnx.ignore", export_params=True, keep_initializers_as_inputs=False,
-                           opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    m_float = onnx.load("data.ignore/iq_add2.onnx.ignore")
-    assert m_float.graph.node[0].op_type == 'iqAdd'
-    assert len(m_float.graph.node[0].attribute) == 5
-    assert len(m_float.graph.node) == 2
-    for m in m_float.graph.node[0].attribute:
-        if m.name == "scale_o":
-            assert m.f == m.f
-        if m.name == "scale_x":
-            assert abs(m.f - 127.0/6) < 0.01
-        if m.name == "scale_y":
-            assert abs(m.f - 127.0/8) < 0.01
-
-
-def test_reshape_export():
-    is_tuple = False
-
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            if is_tuple:
-                return args[0].reshape((1, 4, -1))
-            else:
-                return args[0].reshape(1, 4, -1)
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 4)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_reshape.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_reshape.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_reshape.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_reshape.onnx.ignore")
-    assert aa_model_md5 == aa_iq_model_md5
-
-    is_tuple = True
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_reshape_tuple.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-    aa_iq_tuple_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_reshape_tuple.onnx.ignore")
-    assert aa_model_md5 == aa_iq_tuple_model_md5
-
-
-def test_reshape_as_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0].reshape_as(args[1])
-
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 4)
-    aa_copy = dummy_input.detach().data
-
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, (aa_iq, aa_copy), "data.ignore/iq_reshape_as.onnx.ignore", export_params=True,
-                           opset_version=11, input_names=['x', 'y'], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    iq_model = onnx.load("data.ignore/iq_reshape_as.onnx.ignore")
-    assert iq_model.graph.node[1].op_type == 'Reshape'
-
-
-def test_squeeze_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0].squeeze()
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 4)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_squeeze.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_squeeze.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_squeeze.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_squeeze.onnx.ignore")
-    assert aa_model_md5 == aa_iq_model_md5
-
-
-def test_unsqueeze_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0].unsqueeze(dim=3)
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 5)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_unsqueeze.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_unsqueeze.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_unsqueeze.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_unsqueeze.onnx.ignore")
-    assert aa_model_md5 == aa_iq_model_md5
-
-
-def test_transpose_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            x = args[0].squeeze()
-            return x.transpose(1, 2)
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 2, 2)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_transpose.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_transpose.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 5, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_transpose.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_transpose.onnx.ignore")
-    assert aa_model_md5 == aa_iq_model_md5
-
-
-def test_flatten_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0].flatten(2, 4)
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 7, 5)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_flatten.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #original_model = onnx.load("data.ignore/torch_view.onnx.ignore")
-    aa_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/torch_flatten.onnx.ignore")
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_flatten.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #iq1_model = onnx.load("data.ignore/iq_view.onnx.ignore")
-    aa_iq_model_md5 = get_file_topolo_sort_type_list(
-        "data.ignore/iq_flatten.onnx.ignore")
-    assert aa_model_md5 == aa_iq_model_md5
-
-
-def test_getitem_export():
-    class TestModel(torch.nn.Module):
-        def __init__(self):
-            super(TestModel, self).__init__()
-
-        def forward(self, *args):
-            return args[0][:, 1, :, 2, :]
-    net = TestModel()
-    dummy_input = torch.randn(1, 2, 3, 7, 5)
-    with torch.no_grad():
-        linger.onnx.export(net, dummy_input, "data.ignore/torch_getitem.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    original_model = onnx.load("data.ignore/torch_getitem.onnx.ignore")
-
-    aa_iq = from_torch_tensor(dummy_input, 12, 3)
-    with torch.no_grad():
-        linger.onnx.export(net, aa_iq, "data.ignore/iq_getitem.onnx.ignore", export_params=True, opset_version=11,
-                           input_names=['x', ], operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    iq_model = onnx.load("data.ignore/iq_getitem.onnx.ignore")
-    assert original_model.graph.node[0].op_type == iq_model.graph.node[0].op_type
-    assert original_model.graph.node[1].op_type == iq_model.graph.node[1].op_type
-    assert original_model.graph.node[2].op_type == iq_model.graph.node[2].op_type
-    assert original_model.graph.node[3].op_type == iq_model.graph.node[3].op_type
diff --git a/test/test_iqadd_load_state_dict.py b/test/test_iqadd_load_state_dict.py
deleted file mode 100644
index f4a66d6..0000000
--- a/test/test_iqadd_load_state_dict.py
+++ /dev/null
@@ -1,72 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.ops import IQTensor, from_torch_tensor
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_iqadd_load_state_dict_1():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(392, 100, bias=False)
-
-        def forward(self, x):
-            trans = self.transpose(x)
-            assert isinstance(trans, IQTensor)
-            conv = self.conv(trans)
-            assert isinstance(conv, IQTensor)
-            x = trans + conv
-            assert isinstance(x, IQTensor)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    torch.cuda.set_device(0)
-    net1 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-
-    replace_tuple = (nn.Linear, nn.ConvTranspose2d, nn.Conv2d)
-    net1.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer1.zero_grad()
-        out1 = net1(dummy_input)
-        loss1 = criterion(out1, target)
-        loss1.backward()
-        optimizer1.step()
-        if i % 30 == 29:
-            print('loss1 {}'.format(loss1))
-    net1.eval()
-    torch.save(net1.state_dict(), 'data.ignore/model.pt.ignore')
-    out1 = net1(dummy_input)
-
-    net2 = Net().cuda()
-    net2 = linger.init(net2, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    net2.load_state_dict(torch.load(
-        'data.ignore/model.pt.ignore', map_location='cpu'))
-    net2.eval()
-    out2 = net2(dummy_input)
-
-    with torch.no_grad():
-        torch.onnx.export(net2, dummy_input, "data.ignore/iqadd_t710.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-    assert out1.sum() == out2.sum()
\ No newline at end of file
diff --git a/test/test_iqdiv.py b/test/test_iqdiv.py
deleted file mode 100644
index 19c440f..0000000
--- a/test/test_iqdiv.py
+++ /dev/null
@@ -1,86 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-
-class Net(nn.Module):
-    def __init__(self):
-        super(Net, self).__init__()
-        self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                              padding=1, bias=True)
-        self.bn = nn.BatchNorm2d(10)
-        self.relu = nn.ReLU()
-        self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                               padding=1, bias=True)
-        self.bn1 = nn.BatchNorm2d(10)
-        self.relu1 = nn.ReLU()
-        self.conv2 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                               padding=1, bias=True)
-        self.bn2 = nn.BatchNorm2d(10)
-        self.relu2 = nn.ReLU()
-        self.fc = nn.Linear(1000, 100)
-
-    def forward(self, input):
-        x = self.conv(input)
-        x = self.bn(x)
-        x1 = self.relu(x)
-
-        x = self.conv1(input)
-        x2 = self.bn1(x)
-        x2 = self.relu(x2)
-        # y = linger.from_torch_tensor(torch.tensor(1),0.5,8).cuda()
-        x2 = x2 + 1
-        # x2 = self.relu1(x)
-        x = x1 / 2
-        x = self.conv2(x)
-        x = self.bn2(x)
-        x = self.relu2(x)
-        n, c, h, w = x.shape
-        x = x.view((n, c*h*w))
-        x = self.fc(x)
-        return x
-
-
-def test_conv_linear():
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.BatchNorm2d)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    linger.FuseConvBNAheadRelu(
-        net, aa, fused_bn=False, ahead_bn_relu=True, ahead_conv_relu=True)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    print(net)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_iqdiv.pt')
-    print(net)
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_iqdiv1.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.1
diff --git a/test/test_iqmul.py b/test/test_iqmul.py
deleted file mode 100644
index ef0f338..0000000
--- a/test/test_iqmul.py
+++ /dev/null
@@ -1,54 +0,0 @@
-import torch
-import torch.nn as nn
-import linger
-from linger.ops import from_torch_tensor
-from linger.ops import (IQTensor, from_torch_tensor, iqadd, iqAddLayer,
-                             iqmul, iqMulLayer, torch_cat)
-
-def test_iqmul_iqtensor_scalar():
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-            self.fc = nn.Linear(2, 1)
-
-        def forward(self, x):
-            x = self.fc(x)
-            x = x * 0.16551
-            return x
-
-    model = iqTestLayer().cuda()
-    model = linger.init(model, quant_modules=(nn.Linear), parameter_bits=8)
-
-    x = torch.tensor([[0.6, 0.4]], requires_grad=True).cuda()
-    # import pdb; pdb.set_trace()
-    a = from_torch_tensor(x, 127.0/8, 8)
-    
-    out = model(x)
-    model.eval()
-
-    with torch.no_grad():
-        torch.onnx.export(model, (x),"iqmul.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-def test_iqmul_scale_x_y_o():
-    
-    class iqTestLayer(torch.nn.Module):
-        def __init__(self):
-            super(iqTestLayer, self).__init__()
-            self.fc = nn.Linear(2, 1)
-
-        def forward(self, x):
-            x = iqmul(self, x, 0.125)
-            return x
-
-    x = torch.tensor([[0.6, 0.4]], requires_grad=True).cuda()
-    a = from_torch_tensor(x, 16, 8)
-
-    model = iqTestLayer().cuda()
-    # model = linger.init(model, quant_modules=(), parameter_bits=8)
-
-    out = model(a)
-    model.eval()
-
-    with torch.no_grad():
-        torch.onnx.export(model, (a),"iqmul.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-        
\ No newline at end of file
diff --git a/test/test_iqsigmoid_net.py b/test/test_iqsigmoid_net.py
deleted file mode 100644
index acd236b..0000000
--- a/test/test_iqsigmoid_net.py
+++ /dev/null
@@ -1,56 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_iqsigmoid():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                   padding=1, bias=True)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.conv1(x)
-            x = torch.sigmoid(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 1 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/sigmoid.pt')
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/resize_net.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    #print('out1: ', out1)
\ No newline at end of file
diff --git a/test/test_iqsihft_quant.py b/test/test_iqsihft_quant.py
deleted file mode 100644
index 24dfbce..0000000
--- a/test/test_iqsihft_quant.py
+++ /dev/null
@@ -1,50 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def test_tf_quant():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(392, 100, bias=False)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net1 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net1.train()
-
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # linger.SetTFQuant(luna_quant=True)
-    optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer1.zero_grad()
-        out1 = net1(dummy_input)
-        loss1 = criterion(out1, target)
-        loss1.backward()
-        optimizer1.step()
-        if i % 30 == 29:
-            print('loss1 {}'.format(loss1))
-    net1.eval()
-    out1 = net1(dummy_input)
\ No newline at end of file
diff --git a/test/test_iqsum.py b/test/test_iqsum.py
deleted file mode 100644
index a9032a0..0000000
--- a/test/test_iqsum.py
+++ /dev/null
@@ -1,76 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-
-class Net(nn.Module):
-    def __init__(self):
-        super(Net, self).__init__()
-        self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                              padding=1, bias=True)
-        self.bn = nn.BatchNorm2d(10)
-        self.relu = nn.ReLU()
-        self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                               padding=1, bias=True)
-        self.bn1 = nn.BatchNorm2d(10)
-        self.relu1 = nn.ReLU()
-        self.fc = nn.Linear(1000, 100)
-
-    def forward(self, input):
-        x = self.conv(input)
-        x = self.bn(x)
-        x = self.relu(x)
-
-        x1 = self.conv1(x)
-        x1 = self.bn1(x1)
-        x1 = self.relu1(x1)
-        x1 = x1.sum()
-        x = x/x1
-        n, c, h, w = x.shape
-        x = x.view((n, c*h*w))
-        x = self.fc(x)
-        return x
-
-
-def test_conv_linear():
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.BatchNorm2d)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    linger.FuseConvBNAheadRelu(
-        net, aa, fused_bn=False, ahead_bn_relu=True, ahead_conv_relu=True)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    # net.load_state_dict(torch.load('data.ignore/conv_iqsum.pt'))
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_iqsum.pt')
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa, "data.ignore/conv_iqsum2.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    # assert abs(out1.mean() - 1) < 0.01
diff --git a/test/test_iqtensor.py b/test/test_iqtensor.py
deleted file mode 100644
index eb78307..0000000
--- a/test/test_iqtensor.py
+++ /dev/null
@@ -1,465 +0,0 @@
-import math
-
-import linger
-import torch
-import torch.nn as nn
-from linger.ops import *
-
-torch.set_printoptions(linewidth=28*10)
-
-
-class CNN(nn.Module):
-    def __init__(self):
-        super(CNN, self).__init__()
-        self.conv1 = nn.Conv2d(
-            in_channels=1,
-            out_channels=16,
-            kernel_size=5,
-            stride=1,
-            padding=2,
-        )
-        self.relu1 = nn.ReLU()
-        self.pool1 = nn.MaxPool2d(kernel_size=2)
-        self.conv2 = nn.Conv2d(16, 32, 5, 1, 2)
-        self.relu2 = nn.ReLU()
-        self.pool2 = nn.MaxPool2d(2)
-        self.out = nn.Linear(32 * 7 * 7, 10)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        assert isinstance(x, IQTensor)
-        s0 = x.scale_data
-        x = self.relu1(x)
-        assert isinstance(x, IQTensor)
-        s1 = x.scale_data
-        assert s0 == s1
-        x = self.pool1(x)
-        assert isinstance(x, IQTensor)
-        s2 = x.scale_data
-        assert s0 == s2
-        x = self.conv2(x)
-        assert isinstance(x, IQTensor)
-        s0 = x.scale_data
-        x = self.relu2(x)
-        assert isinstance(x, IQTensor)
-        s1 = x.scale_data
-        assert s0 == s1
-        x = self.pool2(x)
-        assert isinstance(x, IQTensor)
-        s2 = x.scale_data
-        assert s0 == s2
-        x = x.view(x.size(0), -1)
-        assert isinstance(x, IQTensor)
-        s3 = x.scale_data
-        assert s0 == s3
-        output = self.out(x)
-        assert isinstance(output, IQTensor)
-        s4 = output.scale_data
-        return output
-
-
-def test_convint_base():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = Conv2dInt(in_channels=1, out_channels=16,
-                    kernel_size=5, stride=1, padding=2,)
-    net.weight.data.fill_(1.0)
-    net.bias.data.fill_(0)
-    t = torch.ones(1, 1, 28, 28)
-    r = net(t)
-    assert hasattr(r, 'scale_data')
-    # assert r.sum().item() == 287296
-    iq = torch.ones(1, 1, 28, 28)*(-0.1)
-    iq = from_torch_tensor(iq, 0.5, 8)
-    riq = net(iq)
-    assert hasattr(riq, 'scale_data')
-    assert riq.sum().item() == 0
-    iq2 = torch.ones(1, 1, 28, 28)
-    iq2 = from_torch_tensor(iq2, 0.6, 8)
-    riq2 = net(iq2)
-    assert hasattr(riq2, 'scale_data')
-    # assert abs(riq2.sum() - 478827) < 1
-
-
-def test_linearint_base():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = LinearInt(64, 16)
-    net.weight.data.fill_(1.0)
-    net.bias.data.fill_(0)
-    t = torch.ones(64, 64)
-    r = net(t)
-    assert not hasattr(r, 'scale_data')
-    iq = torch.ones(64, 64) * (-0.1)
-    iq = from_torch_tensor(iq, 0.5, 8)
-    riq = net(iq)
-    assert not hasattr(riq, 'scale_data')
-    assert riq.sum().item() == 0
-
-    iq2 = torch.ones(64, 64)
-    iq2 = from_torch_tensor(iq2, 0.6, 8)
-    riq2 = net(iq2)
-    assert not hasattr(riq2, 'scale_data')
-    # assert abs(riq2.sum().item() - 109226) < 1
-
-def test_linearint_base_obits():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = LinearInt(64, 16, o_bits=8)
-    net.weight.data.fill_(1.0)
-    net.bias.data.fill_(0)
-    t = torch.ones(64, 64)
-    r = net(t)
-    assert hasattr(r, 'scale_data')
-    iq = torch.ones(64, 64)*(-0.1)
-    iq = from_torch_tensor(iq, 0.5, 8)
-    riq = net(iq)
-    assert hasattr(riq, 'scale_data')
-    assert riq.sum().item() == 0
-
-    iq2 = torch.ones(64, 64)
-    iq2 = from_torch_tensor(iq2, 0.6, 8)
-    riq2 = net(iq2)
-    assert hasattr(riq2, 'scale_data')
-    # assert  abs(riq2.sum().item() - 109226 ) < 1
-
-    net.weight.data.fill_(0.1)
-    iq3 = torch.ones(64, 64)
-    iq3 = from_torch_tensor(iq3, 0.6, 8)
-    riq3 = net(iq3)
-    assert hasattr(riq3, 'scale_data')
-    # assert abs (riq3.sum().item() - 10922)< 1
-
-
-def test_convint_out():
-    net = CNN().cuda()
-    net = linger.init(net)
-    aa = torch.randn(1, 1, 28, 28).cuda()
-    net(aa)
-
-
-def test_view():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((1, 2, 3, 4), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-    c = a.detach().clone()
-    c.requires_grad_()
-    x = a.view((4, 3, 2, 1))
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-
-    y = b_iq.view(4, 3, 2, 1)
-    assert isinstance(y, IQTensor)
-    y.sum().backward()
-
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-    c_iq = from_torch_tensor(c, 120, 8)
-    z = c_iq.view((4, 3, 2, 1))
-    z.sum().backward()
-    assert (c.grad - a.grad).sum() == 0
-    assert c.grad.size() == a.grad.size()
-
-
-def test_view_as():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((1, 2, 3, 4), requires_grad=True)
-    a_target = torch.randn((3, 2, 1, 4), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-    c = a.detach().clone()
-    c.requires_grad_()
-    x = a.view_as(a_target)
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-
-    y = b_iq.view_as(a_target)
-    assert isinstance(y, IQTensor)
-    y.sum().backward()
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-
-def test_transposeconv2dint_base():
-    my_convtranspose = ConvTranspose2dInt(50, 50, 5, 5, 2, 4, 2, False, 2)
-    input1 = torch.ones(50, 50, 50, 50)
-    out = my_convtranspose(input1)
-    assert type(out) == torch.Tensor
-
-
-def test_transposeconv2dint_obit():
-    my_convtranspose = ConvTranspose2dInt(
-        50, 50, 5, 5, 2, 4, 2, False, 2, o_bits=8)
-    input1 = torch.ones(50, 50, 50, 50)
-    out = my_convtranspose(input1)
-    assert type(out) == IQTensor
-    assert out.scale_data > 0
-    assert out.bits == 8
-
-
-def test_transposeconv2dint_iq_obit():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    my_convtranspose = ConvTranspose2dInt(
-        50, 50, 5, 5, 2, 4, 2, False, 2, o_bits=8)
-    input1 = torch.ones(50, 50, 50, 50)
-    input_iq = from_torch_tensor(input1, 127/1.0, 8)
-    out = my_convtranspose(input_iq)
-    assert type(out) == IQTensor
-    assert out.scale_data > 0
-    assert out.bits == 8
-    assert abs(my_convtranspose.running_x-1.0) < 0.001
-
-
-def test_reshape():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((1, 2, 3, 4), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-    c = a.detach().clone()
-    c.requires_grad_()
-    d = a.detach().clone()
-    d.requires_grad_()
-
-    x = a.reshape((4, 3, 2, 1))
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-
-    y = b_iq.reshape(4, 3, 2, 1)
-    assert isinstance(y, IQTensor)
-    assert y.size() == (4, 3, 2, 1)
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-    c_iq = from_torch_tensor(c, 120, 8)
-    z = c_iq.reshape((4, 3, 2, 1))
-    assert isinstance(z, IQTensor)
-    assert z.size() == (4, 3, 2, 1)
-    z.sum().backward()
-    assert c_iq.grad is None
-    assert (c.grad - a.grad).sum() == 0
-    assert c.grad.size() == a.grad.size()
-
-    d_iq = from_torch_tensor(d, 120, 8)
-    z = d_iq.reshape((4, -1, 2, 1))
-    assert isinstance(z, IQTensor)
-    assert z.size() == (4, 3, 2, 1)
-    z.sum().backward()
-    assert d_iq.grad is None
-    assert (d.grad - a.grad).sum() == 0
-    assert d.grad.size() == a.grad.size()
-
-
-def test_reshape_as():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((1, 2, 3, 4), requires_grad=True)
-    shape_tensor = torch.randn((1, 3, 2, 4), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-    c = a.detach().clone()
-    c.requires_grad_()
-    d = a.detach().clone()
-    d.requires_grad_()
-
-    x = a.reshape_as(shape_tensor)
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-
-    y = b_iq.reshape_as(shape_tensor)
-    assert isinstance(y, IQTensor)
-    assert y.size() == shape_tensor.size()
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-
-def test_squeeze():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((1, 2, 1, 3, 1, 5), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-    a1 = a.detach().clone()
-    a1.requires_grad_()
-    d = a.detach().clone()
-    d.requires_grad_()
-
-    x = a.squeeze()
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-
-    y = b_iq.squeeze()
-    assert isinstance(y, IQTensor)
-    assert y.size() == (2, 3, 5)
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-    x1 = a1.squeeze(2)
-    x1.sum().backward()
-
-    c_iq = from_torch_tensor(d, 120, 8)
-    z = c_iq.squeeze(2)
-    assert isinstance(z, IQTensor)
-    assert z.size() == (1, 2, 3, 1, 5)
-    z.sum().backward()
-    assert c_iq.grad is None
-    assert (d.grad - a1.grad).sum() == 0
-    assert d.grad.size() == a1.grad.size()
-
-
-def test_unsqueeze():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((2, 3, 1, 5), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-
-    x = a.unsqueeze(0)
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-    y = b_iq.unsqueeze(0)
-    assert isinstance(y, IQTensor)
-    assert y.size() == (1, 2, 3, 1, 5)
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-
-def test_transpose():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((2, 3, 5), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-
-    x = a.transpose(1, 2)
-    sizex = x.size()
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-    y = b_iq.transpose(1, 2)
-    assert isinstance(y, IQTensor)
-    assert y.size() == sizex
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-def test_getitem():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((2, 5, 6, 7), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-
-    x = a[1, :, :, :]
-    sizex = x.size()
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-    y = b_iq[1, :, :, :]
-    assert isinstance(y, IQTensor)
-    assert y.size() == sizex
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-def test_flatten():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((2, 3, 1, 5, 6, 7), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-
-    x = a.flatten(1, 3)
-    sizex = x.size()
-    x.sum().backward()
-
-    b_iq = from_torch_tensor(b, 120, 8)
-    y = b_iq.flatten(1, 3)
-    assert isinstance(y, IQTensor)
-    assert y.size() == sizex
-    y.sum().backward()
-    assert b_iq.grad is None
-    assert (b.grad - a.grad).sum() == 0
-    assert b.grad.size() == a.grad.size()
-
-
-def test_split():
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    a = torch.randn((5, 2), requires_grad=True)
-    b = a.detach().clone()
-    b.requires_grad_()
-
-    x = a.split(2)
-    w = torch.cat(x, 0).sum()
-    w.backward()
-    print(x)
-
-    b_iq = from_torch_tensor(b, 120, 8)
-    y = b_iq.split(2)
-    z = torch.cat(y, 0).sum()
-
-    z.backward()
-    print(y)
-
-    assert isinstance(y[0], IQTensor)
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.pool = nn.AvgPool2d((2, 2), (2, 2), (0, 0), False)
-            self.fc = nn.Linear(250, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = self.pool(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = x.split(150)
-            x = self.fc(x[0])
-            return x
-
-    import numpy as np
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # linger.disable_quant(net.fc)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(1):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/split.onnx", export_params=True, opset_version=11,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    print("out1.mean(): ", out1.mean())
-    assert abs(out1.mean() - 1) < 2
\ No newline at end of file
diff --git a/test/test_iqtensor_functional.py b/test/test_iqtensor_functional.py
deleted file mode 100644
index c8e3d00..0000000
--- a/test/test_iqtensor_functional.py
+++ /dev/null
@@ -1,50 +0,0 @@
-import torch
-from linger.ops import from_torch_tensor
-
-
-def test_relu():
-    a = torch.randn((16, 16), requires_grad=True)
-    x = a.detach().clone()
-    x.requires_grad_()
-    b = torch.relu(a)
-    b.sum().backward()
-
-    y = from_torch_tensor(x, 5, 8)
-    c = torch.relu(y)
-    assert c.scale_data == 5
-    assert c.bits == 8
-    c.sum().backward()
-
-    assert (a.grad - x.grad).sum() == 0
-
-
-def test_relu_():
-    a = torch.randn((16, 16), requires_grad=True)
-    x = a.detach().clone()
-    x.requires_grad_()
-    b = torch.relu_(a)
-    b.sum().backward()
-
-    y = from_torch_tensor(x, 5, 8)
-    c = torch.relu_(y)
-    assert c.scale_data == 5
-    assert c.bits == 8
-    c.sum().backward()
-
-    assert (a.grad - x.grad).sum() == 0
-
-
-def test_max_pool2d():
-    a = torch.randn((1, 16, 16), requires_grad=True)
-    x = a.detach().clone()
-    x.requires_grad_()
-    b = torch.max_pool2d(a, kernel_size=(2, 2), stride=2, padding=0)
-    b.sum().backward()
-
-    y = from_torch_tensor(x, 5, 8)
-    c = torch.max_pool2d(y, kernel_size=(2, 2), stride=2, padding=0)
-    assert c.scale_data == 5
-    assert c.bits == 8
-    c.sum().backward()
-
-    assert (a.grad - x.grad).sum() == 0
diff --git a/test/test_iqvar_net.py b/test/test_iqvar_net.py
deleted file mode 100644
index f364540..0000000
--- a/test/test_iqvar_net.py
+++ /dev/null
@@ -1,53 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-
-import linger
-
-
-def test_var():
-    class Model(nn.Module):
-        def __init__(self):
-            super(Model, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(1, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = torch.var(x, 1, False).reshape(-1)
-            x = x.unsqueeze(-1)
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    model = Model().cuda()
-
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    linger.SetIQTensorVar(True)
-    model = linger.init(model, quant_modules=replace_tuple,
-                        mode=linger.QuantMode.QValue)
-
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
-    loss = None
-    for i in range(20):
-        optimizer.zero_grad()
-        out = model(dummy_input)
-        print(out)
-        loss = criterion(out, target)
-        if i % 1 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-    with torch.no_grad():
-        torch.onnx.export(model, dummy_input, "var.onnx", export_params=True, opset_version=11,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_layernorm.py b/test/test_layernorm.py
new file mode 100644
index 0000000..acec991
--- /dev/null
+++ b/test/test_layernorm.py
@@ -0,0 +1,118 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.layernorm_float = nn.LayerNorm([3, 32, 32])      
+
+        self.layernorm_quant = QLayerNorm(
+            [3, 32, 32],
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.layernorm_float.weight.copy_(self.layernorm_quant.qweight)
+            if self.layernorm_quant.qbias is not None:
+                self.layernorm_float.bias.copy_(self.layernorm_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.layernorm_float(x)
+        result_quant = self.layernorm_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("=" * 60)
+    print("测试模型前向传播中的量化layernorm比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(torch.abs(float_result)).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'layernorm_float' in name and 'layernorm_quant' not in name:
+                layer_type = "layernorm"
+            elif 'layernorm_quant' in name:
+                layer_type = "qlayernorm"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+    print("开始量化layernorm精度测试...")
+    print("=" * 60)
+
+    input_tensor = torch.randn(2, 3, 32, 32, requires_grad=True)
+    loss = check_gradients(model, input_tensor)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化layernorm测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化layernorm测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_linear.py b/test/test_linear.py
new file mode 100644
index 0000000..2c0306c
--- /dev/null
+++ b/test/test_linear.py
@@ -0,0 +1,121 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.linear_float = nn.Linear(3 * 32 * 32, 512)      
+
+        self.linear_quant = QLinear(
+            3 * 32 * 32, 
+            512,
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        with torch.no_grad():
+            self.linear_float.weight.copy_(self.linear_quant.qweight)
+            if self.linear_quant.qbias is not None:
+                self.linear_float.bias.copy_(self.linear_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.linear_float(x)
+        result_quant = self.linear_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("=" * 60)
+    print("测试模型前向传播中的量化linear比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+    x = x.view(batch_size, -1)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'linear_float' in name and 'linear_quant' not in name:
+                layer_type = "linear"
+            elif 'linear_quant' in name:
+                layer_type = "qlinear"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+    print("开始量化linear精度测试...")
+    print("=" * 60)
+
+    input_tensor = torch.randn(2, 3, 32, 32, requires_grad=True)
+    input_tensor = input_tensor.view(2, -1)
+    loss = check_gradients(model, input_tensor)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化linear测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化linear测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_lstm.py b/test/test_lstm.py
new file mode 100644
index 0000000..c0a69a3
--- /dev/null
+++ b/test/test_lstm.py
@@ -0,0 +1,141 @@
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader, TensorDataset
+
+# ==========================================================
+# 1. 定义一个最简单的GRU网络
+# ==========================================================
+class LSTMNet(nn.Module):
+    def __init__(self, input_size=10, hidden_size=20, num_layers=1, num_classes=2):
+        super(LSTMNet, self).__init__()
+        self.lstm = nn.LSTM(input_size=input_size,
+                          hidden_size=hidden_size,
+                          num_layers=num_layers,
+                          batch_first=True,
+                          bidirectional=True)
+        self.fc = nn.Linear(hidden_size * 2, num_classes)
+
+    def forward(self, x):
+        # x: [batch, seq_len, input_size]
+        out, h_n = self.lstm(x)          # out: [batch, seq_len, hidden_size]
+        out = out[:, -1, :]             # 取最后一个时间步的输出
+        out = self.fc(out)              # [batch, num_classes]
+        return out
+
+# ==========================================================
+# 2. 生成随机数据 (假数据用于验证功能)
+# ==========================================================
+def generate_data(num_samples=1000, seq_len=5, input_size=10, num_classes=2):
+    X = torch.randn(num_samples, seq_len, input_size)
+    y = torch.randint(0, num_classes, (num_samples,))
+    return X, y
+
+# ==========================================================
+# 3. 训练与测试流程
+# ==========================================================
+def train_and_test():
+    # 参数配置
+    input_size = 10
+    hidden_size = 20
+    num_classes = 2
+    seq_len = 5
+    num_epochs = 5
+    batch_size = 32
+    lr = 1e-3
+
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"🚀 Using device: {device}")
+
+    # 数据
+    X_train, y_train = generate_data(800, seq_len, input_size, num_classes)
+    X_test, y_test = generate_data(200, seq_len, input_size, num_classes)
+
+    train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=batch_size, shuffle=True)
+    test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=batch_size)
+
+    # 模型、优化器、损失函数
+    model = LSTMNet(input_size, hidden_size, num_classes=num_classes).to(device)
+    print(model)
+
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.Adam(model.parameters(), lr=lr)
+
+    for epoch in range(num_epochs):
+        model.train()
+        total_loss = 0.0
+        for X_batch, y_batch in train_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            optimizer.zero_grad()
+            outputs = model(X_batch)
+            loss = criterion(outputs, y_batch)
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+
+        avg_loss = total_loss / len(train_loader)
+        print(f"Epoch [{epoch+1}/{num_epochs}]  Loss: {avg_loss:.4f}")
+
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for X_batch, y_batch in test_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            outputs = model(X_batch)
+            _, predicted = torch.max(outputs, 1)
+            total += y_batch.size(0)
+            correct += (predicted == y_batch).sum().item()
+    acc = correct / total * 100
+    print(f"Test Accuracy: {acc:.2f}%")
+
+    # 量化配置
+    import linger
+    from linger.utils import FakeQuantMethod, QatMethod
+    linger.QUANT_CONFIGS.quant_method = FakeQuantMethod.CUDA
+    linger.QUANT_CONFIGS.quant_info.qat_method = QatMethod.MOM
+    linger.QUANT_CONFIGS.quant_info.weight_bits = 8
+    linger.QUANT_CONFIGS.quant_info.activate_bits = 8
+
+    model = linger.init(model)
+    print(model)
+
+    criterion = nn.CrossEntropyLoss()
+    optimizer = optim.Adam(model.parameters(), lr=lr)
+
+    # 训练循环
+    for epoch in range(num_epochs):
+        model.train()
+        total_loss = 0.0
+        for X_batch, y_batch in train_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            optimizer.zero_grad()
+            outputs = model(X_batch)
+            loss = criterion(outputs, y_batch)
+            loss.backward()
+            optimizer.step()
+            total_loss += loss.item()
+
+        avg_loss = total_loss / len(train_loader)
+        print(f"Epoch [{epoch+1}/{num_epochs}]  Loss: {avg_loss:.4f}")
+
+    # 测试
+    model.eval()
+    correct = 0
+    total = 0
+    with torch.no_grad():
+        for X_batch, y_batch in test_loader:
+            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
+            outputs = model(X_batch)
+            _, predicted = torch.max(outputs, 1)
+            total += y_batch.size(0)
+            correct += (predicted == y_batch).sum().item()
+
+    acc = correct / total * 100
+    print(f"Test Accuracy: {acc:.2f}%")
+
+# ==========================================================
+# 4. 运行主程序
+# ==========================================================
+if __name__ == "__main__":
+    train_and_test()
diff --git a/test/test_lstm_int.py b/test/test_lstm_int.py
deleted file mode 100644
index dc12f05..0000000
--- a/test/test_lstm_int.py
+++ /dev/null
@@ -1,123 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import linger
-import torch
-import torch.nn as nn
-import numpy
-
-def test_lstmpint_net():
-
-    def getacc(lprob, target):
-        num_class = lprob.size()[1]
-        _, new_target = torch.broadcast_tensors(lprob, target)
-
-        remove_pad_mask = new_target.ne(-1)
-        lprob = lprob[remove_pad_mask]
-
-        target = target[target!=-1]
-        target = target.unsqueeze(-1)
-
-
-        lprob = lprob.reshape((-1, num_class))
-
-        preds = torch.argmax(lprob, dim=1)
-        
-        correct_holder = torch.eq(preds.squeeze(), target.squeeze()).float()
-
-        num_corr = correct_holder.sum()
-        num_sample = torch.numel(correct_holder)
-        acc = num_corr/num_sample
-        return acc
-
-
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv0 = nn.Conv2d(1, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-            self.bn0 = nn.BatchNorm2d(100)
-            self.relu0 = nn.ReLU()
-            self.conv1 = nn.Conv2d(100, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-            self.bn1 = nn.BatchNorm2d(100)
-            self.relu1 = nn.ReLU()
-            self.lstmp = nn.LSTM(100, 100, num_layers=1, batch_first=True, bidirectional=False)
-            # self.lstmp = nn.LSTM(100, 50, num_layers=1, batch_first=True, bidirectional=True)
-            self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-        def forward(self, input, batch_lengths=None, initial_state=None):
-            x = self.conv0(input)
-            x = self.bn0(x)
-            x = self.relu0(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.reshape(n, -1, 1, w).squeeze(2)
-            x = x.permute(0, 2, 1)  #b t d
-            # !! !!! !! !!! !! !! !! !!!! !! !! ! !!!! !!! !
-            # 此处之前和之后的pack_padded_sequence和pad_packed_sequence 需要写全，不能用from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence 使用
-            # 直接写全torch.nn.utils.rnn.pack_padded_sequence(_,_,_,_)   torch.nn.utils.rnn.pad_packed_sequence(_,_,_,_)
-            # 不然linger替换不了函数指针    会导致运行出错
-            x = nn.utils.rnn.pack_padded_sequence(x, batch_lengths, batch_first=True, enforce_sorted=False)
-            x, hidden = self.lstmp(x, initial_state) #output b, t, h (10, 10, 100)
-            x, _ = nn.utils.rnn.pad_packed_sequence(x, batch_first=True)
-            x = x.permute(2, 0, 1)
-            d, b, t = x.shape
-            x = x.reshape((1, d, 1, b*t)) # (1, 100, 1, 100)
-            x = self.final_conv(x)  #(1, 10, 1, 100) (d, b*t)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    numpy.random.seed(1)
-    aa = torch.randn(10, 1, 1, 10).cuda()
-    label = torch.randint(10, (10, 10)).cuda()   #class=10
-    mask = torch.ones(10, 10)
-    for i in range(9):
-        index = numpy.random.randint(5, 10)
-        mask[i, index:] = 0
-        label[i, index:] = -1
-
-    input_lengths = mask.long().sum(1).cpu().numpy()
-    input_lengths = torch.tensor(input_lengths)#.cuda()
-    print('input_lengths: ', input_lengths)
-    # input_lengths = None
-    # label = label.permute((1, 0))
-    # batch_size = 10; hidden_size=50; size=2
-    batch_size = 10; hidden_size=100; size=1
-    initial_state = (torch.zeros(size, batch_size, hidden_size).cuda(), 
-                                torch.zeros(size, batch_size, hidden_size).cuda())
-    # initial_state = None
-    net = Net().cuda()
-    replace_modules = (nn.Conv2d, nn.LSTM, nn.BatchNorm2d)
-    # replace_modules = (nn.LSTM,)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = linger.init(net, quant_modules=replace_modules)
-    criterion = nn.CrossEntropyLoss(ignore_index = -1)
-
-    optimizer = torch.optim.Adam(net.parameters(), lr = 0.001)
-    loss = None
-    for i in range(50):
-        optimizer.zero_grad()
-        out = net(aa, input_lengths, initial_state)
-        out = out.squeeze().permute((1,0)) #(b*t, d)
-        loss = criterion(out, label.reshape(-1))
-        if i % 50 == 0:
-            print('loss: ', loss)
-            acc = getacc(out, label.reshape(-1, 1))
-            print('train acc: ', acc)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out1 = net(aa, input_lengths, initial_state)
-    out1 = out1.squeeze().permute((1,0))
-    acc = getacc(out1, label.reshape(-1, 1))
-    print('test acc: ', acc)
-    assert acc > 0.4
-    input_lengths = torch.tensor(input_lengths)#.cuda()
-    # net = torch.jit.trace(net, (aa, input_lengths, initial_state))
-    print(net)
-    with torch.no_grad():
-        net.eval()
-        torch.onnx.export(net, (aa, input_lengths, initial_state), "data.ignore/single_lstm_4.onnx",export_params=True,opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(aa, input_lengths, initial_state)
-    out2 = out2.squeeze().permute((1,0))
-    assert (out1 == out2).all()
\ No newline at end of file
diff --git a/test/test_matmul.py b/test/test_matmul.py
new file mode 100644
index 0000000..fa618ef
--- /dev/null
+++ b/test/test_matmul.py
@@ -0,0 +1,112 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_matmul(self, x_float):
+        """执行量化矩阵乘法并与浮点矩阵乘法比较"""
+        
+        float_matmul = torch.matmul(x_float, x_float)
+    
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        
+        quant_matmul = torch.matmul(x_quant, x_quant)
+        
+        # 计算差异
+        diff = torch.abs(float_matmul) - torch.abs(quant_matmul)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_matmul).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_matmul, quant_matmul, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种乘法
+        float_result, quant_result, relative_diff = self.quantized_matmul(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    """
+    测试模型forward中的量化矩阵乘法比较
+    """
+    print("测试模型前向传播中的量化矩阵乘法比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化矩阵乘法精度测试...")
+    print("=" * 60)
+    
+    # 在forward中比较量化加法
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化矩阵乘法测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化矩阵乘法测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_maxpool.py b/test/test_maxpool.py
new file mode 100644
index 0000000..3f7309b
--- /dev/null
+++ b/test/test_maxpool.py
@@ -0,0 +1,120 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.maxpool_float = nn.MaxPool2d(kernel_size=2, stride=2)     
+
+        self.maxpool_quant = QMaxpool2d(
+            kernel_size=2, 
+            stride=2,
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        # with torch.no_grad():
+        #     self.linear_float.weight.copy_(self.linear_quant.qweight)
+        #     if self.linear_quant.qbias is not None:
+        #         self.linear_float.bias.copy_(self.linear_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.maxpool_float(x)
+        result_quant = self.maxpool_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("=" * 60)
+    print("测试模型前向传播中的量化maxpool比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"输入梯度: {input_tensor.grad.norm().item()}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'maxpool_float' in name and 'maxpool_quant' not in name:
+                layer_type = "maxpool"
+            elif 'maxpool_quant' in name:
+                layer_type = "qmaxpool"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+    print("开始量化maxpool精度测试...")
+    print("=" * 60)
+
+    input_tensor = torch.randn(2, 3, 32, 32, requires_grad=True)
+    loss = check_gradients(model, input_tensor)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化maxpool测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化maxpool测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_mul.py b/test/test_mul.py
new file mode 100644
index 0000000..37a16d7
--- /dev/null
+++ b/test/test_mul.py
@@ -0,0 +1,108 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_mul(self, x_float):
+        """执行量化乘法并与浮点乘法比较"""
+        
+        float_mul = x_float * x_float
+    
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        
+        quant_mul = x_quant * x_quant
+        
+        # 计算差异
+        diff = torch.abs(float_mul) - torch.abs(quant_mul)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_mul).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_mul, quant_mul, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种乘法
+        float_result, quant_result, relative_diff = self.quantized_mul(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    print("测试模型前向传播中的量化乘法比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化乘法精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化乘法测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化乘法测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_ops.py b/test/test_ops.py
deleted file mode 100644
index edf9925..0000000
--- a/test/test_ops.py
+++ /dev/null
@@ -1,250 +0,0 @@
-import linger
-import pytest
-import torch
-import torch.nn as nn
-from linger.ops import *
-
-
-def test_linear():
-    torch.manual_seed(1)
-    torch.cuda.manual_seed(1)
-    torch.cuda.set_device(2)
-    origin_fc = nn.Linear(50, 100, True).cuda().train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    my_fc = LinearInt(50, 100, True, data_bits=8,
-                      parameter_bits=8).cuda().train()
-    weight = torch.randn(100, 50)
-    bias = torch.randn(100)
-    origin_fc.weight.data.copy_(weight)
-    origin_fc.bias.data.copy_(bias)
-    my_fc.weight.data.copy_(weight)
-    my_fc.bias.data.copy_(bias)
-    input1 = torch.ones(50, 50, requires_grad=True).cuda()
-    for epoch in range(10):
-        output1 = origin_fc(input1)
-        output2 = my_fc(input1)
-        criterion = nn.MSELoss()
-        optimizer = torch.optim.SGD(origin_fc.parameters(), lr=0.1)
-        target = torch.ones(50, 100).cuda()
-        loss = criterion(output1, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        optimizer1 = torch.optim.SGD(my_fc.parameters(), lr=0.1)
-        loss1 = criterion(output2, target)
-        optimizer1.zero_grad()
-        loss1.backward()
-        optimizer1.step()
-        print('loss1, loss2: ', loss, loss1)
-        assert criterion(origin_fc.weight, my_fc.weight) < 0.02
-
-
-def test_conv():
-    torch.manual_seed(1)
-    origin_conv = nn.Conv2d(10, 10, 1, 1, 0, 1, 1).cuda()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    my_conv = Conv2dInt(10, 10, 1, 1, 0, 1, 1, data_bits=8,
-                        parameter_bits=8, o_bits=8).cuda()
-    weight = torch.randn(10, 10, 1, 1)
-    bias = torch.randn(10)
-    origin_conv.weight.data.copy_(weight)
-    origin_conv.bias.data.copy_(bias)
-    my_conv.weight.data.copy_(weight)
-    my_conv.bias.data.copy_(bias)
-    input1 = torch.ones(10, 10, 50, 50, requires_grad=True).cuda()
-    for epoch in range(10):
-        output1 = origin_conv(input1)
-        output2 = my_conv(input1)
-        criterion = nn.MSELoss()
-        optimizer = torch.optim.SGD(origin_conv.parameters(), lr=0.01)
-        target = torch.ones(10, 10, 50, 50, requires_grad=True).cuda()
-        loss = criterion(output1, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        optimizer1 = torch.optim.SGD(my_conv.parameters(), lr=0.01)
-        loss1 = criterion(output2, target)
-        optimizer1.zero_grad()
-        loss1.backward()
-        optimizer1.step()
-        assert criterion(origin_conv.weight, my_conv.weight) < 0.02
-
-
-class bn(nn.Module):
-    def __init__(self):
-        super(bn, self).__init__()
-
-        self.out = nn.BatchNorm2d(100)
-
-    def forward(self, input):
-        return self.out(input)
-
-
-class bn_(nn.Module):
-    def __init__(self):
-        super(bn_, self).__init__()
-
-        self.out = BatchNormInt(100)
-
-    def forward(self, input):
-        return self.out(input)
-
-
-def test_convtranspose():
-    torch.manual_seed(1)
-    torch.cuda.manual_seed(1)
-    torch.cuda.set_device(2)
-    origin_convtranspose = nn.ConvTranspose2d(
-        10, 10, 5, 5, 2, 4, 2, True, 2).cuda().train()
-    my_convtranspose = ConvTranspose2dInt(
-        10, 10, 5, 5, 2, 4, 2, True, 2, data_bits=16, parameter_bits=16, o_bits=8).cuda().train()
-    weight = torch.randn(10, 5, 5, 5)
-    bias = torch.randn(10)
-    origin_convtranspose.weight.data.copy_(weight)
-    origin_convtranspose.bias.data.copy_(bias)
-    my_convtranspose.weight.data.copy_(weight)
-    my_convtranspose.bias.data.copy_(bias)
-    input1 = torch.randn(10, 10, 50, 50, requires_grad=True).cuda()
-    for epoch in range(10):
-        output1 = origin_convtranspose(input1)
-        output2 = my_convtranspose(input1)
-        criterion = nn.MSELoss()
-        optimizer = torch.optim.SGD(origin_convtranspose.parameters(), lr=0.01)
-        target = torch.ones(10, 10, 254, 254, requires_grad=True).cuda(2)
-        loss = criterion(output1, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        optimizer1 = torch.optim.SGD(my_convtranspose.parameters(), lr=0.01)
-        loss1 = criterion(output2, target)
-        optimizer1.zero_grad()
-        loss1.backward()
-        optimizer1.step()
-        print("loss1: {}, loss2: {}".format(loss, loss1))
-        assert criterion(origin_convtranspose.weight,
-                         my_convtranspose.weight) < 0.02
-
-
-def test_batchnorm():
-    torch.manual_seed(1)
-    torch.cuda.manual_seed(1)
-    torch.cuda.set_device(2)
-    origin_bn = nn.BatchNorm2d(10, eps=1e-5, momentum=0.1).cuda().train()
-    my_bn = BatchNormInt(10, eps=1e-5, momentum=0.1, o_bits=8,
-                         data_bits=8, parameter_bits=8).cuda().train()
-    weight = torch.randn(10)
-    bias = torch.randn(10)
-    running_mean = torch.randn(10)
-    running_var = torch.randn(10)
-    origin_bn.weight.data.copy_(weight)
-    origin_bn.bias.data.copy_(bias)
-    origin_bn.running_mean.data.copy_(running_mean)
-    origin_bn.running_var.data.copy_(running_var)
-
-    my_bn.weight.data.copy_(weight)
-    my_bn.bias.data.copy_(bias)
-    my_bn.running_mean.data.copy_(running_mean)
-    my_bn.running_var.data.copy_(running_var)
-    # target = torch.ones(10, 10, 50, 50, requires_grad=True).cuda()
-    grad = torch.randn(10, 10, 100, 100).cuda()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    input1 = torch.randn(10, 10, 100, 100, requires_grad=True).cuda()
-    input2 = input1.detach().clone()
-
-    for i in range(100):
-        output1 = origin_bn(input1)
-        output2 = my_bn(input2)
-    output1 = origin_bn.eval()(input1)
-    output2 = my_bn.eval()(input2)
-    assert (output1.abs().mean() - output2.abs().mean()).abs() < 1/127
-
-
-def test_gru():
-    torch.manual_seed(1)
-    torch.cuda.manual_seed(1)
-    torch.cuda.set_device(2)
-    origin_gru = nn.GRU(10, 20, batch_first=True,
-                        bidirectional=True).cuda().train()
-    my_gru = GRUInt(10, 20, batch_first=True,
-                    bidirectional=True, o_bits=8).cuda().train()
-    weight_ih = torch.randn(60, 10)
-    weight_hh = torch.randn(60, 20)
-    bias_ih = torch.randn(60)
-    bias_hh = torch.randn(60)
-    origin_gru.weight_ih_l0.data.copy_(weight_ih)
-    origin_gru.weight_hh_l0.data.copy_(weight_hh)
-    origin_gru.bias_ih_l0.data.copy_(bias_ih)
-    origin_gru.bias_hh_l0.data.copy_(bias_hh)
-    my_gru.weight_ih_l0.data.copy_(weight_ih)
-    my_gru.weight_hh_l0.data.copy_(weight_hh)
-    my_gru.bias_ih_l0.data.copy_(bias_ih)
-    my_gru.bias_hh_l0.data.copy_(bias_hh)
-    input1 = torch.ones(10, 5, 10, requires_grad=True).cuda()
-    for epoch in range(10):
-        output1, hy = origin_gru(input1)
-        output2, hy = my_gru(input1)
-        criterion = nn.MSELoss()
-        optimizer = torch.optim.SGD(origin_gru.parameters(), lr=0.1)
-        target = torch.ones(10, 5, 40).cuda()
-        loss = criterion(output1, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        optimizer1 = torch.optim.SGD(my_gru.parameters(), lr=0.1)
-        loss1 = criterion(output2, target)
-        optimizer1.zero_grad()
-        loss1.backward()
-        optimizer1.step()
-        print('loss1, loss2: ', loss, loss1)
-        assert criterion(origin_gru.weight_ih_l0, my_gru.weight_ih_l0) < 0.02
-        assert criterion(origin_gru.weight_hh_l0, my_gru.weight_hh_l0) < 0.02
-        assert criterion(origin_gru.bias_ih_l0, my_gru.bias_ih_l0) < 0.02
-        assert criterion(origin_gru.bias_hh_l0, my_gru.bias_hh_l0) < 0.02
-
-
-def test_lstm():
-    torch.manual_seed(1)
-    torch.cuda.manual_seed(1)
-    torch.cuda.set_device(2)
-    origin_lstm = nn.LSTM(10, 20, batch_first=True,
-                          bidirectional=True).cuda().train()
-    my_lstm = LSTMInt(10, 20, batch_first=True,
-                      bidirectional=True, o_bits=8).cuda().train()
-    weight_ih = torch.randn(80, 10)
-    weight_hh = torch.randn(80, 20)
-    bias_ih = torch.randn(80)
-    bias_hh = torch.randn(80)
-    origin_lstm.weight_ih_l0.data.copy_(weight_ih)
-    origin_lstm.weight_hh_l0.data.copy_(weight_hh)
-    origin_lstm.bias_ih_l0.data.copy_(bias_ih)
-    origin_lstm.bias_hh_l0.data.copy_(bias_hh)
-    my_lstm.weight_ih_l0.data.copy_(weight_ih)
-    my_lstm.weight_hh_l0.data.copy_(weight_hh)
-    my_lstm.bias_ih_l0.data.copy_(bias_ih)
-    my_lstm.bias_hh_l0.data.copy_(bias_hh)
-    input1 = torch.ones(10, 5, 10, requires_grad=True).cuda()
-    for epoch in range(10):
-        output1, (hy, cy) = origin_lstm(input1)
-        output2, (hy, cy) = my_lstm(input1)
-        criterion = nn.MSELoss()
-        optimizer = torch.optim.SGD(origin_lstm.parameters(), lr=0.1)
-        target = torch.ones(10, 5, 40).cuda()
-        loss = criterion(output1, target)
-        optimizer.zero_grad()
-        loss.backward()
-        optimizer.step()
-
-        optimizer1 = torch.optim.SGD(my_lstm.parameters(), lr=0.1)
-        loss1 = criterion(output2, target)
-        optimizer1.zero_grad()
-        loss1.backward()
-        optimizer1.step()
-        print('loss1, loss2: ', loss, loss1)
-        assert criterion(origin_lstm.weight_ih_l0, my_lstm.weight_ih_l0) < 0.02
-        assert criterion(origin_lstm.weight_hh_l0, my_lstm.weight_hh_l0) < 0.02
-        assert criterion(origin_lstm.bias_ih_l0, my_lstm.bias_ih_l0) < 0.02
-        assert criterion(origin_lstm.bias_hh_l0, my_lstm.bias_hh_l0) < 0.02
diff --git a/test/test_quant_histogram.py b/test/test_quant_histogram.py
deleted file mode 100644
index e852602..0000000
--- a/test/test_quant_histogram.py
+++ /dev/null
@@ -1,65 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant
-
-
-def test_histogram():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(392, 100, bias=False)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net1 = Net().cuda()
-    net2 = Net().cuda()
-    aa = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.ConvTranspose2d, nn.Conv2d, nn.Linear)
-    net1.train()
-    net2.train()
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-
-    net2 = linger.init(net2, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    for v1, v2 in zip(net1.parameters(), net2.parameters()):
-        v2.data.copy_(v1.data)
-    optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
-    optimizer2 = torch.optim.SGD(net2.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer1.zero_grad()
-        optimizer2.zero_grad()
-        out1 = net1(aa)
-        out2 = net2(aa)
-        loss1 = criterion(out1, target)
-        loss1.backward()
-        optimizer1.step()
-        loss2 = criterion(out2, target)
-        loss2.backward()
-        optimizer2.step()
-        if i % 30 == 29:
-            print('loss1 {}, loss2 {}'.format(loss1, loss2))
-    net1.eval()
-    net2.eval()
-    out1 = net1(aa)
-    out2 = net2(aa)
-
-    assert criterion(out1, out2) < 0.02
diff --git a/test/test_quant_tensor_load_state_dict.py b/test/test_quant_tensor_load_state_dict.py
deleted file mode 100644
index 0c82119..0000000
--- a/test/test_quant_tensor_load_state_dict.py
+++ /dev/null
@@ -1,66 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-
-def test_quant_tensor():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.fc = nn.Linear(392, 100, bias=False)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            linger.quant_tensor(self, x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net1 = Net().cuda()
-    dummy_input = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-
-    replace_tuple = (nn.Linear, nn.ConvTranspose2d, nn.Conv2d)
-    net1.train()
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    optimizer1 = torch.optim.SGD(net1.parameters(), lr=0.01)
-    for i in range(150):
-        optimizer1.zero_grad()
-        out1 = net1(dummy_input)
-        loss1 = criterion(out1, target)
-        loss1.backward()
-        optimizer1.step()
-        if i % 30 == 29:
-            print('loss1 {}'.format(loss1))
-    net1.eval()
-    torch.save(net1.state_dict(), 'data.ignore/model.pt')
-    out1 = net1(dummy_input)
-
-    net2 = Net().cuda()
-    net2 = linger.init(net2, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    net2.load_state_dict(torch.load(
-        'data.ignore/model.pt', map_location='cpu'))
-    net2.eval()
-    out2 = net2(dummy_input)
-
-    assert out1.sum() == out2.sum()
diff --git a/test/test_quant_tensor_paramter.py b/test/test_quant_tensor_paramter.py
deleted file mode 100644
index b91ca9b..0000000
--- a/test/test_quant_tensor_paramter.py
+++ /dev/null
@@ -1,124 +0,0 @@
-import os
-import torch
-import torch.nn as nn
-import torch.utils.data as Data
-import torchvision
-import sys
-import os
-import torch.nn.functional
-import linger
-from linger.ops import *
-import pytest
-import numpy 
-# Hyper Parameters
-def test_quant_tensot_param():
-    EPOCH = 5               # train the training data n times, to save time, we just train 1 epoch
-    BATCH_SIZE = 50
-    LR = 0.001              # learning rate
-    DATA_DIR = '/yrfs2/bitbrain/data/'
-    DOWNLOAD_MNIST = False
-    TEST_SAMPLE = 1000
-
-    torch.backends.cudnn.enabled = False
-    device_id =0
-
-    device = torch.device("cuda:"+str(device_id))
-    torch.manual_seed(0)
-
-    train_data = torchvision.datasets.MNIST(
-        root=DATA_DIR,
-        train=True,                                    
-        transform=torchvision.transforms.ToTensor(), 
-        download=DOWNLOAD_MNIST,
-    )
-
-    # plot one example
-    #print(train_data.train_data.size())                 # (60000, 28, 28)
-    #print(train_data.train_labels.size())               # (60000)
-
-
-    # Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
-    train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=False)
-
-    # pick 2000 samples to speed up testing
-    test_data = torchvision.datasets.MNIST(root=DATA_DIR, train=False)
-    test_x = torch.unsqueeze(test_data.data, dim=1).type(torch.FloatTensor)[:TEST_SAMPLE]/255.   # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
-    test_y = test_data.targets[:TEST_SAMPLE]
-
-
-    class CNN(nn.Module):
-        def __init__(self):
-            super(CNN, self).__init__()
-            self.conv1 = nn.Sequential(        
-                Conv2dInt(
-                    in_channels=1,             
-                    out_channels=16,           
-                    kernel_size=5,             
-                    stride=1,                
-                    padding=2,                 
-                ),                             
-                nn.ReLU(),                     
-                nn.MaxPool2d(kernel_size=2),   
-            )
-            self.conv2 = nn.Sequential(        
-                Conv2dInt(16, 32, 5, 1, 2),    
-                nn.ReLU(),                      
-                nn.MaxPool2d(2),                
-            )
-            self.out = LinearInt(32 * 7 * 7, 10)   
-
-
-        def forward(self, x):        
-            x = self.conv1(x)
-            x = x*x
-            
-            x = linger.quant_tensor(self,x, name='_default_layername')
-            layer = linger.quant_tensor_getlayer(self)
-            if self.training:
-                scale_local = 127/(x.abs().max())
-            else:
-                scale_local = layer.scale_x
-
-            t = scale_local * x
-            assert torch.sum(torch.frac(t+0.0001)).data  < 0.00011*t.numel()
-            x = self.conv2(x)
-            x = x.view(x.size(0), -1)           
-            output = self.out(x)
-
-            return output
-
-    def test_net_forward_backward():
-        cnn = CNN()
-        cnn = cnn.cuda(device)
-        #cnn.load_state_dict(torch.load('init.pt'))
-    
-
-        optimizer = torch.optim.Adam(cnn.parameters(), lr=LR)   
-        loss_func = nn.CrossEntropyLoss()                      
-        count_step = 0
-        print("begin to train raw net ...")
-
-        for epoch in range(1):
-            for step, (b_xx, b_yy) in enumerate(train_loader):   
-                cnn.train()
-                b_x = b_xx.cuda(device)
-                b_y = b_yy.cuda(device)
-                output = cnn(b_x)
-                loss = loss_func(output, b_y)   
-                optimizer.zero_grad()          
-                loss.backward()                
-                optimizer.step()  
-                # '''
-                if (step+1) % 50 == 0:
-                    cnn.eval()
-                    accuracy = 0
-                    for it in range(0,TEST_SAMPLE,BATCH_SIZE):
-                        test_output = cnn(test_x[it:it+BATCH_SIZE].cuda(device))
-                        pred_y = torch.max(test_output, 1)[1].data.cpu().numpy()
-                        accuracy += float((pred_y == test_y[it:it+BATCH_SIZE].data.numpy()).astype(int).sum()) / float(BATCH_SIZE)
-                    accuracy /= (TEST_SAMPLE/BATCH_SIZE)
-                    print('Batch: ', step, '| train loss: %.4f' % loss.cpu().data.numpy(), '| test accuracy: %.2f ' % (100*accuracy))
-                count_step+=1
-        assert type(cnn._ifly_bitbrain_round_tensor_iq_tensor_quant__default_layername) == linger.ScaledRoundLayer
-
-    # test_net_forward_backward()                   
diff --git a/test/test_relu.py b/test/test_relu.py
new file mode 100644
index 0000000..621bab0
--- /dev/null
+++ b/test/test_relu.py
@@ -0,0 +1,118 @@
+import torch
+import torch.nn as nn
+import numpy as np
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+
+q_configs = QUANT_CONFIGS
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+
+        self.relu_float = nn.ReLU()     
+
+        self.relu_quant = QRelu(
+            weights_cfg=q_configs.quant_info.to_dict(), 
+            activations_cfg=q_configs.quant_info.to_dict(),
+            bias_cfg=q_configs.quant_info.to_dict(), 
+            constrain =  q_configs.clamp_info.to_dict()   
+        )
+        
+        # with torch.no_grad():
+        #     self.linear_float.weight.copy_(self.linear_quant.qweight)
+        #     if self.linear_quant.qbias is not None:
+        #         self.linear_float.bias.copy_(self.linear_quant.qbias)
+
+    def forward(self, x):
+        result_float = self.relu_float(x)
+        result_quant = self.relu_quant(x)
+     
+        return {
+            'result_float': result_float,
+            'result_quant': result_quant
+        }
+
+def test_quantization_in_forward(model):
+    print("=" * 60)
+    print("测试模型前向传播中的量化relu比较...")
+    print("=" * 60)
+
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    print(f"使用设备: {device}")
+
+    batch_size = 2
+    x = torch.rand(batch_size, 3, 32, 32).to(device)
+
+    with torch.no_grad():
+        model = model.to(device)
+        outputs = model(x)
+
+    float_result = outputs['result_float']
+    quant_result = outputs['result_quant']
+
+    diff = torch.abs(float_result) - torch.abs(quant_result)
+    mean_float = torch.mean(float_result).item()
+    mean_diff = torch.mean(diff).item()
+    relative_diff = mean_diff/mean_float
+
+    return relative_diff
+
+def check_gradients(model, input_tensor):
+    model.train()
+    model.zero_grad()
+    output_dict = model(input_tensor)
+    
+    for key, output_value in output_dict.items():
+        target = torch.randn_like(output_value)
+        loss = torch.nn.functional.mse_loss(output_value, target)
+    
+    loss.backward()
+    
+    print("=== 卷积层梯度检查 ===")
+    print(f"输入形状: {input_tensor.shape}")
+    print(f"输出形状: {output_value.shape}")
+    print(f"输入梯度: {input_tensor.grad.norm().item()}")
+    print(f"损失值: {loss.item():.6f}\n")
+
+    for name, param in model.named_parameters():
+        if param.grad is not None:
+            grad = param.grad
+            
+            is_zero = torch.allclose(grad, torch.zeros_like(grad), atol=1e-8)
+            grad_norm = grad.norm().item()
+            
+            if 'relu_float' in name and 'relu_quant' not in name:
+                layer_type = "relu"
+            elif 'relu_quant' in name:
+                layer_type = "qrelu"
+            else:
+                continue
+                
+            status = "✓ 全0" if is_zero else "✗ 非0"
+            print(f"{layer_type:8} {name:15} | {status} | 梯度范数: {grad_norm:.2e}")
+    
+    return loss.item()
+
+if __name__ == "__main__":
+    model = QuantizationTestNet(num_classes=10)
+    print("开始量化relu精度测试...")
+    print("=" * 60)
+
+    input_tensor = torch.randn(2, 3, 32, 32, requires_grad=True)
+    loss = check_gradients(model, input_tensor)
+
+    final_relative_diff = test_quantization_in_forward(model)
+    final_relative_diff = torch.abs(torch.tensor(final_relative_diff)).item()
+
+    print("=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化relu测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化relu测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
\ No newline at end of file
diff --git a/test/test_relu6_int.py b/test/test_relu6_int.py
deleted file mode 100644
index 0bb60ca..0000000
--- a/test/test_relu6_int.py
+++ /dev/null
@@ -1,67 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_s_relu6_int():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU6()
-            self.conv1 = nn.Conv2d(10, 10, kernel_size=1, stride=1,
-                         padding=0, bias=True)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            # x = x+8
-            x = self.relu(x)
-            x = self.conv1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))  
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    # import pickle as pk
-    # with open('input.pb', 'wb') as file:
-    #     pk.dump(aa, file)
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    # replace_tuple=(nn.Conv2d, nn.Linear, nn.AvgPool2d, nn.ReLU6)
-    replace_tuple=(nn.ReLU6)
-    net.train()
-    # linger.disable_quant(net.fc)
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    loss = None
-
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/relu6_int.pt')
-    net.load_state_dict(torch.load('data.ignore/relu6_int.pt'))
-    out1 = net(aa)
-    # print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa,"data.ignore/test_relu6_int.onnx",export_params=True,opset_version=12,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
diff --git a/test/test_replace_float_module.py b/test/test_replace_float_module.py
deleted file mode 100644
index 2b7baea..0000000
--- a/test/test_replace_float_module.py
+++ /dev/null
@@ -1,68 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_conv_linear():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.conv2 = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)  
-            self.bn1 = nn.BatchNorm2d(10)
-            self.relu1 = nn.ReLU()           
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = x - 1
-            x = self.conv1(x)
-            x = self.conv2(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.Linear)
-    net.train()
-    net = linger.normalize_layers(net, normalize_modules=replace_tuple, normalize_weight_value=4, normalize_bias_value=4, normalize_output_value=4)
-    net.cuda()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    #print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa,"data.ignore/conv_linear.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.01
\ No newline at end of file
diff --git a/test/test_replace_param.py b/test/test_replace_param.py
deleted file mode 100644
index 9d2f94f..0000000
--- a/test/test_replace_param.py
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_replace_param():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))  
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    net.train()
-    print(net)
-    for k, v in net.named_parameters():
-        print(k, v.data_ptr(), v.abs().sum())
-    criterion = nn.MSELoss()
-    net = linger.normalize_layers(net)
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    
-    print(net)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    for k, v in net.named_parameters():
-        print(k, v.data_ptr(), v.abs().sum())
-    
\ No newline at end of file
diff --git a/test/test_s_conv_linear_iqcat.py b/test/test_s_conv_linear_iqcat.py
deleted file mode 100644
index 496b65c..0000000
--- a/test/test_s_conv_linear_iqcat.py
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_s_conv_linear_normalize():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 5, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(5)
-            self.conv1 = nn.Conv2d(10, 5, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn1 = nn.BatchNorm2d(5)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x1 = self.conv(x)
-            x1 = self.bn(x1)
-            x2 = self.conv1(x)
-            x2 = self.bn1(x2)
-            x = torch.cat([x1, x2], dim=1)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))  
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    # linger.disable_quant(net.fc)
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    #print(out1)
-    with torch.no_grad():
-        linger.onnx.export(net, aa,"data.ignore/conv_linear_iqcat.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.1
diff --git a/test/test_s_conv_linear_normalize.py b/test/test_s_conv_linear_normalize.py
deleted file mode 100644
index c8d7c66..0000000
--- a/test/test_s_conv_linear_normalize.py
+++ /dev/null
@@ -1,58 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_s_conv_linear_normalize():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = torch.clamp(x, 0, 127)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))  
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    # linger.disable_quant(net.fc)
-    net = linger.init(net, quant_modules=replace_tuple, mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/conv_linear.pt')
-    out1 = net(aa)
-    #print(out1)
-    with torch.no_grad():
-        torch.onnx.export(net, aa,"data.ignore/conv_linear_normalize.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    assert abs(out1.mean() - 1) < 0.1
\ No newline at end of file
diff --git a/test/test_s_normalize_init.py b/test/test_s_normalize_init.py
deleted file mode 100644
index bde53b0..0000000
--- a/test/test_s_normalize_init.py
+++ /dev/null
@@ -1,67 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from linger.utils import PlatFormQuant
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-
-def test_convbn_normalize():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                         padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))  
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    # assert loss < 1e-12, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/convbn_clamp.pt')
-    out1 = net(aa)
-    net.train()
-    # unable the last fc clamp
-    linger.disable_normalize(net.fc)
-    linger.trace_layers(net, net, aa)
-    normalize_modules = (nn.Conv2d, nn.Linear, nn.BatchNorm2d)
-    net = linger.normalize_layers(net, normalize_modules=normalize_modules, normalize_weight_value=8, normalize_bias_value=8, normalize_output_value=8)
-    net.load_state_dict(torch.load('data.ignore/convbn_clamp.pt'))
-    net.cuda()
-    out2 = net(aa)
-    torch.save(net.state_dict(), 'data.ignore/convbn_quant.pt')
-    linger.disable_quant(net.fc)
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net = linger.init(net)
-    net.load_state_dict(torch.load('data.ignore/convbn_quant.pt'))
-    net.cuda()
-    out3 = net(aa)
-    assert abs(out3.mean() - 1) < 0.02
diff --git a/test/test_scope.py b/test/test_scope.py
deleted file mode 100644
index 59c5be2..0000000
--- a/test/test_scope.py
+++ /dev/null
@@ -1,88 +0,0 @@
-
-import linger
-import linger.onnx
-import torch
-import torch.nn as nn
-import numpy
-import onnx
-
-
-
-# class NetRaw_2(nn.Module):
-#     def __init__(self):
-#         super(NetRaw_2, self).__init__()
-#         self.conv0 = nn.Conv2d(1, 100, kernel_size=(3,1), padding=(0,0), groups=1, bias=True)
-#         self.bn0 = nn.BatchNorm2d(100)
-#         self.relu0 = nn.ReLU()
-#         self.conv1 = nn.Conv2d(100, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-#         self.bn1 = nn.BatchNorm2d(100)
-#         self.relu1 = nn.ReLU()
-#         self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-#     def forward(self, input):
-#         x = self.conv0(input)
-#         x = self.bn0(x)         
-#         x = self.relu0(x)
-#         x = self.conv1(x)
-#         x = self.bn1(x)
-#         x = self.relu1(x)
-#         x = self.final_conv(x)  #(1, 10, 1, 100) (d, b*t)
-#         return x
-
-
-
-# class NetComplicate_2(nn.Module):
-#     def __init__(self):
-#         super(NetComplicate_2, self).__init__()
-#         self.conv0 = nn.Conv2d(1, 100, kernel_size=(3,1), padding=(0,0), groups=1, bias=True)
-#         self.bn0 = nn.BatchNorm2d(100)
-#         self.relu0 = nn.ReLU()
-#         self.conv1 = nn.Conv2d(100, 100, kernel_size=(1,3), padding=(0,1), groups=1, bias=True)
-#         self.bn1 = nn.BatchNorm2d(100)
-#         self.relu1 = nn.ReLU()
-#         self.final_conv = nn.Conv2d(100, 10, 1, 1, 0)
-#         self.net_2 = NetRaw_2()
-#     def forward(self, input):
-#         x = self.conv0(input)
-#         y = self.net_2(input)
-#         x = self.bn0(x)         
-#         x = self.relu0(x)
-#         x = self.conv1(x)
-#         x = self.bn1(x)
-#         x = self.relu1(x)
-#         x = self.final_conv(x)  #(1, 10, 1, 100) (d, b*t)
-#         return x+y
-# def test_scope_info_simple():            
-#     torch.manual_seed(2)
-#     torch.cuda.manual_seed_all(2)
-#     aa = torch.randn(10, 1, 10, 10).cuda()
-#     netraw_2 = NetRaw_2().cuda()      
-#     with torch.no_grad():
-#         netraw_2.eval()
-#         linger.onnx.export(netraw_2, (aa), "data.ignore/test_scoped_1.onnx",export_params=True,opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-#     onnx_model = onnx.load("data.ignore/test_scoped_1.onnx")
-#     for node in onnx_model.graph.node:
-#        node_name =node.name
-#       # print(node_name)
-#        assert node_name.startswith('.')
-#        assert len(node_name.split('.')) >1
-#        assert len(node_name.split('/')) ==2
-
-# def test_scope_info_complicate():            
-#     torch.manual_seed(2)
-#     torch.cuda.manual_seed_all(2)
-#     aa = torch.randn(10, 1, 10, 10).cuda()
-#     netcomplicate_2 = NetComplicate_2().cuda()      
-#     with torch.no_grad():
-#         netcomplicate_2.eval()
-#         linger.onnx.export(netcomplicate_2, (aa), "data.ignore/test_scoped_2.onnx",export_params=True,opset_version=12, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-#     onnx_model = onnx.load("data.ignore/test_scoped_2.onnx")
-#     for node in onnx_model.graph.node:
-#        node_name =node.name
-#        #print(node_name)
-#        assert node_name.startswith('.')
-#        assert len(node_name.split('.')) >1
-#        if node.op_type !="Add":
-#            assert len(node_name.split('/')) ==2
-#        assert node.op_type != 'ScopedEnter'
-#        assert node.op_type != 'ScopedLeave'
-
diff --git a/test/test_shuffle_channel.py b/test/test_shuffle_channel.py
deleted file mode 100644
index 99f3dd7..0000000
--- a/test/test_shuffle_channel.py
+++ /dev/null
@@ -1,66 +0,0 @@
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-def test_normalize_shuffle_channel():
-    class Model(nn.Module):
-        def __init__(self):
-            super(Model, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            # self.linger_shuffle_channel = linger.NormalizeShuffleChannel(8)
-            self.fc = nn.Linear(1000, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            # x = self.linger_shuffle_channel(x, 1)
-            x = linger.channel_shuffle(x, 2)
-            x = x.view(1, -1)
-            x = torch.relu(x)
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    dummy_input = torch.randn(1, 10, 10, 10).cuda()
-    target = torch.ones(100).cuda()
-    criterion = nn.MSELoss()
-
-    replace_tuple = (nn.Conv2d, nn.Linear)
-    model = Model().cuda()
-
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    linger.SetFunctionChannelShuffleQuant(True)
-    model = linger.init(model, quant_modules=replace_tuple,
-                        mode=linger.QuantMode.QValue)
-
-    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
-    loss = None
-    for i in range(20):
-        optimizer.zero_grad()
-        out = model(dummy_input)
-        loss = criterion(out, target)
-        if i % 1 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-
-    with linger.Dumper() as dumper :
-        model.eval()
-        dumper.enable_dump_quanted(model, path="./data.ignore/dump_shuffle_net_10_09_01")
-        out = model(dummy_input)
-    dummy_input1 = dummy_input.detach().cpu().numpy()
-    dummy_input1.tofile("./data.ignore/dump_shuffle_net_10_09_01/input.bin")
-    out1 = out.detach().cpu().numpy()
-    out1.tofile("./data.ignore/dump_shuffle_net_10_09_01/output.bin")
-
-
-    with torch.no_grad():
-        torch.onnx.export(model, dummy_input, "./data.ignore/shuffle_net_10_09_01.onnx", export_params=True, opset_version=11,
-                          operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_sigmoid.py b/test/test_sigmoid.py
new file mode 100644
index 0000000..28ad527
--- /dev/null
+++ b/test/test_sigmoid.py
@@ -0,0 +1,108 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_sigmoid(self, x_float):
+        """执行量化sigmoid并与浮点sigmoid比较"""
+        
+        float_sigmoid = torch.sigmoid(x_float)
+    
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        
+        quant_sigmoid = torch.sigmoid(x_quant)
+        
+        # 计算差异
+        diff = torch.abs(float_sigmoid) - torch.abs(quant_sigmoid)
+                                      
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_sigmoid).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_sigmoid, quant_sigmoid, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种sigmoid
+        float_result, quant_result, relative_diff = self.quantized_sigmoid(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    print("测试模型前向传播中的量化sigmoid比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化sigmoid精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化sigmoid测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化sigmoid测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_simple_test.py b/test/test_simple_test.py
deleted file mode 100644
index 2d5f881..0000000
--- a/test/test_simple_test.py
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from os import stat
-import linger
-import torch
-import torch.nn as nn
-
-
-def test_trace_iqtensor():
-
-    class Add(nn.Module):
-        def __init__(self) -> None:
-            super().__init__()
-
-        def forward(self, x, y):
-            return x + y
-
-    class iqtensorAdd(nn.Module):
-        def __init__(self) -> None:
-            super().__init__()
-            self.add = Add()
-
-        def forward(self, x, y):
-            x = linger.quant_tensor(self, x, name='x')
-            y = linger.quant_tensor(self, y, name='y')
-            z = self.add(x, y)
-            return z
-    net = iqtensorAdd()
-    dummy_input = torch.randn(10, 10)
-    bb = torch.randn(10, 10)
-    linger.trace_layers(net, net.add, (dummy_input, bb))
-    cc = net(dummy_input, bb)
-    with torch.no_grad():
-        torch.onnx.export(net, (dummy_input, bb), "data.ignore/add.onnx", export_params=True, opset_version=12,
-                          verbose=True, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-
-
-def test_iqview_onnx():
-    class IQView(nn.Module):
-        def __init__(self) -> None:
-            super().__init__()
-            self.conv = nn.Conv2d(2, 2, 2, 1, 1)
-            self.fc = nn.Linear(18, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-    dummy_input = torch.randn(1, 2, 2, 2)
-    net = IQView()
-    out = net(dummy_input)
-
-    net = linger.init(net)
-
-    out = net(dummy_input)
-
-    with torch.no_grad():
-        torch.onnx.export(net, (dummy_input,), "data.ignore/iqview.onnx", export_params=True, opset_version=12,
-                          verbose=True, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
diff --git a/test/test_softmax.py b/test/test_softmax.py
new file mode 100644
index 0000000..6ba9a31
--- /dev/null
+++ b/test/test_softmax.py
@@ -0,0 +1,108 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_softmax(self, x_float, dim):
+        """执行量化softmax并与浮点softmax比较"""
+        float_softmax = torch.softmax(x_float, dim = dim)
+    
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+    
+        quant_softmax = torch.softmax(x_quant, dim = dim)
+        
+        # 计算差异
+        diff = torch.abs(float_softmax) - torch.abs(quant_softmax)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_softmax).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_softmax, quant_softmax, relative_diff
+    
+    def forward(self, x, dim):
+        # 在forward中比较两种softmax
+        float_result, quant_result, relative_diff = self.quantized_softmax(x, dim)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    print("测试模型前向传播中的量化softmax比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x, dim = 1)
+
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化softmax精度测试...")
+    print("=" * 60)
+    
+    # 在forward中比较量化softmax
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化softmax测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化softmax测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_softmax_int.py b/test/test_softmax_int.py
index ac6be78..124e548 100644
--- a/test/test_softmax_int.py
+++ b/test/test_softmax_int.py
@@ -1,73 +1,84 @@
-import os
-
-import linger
-import numpy as np
 import torch
 import torch.nn as nn
+import torch.nn.functional as F
+from torch.fx import symbolic_trace, GraphModule
+import operator
+
+# ======================
+# 定义 QAdd 模块
+# ======================
+class QAdd(nn.Module):
+    def __init__(self):
+        super(QAdd, self).__init__()
+    def forward(self, x, y):
+        # 这里可以放量化逻辑
+        return x * y
+
+
+# ======================
+# 原始模型：含 add 操作
+# ======================
+class MyModel(nn.Module):
+    def __init__(self):
+        super(MyModel, self).__init__()
+        self.fc1 = nn.Linear(10, 10)
+        self.fc2 = nn.Linear(10, 10)
+        self.act = nn.ReLU()
+
+    def forward(self, x):
+        out1 = self.fc1(x)
+        out2 = self.fc2(x)
+        out = out1 + out2   # ⚠️ 这里是 add
+
+        b1 = out.unsqueeze(1)         # (B, 1, 10)
+        b2 = out.unsqueeze(2)         # (B, 10, 1)
+        out = torch.bmm(b1, b2) 
+        
+        out = self.act(out)
+        return out
+
 
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
+# ======================
+# FX Graph Transform
+# ======================
+def replace_add_with_qadd(gm: GraphModule) -> GraphModule:
+    graph = gm.graph
+    for node in list(graph.nodes):
+        # 查找 add 节点（可能是 operator.add 或 Tensor.__add__）
+        if node.op == "call_function" and node.target in (operator.add, torch.add, torch.ops.aten.add.Tensor):
+            with graph.inserting_after(node):
+                # 插入 QAdd 模块
+                qadd_mod = QAdd()
+                qadd_name = f"qadd_{node.name}"
+                gm.add_module(qadd_name, qadd_mod)
 
+                new_node = graph.call_module(qadd_name, args=node.args)
+                node.replace_all_uses_with(new_node)
+                graph.erase_node(node)
+    graph.lint()
+    gm.recompile()
+    return gm
 
-def test_softmaxint():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.fc1 = nn.Linear(64, 256)
-            self.fc2 = nn.Linear(64, 256)
-            self.fc3 = nn.Linear(64, 256)
-            self.fc4 = nn.Linear(480, 1000)
 
-        def forward(self, input):
-            x = self.fc1(input)
-            y = self.fc2(input)
-            z = self.fc3(input)
+# ======================
+# 测试
+# ======================
+if __name__ == "__main__":
+    model = MyModel()
+    traced = symbolic_trace(model)
+    print("原始Graph:")
+    print(traced.graph)
 
-            x = x.view(8, 15, 32)
-            y = y.view(8, 32, 15)
-            z = z.view(8, 15, 32)
-            x = torch.bmm(x, y)
-            x = torch.softmax(x, dim=-1)
-            x = torch.bmm(x, z)
-            x = x.view(8, 480)
-            x = self.fc4(x)
-            x = torch.log_softmax(x, dim=-1)
+    x = torch.randn(4, 10)
+    y = model(x)
+    print("原始模型:", x, y)
 
-            return x
+    # 替换 add -> QAdd
+    new_gm = replace_add_with_qadd(traced)
+    print("\n替换后的Graph:")
+    print(new_gm.graph)
 
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    #gpu code current has bugs to fix
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.randn(15, 64, requires_grad=True).cuda()
-    target = torch.ones(8, dtype=torch.int64).cuda()
-    # net = Net()
-    # dummy_input = torch.randn(15, 64, requires_grad=True)
-    # target = torch.ones(1, dtype=torch.int64)
-    criterion = nn.CrossEntropyLoss()
-    replace_tuple = (nn.Linear)
-    net.train()
-    linger.SetFunctionBmmQuant(True)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
-    loss = None
-    for i in range(10):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 1 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/softmax.pt')
-    out1 = net(dummy_input)
-    with torch.no_grad():
-        torch.onnx.export(net, dummy_input, "data.ignore/softmax_net.onnx", export_params=True,
-                          opset_version=11, operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out2 = net(dummy_input)
-    assert out1.abs().sum() == out2.abs().sum()
+    # 验证运行
+    # x = torch.randn(4, 10)
+    y = new_gm(x)
+    print("\n模型输出:", x, y)
diff --git a/test/test_state_dict.py b/test/test_state_dict.py
deleted file mode 100644
index 11fb4e3..0000000
--- a/test/test_state_dict.py
+++ /dev/null
@@ -1,71 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-from linger.utils import PlatFormQuant
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-def test_state_dict():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.transpose = nn.ConvTranspose2d(2, 2, 5, 5, 2, 4, 2, True, 2)
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.bn = nn.BatchNorm2d(2)
-            self.relu = nn.ReLU()
-            self.fc = nn.Linear(392, 100)
-
-        def forward(self, x):
-            x = self.transpose(x)
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            n, c, h, w = x.shape
-            x = x.view(n, c*h*w)
-            x = self.fc(x)
-            return x
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.ConvTranspose2d, nn.Linear)
-    net.train()
-    linger.SetPlatFormQuant(platform_quant=PlatFormQuant.luna_quant)
-    net = linger.init(net, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    assert loss < 1e-2, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/aa.pt')
-    out1 = net(aa)
-    net1 = Net().cuda()
-    net1.train()
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                       mode=linger.QuantMode.QValue)
-    net1.eval()
-    net1.load_state_dict(torch.load('data.ignore/aa.pt', map_location='cpu'))
-    out2 = net1(aa)
-
-    assert out1.sum() == out2.sum()
diff --git a/test/test_tanh.py b/test/test_tanh.py
new file mode 100644
index 0000000..df29855
--- /dev/null
+++ b/test/test_tanh.py
@@ -0,0 +1,108 @@
+import torch
+import torch.nn as nn
+from linger.quant.qtensor import from_tensor_to_qtensor
+import numpy as np
+
+import linger
+
+class QuantizationTestNet(nn.Module):
+    def __init__(self, num_classes=10):
+        super(QuantizationTestNet, self).__init__()
+        
+        # 简单的卷积层用于构建完整模型
+        self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
+        self.pool = nn.AdaptiveAvgPool2d((4, 4))
+        self.classifier = nn.Linear(16 * 4 * 4, num_classes)
+    
+    def quantized_tanh(self, x_float):
+        """执行量化tanh并与浮点tanh比较"""
+        
+        float_tanh = torch.tanh(x_float)
+
+        # 将输入量化
+        scale = torch.tensor(128, dtype=torch.int32)
+        x_quant = from_tensor_to_qtensor(x_float, scale, 8)
+        
+        quant_tanh = torch.tanh(x_quant)
+        
+        # 计算差异
+        diff = torch.abs(float_tanh) - torch.abs(quant_tanh)
+
+        #计算平均差异和平均浮点结果
+        mean_float = torch.mean(float_tanh).item()
+        mean_diff = torch.mean(diff).item()
+
+        #相对差异
+        relative_diff = mean_diff/mean_float
+        
+        return float_tanh, quant_tanh, relative_diff
+    
+    def forward(self, x):
+        # 在forward中比较两种tanh
+        float_result, quant_result, relative_diff = self.quantized_tanh(x)
+        
+        # 返回正常输出和量化比较结果
+        return {
+            'float_addition': float_result,
+            'quant_addition': quant_result,
+            'relative_difference': relative_diff
+        }
+
+def test_quantization_in_forward(model, num_tests=10):
+    print("测试模型前向传播中的量化tanh比较...")
+    print("=" * 60)
+    
+    model.eval()
+
+    model = linger.init(model)
+    # print(model)
+
+    all_relative_diffs = []
+    
+    for i in range(num_tests):
+        # 生成随机输入数据
+        batch_size = 2
+        min_val = 1e-6  # 最小正值
+        x = torch.rand(batch_size, 3, 32, 32) * (1.0 - min_val) + min_val
+        
+        # 前向传播（包含量化比较）
+        with torch.no_grad():
+            outputs = model(x)
+    
+        relative_diff = outputs['relative_difference']
+    
+        all_relative_diffs.append(relative_diff)
+        
+        print(f"测试 {i+1}:")
+        print(f"  相对差异: {relative_diff:.6f}")
+        print("-" * 40)
+    
+    # 统计结果
+    final_relative_diff = max(all_relative_diffs)
+    
+    print("=" * 60)
+    print("最终统计结果:")
+    print(f"  最大相对差异: {final_relative_diff:.6f}")
+    
+    return final_relative_diff
+
+# 运行测试
+if __name__ == "__main__":
+    # 创建模型
+    model = QuantizationTestNet(num_classes=10)
+    
+    print("开始量化tanh精度测试...")
+    print("=" * 60)
+
+    final_relative_diff = test_quantization_in_forward(model, 10)
+
+    print("\n" + "=" * 60)
+    print("最终评估:")
+    
+    threshold = 0.001  # 阈值可以根据需要调整
+    
+    if final_relative_diff < threshold:
+        print(f"✓ 量化tanh测试成功！最大差异 {final_relative_diff:.6f} < 阈值 {threshold}")
+    else:
+        print(f"✗ 量化tanh测试失败！最大差异 {final_relative_diff:.6f} >= 阈值 {threshold}")
+    
\ No newline at end of file
diff --git a/test/test_tool.py b/test/test_tool.py
deleted file mode 100644
index ada259c..0000000
--- a/test/test_tool.py
+++ /dev/null
@@ -1,56 +0,0 @@
-import os
-
-import linger
-import numpy as np
-import torch
-import torch.nn as nn
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-def test_wb_analyse():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                  padding=1, bias=True)
-            self.conv1 = nn.Conv2d(2, 2, kernel_size=3, stride=1,
-                                   padding=1, bias=True)
-            self.fc = nn.Linear(8, 100)
-
-        def forward(self, x):
-
-            x = self.conv(x)
-            x = self.conv1(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-    # random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    dummy_input = torch.ones(1, 2, 2, 2).cuda()
-    target = torch.ones(1, 100).cuda()
-    criterion = nn.MSELoss()
-    replace_tuple = (nn.Conv2d, nn.Linear, nn.AvgPool2d)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
-    loss = None
-    for i in range(200):
-        optimizer.zero_grad()
-        out = net(dummy_input)
-        loss = criterion(out, target)
-        if i % 20 == 0:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-    assert loss < 1, 'training loss error'
-    net.eval()
-    torch.save(net.state_dict(), 'data.ignore/tool_test.pt')
-    linger.wb_analyse('data.ignore/tool_test.pt', 'data.ignore/wb_anylse.log')
diff --git a/test/test_tqtcuda_time.py b/test/test_tqtcuda_time.py
new file mode 100644
index 0000000..ed38cf8
--- /dev/null
+++ b/test/test_tqtcuda_time.py
@@ -0,0 +1,221 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import time
+import random
+import math
+import copy
+import pandas as pd
+import gc
+
+# import debugpy
+# print("localhost start ----------------------------")
+# debugpy.listen(("localhost", 6000))
+# debugpy.wait_for_client()
+
+import linger
+from linger.quant.ops.qmodule import *
+from linger.config import QUANT_CONFIGS
+from linger.utils import *
+
+linger.QUANT_CONFIGS.quant_info.qat_method = QatMethod.TQT
+q_configs = QUANT_CONFIGS
+
+class Configure(Singleton):
+    open_quant   = True
+    quant_method = FakeQuantMethod.NATIVE
+    device       = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    dtype        = torch.float32
+    seed         = 42
+    input        = None # 测试算例，保证使用不同方式测试时数据都相同
+    state_dict   = None # 测试模型的state_dict，保证使用不同方式测试时模型的所有参数都相同
+
+compare_config = Configure()
+
+def clear_all_cache():
+    
+    del compare_config.input
+    del compare_config.state_dict
+    compare_config.input = None
+    compare_config.state_dict = None
+
+    """
+    清空 PyTorch 的所有缓存，包括：
+    1. CUDA 显存缓存
+    2. Autograd 图缓存
+    3. torch.compile 生成的缓存 (Dynamo/Inductor)
+    4. Python 垃圾回收
+    """
+    # 1. 删除无用的 Python 对象
+    gc.collect()
+    
+    # 2. 清空 CUDA 显存缓存
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()  # 清理跨进程内存
+    
+    # 3. 清空 Autograd 统计信息
+    torch.cuda.reset_peak_memory_stats()
+    torch.cuda.reset_accumulated_memory_stats()
+    
+    # 4. 清空 torch.compile 相关缓存
+    try:
+        torch._dynamo.reset()
+    except Exception:
+        print("inductor cache 清理失败:", e)
+    
+    # try:
+    #     torch._inductor.clear_cache()
+    # except Exception:
+    #     print("inductor cache 清理失败:", e)
+    torch.cuda.synchronize()
+    print("✅ 已清空所有 PyTorch 缓存")
+
+
+def time_it(fn, input, repeat=200):
+    for i in range(20):
+        y = fn(input[i])
+    torch.cuda.synchronize()
+    start = time.time()
+    for i in range(repeat):
+        y = fn(input[i])
+    torch.cuda.synchronize()
+    end = time.time()
+    return (end - start) * 1000  # ms
+
+def test(test_nums=200, fix_seq_len=True, batch_size = 40, seq_len=200, input_shape=256, output_shape=768, mode="float", is_train=True):
+    if mode == "float":
+        print("重新初始化了")
+        input = []
+        if fix_seq_len:
+            for i in range(test_nums):
+                input.append(torch.randn(batch_size, seq_len, input_shape).to(compare_config.device).to(compare_config.dtype))
+        else:
+            for i in range(test_nums):
+                temp = random.randint(math.floor(seq_len * 0.9), math.ceil(seq_len * 1.1))
+                input.append(torch.randn(batch_size, temp, input_shape).to(compare_config.device).to(compare_config.dtype))
+        compare_config.input = copy.deepcopy(input)
+        layer = nn.Sequential(torch.nn.Linear(input_shape, output_shape),
+                              torch.nn.Linear(output_shape, input_shape),
+                            )
+        compare_config.state_dict = None
+        compare_config.state_dict = copy.deepcopy(layer.state_dict())
+    elif mode=="cuda":
+        linger.QUANT_CONFIGS.quant_method = FakeQuantMethod.CUDA
+        layer = nn.Sequential(
+            QLinear(
+                input_shape, 
+                output_shape,
+                weights_cfg=q_configs.quant_info.to_dict(), 
+                activations_cfg=q_configs.quant_info.to_dict(),
+                bias_cfg=q_configs.quant_info.to_dict(), 
+                constrain =  q_configs.clamp_info.to_dict()   
+            ),
+            QLinear(
+                output_shape, 
+                input_shape,
+                weights_cfg=q_configs.quant_info.to_dict(), 
+                activations_cfg=q_configs.quant_info.to_dict(),
+                bias_cfg=q_configs.quant_info.to_dict(), 
+                constrain =  q_configs.clamp_info.to_dict()   
+            )
+        )
+        layer.load_state_dict(compare_config.state_dict, strict=False)
+    else:
+        linger.QUANT_CONFIGS.quant_method = FakeQuantMethod.NATIVE
+        layer = nn.Sequential(
+            QLinear(
+                input_shape, 
+                output_shape,
+                weights_cfg=q_configs.quant_info.to_dict(), 
+                activations_cfg=q_configs.quant_info.to_dict(),
+                bias_cfg=q_configs.quant_info.to_dict(), 
+                constrain =  q_configs.clamp_info.to_dict()   
+            ),
+            QLinear(
+                output_shape, 
+                input_shape,
+                weights_cfg=q_configs.quant_info.to_dict(), 
+                activations_cfg=q_configs.quant_info.to_dict(),
+                bias_cfg=q_configs.quant_info.to_dict(), 
+                constrain =  q_configs.clamp_info.to_dict()   
+            )
+        )
+        layer.load_state_dict(compare_config.state_dict, strict=False)
+    layer.to(compare_config.device)
+    if is_train:
+        layer = layer.train()
+    else:
+        layer = layer.eval()
+
+    time_ms = time_it(layer, compare_config.input, test_nums)
+    print(f"test_nums:{test_nums}, fix_seq_len:{fix_seq_len}, batch_size:{batch_size}, seq_len:{seq_len}, input_shape:{input_shape}, output_shape:{output_shape}, time_ms:{time_ms}")
+    # print(input[10].shape)
+    output = layer(compare_config.input[10])
+    torch.cuda.synchronize()
+
+    del layer
+    torch.cuda.empty_cache()
+    return output, time_ms
+
+def log_all(df, test_nums=200, fix_seq_len=True, batch_size = 40, seq_len=200, input_shape=256, output_shape=768):
+    res_all = [] # 用于对比输出的一致性,仅挑选了第10个输入对比
+    
+    # float
+    print("---------float---------")
+    float_res, float_time = test(test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, mode="float")
+
+    # cuda
+    print("---------cuda---------")
+    cuda_res, cuda_time = test(test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, mode="cuda")
+    res_all.append(cuda_res)
+
+    # native
+    print("---------native---------")
+    native_res, native_time = test(test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, mode="native")
+    res_all.append(native_res)
+
+    # native
+    print("---------inference---------")
+    infer_res, infe_time = test(test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, mode="cuda", is_train=False)
+    # res_all.append(infer_res)
+
+    for i in range(1, len(res_all)):
+        diff_nums = (res_all[0].data != res_all[i].data).to(torch.float).sum()
+        if  diff_nums !=0 :
+            print(f"第{i}个和native结果不同，不同个数为：{diff_nums}")
+            import pdb; pdb.set_trace()
+
+    # df.loc[len(df)] = [test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, float_time, native_time, triton_time, cuda_time, compile_time]
+    df.loc[len(df)] = [test_nums, fix_seq_len, batch_size, seq_len, input_shape, output_shape, float_time, cuda_time, native_time, infe_time]
+    clear_all_cache()
+    return df
+
+if __name__ == "__main__":
+    test_nums = 200
+    fix_seq_len = False
+    batch_size = [10, 20, 40]
+    seq_len    = [10, 20, 40, 80, 160, 200]
+    input_shape = [256, 512, 1024, 2048]
+    output_shape = [256, 512, 1024, 2048]
+
+    # batch_size = [10, 20]
+    # seq_len    = [10, 20, ]
+    # input_shape = [256, 512]
+    # output_shape = [256, 512]
+
+    columns = [
+        "test_nums", "fix_seq_len", "batch_size", "seq_len",
+        "input_shape", "output_shape", 
+        "float_time",  "cuda_time","native_time", "inference", # "compile_time",  "native_time", "triton_time",
+    ]
+
+    df = pd.DataFrame(columns=columns)
+
+    for i in batch_size:
+        for j in seq_len:
+            for k in input_shape:
+                for g in output_shape:
+                    print(f"正在计算：{i},{j},{k},{g}")
+                    df = log_all(df=df, test_nums=test_nums, fix_seq_len=fix_seq_len, batch_size = i, seq_len=j, input_shape=k, output_shape=g)
+    df.to_excel("/yrfs4/inference/sqtu2/LLM/code/linger3.0/my_linger/analys_linger.xlsx", index=False)
diff --git a/test/test_trace_convtranspose.py b/test/test_trace_convtranspose.py
deleted file mode 100644
index 93df4dd..0000000
--- a/test/test_trace_convtranspose.py
+++ /dev/null
@@ -1,193 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.nn as nn
-import linger
-import numpy as np
-import os
-
-
-if not os.path.exists('data.ignore'):
-    os.mkdir('data.ignore')
-
-def test_trace_convtranspose_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 10, kernel_size=3, stride=1, padding=1, bias=False, groups=2)
-            self.bn = nn.BatchNorm2d(10)
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(10, 10, kernel_size=3, stride=1, padding=1, bias=False, groups=2)
-            self.bn1 = nn.BatchNorm2d(10)
-            self.relu1 = nn.ReLU()
-            self.deconv = nn.ConvTranspose2d(10,20, kernel_size=2, stride=2)
-            self.bn2 = nn.BatchNorm2d(20)
-            self.relu2 = nn.ReLU()
-            self.fc = nn.Linear(20*100*100, 100)
-            # self.fc = nn.Linear(10*50*50, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            x = self.deconv(x)
-            x = self.bn2(x)
-            x = self.relu2(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(10, 10, 50, 50).cuda()
-    target = torch.ones(10, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.ConvTranspose2d, nn.Linear, nn.BatchNorm2d, linger.NormalizeConvBN2d, linger.NormalizeConvTransposeBN2d)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # torchintx.FuseBNIntoConv(net, aa)
-    # import pdb; pdb.set_trace()
-    # torchintx.trace_layers(net,net, aa, fuse_bn=True)
-    # import pdb; pdb.set_trace()
-    # net = torchintx.init(net, quant_modules=replace_tuple, data_bits=8, parameter_bits=8, out_bits=8, mode=torchintx.QuantMode.MaxValue)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-    net.eval()
-    out = net(aa)
-
-    torch.save(net.state_dict(), 'data.ignore/aa_convtranspose.pt')
-    net1 = Net().cuda()
-    linger.trace_layers(net1, net1, aa, fuse_bn=True)
-    net1.eval()
-    print(net1)
-    net1.load_state_dict(torch.load('data.ignore/aa_convtranspose.pt'))
-    out1 = net1(aa)
-    # import pdb; pdb.set_trace()
-    # with torch.no_grad():
-    #     torch.onnx.export(net, aa,"data.ignore/convtranspose_fuse_bn.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    # out2 = net(aa)
-    # import pdb; pdb.set_trace()
-    assert (out.mean() - out1.mean()).abs() < 0.001, print('out1: {}, out2: {}'.format(out.sum(), out1.sum()))
-    # assert out.abs().sum() == out2.abs().sum(), 'inconsistant for tarce convtranspose'
-
-
-def test_quant_convtransposebn_net():
-    class Net(nn.Module):
-        def __init__(self):
-            super(Net, self).__init__()
-            self.conv = nn.Conv2d(10, 8, kernel_size=3, stride=1, padding=1, bias=False, groups=2)
-            self.bn = nn.BatchNorm2d(8)
-            self.relu = nn.ReLU()
-            self.conv1 = nn.Conv2d(8, 8, kernel_size=3, stride=1, padding=1, bias=False, groups=2)
-            self.bn1 = nn.BatchNorm2d(8)
-            self.relu1 = nn.ReLU()
-            self.deconv = nn.ConvTranspose2d(8, 1, kernel_size=2, stride=2)
-            self.bn2 = nn.BatchNorm2d(1)
-            self.relu2 = nn.ReLU()
-            self.fc = nn.Linear(100*100, 100)
-            # self.fc = nn.Linear(10*50*50, 100)
-
-        def forward(self, x):
-            x = self.conv(x)
-            x = self.bn(x)
-            x = self.relu(x)
-            x = self.conv1(x)
-            x = self.bn1(x)
-            x = self.relu1(x)
-            x = self.deconv(x)
-            x = self.bn2(x)
-            x = self.relu2(x)
-            n, c, h, w = x.shape
-            x = x.view((n, c*h*w))
-            x = self.fc(x)
-            return x
-
-    torch.manual_seed(1)
-    torch.cuda.manual_seed_all(1)
-    np.random.seed(1)
-
-    torch.cuda.set_device(0)
-    net = Net().cuda()
-    aa = torch.randn(1, 10, 50, 50).cuda()
-    target = torch.ones(10, 100).cuda()
-    criterion = nn.MSELoss()
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-
-    net = Net().cuda()
-    criterion = nn.MSELoss()
-    replace_tuple=(nn.Conv2d, nn.ConvTranspose2d, nn.Linear, nn.BatchNorm2d, linger.NormalizeConvBN2d, linger.NormalizeConvTransposeBN2d)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # torchintx.FuseBNIntoConv(net, aa)
-    # import pdb; pdb.set_trace()
-    # torchintx.trace_layers(net,net, aa, fuse_bn=True)
-    # import pdb; pdb.set_trace()
-    # net = torchintx.init(net, quant_modules=replace_tuple, data_bits=8, parameter_bits=8, out_bits=8, mode=torchintx.QuantMode.MaxValue)
-    net.train()
-    optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
-    for i in range(150):
-        optimizer.zero_grad()
-        out = net(aa)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer.step()
-
-    # net.eval()
-    out = net(aa)
-
-    torch.save(net.state_dict(), 'data.ignore/aa_convtranspose_float.pt')
-    net1 = Net().cuda()
-    linger.trace_layers(net1,net1, aa, fuse_bn=True)
-    linger.SetPlatFormQuant(platform_quant=linger.PlatFormQuant.luna_quant)
-    # torchintx.disable_quant(net1.fc)
-    net1 = linger.init(net1, quant_modules=replace_tuple,
-                      mode=linger.QuantMode.QValue)
-    print(net1)
-    net1.load_state_dict(torch.load('data.ignore/aa_convtranspose_float.pt'))
-    net1.train()
-    out1 = net1(aa)
-    assert ((out.abs().mean() - out1.abs().mean()).abs() < 0.1)
-    # import pdb; pdb.set_trace()
-    optimizer1 = torch.optim.SGD(net1.parameters(), lr = 0.001)
-    for i in range(150):
-        optimizer1.zero_grad()
-        out = net1(aa)
-        loss = criterion(out, target)
-        if i % 30 == 29:
-            print('loss: ', loss)
-        loss.backward()
-        optimizer1.step()
-    net1.eval()
-    out2 = net1(aa)
-    # import pdb; pdb.set_trace()
-    with torch.no_grad():
-        torch.onnx.export(net1, aa,"data.ignore/convtranspose_fuse_bn.onnx",export_params=True,opset_version=11,operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK)
-    out3 = net1(aa)
-    # import pdb; pdb.set_trace()
-    assert out2.abs().sum() == out3.abs().sum()
-    # import pdb; pdb.set_trace()
-    assert ((out2.abs().mean() - out1.abs().mean()).abs() < 0.1)
\ No newline at end of file