Skip to content

Error: 一些文件入库失败 #514

@U11Leung

Description

@U11Leung

1️⃣ 描述一下问题

创建知识库后,上传多份文件,部分文件可以正常解析、入库,有文件在入库时失败,稳定重现

模型:硅基流动的bge-m3(1024),
CommonRAG,
添加文件时OCR引擎不启用和DeepSeek OCR都试过
文件是PDF,非纯图片的那种,可以直接选择文字的,438页,技术文档,主要语言是中文
(但是其他一些PDF是成功的,那些文件小一些,其他区别看不出来)
用OCR时字符数1900K+字符,不启用OCR时575K+字符
chunk size=1000,overlap=200,分隔符=\n

2️⃣ 报错日志

请运行以下命令,并提供部分相关日志:

# macOS / Linux
make logs

# Windows
docker logs --tail=100 api-dev
git rev-parse HEAD
make logs 的输出:

02-07 16:53:36 INFO: 10.100.8.8:48664 - "GET /api/knowledge/types HTTP/1.1" 200 - 2ms
02-07 16:53:36 INFO: 10.100.8.8:48680 - "GET /api/knowledge/databases HTTP/1.1" 200 - 4ms
02-07 16:53:36 INFO: 10.100.8.8:48056 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:53:36 INFO: 10.100.8.8:48058 - "GET /api/knowledge/files/supported-types HTTP/1.1" 200 - 1ms
02-07 16:53:36 INFO: 10.100.8.8:48070 - "GET /api/departments HTTP/1.1" 200 - 1ms
02-07 16:53:36 INFO: 10.100.8.8:48094 - "GET /api/knowledge/databases/kb_6f68804b8fb11f6806308c3056808dbf/sample-questions HTTP/1.1" 200 - 3ms
02-07 16:53:36 INFO: 10.100.8.8:48086 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b/query-params HTTP/1.1" 200 - 3ms
02-07 16:53:36 INFO: 10.100.8.8:48090 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 18ms
02-07 16:53:37 INFO: 10.100.8.8:48104 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b/sample-questions HTTP/1.1" 200 - 1ms
02-07 16:53:37 DEBUG knowledge_router.py:568: GET document file_e54c24 info in kb_073aa47fb50ffd47922e1abf8e20b00b
02-07 16:53:37 DEBUG kb_utils.py:447: Parsed MinIO URL: bucket_name=kb-parsed, object_name=kb_073aa47fb50ffd47922e1abf8e20b00b/file_e54c24/parsed.md
02-07 16:53:37 INFO: 10.100.8.8:48112 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b/documents/file_e54c24 HTTP/1.1" 200 - 14ms
02-07 16:53:37 INFO client.py:203: 成功下载 'kb_073aa47fb50ffd47922e1abf8e20b00b/file_e54c24/parsed.md' 从存储桶 'kb-parsed'
02-07 16:53:58 INFO: 127.0.0.1:50212 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:54:28 INFO: 127.0.0.1:47992 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:54:58 INFO: 127.0.0.1:52120 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:55:28 INFO: 127.0.0.1:36592 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:55:58 INFO: 127.0.0.1:49972 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:56:28 INFO: 127.0.0.1:46444 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:56:59 INFO: 127.0.0.1:33474 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:57:29 INFO: 127.0.0.1:43058 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:57:47 DEBUG knowledge_router.py:496: Index documents for db_id kb_073aa47fb50ffd47922e1abf8e20b00b: ['file_e54c24'] params={'chunk_size': 1000, 'chunk_overlap': 200, 'qa_separator': '\\n', 'enable_ocr': 'disable', 'content_hashes': {'http://localhost:9000/ref-kb-073aa47fb50ffd47922e1abf8e20b/750-um001_-zh-p_1770451900441.pdf': '2efd6d16bba2c80b731505682b5837b0bd7acf6022b709c1d3b1367e1faddded'}, 'content_type': 'file', 'db_id': 'kb_073aa47fb50ffd47922e1abf8e20b00b'}
02-07 16:57:47 INFO: 10.100.8.8:37114 - "POST /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b/documents/index HTTP/1.1" 200 - 10ms
02-07 16:57:47 INFO task_service.py:141: Enqueued task dd08df32df144aff98de2bbe9f685711 (文档入库 (知识测试-不OCR))
02-07 16:57:47 DEBUG base.py:330: [update_file_params] file_id=file_e54c24, current_params={'enable_ocr': 'disable', 'content_hashes': {'http://localhost:9000/ref-kb-073aa47fb50ffd47922e1abf8e20b/750-um001_-zh-p_1770451900441.pdf': '2efd6d16bba2c80b731505682b5837b0bd7acf6022b709c1d3b1367e1faddded'}, 'content_type': 'file', 'db_id': 'kb_073aa47fb50ffd47922e1abf8e20b00b', 'chunk_size': 1000, 'chunk_overlap': 200, 'qa_separator': '\\n'}, new_params={'chunk_size': 1000, 'chunk_overlap': 200, 'qa_separator': '\\n', 'enable_ocr': 'disable', 'content_hashes': {'http://localhost:9000/ref-kb-073aa47fb50ffd47922e1abf8e20b/750-um001_-zh-p_1770451900441.pdf': '2efd6d16bba2c80b731505682b5837b0bd7acf6022b709c1d3b1367e1faddded'}, 'content_type': 'file', 'db_id': 'kb_073aa47fb50ffd47922e1abf8e20b00b'}
02-07 16:57:47 DEBUG base.py:339: [update_file_params] file_id=file_e54c24, updated_params={'enable_ocr': 'disable', 'content_hashes': {'http://localhost:9000/ref-kb-073aa47fb50ffd47922e1abf8e20b/750-um001_-zh-p_1770451900441.pdf': '2efd6d16bba2c80b731505682b5837b0bd7acf6022b709c1d3b1367e1faddded'}, 'content_type': 'file', 'db_id': 'kb_073aa47fb50ffd47922e1abf8e20b00b', 'chunk_size': 1000, 'chunk_overlap': 200, 'qa_separator': '\\n'}
02-07 16:57:47 INFO embed.py:254: Loading embedding model siliconflow/BAAI/bge-m3
02-07 16:57:47 INFO __init__.py:64: Running in docker, using https://api.siliconflow.cn/v1/embeddings as base url
02-07 16:57:47 DEBUG milvus.py:288: [index_file] file_id=file_e54c24, processing_params={'enable_ocr': 'disable', 'content_hashes': {'http://localhost:9000/ref-kb-073aa47fb50ffd47922e1abf8e20b/750-um001_-zh-p_1770451900441.pdf': '2efd6d16bba2c80b731505682b5837b0bd7acf6022b709c1d3b1367e1faddded'}, 'content_type': 'file', 'db_id': 'kb_073aa47fb50ffd47922e1abf8e20b00b', 'chunk_size': 1000, 'chunk_overlap': 200, 'qa_separator': '\\n'}
02-07 16:57:47 DEBUG base.py:705: Added file file_e54c24 to processing queue
02-07 16:57:47 DEBUG kb_utils.py:447: Parsed MinIO URL: bucket_name=kb-parsed, object_name=kb_073aa47fb50ffd47922e1abf8e20b00b/file_e54c24/parsed.md
02-07 16:57:47 INFO client.py:203: 成功下载 'kb_073aa47fb50ffd47922e1abf8e20b00b/file_e54c24/parsed.md' 从存储桶 'kb-parsed'
02-07 16:57:47 DEBUG kb_utils.py:123: 启用预分割模式,使用分隔符: '\n'
02-07 16:57:47 DEBUG kb_utils.py:147: Successfully split text into 33930 chunks using MarkdownTextSplitter
02-07 16:57:47 INFO milvus.py:300: Split 750-um001_-zh-p.pdf into 33930 chunks with params: chunk_size=1000, chunk_overlap=200, qa_separator=\n
02-07 16:57:50 INFO: 10.100.8.8:37122 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 165ms
02-07 16:57:50 INFO: 10.100.8.8:37126 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:57:50 INFO: 10.100.8.8:37138 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 1ms
02-07 16:57:50 INFO: 10.100.8.8:37154 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 1ms
02-07 16:57:51 INFO: 10.100.8.8:37166 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:57:52 INFO: 10.100.8.8:37168 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:57:53 INFO: 10.100.8.8:37180 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 3ms
02-07 16:57:54 INFO: 10.100.8.8:37194 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 21ms
02-07 16:57:55 INFO: 10.100.8.8:37202 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:57:55 ERROR milvus.py:341: Indexing failed for file_e54c24: Other Embedding async request failed: , {'model': 'BAAI/bge-m3', 'input': ['单位:', '缺省值:', '最小值/', '最大值:', 'Hz/A', '9080.0', '0.0/100000.0', 'RW 实型', '100 Slip Reg Enable (滑差调节启用)', '滑差调节器启用', '启用或禁用滑差频率调节器。该选择仅在电机控制模式磁通矢量感', '应中 (P35 [Motor Ctrl Mode] (电机控制模式) = 3 “Induction FV” (感应电', '机磁通矢量)) 且使用编码器反馈的情况下才有效。', '缺省值:', '选项:', '1 = “Enabled” (启用)', '0 = “Disabled” (禁用)', '1 = “Enabled” (启用)', 'RW 32 位', '整数', '101 Slip Reg Ki (滑差调节 Ki)', '滑差调节器积分增益', '滑差频率调节器的积分增益。', '缺省值:', '最小值/', '最大值:', '10.00', '0.00/10000.00', 'RW 实型', '102 Slip Reg Kp (滑差调节 Kp)', '滑差调节器比例增益', '滑差频率调节器的比例增益。', '缺省值:', '最小值/', '最大值:', '0.50', '0.00/10000.00', 'RW 实型', '103 Flux Reg Enable (磁通调节启用)', '磁通调节器启用']}, self.base_url='https://api.siliconflow.cn/v1/embeddings'
02-07 16:57:55 DEBUG base.py:717: Removed file file_e54c24 from processing queue
02-07 16:57:55 ERROR knowledge_router.py:539: Index failed for file_e54c24: Other Embedding async request failed: , {'model': 'BAAI/bge-m3', 'input': ['单位:', '缺省值:', '最小值/', '最大值:', 'Hz/A', '9080.0', '0.0/100000.0', 'RW 实型', '100 Slip Reg Enable (滑差调节启用)', '滑差调节器启用', '启用或禁用滑差频率调节器。该选择仅在电机控制模式磁通矢量感', '应中 (P35 [Motor Ctrl Mode] (电机控制模式) = 3 “Induction FV” (感应电', '机磁通矢量)) 且使用编码器反馈的情况下才有效。', '缺省值:', '选项:', '1 = “Enabled” (启用)', '0 = “Disabled” (禁用)', '1 = “Enabled” (启用)', 'RW 32 位', '整数', '101 Slip Reg Ki (滑差调节 Ki)', '滑差调节器积分增益', '滑差频率调节器的积分增益。', '缺省值:', '最小值/', '最大值:', '10.00', '0.00/10000.00', 'RW 实型', '102 Slip Reg Kp (滑差调节 Kp)', '滑差调节器比例增益', '滑差频率调节器的比例增益。', '缺省值:', '最小值/', '最大值:', '0.50', '0.00/10000.00', 'RW 实型', '103 Flux Reg Enable (磁通调节启用)', '磁通调节器启用']}, self.base_url='https://api.siliconflow.cn/v1/embeddings'
02-07 16:57:56 INFO: 10.100.8.8:37206 - "GET /api/knowledge/databases/kb_073aa47fb50ffd47922e1abf8e20b00b HTTP/1.1" 200 - 2ms
02-07 16:57:59 INFO: 127.0.0.1:49248 - "GET /api/system/health HTTP/1.1" 200 - 0ms
02-07 16:58:29 INFO: 127.0.0.1:57584 - "GET /api/system/health HTTP/1.1" 200 - 0ms


Branch: 
Commit ID: 5a376dd1b93e0cde27c5c0dddda492f65fd5bd59
System: Linux yfc8-AI-Server 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux


3️⃣ 相关截图

#️⃣ 其他相关信息

✅ 如果问题与模型调用相关,请尝试切换到其他在线模型

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions