Compare commits

...

448 Commits

Author SHA1 Message Date
雷欧(林平凡)
516f5af00a 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-09-16 16:04:06 +08:00
雷欧(林平凡)
99d80ba61a 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-09-16 15:28:40 +08:00
雷欧(林平凡)
7d70566402 Merge remote-tracking branch 'github/master'
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-09-16 15:19:03 +08:00
binary-husky
3546f8c1c4 impl fast paper reading 2025-08-24 20:50:12 +08:00
binary-husky
d8ad675f13 patch div bug 2025-08-24 20:35:20 +08:00
binary-husky
0ab0417954 normalize source code names 2025-08-24 20:12:34 +08:00
binary-husky
248b0aefae patch dockerfiles 2025-08-23 19:19:35 +08:00
binary-husky
661fe63941 opt dockerfiles 2025-08-23 19:11:42 +08:00
binary-husky
269804fb82 update dockerfiles 2025-08-23 16:20:10 +08:00
binary-husky
8042750d41 Master 4.0 (#2210)
* stage academic conversation

* stage document conversation

* fix buggy gradio version

* file dynamic load

* merge more academic plugins

* accelerate nltk

* feat: 为predict函数添加文件和URL读取功能
- 添加URL检测和网页内容提取功能,支持自动提取网页文本
- 添加文件路径识别和文件内容读取功能,支持private_upload路径格式
- 集成WebTextExtractor处理网页内容提取
- 集成TextContentLoader处理本地文件读取
- 支持文件路径与问题组合的智能处理

* back

* block unstable

---------

Co-authored-by: XiaoBoAI <liuboyin2019@ia.ac.cn>
2025-08-23 15:59:22 +08:00
雷欧(林平凡)
2706263a4b Merge branch 'master' into HEAD
Some checks failed
build-with-all-capacity / build-and-push-image (push) Has been cancelled
build-with-audio-assistant / build-and-push-image (push) Has been cancelled
build-with-chatglm / build-and-push-image (push) Has been cancelled
build-with-latex-arm / build-and-push-image (push) Has been cancelled
build-with-latex / build-and-push-image (push) Has been cancelled
build-without-local-llms / build-and-push-image (push) Has been cancelled
2025-08-08 18:02:17 +08:00
雷欧(林平凡)
ee34f13307 1 2025-08-08 17:59:21 +08:00
binary-husky
65a4cf59c2 new ui backend 2025-07-31 22:22:18 +08:00
binary-husky
a7a56b5058 fix buggy gradio version 2025-06-25 01:34:33 +08:00
binary-husky
8c21432291 use uv to build dockerfile 2025-06-04 02:24:09 +08:00
binary-husky
87b3f79ae9 setup nv 2025-06-04 01:53:29 +08:00
binary-husky
f42aad5093 implement doc_fns 2025-06-04 00:20:09 +08:00
binary-husky
725f60fba3 add context clip policy 2025-06-03 01:05:37 +08:00
binary-husky
be83907394 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2025-05-06 22:17:34 +08:00
binary-husky
eba48a0f1a improve reset conversation ui 2025-05-06 22:10:21 +08:00
binary-husky
ee1a9e7cce support qwen3 models - edit config hint 2025-04-29 11:10:49 +08:00
binary-husky
fc06be6f7a support qwen3 models 2025-04-29 11:09:53 +08:00
binary-husky
883b513b91 add can_multi_thread 2025-04-21 00:50:24 +08:00
binary-husky
24cebaf4ec add o3 and o4 models 2025-04-21 00:48:59 +08:00
binary-husky
858b5f69b0 add in-text stop btn 2025-04-15 01:08:54 +08:00
davidfir3
63c61e6204 添加gemini-2.0-flash (#2180)
Co-authored-by: 柯仕锋 <12029064@zju.edu.cn>
2025-03-25 00:13:18 +08:00
BZfei
82aac97980 阿里云百炼(原灵积)增加对deepseek-r1、deepseek-v3模型支持 (#2182)
* 阿里云百炼(原灵积)增加对deepseek-r1、deepseek-v3模型支持

* update reasoning display

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2025-03-25 00:11:55 +08:00
binary-husky
045cdb15d8 ensure display none even if css load fails 2025-03-10 23:44:47 +08:00
binary-husky
e78e8b0909 allow copy original text instead of renderend text 2025-03-09 00:04:27 +08:00
binary-husky
07974a26d0 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2025-03-08 23:10:42 +08:00
binary-husky
3e56c074cc fix gui_toolbar 2025-03-08 23:09:22 +08:00
littleolaf
72dbe856d2 添加接入 火山引擎在线大模型 内容的支持 (#2165)
* use oai adaptive bridge function to handle vol engine

* add vol engine deepseek v3

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2025-03-04 23:58:03 +08:00
雷欧(林平凡)
0055ea2df7 Merge branch 'master' of https://github.com/binary-husky/gpt_academic
Some checks failed
build-with-all-capacity / build-and-push-image (push) Has been cancelled
build-with-audio-assistant / build-and-push-image (push) Has been cancelled
build-with-chatglm / build-and-push-image (push) Has been cancelled
build-with-latex-arm / build-and-push-image (push) Has been cancelled
build-with-latex / build-and-push-image (push) Has been cancelled
build-without-local-llms / build-and-push-image (push) Has been cancelled
2025-03-04 14:16:24 +08:00
Steven Moder
4a79aa6a93 typo: Fix typos and rename functions across multiple files (#2130)
* typo: Fix typos and rename functions across multiple files

This commit addresses several minor issues:
- Corrected spelling of function names (e.g., `update_ui_lastest_msg` to `update_ui_latest_msg`)
- Fixed typos in comments and variable names
- Corrected capitalization in some strings (e.g., "ArXiv" instead of "Arixv")
- Renamed some variables for consistency
- Corrected some console-related parameter names (e.g., `console_slience` to `console_silence`)

The changes span multiple files across the project, including request LLM bridges, crazy functions, and utility modules.

* fix: f-string expression part cannot include a backslash (#2139)

* raise error when the uploaded tar contain hard/soft link (#2136)

* minor bug fix

* fine tune reasoning css

* upgrade internet gpt plugin

* Update README.md

* fix GHSA-gqp5-wm97-qxcv

* typo fix

* update readme

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2025-03-02 02:16:10 +08:00
binary-husky
5dffe8627f fix GHSA-gqp5-wm97-qxcv 2025-03-02 01:58:45 +08:00
binary-husky
2aefef26db Update README.md 2025-02-21 19:51:09 +08:00
binary-husky
957da731db upgrade internet gpt plugin 2025-02-13 00:19:43 +08:00
雷欧(林平凡)
155e7e0deb Merge remote-tracking branch 'github/master'
Some checks failed
build-with-all-capacity / build-and-push-image (push) Has been cancelled
build-with-audio-assistant / build-and-push-image (push) Has been cancelled
build-with-chatglm / build-and-push-image (push) Has been cancelled
build-with-latex-arm / build-and-push-image (push) Has been cancelled
build-with-latex / build-and-push-image (push) Has been cancelled
build-without-local-llms / build-and-push-image (push) Has been cancelled
# Conflicts:
#	config.py
2025-02-12 15:07:39 +08:00
binary-husky
add29eba08 fine tune reasoning css 2025-02-09 20:26:52 +08:00
binary-husky
163e59c0f3 minor bug fix 2025-02-09 19:33:02 +08:00
binary-husky
07ece29c7c raise error when the uploaded tar contain hard/soft link (#2136) 2025-02-08 20:54:01 +08:00
Steven Moder
991a903fa9 fix: f-string expression part cannot include a backslash (#2139) 2025-02-08 20:50:54 +08:00
Steven Moder
cf7c81170c fix: return 参数数量 及 返回类型考虑 (#2129) 2025-02-07 21:33:06 +08:00
barry
6dda2061dd Update bridge_openrouter.py (#2132)
fix openrouter api 400 post bug

Co-authored-by: lan <56376794+lostatnight@users.noreply.github.com>
2025-02-07 21:28:05 +08:00
雷欧(林平凡)
e9de41b7e8 1
Some checks failed
build-with-all-capacity / build-and-push-image (push) Has been cancelled
build-with-audio-assistant / build-and-push-image (push) Has been cancelled
build-with-chatglm / build-and-push-image (push) Has been cancelled
build-with-latex-arm / build-and-push-image (push) Has been cancelled
build-with-latex / build-and-push-image (push) Has been cancelled
build-without-local-llms / build-and-push-image (push) Has been cancelled
2025-02-07 11:21:24 +08:00
雷欧(林平凡)
b34c79a94b 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-07 11:17:38 +08:00
binary-husky
8a0d96afd3 consider element missing cases in js 2025-02-07 01:21:21 +08:00
binary-husky
37f9b94dee add options to hide ui components 2025-02-07 00:17:36 +08:00
雷欧(林平凡)
95284d859b 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-06 10:46:46 +08:00
雷欧(林平凡)
a552592b5a 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-06 10:32:13 +08:00
雷欧(林平凡)
e305f1b4a8 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-06 10:30:58 +08:00
van
a88497c3ab 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-06 10:23:42 +08:00
雷欧(林平凡)
0f1d2e0e48 1
Some checks are pending
build-with-all-capacity / build-and-push-image (push) Waiting to run
build-with-audio-assistant / build-and-push-image (push) Waiting to run
build-with-chatglm / build-and-push-image (push) Waiting to run
build-with-latex-arm / build-and-push-image (push) Waiting to run
build-with-latex / build-and-push-image (push) Waiting to run
build-without-local-llms / build-and-push-image (push) Waiting to run
2025-02-06 10:03:34 +08:00
binary-husky
936e2f5206 update readme 2025-02-04 16:15:56 +08:00
binary-husky
7f4b87a633 update readme 2025-02-04 16:08:18 +08:00
binary-husky
2ddd1bb634 Merge branch 'memset0-master' 2025-02-04 16:03:53 +08:00
binary-husky
c68285aeac update config and version 2025-02-04 16:03:01 +08:00
Memento mori.
caaebe4296 add support for Deepseek R1 model and display CoT (#2118)
* feat: add support for R1 model and display CoT

* fix unpacking

* feat: customized font & font size

* auto hide tooltip when scoll down

* tooltip glass transparent css

* fix: Enhance API key validation in is_any_api_key function (#2113)

* support qwen2.5-max!

* update minior adjustment

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: Steven Moder <java20131114@gmail.com>
2025-02-04 16:02:02 +08:00
binary-husky
39d50c1c95 update minior adjustment 2025-02-04 15:57:35 +08:00
binary-husky
25dc7bf912 Merge branch 'master' of https://github.com/memset0/gpt_academic into memset0-master 2025-01-30 22:03:31 +08:00
binary-husky
0458590a77 support qwen2.5-max! 2025-01-29 23:29:38 +08:00
Steven Moder
44fe78fff5 fix: Enhance API key validation in is_any_api_key function (#2113) 2025-01-29 21:40:30 +08:00
binary-husky
5ddd657ebc tooltip glass transparent css 2025-01-28 23:50:21 +08:00
binary-husky
9b0b2cf260 auto hide tooltip when scoll down 2025-01-28 23:32:40 +08:00
binary-husky
9f39a6571a feat: customized font & font size 2025-01-28 02:52:56 +08:00
memset0
d07e736214 fix unpacking 2025-01-25 00:00:13 +08:00
memset0
a1f7ae5b55 feat: add support for R1 model and display CoT 2025-01-24 14:43:49 +08:00
binary-husky
1213ef19e5 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2025-01-22 01:50:08 +08:00
binary-husky
aaafe2a797 fix xelatex font problem in all-cap image 2025-01-22 01:49:53 +08:00
binary-husky
2716606f0c Update README.md 2025-01-16 23:40:24 +08:00
binary-husky
286f7303be fix image display bug 2025-01-12 21:54:43 +08:00
binary-husky
7eeab9e376 fix code block display bug 2025-01-09 22:31:59 +08:00
binary-husky
4ca331fb28 prevent html rendering for input 2025-01-05 21:20:12 +08:00
binary-husky
9487829930 change max_chat_preserve = 10 2025-01-03 00:34:36 +08:00
binary-husky
a73074b89e upgrade chat checkpoint 2025-01-03 00:31:03 +08:00
Southlandi
fd93622840 修复Gemini对话错误问题(停用词数量为0的情况) (#2092) 2024-12-28 23:22:10 +08:00
whyXVI
09a82a572d Fix RuntimeError in predict_no_ui_long_connection() (#2095)
Bug fix: Fix RuntimeError in predict_no_ui_long_connection()

In the original code, calling predict_no_ui_long_connection() would trigger a RuntimeError("OpenAI拒绝了请求:" + error_msg) even when the server responded normally. The issue occurred due to incorrect handling of SSE protocol comment lines (lines starting with ":"). 

Modified the parsing logic in both `predict` and `predict_no_ui_long_connection` to handle these lines correctly, making the logic more intuitive and robust.
2024-12-28 23:21:14 +08:00
G.RQ
c53ddf65aa 修复 bug“重置”按钮报错 (#2102)
* fix 重置按钮bug

* fix version control bug

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-12-28 23:19:25 +08:00
binary-husky
ac64a77c2d allow disable openai proxy in WHEN_TO_USE_PROXY 2024-12-28 07:14:54 +08:00
binary-husky
dae8a0affc compat bug fix 2024-12-25 01:21:58 +08:00
binary-husky
97a81e9388 fix temp issue of o1 2024-12-25 00:54:03 +08:00
binary-husky
1dd1d0ed6c fix cookie overflow bug 2024-12-25 00:33:20 +08:00
binary-husky
060af0d2e6 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-12-22 23:33:44 +08:00
binary-husky
a848f714b6 fix welcome card bugs 2024-12-22 23:33:22 +08:00
binary-husky
924f8e30c7 Update issue stale.yml 2024-12-22 14:16:18 +08:00
binary-husky
f40347665b github action change 2024-12-22 14:15:16 +08:00
binary-husky
734c40bbde fix non-localhost javascript error 2024-12-22 14:01:22 +08:00
binary-husky
4ec87fbb54 history ng patch 1 2024-12-21 11:27:53 +08:00
binary-husky
17b5c22e61 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-12-19 22:46:14 +08:00
binary-husky
c6cd04a407 promote the rank of DASHSCOPE_API_KEY 2024-12-19 22:39:14 +08:00
YIQI JIANG
f60a12f8b4 Add o1 and o1-2024-12-17 model support (#2090)
* Add o1 and o1-2024-12-17 model support

* patch api key selection

---------

Co-authored-by: 蒋翌琪 <jiangyiqi99@jiangyiqideMacBook-Pro.local>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-12-19 22:32:57 +08:00
binary-husky
8413fb15ba optimize welcome page 2024-12-18 23:35:25 +08:00
binary-husky
72b2ce9b62 ollama patch 2024-12-18 23:05:55 +08:00
binary-husky
f43ef909e2 roll version to 3.91 2024-12-18 22:56:41 +08:00
binary-husky
9651ad488f Merge branch 'master' into frontier 2024-12-18 22:27:12 +08:00
binary-husky
81da7bb1a5 remove welcome card when layout overflows 2024-12-18 17:48:02 +08:00
binary-husky
4127162ee7 add tts test 2024-12-18 17:47:23 +08:00
binary-husky
98e5cb7b77 update readme 2024-12-09 23:57:10 +08:00
binary-husky
c88d8047dd cookie storage to local storage 2024-12-09 23:52:02 +08:00
binary-husky
e4bebea28d update requirements 2024-12-09 23:40:23 +08:00
YE Ke 叶柯
294df6c2d5 Add ChatGLM4 local deployment support and refactor ChatGLM bridge's path configuration (#2062)
*  feat(request_llms and config.py): ChatGLM4 Deployment

Add support for local deployment of ChatGLM4 model

* 🦄 refactor(bridge_chatglm3.py): ChatGLM3 model path

Added ChatGLM3 path customization (in config.py).
Removed useless quantization model options that have been annotated

---------

Co-authored-by: MarkDeia <17290550+MarkDeia@users.noreply.github.com>
2024-12-07 23:43:51 +08:00
Zhenhong Du
239894544e Add support for grok-beta model from x.ai (#2060)
* Update config.py

add support for `grok-beta` model

* Update bridge_all.py

add support for `grok-beta` model
2024-12-07 23:41:53 +08:00
Menghuan
ed5fc84d4e 添加为windows的环境打包以及一键启动脚本 (#2068)
* 新增自动打包windows下的环境依赖

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-12-07 23:41:02 +08:00
Menghuan
e3f84069ee 改进Doc2X请求,并增加xelatex编译的支持 (#2058)
* doc2x请求函数格式清理

* 更新中间部分

* 添加doc2x超时设置并添加对xelatex编译的支持

* Bug修复以及增加对xelatex安装的检测

* 增强弱网环境下的稳定性

* 修复模型中_无法显示的问题

* add xelatex logs

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-12-07 23:23:59 +08:00
binary-husky
7bf094b6b6 remove 2024-12-07 22:43:03 +08:00
binary-husky
57d7dc33d3 sync common.js 2024-12-07 17:10:01 +08:00
binary-husky
94ccd77480 remove gen restore btn 2024-12-07 16:22:29 +08:00
binary-husky
48e53cba05 update gradio 2024-12-07 16:18:05 +08:00
binary-husky
e9a7f9439f upgrade gradio 2024-12-07 15:59:30 +08:00
binary-husky
a88b119bf0 change urls 2024-12-05 22:13:59 +08:00
binary-husky
eee8115434 add a config note 2024-12-04 23:55:22 +08:00
binary-husky
4f6a272113 remove keyword extraction 2024-12-04 01:33:31 +08:00
binary-husky
cf3dd5ddb6 add fail fallback option for media plugin 2024-12-04 01:06:12 +08:00
binary-husky
f6f10b7230 media plugin update 2024-12-04 00:36:34 +08:00
binary-husky
bd7b219e8f update web search functionality 2024-12-02 01:55:01 +08:00
binary-husky
e62decac21 change some open fn encoding to utf-8 2024-11-19 15:53:50 +00:00
binary-husky
588b22e039 comment remove 2024-11-19 15:05:48 +00:00
binary-husky
ef18aeda81 adjust rag 2024-11-19 14:59:50 +00:00
binary-husky
3520131ca2 public media gpt 2024-11-18 18:38:49 +00:00
binary-husky
ff5901d8c0 Merge branch 'master' into frontier 2024-11-17 18:16:19 +00:00
binary-husky
2305576410 unify mutex button manifest 2024-11-17 18:14:45 +00:00
binary-husky
52f23c505c media-gpt update 2024-11-17 17:45:53 +00:00
binary-husky
34cc484635 chatgpt-4o-latest 2024-11-11 15:58:57 +00:00
binary-husky
d152f62894 renamed plugins 2024-11-11 14:55:05 +00:00
binary-husky
ca35f56f9b fix: media gpt upgrade 2024-11-11 14:48:29 +00:00
binary-husky
d616fd121a update experimental media agent 2024-11-10 16:42:31 +00:00
binary-husky
f3fda0d3fc Merge branch 'master' into frontier 2024-11-10 13:41:44 +00:00
binary-husky
197287fc30 Enhance archive extraction with error handling for tar and gzip formats 2024-11-09 10:10:46 +00:00
Bingchen Jiang
c37fcc9299 Adding support to new openai apikey format (#2030) 2024-11-09 13:41:19 +08:00
binary-husky
91f5e6b8f7 resolve pickle security issue 2024-11-04 13:49:49 +00:00
hcy2206
4f0851f703 增加了对于glm-4-plus的支持 (#2014)
* 增加对于讯飞星火大模型Spark4.0的支持

* Create github action sync.yml

* 增加对于智谱glm-4-plus的支持

* feat: change arxiv io param

* catch comment source code exception

* upgrade auto comment

* add security patch

---------

Co-authored-by: GH Action - Upstream Sync <action@github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-11-03 22:41:16 +08:00
binary-husky
2821f27756 add security patch 2024-11-03 14:34:17 +00:00
binary-husky
8f91a048a8 dfa algo imp 2024-11-03 09:39:14 +00:00
binary-husky
58eac38b4d Merge branch 'master' into frontier 2024-10-30 13:42:17 +00:00
binary-husky
180550b8f0 upgrade auto comment 2024-10-30 13:37:35 +00:00
binary-husky
7497dcb852 catch comment source code exception 2024-10-30 11:40:47 +00:00
binary-husky
23ef2ffb22 feat: change arxiv io param 2024-10-27 16:54:29 +00:00
binary-husky
848d0f65c7 share paper network beta 2024-10-27 16:08:25 +00:00
Menghuan1918
f0b0364f74 修复并改进build with latex的Docker构建 (#2020)
* 改进构建文件

* 修复问题

* 更改docker注释,同时测试拉取大小
2024-10-27 23:17:03 +08:00
binary-husky
d7f0cbe68e Merge branch 'master' into frontier 2024-10-21 14:31:25 +00:00
binary-husky
69f3755682 adjust max_token_limit for pdf translation plugin 2024-10-21 14:31:11 +00:00
binary-husky
04c9077265 Merge branch 'papershare_beta' into frontier 2024-10-21 14:06:52 +00:00
binary-husky
6afd7db1e3 Merge branch 'master' into frontier 2024-10-21 14:06:23 +00:00
binary-husky
4727113243 update doc2x functions 2024-10-21 14:05:42 +00:00
binary-husky
42d10a9481 update doc2x functions 2024-10-21 14:05:05 +00:00
binary-husky
50a1ea83ef control whether to allow sharing translation results with GPTAC academic cloud. 2024-10-18 18:05:50 +00:00
binary-husky
a9c86a7fb8 pre 2024-10-18 14:16:24 +00:00
binary-husky
2b299cf579 Merge branch 'master' into frontier 2024-10-16 15:22:27 +00:00
wsg1873
310122f5a7 solve the concatenate error. (#2011) 2024-10-16 00:56:24 +08:00
binary-husky
0121cacc84 Merge branch 'master' into frontier 2024-10-15 09:10:36 +00:00
binary-husky
c83bf214d0 change arxiv download attempt url order 2024-10-15 09:09:24 +00:00
binary-husky
e34c49dce5 compat: deal with arxiv url change 2024-10-15 09:07:39 +00:00
binary-husky
f2dcd6ad55 compat: arxiv translation src shift 2024-10-15 09:06:57 +00:00
binary-husky
42d9712f20 Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-10-15 08:24:01 +00:00
binary-husky
3890467c84 replace rm with rm -f 2024-10-15 07:32:29 +00:00
binary-husky
074b3c9828 explicitly declare default value 2024-10-15 06:41:12 +00:00
Nextstrain
b8e8457a01 关于o1系列模型无法正常请求的修复,多模型轮询KeyError: 'finish_reason'的修复 (#1992)
* Update bridge_all.py

* Update bridge_chatgpt.py

* Update bridge_chatgpt.py

* Update bridge_all.py

* Update bridge_all.py
2024-10-15 14:36:51 +08:00
binary-husky
2c93a24d7e fix dockerfile: try align python 2024-10-15 06:35:35 +00:00
binary-husky
e9af6ef3a0 fix: github action glitch 2024-10-15 06:32:47 +00:00
wsg1873
5ae8981dbb add the '/Fit' destination (#2009) 2024-10-14 22:50:56 +08:00
Boyin Liu
7f0ffa58f0 Boyin rag (#1983)
* first_version

* rag document support

* RAG interactive prompts added, issues resolved

* Resolve conflicts

* Resolve conflicts

* Resolve conflicts

* more file format support

* move import

* Resolve LlamaIndexRagWorker bug

* new resolve

* Address import  LlamaIndexRagWorker problem

* change import order

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-10-14 22:48:24 +08:00
binary-husky
adbed044e4 fix o1 compat problem 2024-10-13 17:02:07 +00:00
Menghuan1918
2fe5febaf0 为build-with-latex版本Docker构建新增arm64支持 (#1994)
* Add arm64 support

* Bug fix

* Some build bug fix

* Add arm support

* 分离arm和x86构建

* 改进构建文档

* update tags

* Update build-with-latex-arm.yml

* Revert "Update build-with-latex-arm.yml"

This reverts commit 9af92549b5.

* Update

* Add

* httpx

* Addison

* Update GithubAction+NoLocal+Latex

* Update docker-compose.yml and GithubAction+NoLocal+Latex

* Update README.md

* test math anim generation

* solve the pdf concatenate error. (#2006)

* solve the pdf concatenate error.

* add legacy fallback option

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: wsg1873 <wsg0326@163.com>
2024-10-14 00:25:28 +08:00
binary-husky
5888d038aa move import 2024-10-13 16:17:10 +00:00
binary-husky
ee8213e936 Merge branch 'boyin_rag' into frontier 2024-10-13 16:12:51 +00:00
binary-husky
a57dcbcaeb Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-10-13 08:26:06 +00:00
binary-husky
b812392a9d Merge branch 'master' into frontier 2024-10-13 08:25:47 +00:00
wsg1873
f54d8e559a solve the pdf concatenate error. (#2006)
* solve the pdf concatenate error.

* add legacy fallback option

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-10-13 16:16:51 +08:00
lbykkkk
fce4fa1ec7 more file format support 2024-10-12 18:25:33 +00:00
Boyin Liu
d13f1e270c Merge branch 'master' into boyin_rag 2024-10-11 22:31:07 +08:00
lbykkkk
85cf3d08eb Resolve conflicts 2024-10-11 22:29:56 +08:00
lbykkkk
584e747565 Resolve conflicts 2024-10-11 22:27:57 +08:00
lbykkkk
02ba653c19 Resolve conflicts 2024-10-11 22:21:53 +08:00
binary-husky
e68fc2bc69 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-10-11 13:33:05 +00:00
binary-husky
f695d7f1da test math anim generation 2024-10-11 13:32:57 +00:00
lbykkkk
2d12b5b27d RAG interactive prompts added, issues resolved 2024-10-11 01:06:17 +08:00
binary-husky
679352d896 Update README.md 2024-10-10 13:38:35 +08:00
binary-husky
12c9ab1e33 Update README.md 2024-10-10 12:02:12 +08:00
binary-husky
a4bcd262f9 Merge branch 'master' into frontier 2024-10-07 05:20:49 +00:00
binary-husky
da4a5efc49 lazy load llama-index lib 2024-10-06 16:26:26 +00:00
binary-husky
9ac450cfb6 紧急修复 fix httpx breaking bad error 2024-10-06 15:02:14 +00:00
binary-husky
172f9e220b version 3.90 2024-10-05 16:51:08 +00:00
Boyin Liu
748e31102f Merge branch 'master' into boyin_rag 2024-10-05 23:58:43 +08:00
binary-husky
a28b7d8475 Merge branch 'master' of https://github.com/binary-husky/gpt_academic 2024-10-05 19:10:42 +08:00
binary-husky
7d3ed36899 fix: llama index deps verion limit 2024-10-05 19:10:38 +08:00
binary-husky
a7bc5fa357 remove out-dated jittor models 2024-10-05 10:58:45 +00:00
binary-husky
4f5dd9ebcf add temp solution for llama-index compat 2024-10-05 09:53:21 +00:00
binary-husky
427feb99d8 llama-index==0.10.5 2024-10-05 17:34:08 +08:00
binary-husky
a01ca93362 Merge Latest Frontier (#1991)
* logging sys to loguru: stage 1 complete

* import loguru: stage 2

* logging -> loguru: stage 3

* support o1-preview and o1-mini

* logging -> loguru stage 4

* update social helper

* logging -> loguru: final stage

* fix: console output

* update translation matrix

* fix: loguru argument error with proxy enabled (#1977)

* relax llama index version

* remove comment

* Added some modules to support openrouter (#1975)

* Added some modules for supporting openrouter model

Added some modules for supporting openrouter model

* Update config.py

* Update .gitignore

* Update bridge_openrouter.py

* Not changed actually

* Refactor logging in bridge_openrouter.py

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* remove logging extra

---------

Co-authored-by: Steven Moder <java20131114@gmail.com>
Co-authored-by: Ren Lifei <2602264455@qq.com>
2024-10-05 17:09:18 +08:00
binary-husky
97eef45ab7 Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-10-01 11:59:14 +00:00
binary-husky
0c0e2acb9b remove logging extra 2024-10-01 11:57:47 +00:00
Ren Lifei
9fba8e0142 Added some modules to support openrouter (#1975)
* Added some modules for supporting openrouter model

Added some modules for supporting openrouter model

* Update config.py

* Update .gitignore

* Update bridge_openrouter.py

* Not changed actually

* Refactor logging in bridge_openrouter.py

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-09-28 18:05:34 +08:00
binary-husky
7d7867fb64 remove comment 2024-09-23 15:16:13 +00:00
lbykkkk
7ea791d83a rag document support 2024-09-22 21:37:57 +08:00
binary-husky
f9dbaa39fb Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-09-21 15:40:24 +00:00
binary-husky
bbc2288c5b relax llama index version 2024-09-21 15:40:10 +00:00
Steven Moder
64ab916838 fix: loguru argument error with proxy enabled (#1977) 2024-09-21 23:32:00 +08:00
binary-husky
8fe559da9f update translation matrix 2024-09-21 14:56:10 +00:00
binary-husky
09fd22091a fix: console output 2024-09-21 14:41:36 +00:00
lbykkkk
df717f8bba first_version 2024-09-20 00:06:59 +08:00
binary-husky
e296719b23 Merge branch 'purge_print' into frontier 2024-09-16 09:56:25 +00:00
binary-husky
2f343179a2 logging -> loguru: final stage 2024-09-15 15:51:51 +00:00
binary-husky
4d9604f2e9 update social helper 2024-09-15 15:16:36 +00:00
binary-husky
597c320808 fix: system prompt err when using o1 models 2024-09-14 17:04:01 +00:00
binary-husky
18290fd138 fix: support o1 models 2024-09-14 17:00:02 +00:00
binary-husky
bbf9e9f868 logging -> loguru stage 4 2024-09-14 16:00:09 +00:00
binary-husky
0d0575a639 support o1-preview and o1-mini 2024-09-13 03:12:18 +00:00
binary-husky
aa1f967dd7 support o1-preview and o1-mini 2024-09-13 03:11:53 +00:00
binary-husky
0d082327c8 logging -> loguru: stage 3 2024-09-11 08:49:55 +00:00
binary-husky
80acd9c875 import loguru: stage 2 2024-09-11 08:18:01 +00:00
binary-husky
17cd4f8210 logging sys to loguru: stage 1 complete 2024-09-11 03:30:30 +00:00
binary-husky
4e041e1d4e Merge branch 'frontier': windows deps bug fix 2024-09-08 16:32:38 +00:00
binary-husky
7ef39770c7 fallback to simple vs in windows system 2024-09-09 00:27:02 +08:00
binary-husky
8222f638cf Merge branch 'frontier' 2024-09-08 15:46:13 +00:00
binary-husky
ab32c314ab change git ignore 2024-09-08 15:44:02 +00:00
binary-husky
dcfed97054 revise milvus rag 2024-09-08 15:43:01 +00:00
binary-husky
dd66ca26f7 Frontier (#1958)
* update welcome svg

* fix loading chatglm3 (#1937)

* update welcome svg

* update welcome message

* fix loading chatglm3

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* begin rag project with llama index

* rag version one

* rag beta release

* add social worker (proto)

* fix llamaindex version

---------

Co-authored-by: moetayuko <loli@yuko.moe>
2024-09-08 23:20:42 +08:00
binary-husky
8b91d2ac0a add milvus vector store 2024-09-08 15:19:03 +00:00
binary-husky
e4e00b713f fix llamaindex version 2024-09-05 05:21:10 +00:00
binary-husky
710a65522c add social worker (proto) 2024-09-02 15:55:06 +00:00
binary-husky
34784c1d40 Merge branch 'rag' into frontier 2024-09-02 15:01:12 +00:00
binary-husky
80b1a6f99b rag beta release 2024-09-02 15:00:47 +00:00
binary-husky
08c3c56f53 rag version one 2024-08-28 15:14:13 +00:00
binary-husky
294716c832 begin rag project with llama index 2024-08-21 14:24:37 +00:00
binary-husky
16f4fd636e update ref 2024-08-19 16:14:52 +00:00
binary-husky
e07caf7a69 update openai api key pattern 2024-08-19 15:59:20 +00:00
moetayuko
a95b3daab9 fix loading chatglm3 (#1937)
* update welcome svg

* update welcome message

* fix loading chatglm3

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
2024-08-19 23:32:45 +08:00
binary-husky
4873e9dfdc update translation matrix 2024-08-12 13:50:37 +00:00
moetayuko
a119ab36fe fix enabling sparkv4 (#1936) 2024-08-12 21:45:08 +08:00
FatShibaInu
f9384e4e5f Add Support for Gemini 1.5 Pro & Gemini 1.5 Flash (#1926)
* Add Support for Gemini 1.5 Pro & 1.5 Flash.

* Update bridge_all.py

fix a spelling error in comments.

* Add Support for Gemini 1.5 Pro & Gemini 1.5 Flash
2024-08-12 21:44:24 +08:00
binary-husky
6fe5f6ee6e update welcome message 2024-08-05 11:37:06 +00:00
binary-husky
068d753426 update welcome svg 2024-08-04 15:59:09 +00:00
binary-husky
5010537f3c update welcome svg 2024-08-04 15:58:32 +00:00
binary-husky
f35f6633e0 fix: welcome card flip bug 2024-08-02 11:20:41 +00:00
hongyi-zhao
573dc4d184 Add claude-3-5-sonnet-20240620 (#1907)
See https://docs.anthropic.com/en/docs/about-claude/models#model-names fore model names.
2024-08-02 18:04:42 +08:00
binary-husky
da8b2d69ce update version 3.8 2024-08-02 10:02:04 +00:00
binary-husky
58e732c26f Merge branch 'frontier' 2024-08-02 09:50:40 +00:00
Menghuan1918
ca238daa8c 改进联网搜索插件-新增搜索模式,搜索增强 (#1874)
* Change default to Mixed option

* Add option optimizer

* Add search optimizer prompts

* Enhanced Processing

* Finish search_optimizer part

* prompts bug fix

* Bug fix
2024-07-23 00:55:48 +08:00
jiangfy-ihep
60b3491513 add gpt-4o-mini (#1904)
Co-authored-by: Fayu Jiang <jiangfayu@hotmail.com>
2024-07-23 00:55:34 +08:00
binary-husky
c1175bfb7d add flip card animation 2024-07-22 04:53:59 +00:00
binary-husky
b705afd5ff welcome menu bug fix 2024-07-22 04:35:52 +00:00
binary-husky
dfcd28abce add width_to_hide_welcome 2024-07-22 03:34:35 +00:00
binary-husky
1edaa9e234 hide when too narrow 2024-07-21 15:04:38 +00:00
binary-husky
f0cd617ec2 minor css improve 2024-07-20 10:29:47 +00:00
binary-husky
0b08bb2cea update svg 2024-07-20 07:15:08 +00:00
Keldos
d1f8607ac8 Update submit button dropdown style (#1900) 2024-07-20 14:50:56 +08:00
binary-husky
7eb68a2086 tune 2024-07-17 17:16:34 +00:00
binary-husky
ee9e99036a Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-07-17 17:14:49 +00:00
binary-husky
55e255220b update 2024-07-17 17:12:32 +00:00
lbykkkk
019cd26ae8 Merge branch 'frontier' of https://github.com/binary-husky/gpt_academic into frontier 2024-07-18 00:35:51 +08:00
lbykkkk
a5b21d5cc0 修改content并统一logo颜色 2024-07-18 00:35:40 +08:00
binary-husky
ce940ff70f roll welcome msg 2024-07-17 16:34:24 +00:00
binary-husky
fc6a83c29f update 2024-07-17 15:44:08 +00:00
binary-husky
1d3212e367 reverse welcome msg 2024-07-17 15:43:41 +00:00
lbykkkk
8a835352a3 更新欢迎界面的用语和logo 2024-07-17 19:49:07 +08:00
binary-husky
5456c9fa43 improve welcome UI 2024-07-16 16:23:07 +00:00
binary-husky
ea67054c30 update chuanhu theme 2024-07-16 16:07:46 +00:00
binary-husky
1084108df6 adding welcome page 2024-07-16 10:41:25 +00:00
binary-husky
40c9700a8d add welcome page 2024-07-15 15:47:24 +00:00
binary-husky
6da5623813 多用途复用提交按钮 2024-07-15 04:23:43 +00:00
binary-husky
778c9cd9ec roll version 2024-07-15 03:29:56 +00:00
binary-husky
e290317146 proxy submit btn 2024-07-15 03:28:59 +00:00
binary-husky
85b92b7f07 move python comment agent to dropdown 2024-07-13 16:26:36 +00:00
binary-husky
ff899777ce improve source code comment plugin functionality 2024-07-13 16:20:17 +00:00
binary-husky
c1b8c773c3 stage compare source code comment 2024-07-13 15:28:53 +00:00
binary-husky
8747c48175 mt improvement 2024-07-12 08:26:40 +00:00
binary-husky
c0010c88bc implement auto comment 2024-07-12 07:36:40 +00:00
binary-husky
68838da8ad finish test 2024-07-12 04:19:07 +00:00
binary-husky
ca7de8fcdd version up 2024-07-10 02:00:36 +00:00
binary-husky
7ebc2d00e7 Merge branch 'master' into frontier 2024-07-09 03:19:35 +00:00
binary-husky
47fb81cfde Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-07-09 03:18:19 +00:00
binary-husky
83961c1002 optimize image generation fn 2024-07-09 03:18:14 +00:00
binary-husky
a8621333af js impl bug fix 2024-07-08 15:50:12 +00:00
binary-husky
f402ef8134 hide ask btn 2024-07-08 15:15:30 +00:00
binary-husky
65d0f486f1 change cache to lru_cache for lower python version 2024-07-07 16:02:05 +00:00
binary-husky
41f25a6a9b Merge branch 'bold_frontier' into frontier 2024-07-04 14:16:08 +00:00
binary-husky
4a6a032334 ignore 2024-07-04 14:14:49 +00:00
binary-husky
f945a7bd19 preserve theme selection 2024-07-04 14:11:51 +00:00
binary-husky
379dcb2fa7 minor gui bug fix 2024-07-04 13:31:21 +00:00
Menghuan1918
114192e025 Bug fix: can not chat with deepseek (#1879) 2024-07-04 20:28:53 +08:00
binary-husky
30c905917a unify plugin calling 2024-07-02 15:32:40 +00:00
binary-husky
0c6c357e9c revise qwen 2024-07-02 14:22:45 +00:00
binary-husky
9d11b17f25 Merge branch 'master' into frontier 2024-07-02 08:06:34 +00:00
binary-husky
1d9e9fa6a1 new page btn 2024-07-01 16:27:23 +00:00
Menghuan1918
6cd2d80dfd Bug fix: Some non-standard forms of error return are not caught (#1877) 2024-07-01 20:35:49 +08:00
binary-husky
18d3245fc9 ready next gradio version 2024-06-29 15:29:48 +00:00
hcy2206
194e665a3b 增加了对于讯飞星火大模型Spark4.0的支持 (#1875) 2024-06-29 23:20:04 +08:00
binary-husky
7e201c5028 move test file to correct position 2024-06-28 08:23:40 +00:00
binary-husky
6babcb4a9c Merge branch 'master' into frontier 2024-06-27 06:52:03 +00:00
binary-husky
00e5a31b50 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-06-27 06:50:06 +00:00
binary-husky
d8b9686eeb fix latex auto correct 2024-06-27 06:49:36 +00:00
binary-husky
b7b4e201cb fix latex auto correct 2024-06-27 06:49:10 +00:00
binary-husky
26e7677dc3 fix new api for taichu 2024-06-26 15:18:11 +00:00
Menghuan1918
25e06de1b6 Docker build bug fix (#1870) 2024-06-26 14:31:31 +08:00
binary-husky
5e64a50898 Merge branch 'master' into frontier 2024-06-25 11:43:40 +00:00
binary-husky
0ad571e6b5 prevent further stream when reset is clicked 2024-06-25 11:43:14 +00:00
binary-husky
60a42fb070 Merge branch 'master' into frontier 2024-06-25 11:14:32 +00:00
binary-husky
ddad5247fc upgrade searxng 2024-06-25 11:12:51 +00:00
binary-husky
c94d5054a2 move fn 2024-06-25 08:53:28 +00:00
binary-husky
ececfb9b6e test new dropdown js code 2024-06-25 08:34:50 +00:00
binary-husky
9f13c5cedf update default value of scroller_max_len 2024-06-25 05:34:55 +00:00
binary-husky
68b36042ce re-locate plugin 2024-06-25 05:32:20 +00:00
binary-husky
cac6c50d2f roll version 2024-06-19 12:56:23 +00:00
binary-husky
f884eb43cf Merge branch 'master' into frontier 2024-06-19 12:56:04 +00:00
binary-husky
d37383dd4e change arxiv cache dir path 2024-06-19 12:49:34 +00:00
binary-husky
dfae4e8081 optimize scolling visual effect 2024-06-19 12:42:11 +00:00
binary-husky
15cc08505f resolve safe pickle err 2024-06-19 11:59:47 +00:00
iluem
c5a82f6ab7 Merge pull request from GHSA-3jrq-66fm-w7xr 2024-06-19 14:29:21 +08:00
binary-husky
768ed4514a minor formatting issue 2024-06-18 14:51:53 +00:00
binary-husky
9dfbff7fd0 Merge branch 'GHSA-3jrq-66fm-w7xr' into frontier 2024-06-18 10:19:10 +00:00
binary-husky
47cedde954 fix security issue GHSA-3jrq-66fm-w7xr 2024-06-18 10:18:33 +00:00
binary-husky
1e16485087 internet gpt minor bug fix 2024-06-16 15:16:24 +00:00
binary-husky
f3660d669f internet GPT upgrade 2024-06-16 14:10:38 +00:00
binary-husky
e6d1cb09cb Merge branch 'master' into frontier 2024-06-16 13:47:15 +00:00
binary-husky
12aebf9707 searxng based information gathering 2024-06-16 12:12:57 +00:00
binary-husky
0b5385e5e5 Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-06-12 09:34:12 +00:00
binary-husky
2ff1a1fb0b update translation matrix 2024-06-12 09:34:05 +00:00
Yuki
cdadd38cf7 ️feat: block access to openapi references while running under fastapi (#1849)
- block fastapi openapi reference(swagger and redoc) routes
2024-06-10 22:26:46 +08:00
binary-husky
48e10fb10a Update README.md 2024-06-10 22:22:04 +08:00
binary-husky
ba484c55a0 Merge branch 'master' into frontier 2024-06-10 14:19:26 +00:00
Frank Lee
ca64a592f5 Update zhipu models (#1852) 2024-06-10 22:17:51 +08:00
Guoxin Sun
cb96ca132a Update common.js (#1854)
fix typo
2024-06-10 22:17:27 +08:00
binary-husky
737101b81d remove debug msg 2024-06-07 17:00:05 +00:00
binary-husky
612caa2f5f revise 2024-06-07 16:50:27 +00:00
binary-husky
85dbe4a4bf pdf processing improvement 2024-06-07 15:53:08 +00:00
binary-husky
2262a4d80a taichu model fix 2024-06-06 09:35:05 +00:00
binary-husky
b456ff02ab add note 2024-06-06 09:14:32 +00:00
binary-husky
24a21ae320 紫东太初大模型 2024-06-06 09:05:06 +00:00
binary-husky
3d5790cc2c resolve fallback to non-multimodal problem 2024-06-06 08:00:30 +00:00
binary-husky
7de6015800 multimodal support for gpt-4o etc 2024-06-06 07:36:37 +00:00
binary-husky
46428b7c7a Merge branch 'master' into frontier 2024-06-01 16:22:32 +00:00
binary-husky
66a50c8019 live2d shutdown bug fix 2024-06-01 16:21:04 +00:00
Menghuan1918
814dc943ac 将“生成多种图表”插件高级参数更新为二级菜单 (#1839)
* Improve the prompts

* Update to new meun form

* Bug fix (wrong type of plugin_kwargs)
2024-06-01 13:34:33 +08:00
binary-husky
96cd1f0b25 secondary menu main input sync bug fix 2024-05-31 04:13:27 +00:00
binary-husky
4fc17f4add Merge branch 'master' into frontier 2024-05-30 15:00:44 +00:00
binary-husky
b3665d8fec remove check 2024-05-30 14:54:50 +00:00
binary-husky
80c4281888 TTS Default Enable 2024-05-30 14:27:18 +00:00
binary-husky
beda56abb0 update dockerfile 2024-05-30 12:44:17 +00:00
binary-husky
cb16941d01 update css 2024-05-30 12:35:47 +00:00
binary-husky
5cf9ac7849 Merge branch 'master' into frontier 2024-05-29 16:06:28 +00:00
binary-husky
51ddb88ceb correct hint err 2024-05-29 16:05:23 +00:00
binary-husky
69dfe5d514 compat to old void-terminal plugin 2024-05-29 15:50:00 +00:00
binary-husky
6819f87512 Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-05-23 16:35:20 +00:00
binary-husky
3d51b9d5bb compat baichuan 2024-05-23 16:35:15 +00:00
QiyuanChen
bff87ada92 添加对ERNIE-Speed和ERNIE-Lite模型的支持 (#1821)
* feat: add ERNIE-Speed and ERNIE-Lite

百度的ERNIE-Speed and ERNIE-Lite模型开始免费使用了,故添加了调用地址。可以使用ERNIE-Speed-128K,ERNIE-Speed-8K,ERNIE-Lite-8K来访问

* chore: Modify supported models in config.py

修改了config.py中千帆支持的模型列表,添加了三款免费模型
2024-05-24 00:16:26 +08:00
binary-husky
a938412b6f save conversation wrap 2024-05-23 15:58:59 +00:00
binary-husky
a48acf6fec Flex Btn Bug Fix 2024-05-22 08:38:40 +00:00
binary-husky
c6b9ab5214 add document 2024-05-22 06:39:56 +00:00
binary-husky
aa3332de69 add document 2024-05-22 06:27:26 +00:00
binary-husky
d43175d46d fix type hint 2024-05-21 13:18:38 +00:00
binary-husky
8ca9232db2 Merge branch 'master' into frontier 2024-05-21 12:27:01 +00:00
binary-husky
1339aa0e1a doc2x latex convertion 2024-05-21 12:24:50 +00:00
binary-husky
f41419e767 update demo 2024-05-21 11:12:08 +00:00
binary-husky
d88c585305 improve latex plugin 2024-05-21 10:47:50 +00:00
binary-husky
0a88d18c7a secondary menu for pdf trans 2024-05-21 08:51:29 +00:00
binary-husky
0d0edc2216 Merge branch 'frontier' of github.com:binary-husky/chatgpt_academic into frontier 2024-05-19 21:54:16 +08:00
binary-husky
5e0875fcf4 from backend to front end 2024-05-19 21:54:06 +08:00
Shixian Sheng
c508b84db8 更新了README.md/Update README.md (#1810) 2024-05-19 20:41:17 +08:00
Menghuan1918
f2b67602bb 为docker构建添加FFmpeg依赖 (#1807)
* Test: change dockerfile to install ffmpeg

* Add the ffmpeg to dockerfile (required by edge-tts)
2024-05-19 14:27:55 +08:00
binary-husky
29daba5d2f success? 2024-05-18 23:03:28 +08:00
binary-husky
9477824ac1 improve css 2024-05-18 21:54:15 +08:00
binary-husky
459c5b2d24 plugin refactor: phase 1 2024-05-18 20:23:50 +08:00
binary-husky
abf9b5aee5 Merge branch 'master' into frontier 2024-05-18 15:52:08 +08:00
binary-husky
2ce4482146 fix new ModelOverride fn bug 2024-05-18 15:47:25 +08:00
binary-husky
4282b83035 change TTS default to DISABLE 2024-05-18 15:43:35 +08:00
binary-husky
537be57c9b fix tts bugs 2024-05-17 21:07:28 +08:00
binary-husky
3aa92d6c80 change main ui hint 2024-05-17 11:34:13 +08:00
awwaawwa
b7eb9aba49 [Feature]: allow model mutex override in core_functional.py (#1708)
* allow_core_func_specify_model

* change arg name

* 模型覆盖支持热更新&当模型覆盖指向不存在的模型时报错

* allow model mutex override

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-05-17 11:15:23 +08:00
hongyi-zhao
881a596a30 model support (gpt4o) in project. (#1760)
* Add the environment variable: OPEN_BROWSER

* Add configurable browser launching with custom arguments

- Update `config.py` to include options for specifying the browser and its arguments for opening URLs.
- Modify `main.py` to use the configured browser settings from `config.py` to launch the web page.
- Enhance `config_loader.py` to process path-like strings by expanding and normalizing paths, which supports the configuration improvements.

* Add support for the following models:

"gpt-4o", "gpt-4o-2024-05-13"

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-05-14 17:01:32 +08:00
binary-husky
1b3c331d01 dos2unix 2024-05-14 12:02:40 +08:00
binary-husky
70d5f2a7df arg name err patch 2024-05-13 23:40:35 +08:00
Menghuan1918
fd2f8b9090 Provide a new fast and simple way of accessing APIs (As example: Yi-models,Deepseek) (#1782)
* deal with the message part

* Finish no_ui_connect

* finish predict part

* Delete old version

* An example of add new api

* Bug fix:can not change in "model_info"

* Bug fix

* Error message handling

* Clear the format

* An example of add a openai form API:Deepseek

* For compatibility reasons

* Feture: set different API/Endpoint to diferent models

* Add support for YI new models

* 更新doc2x的api key机制 (#1766)

* Fix DOC2X API key refresh issue in PDF translation

* remove add

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* 修改部分文件名、变量名

* patch err

---------

Co-authored-by: alex_xiao <113411296+Alex4210987@users.noreply.github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-05-13 23:38:08 +08:00
binary-husky
225a2de011 Version 3.76 (#1752)
* version roll

* add upload processbar
2024-05-13 22:54:38 +08:00
binary-husky
6aea6d8e2b Merge branch 'master' into frontier 2024-05-13 22:52:15 +08:00
alex_xiao
8d85616c27 更新doc2x的api key机制 (#1766)
* Fix DOC2X API key refresh issue in PDF translation

* remove add

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-05-13 22:49:40 +08:00
binary-husky
e4533dd24d Merge branch 'master' into frontier 2024-05-04 17:00:09 +08:00
binary-husky
43ed8cb8a8 Fix fastapi version compat 2024-05-04 16:43:42 +08:00
binary-husky
3eff964424 Update README.md 2024-05-01 17:59:25 +08:00
OREEkE
ebde98b34b Update requirements.txt (#1753)
TTS_TYPE = "EDGE_TTS"需要的依赖
2024-05-01 14:55:04 +08:00
binary-husky
6f883031c0 Update config.py 2024-05-01 14:54:36 +08:00
binary-husky
fa15059f07 add upload processbar 2024-05-01 01:11:35 +08:00
binary-husky
685c573619 version roll 2024-04-30 21:00:25 +08:00
binary-husky
5fcd02506c version 3.75 (#1702)
* Update version to 3.74

* Add support for Yi Model API (#1635)

* 更新以支持零一万物模型

* 删除newbing

* 修改config

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Refactor function signatures in bridge files

* fix qwen api change

* rename and ref functions

* rename and move some cookie functions

* 增加haiku模型,新增endpoint配置说明 (#1626)

* haiku added

* 新增haiku,新增endpoint配置说明

* Haiku added

* 将说明同步至最新Endpoint

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* private_upload目录下进行文件鉴权 (#1596)

* private_upload目录下进行文件鉴权

* minor fastapi adjustment

* Add logging functionality to enable saving
conversation records

* waiting to fix username retrieve

* support 2rd web path

* allow accessing default user dir

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* remove yaml deps

* fix favicon

* fix abs path auth problem

* forget to write a return

* add `dashscope` to deps

* fix GHSA-v9q9-xj86-953p

* 用户名重叠越权访问patch (#1681)

* add cohere model api access

* cohere + can_multi_thread

* fix block user access(fail)

* fix fastapi bug

* change cohere api endpoint

* explain version

* # fix com_zhipuglm.py illegal temperature problem (#1687)

* Update com_zhipuglm.py

# fix 用户在使用 zhipuai 界面时遇到了关于温度参数的非法参数错误

* allow store lm model dropdown

* add a btn to reverse previous reset

* remove extra fns

* Add support for glm-4v model (#1700)

* 修改chatglm3量化加载方式 (#1688)

Co-authored-by: zym9804 <ren990603@gmail.com>

* save chat stage 1

* consider null cookie situation

* 在点击复制按钮时激活语音

* miss some parts

* move all to js

* done first stage

* add edge tts

* bug fix

* bug fix

* remove console log

* bug fix

* bug fix

* bug fix

* audio switch

* update tts readme

* remove tempfile when done

* disable auto audio follow

* avoid play queue update after shut up

* feat: minimizing common.js

* improve tts functionality

* deterine whether the cached model is in choices

* Add support for Ollama (#1740)

* print err when doc2x not successful

* add icon

* adjust url for doc2x key version

* prepare merge

---------

Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: Skyzayre <120616113+Skyzayre@users.noreply.github.com>
Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
Co-authored-by: Yuki <903728862@qq.com>
Co-authored-by: zyren123 <91042213+zyren123@users.noreply.github.com>
Co-authored-by: zym9804 <ren990603@gmail.com>
2024-04-30 20:37:41 +08:00
binary-husky
bd5280df1b minor pdf translation adjustment 2024-04-30 00:52:36 +08:00
binary-husky
744759704d allow personal docx api access 2024-04-29 23:53:41 +08:00
WFS
81df0aa210 fix the issue of when using google Gemini pro, don't have chat histor… (#1743)
* fix the issue of when using google Gemini pro, don't have chat history record

just add chat_log in bridge_google_gmini.py

* Update bridge_google_gemini.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
2024-04-25 22:26:32 +08:00
Menghuan1918
cadaa81030 Fix the bug cause Nougat can not use (#1738)
* Bug fix for nougat require pdf

* Fixing bugs in a simpler and safer way
2024-04-24 12:13:44 +08:00
binary-husky
3b6cbbdcb0 Update README.md (#1736) 2024-04-24 11:41:56 +08:00
binary-husky
52e49c48b8 the latest zhipuai whl is broken 2024-04-23 18:20:36 +08:00
binary-husky
6ad15a6129 fix equation showing problem 2024-04-22 01:54:03 +08:00
binary-husky
09990d44d3 merge to resolve multiple pickle security issues (#1728)
* 注释调试if分支

* support pdf url for latex translation

* Merge pull request from GHSA-mvrw-h7rc-22r8

* 注释调试if分支

* Improve objload security

* Update README.md

* support pdf url for latex translation

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* fix import

---------

Co-authored-by: Longtaotao <longtaotao@bupt.edu.cn>
Co-authored-by: iluem <57590186+Qhaoduoyu@users.noreply.github.com>
2024-04-21 19:37:05 +08:00
binary-husky
eac5191815 Update README.md 2024-04-21 02:12:15 +08:00
owo
ae4407135d fix: 添加report_exception中缺失的a参数 (#1720)
在report_exception函数的定义中,参数a未包含默认值,因此应提供相应的值传入。
2024-04-18 16:27:00 +08:00
owo
f0e15bd710 fix: 修复了在else语句中调用'schema_str'之前未定义的问题 (#1719)
重新排列了方法中的条件返回语句,以确保在使用之前始终定义了'schema_str'。
2024-04-18 16:26:13 +08:00
jiangfy-ihep
5c5f442649 Fix: openai project API key pattern (#1721)
Co-authored-by: Fayu Jiang <jiangfayu@hotmail.com>
2024-04-18 16:24:29 +08:00
binary-husky
160552cc5f introduce doc2x 2024-04-15 01:57:31 +08:00
binary-husky
c131ec0b20 rename pdf plugin file name 2024-04-14 22:46:31 +08:00
iluem
2f3aeb7976 Merge pull request from GHSA-23cr-v6pm-j89p
* Update crazy_utils.py

Improve security

* add a white space

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
2024-04-14 21:51:03 +08:00
binary-husky
eff5b89b98 scan first, then extract 2024-04-14 21:36:57 +08:00
iluem
f77ab27bc9 Merge pull request from GHSA-rh7j-jfvq-857j
Prevent path traversal for improved security
2024-04-14 21:33:37 +08:00
awwaawwa
ba0a8b7072 integrate gpt-4-turbo-2024-04-09 (#1698)
* 接入 gpt-4-turbo-2024-04-09 模型

* add gpt-4-turbo and change to vision

* add gpt-4-turbo to avail llm models

* 暂时将gpt-4-turbo接入至普通版本
2024-04-11 22:02:40 +08:00
hmp
2406022c2a access vllm 2024-04-11 22:00:07 +08:00
OREEkE
02b6f26b05 remove logging in gradios.py (#1699)
如果初始主题是HF社区主题,这里使用logging会导致程序不再写入日志(包括对话内容在内的任何记录),下载主题的日志输出和程序启动时的日志初始化有冲突。
2024-04-11 14:15:12 +08:00
OREEkE
2a003e8d49 add loadLive2D() when ADD_WAIFU = False (#1693)
ADD_WAIFU = False,浏览器会抛出错误:[Error] JQuery is not defined. 因为这时候没有jQuery库可用,却依然使用了loadLive2D()函数。现在加一个判断,如果ADD_WAIFU = False,禁用jQuery库的同时也禁用loadLive2D()函数,除非ADD_WAIFU = True
2024-04-10 00:10:53 +08:00
binary-husky
21891b0f6d update translate matrix 2024-04-08 12:43:24 +08:00
Yuki
163f12c533 # fix com_zhipuglm.py illegal temperature problem (#1687)
* Update com_zhipuglm.py

# fix 用户在使用 zhipuai 界面时遇到了关于温度参数的非法参数错误
2024-04-08 12:17:07 +08:00
binary-husky
bdd46c5dd1 Version 3.74: Merge latest updates on dev branch (frontier) (#1621)
* Update version to 3.74

* Add support for Yi Model API (#1635)

* 更新以支持零一万物模型

* 删除newbing

* 修改config

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Refactor function signatures in bridge files

* fix qwen api change

* rename and ref functions

* rename and move some cookie functions

* 增加haiku模型,新增endpoint配置说明 (#1626)

* haiku added

* 新增haiku,新增endpoint配置说明

* Haiku added

* 将说明同步至最新Endpoint

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* private_upload目录下进行文件鉴权 (#1596)

* private_upload目录下进行文件鉴权

* minor fastapi adjustment

* Add logging functionality to enable saving
conversation records

* waiting to fix username retrieve

* support 2rd web path

* allow accessing default user dir

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* remove yaml deps

* fix favicon

* fix abs path auth problem

* forget to write a return

* add `dashscope` to deps

* fix GHSA-v9q9-xj86-953p

* 用户名重叠越权访问patch (#1681)

* add cohere model api access

* cohere + can_multi_thread

* fix block user access(fail)

* fix fastapi bug

* change cohere api endpoint

* explain version

---------

Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: Skyzayre <120616113+Skyzayre@users.noreply.github.com>
Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
2024-04-08 11:49:30 +08:00
binary-husky
ae51a0e686 fix GHSA-v9q9-xj86-953p 2024-04-05 20:47:11 +08:00
binary-husky
f2582ea137 fix qwen api change 2024-04-03 12:17:41 +08:00
binary-husky
ddd2fd84da fix checkbox bugs 2024-04-02 19:42:55 +08:00
binary-husky
6c90ff80ea add prompt and temperature to cookie 2024-04-02 18:02:00 +08:00
binary-husky
cb7c0703be Update requirements.txt (#1668) 2024-04-01 11:30:50 +08:00
binary-husky
5181cd441d change pip install url due to server failure (#1667) 2024-04-01 11:20:14 +08:00
binary-husky
216d4374e7 fix color list overflow 2024-04-01 00:11:32 +08:00
iluem
8af6c0cab6 Qhaoduoyu patch 1: pickle to json to increase security (#1648)
* Update theme.py

fix bugs

* Update theme.py

fix bugs

* change var names

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-03-25 09:54:30 +08:00
binary-husky
67ad041372 fix issue #1640 2024-03-20 18:09:37 +08:00
binary-husky
725c72229c update docker compose 2024-03-20 17:37:03 +08:00
Menghuan1918
e42ede512b Update Claude3 api request and fix some bugs (#1641)
* Update version to 3.74

* Add support for Yi Model API (#1635)

* 更新以支持零一万物模型

* 删除newbing

* 修改config

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Update claude requrest to http type

* Update for endpoint

* Add support for other tpyes of pictures

* Update pip packages

* Fix console_slience issue while error handling

* revert version changes

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-03-20 17:22:23 +08:00
binary-husky
84ccc9e64c fix claude + oneapi error 2024-03-17 14:53:28 +08:00
binary-husky
c172847e19 add python annotations for toolbox functions 2024-03-16 22:54:33 +08:00
binary-husky
d166d25eb4 resolve invalid escape sequence warning
to support python3.12
2024-03-11 18:10:05 +08:00
binary-husky
516bbb1331 Update README.md 2024-03-11 17:40:16 +08:00
binary-husky
c3140ce344 merge frontier branch (#1620)
* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502)

* 适配 google gemini 优化为从用户input中提取文件

* 适配最新的智谱SDK、支持glm-4v

* requirements.txt fix

* pending history check

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520)

* Update crazy_functional.py with new functionality deal with PDF

* Update crazy_functional.py and Mermaid.py for plugin_kwargs

* Update crazy_functional.py with new chart type: mind map

* Update SELECT_PROMPT and i_say_show_user messages

* Update ArgsReminder message in get_crazy_functions() function

* Update with read md file and update PROMPTS

* Return the PROMPTS as the test found that the initial version worked best

* Update Mermaid chart generation function

* version 3.71

* 解决issues #1510

* Remove unnecessary text from sys_prompt in 解析历史输入 function

* Remove sys_prompt message in 解析历史输入 function

* Update bridge_all.py: supports gpt-4-turbo-preview (#1517)

* Update bridge_all.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update bridge_all.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Update config.py: supports gpt-4-turbo-preview (#1516)

* Update config.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update config.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Refactor 解析历史输入 function to handle file input

* Update Mermaid chart generation functionality

* rename files and functions

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* 接入mathpix ocr功能 (#1468)

* Update Latex输出PDF结果.py

借助mathpix实现了PDF翻译中文并重新编译PDF

* Update config.py

add mathpix appid & appkey

* Add 'PDF翻译中文并重新编译PDF' feature to plugins.

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* fix zhipuai

* check picture

* remove glm-4 due to bug

* 修改config

* 检查MATHPIX_APPID

* Remove unnecessary code and update
function_plugins dictionary

* capture non-standard token overflow

* bug fix #1524

* change mermaid style

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530)

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽

* 微调未果 先stage一下

* update

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* ver 3.72

* change live2d

* save the status of ``clear btn` in cookie

* 前端选择保持

* js ui bug fix

* reset btn bug fix

* update live2d tips

* fix missing get_token_num method

* fix live2d toggle switch

* fix persistent custom btn with cookie

* fix zhipuai feedback with core functionality

* Refactor button update and clean up functions

* tailing space removal

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

* Prompt fix、脑图提示词优化 (#1537)

* 适配 google gemini 优化为从用户input中提取文件

* 脑图提示词优化

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* 优化“PDF翻译中文并重新编译PDF”插件 (#1602)

* Add gemini_endpoint to API_URL_REDIRECT (#1560)

* Add gemini_endpoint to API_URL_REDIRECT

* Update gemini-pro and gemini-pro-vision model_info
endpoints

* Update to support new claude models (#1606)

* Add anthropic library and update claude models

* 更新bridge_claude.py文件,添加了对图片输入的支持。修复了一些bug。

* 添加Claude_3_Models变量以限制图片数量

* Refactor code to improve readability and
maintainability

* minor claude bug fix

* more flexible one-api support

* reformat config

* fix one-api new access bug

* dummy

* compat non-standard api

* version 3.73

---------

Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: Hao Ma <893017927@qq.com>
Co-authored-by: zeyuan huang <599012428@qq.com>
2024-03-11 17:26:09 +08:00
binary-husky
cd18663800 compat non-standard api - 2 2024-03-10 17:13:54 +08:00
binary-husky
dbf1322836 compat non-standard api 2024-03-10 17:07:59 +08:00
XIao
98dd3ae1c0 Moonshot- 在config.py中增加可用模型 (#1603)
* 支持月之暗面api

* fix文案

* 优化noui的返回值,对话历史文件继续上传到moonshat

* fix

* config 可用模型配置增加

* add `can_multi_thread` model attr (#1598)

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
Co-authored-by: binary-husky <qingxu.fu@outlook.com>
2024-03-05 16:07:05 +08:00
binary-husky
3036709496 add can_multi_thread model attr (#1598) 2024-03-05 15:58:18 +08:00
XIao
8e9c07644f 支持月之暗面api,文件对话 (#1597)
* 支持月之暗面api

* fix文案
2024-03-03 23:42:17 +08:00
binary-husky
90d96b77e6 handle qianfan chat error 2024-02-29 00:36:06 +08:00
binary-husky
66c876a9ca Update README.md 2024-02-26 22:56:09 +08:00
binary-husky
0665eb75ed Update README.md (#1581) 2024-02-26 22:52:00 +08:00
binary-husky
6b784035fa Merge branch 'master' of github.com:binary-husky/chatgpt_academic 2024-02-25 21:13:56 +08:00
binary-husky
8bb3d84912 fix zip chinese file name error 2024-02-25 21:13:41 +08:00
binary-husky
a0193cf227 edit dep url 2024-02-23 13:28:49 +08:00
binary-husky
b72289bfb0 Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration
2024-02-21 14:20:10 +08:00
Menghuan1918
bdfe3862eb 添加部分翻译 (#1566) 2024-02-21 14:14:06 +08:00
binary-husky
dae180b9ea update spark v3.5, fix glm parallel problem 2024-02-18 14:08:35 +08:00
binary-husky
e359fff040 Fix response message bug in bridge_qianfan.py,
bridge_qwen.py, and bridge_skylark2.py
2024-02-15 00:02:24 +08:00
binary-husky
2e9b4a5770 Merge Frontier, Update to Version 3.72 (#1553)
* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502)

* 适配 google gemini 优化为从用户input中提取文件

* 适配最新的智谱SDK、支持glm-4v

* requirements.txt fix

* pending history check

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520)

* Update crazy_functional.py with new functionality deal with PDF

* Update crazy_functional.py and Mermaid.py for plugin_kwargs

* Update crazy_functional.py with new chart type: mind map

* Update SELECT_PROMPT and i_say_show_user messages

* Update ArgsReminder message in get_crazy_functions() function

* Update with read md file and update PROMPTS

* Return the PROMPTS as the test found that the initial version worked best

* Update Mermaid chart generation function

* version 3.71

* 解决issues #1510

* Remove unnecessary text from sys_prompt in 解析历史输入 function

* Remove sys_prompt message in 解析历史输入 function

* Update bridge_all.py: supports gpt-4-turbo-preview (#1517)

* Update bridge_all.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update bridge_all.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Update config.py: supports gpt-4-turbo-preview (#1516)

* Update config.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update config.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Refactor 解析历史输入 function to handle file input

* Update Mermaid chart generation functionality

* rename files and functions

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* 接入mathpix ocr功能 (#1468)

* Update Latex输出PDF结果.py

借助mathpix实现了PDF翻译中文并重新编译PDF

* Update config.py

add mathpix appid & appkey

* Add 'PDF翻译中文并重新编译PDF' feature to plugins.

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* fix zhipuai

* check picture

* remove glm-4 due to bug

* 修改config

* 检查MATHPIX_APPID

* Remove unnecessary code and update
function_plugins dictionary

* capture non-standard token overflow

* bug fix #1524

* change mermaid style

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530)

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽

* 微调未果 先stage一下

* update

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* ver 3.72

* change live2d

* save the status of ``clear btn` in cookie

* 前端选择保持

* js ui bug fix

* reset btn bug fix

* update live2d tips

* fix missing get_token_num method

* fix live2d toggle switch

* fix persistent custom btn with cookie

* fix zhipuai feedback with core functionality

* Refactor button update and clean up functions

---------

Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: Hao Ma <893017927@qq.com>
Co-authored-by: zeyuan huang <599012428@qq.com>
2024-02-14 18:35:09 +08:00
binary-husky
e0c5859cf9 update Column min_width parameter 2024-02-12 23:37:31 +08:00
binary-husky
b9b1e12dc9 fix missing get_token_num method 2024-02-12 15:58:55 +08:00
binary-husky
8814026ec3 fix gradio-client version (#1548) 2024-02-09 13:25:01 +08:00
binary-husky
3025d5be45 remove jsdelivr (#1547) 2024-02-09 13:17:14 +08:00
binary-husky
6c13bb7b46 patch issue #1538 2024-02-06 17:59:09 +08:00
binary-husky
c27e559f10 match sess-* key 2024-02-06 17:51:47 +08:00
binary-husky
cdb5288f49 fix issue #1532 2024-02-02 17:47:35 +08:00
hongyi-zhao
49c6fcfe97 Update config.py: supports gpt-4-turbo-preview (#1516)
* Update config.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update config.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
2024-01-26 16:44:32 +08:00
hongyi-zhao
45fa0404eb Update bridge_all.py: supports gpt-4-turbo-preview (#1517)
* Update bridge_all.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update bridge_all.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>
2024-01-26 16:36:23 +08:00
352 changed files with 51360 additions and 13943 deletions

6
.dockerignore Normal file
View File

@@ -0,0 +1,6 @@
.venv
.github
.vscode
gpt_log
tests
README.md

View File

@@ -1,44 +0,0 @@
# https://docs.github.com/en/actions/publishing-packages/publishing-docker-images#publishing-images-to-github-packages
name: build-with-chatglm
on:
push:
branches:
- 'master'
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}_chatglm_moss
jobs:
build-and-push-image:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Log in to the Container registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
file: docs/GithubAction+ChatGLM+Moss
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}

View File

@@ -1,44 +0,0 @@
# https://docs.github.com/en/actions/publishing-packages/publishing-docker-images#publishing-images-to-github-packages
name: build-with-jittorllms
on:
push:
branches:
- 'master'
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}_jittorllms
jobs:
build-and-push-image:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Log in to the Container registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
file: docs/GithubAction+JittorLLMs
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}

View File

@@ -1,14 +1,14 @@
# https://docs.github.com/en/actions/publishing-packages/publishing-docker-images#publishing-images-to-github-packages
name: build-with-all-capacity-beta
name: build-with-latex-arm
on:
push:
branches:
- 'master'
- "master"
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}_with_all_capacity_beta
IMAGE_NAME: ${{ github.repository }}_with_latex_arm
jobs:
build-and-push-image:
@@ -18,11 +18,17 @@ jobs:
packages: write
steps:
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Checkout repository
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Log in to the Container registry
uses: docker/login-action@v2
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
@@ -35,10 +41,11 @@ jobs:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
uses: docker/build-push-action@v6
with:
context: .
push: true
file: docs/GithubAction+AllCapacityBeta
platforms: linux/arm64
file: docs/GithubAction+NoLocal+Latex
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
labels: ${{ steps.meta.outputs.labels }}

View File

@@ -0,0 +1,56 @@
name: Create Conda Environment Package
on:
workflow_dispatch:
jobs:
build:
runs-on: windows-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Miniconda
uses: conda-incubator/setup-miniconda@v3
with:
auto-activate-base: true
activate-environment: ""
- name: Create new Conda environment
shell: bash -l {0}
run: |
conda create -n gpt python=3.11 -y
conda activate gpt
- name: Install requirements
shell: bash -l {0}
run: |
conda activate gpt
pip install -r requirements.txt
- name: Install conda-pack
shell: bash -l {0}
run: |
conda activate gpt
conda install conda-pack -y
- name: Pack conda environment
shell: bash -l {0}
run: |
conda activate gpt
conda pack -n gpt -o gpt.tar.gz
- name: Create workspace zip
shell: pwsh
run: |
mkdir workspace
Get-ChildItem -Exclude "workspace" | Copy-Item -Destination workspace -Recurse
Remove-Item -Path workspace/.git* -Recurse -Force -ErrorAction SilentlyContinue
Copy-Item gpt.tar.gz workspace/ -Force
- name: Upload packed files
uses: actions/upload-artifact@v4
with:
name: gpt-academic-package
path: workspace

View File

@@ -7,7 +7,7 @@
name: 'Close stale issues and PRs'
on:
schedule:
- cron: '*/5 * * * *'
- cron: '*/30 * * * *'
jobs:
stale:
@@ -19,7 +19,6 @@ jobs:
steps:
- uses: actions/stale@v8
with:
stale-issue-message: 'This issue is stale because it has been open 100 days with no activity. Remove stale label or comment or this will be closed in 1 days.'
stale-issue-message: 'This issue is stale because it has been open 100 days with no activity. Remove stale label or comment or this will be closed in 7 days.'
days-before-stale: 100
days-before-close: 1
debug-only: true
days-before-close: 7

12
.gitignore vendored
View File

@@ -131,6 +131,9 @@ dmypy.json
# Pyre type checker
.pyre/
# macOS files
.DS_Store
.vscode
.idea
@@ -153,3 +156,12 @@ media
flagged
request_llms/ChatGLM-6b-onnx-u8s8
.pre-commit-config.yaml
test.*
temp.*
objdump*
*.min.*.js
TODO
experimental_mods
search_results
gg.docx
unstructured_reader.py

View File

@@ -3,32 +3,38 @@
# - 如何构建: 先修改 `config.py` 然后 `docker build -t gpt-academic . `
# - 如何运行(Linux下): `docker run --rm -it --net=host gpt-academic `
# - 如何运行(其他操作系统选择任意一个固定端口50923): `docker run --rm -it -e WEB_PORT=50923 -p 50923:50923 gpt-academic `
FROM python:3.11
FROM ghcr.io/astral-sh/uv:python3.12-bookworm
# 非必要步骤更换pip源 (以下三行,可以删除)
RUN echo '[global]' > /etc/pip.conf && \
echo 'index-url = https://mirrors.aliyun.com/pypi/simple/' >> /etc/pip.conf && \
echo 'trusted-host = mirrors.aliyun.com' >> /etc/pip.conf
# 语音输出功能以下1,2行更换阿里源第3,4行安装ffmpeg都可以删除
RUN sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
sed -i 's/security.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
apt-get update
RUN apt-get install ffmpeg -y
RUN apt-get clean
# 进入工作路径(必要)
WORKDIR /gpt
# 安装大部分依赖利用Docker缓存加速以后的构建 (以下三行,可以删除)
# 安装大部分依赖利用Docker缓存加速以后的构建 (以下两行,可以删除)
COPY requirements.txt ./
RUN pip3 install -r requirements.txt
RUN uv venv --python=3.12 && uv pip install --verbose -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
ENV PATH="/gpt/.venv/bin:$PATH"
RUN python -c 'import loguru'
# 装载项目文件,安装剩余依赖(必要)
COPY . .
RUN pip3 install -r requirements.txt
RUN uv pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
# # 非必要步骤,用于预热模块(可以删除)
RUN python -c 'from check_proxy import warm_up_modules; warm_up_modules()'
# 非必要步骤,用于预热模块(可以删除)
RUN python3 -c 'from check_proxy import warm_up_modules; warm_up_modules()'
ENV CGO_ENABLED=0
# 启动(必要)
CMD ["python3", "-u", "main.py"]
CMD ["bash", "-c", "python main.py"]

View File

@@ -1,8 +1,13 @@
> [!IMPORTANT]
> 2024.1.18: 更新3.70版本支持Mermaid绘图库让大模型绘制脑图
> 2024.1.17: 恭迎GLM4全力支持Qwen、GLM、DeepseekCoder等国内中文大语言基座模型
> 2024.1.17: 某些依赖包尚不兼容python 3.12推荐python 3.11。
> 2024.1.17: 安装依赖时,请选择`requirements.txt`中**指定的版本**。 安装命令:`pip install -r requirements.txt`。本项目完全开源免费,您可通过订阅[在线服务](https://github.com/binary-husky/gpt_academic/wiki/online)的方式鼓励本项目的发展。
> [!IMPORTANT]
> `master主分支`最新动态(2025.8.23): Dockerfile构建效率大幅优化
> `master主分支`最新动态(2025.7.31): 新GUI前端Coming Soon
>
> 2025.2.2: 三分钟快速接入最强qwen2.5-max[视频](https://www.bilibili.com/video/BV1LeFuerEG4)
> 2025.2.1: 支持自定义字体
> 2024.10.10: 突发停电,紧急恢复了提供[whl包](https://drive.google.com/drive/folders/14kR-3V-lIbvGxri4AHc8TpiA1fqsw7SK?usp=sharing)的文件服务器
> 2024.5.1: 加入Doc2x翻译PDF论文的功能[查看详情](https://github.com/binary-husky/gpt_academic/wiki/Doc2x)
> 2024.3.11: 全力支持Qwen、GLM、DeepseekCoder等中文大语言模型 SoVits语音克隆模块[查看详情](https://www.bilibili.com/video/BV1Rp421S7tF/)
> 2024.1.17: 安装依赖时,请选择`requirements.txt`中**指定的版本**。 安装命令:`pip install -r requirements.txt`。
<br>
@@ -58,7 +63,6 @@ Read this in [English](docs/README.English.md) | [日本語](docs/README.Japanes
⭐支持mermaid图像渲染 | 支持让GPT生成[流程图](https://www.bilibili.com/video/BV18c41147H9/)、状态转移图、甘特图、饼状图、GitGraph等等3.7版本)
⭐Arxiv论文精细翻译 ([Docker](https://github.com/binary-husky/gpt_academic/pkgs/container/gpt_academic_with_latex)) | [插件] 一键[以超高质量翻译arxiv论文](https://www.bilibili.com/video/BV1dz4y1v77A/),目前最好的论文翻译工具
⭐[实时语音对话输入](https://github.com/binary-husky/gpt_academic/blob/master/docs/use_audio.md) | [插件] 异步[监听音频](https://www.bilibili.com/video/BV1AV4y187Uy/),自动断句,自动寻找回答时机
⭐AutoGen多智能体插件 | [插件] 借助微软AutoGen探索多Agent的智能涌现可能
⭐虚空终端插件 | [插件] 能够使用自然语言直接调度本项目其他插件
润色、翻译、代码解释 | 一键润色、翻译、查找论文语法错误、解释代码
[自定义快捷键](https://www.bilibili.com/video/BV14s4y1E7jN) | 支持自定义快捷键
@@ -67,7 +71,7 @@ Read this in [English](docs/README.English.md) | [日本語](docs/README.Japanes
读论文、[翻译](https://www.bilibili.com/video/BV1KT411x7Wn)论文 | [插件] 一键解读latex/pdf论文全文并生成摘要
Latex全文[翻译](https://www.bilibili.com/video/BV1nk4y1Y7Js/)、[润色](https://www.bilibili.com/video/BV1FT411H7c5/) | [插件] 一键翻译或润色latex论文
批量注释生成 | [插件] 一键批量生成函数注释
Markdown[中英互译](https://www.bilibili.com/video/BV1yo4y157jV/) | [插件] 看到上面5种语言的[README](https://github.com/binary-husky/gpt_academic/blob/master/docs/README_EN.md)了吗?就是出自他的手笔
Markdown[中英互译](https://www.bilibili.com/video/BV1yo4y157jV/) | [插件] 看到上面5种语言的[README](https://github.com/binary-husky/gpt_academic/blob/master/docs/README.English.md)了吗?就是出自他的手笔
[PDF论文全文翻译功能](https://www.bilibili.com/video/BV1KT411x7Wn) | [插件] PDF论文提取题目&摘要+翻译全文(多线程)
[Arxiv小助手](https://www.bilibili.com/video/BV1LM4y1279X) | [插件] 输入arxiv文章url即可一键翻译摘要+下载PDF
Latex论文一键校对 | [插件] 仿Grammarly对Latex文章进行语法、拼写纠错+输出对照PDF
@@ -87,6 +91,10 @@ Latex论文一键校对 | [插件] 仿Grammarly对Latex文章进行语法、拼
<img src="https://user-images.githubusercontent.com/96192199/279702205-d81137c3-affd-4cd1-bb5e-b15610389762.gif" width="700" >
</div>
<div align="center">
<img src="https://github.com/binary-husky/gpt_academic/assets/96192199/70ff1ec5-e589-4561-a29e-b831079b37fb.gif" width="700" >
</div>
- 所有按钮都通过读取functional.py动态生成可随意加自定义功能解放剪贴板
<div align="center">
@@ -119,20 +127,20 @@ Latex论文一键校对 | [插件] 仿Grammarly对Latex文章进行语法、拼
```mermaid
flowchart TD
A{"安装方法"} --> W1("I. 🔑直接运行 (Windows, Linux or MacOS)")
W1 --> W11["1. Python pip包管理依赖"]
W1 --> W12["2. Anaconda包管理依赖推荐⭐"]
A{"安装方法"} --> W1("I 🔑直接运行 (Windows, Linux or MacOS)")
W1 --> W11["1 Python pip包管理依赖"]
W1 --> W12["2 Anaconda包管理依赖推荐⭐"]
A --> W2["II. 🐳使用Docker (Windows, Linux or MacOS)"]
A --> W2["II 🐳使用Docker (Windows, Linux or MacOS)"]
W2 --> k1["1. 部署项目全部能力的大镜像(推荐⭐)"]
W2 --> k2["2. 仅在线模型GPT, GLM4等镜像"]
W2 --> k3["3. 在线模型 + Latex的大镜像"]
W2 --> k1["1 部署项目全部能力的大镜像(推荐⭐)"]
W2 --> k2["2 仅在线模型GPT, GLM4等镜像"]
W2 --> k3["3 在线模型 + Latex的大镜像"]
A --> W4["IV. 🚀其他部署方法"]
W4 --> C1["1. Windows/MacOS 一键安装运行脚本(推荐⭐)"]
W4 --> C2["2. Huggingface, Sealos远程部署"]
W4 --> C4["3. ... 其他 ..."]
A --> W4["IV 🚀其他部署方法"]
W4 --> C1["1 Windows/MacOS 一键安装运行脚本(推荐⭐)"]
W4 --> C2["2 Huggingface, Sealos远程部署"]
W4 --> C4["3 其他 ..."]
```
### 安装方法I直接运行 (Windows, Linux or MacOS)
@@ -165,26 +173,32 @@ flowchart TD
```
<details><summary>如果需要支持清华ChatGLM2/复旦MOSS/RWKV作为后端请点击展开此处</summary>
<details><summary>如果需要支持清华ChatGLM系列/复旦MOSS/RWKV作为后端请点击展开此处</summary>
<p>
【可选步骤】如果需要支持清华ChatGLM3/复旦MOSS作为后端需要额外安装更多依赖前提条件熟悉Python + 用过Pytorch + 电脑配置够强):
【可选步骤】如果需要支持清华ChatGLM系列/复旦MOSS作为后端需要额外安装更多依赖前提条件熟悉Python + 用过Pytorch + 电脑配置够强):
```sh
# 【可选步骤I】支持清华ChatGLM3。清华ChatGLM备注如果遇到"Call ChatGLM fail 不能正常加载ChatGLM的参数" 错误,参考如下: 1以上默认安装的为torch+cpu版使用cuda需要卸载torch重新安装torch+cuda 2如因本机配置不够无法加载模型可以修改request_llm/bridge_chatglm.py中的模型精度, 将 AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) 都修改为 AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
python -m pip install -r request_llms/requirements_chatglm.txt
# 【可选步骤II】支持复旦MOSS
# 【可选步骤II】支持清华ChatGLM4 注意此模型至少需要24G显存
python -m pip install -r request_llms/requirements_chatglm4.txt
# 可使用modelscope下载ChatGLM4模型
# pip install modelscope
# modelscope download --model ZhipuAI/glm-4-9b-chat --local_dir ./THUDM/glm-4-9b-chat
# 【可选步骤III】支持复旦MOSS
python -m pip install -r request_llms/requirements_moss.txt
git clone --depth=1 https://github.com/OpenLMLab/MOSS.git request_llms/moss # 注意执行此行代码时,必须处于项目根路径
# 【可选步骤III】支持RWKV Runner
# 【可选步骤IV】支持RWKV Runner
参考wikihttps://github.com/binary-husky/gpt_academic/wiki/%E9%80%82%E9%85%8DRWKV-Runner
# 【可选步骤IV】确保config.py配置文件的AVAIL_LLM_MODELS包含了期望的模型目前支持的全部模型如下(jittorllms系列目前仅支持docker方案)
# 【可选步骤V】确保config.py配置文件的AVAIL_LLM_MODELS包含了期望的模型目前支持的全部模型如下(jittorllms系列目前仅支持docker方案)
AVAIL_LLM_MODELS = ["gpt-3.5-turbo", "api2d-gpt-3.5-turbo", "gpt-4", "api2d-gpt-4", "chatglm", "moss"] # + ["jittorllms_rwkv", "jittorllms_pangualpha", "jittorllms_llama"]
# 【可选步骤V】支持本地模型INT8,INT4量化这里所指的模型本身不是量化版本目前deepseek-coder支持后面测试后会加入更多模型量化选择
# 【可选步骤VI】支持本地模型INT8,INT4量化这里所指的模型本身不是量化版本目前deepseek-coder支持后面测试后会加入更多模型量化选择
pip install bitsandbyte
# windows用户安装bitsandbytes需要使用下面bitsandbytes-windows-webui
python -m pip install bitsandbytes --prefer-binary --extra-index-url=https://jllllll.github.io/bitsandbytes-windows-webui
@@ -253,8 +267,7 @@ P.S. 如果需要依赖Latex的插件功能请见Wiki。另外您也可以
# Advanced Usage
### I自定义新的便捷按钮学术快捷键
任意文本编辑器打开`core_functional.py`,添加如下条目,然后重启程序。(如果按钮已存在,那么可以直接修改(前缀、后缀都已支持热修改),无需重启程序即可生效。)
例如
现在已可以通过UI中的`界面外观`菜单中的`自定义菜单`添加新的便捷按钮。如果需要在代码中定义,请使用任意文本编辑器打开`core_functional.py`,添加如下条目即可:
```python
"超级英译中": {
@@ -413,7 +426,6 @@ timeline LR
1. `master` 分支: 主分支,稳定版
2. `frontier` 分支: 开发分支,测试版
3. 如何[接入其他大模型](request_llms/README.md)
4. 访问GPT-Academic的[在线服务并支持我们](https://github.com/binary-husky/gpt_academic/wiki/online)
### V参考与学习

View File

@@ -1,37 +1,77 @@
from loguru import logger
def check_proxy(proxies):
def check_proxy(proxies, return_ip=False):
"""
检查代理配置并返回结果。
Args:
proxies (dict): 包含http和https代理配置的字典。
return_ip (bool, optional): 是否返回代理的IP地址。默认为False。
Returns:
str or None: 检查的结果信息或代理的IP地址如果`return_ip`为True
"""
import requests
proxies_https = proxies['https'] if proxies is not None else ''
ip = None
try:
response = requests.get("https://ipapi.co/json/", proxies=proxies, timeout=4)
response = requests.get("https://ipapi.co/json/", proxies=proxies, timeout=4) # ⭐ 执行GET请求以获取代理信息
data = response.json()
if 'country_name' in data:
country = data['country_name']
result = f"代理配置 {proxies_https}, 代理所在地:{country}"
if 'ip' in data:
ip = data['ip']
elif 'error' in data:
alternative = _check_with_backup_source(proxies)
alternative, ip = _check_with_backup_source(proxies) # ⭐ 调用备用方法检查代理配置
if alternative is None:
result = f"代理配置 {proxies_https}, 代理所在地未知IP查询频率受限"
else:
result = f"代理配置 {proxies_https}, 代理所在地:{alternative}"
else:
result = f"代理配置 {proxies_https}, 代理数据解析失败:{data}"
print(result)
return result
if not return_ip:
logger.warning(result)
return result
else:
return ip
except:
result = f"代理配置 {proxies_https}, 代理所在地查询超时,代理可能无效"
print(result)
return result
if not return_ip:
logger.warning(result)
return result
else:
return ip
def _check_with_backup_source(proxies):
"""
通过备份源检查代理,并获取相应信息。
Args:
proxies (dict): 包含代理信息的字典。
Returns:
tuple: 代理信息(geo)和IP地址(ip)的元组。
"""
import random, string, requests
random_string = ''.join(random.choices(string.ascii_letters + string.digits, k=32))
try: return requests.get(f"http://{random_string}.edns.ip-api.com/json", proxies=proxies, timeout=4).json()['dns']['geo']
except: return None
try:
res_json = requests.get(f"http://{random_string}.edns.ip-api.com/json", proxies=proxies, timeout=4).json() # ⭐ 执行代理检查和备份源请求
return res_json['dns']['geo'], res_json['dns']['ip']
except:
return None, None
def backup_and_download(current_version, remote_version):
"""
一键更新协议:备份和下载
一键更新协议:备份当前版本,下载远程版本并解压缩。
Args:
current_version (str): 当前版本号。
remote_version (str): 远程版本号。
Returns:
str: 新版本目录的路径。
"""
from toolbox import get_conf
import shutil
@@ -47,8 +87,8 @@ def backup_and_download(current_version, remote_version):
shutil.copytree('./', backup_dir, ignore=lambda x, y: ['history'])
proxies = get_conf('proxies')
try: r = requests.get('https://github.com/binary-husky/chatgpt_academic/archive/refs/heads/master.zip', proxies=proxies, stream=True)
except: r = requests.get('https://public.gpt-academic.top/publish/master.zip', proxies=proxies, stream=True)
zip_file_path = backup_dir+'/master.zip'
except: r = requests.get('https://public.agent-matrix.com/publish/master.zip', proxies=proxies, stream=True)
zip_file_path = backup_dir+'/master.zip' # ⭐ 保存备份文件的路径
with open(zip_file_path, 'wb+') as f:
f.write(r.content)
dst_path = new_version_dir
@@ -64,6 +104,17 @@ def backup_and_download(current_version, remote_version):
def patch_and_restart(path):
"""
一键更新协议:覆盖和重启
Args:
path (str): 新版本代码所在的路径
注意事项:
如果您的程序没有使用config_private.py私密配置文件则会将config.py重命名为config_private.py以避免配置丢失。
更新流程:
- 复制最新版本代码到当前目录
- 更新pip包依赖
- 如果更新失败,则提示手动安装依赖库并重启
"""
from distutils import dir_util
import shutil
@@ -71,33 +122,44 @@ def patch_and_restart(path):
import sys
import time
import glob
from colorful import print亮黄, print亮绿, print亮红
# if not using config_private, move origin config.py as config_private.py
from shared_utils.colorful import log亮黄, log亮绿, log亮红
if not os.path.exists('config_private.py'):
print亮黄('由于您没有设置config_private.py私密配置现将您的现有配置移动至config_private.py以防止配置丢失',
log亮黄('由于您没有设置config_private.py私密配置现将您的现有配置移动至config_private.py以防止配置丢失',
'另外您可以随时在history子文件夹下找回旧版的程序。')
shutil.copyfile('config.py', 'config_private.py')
path_new_version = glob.glob(path + '/*-master')[0]
dir_util.copy_tree(path_new_version, './')
print亮绿('代码已经更新即将更新pip包依赖……')
for i in reversed(range(5)): time.sleep(1); print(i)
try:
dir_util.copy_tree(path_new_version, './') # ⭐ 将最新版本代码复制到当前目录
log亮绿('代码已经更新即将更新pip包依赖……')
for i in reversed(range(5)): time.sleep(1); log亮绿(i)
try:
import subprocess
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-r', 'requirements.txt'])
except:
print亮红('pip包依赖安装出现问题需要手动安装新增的依赖库 `python -m pip install -r requirements.txt`,然后在用常规的`python main.py`的方式启动。')
print亮绿('更新完成您可以随时在history子文件夹下找回旧版的程序5s之后重启')
print亮红('假如重启失败,您可能需要手动安装新增的依赖库 `python -m pip install -r requirements.txt`,然后在用常规的`python main.py`的方式启动。')
print(' ------------------------------ -----------------------------------')
for i in reversed(range(8)): time.sleep(1); print(i)
os.execl(sys.executable, sys.executable, *sys.argv)
log亮红('pip包依赖安装出现问题需要手动安装新增的依赖库 `python -m pip install -r requirements.txt`,然后在用常规的`python main.py`的方式启动。')
log亮绿('更新完成您可以随时在history子文件夹下找回旧版的程序5s之后重启')
log亮红('假如重启失败,您可能需要手动安装新增的依赖库 `python -m pip install -r requirements.txt`,然后在用常规的`python main.py`的方式启动。')
log亮绿(' ------------------------------ -----------------------------------')
for i in reversed(range(8)): time.sleep(1); log亮绿(i)
os.execl(sys.executable, sys.executable, *sys.argv) # 重启程序
def get_current_version():
"""
获取当前的版本号。
Returns:
str: 当前的版本号。如果无法获取版本号,则返回空字符串。
"""
import json
try:
with open('./version', 'r', encoding='utf8') as f:
current_version = json.loads(f.read())['version']
current_version = json.loads(f.read())['version'] # ⭐ 从读取的json数据中提取版本号
except:
current_version = ""
return current_version
@@ -106,6 +168,12 @@ def get_current_version():
def auto_update(raise_error=False):
"""
一键更新协议:查询版本和用户意见
Args:
raise_error (bool, optional): 是否在出错时抛出错误。默认为 False。
Returns:
None
"""
try:
from toolbox import get_conf
@@ -113,7 +181,7 @@ def auto_update(raise_error=False):
import json
proxies = get_conf('proxies')
try: response = requests.get("https://raw.githubusercontent.com/binary-husky/chatgpt_academic/master/version", proxies=proxies, timeout=5)
except: response = requests.get("https://public.gpt-academic.top/publish/version", proxies=proxies, timeout=5)
except: response = requests.get("https://public.agent-matrix.com/publish/version", proxies=proxies, timeout=5)
remote_json_data = json.loads(response.text)
remote_version = remote_json_data['version']
if remote_json_data["show_feature"]:
@@ -124,22 +192,22 @@ def auto_update(raise_error=False):
current_version = f.read()
current_version = json.loads(current_version)['version']
if (remote_version - current_version) >= 0.01-1e-5:
from colorful import print亮黄
print亮黄(f'\n新版本可用。新版本:{remote_version},当前版本:{current_version}{new_feature}')
print('1Github更新地址:\nhttps://github.com/binary-husky/chatgpt_academic\n')
from shared_utils.colorful import log亮黄
log亮黄(f'\n新版本可用。新版本:{remote_version},当前版本:{current_version}{new_feature}') # ⭐ 在控制台打印新版本信息
logger.info('1Github更新地址:\nhttps://github.com/binary-husky/chatgpt_academic\n')
user_instruction = input('2是否一键更新代码Y+回车=确认,输入其他/无输入+回车=不更新)?')
if user_instruction in ['Y', 'y']:
path = backup_and_download(current_version, remote_version)
path = backup_and_download(current_version, remote_version) # ⭐ 备份并下载文件
try:
patch_and_restart(path)
patch_and_restart(path) # ⭐ 执行覆盖并重启操作
except:
msg = '更新失败。'
if raise_error:
from toolbox import trimmed_format_exc
msg += trimmed_format_exc()
print(msg)
logger.warning(msg)
else:
print('自动更新程序:已禁用')
logger.info('自动更新程序:已禁用')
return
else:
return
@@ -148,10 +216,13 @@ def auto_update(raise_error=False):
if raise_error:
from toolbox import trimmed_format_exc
msg += trimmed_format_exc()
print(msg)
logger.info(msg)
def warm_up_modules():
print('正在执行一些模块的预热 ...')
"""
预热模块,加载特定模块并执行预热操作。
"""
logger.info('正在执行一些模块的预热 ...')
from toolbox import ProxyNetworkActivate
from request_llms.bridge_all import model_info
with ProxyNetworkActivate("Warmup_Modules"):
@@ -159,18 +230,70 @@ def warm_up_modules():
enc.encode("模块预热", disallowed_special=())
enc = model_info["gpt-4"]['tokenizer']
enc.encode("模块预热", disallowed_special=())
try_warm_up_vectordb()
# def try_warm_up_vectordb():
# try:
# import os
# import nltk
# target = os.path.expanduser('~/nltk_data')
# logger.info(f'模块预热: nltk punkt (从Github下载部分文件到 {target})')
# nltk.data.path.append(target)
# nltk.download('punkt', download_dir=target)
# logger.info('模块预热完成: nltk punkt')
# except:
# logger.exception('模块预热: nltk punkt 失败,可能需要手动安装 nltk punkt')
# logger.error('模块预热: nltk punkt 失败,可能需要手动安装 nltk punkt')
def try_warm_up_vectordb():
import os
import nltk
target = os.path.expanduser('~/nltk_data')
nltk.data.path.append(target)
try:
# 尝试加载 punkt
logger.info(f'nltk模块预热')
nltk.data.find('tokenizers/punkt')
nltk.data.find('tokenizers/punkt_tab')
nltk.data.find('taggers/averaged_perceptron_tagger_eng')
logger.info('nltk模块预热完成读取本地缓存')
except:
# 如果找不到,则尝试下载
try:
logger.info(f'模块预热: nltk punkt (从 Github 下载部分文件到 {target})')
from shared_utils.nltk_downloader import Downloader
_downloader = Downloader()
_downloader.download('punkt', download_dir=target)
_downloader.download('punkt_tab', download_dir=target)
_downloader.download('averaged_perceptron_tagger_eng', download_dir=target)
logger.info('nltk模块预热完成')
except Exception:
logger.exception('模块预热: nltk punkt 失败,可能需要手动安装 nltk punkt')
def warm_up_vectordb():
print('正在执行一些模块的预热 ...')
"""
执行一些模块的预热操作。
本函数主要用于执行一些模块的预热操作,确保在后续的流程中能够顺利运行。
⭐ 关键作用:预热模块
Returns:
None
"""
logger.info('正在执行一些模块的预热 ...')
from toolbox import ProxyNetworkActivate
with ProxyNetworkActivate("Warmup_Modules"):
import nltk
with ProxyNetworkActivate("Warmup_Modules"): nltk.download("punkt")
if __name__ == '__main__':
import os
os.environ['no_proxy'] = '*' # 避免代理网络产生意外污染
from toolbox import get_conf
proxies = get_conf('proxies')
check_proxy(proxies)
check_proxy(proxies)

266
config.py
View File

@@ -2,45 +2,80 @@
以下所有配置也都支持利用环境变量覆写环境变量配置格式见docker-compose.yml。
读取优先级:环境变量 > config_private.py > config.py
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
All the following configurations also support using environment variables to override,
and the environment variable configuration format can be seen in docker-compose.yml.
All the following configurations also support using environment variables to override,
and the environment variable configuration format can be seen in docker-compose.yml.
Configuration reading priority: environment variable > config_private.py > config.py
"""
# [step 1]>> API_KEY = "sk-123456789xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123456789"。极少数情况下还需要填写组织格式如org-123456789abcdefghijklmno的请向下翻找 API_ORG 设置项
API_KEY = "此处填API密钥" # 可同时填写多个API-KEY用英文逗号分割例如API_KEY = "sk-openaikey1,sk-openaikey2,fkxxxx-api2dkey3,azure-apikey4"
# [step 1-1]>> ( 接入OpenAI模型家族 ) API_KEY = "sk-123456789xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123456789"。极少数情况下还需要填写组织格式如org-123456789abcdefghijklmno的请向下翻找 API_ORG 设置项
API_KEY = "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" # 可同时填写多个API-KEY用英文逗号分割例如API_KEY = "sk-openaikey1,sk-openaikey2,fkxxxx-api2dkey3,azure-apikey4"
# [step 1-2]>> ( 强烈推荐!接入通义家族 & 大模型服务平台百炼 ) 接入通义千问在线大模型api-key获取地址 https://dashscope.console.aliyun.com/
DASHSCOPE_API_KEY = "" # 阿里灵积云API_KEY用于接入qwen-maxdashscope-qwen3-14bdashscope-deepseek-r1等
# [step 2]>> 改为True应用代理如果直接在海外服务器部署此处不修改如果使用本地或无地域限制的大模型时此处也不需要修改
# [step 1-3]>> ( 接入 deepseek-reasoner, 即 deepseek-r1 ) 深度求索(DeepSeek) API KEY默认请求地址为"https://api.deepseek.com/v1/chat/completions"
DEEPSEEK_API_KEY = "sk-d99b8cc6b7414cc88a5d950a3ff7585e"
# [step 2]>> 改为True应用代理。如果使用本地或无地域限制的大模型时此处不修改如果直接在海外服务器部署此处不修改
USE_PROXY = False
if USE_PROXY:
"""
代理网络的地址,打开你的代理软件查看代理协议(socks5h / http)、地址(localhost)和端口(11284)
填写格式是 [协议]:// [地址] :[端口]填写之前不要忘记把USE_PROXY改成True如果直接在海外服务器部署此处不修改
<配置教程&视频教程> https://github.com/binary-husky/gpt_academic/issues/1>
[协议] 常见协议无非socks5h/http; 例如 v2**y 和 ss* 的默认本地协议是socks5h; 而cl**h 的默认本地协议是http
[地址] 填localhost或者127.0.0.1localhost意思是代理软件安装在本机上
[端口] 在代理软件的设置里找。虽然不同的代理软件界面不一样,但端口号都应该在最显眼的位置上
"""
proxies = {
# [协议]:// [地址] :[端口]
"http": "socks5h://localhost:11284", # 再例如 "http": "http://127.0.0.1:7890",
"https": "socks5h://localhost:11284", # 再例如 "https": "http://127.0.0.1:7890",
"http":"socks5h://192.168.8.9:1070", # 再例如 "http": "http://127.0.0.1:7890",
"https":"socks5h://192.168.8.9:1070", # 再例如 "https": "http://127.0.0.1:7890",
}
else:
proxies = None
# ------------------------------------ 以下配置可以优化体验, 但大部分场合下并不需要修改 ------------------------------------
# [step 3]>> 模型选择是 (注意: LLM_MODEL是默认选中的模型, 它*必须*被包含在AVAIL_LLM_MODELS列表中 )
LLM_MODEL = "gpt-4" # 可选 ↓↓↓
AVAIL_LLM_MODELS = ["qwen-max", "o1-mini", "o1-mini-2024-09-12", "o1", "o1-2024-12-17", "o1-preview", "o1-preview-2024-09-12",
"gpt-4-1106-preview", "gpt-4-turbo-preview", "gpt-4-vision-preview",
"gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-4-turbo-2024-04-09",
"gpt-3.5-turbo-1106", "gpt-3.5-turbo-16k", "gpt-3.5-turbo", "azure-gpt-3.5",
"gpt-4", "gpt-4-32k", "azure-gpt-4", "glm-4", "glm-4v", "glm-3-turbo",
"gemini-1.5-pro", "chatglm3", "chatglm4",
"deepseek-chat", "deepseek-coder", "deepseek-reasoner",
"volcengine-deepseek-r1-250120", "volcengine-deepseek-v3-241226",
"dashscope-deepseek-r1", "dashscope-deepseek-v3",
"dashscope-qwen3-14b", "dashscope-qwen3-235b-a22b", "dashscope-qwen3-32b",
]
EMBEDDING_MODEL = "text-embedding-3-small"
# --- --- --- ---
# P.S. 其他可用的模型还包括
# AVAIL_LLM_MODELS = [
# "glm-4-0520", "glm-4-air", "glm-4-airx", "glm-4-flash",
# "qianfan", "deepseekcoder",
# "spark", "sparkv2", "sparkv3", "sparkv3.5", "sparkv4",
# "qwen-turbo", "qwen-plus", "qwen-local",
# "moonshot-v1-128k", "moonshot-v1-32k", "moonshot-v1-8k",
# "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-3.5-turbo-0125", "gpt-4o-2024-05-13"
# "claude-3-haiku-20240307","claude-3-sonnet-20240229","claude-3-opus-20240229", "claude-2.1", "claude-instant-1.2",
# "moss", "llama2", "chatglm_onnx", "internlm", "jittorllms_pangualpha", "jittorllms_llama",
# "deepseek-chat" ,"deepseek-coder",
# "gemini-1.5-flash",
# "yi-34b-chat-0205","yi-34b-chat-200k","yi-large","yi-medium","yi-spark","yi-large-turbo","yi-large-preview",
# "grok-beta",
# ]
# --- --- --- ---
# 此外您还可以在接入one-api/vllm/ollama/Openroute时
# 使用"one-api-*","vllm-*","ollama-*","openrouter-*"前缀直接使用非标准方式接入的模型,例如
# AVAIL_LLM_MODELS = ["one-api-claude-3-sonnet-20240229(max_token=100000)", "ollama-phi3(max_token=4096)","openrouter-openai/gpt-4o-mini","openrouter-openai/chatgpt-4o-latest"]
# --- --- --- ---
# --------------- 以下配置可以优化体验 ---------------
# 重新URL重新定向实现更换API_URL的作用高危设置! 常规情况下不要修改! 通过修改此设置您将把您的API-KEY和对话隐私完全暴露给您设定的中间人
# 格式: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "在这里填写重定向的api.openai.com的URL"}
# 举例: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "https://reverse-proxy-url/v1/chat/completions"}
# 格式: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "在这里填写重定向的api.openai.com的URL"}
# 举例: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "https://reverse-proxy-url/v1/chat/completions", "http://localhost:11434/api/chat": "在这里填写您ollama的URL"}
API_URL_REDIRECT = {}
# 多线程函数插件中默认允许多少路线程同时访问OpenAI。Free trial users的限制是每分钟3次Pay-as-you-go users的限制是每分钟3500次
# 一言以蔽之免费5刀用户填3OpenAI绑了信用卡的用户可以填 16 或者更高。提高限制请查询https://platform.openai.com/docs/guides/rate-limits/overview
DEFAULT_WORKER_NUM = 3
DEFAULT_WORKER_NUM = 8
# 色彩主题, 可选 ["Default", "Chuanhu-Small-and-Beautiful", "High-Contrast"]
@@ -48,6 +83,31 @@ DEFAULT_WORKER_NUM = 3
THEME = "Default"
AVAIL_THEMES = ["Default", "Chuanhu-Small-and-Beautiful", "High-Contrast", "Gstaff/Xkcd", "NoCrypt/Miku"]
FONT = "Theme-Default-Font"
AVAIL_FONTS = [
"默认值(Theme-Default-Font)",
"宋体(SimSun)",
"黑体(SimHei)",
"楷体(KaiTi)",
"仿宋(FangSong)",
"华文细黑(STHeiti Light)",
"华文楷体(STKaiti)",
"华文仿宋(STFangsong)",
"华文宋体(STSong)",
"华文中宋(STZhongsong)",
"华文新魏(STXinwei)",
"华文隶书(STLiti)",
# 备注:以下字体需要网络支持,您可以自定义任意您喜欢的字体,如下所示,需要满足的格式为 "字体昵称(字体英文真名@字体css下载链接)"
"思源宋体(Source Han Serif CN VF@https://chinese-fonts-cdn.deno.dev/packages/syst/dist/SourceHanSerifCN/result.css)",
"月星楷(Moon Stars Kai HW@https://chinese-fonts-cdn.deno.dev/packages/moon-stars-kai/dist/MoonStarsKaiHW-Regular/result.css)",
"珠圆体(MaokenZhuyuanTi@https://chinese-fonts-cdn.deno.dev/packages/mkzyt/dist/猫啃珠圆体/result.css)",
"平方萌萌哒(PING FANG MENG MNEG DA@https://chinese-fonts-cdn.deno.dev/packages/pfmmd/dist/平方萌萌哒/result.css)",
"Helvetica",
"ui-sans-serif",
"sans-serif",
"system-ui"
]
# 默认的系统提示词system prompt
INIT_SYS_PROMPT = "Serve me as a writing and programming assistant."
@@ -66,7 +126,7 @@ LAYOUT = "LEFT-RIGHT" # "LEFT-RIGHT"(左右布局) # "TOP-DOWN"(上下
# 暗色模式 / 亮色模式
DARK_MODE = True
DARK_MODE = True
# 发送请求到OpenAI后等待多久判定为超时
@@ -74,31 +134,20 @@ TIMEOUT_SECONDS = 30
# 网页的端口, -1代表随机端口
WEB_PORT = -1
WEB_PORT = 19998
# 是否自动打开浏览器页面
AUTO_OPEN_BROWSER = True
# 如果OpenAI不响应网络卡顿、代理失败、KEY失效重试的次数限制
MAX_RETRY = 2
MAX_RETRY = 3
# 插件分类默认选项
DEFAULT_FN_GROUPS = ['对话', '编程', '学术', '智能体']
# 模型选择是 (注意: LLM_MODEL是默认选中的模型, 它*必须*被包含在AVAIL_LLM_MODELS列表中 )
LLM_MODEL = "gpt-3.5-turbo" # 可选 ↓↓↓
AVAIL_LLM_MODELS = ["gpt-3.5-turbo-1106","gpt-4-1106-preview","gpt-4-vision-preview",
"gpt-3.5-turbo-16k", "gpt-3.5-turbo", "azure-gpt-3.5",
"gpt-4", "gpt-4-32k", "azure-gpt-4", "api2d-gpt-4",
"gemini-pro", "chatglm3", "claude-2", "zhipuai"]
# P.S. 其他可用的模型还包括 [
# "moss", "qwen-turbo", "qwen-plus", "qwen-max"
# "zhipuai", "qianfan", "deepseekcoder", "llama2", "qwen-local", "gpt-3.5-turbo-0613",
# "gpt-3.5-turbo-16k-0613", "gpt-3.5-random", "api2d-gpt-3.5-turbo", 'api2d-gpt-3.5-turbo-16k',
# "spark", "sparkv2", "sparkv3", "chatglm_onnx", "claude-1-100k", "claude-2", "internlm", "jittorllms_pangualpha", "jittorllms_llama"
# ]
# 定义界面上“询问多个GPT模型”插件应该使用哪些模型请从AVAIL_LLM_MODELS中选择并在不同模型之间用`&`间隔,例如"gpt-3.5-turbo&chatglm3&azure-gpt-4"
MULTI_QUERY_LLM_MODELS = "gpt-3.5-turbo&chatglm3"
@@ -109,16 +158,15 @@ MULTI_QUERY_LLM_MODELS = "gpt-3.5-turbo&chatglm3"
QWEN_LOCAL_MODEL_SELECTION = "Qwen/Qwen-1_8B-Chat-Int8"
# 接入通义千问在线大模型 https://dashscope.console.aliyun.com/
DASHSCOPE_API_KEY = "" # 阿里灵积云API_KEY
# 百度千帆LLM_MODEL="qianfan"
BAIDU_CLOUD_API_KEY = ''
BAIDU_CLOUD_SECRET_KEY = ''
BAIDU_CLOUD_QIANFAN_MODEL = 'ERNIE-Bot' # 可选 "ERNIE-Bot-4"(文心大模型4.0), "ERNIE-Bot"(文心一言), "ERNIE-Bot-turbo", "BLOOMZ-7B", "Llama-2-70B-Chat", "Llama-2-13B-Chat", "Llama-2-7B-Chat"
BAIDU_CLOUD_QIANFAN_MODEL = 'ERNIE-Bot' # 可选 "ERNIE-Bot-4"(文心大模型4.0), "ERNIE-Bot"(文心一言), "ERNIE-Bot-turbo", "BLOOMZ-7B", "Llama-2-70B-Chat", "Llama-2-13B-Chat", "Llama-2-7B-Chat", "ERNIE-Speed-128K", "ERNIE-Speed-8K", "ERNIE-Lite-8K"
# 如果使用ChatGLM3或ChatGLM4本地模型请把 LLM_MODEL="chatglm3" 或LLM_MODEL="chatglm4",并在此处指定模型路径
CHATGLM_LOCAL_MODEL_PATH = "THUDM/glm-4-9b-chat" # 例如"/home/hmp/ChatGLM3-6B/"
# 如果使用ChatGLM2微调模型请把 LLM_MODEL="chatglmft",并在此处指定模型路径
CHATGLM_PTUNING_CHECKPOINT = "" # 例如"/home/hmp/ChatGLM2-6B/ptuning/output/6b-pt-128-1e-2/checkpoint-100"
@@ -127,6 +175,7 @@ CHATGLM_PTUNING_CHECKPOINT = "" # 例如"/home/hmp/ChatGLM2-6B/ptuning/output/6b
LOCAL_MODEL_DEVICE = "cpu" # 可选 "cuda"
LOCAL_MODEL_QUANT = "FP16" # 默认 "FP16" "INT4" 启用量化INT4版本 "INT8" 启用量化INT8版本
# 设置gradio的并行线程数不需要修改
CONCURRENT_COUNT = 100
@@ -136,7 +185,7 @@ AUTO_CLEAR_TXT = False
# 加一个live2d装饰
ADD_WAIFU = False
ADD_WAIFU = True
# 设置用户名和密码不需要修改相关功能不稳定与gradio版本和网络都相关如果本地使用不建议加这个
@@ -144,7 +193,8 @@ ADD_WAIFU = False
AUTHENTICATION = []
# 如果需要在二级路径下运行(常规情况下,不要修改!!需要配合修改main.py才能生效!
# 如果需要在二级路径下运行(常规情况下,不要修改!!
# (举例 CUSTOM_PATH = "/gpt_academic",可以让软件运行在 http://ip:port/gpt_academic/ 下。)
CUSTOM_PATH = "/"
@@ -158,7 +208,7 @@ API_ORG = ""
# 如果需要使用Slack Claude使用教程详情见 request_llms/README.md
SLACK_CLAUDE_BOT_ID = ''
SLACK_CLAUDE_BOT_ID = ''
SLACK_CLAUDE_USER_TOKEN = ''
@@ -172,14 +222,8 @@ AZURE_ENGINE = "填入你亲手写的部署名" # 读 docs\use_azure.
AZURE_CFG_ARRAY = {}
# 使用Newbing (不推荐使用,未来将删除)
NEWBING_STYLE = "creative" # ["creative", "balanced", "precise"]
NEWBING_COOKIES = """
put your new bing cookies here
"""
# 阿里云实时语音识别 配置难度较高 仅建议高手用户使用 参考 https://github.com/binary-husky/gpt_academic/blob/master/docs/use_audio.md
# 阿里云实时语音识别 配置难度较高
# 参考 https://github.com/binary-husky/gpt_academic/blob/master/docs/use_audio.md
ENABLE_AUDIO = False
ALIYUN_TOKEN="" # 例如 f37f30e0f9934c34a992f6f64f7eba4f
ALIYUN_APPKEY="" # 例如 RoPlZrM88DnAFkZK
@@ -187,6 +231,12 @@ ALIYUN_ACCESSKEY="" # (无需填写)
ALIYUN_SECRET="" # (无需填写)
# GPT-SOVITS 文本转语音服务的运行地址(将语言模型的生成文本朗读出来)
TTS_TYPE = "EDGE_TTS" # EDGE_TTS / LOCAL_SOVITS_API / DISABLE
GPT_SOVITS_URL = ""
EDGE_TTS_VOICE = "zh-CN-XiaoxiaoNeural"
# 接入讯飞星火大模型 https://console.xfyun.cn/services/iat
XFYUN_APPID = "00000000"
XFYUN_API_SECRET = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
@@ -195,19 +245,40 @@ XFYUN_API_KEY = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
# 接入智谱大模型
ZHIPUAI_API_KEY = ""
ZHIPUAI_MODEL = "glm-4" # 可选 "glm-3-turbo" "glm-4"
# # 火山引擎YUNQUE大模型
# YUNQUE_SECRET_KEY = ""
# YUNQUE_ACCESS_KEY = ""
# YUNQUE_MODEL = ""
ZHIPUAI_MODEL = "" # 此选项已废弃,不再需要填写
# Claude API KEY
ANTHROPIC_API_KEY = ""
# 月之暗面 API KEY
MOONSHOT_API_KEY = ""
# 零一万物(Yi Model) API KEY
YIMODEL_API_KEY = ""
# 接入火山引擎的在线大模型)api-key获取地址 https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
ARK_API_KEY = "00000000-0000-0000-0000-000000000000" # 火山引擎 API KEY
# 紫东太初大模型 https://ai-maas.wair.ac.cn
TAICHU_API_KEY = ""
# Grok API KEY
GROK_API_KEY = ""
# Mathpix 拥有执行PDF的OCR功能但是需要注册账号
MATHPIX_APPID = ""
MATHPIX_APPKEY = ""
# DOC2X的PDF解析服务注册账号并获取API KEY: https://doc2x.noedgeai.com/login
DOC2X_API_KEY = ""
# 自定义API KEY格式
CUSTOM_API_KEY_PATTERN = ""
@@ -224,11 +295,15 @@ HUGGINGFACE_ACCESS_TOKEN = "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"
# 获取方法复制以下空间https://huggingface.co/spaces/qingxu98/grobid设为public然后GROBID_URL = "https://(你的hf用户名如qingxu98)-(你的填写的空间名如grobid).hf.space"
GROBID_URLS = [
"https://qingxu98-grobid.hf.space","https://qingxu98-grobid2.hf.space","https://qingxu98-grobid3.hf.space",
"https://qingxu98-grobid4.hf.space","https://qingxu98-grobid5.hf.space", "https://qingxu98-grobid6.hf.space",
"https://qingxu98-grobid7.hf.space", "https://qingxu98-grobid8.hf.space",
"https://qingxu98-grobid4.hf.space","https://qingxu98-grobid5.hf.space", "https://qingxu98-grobid6.hf.space",
"https://qingxu98-grobid7.hf.space", "https://qingxu98-grobid8.hf.space",
]
# Searxng互联网检索服务这是一个huggingface空间请前往huggingface复制该空间然后把自己新的空间地址填在这里
SEARXNG_URLS = [ f"https://kaletianlre-beardvs{i}dd.hf.space/" for i in range(1,5) ]
# 是否允许通过自然语言描述修改本页的配置,该功能具有一定的危险性,默认关闭
ALLOW_RESET_CONFIG = False
@@ -237,21 +312,21 @@ ALLOW_RESET_CONFIG = False
AUTOGEN_USE_DOCKER = False
# 临时的上传文件夹位置,请修改
# 临时的上传文件夹位置,请尽量不要修改
PATH_PRIVATE_UPLOAD = "private_upload"
# 日志文件夹的位置,请修改
# 日志文件夹的位置,请尽量不要修改
PATH_LOGGING = "gpt_log"
# 除了连接OpenAI之外还有哪些场合允许使用代理请勿修改
WHEN_TO_USE_PROXY = ["Download_LLM", "Download_Gradio_Theme", "Connect_Grobid",
"Warmup_Modules", "Nougat_Download", "AutoGen"]
# 存储翻译好的arxiv论文的路径请尽量不要修改
ARXIV_CACHE_DIR = "gpt_log/arxiv_cache"
# *实验性功能*: 自动检测并屏蔽失效的KEY请勿使用
BLOCK_INVALID_APIKEY = False
# 除了连接OpenAI之外还有哪些场合允许使用代理请尽量不要修改
WHEN_TO_USE_PROXY = ["Connect_OpenAI", "Download_LLM", "Download_Gradio_Theme", "Connect_Grobid",
"Warmup_Modules", "Nougat_Download", "AutoGen", "Connect_OpenAI_Embedding"]
# 启用插件热加载
@@ -261,7 +336,32 @@ PLUGIN_HOT_RELOAD = False
# 自定义按钮的最大数量限制
NUM_CUSTOM_BASIC_BTN = 4
# 媒体智能体的服务地址这是一个huggingface空间请前往huggingface复制该空间然后把自己新的空间地址填在这里
DAAS_SERVER_URLS = [ f"https://niuziniu-biligpt{i}.hf.space/stream" for i in range(1,5) ]
# 在互联网搜索组件中负责将搜索结果整理成干净的Markdown
JINA_API_KEY = ""
# SEMANTIC SCHOLAR API KEY
SEMANTIC_SCHOLAR_KEY = ""
# 是否自动裁剪上下文长度(是否启动,默认不启动)
AUTO_CONTEXT_CLIP_ENABLE = False
# 目标裁剪上下文的token长度如果超过这个长度则会自动裁剪
AUTO_CONTEXT_CLIP_TRIGGER_TOKEN_LEN = 30*1000
# 无条件丢弃x以上的轮数
AUTO_CONTEXT_MAX_ROUND = 64
# 在裁剪上下文时倒数第x次对话能“最多”保留的上下文token的比例占 AUTO_CONTEXT_CLIP_TRIGGER_TOKEN_LEN 的多少
AUTO_CONTEXT_MAX_CLIP_RATIO = [0.80, 0.60, 0.45, 0.25, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
"""
--------------- 配置关联关系说明 ---------------
在线大模型配置关联关系示意图
├── "gpt-3.5-turbo" 等openai模型
@@ -285,7 +385,7 @@ NUM_CUSTOM_BASIC_BTN = 4
│ ├── XFYUN_API_SECRET
│ └── XFYUN_API_KEY
├── "claude-1-100k" 等claude模型
├── "claude-3-opus-20240229" 等claude模型
│ └── ANTHROPIC_API_KEY
├── "stack-claude"
@@ -297,9 +397,11 @@ NUM_CUSTOM_BASIC_BTN = 4
│ ├── BAIDU_CLOUD_API_KEY
│ └── BAIDU_CLOUD_SECRET_KEY
├── "zhipuai" 智谱AI大模型chatglm_turbo
── ZHIPUAI_API_KEY
└── ZHIPUAI_MODEL
├── "glm-4", "glm-3-turbo", "zhipuai" 智谱AI大模型
── ZHIPUAI_API_KEY
├── "yi-34b-chat-0205", "yi-34b-chat-200k" 等零一万物(Yi Model)大模型
│ └── YIMODEL_API_KEY
├── "qwen-turbo" 等通义千问大模型
│ └── DASHSCOPE_API_KEY
@@ -307,13 +409,15 @@ NUM_CUSTOM_BASIC_BTN = 4
├── "Gemini"
│ └── GEMINI_API_KEY
└── "newbing" Newbing接口不再稳定不推荐使用
├── NEWBING_STYLE
── NEWBING_COOKIES
└── "one-api-...(max_token=...)" 用一种更方便的方式接入one-api多模型管理界面
├── AVAIL_LLM_MODELS
── API_KEY
└── API_URL_REDIRECT
本地大模型示意图
├── "chatglm4"
├── "chatglm3"
├── "chatglm"
├── "chatglm_onnx"
@@ -343,6 +447,9 @@ NUM_CUSTOM_BASIC_BTN = 4
插件在线服务配置依赖关系示意图
├── 互联网检索
│ └── SEARXNG_URLS
├── 语音功能
│ ├── ENABLE_AUDIO
│ ├── ALIYUN_TOKEN
@@ -351,6 +458,9 @@ NUM_CUSTOM_BASIC_BTN = 4
│ └── ALIYUN_SECRET
└── PDF文档精准解析
── GROBID_URLS
── GROBID_URLS
├── MATHPIX_APPID
└── MATHPIX_APPKEY
"""

466
config_private.py Normal file
View File

@@ -0,0 +1,466 @@
"""
以下所有配置也都支持利用环境变量覆写环境变量配置格式见docker-compose.yml。
读取优先级:环境变量 > config_private.py > config.py
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
All the following configurations also support using environment variables to override,
and the environment variable configuration format can be seen in docker-compose.yml.
Configuration reading priority: environment variable > config_private.py > config.py
"""
# [step 1-1]>> ( 接入OpenAI模型家族 ) API_KEY = "sk-123456789xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123456789"。极少数情况下还需要填写组织格式如org-123456789abcdefghijklmno的请向下翻找 API_ORG 设置项
API_KEY = "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" # 可同时填写多个API-KEY用英文逗号分割例如API_KEY = "sk-openaikey1,sk-openaikey2,fkxxxx-api2dkey3,azure-apikey4"
# [step 1-2]>> ( 强烈推荐!接入通义家族 & 大模型服务平台百炼 ) 接入通义千问在线大模型api-key获取地址 https://dashscope.console.aliyun.com/
DASHSCOPE_API_KEY = "" # 阿里灵积云API_KEY用于接入qwen-maxdashscope-qwen3-14bdashscope-deepseek-r1等
# [step 1-3]>> ( 接入 deepseek-reasoner, 即 deepseek-r1 ) 深度求索(DeepSeek) API KEY默认请求地址为"https://api.deepseek.com/v1/chat/completions"
DEEPSEEK_API_KEY = "sk-d99b8cc6b7414cc88a5d950a3ff7585e"
# [step 2]>> 改为True应用代理。如果使用本地或无地域限制的大模型时此处不修改如果直接在海外服务器部署此处不修改
USE_PROXY = False
if USE_PROXY:
proxies = {
"http":"socks5h://192.168.8.9:1070", # 再例如 "http": "http://127.0.0.1:7890",
"https":"socks5h://192.168.8.9:1070", # 再例如 "https": "http://127.0.0.1:7890",
}
else:
proxies = None
# [step 3]>> 模型选择是 (注意: LLM_MODEL是默认选中的模型, 它*必须*被包含在AVAIL_LLM_MODELS列表中 )
LLM_MODEL = "gpt-4" # 可选 ↓↓↓
AVAIL_LLM_MODELS = ["qwen-max", "o1-mini", "o1-mini-2024-09-12", "o1", "o1-2024-12-17", "o1-preview", "o1-preview-2024-09-12",
"gpt-4-1106-preview", "gpt-4-turbo-preview", "gpt-4-vision-preview",
"gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-4-turbo-2024-04-09",
"gpt-3.5-turbo-1106", "gpt-3.5-turbo-16k", "gpt-3.5-turbo", "azure-gpt-3.5",
"gpt-4", "gpt-4-32k", "azure-gpt-4", "glm-4", "glm-4v", "glm-3-turbo",
"gemini-1.5-pro", "chatglm3", "chatglm4",
"deepseek-chat", "deepseek-coder", "deepseek-reasoner",
"volcengine-deepseek-r1-250120", "volcengine-deepseek-v3-241226",
"dashscope-deepseek-r1", "dashscope-deepseek-v3",
"dashscope-qwen3-14b", "dashscope-qwen3-235b-a22b", "dashscope-qwen3-32b",
]
EMBEDDING_MODEL = "text-embedding-3-small"
# --- --- --- ---
# P.S. 其他可用的模型还包括
# AVAIL_LLM_MODELS = [
# "glm-4-0520", "glm-4-air", "glm-4-airx", "glm-4-flash",
# "qianfan", "deepseekcoder",
# "spark", "sparkv2", "sparkv3", "sparkv3.5", "sparkv4",
# "qwen-turbo", "qwen-plus", "qwen-local",
# "moonshot-v1-128k", "moonshot-v1-32k", "moonshot-v1-8k",
# "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-3.5-turbo-0125", "gpt-4o-2024-05-13"
# "claude-3-haiku-20240307","claude-3-sonnet-20240229","claude-3-opus-20240229", "claude-2.1", "claude-instant-1.2",
# "moss", "llama2", "chatglm_onnx", "internlm", "jittorllms_pangualpha", "jittorllms_llama",
# "deepseek-chat" ,"deepseek-coder",
# "gemini-1.5-flash",
# "yi-34b-chat-0205","yi-34b-chat-200k","yi-large","yi-medium","yi-spark","yi-large-turbo","yi-large-preview",
# "grok-beta",
# ]
# --- --- --- ---
# 此外您还可以在接入one-api/vllm/ollama/Openroute时
# 使用"one-api-*","vllm-*","ollama-*","openrouter-*"前缀直接使用非标准方式接入的模型,例如
# AVAIL_LLM_MODELS = ["one-api-claude-3-sonnet-20240229(max_token=100000)", "ollama-phi3(max_token=4096)","openrouter-openai/gpt-4o-mini","openrouter-openai/chatgpt-4o-latest"]
# --- --- --- ---
# --------------- 以下配置可以优化体验 ---------------
# 重新URL重新定向实现更换API_URL的作用高危设置! 常规情况下不要修改! 通过修改此设置您将把您的API-KEY和对话隐私完全暴露给您设定的中间人
# 格式: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "在这里填写重定向的api.openai.com的URL"}
# 举例: API_URL_REDIRECT = {"https://api.openai.com/v1/chat/completions": "https://reverse-proxy-url/v1/chat/completions", "http://localhost:11434/api/chat": "在这里填写您ollama的URL"}
API_URL_REDIRECT = {}
# 多线程函数插件中默认允许多少路线程同时访问OpenAI。Free trial users的限制是每分钟3次Pay-as-you-go users的限制是每分钟3500次
# 一言以蔽之免费5刀用户填3OpenAI绑了信用卡的用户可以填 16 或者更高。提高限制请查询https://platform.openai.com/docs/guides/rate-limits/overview
DEFAULT_WORKER_NUM = 8
# 色彩主题, 可选 ["Default", "Chuanhu-Small-and-Beautiful", "High-Contrast"]
# 更多主题, 请查阅Gradio主题商店: https://huggingface.co/spaces/gradio/theme-gallery 可选 ["Gstaff/Xkcd", "NoCrypt/Miku", ...]
THEME = "Default"
AVAIL_THEMES = ["Default", "Chuanhu-Small-and-Beautiful", "High-Contrast", "Gstaff/Xkcd", "NoCrypt/Miku"]
FONT = "Theme-Default-Font"
AVAIL_FONTS = [
"默认值(Theme-Default-Font)",
"宋体(SimSun)",
"黑体(SimHei)",
"楷体(KaiTi)",
"仿宋(FangSong)",
"华文细黑(STHeiti Light)",
"华文楷体(STKaiti)",
"华文仿宋(STFangsong)",
"华文宋体(STSong)",
"华文中宋(STZhongsong)",
"华文新魏(STXinwei)",
"华文隶书(STLiti)",
# 备注:以下字体需要网络支持,您可以自定义任意您喜欢的字体,如下所示,需要满足的格式为 "字体昵称(字体英文真名@字体css下载链接)"
"思源宋体(Source Han Serif CN VF@https://chinese-fonts-cdn.deno.dev/packages/syst/dist/SourceHanSerifCN/result.css)",
"月星楷(Moon Stars Kai HW@https://chinese-fonts-cdn.deno.dev/packages/moon-stars-kai/dist/MoonStarsKaiHW-Regular/result.css)",
"珠圆体(MaokenZhuyuanTi@https://chinese-fonts-cdn.deno.dev/packages/mkzyt/dist/猫啃珠圆体/result.css)",
"平方萌萌哒(PING FANG MENG MNEG DA@https://chinese-fonts-cdn.deno.dev/packages/pfmmd/dist/平方萌萌哒/result.css)",
"Helvetica",
"ui-sans-serif",
"sans-serif",
"system-ui"
]
# 默认的系统提示词system prompt
INIT_SYS_PROMPT = "Serve me as a writing and programming assistant."
# 对话窗的高度 仅在LAYOUT="TOP-DOWN"时生效)
CHATBOT_HEIGHT = 1115
# 代码高亮
CODE_HIGHLIGHT = True
# 窗口布局
LAYOUT = "LEFT-RIGHT" # "LEFT-RIGHT"(左右布局) # "TOP-DOWN"(上下布局)
# 暗色模式 / 亮色模式
DARK_MODE = True
# 发送请求到OpenAI后等待多久判定为超时
TIMEOUT_SECONDS = 30
# 网页的端口, -1代表随机端口
WEB_PORT = 19998
# 是否自动打开浏览器页面
AUTO_OPEN_BROWSER = True
# 如果OpenAI不响应网络卡顿、代理失败、KEY失效重试的次数限制
MAX_RETRY = 3
# 插件分类默认选项
DEFAULT_FN_GROUPS = ['对话', '编程', '学术', '智能体']
# 定义界面上“询问多个GPT模型”插件应该使用哪些模型请从AVAIL_LLM_MODELS中选择并在不同模型之间用`&`间隔,例如"gpt-3.5-turbo&chatglm3&azure-gpt-4"
MULTI_QUERY_LLM_MODELS = "gpt-3.5-turbo&chatglm3"
# 选择本地模型变体只有当AVAIL_LLM_MODELS包含了对应本地模型时才会起作用
# 如果你选择Qwen系列的模型那么请在下面的QWEN_MODEL_SELECTION中指定具体的模型
# 也可以是具体的模型路径
QWEN_LOCAL_MODEL_SELECTION = "Qwen/Qwen-1_8B-Chat-Int8"
# 百度千帆LLM_MODEL="qianfan"
BAIDU_CLOUD_API_KEY = ''
BAIDU_CLOUD_SECRET_KEY = ''
BAIDU_CLOUD_QIANFAN_MODEL = 'ERNIE-Bot' # 可选 "ERNIE-Bot-4"(文心大模型4.0), "ERNIE-Bot"(文心一言), "ERNIE-Bot-turbo", "BLOOMZ-7B", "Llama-2-70B-Chat", "Llama-2-13B-Chat", "Llama-2-7B-Chat", "ERNIE-Speed-128K", "ERNIE-Speed-8K", "ERNIE-Lite-8K"
# 如果使用ChatGLM3或ChatGLM4本地模型请把 LLM_MODEL="chatglm3" 或LLM_MODEL="chatglm4",并在此处指定模型路径
CHATGLM_LOCAL_MODEL_PATH = "THUDM/glm-4-9b-chat" # 例如"/home/hmp/ChatGLM3-6B/"
# 如果使用ChatGLM2微调模型请把 LLM_MODEL="chatglmft",并在此处指定模型路径
CHATGLM_PTUNING_CHECKPOINT = "" # 例如"/home/hmp/ChatGLM2-6B/ptuning/output/6b-pt-128-1e-2/checkpoint-100"
# 本地LLM模型如ChatGLM的执行方式 CPU/GPU
LOCAL_MODEL_DEVICE = "cpu" # 可选 "cuda"
LOCAL_MODEL_QUANT = "FP16" # 默认 "FP16" "INT4" 启用量化INT4版本 "INT8" 启用量化INT8版本
# 设置gradio的并行线程数不需要修改
CONCURRENT_COUNT = 100
# 是否在提交时自动清空输入框
AUTO_CLEAR_TXT = False
# 加一个live2d装饰
ADD_WAIFU = True
# 设置用户名和密码不需要修改相关功能不稳定与gradio版本和网络都相关如果本地使用不建议加这个
# [("username", "password"), ("username2", "password2"), ...]
AUTHENTICATION = []
# 如果需要在二级路径下运行(常规情况下,不要修改!!
# (举例 CUSTOM_PATH = "/gpt_academic",可以让软件运行在 http://ip:port/gpt_academic/ 下。)
CUSTOM_PATH = "/"
# HTTPS 秘钥和证书(不需要修改)
SSL_KEYFILE = ""
SSL_CERTFILE = ""
# 极少数情况下openai的官方KEY需要伴随组织编码格式如org-xxxxxxxxxxxxxxxxxxxxxxxx使用
API_ORG = ""
# 如果需要使用Slack Claude使用教程详情见 request_llms/README.md
SLACK_CLAUDE_BOT_ID = ''
SLACK_CLAUDE_USER_TOKEN = ''
# 如果需要使用AZURE方法一单个azure模型部署详情请见额外文档 docs\use_azure.md
AZURE_ENDPOINT = "https://你亲手写的api名称.openai.azure.com/"
AZURE_API_KEY = "填入azure openai api的密钥" # 建议直接在API_KEY处填写该选项即将被弃用
AZURE_ENGINE = "填入你亲手写的部署名" # 读 docs\use_azure.md
# 如果需要使用AZURE方法二多个azure模型部署+动态切换)详情请见额外文档 docs\use_azure.md
AZURE_CFG_ARRAY = {}
# 阿里云实时语音识别 配置难度较高
# 参考 https://github.com/binary-husky/gpt_academic/blob/master/docs/use_audio.md
ENABLE_AUDIO = False
ALIYUN_TOKEN="" # 例如 f37f30e0f9934c34a992f6f64f7eba4f
ALIYUN_APPKEY="" # 例如 RoPlZrM88DnAFkZK
ALIYUN_ACCESSKEY="" # (无需填写)
ALIYUN_SECRET="" # (无需填写)
# GPT-SOVITS 文本转语音服务的运行地址(将语言模型的生成文本朗读出来)
TTS_TYPE = "EDGE_TTS" # EDGE_TTS / LOCAL_SOVITS_API / DISABLE
GPT_SOVITS_URL = ""
EDGE_TTS_VOICE = "zh-CN-XiaoxiaoNeural"
# 接入讯飞星火大模型 https://console.xfyun.cn/services/iat
XFYUN_APPID = "00000000"
XFYUN_API_SECRET = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
XFYUN_API_KEY = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
# 接入智谱大模型
ZHIPUAI_API_KEY = ""
ZHIPUAI_MODEL = "" # 此选项已废弃,不再需要填写
# Claude API KEY
ANTHROPIC_API_KEY = ""
# 月之暗面 API KEY
MOONSHOT_API_KEY = ""
# 零一万物(Yi Model) API KEY
YIMODEL_API_KEY = ""
# 接入火山引擎的在线大模型)api-key获取地址 https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint
ARK_API_KEY = "00000000-0000-0000-0000-000000000000" # 火山引擎 API KEY
# 紫东太初大模型 https://ai-maas.wair.ac.cn
TAICHU_API_KEY = ""
# Grok API KEY
GROK_API_KEY = ""
# Mathpix 拥有执行PDF的OCR功能但是需要注册账号
MATHPIX_APPID = ""
MATHPIX_APPKEY = ""
# DOC2X的PDF解析服务注册账号并获取API KEY: https://doc2x.noedgeai.com/login
DOC2X_API_KEY = ""
# 自定义API KEY格式
CUSTOM_API_KEY_PATTERN = ""
# Google Gemini API-Key
GEMINI_API_KEY = ''
# HUGGINGFACE的TOKEN下载LLAMA时起作用 https://huggingface.co/docs/hub/security-tokens
HUGGINGFACE_ACCESS_TOKEN = "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"
# GROBID服务器地址填写多个可以均衡负载用于高质量地读取PDF文档
# 获取方法复制以下空间https://huggingface.co/spaces/qingxu98/grobid设为public然后GROBID_URL = "https://(你的hf用户名如qingxu98)-(你的填写的空间名如grobid).hf.space"
GROBID_URLS = [
"https://qingxu98-grobid.hf.space","https://qingxu98-grobid2.hf.space","https://qingxu98-grobid3.hf.space",
"https://qingxu98-grobid4.hf.space","https://qingxu98-grobid5.hf.space", "https://qingxu98-grobid6.hf.space",
"https://qingxu98-grobid7.hf.space", "https://qingxu98-grobid8.hf.space",
]
# Searxng互联网检索服务这是一个huggingface空间请前往huggingface复制该空间然后把自己新的空间地址填在这里
SEARXNG_URLS = [ f"https://kaletianlre-beardvs{i}dd.hf.space/" for i in range(1,5) ]
# 是否允许通过自然语言描述修改本页的配置,该功能具有一定的危险性,默认关闭
ALLOW_RESET_CONFIG = False
# 在使用AutoGen插件时是否使用Docker容器运行代码
AUTOGEN_USE_DOCKER = False
# 临时的上传文件夹位置,请尽量不要修改
PATH_PRIVATE_UPLOAD = "private_upload"
# 日志文件夹的位置,请尽量不要修改
PATH_LOGGING = "gpt_log"
# 存储翻译好的arxiv论文的路径请尽量不要修改
ARXIV_CACHE_DIR = "gpt_log/arxiv_cache"
# 除了连接OpenAI之外还有哪些场合允许使用代理请尽量不要修改
WHEN_TO_USE_PROXY = ["Connect_OpenAI", "Download_LLM", "Download_Gradio_Theme", "Connect_Grobid",
"Warmup_Modules", "Nougat_Download", "AutoGen", "Connect_OpenAI_Embedding"]
# 启用插件热加载
PLUGIN_HOT_RELOAD = False
# 自定义按钮的最大数量限制
NUM_CUSTOM_BASIC_BTN = 4
# 媒体智能体的服务地址这是一个huggingface空间请前往huggingface复制该空间然后把自己新的空间地址填在这里
DAAS_SERVER_URLS = [ f"https://niuziniu-biligpt{i}.hf.space/stream" for i in range(1,5) ]
# 在互联网搜索组件中负责将搜索结果整理成干净的Markdown
JINA_API_KEY = ""
# SEMANTIC SCHOLAR API KEY
SEMANTIC_SCHOLAR_KEY = ""
# 是否自动裁剪上下文长度(是否启动,默认不启动)
AUTO_CONTEXT_CLIP_ENABLE = False
# 目标裁剪上下文的token长度如果超过这个长度则会自动裁剪
AUTO_CONTEXT_CLIP_TRIGGER_TOKEN_LEN = 30*1000
# 无条件丢弃x以上的轮数
AUTO_CONTEXT_MAX_ROUND = 64
# 在裁剪上下文时倒数第x次对话能“最多”保留的上下文token的比例占 AUTO_CONTEXT_CLIP_TRIGGER_TOKEN_LEN 的多少
AUTO_CONTEXT_MAX_CLIP_RATIO = [0.80, 0.60, 0.45, 0.25, 0.20, 0.18, 0.16, 0.14, 0.12, 0.10, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]
"""
--------------- 配置关联关系说明 ---------------
在线大模型配置关联关系示意图
├── "gpt-3.5-turbo" 等openai模型
│ ├── API_KEY
│ ├── CUSTOM_API_KEY_PATTERN不常用
│ ├── API_ORG不常用
│ └── API_URL_REDIRECT不常用
├── "azure-gpt-3.5" 等azure模型单个azure模型不需要动态切换
│ ├── API_KEY
│ ├── AZURE_ENDPOINT
│ ├── AZURE_API_KEY
│ ├── AZURE_ENGINE
│ └── API_URL_REDIRECT
├── "azure-gpt-3.5" 等azure模型多个azure模型需要动态切换高优先级
│ └── AZURE_CFG_ARRAY
├── "spark" 星火认知大模型 spark & sparkv2
│ ├── XFYUN_APPID
│ ├── XFYUN_API_SECRET
│ └── XFYUN_API_KEY
├── "claude-3-opus-20240229" 等claude模型
│ └── ANTHROPIC_API_KEY
├── "stack-claude"
│ ├── SLACK_CLAUDE_BOT_ID
│ └── SLACK_CLAUDE_USER_TOKEN
├── "qianfan" 百度千帆大模型库
│ ├── BAIDU_CLOUD_QIANFAN_MODEL
│ ├── BAIDU_CLOUD_API_KEY
│ └── BAIDU_CLOUD_SECRET_KEY
├── "glm-4", "glm-3-turbo", "zhipuai" 智谱AI大模型
│ └── ZHIPUAI_API_KEY
├── "yi-34b-chat-0205", "yi-34b-chat-200k" 等零一万物(Yi Model)大模型
│ └── YIMODEL_API_KEY
├── "qwen-turbo" 等通义千问大模型
│ └── DASHSCOPE_API_KEY
├── "Gemini"
│ └── GEMINI_API_KEY
└── "one-api-...(max_token=...)" 用一种更方便的方式接入one-api多模型管理界面
├── AVAIL_LLM_MODELS
├── API_KEY
└── API_URL_REDIRECT
本地大模型示意图
├── "chatglm4"
├── "chatglm3"
├── "chatglm"
├── "chatglm_onnx"
├── "chatglmft"
├── "internlm"
├── "moss"
├── "jittorllms_pangualpha"
├── "jittorllms_llama"
├── "deepseekcoder"
├── "qwen-local"
├── RWKV的支持见Wiki
└── "llama2"
用户图形界面布局依赖关系示意图
├── CHATBOT_HEIGHT 对话窗的高度
├── CODE_HIGHLIGHT 代码高亮
├── LAYOUT 窗口布局
├── DARK_MODE 暗色模式 / 亮色模式
├── DEFAULT_FN_GROUPS 插件分类默认选项
├── THEME 色彩主题
├── AUTO_CLEAR_TXT 是否在提交时自动清空输入框
├── ADD_WAIFU 加一个live2d装饰
└── ALLOW_RESET_CONFIG 是否允许通过自然语言描述修改本页的配置,该功能具有一定的危险性
插件在线服务配置依赖关系示意图
├── 互联网检索
│ └── SEARXNG_URLS
├── 语音功能
│ ├── ENABLE_AUDIO
│ ├── ALIYUN_TOKEN
│ ├── ALIYUN_APPKEY
│ ├── ALIYUN_ACCESSKEY
│ └── ALIYUN_SECRET
└── PDF文档精准解析
├── GROBID_URLS
├── MATHPIX_APPID
└── MATHPIX_APPKEY
"""

View File

@@ -17,7 +17,7 @@ def get_core_functions():
text_show_english=
r"Below is a paragraph from an academic paper. Polish the writing to meet the academic style, "
r"improve the spelling, grammar, clarity, concision and overall readability. When necessary, rewrite the whole sentence. "
r"Firstly, you should provide the polished paragraph. "
r"Firstly, you should provide the polished paragraph (in English). "
r"Secondly, you should list all your modification and explain the reasons to do so in markdown table.",
text_show_chinese=
r"作为一名中文学术论文写作改进助理,你的任务是改进所提供文本的拼写、语法、清晰、简洁和整体可读性,"
@@ -33,17 +33,19 @@ def get_core_functions():
"AutoClearHistory": False,
# [6] 文本预处理 (可选参数,默认 None举例写个函数移除所有的换行符
"PreProcess": None,
# [7] 模型选择 (可选参数。如不设置,则使用当前全局模型;如设置,则用指定模型覆盖全局模型。)
# "ModelOverride": "gpt-3.5-turbo", # 主要用途:强制点击此基础功能按钮时,使用指定的模型。
},
"总结绘制脑图": {
# 前缀,会被加在你的输入之前。例如,用来描述你的要求,例如翻译、解释代码、润色等等
"Prefix": r"",
"Prefix": '''"""\n\n''',
# 后缀,会被加在你的输入之后。例如,配合前缀可以把你的输入内容用引号圈起来
"Suffix":
# dedent() 函数用于去除多行字符串的缩进
dedent("\n"+r'''
==============================
dedent("\n\n"+r'''
"""
使用mermaid flowchart对以上文本进行总结概括上述段落的内容以及内在逻辑关系例如
@@ -57,15 +59,15 @@ def get_core_functions():
C --> |"箭头名2"| F["节点名6"]
```
警告
注意
1使用中文
2节点名字使用引号包裹如["Laptop"]
3`|` 和 `"`之间不要存在空格
4根据情况选择flowchart LR从左到右或者flowchart TD从上到下
'''),
},
"查找语法错误": {
"Prefix": r"Help me ensure that the grammar and the spelling is correct. "
r"Do not try to polish the text, if no mistake is found, tell me that this paragraph is good. "
@@ -85,14 +87,14 @@ def get_core_functions():
"Suffix": r"",
"PreProcess": clear_line_break, # 预处理:清除换行符
},
"中译英": {
"Prefix": r"Please translate following sentence to English:" + "\n\n",
"Suffix": r"",
},
"学术英中互译": {
"Prefix": build_gpt_academic_masked_string_langbased(
text_show_chinese=
@@ -112,29 +114,29 @@ def get_core_functions():
) + "\n\n",
"Suffix": r"",
},
"英译中": {
"Prefix": r"翻译成地道的中文:" + "\n\n",
"Suffix": r"",
"Visible": False,
},
"找图片": {
"Prefix": r"我需要你找一张网络图片。使用Unsplash API(https://source.unsplash.com/960x640/?<英语关键词>)获取图片URL"
r"然后请使用Markdown格式封装并且不要有反斜线不要用代码块。现在请按以下描述给我发送图片" + "\n\n",
"Suffix": r"",
"Visible": False,
},
"解释代码": {
"Prefix": r"请解释以下代码:" + "\n```\n",
"Suffix": "\n```\n",
},
"参考文献转Bib": {
"Prefix": r"Here are some bibliography items, please transform them into bibtex style."
r"Note that, reference styles maybe more than one kind, you should transform each item correctly."

View File

@@ -1,47 +1,73 @@
from toolbox import HotReload # HotReload 的意思是热更新,修改函数插件后,不需要重启程序,代码直接生效
from toolbox import trimmed_format_exc
from loguru import logger
def get_crazy_functions():
from crazy_functions.读文章写摘要 import 读文章写摘要
from crazy_functions.生成函数注释 import 批量生成函数注释
from crazy_functions.解析项目源代码 import 解析项目本身
from crazy_functions.解析项目源代码 import 解析一个Python项目
from crazy_functions.解析项目源代码 import 解析一个Matlab项目
from crazy_functions.解析项目源代码 import 解析一个C项目的头文件
from crazy_functions.解析项目源代码 import 解析一个C项目
from crazy_functions.解析项目源代码 import 解析一个Golang项目
from crazy_functions.解析项目源代码 import 解析一个Rust项目
from crazy_functions.解析项目源代码 import 解析一个Java项目
from crazy_functions.解析项目源代码 import 解析一个前端项目
from crazy_functions.Paper_Abstract_Writer import Paper_Abstract_Writer
from crazy_functions.Program_Comment_Gen import 批量Program_Comment_Gen
from crazy_functions.SourceCode_Analyse import 解析项目本身
from crazy_functions.SourceCode_Analyse import 解析一个Python项目
from crazy_functions.SourceCode_Analyse import 解析一个Matlab项目
from crazy_functions.SourceCode_Analyse import 解析一个C项目的头文件
from crazy_functions.SourceCode_Analyse import 解析一个C项目
from crazy_functions.SourceCode_Analyse import 解析一个Golang项目
from crazy_functions.SourceCode_Analyse import 解析一个Rust项目
from crazy_functions.SourceCode_Analyse import 解析一个Java项目
from crazy_functions.SourceCode_Analyse import 解析一个前端项目
from crazy_functions.高级功能函数模板 import 高阶功能模板函数
from crazy_functions.Latex全文润色 import Latex英文润色
from crazy_functions.询问多个大语言模型 import 同时问询
from crazy_functions.解析项目源代码 import 解析一个Lua项目
from crazy_functions.解析项目源代码 import 解析一个CSharp项目
from crazy_functions.总结word文档 import 总结word文档
from crazy_functions.解析JupyterNotebook import 解析ipynb文件
from crazy_functions.对话历史存档 import 对话历史存档
from crazy_functions.对话历史存档 import 载入对话历史存档
from crazy_functions.对话历史存档 import 删除所有本地对话历史记录
from crazy_functions.辅助功能 import 清除缓存
from crazy_functions.批量Markdown翻译 import Markdown英译中
from crazy_functions.批量总结PDF文档 import 批量总结PDF文档
from crazy_functions.批量翻译PDF文档_多线程 import 批量翻译PDF文档
from crazy_functions.谷歌检索小助手 import 谷歌检索小助手
from crazy_functions.理解PDF文档内容 import 理解PDF文档内容标准文件输入
from crazy_functions.Latex全文润色 import Latex中文润色
from crazy_functions.Latex全文润色 import Latex英文纠错
from crazy_functions.批量Markdown翻译 import Markdown中译英
from crazy_functions.虚空终端 import 虚空终端
from crazy_functions.生成多种Mermaid图表 import 生成多种Mermaid图表
from crazy_functions.高级功能函数模板 import Demo_Wrap
from crazy_functions.Latex_Project_Polish import Latex英文润色
from crazy_functions.Multi_LLM_Query import 同时问询
from crazy_functions.SourceCode_Analyse import 解析一个Lua项目
from crazy_functions.SourceCode_Analyse import 解析一个CSharp项目
from crazy_functions.Word_Summary import Word_Summary
from crazy_functions.SourceCode_Analyse_JupyterNotebook import 解析ipynb文件
from crazy_functions.Conversation_To_File import 载入对话历史存档
from crazy_functions.Conversation_To_File import 对话历史存档
from crazy_functions.Conversation_To_File import Conversation_To_File_Wrap
from crazy_functions.Conversation_To_File import 删除所有本地对话历史记录
from crazy_functions.Helpers import 清除缓存
from crazy_functions.Markdown_Translate import Markdown英译中
from crazy_functions.PDF_Summary import PDF_Summary
from crazy_functions.PDF_Translate import 批量翻译PDF文档
from crazy_functions.Google_Scholar_Assistant_Legacy import Google_Scholar_Assistant_Legacy
from crazy_functions.PDF_QA import PDF_QA标准文件输入
from crazy_functions.Latex_Project_Polish import Latex中文润色
from crazy_functions.Latex_Project_Polish import Latex英文纠错
from crazy_functions.Markdown_Translate import Markdown中译英
from crazy_functions.Void_Terminal import Void_Terminal
from crazy_functions.Mermaid_Figure_Gen import Mermaid_Gen
from crazy_functions.PDF_Translate_Wrap import PDF_Tran
from crazy_functions.Latex_Function import Latex英文纠错加PDF对比
from crazy_functions.Latex_Function import Latex翻译中文并重新编译PDF
from crazy_functions.Latex_Function import PDF翻译中文并重新编译PDF
from crazy_functions.Latex_Function_Wrap import Arxiv_Localize
from crazy_functions.Latex_Function_Wrap import PDF_Localize
from crazy_functions.Internet_GPT import 连接网络回答问题
from crazy_functions.Internet_GPT_Wrap import NetworkGPT_Wrap
from crazy_functions.Image_Generate import 图片生成_DALLE2, 图片生成_DALLE3, 图片修改_DALLE2
from crazy_functions.Image_Generate_Wrap import ImageGen_Wrap
from crazy_functions.SourceCode_Comment import 注释Python项目
from crazy_functions.SourceCode_Comment_Wrap import SourceCodeComment_Wrap
from crazy_functions.VideoResource_GPT import 多媒体任务
from crazy_functions.Document_Conversation import 批量文件询问
from crazy_functions.Document_Conversation_Wrap import Document_Conversation_Wrap
function_plugins = {
"多媒体智能体": {
"Group": "智能体",
"Color": "stop",
"AsButton": False,
"Info": "【仅测试】多媒体任务",
"Function": HotReload(多媒体任务),
},
"虚空终端": {
"Group": "对话|编程|学术|智能体",
"Color": "stop",
"AsButton": True,
"Function": HotReload(虚空终端),
"Info": "使用自然语言实现您的想法",
"Function": HotReload(Void_Terminal),
},
"解析整个Python项目": {
"Group": "编程",
@@ -50,6 +76,14 @@ def get_crazy_functions():
"Info": "解析一个Python项目的所有源文件(.py) | 输入参数为路径",
"Function": HotReload(解析一个Python项目),
},
"注释Python项目": {
"Group": "编程",
"Color": "stop",
"AsButton": False,
"Info": "上传一系列python源文件(或者压缩包), 为这些代码添加docstring | 输入参数为路径",
"Function": HotReload(注释Python项目),
"Class": SourceCodeComment_Wrap,
},
"载入对话历史存档(先上传存档或输入路径)": {
"Group": "对话",
"Color": "stop",
@@ -70,21 +104,28 @@ def get_crazy_functions():
"Info": "清除所有缓存文件,谨慎操作 | 不需要输入参数",
"Function": HotReload(清除缓存),
},
"生成多种Mermaid图表(从当前对话或文件(.pdf/.md)中生产图表)": {
"生成多种Mermaid图表(从当前对话或路径(.pdf/.md/.docx)中生产图表)": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"Info" : "基于当前对话或PDF生成多种Mermaid图表,图表类型由模型判断",
"Function": HotReload(生成多种Mermaid图表),
"AdvancedArgs": True,
"ArgsReminder": "请输入图类型对应的数字,不输入则为模型自行判断:1-流程图,2-序列图,3-类图,4-饼图,5-甘特图,6-状态图,7-实体关系图,8-象限提示图,9-思维导图",
"Info" : "基于当前对话或文件生成多种Mermaid图表,图表类型由模型判断",
"Function": None,
"Class": Mermaid_Gen
},
"Arxiv论文翻译": {
"Group": "学术",
"Color": "stop",
"AsButton": True,
"Info": "ArXiv论文精细翻译 | 输入参数arxiv论文的ID比如1812.10695",
"Function": HotReload(Latex翻译中文并重新编译PDF), # 当注册Class后Function旧接口仅会在“虚空终端”中起作用
"Class": Arxiv_Localize, # 新一代插件需要注册Class
},
"批量总结Word文档": {
"Group": "学术",
"Color": "stop",
"AsButton": True,
"AsButton": False,
"Info": "批量总结word文档 | 输入参数为路径",
"Function": HotReload(总结word文档),
"Function": HotReload(Word_Summary),
},
"解析整个Matlab项目": {
"Group": "编程",
@@ -163,7 +204,7 @@ def get_crazy_functions():
"Color": "stop",
"AsButton": False,
"Info": "读取Tex论文并写摘要 | 输入参数为路径",
"Function": HotReload(读文章写摘要),
"Function": HotReload(Paper_Abstract_Writer),
},
"翻译README或MD": {
"Group": "编程",
@@ -184,32 +225,46 @@ def get_crazy_functions():
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "批量生成函数的注释 | 输入参数为路径",
"Function": HotReload(批量生成函数注释),
"Function": HotReload(批量Program_Comment_Gen),
},
"保存当前的对话": {
"Group": "对话",
"Color": "stop",
"AsButton": True,
"Info": "保存当前的对话 | 不需要输入参数",
"Function": HotReload(对话历史存档),
"Function": HotReload(对话历史存档), # 当注册Class后Function旧接口仅会在“Void_Terminal”中起作用
"Class": Conversation_To_File_Wrap # 新一代插件需要注册Class
},
"[多线程Demo]解析此项目本身(源码自译解)": {
"Group": "对话|编程",
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "多线程解析并翻译此项目的源码 | 不需要输入参数",
"Function": HotReload(解析项目本身),
},
"查互联网后回答": {
"Group": "对话",
"Color": "stop",
"AsButton": True, # 加入下拉菜单中
# "Info": "连接网络回答问题(需要访问谷歌)| 输入参数是一个问题",
"Function": HotReload(连接网络回答问题),
"Class": NetworkGPT_Wrap # 新一代插件需要注册Class
},
"历史上的今天": {
"Group": "对话",
"AsButton": True,
"Color": "stop",
"AsButton": False,
"Info": "查看历史上的今天事件 (这是一个面向开发者的插件Demo) | 不需要输入参数",
"Function": HotReload(高阶功能模板函数),
"Function": None,
"Class": Demo_Wrap, # 新一代插件需要注册Class
},
"精准翻译PDF论文": {
"PDF论文翻译": {
"Group": "学术",
"Color": "stop",
"AsButton": True,
"Info": "精准翻译PDF论文为中文 | 输入参数为路径",
"Function": HotReload(批量翻译PDF文档),
"Function": HotReload(批量翻译PDF文档), # 当注册Class后Function旧接口仅会在“Void_Terminal”中起作用
"Class": PDF_Tran, # 新一代插件需要注册Class
},
"询问多个GPT模型": {
"Group": "对话",
@@ -222,21 +277,21 @@ def get_crazy_functions():
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "批量总结PDF文档的内容 | 输入参数为路径",
"Function": HotReload(批量总结PDF文档),
"Function": HotReload(PDF_Summary),
},
"谷歌学术检索助手输入谷歌学术搜索页url": {
"Group": "学术",
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "使用谷歌学术检索助手搜索指定URL的结果 | 输入参数为谷歌学术搜索页的URL",
"Function": HotReload(谷歌检索小助手),
"Function": HotReload(Google_Scholar_Assistant_Legacy),
},
"理解PDF文档内容 模仿ChatPDF": {
"Group": "学术",
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "理解PDF文档的内容并进行回答 | 输入参数为路径",
"Function": HotReload(理解PDF文档内容标准文件输入),
"Function": HotReload(PDF_QA标准文件输入),
},
"英文Latex项目全文润色输入路径或上传压缩包": {
"Group": "学术",
@@ -284,11 +339,95 @@ def get_crazy_functions():
"Info": "批量将Markdown文件中文翻译为英文 | 输入参数为路径或上传压缩包",
"Function": HotReload(Markdown中译英),
},
"Latex英文纠错+高亮修正位置 [需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": "如果有必要, 请在此处追加更细致的矫错指令(使用英文)。",
"Function": HotReload(Latex英文纠错加PDF对比),
},
"📚Arxiv论文精细翻译输入arxivID[需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": r"如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
r"例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
r'If the term "agent" is used in this section, it should be translated to "智能体". ',
"Info": "ArXiv论文精细翻译 | 输入参数arxiv论文的ID比如1812.10695",
"Function": HotReload(Latex翻译中文并重新编译PDF), # 当注册Class后Function旧接口仅会在“Void_Terminal”中起作用
"Class": Arxiv_Localize, # 新一代插件需要注册Class
},
"📚本地Latex论文精细翻译上传Latex项目[需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": r"如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
r"例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
r'If the term "agent" is used in this section, it should be translated to "智能体". ',
"Info": "本地Latex论文精细翻译 | 输入参数是路径",
"Function": HotReload(Latex翻译中文并重新编译PDF),
},
"PDF翻译中文并重新编译PDF上传PDF[需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": r"如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
r"例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
r'If the term "agent" is used in this section, it should be translated to "智能体". ',
"Info": "PDF翻译中文并重新编译PDF | 输入参数为路径",
"Function": HotReload(PDF翻译中文并重新编译PDF), # 当注册Class后Function旧接口仅会在“Void_Terminal”中起作用
"Class": PDF_Localize # 新一代插件需要注册Class
},
"批量文件询问 (支持自定义总结各种文件)": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": False,
"Info": "先上传文件,点击此按钮,进行提问",
"Function": HotReload(批量文件询问),
"Class": Document_Conversation_Wrap,
},
}
# -=--=- 尚未充分测试的实验性插件 & 需要额外依赖的插件 -=--=-
function_plugins.update(
{
"🎨图片生成DALLE2/DALLE3, 使用前切换到GPT系列模型": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"Info": "使用 DALLE2/DALLE3 生成图片 | 输入参数字符串,提供图像的内容",
"Function": HotReload(图片生成_DALLE2), # 当注册Class后Function旧接口仅会在“Void_Terminal”中起作用
"Class": ImageGen_Wrap # 新一代插件需要注册Class
},
}
)
function_plugins.update(
{
"🎨图片修改_DALLE2 使用前请切换模型到GPT系列": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": False, # 调用时唤起高级参数输入区默认False
# "Info": "使用DALLE2修改图片 | 输入参数字符串,提供图像的内容",
"Function": HotReload(图片修改_DALLE2),
},
}
)
try:
from crazy_functions.下载arxiv论文翻译摘要 import 下载arxiv论文并翻译摘要
from crazy_functions.Arxiv_Downloader import 下载arxiv论文并翻译摘要
function_plugins.update(
{
@@ -302,42 +441,12 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.联网的ChatGPT import 连接网络回答问题
function_plugins.update(
{
"连接网络回答问题(输入问题后点击该插件,需要访问谷歌)": {
"Group": "对话",
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
# "Info": "连接网络回答问题(需要访问谷歌)| 输入参数是一个问题",
"Function": HotReload(连接网络回答问题),
}
}
)
from crazy_functions.联网的ChatGPT_bing版 import 连接bing搜索回答问题
function_plugins.update(
{
"连接网络回答问题中文Bing版输入问题后点击该插件": {
"Group": "对话",
"Color": "stop",
"AsButton": False, # 加入下拉菜单中
"Info": "连接网络回答问题需要访问中文Bing| 输入参数是一个问题",
"Function": HotReload(连接bing搜索回答问题),
}
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
try:
from crazy_functions.解析项目源代码 import 解析任意code项目
from crazy_functions.SourceCode_Analyse import 解析任意code项目
function_plugins.update(
{
@@ -352,11 +461,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.询问多个大语言模型 import 同时问询_指定模型
from crazy_functions.Multi_LLM_Query import 同时问询_指定模型
function_plugins.update(
{
@@ -371,56 +480,13 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.图片生成 import 图片生成_DALLE2, 图片生成_DALLE3, 图片修改_DALLE2
function_plugins.update(
{
"图片生成_DALLE2 先切换模型到gpt-*": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True, # 调用时唤起高级参数输入区默认False
"ArgsReminder": "在这里输入分辨率, 如1024x1024默认支持 256x256, 512x512, 1024x1024", # 高级参数输入区的显示提示
"Info": "使用DALLE2生成图片 | 输入参数字符串,提供图像的内容",
"Function": HotReload(图片生成_DALLE2),
},
}
)
function_plugins.update(
{
"图片生成_DALLE3 先切换模型到gpt-*": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True, # 调用时唤起高级参数输入区默认False
"ArgsReminder": "在这里输入自定义参数「分辨率-质量(可选)-风格(可选)」, 参数示例「1024x1024-hd-vivid」 || 分辨率支持 「1024x1024」(默认) /「1792x1024」/「1024x1792」 || 质量支持 「-standard」(默认) /「-hd」 || 风格支持 「-vivid」(默认) /「-natural」", # 高级参数输入区的显示提示
"Info": "使用DALLE3生成图片 | 输入参数字符串,提供图像的内容",
"Function": HotReload(图片生成_DALLE3),
},
}
)
function_plugins.update(
{
"图片修改_DALLE2 先切换模型到gpt-*": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": False, # 调用时唤起高级参数输入区默认False
# "Info": "使用DALLE2修改图片 | 输入参数字符串,提供图像的内容",
"Function": HotReload(图片修改_DALLE2),
},
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
try:
from crazy_functions.总结音视频 import 总结音视频
from crazy_functions.Audio_Summary import Audio_Summary
function_plugins.update(
{
@@ -431,16 +497,16 @@ def get_crazy_functions():
"AdvancedArgs": True,
"ArgsReminder": "调用openai api 使用whisper-1模型, 目前支持的格式:mp4, m4a, wav, mpga, mpeg, mp3。此处可以输入解析提示例如解析为简体中文默认",
"Info": "批量总结音频或视频 | 输入参数为路径",
"Function": HotReload(总结音视频),
"Function": HotReload(Audio_Summary),
}
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.数学动画生成manim import 动画生成
from crazy_functions.Math_Animation_Gen import 动画生成
function_plugins.update(
{
@@ -454,11 +520,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.批量Markdown翻译 import Markdown翻译指定语言
from crazy_functions.Markdown_Translate import Markdown翻译指定语言
function_plugins.update(
{
@@ -473,11 +539,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.知识库问答 import 知识库文件注入
from crazy_functions.Vectorstore_QA import 知识库文件注入
function_plugins.update(
{
@@ -492,11 +558,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.知识库问答 import 读取知识库作答
from crazy_functions.Vectorstore_QA import 读取知识库作答
function_plugins.update(
{
@@ -511,11 +577,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.交互功能函数模板 import 交互功能模板函数
from crazy_functions.Interactive_Func_Template import 交互功能模板函数
function_plugins.update(
{
@@ -528,57 +594,16 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.Latex输出PDF结果 import Latex英文纠错加PDF对比
from crazy_functions.Latex输出PDF结果 import Latex翻译中文并重新编译PDF
function_plugins.update(
{
"Latex英文纠错+高亮修正位置 [需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": "如果有必要, 请在此处追加更细致的矫错指令(使用英文)。",
"Function": HotReload(Latex英文纠错加PDF对比),
},
"Arxiv论文精细翻译输入arxivID[需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": "如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
+ "例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
+ 'If the term "agent" is used in this section, it should be translated to "智能体". ',
"Info": "Arixv论文精细翻译 | 输入参数arxiv论文的ID比如1812.10695",
"Function": HotReload(Latex翻译中文并重新编译PDF),
},
"本地Latex论文精细翻译上传Latex项目[需Latex]": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"AdvancedArgs": True,
"ArgsReminder": "如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
+ "例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
+ 'If the term "agent" is used in this section, it should be translated to "智能体". ',
"Info": "本地Latex论文精细翻译 | 输入参数是路径",
"Function": HotReload(Latex翻译中文并重新编译PDF),
}
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
try:
from toolbox import get_conf
ENABLE_AUDIO = get_conf("ENABLE_AUDIO")
if ENABLE_AUDIO:
from crazy_functions.语音助手 import 语音助手
from crazy_functions.Audio_Assistant import Audio_Assistant
function_plugins.update(
{
@@ -587,16 +612,16 @@ def get_crazy_functions():
"Color": "stop",
"AsButton": True,
"Info": "这是一个时刻聆听着的语音对话助手 | 没有输入参数",
"Function": HotReload(语音助手),
"Function": HotReload(Audio_Assistant),
}
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.批量翻译PDF文档_NOUGAT import 批量翻译PDF文档
from crazy_functions.PDF_Translate_Nougat import 批量翻译PDF文档
function_plugins.update(
{
@@ -609,11 +634,11 @@ def get_crazy_functions():
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
try:
from crazy_functions.函数动态生成 import 函数动态生成
from crazy_functions.Dynamic_Function_Generate import Dynamic_Function_Generate
function_plugins.update(
{
@@ -621,47 +646,86 @@ def get_crazy_functions():
"Group": "智能体",
"Color": "stop",
"AsButton": False,
"Function": HotReload(函数动态生成),
"Function": HotReload(Dynamic_Function_Generate),
}
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
# try:
# from crazy_functions.Multi_Agent_Legacy import Multi_Agent_Legacy终端
# function_plugins.update(
# {
# "AutoGenMulti_Agent_Legacy终端仅供测试": {
# "Group": "智能体",
# "Color": "stop",
# "AsButton": False,
# "Function": HotReload(Multi_Agent_Legacy终端),
# }
# }
# )
# except:
# logger.error(trimmed_format_exc())
# logger.error("Load function plugin failed")
try:
from crazy_functions.多智能体 import 多智能体终端
from crazy_functions.Rag_Interface import Rag问答
function_plugins.update(
{
"AutoGen多智能体终端仅供测试": {
"Group": "智能体",
"Rag智能召回": {
"Group": "对话",
"Color": "stop",
"AsButton": False,
"Function": HotReload(多智能体终端),
}
"Info": "将问答数据记录到向量库中,作为长期参考。",
"Function": HotReload(Rag问答),
},
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
# try:
# from crazy_functions.Document_Optimize import 自定义智能文档处理
# function_plugins.update(
# {
# "一键处理文档(支持自定义全文润色、降重等)": {
# "Group": "学术",
# "Color": "stop",
# "AsButton": False,
# "AdvancedArgs": True,
# "ArgsReminder": "请输入处理指令和要求(可以详细描述),如:请帮我润色文本,要求幽默点。默认调用润色指令。",
# "Info": "保留文档结构,智能处理文档内容 | 输入参数为文件路径",
# "Function": HotReload(自定义智能文档处理)
# },
# }
# )
# except:
# logger.error(trimmed_format_exc())
# logger.error("Load function plugin failed")
try:
from crazy_functions.互动小游戏 import 随机小游戏
from crazy_functions.Paper_Reading import 快速论文解读
function_plugins.update(
{
"随机互动小游戏(仅供测试)": {
"Group": "智能体",
"速读论文": {
"Group": "学术",
"Color": "stop",
"AsButton": False,
"Function": HotReload(随机小游戏),
}
"Info": "上传一篇论文进行快速分析和解读 | 输入参数为论文路径或DOI/arXiv ID",
"Function": HotReload(快速论文解读),
},
}
)
except:
print(trimmed_format_exc())
print("Load function plugin failed")
logger.error(trimmed_format_exc())
logger.error("Load function plugin failed")
# try:
# from crazy_functions.高级功能函数模板 import 测试图表渲染
@@ -674,22 +738,9 @@ def get_crazy_functions():
# }
# })
# except:
# print(trimmed_format_exc())
# logger.error(trimmed_format_exc())
# print('Load function plugin failed')
# try:
# from crazy_functions.chatglm微调工具 import 微调数据集生成
# function_plugins.update({
# "黑盒模型学习: 微调数据集生成 (先上传数据集)": {
# "Color": "stop",
# "AsButton": False,
# "AdvancedArgs": True,
# "ArgsReminder": "针对数据集输入(如 绿帽子*深蓝色衬衫*黑色运动裤)给出指令,例如您可以将以下命令复制到下方: --llm_to_learn=azure-gpt-3.5 --prompt_prefix='根据下面的服装类型提示想象一个穿着者对这个人外貌、身处的环境、内心世界、过去经历进行描写。要求100字以内用第二人称。' --system_prompt=''",
# "Function": HotReload(微调数据集生成)
# }
# })
# except:
# print('Load function plugin failed')
"""
设置默认值:
@@ -709,3 +760,26 @@ def get_crazy_functions():
function_plugins[name]["Color"] = "secondary"
return function_plugins
def get_multiplex_button_functions():
"""多路复用主提交按钮的功能映射
"""
return {
"常规对话":
"",
"查互联网后回答":
"查互联网后回答",
"多模型对话":
"询问多个GPT模型", # 映射到上面的 `询问多个GPT模型` 插件
"智能召回 RAG":
"Rag智能召回", # 映射到上面的 `Rag智能召回` 插件
"多媒体查询":
"多媒体智能体", # 映射到上面的 `多媒体智能体` 插件
}

View File

@@ -0,0 +1,290 @@
import re
import os
import asyncio
from typing import List, Dict, Tuple
from dataclasses import dataclass
from textwrap import dedent
from toolbox import CatchException, get_conf, update_ui, promote_file_to_downloadzone, get_log_folder, get_user
from toolbox import update_ui, CatchException, report_exception, write_history_to_file
from crazy_functions.review_fns.data_sources.semantic_source import SemanticScholarSource
from crazy_functions.review_fns.data_sources.arxiv_source import ArxivSource
from crazy_functions.review_fns.query_analyzer import QueryAnalyzer
from crazy_functions.review_fns.handlers.review_handler import 文献综述功能
from crazy_functions.review_fns.handlers.recommend_handler import 论文推荐功能
from crazy_functions.review_fns.handlers.qa_handler import 学术问答功能
from crazy_functions.review_fns.handlers.paper_handler import 单篇论文分析功能
from crazy_functions.Conversation_To_File import write_chat_to_file
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.review_fns.handlers.latest_handler import Arxiv最新论文推荐功能
from datetime import datetime
@CatchException
def 学术对话(txt: str, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List,
history: List, system_prompt: str, user_request: str):
"""主函数"""
# 初始化数据源
arxiv_source = ArxivSource()
semantic_source = SemanticScholarSource(
api_key=get_conf("SEMANTIC_SCHOLAR_KEY")
)
# 初始化处理器
handlers = {
"review": 文献综述功能(arxiv_source, semantic_source, llm_kwargs),
"recommend": 论文推荐功能(arxiv_source, semantic_source, llm_kwargs),
"qa": 学术问答功能(arxiv_source, semantic_source, llm_kwargs),
"paper": 单篇论文分析功能(arxiv_source, semantic_source, llm_kwargs),
"latest": Arxiv最新论文推荐功能(arxiv_source, semantic_source, llm_kwargs),
}
# 分析查询意图
chatbot.append([None, "正在分析研究主题和查询要求..."])
yield from update_ui(chatbot=chatbot, history=history)
query_analyzer = QueryAnalyzer()
search_criteria = yield from query_analyzer.analyze_query(txt, chatbot, llm_kwargs)
handler = handlers.get(search_criteria.query_type)
if not handler:
handler = handlers["qa"] # 默认使用QA处理器
# 处理查询
chatbot.append([None, f"使用{handler.__class__.__name__}处理...可能需要您耐心等待35分钟..."])
yield from update_ui(chatbot=chatbot, history=history)
final_prompt = asyncio.run(handler.handle(
criteria=search_criteria,
chatbot=chatbot,
history=history,
system_prompt=system_prompt,
llm_kwargs=llm_kwargs,
plugin_kwargs=plugin_kwargs
))
if final_prompt:
# 检查是否是道歉提示
if "很抱歉,我们未能找到" in final_prompt:
chatbot.append([txt, final_prompt])
yield from update_ui(chatbot=chatbot, history=history)
return
# 在 final_prompt 末尾添加用户原始查询要求
final_prompt += dedent(f"""
Original user query: "{txt}"
IMPORTANT NOTE :
- Your response must directly address the user's original user query above
- While following the previous guidelines, prioritize answering what the user specifically asked
- Make sure your response format and content align with the user's expectations
- Do not translate paper titles, keep them in their original language
- Do not generate a reference list in your response - references will be handled separately
""")
# 使用最终的prompt生成回答
response = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=final_prompt,
inputs_show_user=txt,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history=[],
sys_prompt=f"You are a helpful academic assistant. Response in Chinese by default unless specified language is required in the user's query."
)
# 1. 获取文献列表
papers_list = handler.ranked_papers # 直接使用原始论文数据
# 在新的对话中添加格式化的参考文献列表
if papers_list:
references = ""
for idx, paper in enumerate(papers_list, 1):
# 构建作者列表
authors = paper.authors[:3]
if len(paper.authors) > 3:
authors.append("et al.")
authors_str = ", ".join(authors)
# 构建期刊指标信息
metrics = []
if hasattr(paper, 'if_factor') and paper.if_factor:
metrics.append(f"IF: {paper.if_factor}")
if hasattr(paper, 'jcr_division') and paper.jcr_division:
metrics.append(f"JCR: {paper.jcr_division}")
if hasattr(paper, 'cas_division') and paper.cas_division:
metrics.append(f"中科院分区: {paper.cas_division}")
metrics_str = f" [{', '.join(metrics)}]" if metrics else ""
# 构建DOI链接
doi_link = ""
if paper.doi:
if "arxiv.org" in str(paper.doi):
doi_url = paper.doi
else:
doi_url = f"https://doi.org/{paper.doi}"
doi_link = f" <a href='{doi_url}' target='_blank'>DOI: {paper.doi}</a>"
# 构建完整的引用
reference = f"[{idx}] {authors_str}. *{paper.title}*"
if paper.venue_name:
reference += f". {paper.venue_name}"
if paper.year:
reference += f", {paper.year}"
reference += metrics_str
if doi_link:
reference += f".{doi_link}"
reference += " \n"
references += reference
# 添加新的对话显示参考文献
chatbot.append(["参考文献如下:", references])
yield from update_ui(chatbot=chatbot, history=history)
# 2. 保存为不同格式
from .review_fns.conversation_doc.word_doc import WordFormatter
from .review_fns.conversation_doc.word2pdf import WordToPdfConverter
from .review_fns.conversation_doc.markdown_doc import MarkdownFormatter
from .review_fns.conversation_doc.html_doc import HtmlFormatter
# 创建保存目录
save_dir = get_log_folder(get_user(chatbot), plugin_name='chatscholar')
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# 生成文件名
def get_safe_filename(txt, max_length=10):
# 获取文本前max_length个字符作为文件名
filename = txt[:max_length].strip()
# 移除不安全的文件名字符
filename = re.sub(r'[\\/:*?"<>|]', '', filename)
# 如果文件名为空,使用时间戳
if not filename:
filename = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
return filename
base_filename = get_safe_filename(txt)
result_files = [] # 收集所有生成的文件
pdf_path = None # 用于跟踪PDF是否成功生成
# 保存为Markdown
try:
md_formatter = MarkdownFormatter()
md_content = md_formatter.create_document(txt, response, papers_list)
result_file_md = write_history_to_file(
history=[md_content],
file_basename=f"markdown_{base_filename}.md"
)
result_files.append(result_file_md)
except Exception as e:
print(f"Markdown保存失败: {str(e)}")
# 保存为HTML
try:
html_formatter = HtmlFormatter()
html_content = html_formatter.create_document(txt, response, papers_list)
result_file_html = write_history_to_file(
history=[html_content],
file_basename=f"html_{base_filename}.html"
)
result_files.append(result_file_html)
except Exception as e:
print(f"HTML保存失败: {str(e)}")
# 保存为Word
try:
word_formatter = WordFormatter()
try:
doc = word_formatter.create_document(txt, response, papers_list)
except Exception as e:
print(f"Word文档内容生成失败: {str(e)}")
raise e
try:
result_file_docx = os.path.join(
os.path.dirname(result_file_md) if result_file_md else save_dir,
f"docx_{base_filename}.docx"
)
doc.save(result_file_docx)
result_files.append(result_file_docx)
print(f"Word文档已保存到: {result_file_docx}")
# 转换为PDF
try:
pdf_path = WordToPdfConverter.convert_to_pdf(result_file_docx)
if pdf_path:
result_files.append(pdf_path)
print(f"PDF文档已生成: {pdf_path}")
except Exception as e:
print(f"PDF转换失败: {str(e)}")
except Exception as e:
print(f"Word文档保存失败: {str(e)}")
raise e
except Exception as e:
print(f"Word格式化失败: {str(e)}")
import traceback
print(f"详细错误信息: {traceback.format_exc()}")
# 保存为BibTeX格式
try:
from .review_fns.conversation_doc.reference_formatter import ReferenceFormatter
ref_formatter = ReferenceFormatter()
bibtex_content = ref_formatter.create_document(papers_list)
# 在与其他文件相同目录下创建BibTeX文件
result_file_bib = os.path.join(
os.path.dirname(result_file_md) if result_file_md else save_dir,
f"references_{base_filename}.bib"
)
# 直接写入文件
with open(result_file_bib, 'w', encoding='utf-8') as f:
f.write(bibtex_content)
result_files.append(result_file_bib)
print(f"BibTeX文件已保存到: {result_file_bib}")
except Exception as e:
print(f"BibTeX格式保存失败: {str(e)}")
# 保存为EndNote格式
try:
from .review_fns.conversation_doc.endnote_doc import EndNoteFormatter
endnote_formatter = EndNoteFormatter()
endnote_content = endnote_formatter.create_document(papers_list)
# 在与其他文件相同目录下创建EndNote文件
result_file_enw = os.path.join(
os.path.dirname(result_file_md) if result_file_md else save_dir,
f"references_{base_filename}.enw"
)
# 直接写入文件
with open(result_file_enw, 'w', encoding='utf-8') as f:
f.write(endnote_content)
result_files.append(result_file_enw)
print(f"EndNote文件已保存到: {result_file_enw}")
except Exception as e:
print(f"EndNote格式保存失败: {str(e)}")
# 添加所有文件到下载区
success_files = []
for file in result_files:
try:
promote_file_to_downloadzone(file, chatbot=chatbot)
success_files.append(os.path.basename(file))
except Exception as e:
print(f"文件添加到下载区失败: {str(e)}")
# 更新成功提示消息
if success_files:
chatbot.append(["保存对话记录成功bib和enw文件支持导入到EndNote、Zotero、JabRef、Mendeley等文献管理软件HTML文件支持在浏览器中打开里面包含详细论文源信息", "对话已保存并添加到下载区,可以在下载区找到相关文件"])
else:
chatbot.append(["保存对话记录", "所有格式的保存都失败了,请检查错误日志。"])
yield from update_ui(chatbot=chatbot, history=history)
else:
report_exception(chatbot, history, a=f"处理失败", b=f"请尝试其他查询")
yield from update_ui(chatbot=chatbot, history=history)

View File

@@ -1,17 +1,19 @@
import re, requests, unicodedata, os
from toolbox import update_ui, get_log_folder
from toolbox import write_history_to_file, promote_file_to_downloadzone
from toolbox import CatchException, report_exception, get_conf
import re, requests, unicodedata, os
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from loguru import logger
def download_arxiv_(url_pdf):
if 'arxiv.org' not in url_pdf:
if ('.' in url_pdf) and ('/' not in url_pdf):
new_url = 'https://arxiv.org/abs/'+url_pdf
print('下载编号:', url_pdf, '自动定位:', new_url)
logger.info('下载编号:', url_pdf, '自动定位:', new_url)
# download_arxiv_(new_url)
return download_arxiv_(new_url)
else:
print('不能识别的URL')
logger.info('不能识别的URL')
return None
if 'abs' in url_pdf:
url_pdf = url_pdf.replace('abs', 'pdf')
@@ -42,15 +44,12 @@ def download_arxiv_(url_pdf):
requests_pdf_url = url_pdf
file_path = download_dir+title_str
print('下载中')
logger.info('下载中')
proxies = get_conf('proxies')
r = requests.get(requests_pdf_url, proxies=proxies)
with open(file_path, 'wb+') as f:
f.write(r.content)
print('下载完成')
# print('输出下载命令:','aria2c -o \"%s\" %s'%(title_str,url_pdf))
# subprocess.call('aria2c --all-proxy=\"172.18.116.150:11084\" -o \"%s\" %s'%(download_dir+title_str,url_pdf), shell=True)
logger.info('下载完成')
x = "%s %s %s.bib" % (paper_id, other_info['year'], other_info['authors'])
x = x.replace('?', '')\
@@ -63,19 +62,9 @@ def download_arxiv_(url_pdf):
def get_name(_url_):
import os
from bs4 import BeautifulSoup
print('正在获取文献名!')
print(_url_)
# arxiv_recall = {}
# if os.path.exists('./arxiv_recall.pkl'):
# with open('./arxiv_recall.pkl', 'rb') as f:
# arxiv_recall = pickle.load(f)
# if _url_ in arxiv_recall:
# print('在缓存中')
# return arxiv_recall[_url_]
logger.info('正在获取文献名!')
logger.info(_url_)
proxies = get_conf('proxies')
res = requests.get(_url_, proxies=proxies)
@@ -92,7 +81,7 @@ def get_name(_url_):
other_details['abstract'] = abstract
except:
other_details['year'] = ''
print('年份获取失败')
logger.info('年份获取失败')
# get author
try:
@@ -101,7 +90,7 @@ def get_name(_url_):
other_details['authors'] = authors
except:
other_details['authors'] = ''
print('authors获取失败')
logger.info('authors获取失败')
# get comment
try:
@@ -116,11 +105,11 @@ def get_name(_url_):
other_details['comment'] = ''
except:
other_details['comment'] = ''
print('年份获取失败')
logger.info('年份获取失败')
title_str = BeautifulSoup(
res.text, 'html.parser').find('title').contents[0]
print('获取成功:', title_str)
logger.info('获取成功:', title_str)
# arxiv_recall[_url_] = (title_str+'.pdf', other_details)
# with open('./arxiv_recall.pkl', 'wb') as f:
# pickle.dump(arxiv_recall, f)
@@ -144,8 +133,8 @@ def 下载arxiv论文并翻译摘要(txt, llm_kwargs, plugin_kwargs, chatbot, hi
try:
import bs4
except:
report_exception(chatbot, history,
a = f"解析项目: {txt}",
report_exception(chatbot, history,
a = f"解析项目: {txt}",
b = f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade beautifulsoup4```。")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
@@ -157,12 +146,12 @@ def 下载arxiv论文并翻译摘要(txt, llm_kwargs, plugin_kwargs, chatbot, hi
try:
pdf_path, info = download_arxiv_(txt)
except:
report_exception(chatbot, history,
a = f"解析项目: {txt}",
report_exception(chatbot, history,
a = f"解析项目: {txt}",
b = f"下载pdf文件未成功")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 翻译摘要等
i_say = f"请你阅读以下学术论文相关的材料,提取摘要,翻译为中文。材料如下:{str(info)}"
i_say_show_user = f'请你阅读以下学术论文相关的材料,提取摘要,翻译为中文。论文:{pdf_path}'

View File

@@ -1,11 +1,13 @@
from toolbox import update_ui
from toolbox import CatchException, get_conf, markdown_convertion
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.crazy_utils import input_clipping
from crazy_functions.agent_fns.watchdog import WatchDog
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.live_audio.aliyunASR import AliyunASR
from loguru import logger
import threading, time
import numpy as np
from .live_audio.aliyunASR import AliyunASR
import json
import re
@@ -39,12 +41,12 @@ class AsyncGptTask():
try:
MAX_TOKEN_ALLO = 2560
i_say, history = input_clipping(i_say, history, max_token_limit=MAX_TOKEN_ALLO)
gpt_say_partial = predict_no_ui_long_connection(inputs=i_say, llm_kwargs=llm_kwargs, history=history, sys_prompt=sys_prompt,
observe_window=observe_window[index], console_slience=True)
gpt_say_partial = predict_no_ui_long_connection(inputs=i_say, llm_kwargs=llm_kwargs, history=history, sys_prompt=sys_prompt,
observe_window=observe_window[index], console_silence=True)
except ConnectionAbortedError as token_exceed_err:
print('至少一个线程任务Token溢出而失败', e)
logger.error('至少一个线程任务Token溢出而失败', e)
except Exception as e:
print('至少一个线程任务意外失败', e)
logger.error('至少一个线程任务意外失败', e)
def add_async_gpt_task(self, i_say, chatbot_index, llm_kwargs, history, system_prompt):
self.observe_future.append([""])
@@ -120,7 +122,7 @@ class InterviewAssistant(AliyunASR):
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
self.plugin_wd.feed()
if self.event_on_result_chg.is_set():
if self.event_on_result_chg.is_set():
# called when some words have finished
self.event_on_result_chg.clear()
chatbot[-1] = list(chatbot[-1])
@@ -151,7 +153,7 @@ class InterviewAssistant(AliyunASR):
# add gpt task 创建子线程请求gpt避免线程阻塞
history = chatbot2history(chatbot)
self.agt.add_async_gpt_task(self.buffered_sentence, len(chatbot)-1, llm_kwargs, history, system_prompt)
self.buffered_sentence = ""
chatbot.append(["[ 请讲话 ]", "[ 正在等您说完问题 ]"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
@@ -166,7 +168,7 @@ class InterviewAssistant(AliyunASR):
@CatchException
def 语音助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Audio_Assistant(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# pip install -U openai-whisper
chatbot.append(["对话助手函数插件:使用时,双手离开鼠标键盘吧", "音频助手, 正在听您讲话(点击“停止”键可终止程序)..."])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面

View File

@@ -1,5 +1,5 @@
from toolbox import CatchException, report_exception, select_api_key, update_ui, get_conf
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from toolbox import write_history_to_file, promote_file_to_downloadzone, get_log_folder
def split_audio_file(filename, split_duration=1000):
@@ -132,13 +132,13 @@ def AnalyAudio(parse_prompt, file_manifest, llm_kwargs, chatbot, history):
@CatchException
def 总结音视频(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, WEB_PORT):
def Audio_Summary(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, WEB_PORT):
import glob, os
# 基本信息:功能、贡献者
chatbot.append([
"函数插件功能?",
"总结音视频内容,函数插件贡献者: dalvqw & BinaryHusky"])
"Audio_Summary内容,函数插件贡献者: dalvqw & BinaryHusky"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
try:

View File

@@ -1,232 +0,0 @@
from collections.abc import Callable, Iterable, Mapping
from typing import Any
from toolbox import CatchException, update_ui, gen_time_str, trimmed_format_exc
from toolbox import promote_file_to_downloadzone, get_log_folder
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from .crazy_utils import input_clipping, try_install_deps
from multiprocessing import Process, Pipe
import os
import time
templete = """
```python
import ... # Put dependencies here, e.g. import numpy as np
class TerminalFunction(object): # Do not change the name of the class, The name of the class must be `TerminalFunction`
def run(self, path): # The name of the function must be `run`, it takes only a positional argument.
# rewrite the function you have just written here
...
return generated_file_path
```
"""
def inspect_dependency(chatbot, history):
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return True
def get_code_block(reply):
import re
pattern = r"```([\s\S]*?)```" # regex pattern to match code blocks
matches = re.findall(pattern, reply) # find all code blocks in text
if len(matches) == 1:
return matches[0].strip('python') # code block
for match in matches:
if 'class TerminalFunction' in match:
return match.strip('python') # code block
raise RuntimeError("GPT is not generating proper code.")
def gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history):
# 输入
prompt_compose = [
f'Your job:\n'
f'1. write a single Python function, which takes a path of a `{file_type}` file as the only argument and returns a `string` containing the result of analysis or the path of generated files. \n',
f"2. You should write this function to perform following task: " + txt + "\n",
f"3. Wrap the output python function with markdown codeblock."
]
i_say = "".join(prompt_compose)
demo = []
# 第一步
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=demo,
sys_prompt= r"You are a programmer."
)
history.extend([i_say, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
# 第二步
prompt_compose = [
"If previous stage is successful, rewrite the function you have just written to satisfy following templete: \n",
templete
]
i_say = "".join(prompt_compose); inputs_show_user = "If previous stage is successful, rewrite the function you have just written to satisfy executable templete. "
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=inputs_show_user,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt= r"You are a programmer."
)
code_to_return = gpt_say
history.extend([i_say, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
# # 第三步
# i_say = "Please list to packages to install to run the code above. Then show me how to use `try_install_deps` function to install them."
# i_say += 'For instance. `try_install_deps(["opencv-python", "scipy", "numpy"])`'
# installation_advance = yield from request_gpt_model_in_new_thread_with_ui_alive(
# inputs=i_say, inputs_show_user=inputs_show_user,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# sys_prompt= r"You are a programmer."
# )
# # # 第三步
# i_say = "Show me how to use `pip` to install packages to run the code above. "
# i_say += 'For instance. `pip install -r opencv-python scipy numpy`'
# installation_advance = yield from request_gpt_model_in_new_thread_with_ui_alive(
# inputs=i_say, inputs_show_user=i_say,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# sys_prompt= r"You are a programmer."
# )
installation_advance = ""
return code_to_return, installation_advance, txt, file_type, llm_kwargs, chatbot, history
def make_module(code):
module_file = 'gpt_fn_' + gen_time_str().replace('-','_')
with open(f'{get_log_folder()}/{module_file}.py', 'w', encoding='utf8') as f:
f.write(code)
def get_class_name(class_string):
import re
# Use regex to extract the class name
class_name = re.search(r'class (\w+)\(', class_string).group(1)
return class_name
class_name = get_class_name(code)
return f"{get_log_folder().replace('/', '.')}.{module_file}->{class_name}"
def init_module_instance(module):
import importlib
module_, class_ = module.split('->')
init_f = getattr(importlib.import_module(module_), class_)
return init_f()
def for_immediate_show_off_when_possible(file_type, fp, chatbot):
if file_type in ['png', 'jpg']:
image_path = os.path.abspath(fp)
chatbot.append(['这是一张图片, 展示如下:',
f'本地文件地址: <br/>`{image_path}`<br/>'+
f'本地文件预览: <br/><div align="center"><img src="file={image_path}"></div>'
])
return chatbot
def subprocess_worker(instance, file_path, return_dict):
return_dict['result'] = instance.run(file_path)
def have_any_recent_upload_files(chatbot):
_5min = 5 * 60
if not chatbot: return False # chatbot is None
most_recent_uploaded = chatbot._cookies.get("most_recent_uploaded", None)
if not most_recent_uploaded: return False # most_recent_uploaded is None
if time.time() - most_recent_uploaded["time"] < _5min: return True # most_recent_uploaded is new
else: return False # most_recent_uploaded is too old
def get_recent_file_prompt_support(chatbot):
most_recent_uploaded = chatbot._cookies.get("most_recent_uploaded", None)
path = most_recent_uploaded['path']
return path
@CatchException
def 虚空终端CodeInterpreter(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数,暂时没有用武之地
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
raise NotImplementedError
# 清空历史,以免输入溢出
history = []; clear_file_downloadzone(chatbot)
# 基本信息:功能、贡献者
chatbot.append([
"函数插件功能?",
"CodeInterpreter开源版, 此插件处于开发阶段, 建议暂时不要使用, 插件初始化中 ..."
])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
if have_any_recent_upload_files(chatbot):
file_path = get_recent_file_prompt_support(chatbot)
else:
chatbot.append(["文件检索", "没有发现任何近期上传的文件。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 读取文件
if ("recently_uploaded_files" in plugin_kwargs) and (plugin_kwargs["recently_uploaded_files"] == ""): plugin_kwargs.pop("recently_uploaded_files")
recently_uploaded_files = plugin_kwargs.get("recently_uploaded_files", None)
file_path = recently_uploaded_files[-1]
file_type = file_path.split('.')[-1]
# 粗心检查
if is_the_upload_folder(txt):
chatbot.append([
"...",
f"请在输入框内填写需求,然后再次点击该插件(文件路径 {file_path} 已经被记忆)"
])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 开始干正事
for j in range(5): # 最多重试5次
try:
code, installation_advance, txt, file_type, llm_kwargs, chatbot, history = \
yield from gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history)
code = get_code_block(code)
res = make_module(code)
instance = init_module_instance(res)
break
except Exception as e:
chatbot.append([f"{j}次代码生成尝试,失败了", f"错误追踪\n```\n{trimmed_format_exc()}\n```\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 代码生成结束, 开始执行
try:
import multiprocessing
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = multiprocessing.Process(target=subprocess_worker, args=(instance, file_path, return_dict))
# only has 10 seconds to run
p.start(); p.join(timeout=10)
if p.is_alive(): p.terminate(); p.join()
p.close()
res = return_dict['result']
# res = instance.run(file_path)
except Exception as e:
chatbot.append(["执行失败了", f"错误追踪\n```\n{trimmed_format_exc()}\n```\n"])
# chatbot.append(["如果是缺乏依赖,请参考以下建议", installation_advance])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 顺利完成,收尾
res = str(res)
if os.path.exists(res):
chatbot.append(["执行成功了,结果是一个有效文件", "结果:" + res])
new_file_path = promote_file_to_downloadzone(res, chatbot=chatbot)
chatbot = for_immediate_show_off_when_possible(file_type, new_file_path, chatbot)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
else:
chatbot.append(["执行成功了,结果是一个字符串", "结果:" + res])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
"""
测试:
裁剪图像,保留下半部分
交换图像的蓝色通道和红色通道
将图像转为灰度图像
将csv文件转excel表格
"""

View File

@@ -1,10 +1,10 @@
from toolbox import CatchException, update_ui, gen_time_str
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from .crazy_utils import input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import input_clipping
import copy, json
@CatchException
def 命令行助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Commandline_Assistant(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本, 例如需要翻译的一段话, 再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数, 如温度和top_p等, 一般原样传递下去就行
@@ -21,8 +21,8 @@ def 命令行助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_pro
i_say = "请写bash命令实现以下功能" + txt
# 开始
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
inputs=i_say, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
sys_prompt="你是一个Linux大师级用户。注意当我要求你写bash命令时尽可能地仅用一行命令解决我的要求。"
)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新

View File

@@ -0,0 +1,374 @@
import re
from toolbox import CatchException, update_ui, promote_file_to_downloadzone, get_log_folder, get_user, update_ui_latest_msg
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
from loguru import logger
f_prefix = 'GPT-Academic对话存档'
def write_chat_to_file_legacy(chatbot, history=None, file_name=None):
"""
将对话记录history以Markdown格式写入文件中。如果没有指定文件名则使用当前时间生成文件名。
"""
import os
import time
from themes.theme import advanced_css
if (file_name is not None) and (file_name != "") and (not file_name.endswith('.html')): file_name += '.html'
else: file_name = None
if file_name is None:
file_name = f_prefix + time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) + '.html'
fp = os.path.join(get_log_folder(get_user(chatbot), plugin_name='chat_history'), file_name)
with open(fp, 'w', encoding='utf8') as f:
from textwrap import dedent
form = dedent("""
<!DOCTYPE html><head><meta charset="utf-8"><title>对话存档</title><style>{CSS}</style></head>
<body>
<div class="test_temp1" style="width:10%; height: 500px; float:left;"></div>
<div class="test_temp2" style="width:80%;padding: 40px;float:left;padding-left: 20px;padding-right: 20px;box-shadow: rgba(0, 0, 0, 0.2) 0px 0px 8px 8px;border-radius: 10px;">
<div class="chat-body" style="display: flex;justify-content: center;flex-direction: column;align-items: center;flex-wrap: nowrap;">
{CHAT_PREVIEW}
<div></div>
<div></div>
<div style="text-align: center;width:80%;padding: 0px;float:left;padding-left:20px;padding-right:20px;box-shadow: rgba(0, 0, 0, 0.05) 0px 0px 1px 2px;border-radius: 1px;">对话(原始数据)</div>
{HISTORY_PREVIEW}
</div>
</div>
<div class="test_temp3" style="width:10%; height: 500px; float:left;"></div>
</body>
""")
qa_from = dedent("""
<div class="QaBox" style="width:80%;padding: 20px;margin-bottom: 20px;box-shadow: rgb(0 255 159 / 50%) 0px 0px 1px 2px;border-radius: 4px;">
<div class="Question" style="border-radius: 2px;">{QUESTION}</div>
<hr color="blue" style="border-top: dotted 2px #ccc;">
<div class="Answer" style="border-radius: 2px;">{ANSWER}</div>
</div>
""")
history_from = dedent("""
<div class="historyBox" style="width:80%;padding: 0px;float:left;padding-left:20px;padding-right:20px;box-shadow: rgba(0, 0, 0, 0.05) 0px 0px 1px 2px;border-radius: 1px;">
<div class="entry" style="border-radius: 2px;">{ENTRY}</div>
</div>
""")
CHAT_PREVIEW_BUF = ""
for i, contents in enumerate(chatbot):
question, answer = contents[0], contents[1]
if question is None: question = ""
try: question = str(question)
except: question = ""
if answer is None: answer = ""
try: answer = str(answer)
except: answer = ""
CHAT_PREVIEW_BUF += qa_from.format(QUESTION=question, ANSWER=answer)
HISTORY_PREVIEW_BUF = ""
for h in history:
HISTORY_PREVIEW_BUF += history_from.format(ENTRY=h)
html_content = form.format(CHAT_PREVIEW=CHAT_PREVIEW_BUF, HISTORY_PREVIEW=HISTORY_PREVIEW_BUF, CSS=advanced_css)
f.write(html_content)
promote_file_to_downloadzone(fp, rename_file=file_name, chatbot=chatbot)
return '对话历史写入:' + fp
def write_chat_to_file(chatbot, history=None, file_name=None):
"""
将对话记录history以多种格式HTML、Word、Markdown写入文件中。如果没有指定文件名则使用当前时间生成文件名。
Args:
chatbot: 聊天机器人对象,包含对话内容
history: 对话历史记录
file_name: 指定的文件名如果为None则使用时间戳
Returns:
str: 提示信息,包含文件保存路径
"""
import os
import time
import asyncio
import aiofiles
from toolbox import promote_file_to_downloadzone
from crazy_functions.doc_fns.conversation_doc.excel_doc import save_chat_tables
from crazy_functions.doc_fns.conversation_doc.html_doc import HtmlFormatter
from crazy_functions.doc_fns.conversation_doc.markdown_doc import MarkdownFormatter
from crazy_functions.doc_fns.conversation_doc.word_doc import WordFormatter
from crazy_functions.doc_fns.conversation_doc.txt_doc import TxtFormatter
from crazy_functions.doc_fns.conversation_doc.word2pdf import WordToPdfConverter
async def save_html():
try:
html_formatter = HtmlFormatter(chatbot, history)
html_content = html_formatter.create_document()
html_file = os.path.join(save_dir, base_name + '.html')
async with aiofiles.open(html_file, 'w', encoding='utf8') as f:
await f.write(html_content)
return html_file
except Exception as e:
print(f"保存HTML格式失败: {str(e)}")
return None
async def save_word():
try:
word_formatter = WordFormatter()
doc = word_formatter.create_document(history)
docx_file = os.path.join(save_dir, base_name + '.docx')
# 由于python-docx不支持异步使用线程池执行
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, doc.save, docx_file)
return docx_file
except Exception as e:
print(f"保存Word格式失败: {str(e)}")
return None
async def save_pdf(docx_file):
try:
if docx_file:
# 获取文件名和保存路径
pdf_file = os.path.join(save_dir, base_name + '.pdf')
# 在线程池中执行转换
loop = asyncio.get_event_loop()
pdf_file = await loop.run_in_executor(
None,
WordToPdfConverter.convert_to_pdf,
docx_file
# save_dir
)
return pdf_file
except Exception as e:
print(f"保存PDF格式失败: {str(e)}")
return None
async def save_markdown():
try:
md_formatter = MarkdownFormatter()
md_content = md_formatter.create_document(history)
md_file = os.path.join(save_dir, base_name + '.md')
async with aiofiles.open(md_file, 'w', encoding='utf8') as f:
await f.write(md_content)
return md_file
except Exception as e:
print(f"保存Markdown格式失败: {str(e)}")
return None
async def save_txt():
try:
txt_formatter = TxtFormatter()
txt_content = txt_formatter.create_document(history)
txt_file = os.path.join(save_dir, base_name + '.txt')
async with aiofiles.open(txt_file, 'w', encoding='utf8') as f:
await f.write(txt_content)
return txt_file
except Exception as e:
print(f"保存TXT格式失败: {str(e)}")
return None
async def main():
# 并发执行所有保存任务
html_task = asyncio.create_task(save_html())
word_task = asyncio.create_task(save_word())
md_task = asyncio.create_task(save_markdown())
txt_task = asyncio.create_task(save_txt())
# 等待所有任务完成
html_file = await html_task
docx_file = await word_task
md_file = await md_task
txt_file = await txt_task
# PDF转换需要等待word文件生成完成
pdf_file = await save_pdf(docx_file)
# 收集所有成功生成的文件
result_files = [f for f in [html_file, docx_file, md_file, txt_file, pdf_file] if f]
# 保存Excel表格
excel_files = save_chat_tables(history, save_dir, base_name)
result_files.extend(excel_files)
return result_files
# 生成时间戳
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())
# 获取保存目录
save_dir = get_log_folder(get_user(chatbot), plugin_name='chat_history')
# 处理文件名
base_name = file_name if file_name else f"聊天记录_{timestamp}"
# 运行异步任务
result_files = asyncio.run(main())
# 将生成的文件添加到下载区
for file in result_files:
promote_file_to_downloadzone(file, rename_file=os.path.basename(file), chatbot=chatbot)
# 如果没有成功保存任何文件,返回错误信息
if not result_files:
return "保存对话记录失败,请检查错误日志"
ext_list = [os.path.splitext(f)[1] for f in result_files]
# 返回成功信息和文件路径
return f"对话历史已保存至以下格式文件:" + "".join(ext_list)
def gen_file_preview(file_name):
try:
with open(file_name, 'r', encoding='utf8') as f:
file_content = f.read()
# pattern to match the text between <head> and </head>
pattern = re.compile(r'<head>.*?</head>', flags=re.DOTALL)
file_content = re.sub(pattern, '', file_content)
html, history = file_content.split('<hr color="blue"> \n\n 对话数据 (无渲染):\n')
history = history.strip('<code>')
history = history.strip('</code>')
history = history.split("\n>>>")
return list(filter(lambda x:x!="", history))[0][:100]
except:
return ""
def read_file_to_chat(chatbot, history, file_name):
with open(file_name, 'r', encoding='utf8') as f:
file_content = f.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(file_content, 'lxml')
# 提取QaBox信息
chatbot.clear()
qa_box_list = []
qa_boxes = soup.find_all("div", class_="QaBox")
for box in qa_boxes:
question = box.find("div", class_="Question").get_text(strip=False)
answer = box.find("div", class_="Answer").get_text(strip=False)
qa_box_list.append({"Question": question, "Answer": answer})
chatbot.append([question, answer])
# 提取historyBox信息
history_box_list = []
history_boxes = soup.find_all("div", class_="historyBox")
for box in history_boxes:
entry = box.find("div", class_="entry").get_text(strip=False)
history_box_list.append(entry)
history = history_box_list
chatbot.append([None, f"[Local Message] 载入对话{len(qa_box_list)}条,上下文{len(history)}条。"])
return chatbot, history
@CatchException
def 对话历史存档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数,暂时没有用武之地
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
file_name = plugin_kwargs.get("file_name", None)
chatbot.append((None, f"[Local Message] {write_chat_to_file_legacy(chatbot, history, file_name)},您可以调用下拉菜单中的“载入对话历史存档”还原当下的对话。"))
try:
chatbot.append((None, f"[Local Message] 正在尝试生成pdf以及word格式的对话存档请稍等..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 由于请求需要一段时间,我们先及时地做一次界面更新
lastmsg = f"[Local Message] {write_chat_to_file(chatbot, history, file_name)}" \
f"您可以调用下拉菜单中的“载入对话历史会话”还原当下的对话请注意目前只支持html格式载入历史。" \
f"当模型回答中存在表格将提取表格内容存储为Excel的xlsx格式如果你提供一些数据,然后输入指令要求模型帮你整理为表格" \
f"如“请帮我将下面的数据整理为表格再利用此插件就可以获取到Excel表格。"
yield from update_ui_latest_msg(lastmsg, chatbot, history) # 刷新界面 # 由于请求需要一段时间,我们先及时地做一次界面更新
except Exception as e:
logger.exception(f"已完成对话存档pdf和word格式的对话存档生成未成功{str(e)}")
lastmsg = "已完成对话存档pdf和word格式的对话存档生成未成功"
yield from update_ui_latest_msg(lastmsg, chatbot, history) # 刷新界面 # 由于请求需要一段时间,我们先及时地做一次界面更新
return
class Conversation_To_File_Wrap(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
第一个参数,名称`file_name`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
"""
gui_definition = {
"file_name": ArgProperty(title="保存文件名", description="输入对话存档文件名,留空则使用时间作为文件名", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
yield from 对话历史存档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
def hide_cwd(str):
import os
current_path = os.getcwd()
replace_path = "."
return str.replace(current_path, replace_path)
@CatchException
def 载入对话历史存档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数,暂时没有用武之地
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
from crazy_functions.crazy_utils import get_files_from_everything
success, file_manifest, _ = get_files_from_everything(txt, type='.html')
if not success:
if txt == "": txt = '空空如也的输入栏'
import glob
local_history = "<br/>".join([
"`"+hide_cwd(f)+f" ({gen_file_preview(f)})"+"`"
for f in glob.glob(
f'{get_log_folder(get_user(chatbot), plugin_name="chat_history")}/**/{f_prefix}*.html',
recursive=True
)])
chatbot.append([f"正在查找对话历史文件html格式: {txt}", f"找不到任何html文件: {txt}。但本地存储了以下历史文件,您可以将任意一个文件路径粘贴到输入区,然后重试:<br/>{local_history}"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
try:
chatbot, history = read_file_to_chat(chatbot, history, file_manifest[0])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
except:
chatbot.append([f"载入对话历史文件", f"对话历史文件损坏!"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
@CatchException
def 删除所有本地对话历史记录(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数,暂时没有用武之地
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
import glob, os
local_history = "<br/>".join([
"`"+hide_cwd(f)+"`"
for f in glob.glob(
f'{get_log_folder(get_user(chatbot), plugin_name="chat_history")}/**/{f_prefix}*.html', recursive=True
)])
for f in glob.glob(f'{get_log_folder(get_user(chatbot), plugin_name="chat_history")}/**/{f_prefix}*.html', recursive=True):
os.remove(f)
chatbot.append([f"删除所有历史对话文件", f"已删除<br/>{local_history}"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return

View File

@@ -0,0 +1,537 @@
import os
import threading
import time
from dataclasses import dataclass
from typing import List, Tuple, Dict, Generator
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.pdf_fns.breakdown_txt import breakdown_text_to_satisfy_token_limit
from crazy_functions.rag_fns.rag_file_support import extract_text
from request_llms.bridge_all import model_info
from toolbox import update_ui, CatchException, report_exception
from shared_utils.fastapi_server import validate_path_safety
@dataclass
class FileFragment:
"""文件片段数据类,用于组织处理单元"""
file_path: str
content: str
rel_path: str
fragment_index: int
total_fragments: int
class BatchDocumentSummarizer:
"""优化的文档总结器 - 批处理版本"""
def __init__(self, llm_kwargs: Dict, query: str, chatbot: List, history: List, system_prompt: str):
"""初始化总结器"""
self.llm_kwargs = llm_kwargs
self.query = query
self.chatbot = chatbot
self.history = history
self.system_prompt = system_prompt
self.failed_files = []
self.file_summaries_map = {}
def _get_token_limit(self) -> int:
"""获取模型token限制"""
max_token = model_info[self.llm_kwargs['llm_model']]['max_token']
return max_token * 3 // 4
def _create_batch_inputs(self, fragments: List[FileFragment]) -> Tuple[List, List, List]:
"""创建批处理输入"""
inputs_array = []
inputs_show_user_array = []
history_array = []
for frag in fragments:
if self.query:
i_say = (f'请按照用户要求对文件内容进行处理,文件名为{os.path.basename(frag.file_path)}'
f'用户要求为:{self.query}'
f'文件内容是 ```{frag.content}```')
i_say_show_user = (f'正在处理 {frag.rel_path} (片段 {frag.fragment_index + 1}/{frag.total_fragments})')
else:
i_say = (f'请对下面的内容用中文做总结不超过500字文件名是{os.path.basename(frag.file_path)}'
f'内容是 ```{frag.content}```')
i_say_show_user = f'正在处理 {frag.rel_path} (片段 {frag.fragment_index + 1}/{frag.total_fragments})'
inputs_array.append(i_say)
inputs_show_user_array.append(i_say_show_user)
history_array.append([])
return inputs_array, inputs_show_user_array, history_array
def _process_single_file_with_timeout(self, file_info: Tuple[str, str], mutable_status: List) -> List[FileFragment]:
"""包装了超时控制的文件处理函数"""
def timeout_handler():
thread = threading.current_thread()
if hasattr(thread, '_timeout_occurred'):
thread._timeout_occurred = True
# 设置超时标记
thread = threading.current_thread()
thread._timeout_occurred = False
# 设置超时时间为30秒给予更多处理时间
TIMEOUT_SECONDS = 30
timer = threading.Timer(TIMEOUT_SECONDS, timeout_handler)
timer.start()
try:
fp, project_folder = file_info
fragments = []
# 定期检查是否超时
def check_timeout():
if hasattr(thread, '_timeout_occurred') and thread._timeout_occurred:
raise TimeoutError(f"处理文件 {os.path.basename(fp)} 超时({TIMEOUT_SECONDS}秒)")
# 更新状态
mutable_status[0] = "检查文件大小"
mutable_status[1] = time.time()
check_timeout()
# 文件大小检查
if os.path.getsize(fp) > self.max_file_size:
self.failed_files.append((fp, f"文件过大:超过{self.max_file_size / 1024 / 1024}MB"))
mutable_status[2] = "文件过大"
return fragments
# 更新状态
mutable_status[0] = "提取文件内容"
mutable_status[1] = time.time()
# 提取内容 - 使用单独的超时控制
content = None
extract_start_time = time.time()
try:
while True:
check_timeout() # 检查全局超时
# 检查提取过程是否超时10秒
if time.time() - extract_start_time > 10:
raise TimeoutError("文件内容提取超时10秒")
try:
content = extract_text(fp)
break
except Exception as e:
if "timeout" in str(e).lower():
continue # 如果是临时超时,重试
raise # 其他错误直接抛出
except Exception as e:
self.failed_files.append((fp, f"文件读取失败:{str(e)}"))
mutable_status[2] = "读取失败"
return fragments
if content is None:
self.failed_files.append((fp, "文件解析失败:不支持的格式或文件损坏"))
mutable_status[2] = "格式不支持"
return fragments
elif not content.strip():
self.failed_files.append((fp, "文件内容为空"))
mutable_status[2] = "内容为空"
return fragments
check_timeout()
# 更新状态
mutable_status[0] = "分割文本"
mutable_status[1] = time.time()
# 分割文本 - 添加超时检查
split_start_time = time.time()
try:
while True:
check_timeout() # 检查全局超时
# 检查分割过程是否超时5秒
if time.time() - split_start_time > 5:
raise TimeoutError("文本分割超时5秒")
paper_fragments = breakdown_text_to_satisfy_token_limit(
txt=content,
limit=self._get_token_limit(),
llm_model=self.llm_kwargs['llm_model']
)
break
except Exception as e:
self.failed_files.append((fp, f"文本分割失败:{str(e)}"))
mutable_status[2] = "分割失败"
return fragments
# 处理片段
rel_path = os.path.relpath(fp, project_folder)
for i, frag in enumerate(paper_fragments):
check_timeout() # 每处理一个片段检查一次超时
if frag.strip():
fragments.append(FileFragment(
file_path=fp,
content=frag,
rel_path=rel_path,
fragment_index=i,
total_fragments=len(paper_fragments)
))
mutable_status[2] = "处理完成"
return fragments
except TimeoutError as e:
self.failed_files.append((fp, str(e)))
mutable_status[2] = "处理超时"
return []
except Exception as e:
self.failed_files.append((fp, f"处理失败:{str(e)}"))
mutable_status[2] = "处理异常"
return []
finally:
timer.cancel()
def prepare_fragments(self, project_folder: str, file_paths: List[str]) -> Generator:
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from typing import Generator, List
"""并行准备所有文件的处理片段"""
all_fragments = []
total_files = len(file_paths)
# 配置参数
self.refresh_interval = 0.2 # UI刷新间隔
self.watch_dog_patience = 5 # 看门狗超时时间
self.max_file_size = 10 * 1024 * 1024 # 10MB限制
self.max_workers = min(32, len(file_paths)) # 最多32个线程
# 创建有超时控制的线程池
executor = ThreadPoolExecutor(max_workers=self.max_workers)
# 用于跨线程状态传递的可变列表 - 增加文件名信息
mutable_status_array = [["等待中", time.time(), "pending", file_path] for file_path in file_paths]
# 创建文件处理任务
file_infos = [(fp, project_folder) for fp in file_paths]
# 提交所有任务,使用带超时控制的处理函数
futures = [
executor.submit(
self._process_single_file_with_timeout,
file_info,
mutable_status_array[i]
) for i, file_info in enumerate(file_infos)
]
# 更新UI的计数器
cnt = 0
try:
# 监控任务执行
while True:
time.sleep(self.refresh_interval)
cnt += 1
# 检查任务完成状态
worker_done = [f.done() for f in futures]
# 更新状态显示
status_str = ""
for i, (status, timestamp, desc, file_path) in enumerate(mutable_status_array):
# 获取文件名(去掉路径)
file_name = os.path.basename(file_path)
if worker_done[i]:
status_str += f"文件 {file_name}: {desc}\n\n"
else:
status_str += f"文件 {file_name}: {status} {desc}\n\n"
# 更新UI
self.chatbot[-1] = [
"处理进度",
f"正在处理文件...\n\n{status_str}" + "." * (cnt % 10 + 1)
]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 检查是否所有任务完成
if all(worker_done):
break
finally:
# 确保线程池正确关闭
executor.shutdown(wait=False)
# 收集结果
processed_files = 0
for future in futures:
try:
fragments = future.result(timeout=0.1) # 给予一个短暂的超时时间来获取结果
all_fragments.extend(fragments)
processed_files += 1
except concurrent.futures.TimeoutError:
# 处理获取结果超时
file_index = futures.index(future)
self.failed_files.append((file_paths[file_index], "结果获取超时"))
continue
except Exception as e:
# 处理其他异常
file_index = futures.index(future)
self.failed_files.append((file_paths[file_index], f"未知错误:{str(e)}"))
continue
# 最终进度更新
self.chatbot.append([
"文件处理完成",
f"成功处理 {len(all_fragments)} 个片段,失败 {len(self.failed_files)} 个文件"
])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return all_fragments
def _process_fragments_batch(self, fragments: List[FileFragment]) -> Generator:
"""批量处理文件片段"""
from collections import defaultdict
batch_size = 64 # 每批处理的片段数
max_retries = 3 # 最大重试次数
retry_delay = 5 # 重试延迟(秒)
results = defaultdict(list)
# 按批次处理
for i in range(0, len(fragments), batch_size):
batch = fragments[i:i + batch_size]
inputs_array, inputs_show_user_array, history_array = self._create_batch_inputs(batch)
sys_prompt_array = ["请总结以下内容:"] * len(batch)
# 添加重试机制
for retry in range(max_retries):
try:
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
inputs_show_user_array=inputs_show_user_array,
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=history_array,
sys_prompt_array=sys_prompt_array,
)
# 处理响应
for j, frag in enumerate(batch):
summary = response_collection[j * 2 + 1]
if summary and summary.strip():
results[frag.rel_path].append({
'index': frag.fragment_index,
'summary': summary,
'total': frag.total_fragments
})
break # 成功处理,跳出重试循环
except Exception as e:
if retry == max_retries - 1: # 最后一次重试失败
for frag in batch:
self.failed_files.append((frag.file_path, f"处理失败:{str(e)}"))
else:
yield from update_ui(self.chatbot.append([f"批次处理失败,{retry_delay}秒后重试...", str(e)]))
time.sleep(retry_delay)
return results
def _generate_final_summary_request(self) -> Tuple[List, List, List]:
"""准备最终总结请求"""
if not self.file_summaries_map:
return (["无可用的文件总结"], ["生成最终总结"], [[]])
summaries = list(self.file_summaries_map.values())
if all(not summary for summary in summaries):
return (["所有文件处理均失败"], ["生成最终总结"], [[]])
if self.plugin_kwargs.get("advanced_arg"):
i_say = "根据以上所有文件的处理结果,按要求进行综合处理:" + self.plugin_kwargs['advanced_arg']
else:
i_say = "请根据以上所有文件的处理结果生成最终的总结不超过1000字。"
return ([i_say], [i_say], [summaries])
def process_files(self, project_folder: str, file_paths: List[str]) -> Generator:
"""处理所有文件"""
total_files = len(file_paths)
self.chatbot.append([f"开始处理", f"总计 {total_files} 个文件"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 1. 准备所有文件片段
# 在 process_files 函数中:
fragments = yield from self.prepare_fragments(project_folder, file_paths)
if not fragments:
self.chatbot.append(["处理失败", "没有可处理的文件内容"])
return "没有可处理的文件内容"
# 2. 批量处理所有文件片段
self.chatbot.append([f"文件分析", f"共计 {len(fragments)} 个处理单元"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
try:
file_summaries = yield from self._process_fragments_batch(fragments)
except Exception as e:
self.chatbot.append(["处理错误", f"批处理过程失败:{str(e)}"])
return "处理过程发生错误"
# 3. 为每个文件生成整体总结
self.chatbot.append(["生成总结", "正在汇总文件内容..."])
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 处理每个文件的总结
for rel_path, summaries in file_summaries.items():
if len(summaries) > 1: # 多片段文件需要生成整体总结
sorted_summaries = sorted(summaries, key=lambda x: x['index'])
if self.plugin_kwargs.get("advanced_arg"):
i_say = f'请按照用户要求对文件内容进行处理,用户要求为:{self.plugin_kwargs["advanced_arg"]}'
else:
i_say = f"请总结文件 {os.path.basename(rel_path)} 的主要内容不超过500字。"
try:
summary_texts = [s['summary'] for s in sorted_summaries]
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=[i_say],
inputs_show_user_array=[f"生成 {rel_path} 的处理结果"],
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=[summary_texts],
sys_prompt_array=["你是一个优秀的助手,"],
)
self.file_summaries_map[rel_path] = response_collection[1]
except Exception as e:
self.chatbot.append(["警告", f"文件 {rel_path} 总结生成失败:{str(e)}"])
self.file_summaries_map[rel_path] = "总结生成失败"
else: # 单片段文件直接使用其唯一的总结
self.file_summaries_map[rel_path] = summaries[0]['summary']
# 4. 生成最终总结
if total_files == 1:
return "文件数为1此时不调用总结模块"
else:
try:
# 收集所有文件的总结用于生成最终总结
file_summaries_for_final = []
for rel_path, summary in self.file_summaries_map.items():
file_summaries_for_final.append(f"文件 {rel_path} 的总结:\n{summary}")
if self.plugin_kwargs.get("advanced_arg"):
final_summary_prompt = ("根据以下所有文件的总结内容,按要求进行综合处理:" +
self.plugin_kwargs['advanced_arg'])
else:
final_summary_prompt = "请根据以下所有文件的总结内容,生成最终的总结报告。"
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=[final_summary_prompt],
inputs_show_user_array=["生成最终总结报告"],
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=[file_summaries_for_final],
sys_prompt_array=["总结所有文件内容。"],
max_workers=1
)
return response_collection[1] if len(response_collection) > 1 else "生成总结失败"
except Exception as e:
self.chatbot.append(["错误", f"最终总结生成失败:{str(e)}"])
return "生成总结失败"
def save_results(self, final_summary: str):
"""保存结果到文件"""
from toolbox import promote_file_to_downloadzone, write_history_to_file
from crazy_functions.doc_fns.batch_file_query_doc import MarkdownFormatter, HtmlFormatter, WordFormatter
import os
timestamp = time.strftime("%Y%m%d_%H%M%S")
# 创建各种格式化器
md_formatter = MarkdownFormatter(final_summary, self.file_summaries_map, self.failed_files)
html_formatter = HtmlFormatter(final_summary, self.file_summaries_map, self.failed_files)
word_formatter = WordFormatter(final_summary, self.file_summaries_map, self.failed_files)
result_files = []
# 保存 Markdown
try:
md_content = md_formatter.create_document()
result_file_md = write_history_to_file(
history=[md_content], # 直接传入内容列表
file_basename=f"文档总结_{timestamp}.md"
)
result_files.append(result_file_md)
except:
pass
# 保存 HTML
try:
html_content = html_formatter.create_document()
result_file_html = write_history_to_file(
history=[html_content],
file_basename=f"文档总结_{timestamp}.html"
)
result_files.append(result_file_html)
except:
pass
# 保存 Word
try:
doc = word_formatter.create_document()
# 由于 Word 文档需要用 doc.save(),我们使用与 md 文件相同的目录
result_file_docx = os.path.join(
os.path.dirname(result_file_md),
f"文档总结_{timestamp}.docx"
)
doc.save(result_file_docx)
result_files.append(result_file_docx)
except:
pass
# 添加到下载区
for file in result_files:
promote_file_to_downloadzone(file, chatbot=self.chatbot)
self.chatbot.append(["处理完成", f"结果已保存至: {', '.join(result_files)}"])
@CatchException
def 批量文件询问(txt: str, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List,
history: List, system_prompt: str, user_request: str):
"""主函数 - 优化版本"""
# 初始化
import glob
import re
from crazy_functions.rag_fns.rag_file_support import supports_format
from toolbox import report_exception
query = plugin_kwargs.get("advanced_arg")
summarizer = BatchDocumentSummarizer(llm_kwargs, query, chatbot, history, system_prompt)
chatbot.append(["函数插件功能", f"作者lbykkkk批量总结文件。支持格式: {', '.join(supports_format)}等其他文本格式文件如果长时间卡在文件处理过程请查看处理进度然后删除所有处于“pending”状态的文件然后重新上传处理。"])
yield from update_ui(chatbot=chatbot, history=history)
# 验证输入路径
if not os.path.exists(txt):
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到项目或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history)
return
# 获取文件列表
project_folder = txt
user_name = chatbot.get_user()
validate_path_safety(project_folder, user_name)
extract_folder = next((d for d in glob.glob(f'{project_folder}/*')
if os.path.isdir(d) and d.endswith('.extract')), project_folder)
exclude_patterns = r'/[^/]+\.(zip|rar|7z|tar|gz)$'
file_manifest = [f for f in glob.glob(f'{extract_folder}/**', recursive=True)
if os.path.isfile(f) and not re.search(exclude_patterns, f)]
if not file_manifest:
report_exception(chatbot, history, a=f"解析项目: {txt}", b="未找到支持的文件类型")
yield from update_ui(chatbot=chatbot, history=history)
return
# 处理所有文件并生成总结
final_summary = yield from summarizer.process_files(project_folder, file_manifest)
yield from update_ui(chatbot=chatbot, history=history)
# 保存结果
summarizer.save_results(final_summary)
yield from update_ui(chatbot=chatbot, history=history)

View File

@@ -0,0 +1,36 @@
import random
from toolbox import get_conf
from crazy_functions.Document_Conversation import 批量文件询问
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
class Document_Conversation_Wrap(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
第一个参数,名称`main_input`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第二个参数,名称`advanced_arg`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第三个参数,名称`allow_cache`,参数`type`声明这是一个下拉菜单,下拉菜单上方显示`title`+`description`,下拉菜单的选项为`options``default_value`为下拉菜单默认值;
"""
gui_definition = {
"main_input":
ArgProperty(title="已上传的文件", description="上传文件后自动填充", default_value="", type="string").model_dump_json(),
"searxng_url":
ArgProperty(title="对材料提问", description="提问", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs:dict, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
yield from 批量文件询问(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)

View File

@@ -0,0 +1,673 @@
import os
import time
import glob
import re
import threading
from typing import Dict, List, Generator, Tuple
from dataclasses import dataclass
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.pdf_fns.breakdown_txt import breakdown_text_to_satisfy_token_limit
from crazy_functions.rag_fns.rag_file_support import extract_text, supports_format, convert_to_markdown
from request_llms.bridge_all import model_info
from toolbox import update_ui, CatchException, report_exception, promote_file_to_downloadzone, write_history_to_file
from shared_utils.fastapi_server import validate_path_safety
# 新增:导入结构化论文提取器
from crazy_functions.doc_fns.read_fns.unstructured_all.paper_structure_extractor import PaperStructureExtractor, ExtractorConfig, StructuredPaper
# 导入格式化器
from crazy_functions.paper_fns.file2file_doc import (
TxtFormatter,
MarkdownFormatter,
HtmlFormatter,
WordFormatter
)
@dataclass
class TextFragment:
"""文本片段数据类,用于组织处理单元"""
content: str
fragment_index: int
total_fragments: int
class DocumentProcessor:
"""文档处理器 - 处理单个文档并输出结果"""
def __init__(self, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List, history: List, system_prompt: str):
"""初始化处理器"""
self.llm_kwargs = llm_kwargs
self.plugin_kwargs = plugin_kwargs
self.chatbot = chatbot
self.history = history
self.system_prompt = system_prompt
self.processed_results = []
self.failed_fragments = []
# 新增:初始化论文结构提取器
self.paper_extractor = PaperStructureExtractor()
def _get_token_limit(self) -> int:
"""获取模型token限制返回更小的值以确保更细粒度的分割"""
max_token = model_info[self.llm_kwargs['llm_model']]['max_token']
# 降低token限制使每个片段更小
return max_token // 4 # 从3/4降低到1/4
def _create_batch_inputs(self, fragments: List[TextFragment]) -> Tuple[List, List, List]:
"""创建批处理输入"""
inputs_array = []
inputs_show_user_array = []
history_array = []
user_instruction = self.plugin_kwargs.get("advanced_arg", "请润色以下学术文本,提高其语言表达的准确性、专业性和流畅度,保持学术风格,确保逻辑连贯,但不改变原文的科学内容和核心观点")
for frag in fragments:
i_say = (f'请按照以下要求处理文本内容:{user_instruction}\n\n'
f'请将对文本的处理结果放在<decision>和</decision>标签之间。\n\n'
f'文本内容:\n```\n{frag.content}\n```')
i_say_show_user = f'正在处理文本片段 {frag.fragment_index + 1}/{frag.total_fragments}'
inputs_array.append(i_say)
inputs_show_user_array.append(i_say_show_user)
history_array.append([])
return inputs_array, inputs_show_user_array, history_array
def _extract_decision(self, text: str) -> str:
"""从LLM响应中提取<decision>标签内的内容"""
import re
pattern = r'<decision>(.*?)</decision>'
matches = re.findall(pattern, text, re.DOTALL)
if matches:
return matches[0].strip()
else:
# 如果没有找到标签,返回原始文本
return text.strip()
def process_file(self, file_path: str) -> Generator:
"""处理单个文件"""
self.chatbot.append(["开始处理文件", f"文件路径: {file_path}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
try:
# 首先尝试转换为Markdown
from crazy_functions.rag_fns.rag_file_support import convert_to_markdown
file_path = convert_to_markdown(file_path)
# 1. 检查文件是否为支持的论文格式
is_paper_format = any(file_path.lower().endswith(ext) for ext in self.paper_extractor.SUPPORTED_EXTENSIONS)
if is_paper_format:
# 使用结构化提取器处理论文
return (yield from self._process_structured_paper(file_path))
else:
# 使用原有方式处理普通文档
return (yield from self._process_regular_file(file_path))
except Exception as e:
self.chatbot.append(["处理错误", f"文件处理失败: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return None
def _process_structured_paper(self, file_path: str) -> Generator:
"""处理结构化论文文件"""
# 1. 提取论文结构
self.chatbot[-1] = ["正在分析论文结构", f"文件路径: {file_path}"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
try:
paper = self.paper_extractor.extract_paper_structure(file_path)
if not paper or not paper.sections:
self.chatbot.append(["无法提取论文结构", "将使用全文内容进行处理"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 使用全文内容进行段落切分
if paper and paper.full_text:
# 使用增强的分割函数进行更细致的分割
fragments = self._breakdown_section_content(paper.full_text)
# 创建文本片段对象
text_fragments = []
for i, frag in enumerate(fragments):
if frag.strip():
text_fragments.append(TextFragment(
content=frag,
fragment_index=i,
total_fragments=len(fragments)
))
# 批量处理片段
if text_fragments:
self.chatbot[-1] = ["开始处理文本", f"{len(text_fragments)} 个片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 一次性准备所有输入
inputs_array, inputs_show_user_array, history_array = self._create_batch_inputs(text_fragments)
# 使用系统提示
instruction = self.plugin_kwargs.get("advanced_arg", "请润色以下学术文本,提高其语言表达的准确性、专业性和流畅度,保持学术风格,确保逻辑连贯,但不改变原文的科学内容和核心观点")
sys_prompt_array = [f"你是一个专业的学术文献编辑助手。请按照用户的要求:'{instruction}'处理文本。保持学术风格,增强表达的准确性和专业性。"] * len(text_fragments)
# 调用LLM一次性处理所有片段
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
inputs_show_user_array=inputs_show_user_array,
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=history_array,
sys_prompt_array=sys_prompt_array,
)
# 处理响应
for j, frag in enumerate(text_fragments):
try:
llm_response = response_collection[j * 2 + 1]
processed_text = self._extract_decision(llm_response)
if processed_text and processed_text.strip():
self.processed_results.append({
'index': frag.fragment_index,
'content': processed_text
})
else:
self.failed_fragments.append(frag)
self.processed_results.append({
'index': frag.fragment_index,
'content': frag.content
})
except Exception as e:
self.failed_fragments.append(frag)
self.processed_results.append({
'index': frag.fragment_index,
'content': frag.content
})
# 按原始顺序合并结果
self.processed_results.sort(key=lambda x: x['index'])
final_content = "\n".join([item['content'] for item in self.processed_results])
# 更新UI
success_count = len(text_fragments) - len(self.failed_fragments)
self.chatbot[-1] = ["处理完成", f"成功处理 {success_count}/{len(text_fragments)} 个片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
return final_content
else:
self.chatbot.append(["处理失败", "未能提取到有效的文本内容"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return None
else:
self.chatbot.append(["处理失败", "未能提取到论文内容"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return None
# 2. 准备处理章节内容(不处理标题)
self.chatbot[-1] = ["已提取论文结构", f"{len(paper.sections)} 个主要章节"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 3. 收集所有需要处理的章节内容并分割为合适大小
sections_to_process = []
section_map = {} # 用于映射处理前后的内容
def collect_section_contents(sections, parent_path=""):
"""递归收集章节内容,跳过参考文献部分"""
for i, section in enumerate(sections):
current_path = f"{parent_path}/{i}" if parent_path else f"{i}"
# 检查是否为参考文献部分,如果是则跳过
if section.section_type == 'references' or section.title.lower() in ['references', '参考文献', 'bibliography', '文献']:
continue # 跳过参考文献部分
# 只处理内容非空的章节
if section.content and section.content.strip():
# 使用增强的分割函数进行更细致的分割
fragments = self._breakdown_section_content(section.content)
for fragment_idx, fragment_content in enumerate(fragments):
if fragment_content.strip():
fragment_index = len(sections_to_process)
sections_to_process.append(TextFragment(
content=fragment_content,
fragment_index=fragment_index,
total_fragments=0 # 临时值,稍后更新
))
# 保存映射关系,用于稍后更新章节内容
# 为每个片段存储原始章节和片段索引信息
section_map[fragment_index] = (current_path, section, fragment_idx, len(fragments))
# 递归处理子章节
if section.subsections:
collect_section_contents(section.subsections, current_path)
# 收集所有章节内容
collect_section_contents(paper.sections)
# 更新总片段数
total_fragments = len(sections_to_process)
for frag in sections_to_process:
frag.total_fragments = total_fragments
# 4. 如果没有内容需要处理,直接返回
if not sections_to_process:
self.chatbot.append(["处理完成", "未找到需要处理的内容"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return None
# 5. 批量处理章节内容
self.chatbot[-1] = ["开始处理论文内容", f"{len(sections_to_process)} 个内容片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 一次性准备所有输入
inputs_array, inputs_show_user_array, history_array = self._create_batch_inputs(sections_to_process)
# 使用系统提示
instruction = self.plugin_kwargs.get("advanced_arg", "请润色以下学术文本,提高其语言表达的准确性、专业性和流畅度,保持学术风格,确保逻辑连贯,但不改变原文的科学内容和核心观点")
sys_prompt_array = [f"你是一个专业的学术文献编辑助手。请按照用户的要求:'{instruction}'处理文本。保持学术风格,增强表达的准确性和专业性。"] * len(sections_to_process)
# 调用LLM一次性处理所有片段
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
inputs_show_user_array=inputs_show_user_array,
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=history_array,
sys_prompt_array=sys_prompt_array,
)
# 处理响应,重组章节内容
section_contents = {} # 用于重组各章节的处理后内容
for j, frag in enumerate(sections_to_process):
try:
llm_response = response_collection[j * 2 + 1]
processed_text = self._extract_decision(llm_response)
if processed_text and processed_text.strip():
# 保存处理结果
self.processed_results.append({
'index': frag.fragment_index,
'content': processed_text
})
# 存储处理后的文本片段,用于后续重组
fragment_index = frag.fragment_index
if fragment_index in section_map:
path, section, fragment_idx, total_fragments = section_map[fragment_index]
# 初始化此章节的内容容器(如果尚未创建)
if path not in section_contents:
section_contents[path] = [""] * total_fragments
# 将处理后的片段放入正确位置
section_contents[path][fragment_idx] = processed_text
else:
self.failed_fragments.append(frag)
except Exception as e:
self.failed_fragments.append(frag)
# 重组每个章节的内容
for path, fragments in section_contents.items():
section = None
for idx in section_map:
if section_map[idx][0] == path:
section = section_map[idx][1]
break
if section:
# 合并该章节的所有处理后片段
section.content = "\n".join(fragments)
# 6. 更新UI
success_count = total_fragments - len(self.failed_fragments)
self.chatbot[-1] = ["处理完成", f"成功处理 {success_count}/{total_fragments} 个内容片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 收集参考文献部分(不进行处理)
references_sections = []
def collect_references(sections, parent_path=""):
"""递归收集参考文献部分"""
for i, section in enumerate(sections):
current_path = f"{parent_path}/{i}" if parent_path else f"{i}"
# 检查是否为参考文献部分
if section.section_type == 'references' or section.title.lower() in ['references', '参考文献', 'bibliography', '文献']:
references_sections.append((current_path, section))
# 递归检查子章节
if section.subsections:
collect_references(section.subsections, current_path)
# 收集参考文献
collect_references(paper.sections)
# 7. 将处理后的结构化论文转换为Markdown
markdown_content = self.paper_extractor.generate_markdown(paper)
# 8. 返回处理后的内容
self.chatbot[-1] = ["处理完成", f"成功处理 {success_count}/{total_fragments} 个内容片段,参考文献部分未处理"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
return markdown_content
except Exception as e:
self.chatbot.append(["结构化处理失败", f"错误: {str(e)},将尝试作为普通文件处理"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return (yield from self._process_regular_file(file_path))
def _process_regular_file(self, file_path: str) -> Generator:
"""使用原有方式处理普通文件"""
# 原有的文件处理逻辑
self.chatbot[-1] = ["正在读取文件", f"文件路径: {file_path}"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
content = extract_text(file_path)
if not content or not content.strip():
self.chatbot.append(["处理失败", "文件内容为空或无法提取内容"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return None
# 2. 分割文本
self.chatbot[-1] = ["正在分析文件", "将文件内容分割为适当大小的片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 使用增强的分割函数
fragments = self._breakdown_section_content(content)
# 3. 创建文本片段对象
text_fragments = []
for i, frag in enumerate(fragments):
if frag.strip():
text_fragments.append(TextFragment(
content=frag,
fragment_index=i,
total_fragments=len(fragments)
))
# 4. 处理所有片段
self.chatbot[-1] = ["开始处理文本", f"{len(text_fragments)} 个片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 批量处理片段
batch_size = 8 # 每批处理的片段数
for i in range(0, len(text_fragments), batch_size):
batch = text_fragments[i:i + batch_size]
inputs_array, inputs_show_user_array, history_array = self._create_batch_inputs(batch)
# 使用系统提示
instruction = self.plugin_kwargs.get("advanced_arg", "请润色以下文本")
sys_prompt_array = [f"你是一个专业的文本处理助手。请按照用户的要求:'{instruction}'处理文本。"] * len(batch)
# 调用LLM处理
response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
inputs_show_user_array=inputs_show_user_array,
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history_array=history_array,
sys_prompt_array=sys_prompt_array,
)
# 处理响应
for j, frag in enumerate(batch):
try:
llm_response = response_collection[j * 2 + 1]
processed_text = self._extract_decision(llm_response)
if processed_text and processed_text.strip():
self.processed_results.append({
'index': frag.fragment_index,
'content': processed_text
})
else:
self.failed_fragments.append(frag)
self.processed_results.append({
'index': frag.fragment_index,
'content': frag.content # 如果处理失败,使用原始内容
})
except Exception as e:
self.failed_fragments.append(frag)
self.processed_results.append({
'index': frag.fragment_index,
'content': frag.content # 如果处理失败,使用原始内容
})
# 5. 按原始顺序合并结果
self.processed_results.sort(key=lambda x: x['index'])
final_content = "\n".join([item['content'] for item in self.processed_results])
# 6. 更新UI
success_count = len(text_fragments) - len(self.failed_fragments)
self.chatbot[-1] = ["处理完成", f"成功处理 {success_count}/{len(text_fragments)} 个片段"]
yield from update_ui(chatbot=self.chatbot, history=self.history)
return final_content
def save_results(self, content: str, original_file_path: str) -> List[str]:
"""保存处理结果为多种格式"""
if not content:
return []
timestamp = time.strftime("%Y%m%d_%H%M%S")
original_filename = os.path.basename(original_file_path)
filename_without_ext = os.path.splitext(original_filename)[0]
base_filename = f"{filename_without_ext}_processed_{timestamp}"
result_files = []
# 获取用户指定的处理类型
processing_type = self.plugin_kwargs.get("advanced_arg", "文本处理")
# 1. 保存为TXT
try:
txt_formatter = TxtFormatter()
txt_content = txt_formatter.create_document(content)
txt_file = write_history_to_file(
history=[txt_content],
file_basename=f"{base_filename}.txt"
)
result_files.append(txt_file)
except Exception as e:
self.chatbot.append(["警告", f"TXT格式保存失败: {str(e)}"])
# 2. 保存为Markdown
try:
md_formatter = MarkdownFormatter()
md_content = md_formatter.create_document(content, processing_type)
md_file = write_history_to_file(
history=[md_content],
file_basename=f"{base_filename}.md"
)
result_files.append(md_file)
except Exception as e:
self.chatbot.append(["警告", f"Markdown格式保存失败: {str(e)}"])
# 3. 保存为HTML
try:
html_formatter = HtmlFormatter(processing_type=processing_type)
html_content = html_formatter.create_document(content)
html_file = write_history_to_file(
history=[html_content],
file_basename=f"{base_filename}.html"
)
result_files.append(html_file)
except Exception as e:
self.chatbot.append(["警告", f"HTML格式保存失败: {str(e)}"])
# 4. 保存为Word
try:
word_formatter = WordFormatter()
doc = word_formatter.create_document(content, processing_type)
# 获取保存路径
from toolbox import get_log_folder
word_path = os.path.join(get_log_folder(), f"{base_filename}.docx")
doc.save(word_path)
# 5. 保存为PDF通过Word转换
try:
from crazy_functions.paper_fns.file2file_doc.word2pdf import WordToPdfConverter
pdf_path = WordToPdfConverter.convert_to_pdf(word_path)
result_files.append(pdf_path)
except Exception as e:
self.chatbot.append(["警告", f"PDF格式保存失败: {str(e)}"])
except Exception as e:
self.chatbot.append(["警告", f"Word格式保存失败: {str(e)}"])
# 添加到下载区
for file in result_files:
promote_file_to_downloadzone(file, chatbot=self.chatbot)
return result_files
def _breakdown_section_content(self, content: str) -> List[str]:
"""对文本内容进行分割与合并
主要按段落进行组织,只合并较小的段落以减少片段数量
保留原始段落结构,不对长段落进行强制分割
针对中英文设置不同的阈值,因为字符密度不同
"""
# 先按段落分割文本
paragraphs = content.split('\n\n')
# 检测语言类型
chinese_char_count = sum(1 for char in content if '\u4e00' <= char <= '\u9fff')
is_chinese_text = chinese_char_count / max(1, len(content)) > 0.3
# 根据语言类型设置不同的阈值(只用于合并小段落)
if is_chinese_text:
# 中文文本:一个汉字就是一个字符,信息密度高
min_chunk_size = 300 # 段落合并的最小阈值
target_size = 800 # 理想的段落大小
else:
# 英文文本:一个单词由多个字符组成,信息密度低
min_chunk_size = 600 # 段落合并的最小阈值
target_size = 1600 # 理想的段落大小
# 1. 只合并小段落,不对长段落进行分割
result_fragments = []
current_chunk = []
current_length = 0
for para in paragraphs:
# 如果段落太小且不会超过目标大小,则合并
if len(para) < min_chunk_size and current_length + len(para) <= target_size:
current_chunk.append(para)
current_length += len(para)
# 否则,创建新段落
else:
# 如果当前块非空且与当前段落无关,先保存它
if current_chunk and current_length > 0:
result_fragments.append('\n\n'.join(current_chunk))
# 当前段落作为新块
current_chunk = [para]
current_length = len(para)
# 如果当前块大小已接近目标大小,保存并开始新块
if current_length >= target_size:
result_fragments.append('\n\n'.join(current_chunk))
current_chunk = []
current_length = 0
# 保存最后一个块
if current_chunk:
result_fragments.append('\n\n'.join(current_chunk))
# 2. 处理可能过大的片段确保不超过token限制
final_fragments = []
max_token = self._get_token_limit()
for fragment in result_fragments:
# 检查fragment是否可能超出token限制
# 根据语言类型调整token估算
if is_chinese_text:
estimated_tokens = len(fragment) / 1.5 # 中文每个token约1-2个字符
else:
estimated_tokens = len(fragment) / 4 # 英文每个token约4个字符
if estimated_tokens > max_token:
# 即使可能超出限制,也尽量保持段落的完整性
# 使用breakdown_text但设置更大的限制来减少分割
larger_limit = max_token * 0.95 # 使用95%的限制
sub_fragments = breakdown_text_to_satisfy_token_limit(
txt=fragment,
limit=larger_limit,
llm_model=self.llm_kwargs['llm_model']
)
final_fragments.extend(sub_fragments)
else:
final_fragments.append(fragment)
return final_fragments
@CatchException
def 自定义智能文档处理(txt: str, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List,
history: List, system_prompt: str, user_request: str):
"""主函数 - 文件到文件处理"""
# 初始化
processor = DocumentProcessor(llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)
chatbot.append(["函数插件功能", "文件内容处理:将文档内容按照指定要求处理后输出为多种格式"])
yield from update_ui(chatbot=chatbot, history=history)
# 验证输入路径
if not os.path.exists(txt):
report_exception(chatbot, history, a=f"解析路径: {txt}", b=f"找不到路径或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history)
return
# 验证路径安全性
user_name = chatbot.get_user()
validate_path_safety(txt, user_name)
# 获取文件列表
if os.path.isfile(txt):
# 单个文件处理
file_paths = [txt]
else:
# 目录处理 - 类似批量文件询问插件
project_folder = txt
extract_folder = next((d for d in glob.glob(f'{project_folder}/*')
if os.path.isdir(d) and d.endswith('.extract')), project_folder)
# 排除压缩文件
exclude_patterns = r'/[^/]+\.(zip|rar|7z|tar|gz)$'
file_paths = [f for f in glob.glob(f'{extract_folder}/**', recursive=True)
if os.path.isfile(f) and not re.search(exclude_patterns, f)]
# 过滤支持的文件格式
file_paths = [f for f in file_paths if any(f.lower().endswith(ext) for ext in
list(processor.paper_extractor.SUPPORTED_EXTENSIONS) + ['.json', '.csv', '.xlsx', '.xls'])]
if not file_paths:
report_exception(chatbot, history, a=f"解析路径: {txt}", b="未找到支持的文件类型")
yield from update_ui(chatbot=chatbot, history=history)
return
# 处理文件
if len(file_paths) > 1:
chatbot.append(["发现多个文件", f"共找到 {len(file_paths)} 个文件,将处理第一个文件"])
yield from update_ui(chatbot=chatbot, history=history)
# 只处理第一个文件
file_to_process = file_paths[0]
processed_content = yield from processor.process_file(file_to_process)
if processed_content:
# 保存结果
result_files = processor.save_results(processed_content, file_to_process)
if result_files:
chatbot.append(["处理完成", f"已生成 {len(result_files)} 个结果文件"])
else:
chatbot.append(["处理完成", "但未能保存任何结果文件"])
else:
chatbot.append(["处理失败", "未能生成有效的处理结果"])
yield from update_ui(chatbot=chatbot, history=history)

View File

@@ -6,18 +6,18 @@
- 将图像转为灰度图像
- 将csv文件转excel表格
Testing:
- Crop the image, keeping the bottom half.
- Swap the blue channel and red channel of the image.
- Convert the image to grayscale.
Testing:
- Crop the image, keeping the bottom half.
- Swap the blue channel and red channel of the image.
- Convert the image to grayscale.
- Convert the CSV file to an Excel spreadsheet.
"""
from toolbox import CatchException, update_ui, gen_time_str, trimmed_format_exc, is_the_upload_folder
from toolbox import promote_file_to_downloadzone, get_log_folder, update_ui_lastest_msg
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, get_plugin_arg
from .crazy_utils import input_clipping, try_install_deps
from toolbox import promote_file_to_downloadzone, get_log_folder, update_ui_latest_msg
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, get_plugin_arg
from crazy_functions.crazy_utils import input_clipping, try_install_deps
from crazy_functions.gen_fns.gen_fns_shared import is_function_successfully_generated
from crazy_functions.gen_fns.gen_fns_shared import get_class_name
from crazy_functions.gen_fns.gen_fns_shared import subprocess_worker
@@ -27,14 +27,14 @@ import time
import glob
import multiprocessing
templete = """
template = """
```python
import ... # Put dependencies here, e.g. import numpy as np.
import ... # Put dependencies here, e.g. import numpy as np.
class TerminalFunction(object): # Do not change the name of the class, The name of the class must be `TerminalFunction`
def run(self, path): # The name of the function must be `run`, it takes only a positional argument.
# rewrite the function you have just written here
# rewrite the function you have just written here
...
return generated_file_path
```
@@ -48,7 +48,7 @@ def get_code_block(reply):
import re
pattern = r"```([\s\S]*?)```" # regex pattern to match code blocks
matches = re.findall(pattern, reply) # find all code blocks in text
if len(matches) == 1:
if len(matches) == 1:
return matches[0].strip('python') # code block
for match in matches:
if 'class TerminalFunction' in match:
@@ -68,8 +68,8 @@ def gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history):
# 第一步
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=demo,
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=demo,
sys_prompt= r"You are a world-class programmer."
)
history.extend([i_say, gpt_say])
@@ -77,38 +77,38 @@ def gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history):
# 第二步
prompt_compose = [
"If previous stage is successful, rewrite the function you have just written to satisfy following templete: \n",
templete
"If previous stage is successful, rewrite the function you have just written to satisfy following template: \n",
template
]
i_say = "".join(prompt_compose); inputs_show_user = "If previous stage is successful, rewrite the function you have just written to satisfy executable templete. "
i_say = "".join(prompt_compose); inputs_show_user = "If previous stage is successful, rewrite the function you have just written to satisfy executable template. "
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=inputs_show_user,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
inputs=i_say, inputs_show_user=inputs_show_user,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt= r"You are a programmer. You need to replace `...` with valid packages, do not give `...` in your answer!"
)
code_to_return = gpt_say
history.extend([i_say, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
# # 第三步
# i_say = "Please list to packages to install to run the code above. Then show me how to use `try_install_deps` function to install them."
# i_say += 'For instance. `try_install_deps(["opencv-python", "scipy", "numpy"])`'
# installation_advance = yield from request_gpt_model_in_new_thread_with_ui_alive(
# inputs=i_say, inputs_show_user=inputs_show_user,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# inputs=i_say, inputs_show_user=inputs_show_user,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# sys_prompt= r"You are a programmer."
# )
# # # 第三步
# # # 第三步
# i_say = "Show me how to use `pip` to install packages to run the code above. "
# i_say += 'For instance. `pip install -r opencv-python scipy numpy`'
# installation_advance = yield from request_gpt_model_in_new_thread_with_ui_alive(
# inputs=i_say, inputs_show_user=i_say,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# inputs=i_say, inputs_show_user=i_say,
# llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
# sys_prompt= r"You are a programmer."
# )
installation_advance = ""
return code_to_return, installation_advance, txt, file_type, llm_kwargs, chatbot, history
@@ -117,7 +117,7 @@ def gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history):
def for_immediate_show_off_when_possible(file_type, fp, chatbot):
if file_type in ['png', 'jpg']:
image_path = os.path.abspath(fp)
chatbot.append(['这是一张图片, 展示如下:',
chatbot.append(['这是一张图片, 展示如下:',
f'本地文件地址: <br/>`{image_path}`<br/>'+
f'本地文件预览: <br/><div align="center"><img src="file={image_path}"></div>'
])
@@ -139,7 +139,7 @@ def get_recent_file_prompt_support(chatbot):
return path
@CatchException
def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Dynamic_Function_Generate(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本例如需要翻译的一段话再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
@@ -159,33 +159,33 @@ def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
# ⭐ 文件上传区是否有东西
# 1. 如果有文件: 作为函数参数
# 2. 如果没有文件需要用GPT提取参数 (太懒了,以后再写,虚空终端已经实现了类似的代码)
# 2. 如果没有文件需要用GPT提取参数 (太懒了,以后再写,Void_Terminal已经实现了类似的代码)
file_list = []
if get_plugin_arg(plugin_kwargs, key="file_path_arg", default=False):
file_path = get_plugin_arg(plugin_kwargs, key="file_path_arg", default=None)
file_list.append(file_path)
yield from update_ui_lastest_msg(f"当前文件: {file_path}", chatbot, history, 1)
yield from update_ui_latest_msg(f"当前文件: {file_path}", chatbot, history, 1)
elif have_any_recent_upload_files(chatbot):
file_dir = get_recent_file_prompt_support(chatbot)
file_list = glob.glob(os.path.join(file_dir, '**/*'), recursive=True)
yield from update_ui_lastest_msg(f"当前文件处理列表: {file_list}", chatbot, history, 1)
yield from update_ui_latest_msg(f"当前文件处理列表: {file_list}", chatbot, history, 1)
else:
chatbot.append(["文件检索", "没有发现任何近期上传的文件。"])
yield from update_ui_lastest_msg("没有发现任何近期上传的文件。", chatbot, history, 1)
yield from update_ui_latest_msg("没有发现任何近期上传的文件。", chatbot, history, 1)
return # 2. 如果没有文件
if len(file_list) == 0:
chatbot.append(["文件检索", "没有发现任何近期上传的文件。"])
yield from update_ui_lastest_msg("没有发现任何近期上传的文件。", chatbot, history, 1)
yield from update_ui_latest_msg("没有发现任何近期上传的文件。", chatbot, history, 1)
return # 2. 如果没有文件
# 读取文件
file_type = file_list[0].split('.')[-1]
# 粗心检查
if is_the_upload_folder(txt):
yield from update_ui_lastest_msg(f"请在输入框内填写需求, 然后再次点击该插件! 至于您的文件,不用担心, 文件路径 {txt} 已经被记忆. ", chatbot, history, 1)
yield from update_ui_latest_msg(f"请在输入框内填写需求, 然后再次点击该插件! 至于您的文件,不用担心, 文件路径 {txt} 已经被记忆. ", chatbot, history, 1)
return
# 开始干正事
MAX_TRY = 3
for j in range(MAX_TRY): # 最多重试5次
@@ -195,7 +195,7 @@ def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
code, installation_advance, txt, file_type, llm_kwargs, chatbot, history = \
yield from gpt_interact_multi_step(txt, file_type, llm_kwargs, chatbot, history)
chatbot.append(["代码生成阶段结束", ""])
yield from update_ui_lastest_msg(f"正在验证上述代码的有效性 ...", chatbot, history, 1)
yield from update_ui_latest_msg(f"正在验证上述代码的有效性 ...", chatbot, history, 1)
# ⭐ 分离代码块
code = get_code_block(code)
# ⭐ 检查模块
@@ -206,11 +206,11 @@ def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
if not traceback: traceback = trimmed_format_exc()
# 处理异常
if not traceback: traceback = trimmed_format_exc()
yield from update_ui_lastest_msg(f"{j+1}/{MAX_TRY} 次代码生成尝试, 失败了~ 别担心, 我们5秒后再试一次... \n\n此次我们的错误追踪是\n```\n{traceback}\n```\n", chatbot, history, 5)
yield from update_ui_latest_msg(f"{j+1}/{MAX_TRY} 次代码生成尝试, 失败了~ 别担心, 我们5秒后再试一次... \n\n此次我们的错误追踪是\n```\n{traceback}\n```\n", chatbot, history, 5)
# 代码生成结束, 开始执行
TIME_LIMIT = 15
yield from update_ui_lastest_msg(f"开始创建新进程并执行代码! 时间限制 {TIME_LIMIT} 秒. 请等待任务完成... ", chatbot, history, 1)
yield from update_ui_latest_msg(f"开始创建新进程并执行代码! 时间限制 {TIME_LIMIT} 秒. 请等待任务完成... ", chatbot, history, 1)
manager = multiprocessing.Manager()
return_dict = manager.dict()
@@ -238,7 +238,7 @@ def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
# chatbot.append(["如果是缺乏依赖,请参考以下建议", installation_advance])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 顺利完成,收尾
res = str(res)
if os.path.exists(res):
@@ -248,5 +248,5 @@ def 函数动态生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
else:
chatbot.append(["执行成功了,结果是一个字符串", "结果:" + res])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新

View File

@@ -1,6 +1,6 @@
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from toolbox import CatchException, report_exception, promote_file_to_downloadzone
from toolbox import update_ui, update_ui_lastest_msg, disable_auto_promotion, write_history_to_file
from toolbox import update_ui, update_ui_latest_msg, disable_auto_promotion, write_history_to_file
import logging
import requests
import time
@@ -20,10 +20,10 @@ def get_meta_information(url, chatbot, history):
proxies = get_conf('proxies')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Cache-Control':'max-age=0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Connection': 'keep-alive'
}
try:
@@ -95,7 +95,7 @@ def get_meta_information(url, chatbot, history):
)
try: paper = next(search.results())
except: paper = None
is_match = paper is not None and string_similar(title, paper.title) > 0.90
# 如果在Arxiv上匹配失败检索文章的历史版本的题目
@@ -132,7 +132,7 @@ def get_meta_information(url, chatbot, history):
return profile
@CatchException
def 谷歌检索小助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Google_Scholar_Assistant_Legacy(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
disable_auto_promotion(chatbot=chatbot)
# 基本信息:功能、贡献者
chatbot.append([
@@ -146,8 +146,8 @@ def 谷歌检索小助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
import math
from bs4 import BeautifulSoup
except:
report_exception(chatbot, history,
a = f"解析项目: {txt}",
report_exception(chatbot, history,
a = f"解析项目: {txt}",
b = f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade beautifulsoup4 arxiv```。")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
@@ -156,14 +156,14 @@ def 谷歌检索小助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
history = []
meta_paper_info_list = yield from get_meta_information(txt, chatbot, history)
if len(meta_paper_info_list) == 0:
yield from update_ui_lastest_msg(lastmsg='获取文献失败可能触发了google反爬虫机制。',chatbot=chatbot, history=history, delay=0)
yield from update_ui_latest_msg(lastmsg='获取文献失败可能触发了google反爬虫机制。',chatbot=chatbot, history=history, delay=0)
return
batchsize = 5
for batch in range(math.ceil(len(meta_paper_info_list)/batchsize)):
if len(meta_paper_info_list[:batchsize]) > 0:
i_say = "下面是一些学术文献的数据,提取出以下内容:" + \
"1、英文题目2、中文题目翻译3、作者4、arxiv公开is_paper_in_arxiv4、引用数量cite5、中文摘要翻译。" + \
f"以下是信息源:{str(meta_paper_info_list[:batchsize])}"
f"以下是信息源:{str(meta_paper_info_list[:batchsize])}"
inputs_show_user = f"请分析此页面中出现的所有文章:{txt},这是第{batch+1}"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
@@ -175,11 +175,11 @@ def 谷歌检索小助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
history.extend([ f"{batch+1}", gpt_say ])
meta_paper_info_list = meta_paper_info_list[batchsize:]
chatbot.append(["状态?",
chatbot.append(["状态?",
"已经全部完成您可以试试让AI写一个Related Works例如您可以继续输入Write a \"Related Works\" section about \"你搜索的研究领域\" for me."])
msg = '正常'
yield from update_ui(chatbot=chatbot, history=history, msg=msg) # 刷新界面
path = write_history_to_file(history)
promote_file_to_downloadzone(path, chatbot=chatbot)
chatbot.append(("完成了吗?", path));
chatbot.append(("完成了吗?", path));
yield from update_ui(chatbot=chatbot, history=history, msg=msg) # 刷新界面

View File

@@ -7,7 +7,7 @@ def gen_image(llm_kwargs, prompt, resolution="1024x1024", model="dall-e-2", qual
from request_llms.bridge_all import model_info
proxies = get_conf('proxies')
# Set up OpenAI API key and model
# Set up OpenAI API key and model
api_key = select_api_key(llm_kwargs['api_key'], llm_kwargs['llm_model'])
chat_endpoint = model_info[llm_kwargs['llm_model']]['endpoint']
# 'https://api.openai.com/v1/chat/completions'
@@ -30,7 +30,7 @@ def gen_image(llm_kwargs, prompt, resolution="1024x1024", model="dall-e-2", qual
if style is not None:
data['style'] = style
response = requests.post(url, headers=headers, json=data, proxies=proxies)
print(response.content)
# logger.info(response.content)
try:
image_url = json.loads(response.content.decode('utf8'))['data'][0]['url']
except:
@@ -76,7 +76,7 @@ def edit_image(llm_kwargs, prompt, image_path, resolution="1024x1024", model="da
}
response = requests.post(url, headers=headers, files=files, proxies=proxies)
print(response.content)
# logger.info(response.content)
try:
image_url = json.loads(response.content.decode('utf8'))['data'][0]['url']
except:
@@ -108,12 +108,12 @@ def 图片生成_DALLE2(prompt, llm_kwargs, plugin_kwargs, chatbot, history, sys
chatbot.append((prompt, "[Local Message] 图像生成提示为空白,请在“输入区”输入图像生成提示。"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 界面更新
return
chatbot.append(("您正在调用“图像生成”插件。", "[Local Message] 生成图像, 请先把模型切换至gpt-*。如果中文Prompt效果不理想, 请尝试英文Prompt。正在处理中 ....."))
chatbot.append(("您正在调用“图像生成”插件。", "[Local Message] 生成图像, 使用前请切换模型到GPT系列。如果中文Prompt效果不理想, 请尝试英文Prompt。正在处理中 ....."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 由于请求gpt需要一段时间,我们先及时地做一次界面更新
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
resolution = plugin_kwargs.get("advanced_arg", '1024x1024')
image_url, image_path = gen_image(llm_kwargs, prompt, resolution)
chatbot.append([prompt,
chatbot.append([prompt,
f'图像中转网址: <br/>`{image_url}`<br/>'+
f'中转网址预览: <br/><div align="center"><img src="{image_url}"></div>'
f'本地文件地址: <br/>`{image_path}`<br/>'+
@@ -129,7 +129,7 @@ def 图片生成_DALLE3(prompt, llm_kwargs, plugin_kwargs, chatbot, history, sys
chatbot.append((prompt, "[Local Message] 图像生成提示为空白,请在“输入区”输入图像生成提示。"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 界面更新
return
chatbot.append(("您正在调用“图像生成”插件。", "[Local Message] 生成图像, 请先把模型切换至gpt-*。如果中文Prompt效果不理想, 请尝试英文Prompt。正在处理中 ....."))
chatbot.append(("您正在调用“图像生成”插件。", "[Local Message] 生成图像, 使用前请切换模型到GPT系列。如果中文Prompt效果不理想, 请尝试英文Prompt。正在处理中 ....."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 由于请求gpt需要一段时间,我们先及时地做一次界面更新
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
resolution_arg = plugin_kwargs.get("advanced_arg", '1024x1024-standard-vivid').lower()
@@ -144,7 +144,7 @@ def 图片生成_DALLE3(prompt, llm_kwargs, plugin_kwargs, chatbot, history, sys
elif part in ['vivid', 'natural']:
style = part
image_url, image_path = gen_image(llm_kwargs, prompt, resolution, model="dall-e-3", quality=quality, style=style)
chatbot.append([prompt,
chatbot.append([prompt,
f'图像中转网址: <br/>`{image_url}`<br/>'+
f'中转网址预览: <br/><div align="center"><img src="{image_url}"></div>'
f'本地文件地址: <br/>`{image_path}`<br/>'+
@@ -164,9 +164,9 @@ class ImageEditState(GptAcademicState):
confirm = (len(file_manifest) >= 1 and file_manifest[0].endswith('.png') and os.path.exists(file_manifest[0]))
file = None if not confirm else file_manifest[0]
return confirm, file
def lock_plugin(self, chatbot):
chatbot._cookies['lock_plugin'] = 'crazy_functions.图片生成->图片修改_DALLE2'
chatbot._cookies['lock_plugin'] = 'crazy_functions.Image_Generate->图片修改_DALLE2'
self.dump_state(chatbot)
def unlock_plugin(self, chatbot):

View File

@@ -0,0 +1,56 @@
from toolbox import get_conf, update_ui
from crazy_functions.Image_Generate import 图片生成_DALLE2, 图片生成_DALLE3, 图片修改_DALLE2
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
class ImageGen_Wrap(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
第一个参数,名称`main_input`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第二个参数,名称`advanced_arg`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
"""
gui_definition = {
"main_input":
ArgProperty(title="输入图片描述", description="需要生成图像的文本描述,尽量使用英文", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"model_name":
ArgProperty(title="模型", options=["DALLE2", "DALLE3"], default_value="DALLE3", description="", type="dropdown").model_dump_json(),
"resolution":
ArgProperty(title="分辨率", options=["256x256(限DALLE2)", "512x512(限DALLE2)", "1024x1024", "1792x1024(限DALLE3)", "1024x1792(限DALLE3)"], default_value="1024x1024", description="", type="dropdown").model_dump_json(),
"quality (仅DALLE3生效)":
ArgProperty(title="质量", options=["standard", "hd"], default_value="standard", description="", type="dropdown").model_dump_json(),
"style (仅DALLE3生效)":
ArgProperty(title="风格", options=["vivid", "natural"], default_value="vivid", description="", type="dropdown").model_dump_json(),
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
# 分辨率
resolution = plugin_kwargs["resolution"].replace("(限DALLE2)", "").replace("(限DALLE3)", "")
if plugin_kwargs["model_name"] == "DALLE2":
plugin_kwargs["advanced_arg"] = resolution
yield from 图片生成_DALLE2(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
elif plugin_kwargs["model_name"] == "DALLE3":
quality = plugin_kwargs["quality (仅DALLE3生效)"]
style = plugin_kwargs["style (仅DALLE3生效)"]
plugin_kwargs["advanced_arg"] = f"{resolution}-{quality}-{style}"
yield from 图片生成_DALLE3(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
else:
chatbot.append([None, "抱歉,找不到该模型"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面

View File

@@ -1,6 +1,5 @@
from toolbox import CatchException, update_ui
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
@CatchException
def 交互功能模板函数(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
@@ -14,13 +13,13 @@ def 交互功能模板函数(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
user_request 当前用户的请求信息IP地址等
"""
history = [] # 清空历史,以免输入溢出
chatbot.append(("这是什么功能?", "交互功能函数模板。在执行完成之后, 可以将自身的状态存储到cookie中, 等待用户的再次调用。"))
chatbot.append(("这是什么功能?", "Interactive_Func_Template。在执行完成之后, 可以将自身的状态存储到cookie中, 等待用户的再次调用。"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
state = chatbot._cookies.get('plugin_state_0001', None) # 初始化插件状态
if state is None:
chatbot._cookies['lock_plugin'] = 'crazy_functions.交互功能函数模板->交互功能模板函数' # 赋予插件锁定 锁定插件回调路径,当下一次用户提交时,会直接转到该函数
chatbot._cookies['lock_plugin'] = 'crazy_functions.Interactive_Func_Template->交互功能模板函数' # 赋予插件锁定 锁定插件回调路径,当下一次用户提交时,会直接转到该函数
chatbot._cookies['plugin_state_0001'] = 'wait_user_keyword' # 赋予插件状态
chatbot.append(("第一次调用:", "请输入关键词, 我将为您查找相关壁纸, 建议使用英文单词, 插件锁定中,请直接提交即可。"))
@@ -38,7 +37,7 @@ def 交互功能模板函数(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
inputs=inputs_show_user=f"Extract all image urls in this html page, pick the first 5 images and show them with markdown format: \n\n {page_return}"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=inputs, inputs_show_user=inputs_show_user,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
sys_prompt="When you want to show an image, use markdown format. e.g. ![image_description](image_url). If there are no image url provided, answer 'no image url provided'"
)
chatbot[-1] = [chatbot[-1][0], gpt_say]

View File

@@ -1,4 +1,4 @@
from toolbox import CatchException, update_ui, update_ui_lastest_msg
from toolbox import CatchException, update_ui, update_ui_latest_msg
from crazy_functions.multi_stage.multi_stage_utils import GptAcademicGameBaseState
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from request_llms.bridge_all import predict_no_ui_long_connection
@@ -12,11 +12,11 @@ def 随机小游戏(prompt, llm_kwargs, plugin_kwargs, chatbot, history, system_
# 选择游戏
cls = MiniGame_ResumeStory
# 如果之前已经初始化了游戏实例,则继续该实例;否则重新初始化
state = cls.sync_state(chatbot,
llm_kwargs,
cls,
state = cls.sync_state(chatbot,
llm_kwargs,
cls,
plugin_name='MiniGame_ResumeStory',
callback_fn='crazy_functions.互动小游戏->随机小游戏',
callback_fn='crazy_functions.Interactive_Mini_Game->随机小游戏',
lock_plugin=True
)
yield from state.continue_game(prompt, chatbot, history)
@@ -30,11 +30,11 @@ def 随机小游戏1(prompt, llm_kwargs, plugin_kwargs, chatbot, history, system
# 选择游戏
cls = MiniGame_ASCII_Art
# 如果之前已经初始化了游戏实例,则继续该实例;否则重新初始化
state = cls.sync_state(chatbot,
llm_kwargs,
cls,
state = cls.sync_state(chatbot,
llm_kwargs,
cls,
plugin_name='MiniGame_ASCII_Art',
callback_fn='crazy_functions.互动小游戏->随机小游戏1',
callback_fn='crazy_functions.Interactive_Mini_Game->随机小游戏1',
lock_plugin=True
)
yield from state.continue_game(prompt, chatbot, history)

View File

@@ -0,0 +1,365 @@
import requests
import random
import time
import re
import json
from bs4 import BeautifulSoup
from functools import lru_cache
from itertools import zip_longest
from check_proxy import check_proxy
from toolbox import CatchException, update_ui, get_conf, update_ui_latest_msg
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
from request_llms.bridge_all import model_info
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.prompts.internet import SearchOptimizerPrompt, SearchAcademicOptimizerPrompt
def search_optimizer(
query,
proxies,
history,
llm_kwargs,
optimizer=1,
categories="general",
searxng_url=None,
engines=None,
):
# ------------- < 第1步尝试进行搜索优化 > -------------
# * 增强优化,会尝试结合历史记录进行搜索优化
if optimizer == 2:
his = " "
if len(history) == 0:
pass
else:
for i, h in enumerate(history):
if i % 2 == 0:
his += f"Q: {h}\n"
else:
his += f"A: {h}\n"
if categories == "general":
sys_prompt = SearchOptimizerPrompt.format(query=query, history=his, num=4)
elif categories == "science":
sys_prompt = SearchAcademicOptimizerPrompt.format(query=query, history=his, num=4)
else:
his = " "
if categories == "general":
sys_prompt = SearchOptimizerPrompt.format(query=query, history=his, num=3)
elif categories == "science":
sys_prompt = SearchAcademicOptimizerPrompt.format(query=query, history=his, num=3)
mutable = ["", time.time(), ""]
llm_kwargs["temperature"] = 0.8
try:
query_json = predict_no_ui_long_connection(
inputs=query,
llm_kwargs=llm_kwargs,
history=[],
sys_prompt=sys_prompt,
observe_window=mutable,
)
except Exception:
query_json = "null"
#* 尝试解码优化后的搜索结果
query_json = re.sub(r"```json|```", "", query_json)
try:
queries = json.loads(query_json)
except Exception:
#* 如果解码失败,降低温度再试一次
try:
llm_kwargs["temperature"] = 0.4
query_json = predict_no_ui_long_connection(
inputs=query,
llm_kwargs=llm_kwargs,
history=[],
sys_prompt=sys_prompt,
observe_window=mutable,
)
query_json = re.sub(r"```json|```", "", query_json)
queries = json.loads(query_json)
except Exception:
#* 如果再次失败,直接返回原始问题
queries = [query]
links = []
success = 0
Exceptions = ""
for q in queries:
try:
link = searxng_request(q, proxies, categories, searxng_url, engines=engines)
if len(link) > 0:
links.append(link[:-5])
success += 1
except Exception:
Exceptions = Exception
pass
if success == 0:
raise ValueError(f"在线搜索失败!\n{Exceptions}")
# * 清洗搜索结果,依次放入每组第一,第二个搜索结果,并清洗重复的搜索结果
seen_links = set()
result = []
for tuple in zip_longest(*links, fillvalue=None):
for item in tuple:
if item is not None:
link = item["link"]
if link not in seen_links:
seen_links.add(link)
result.append(item)
return result
@lru_cache
def get_auth_ip():
ip = check_proxy(None, return_ip=True)
if ip is None:
return '114.114.114.' + str(random.randint(1, 10))
return ip
def searxng_request(query, proxies, categories='general', searxng_url=None, engines=None):
if searxng_url is None:
urls = get_conf("SEARXNG_URLS")
url = random.choice(urls)
else:
url = searxng_url
if engines == "Mixed":
engines = None
if categories == 'general':
params = {
'q': query, # 搜索查询
'format': 'json', # 输出格式为JSON
'language': 'zh', # 搜索语言
'engines': engines,
}
elif categories == 'science':
params = {
'q': query, # 搜索查询
'format': 'json', # 输出格式为JSON
'language': 'zh', # 搜索语言
'categories': 'science'
}
else:
raise ValueError('不支持的检索类型')
headers = {
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
'X-Forwarded-For': get_auth_ip(),
'X-Real-IP': get_auth_ip()
}
results = []
response = requests.post(url, params=params, headers=headers, proxies=proxies, timeout=30)
if response.status_code == 200:
json_result = response.json()
for result in json_result['results']:
item = {
"title": result.get("title", ""),
"source": result.get("engines", "unknown"),
"content": result.get("content", ""),
"link": result["url"],
}
results.append(item)
return results
else:
if response.status_code == 429:
raise ValueError("Searxng在线搜索服务当前使用人数太多请稍后。")
else:
raise ValueError("在线搜索失败,状态码: " + str(response.status_code) + '\t' + response.content.decode('utf-8'))
def scrape_text(url, proxies) -> str:
"""Scrape text from a webpage
Args:
url (str): The URL to scrape text from
Returns:
str: The scraped text
"""
from loguru import logger
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
'Content-Type': 'text/plain',
}
# 首先采用Jina进行文本提取
if get_conf("JINA_API_KEY"):
try: return jina_scrape_text(url)
except: logger.debug("Jina API 请求失败,回到旧方法")
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=8)
if response.encoding == "ISO-8859-1": response.encoding = response.apparent_encoding
except:
return "无法连接到该网页"
soup = BeautifulSoup(response.text, "html.parser")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = "\n".join(chunk for chunk in chunks if chunk)
return text
def jina_scrape_text(url) -> str:
"jina_39727421c8fa4e4fa9bd698e5211feaaDyGeVFESNrRaepWiLT0wmHYJSh-d"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
'Content-Type': 'text/plain',
"X-Retain-Images": "none",
"Authorization": f'Bearer {get_conf("JINA_API_KEY")}'
}
response = requests.get("https://r.jina.ai/" + url, headers=headers, proxies=None, timeout=8)
if response.status_code != 200:
raise ValueError("Jina API 请求失败,开始尝试旧方法!" + response.text)
if response.encoding == "ISO-8859-1": response.encoding = response.apparent_encoding
result = response.text
result = result.replace("\\[", "[").replace("\\]", "]").replace("\\(", "(").replace("\\)", ")")
return response.text
def internet_search_with_analysis_prompt(prompt, analysis_prompt, llm_kwargs, chatbot):
from toolbox import get_conf
proxies = get_conf('proxies')
categories = 'general'
searxng_url = None # 使用默认的searxng_url
engines = None # 使用默认的搜索引擎
yield from update_ui_latest_msg(lastmsg=f"检索中: {prompt} ...", chatbot=chatbot, history=[], delay=1)
urls = searxng_request(prompt, proxies, categories, searxng_url, engines=engines)
yield from update_ui_latest_msg(lastmsg=f"依次访问搜索到的网站 ...", chatbot=chatbot, history=[], delay=1)
if len(urls) == 0:
return None
max_search_result = 5 # 最多收纳多少个网页的结果
history = []
for index, url in enumerate(urls[:max_search_result]):
yield from update_ui_latest_msg(lastmsg=f"依次访问搜索到的网站: {url['link']} ...", chatbot=chatbot, history=[], delay=1)
res = scrape_text(url['link'], proxies)
prefix = f"{index}份搜索结果 [源自{url['source'][0]}搜索] {url['title'][:25]}"
history.extend([prefix, res])
i_say = f"从以上搜索结果中抽取信息,然后回答问题:{prompt} {analysis_prompt}"
i_say, history = input_clipping( # 裁剪输入从最长的条目开始裁剪防止爆token
inputs=i_say,
history=history,
max_token_limit=8192
)
gpt_say = predict_no_ui_long_connection(
inputs=i_say,
llm_kwargs=llm_kwargs,
history=history,
sys_prompt="请从搜索结果中抽取信息,对最相关的两个搜索结果进行总结,然后回答问题。",
console_silence=False,
)
return gpt_say
@CatchException
def 连接网络回答问题(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
optimizer_history = history[:-8]
history = [] # 清空历史,以免输入溢出
chatbot.append((f"请结合互联网信息回答以下问题:{txt}", "检索中..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# ------------- < 第1步爬取搜索引擎的结果 > -------------
from toolbox import get_conf
proxies = get_conf('proxies')
categories = plugin_kwargs.get('categories', 'general')
searxng_url = plugin_kwargs.get('searxng_url', None)
engines = plugin_kwargs.get('engine', None)
optimizer = plugin_kwargs.get('optimizer', "关闭")
if optimizer == "关闭":
urls = searxng_request(txt, proxies, categories, searxng_url, engines=engines)
else:
urls = search_optimizer(txt, proxies, optimizer_history, llm_kwargs, optimizer, categories, searxng_url, engines)
history = []
if len(urls) == 0:
chatbot.append((f"结论:{txt}", "[Local Message] 受到限制无法从searxng获取信息请尝试更换搜索引擎。"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# ------------- < 第2步依次访问网页 > -------------
from concurrent.futures import ThreadPoolExecutor
from textwrap import dedent
max_search_result = 5 # 最多收纳多少个网页的结果
if optimizer == "开启(增强)":
max_search_result = 8
template = dedent("""
<details>
<summary>{TITLE}</summary>
<div class="search_result">{URL}</div>
<div class="search_result">{CONTENT}</div>
</details>
""")
buffer = ""
# 创建线程池
with ThreadPoolExecutor(max_workers=5) as executor:
# 提交任务到线程池
futures = []
for index, url in enumerate(urls[:max_search_result]):
future = executor.submit(scrape_text, url['link'], proxies)
futures.append((index, future, url))
# 处理完成的任务
for index, future, url in futures:
# 开始
prefix = f"正在加载 第{index+1}份搜索结果 [源自{url['source'][0]}搜索] {url['title'][:25]}"
string_structure = template.format(TITLE=prefix, URL=url['link'], CONTENT="正在加载,请稍后 ......")
yield from update_ui_latest_msg(lastmsg=(buffer + string_structure), chatbot=chatbot, history=history, delay=0.1) # 刷新界面
# 获取结果
res = future.result()
# 显示结果
prefix = f"{index+1}份搜索结果 [源自{url['source'][0]}搜索] {url['title'][:25]}"
string_structure = template.format(TITLE=prefix, URL=url['link'], CONTENT=res[:1000] + "......")
buffer += string_structure
# 更新历史
history.extend([prefix, res])
yield from update_ui_latest_msg(lastmsg=buffer, chatbot=chatbot, history=history, delay=0.1) # 刷新界面
# ------------- < 第3步ChatGPT综合 > -------------
if (optimizer != "开启(增强)"):
i_say = f"从以上搜索结果中抽取信息,然后回答问题:{txt}"
i_say, history = input_clipping( # 裁剪输入从最长的条目开始裁剪防止爆token
inputs=i_say,
history=history,
max_token_limit=min(model_info[llm_kwargs['llm_model']]['max_token']*3//4, 8192)
)
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt="请从给定的若干条搜索结果中抽取信息,对最相关的两个搜索结果进行总结,然后回答问题。"
)
chatbot[-1] = (i_say, gpt_say)
history.append(i_say);history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
#* 或者使用搜索优化器,这样可以保证后续问答能读取到有效的历史记录
else:
i_say = f"从以上搜索结果中抽取与问题:{txt} 相关的信息:"
i_say, history = input_clipping( # 裁剪输入从最长的条目开始裁剪防止爆token
inputs=i_say,
history=history,
max_token_limit=min(model_info[llm_kwargs['llm_model']]['max_token']*3//4, 8192)
)
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt="请从给定的若干条搜索结果中抽取信息,对最相关的三个搜索结果进行总结"
)
chatbot[-1] = (i_say, gpt_say)
history = []
history.append(i_say);history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
# ------------- < 第4步根据综合回答问题 > -------------
i_say = f"请根据以上搜索结果回答问题:{txt}"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt="请根据给定的若干条搜索结果回答问题"
)
chatbot[-1] = (i_say, gpt_say)
history.append(i_say);history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history)

View File

@@ -1,5 +1,5 @@
from toolbox import CatchException, update_ui
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
import requests
from bs4 import BeautifulSoup
from request_llms.bridge_all import model_info
@@ -22,8 +22,8 @@ def bing_search(query, proxies=None):
item = {'title': title, 'link': link}
results.append(item)
for r in results:
print(r['link'])
# for r in results:
# print(r['link'])
return results

View File

@@ -1,5 +1,5 @@
from toolbox import CatchException, update_ui
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
import requests
from bs4 import BeautifulSoup
from request_llms.bridge_all import model_info
@@ -23,8 +23,8 @@ def google(query, proxies):
item = {'title': title, 'link': link}
results.append(item)
for r in results:
print(r['link'])
# for r in results:
# print(r['link'])
return results
def scrape_text(url, proxies) -> str:
@@ -40,10 +40,10 @@ def scrape_text(url, proxies) -> str:
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
'Content-Type': 'text/plain',
}
try:
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=8)
if response.encoding == "ISO-8859-1": response.encoding = response.apparent_encoding
except:
except:
return "无法连接到该网页"
soup = BeautifulSoup(response.text, "html.parser")
for script in soup(["script", "style"]):
@@ -66,7 +66,7 @@ def 连接网络回答问题(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
user_request 当前用户的请求信息IP地址等
"""
history = [] # 清空历史,以免输入溢出
chatbot.append((f"请结合互联网信息回答以下问题:{txt}",
chatbot.append((f"请结合互联网信息回答以下问题:{txt}",
"[Local Message] 请注意,您正在调用一个[函数插件]的模板该模板可以实现ChatGPT联网信息综合。该函数面向希望实现更多有趣功能的开发者它可以作为创建新功能函数的模板。您若希望分享新的功能模组请不吝PR"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 由于请求gpt需要一段时间我们先及时地做一次界面更新
@@ -91,13 +91,13 @@ def 连接网络回答问题(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
# ------------- < 第3步ChatGPT综合 > -------------
i_say = f"从以上搜索结果中抽取信息,然后回答问题:{txt}"
i_say, history = input_clipping( # 裁剪输入从最长的条目开始裁剪防止爆token
inputs=i_say,
history=history,
inputs=i_say,
history=history,
max_token_limit=model_info[llm_kwargs['llm_model']]['max_token']*3//4
)
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt="请从给定的若干条搜索结果中抽取信息,对最相关的两个搜索结果进行总结,然后回答问题。"
)
chatbot[-1] = (i_say, gpt_say)

View File

@@ -0,0 +1,49 @@
import random
from toolbox import get_conf
from crazy_functions.Internet_GPT import 连接网络回答问题
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
class NetworkGPT_Wrap(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
第一个参数,名称`main_input`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第二个参数,名称`advanced_arg`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第三个参数,名称`allow_cache`,参数`type`声明这是一个下拉菜单,下拉菜单上方显示`title`+`description`,下拉菜单的选项为`options``default_value`为下拉菜单默认值;
"""
urls = get_conf("SEARXNG_URLS")
url = random.choice(urls)
gui_definition = {
"main_input":
ArgProperty(title="输入问题", description="待通过互联网检索的问题,会自动读取输入框内容", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"categories":
ArgProperty(title="搜索分类", options=["网页", "学术论文"], default_value="网页", description="", type="dropdown").model_dump_json(),
"engine":
ArgProperty(title="选择搜索引擎", options=["Mixed", "bing", "google", "duckduckgo"], default_value="google", description="", type="dropdown").model_dump_json(),
"optimizer":
ArgProperty(title="搜索优化", options=["关闭", "开启", "开启(增强)"], default_value="关闭", description="是否使用搜索增强。注意这可能会消耗较多token", type="dropdown").model_dump_json(),
"searxng_url":
ArgProperty(title="Searxng服务地址", description="输入Searxng的地址", default_value=url, type="string").model_dump_json(), # 主输入,自动从输入框同步
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs:dict, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
if plugin_kwargs.get("categories", None) == "网页": plugin_kwargs["categories"] = "general"
elif plugin_kwargs.get("categories", None) == "学术论文": plugin_kwargs["categories"] = "science"
else: plugin_kwargs["categories"] = "general"
yield from 连接网络回答问题(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)

View File

@@ -0,0 +1,595 @@
from toolbox import update_ui, trimmed_format_exc, get_conf, get_log_folder, promote_file_to_downloadzone, check_repeat_upload, map_file_to_sha256
from toolbox import CatchException, report_exception, update_ui_latest_msg, zip_result, gen_time_str
from functools import partial
from loguru import logger
import glob, os, requests, time, json, tarfile, threading
pj = os.path.join
ARXIV_CACHE_DIR = get_conf("ARXIV_CACHE_DIR")
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 工具函数 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# 专业词汇声明 = 'If the term "agent" is used in this section, it should be translated to "智能体". '
def switch_prompt(pfg, mode, more_requirement):
"""
Generate prompts and system prompts based on the mode for proofreading or translating.
Args:
- pfg: Proofreader or Translator instance.
- mode: A string specifying the mode, either 'proofread' or 'translate_zh'.
Returns:
- inputs_array: A list of strings containing prompts for users to respond to.
- sys_prompt_array: A list of strings containing prompts for system prompts.
"""
n_split = len(pfg.sp_file_contents)
if mode == 'proofread_en':
inputs_array = [r"Below is a section from an academic paper, proofread this section." +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " + more_requirement +
r"Answer me only with the revised text:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
sys_prompt_array = ["You are a professional academic paper writer." for _ in range(n_split)]
elif mode == 'translate_zh':
inputs_array = [
r"Below is a section from an English academic paper, translate it into Chinese. " + more_requirement +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " +
r"Answer me only with the translated text:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
sys_prompt_array = ["You are a professional translator." for _ in range(n_split)]
else:
assert False, "未知指令"
return inputs_array, sys_prompt_array
def descend_to_extracted_folder_if_exist(project_folder):
"""
Descend into the extracted folder if it exists, otherwise return the original folder.
Args:
- project_folder: A string specifying the folder path.
Returns:
- A string specifying the path to the extracted folder, or the original folder if there is no extracted folder.
"""
maybe_dir = [f for f in glob.glob(f'{project_folder}/*') if os.path.isdir(f)]
if len(maybe_dir) == 0: return project_folder
if maybe_dir[0].endswith('.extract'): return maybe_dir[0]
return project_folder
def move_project(project_folder, arxiv_id=None):
"""
Create a new work folder and copy the project folder to it.
Args:
- project_folder: A string specifying the folder path of the project.
Returns:
- A string specifying the path to the new work folder.
"""
import shutil, time
time.sleep(2) # avoid time string conflict
if arxiv_id is not None:
new_workfolder = pj(ARXIV_CACHE_DIR, arxiv_id, 'workfolder')
else:
new_workfolder = f'{get_log_folder()}/{gen_time_str()}'
try:
shutil.rmtree(new_workfolder)
except:
pass
# align subfolder if there is a folder wrapper
items = glob.glob(pj(project_folder, '*'))
items = [item for item in items if os.path.basename(item) != '__MACOSX']
if len(glob.glob(pj(project_folder, '*.tex'))) == 0 and len(items) == 1:
if os.path.isdir(items[0]): project_folder = items[0]
shutil.copytree(src=project_folder, dst=new_workfolder)
return new_workfolder
def arxiv_download(chatbot, history, txt, allow_cache=True):
def check_cached_translation_pdf(arxiv_id):
translation_dir = pj(ARXIV_CACHE_DIR, arxiv_id, 'translation')
if not os.path.exists(translation_dir):
os.makedirs(translation_dir)
target_file = pj(translation_dir, 'translate_zh.pdf')
if os.path.exists(target_file):
promote_file_to_downloadzone(target_file, rename_file=None, chatbot=chatbot)
target_file_compare = pj(translation_dir, 'comparison.pdf')
if os.path.exists(target_file_compare):
promote_file_to_downloadzone(target_file_compare, rename_file=None, chatbot=chatbot)
return target_file
return False
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
if txt.startswith('https://arxiv.org/pdf/'):
arxiv_id = txt.split('/')[-1] # 2402.14207v2.pdf
txt = arxiv_id.split('v')[0] # 2402.14207
if ('.' in txt) and ('/' not in txt) and is_float(txt): # is arxiv ID
txt = 'https://arxiv.org/abs/' + txt.strip()
if ('.' in txt) and ('/' not in txt) and is_float(txt[:10]): # is arxiv ID
txt = 'https://arxiv.org/abs/' + txt[:10]
if not txt.startswith('https://arxiv.org'):
return txt, None # 是本地文件,跳过下载
# <-------------- inspect format ------------->
chatbot.append([f"检测到arxiv文档连接", '尝试下载 ...'])
yield from update_ui(chatbot=chatbot, history=history)
time.sleep(1) # 刷新界面
url_ = txt # https://arxiv.org/abs/1707.06690
if not txt.startswith('https://arxiv.org/abs/'):
msg = f"解析arxiv网址失败, 期望格式例如: https://arxiv.org/abs/1707.06690。实际得到格式: {url_}"
yield from update_ui_latest_msg(msg, chatbot=chatbot, history=history) # 刷新界面
return msg, None
# <-------------- set format ------------->
arxiv_id = url_.split('/abs/')[-1]
if 'v' in arxiv_id: arxiv_id = arxiv_id[:10]
cached_translation_pdf = check_cached_translation_pdf(arxiv_id)
if cached_translation_pdf and allow_cache: return cached_translation_pdf, arxiv_id
extract_dst = pj(ARXIV_CACHE_DIR, arxiv_id, 'extract')
translation_dir = pj(ARXIV_CACHE_DIR, arxiv_id, 'e-print')
dst = pj(translation_dir, arxiv_id + '.tar')
os.makedirs(translation_dir, exist_ok=True)
# <-------------- download arxiv source file ------------->
def fix_url_and_download():
# for url_tar in [url_.replace('/abs/', '/e-print/'), url_.replace('/abs/', '/src/')]:
for url_tar in [url_.replace('/abs/', '/src/'), url_.replace('/abs/', '/e-print/')]:
proxies = get_conf('proxies')
r = requests.get(url_tar, proxies=proxies)
if r.status_code == 200:
with open(dst, 'wb+') as f:
f.write(r.content)
return True
return False
if os.path.exists(dst) and allow_cache:
yield from update_ui_latest_msg(f"调用缓存 {arxiv_id}", chatbot=chatbot, history=history) # 刷新界面
success = True
else:
yield from update_ui_latest_msg(f"开始下载 {arxiv_id}", chatbot=chatbot, history=history) # 刷新界面
success = fix_url_and_download()
yield from update_ui_latest_msg(f"下载完成 {arxiv_id}", chatbot=chatbot, history=history) # 刷新界面
if not success:
yield from update_ui_latest_msg(f"下载失败 {arxiv_id}", chatbot=chatbot, history=history)
raise tarfile.ReadError(f"论文下载失败 {arxiv_id}")
# <-------------- extract file ------------->
from toolbox import extract_archive
try:
extract_archive(file_path=dst, dest_dir=extract_dst)
except tarfile.ReadError:
os.remove(dst)
raise tarfile.ReadError(f"论文下载失败")
return extract_dst, arxiv_id
def pdf2tex_project(pdf_file_path, plugin_kwargs):
if plugin_kwargs["method"] == "MATHPIX":
# Mathpix API credentials
app_id, app_key = get_conf('MATHPIX_APPID', 'MATHPIX_APPKEY')
headers = {"app_id": app_id, "app_key": app_key}
# Step 1: Send PDF file for processing
options = {
"conversion_formats": {"tex.zip": True},
"math_inline_delimiters": ["$", "$"],
"rm_spaces": True
}
response = requests.post(url="https://api.mathpix.com/v3/pdf",
headers=headers,
data={"options_json": json.dumps(options)},
files={"file": open(pdf_file_path, "rb")})
if response.ok:
pdf_id = response.json()["pdf_id"]
logger.info(f"PDF processing initiated. PDF ID: {pdf_id}")
# Step 2: Check processing status
while True:
conversion_response = requests.get(f"https://api.mathpix.com/v3/pdf/{pdf_id}", headers=headers)
conversion_data = conversion_response.json()
if conversion_data["status"] == "completed":
logger.info("PDF processing completed.")
break
elif conversion_data["status"] == "error":
logger.info("Error occurred during processing.")
else:
logger.info(f"Processing status: {conversion_data['status']}")
time.sleep(5) # wait for a few seconds before checking again
# Step 3: Save results to local files
output_dir = os.path.join(os.path.dirname(pdf_file_path), 'mathpix_output')
if not os.path.exists(output_dir):
os.makedirs(output_dir)
url = f"https://api.mathpix.com/v3/pdf/{pdf_id}.tex"
response = requests.get(url, headers=headers)
file_name_wo_dot = '_'.join(os.path.basename(pdf_file_path).split('.')[:-1])
output_name = f"{file_name_wo_dot}.tex.zip"
output_path = os.path.join(output_dir, output_name)
with open(output_path, "wb") as output_file:
output_file.write(response.content)
logger.info(f"tex.zip file saved at: {output_path}")
import zipfile
unzip_dir = os.path.join(output_dir, file_name_wo_dot)
with zipfile.ZipFile(output_path, 'r') as zip_ref:
zip_ref.extractall(unzip_dir)
return unzip_dir
else:
logger.error(f"Error sending PDF for processing. Status code: {response.status_code}")
return None
else:
from crazy_functions.pdf_fns.parse_pdf_via_doc2x import 解析PDF_DOC2X_转Latex
unzip_dir = 解析PDF_DOC2X_转Latex(pdf_file_path)
return unzip_dir
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 插件主程序1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@CatchException
def Latex英文纠错加PDF对比(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# <-------------- information about this plugin ------------->
chatbot.append(["函数插件功能?",
"对整个Latex项目进行纠错, 用latex编译为PDF对修正处做高亮。函数插件贡献者: Binary-Husky。注意事项: 目前对机器学习类文献转化效果最好其他类型文献转化效果未知。仅在Windows系统进行了测试其他操作系统表现未知。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# <-------------- more requirements ------------->
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
more_req = plugin_kwargs.get("advanced_arg", "")
_switch_prompt_ = partial(switch_prompt, more_requirement=more_req)
# <-------------- check deps ------------->
try:
import glob, os, time, subprocess
subprocess.Popen(['pdflatex', '-version'])
from .latex_fns.latex_actions import Latex精细分解与转化, 编译Latex
except Exception as e:
chatbot.append([f"解析项目: {txt}",
f"尝试执行Latex指令失败。Latex没有安装, 或者不在环境变量PATH中。安装方法https://tug.org/texlive/。报错信息\n\n```\n\n{trimmed_format_exc()}\n\n```\n\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- clear history and read input ------------->
history = []
if os.path.exists(txt):
project_folder = txt
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.tex', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- if is a zip/tar file ------------->
project_folder = descend_to_extracted_folder_if_exist(project_folder)
# <-------------- move latex project away from temp folder ------------->
from shared_utils.fastapi_server import validate_path_safety
validate_path_safety(project_folder, chatbot.get_user())
project_folder = move_project(project_folder, arxiv_id=None)
# <-------------- if merge_translate_zh is already generated, skip gpt req ------------->
if not os.path.exists(project_folder + '/merge_proofread_en.tex'):
yield from Latex精细分解与转化(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
chatbot, history, system_prompt, mode='proofread_en',
switch_prompt=_switch_prompt_)
# <-------------- compile PDF ------------->
success = yield from 编译Latex(chatbot, history, main_file_original='merge',
main_file_modified='merge_proofread_en',
work_folder_original=project_folder, work_folder_modified=project_folder,
work_folder=project_folder)
# <-------------- zip PDF ------------->
zip_res = zip_result(project_folder)
if success:
chatbot.append((f"成功啦", '请查收结果(压缩包)...'))
yield from update_ui(chatbot=chatbot, history=history);
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
else:
chatbot.append((f"失败了",
'虽然PDF生成失败了, 但请查收结果(压缩包), 内含已经翻译的Tex文档, 也是可读的, 您可以到Github Issue区, 用该压缩包+Conversation_To_File进行反馈 ...'))
yield from update_ui(chatbot=chatbot, history=history);
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <-------------- we are done ------------->
return success
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 插件主程序2 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@CatchException
def Latex翻译中文并重新编译PDF(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# <-------------- information about this plugin ------------->
chatbot.append([
"函数插件功能?",
"对整个Latex项目进行翻译, 生成中文PDF。函数插件贡献者: Binary-Husky。注意事项: 此插件Windows支持最佳Linux下必须使用Docker安装详见项目主README.md。目前对机器学习类文献转化效果最好其他类型文献转化效果未知。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# <-------------- more requirements ------------->
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
more_req = plugin_kwargs.get("advanced_arg", "")
no_cache = ("--no-cache" in more_req)
if no_cache: more_req = more_req.replace("--no-cache", "").strip()
allow_gptac_cloud_io = ("--allow-cloudio" in more_req) # 从云端下载翻译结果,以及上传翻译结果到云端
if allow_gptac_cloud_io: more_req = more_req.replace("--allow-cloudio", "").strip()
allow_cache = not no_cache
_switch_prompt_ = partial(switch_prompt, more_requirement=more_req)
# <-------------- check deps ------------->
try:
import glob, os, time, subprocess
subprocess.Popen(['pdflatex', '-version'])
from .latex_fns.latex_actions import Latex精细分解与转化, 编译Latex
except Exception as e:
chatbot.append([f"解析项目: {txt}",
f"尝试执行Latex指令失败。Latex没有安装, 或者不在环境变量PATH中。安装方法https://tug.org/texlive/。报错信息\n\n```\n\n{trimmed_format_exc()}\n\n```\n\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- clear history and read input ------------->
history = []
try:
txt, arxiv_id = yield from arxiv_download(chatbot, history, txt, allow_cache)
except tarfile.ReadError as e:
yield from update_ui_latest_msg(
"无法自动下载该论文的Latex源码请前往arxiv打开此论文下载页面点other Formats然后download source手动下载latex源码包。接下来调用本地Latex翻译插件即可。",
chatbot=chatbot, history=history)
return
if txt.endswith('.pdf'):
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"发现已经存在翻译好的PDF文档")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# #################################################################
if allow_gptac_cloud_io and arxiv_id:
# 访问 GPTAC学术云查询云端是否存在该论文的翻译版本
from crazy_functions.latex_fns.latex_actions import check_gptac_cloud
success, downloaded = check_gptac_cloud(arxiv_id, chatbot)
if success:
chatbot.append([
f"检测到GPTAC云端存在翻译版本, 如果不满意翻译结果, 请禁用云端分享, 然后重新执行。",
None
])
yield from update_ui(chatbot=chatbot, history=history)
return
#################################################################
if os.path.exists(txt):
project_folder = txt
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无法处理: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.tex', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- if is a zip/tar file ------------->
project_folder = descend_to_extracted_folder_if_exist(project_folder)
# <-------------- move latex project away from temp folder ------------->
from shared_utils.fastapi_server import validate_path_safety
validate_path_safety(project_folder, chatbot.get_user())
project_folder = move_project(project_folder, arxiv_id)
# <-------------- if merge_translate_zh is already generated, skip gpt req ------------->
if not os.path.exists(project_folder + '/merge_translate_zh.tex'):
yield from Latex精细分解与转化(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
chatbot, history, system_prompt, mode='translate_zh',
switch_prompt=_switch_prompt_)
# <-------------- compile PDF ------------->
success = yield from 编译Latex(chatbot, history, main_file_original='merge',
main_file_modified='merge_translate_zh', mode='translate_zh',
work_folder_original=project_folder, work_folder_modified=project_folder,
work_folder=project_folder)
# <-------------- zip PDF ------------->
zip_res = zip_result(project_folder)
if success:
if allow_gptac_cloud_io and arxiv_id:
# 如果用户允许我们将翻译好的arxiv论文PDF上传到GPTAC学术云
from crazy_functions.latex_fns.latex_actions import upload_to_gptac_cloud_if_user_allow
threading.Thread(target=upload_to_gptac_cloud_if_user_allow,
args=(chatbot, arxiv_id), daemon=True).start()
chatbot.append((f"成功啦", '请查收结果(压缩包)...'))
yield from update_ui(chatbot=chatbot, history=history)
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
else:
chatbot.append((f"失败了",
'虽然PDF生成失败了, 但请查收结果(压缩包), 内含已经翻译的Tex文档, 您可以到Github Issue区, 用该压缩包进行反馈。如系统是Linux请检查系统字体见Github wiki ...'))
yield from update_ui(chatbot=chatbot, history=history)
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <-------------- we are done ------------->
return success
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 插件主程序3 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@CatchException
def PDF翻译中文并重新编译PDF(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, web_port):
# <-------------- information about this plugin ------------->
chatbot.append([
"函数插件功能?",
"将PDF转换为Latex项目翻译为中文后重新编译为PDF。函数插件贡献者: Marroh。注意事项: 此插件Windows支持最佳Linux下必须使用Docker安装详见项目主README.md。目前对机器学习类文献转化效果最好其他类型文献转化效果未知。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# <-------------- more requirements ------------->
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
more_req = plugin_kwargs.get("advanced_arg", "")
no_cache = more_req.startswith("--no-cache")
if no_cache: more_req.lstrip("--no-cache")
allow_cache = not no_cache
_switch_prompt_ = partial(switch_prompt, more_requirement=more_req)
# <-------------- check deps ------------->
try:
import glob, os, time, subprocess
subprocess.Popen(['pdflatex', '-version'])
from .latex_fns.latex_actions import Latex精细分解与转化, 编译Latex
except Exception as e:
chatbot.append([f"解析项目: {txt}",
f"尝试执行Latex指令失败。Latex没有安装, 或者不在环境变量PATH中。安装方法https://tug.org/texlive/。报错信息\n\n```\n\n{trimmed_format_exc()}\n\n```\n\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- clear history and read input ------------->
if os.path.exists(txt):
project_folder = txt
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无法处理: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.pdf', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到任何.pdf文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
if len(file_manifest) != 1:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"不支持同时处理多个pdf文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
if plugin_kwargs.get("method", "") == 'MATHPIX':
app_id, app_key = get_conf('MATHPIX_APPID', 'MATHPIX_APPKEY')
if len(app_id) == 0 or len(app_key) == 0:
report_exception(chatbot, history, a="缺失 MATHPIX_APPID 和 MATHPIX_APPKEY。", b=f"请配置 MATHPIX_APPID 和 MATHPIX_APPKEY")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
if plugin_kwargs.get("method", "") == 'DOC2X':
app_id, app_key = "", ""
DOC2X_API_KEY = get_conf('DOC2X_API_KEY')
if len(DOC2X_API_KEY) == 0:
report_exception(chatbot, history, a="缺失 DOC2X_API_KEY。", b=f"请配置 DOC2X_API_KEY")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
hash_tag = map_file_to_sha256(file_manifest[0])
# # <-------------- check repeated pdf ------------->
# chatbot.append([f"检查PDF是否被重复上传", "正在检查..."])
# yield from update_ui(chatbot=chatbot, history=history)
# repeat, project_folder = check_repeat_upload(file_manifest[0], hash_tag)
# if repeat:
# yield from update_ui_latest_msg(f"发现重复上传,请查收结果(压缩包)...", chatbot=chatbot, history=history)
# try:
# translate_pdf = [f for f in glob.glob(f'{project_folder}/**/merge_translate_zh.pdf', recursive=True)][0]
# promote_file_to_downloadzone(translate_pdf, rename_file=None, chatbot=chatbot)
# comparison_pdf = [f for f in glob.glob(f'{project_folder}/**/comparison.pdf', recursive=True)][0]
# promote_file_to_downloadzone(comparison_pdf, rename_file=None, chatbot=chatbot)
# zip_res = zip_result(project_folder)
# promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# return
# except:
# report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"发现重复上传,但是无法找到相关文件")
# yield from update_ui(chatbot=chatbot, history=history)
# else:
# yield from update_ui_latest_msg(f"未发现重复上传", chatbot=chatbot, history=history)
# <-------------- convert pdf into tex ------------->
chatbot.append([f"解析项目: {txt}", "正在将PDF转换为tex项目请耐心等待..."])
yield from update_ui(chatbot=chatbot, history=history)
project_folder = pdf2tex_project(file_manifest[0], plugin_kwargs)
if project_folder is None:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"PDF转换为tex项目失败")
yield from update_ui(chatbot=chatbot, history=history)
return False
# <-------------- translate latex file into Chinese ------------->
yield from update_ui_latest_msg("正在tex项目将翻译为中文...", chatbot=chatbot, history=history)
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.tex', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- if is a zip/tar file ------------->
project_folder = descend_to_extracted_folder_if_exist(project_folder)
# <-------------- move latex project away from temp folder ------------->
from shared_utils.fastapi_server import validate_path_safety
validate_path_safety(project_folder, chatbot.get_user())
project_folder = move_project(project_folder)
# <-------------- set a hash tag for repeat-checking ------------->
with open(pj(project_folder, hash_tag + '.tag'), 'w', encoding='utf8') as f:
f.write(hash_tag)
f.close()
# <-------------- if merge_translate_zh is already generated, skip gpt req ------------->
if not os.path.exists(project_folder + '/merge_translate_zh.tex'):
yield from Latex精细分解与转化(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
chatbot, history, system_prompt, mode='translate_zh',
switch_prompt=_switch_prompt_)
# <-------------- compile PDF ------------->
yield from update_ui_latest_msg("正在将翻译好的项目tex项目编译为PDF...", chatbot=chatbot, history=history)
success = yield from 编译Latex(chatbot, history, main_file_original='merge',
main_file_modified='merge_translate_zh', mode='translate_zh',
work_folder_original=project_folder, work_folder_modified=project_folder,
work_folder=project_folder)
# <-------------- zip PDF ------------->
zip_res = zip_result(project_folder)
if success:
chatbot.append((f"成功啦", '请查收结果(压缩包)...'))
yield from update_ui(chatbot=chatbot, history=history);
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
else:
chatbot.append((f"失败了",
'虽然PDF生成失败了, 但请查收结果(压缩包), 内含已经翻译的Tex文档, 您可以到Github Issue区, 用该压缩包进行反馈。如系统是Linux请检查系统字体见Github wiki ...'))
yield from update_ui(chatbot=chatbot, history=history);
time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <-------------- we are done ------------->
return success

View File

@@ -0,0 +1,85 @@
from crazy_functions.Latex_Function import Latex翻译中文并重新编译PDF, PDF翻译中文并重新编译PDF
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
class Arxiv_Localize(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
第一个参数,名称`main_input`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第二个参数,名称`advanced_arg`,参数`type`声明这是一个文本框,文本框上方显示`title`,文本框内部显示`description``default_value`为默认值;
第三个参数,名称`allow_cache`,参数`type`声明这是一个下拉菜单,下拉菜单上方显示`title`+`description`,下拉菜单的选项为`options``default_value`为下拉菜单默认值;
"""
gui_definition = {
"main_input":
ArgProperty(title="ArxivID", description="输入Arxiv的ID或者网址", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"advanced_arg":
ArgProperty(title="额外的翻译提示词",
description=r"如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
r"例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
r'If the term "agent" is used in this section, it should be translated to "智能体". ',
default_value="", type="string").model_dump_json(), # 高级参数输入区,自动同步
"allow_cache":
ArgProperty(title="是否允许从缓存中调取结果", options=["允许缓存", "从头执行"], default_value="允许缓存", description="", type="dropdown").model_dump_json(),
"allow_cloudio":
ArgProperty(title="是否允许从GPTAC学术云下载(或者上传)翻译结果(仅针对Arxiv论文)", options=["允许", "禁止"], default_value="禁止", description="共享文献,互助互利", type="dropdown").model_dump_json(),
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
allow_cache = plugin_kwargs["allow_cache"]
allow_cloudio = plugin_kwargs["allow_cloudio"]
advanced_arg = plugin_kwargs["advanced_arg"]
if allow_cache == "从头执行": plugin_kwargs["advanced_arg"] = "--no-cache " + plugin_kwargs["advanced_arg"]
# 从云端下载翻译结果,以及上传翻译结果到云端;人人为我,我为人人。
if allow_cloudio == "允许": plugin_kwargs["advanced_arg"] = "--allow-cloudio " + plugin_kwargs["advanced_arg"]
yield from Latex翻译中文并重新编译PDF(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
class PDF_Localize(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
"""
gui_definition = {
"main_input":
ArgProperty(title="PDF文件路径", description="未指定路径,请上传文件后,再点击该插件", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"advanced_arg":
ArgProperty(title="额外的翻译提示词",
description=r"如果有必要, 请在此处给出自定义翻译命令, 解决部分词汇翻译不准确的问题。 "
r"例如当单词'agent'翻译不准确时, 请尝试把以下指令复制到高级参数区: "
r'If the term "agent" is used in this section, it should be translated to "智能体". ',
default_value="", type="string").model_dump_json(), # 高级参数输入区,自动同步
"method":
ArgProperty(title="采用哪种方法执行转换", options=["MATHPIX", "DOC2X"], default_value="DOC2X", description="", type="dropdown").model_dump_json(),
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
yield from PDF翻译中文并重新编译PDF(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)

View File

@@ -1,6 +1,6 @@
from toolbox import update_ui, trimmed_format_exc, promote_file_to_downloadzone, get_log_folder
from toolbox import CatchException, report_exception, write_history_to_file, zip_folder
from loguru import logger
class PaperFileGroup():
def __init__(self):
@@ -33,7 +33,7 @@ class PaperFileGroup():
self.sp_file_index.append(index)
self.sp_file_tag.append(self.file_paths[index] + f".part-{j}.tex")
print('Segmentation: done')
logger.info('Segmentation: done')
def merge_result(self):
self.file_result = ["" for _ in range(len(self.file_paths))]
for r, k in zip(self.sp_file_result, self.sp_file_index):
@@ -46,7 +46,7 @@ class PaperFileGroup():
manifest.append(path + '.polish.tex')
f.write(res)
return manifest
def zip_result(self):
import os, time
folder = os.path.dirname(self.file_paths[0])
@@ -56,10 +56,10 @@ class PaperFileGroup():
def 多文件润色(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, language='en', mode='polish'):
import time, os, re
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
# <-------- 读取Latex文件删除其中的所有注释 ---------->
# <-------- 读取Latex文件删除其中的所有注释 ---------->
pfg = PaperFileGroup()
for index, fp in enumerate(file_manifest):
@@ -73,31 +73,31 @@ def 多文件润色(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
pfg.file_paths.append(fp)
pfg.file_contents.append(clean_tex_content)
# <-------- 拆分过长的latex文件 ---------->
# <-------- 拆分过长的latex文件 ---------->
pfg.run_file_split(max_token_limit=1024)
n_split = len(pfg.sp_file_contents)
# <-------- 多线程润色开始 ---------->
# <-------- 多线程润色开始 ---------->
if language == 'en':
if mode == 'polish':
inputs_array = ["Below is a section from an academic paper, polish this section to meet the academic standard, " +
"improve the grammar, clarity and overall readability, do not modify any latex command such as \section, \cite and equations:" +
inputs_array = [r"Below is a section from an academic paper, polish this section to meet the academic standard, " +
r"improve the grammar, clarity and overall readability, do not modify any latex command such as \section, \cite and equations:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
else:
inputs_array = [r"Below is a section from an academic paper, proofread this section." +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " +
r"Answer me only with the revised text:" +
inputs_array = [r"Below is a section from an academic paper, proofread this section." +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " +
r"Answer me only with the revised text:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"Polish {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper writer." for _ in range(n_split)]
elif language == 'zh':
if mode == 'polish':
inputs_array = [f"以下是一篇学术论文中的一段内容请将此部分润色以满足学术标准提高语法、清晰度和整体可读性不要修改任何LaTeX命令例如\section\cite和方程式" +
inputs_array = [r"以下是一篇学术论文中的一段内容请将此部分润色以满足学术标准提高语法、清晰度和整体可读性不要修改任何LaTeX命令例如\section\cite和方程式" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
else:
inputs_array = [f"以下是一篇学术论文中的一段内容请对这部分内容进行语法矫正。不要修改任何LaTeX命令例如\section\cite和方程式" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_array = [r"以下是一篇学术论文中的一段内容请对这部分内容进行语法矫正。不要修改任何LaTeX命令例如\section\cite和方程式" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"润色 {f}" for f in pfg.sp_file_tag]
sys_prompt_array=["你是一位专业的中文学术论文作家。" for _ in range(n_split)]
@@ -113,7 +113,7 @@ def 多文件润色(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
scroller_max_len = 80
)
# <-------- 文本碎片重组为完整的tex文件整理结果为压缩包 ---------->
# <-------- 文本碎片重组为完整的tex文件整理结果为压缩包 ---------->
try:
pfg.sp_file_result = []
for i_say, gpt_say in zip(gpt_response_collection[0::2], gpt_response_collection[1::2]):
@@ -122,9 +122,9 @@ def 多文件润色(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
pfg.write_result()
pfg.zip_result()
except:
print(trimmed_format_exc())
logger.error(trimmed_format_exc())
# <-------- 整理结果,退出 ---------->
# <-------- 整理结果,退出 ---------->
create_report_file_name = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) + f"-chatgpt.polish.md"
res = write_history_to_file(gpt_response_collection, file_basename=create_report_file_name)
promote_file_to_downloadzone(res, chatbot=chatbot)

View File

@@ -1,6 +1,6 @@
from toolbox import update_ui, promote_file_to_downloadzone
from toolbox import CatchException, report_exception, write_history_to_file
fast_debug = False
from loguru import logger
class PaperFileGroup():
def __init__(self):
@@ -33,13 +33,13 @@ class PaperFileGroup():
self.sp_file_index.append(index)
self.sp_file_tag.append(self.file_paths[index] + f".part-{j}.tex")
print('Segmentation: done')
logger.info('Segmentation: done')
def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, language='en'):
import time, os, re
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
# <-------- 读取Latex文件删除其中的所有注释 ---------->
# <-------- 读取Latex文件删除其中的所有注释 ---------->
pfg = PaperFileGroup()
for index, fp in enumerate(file_manifest):
@@ -53,11 +53,11 @@ def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
pfg.file_paths.append(fp)
pfg.file_contents.append(clean_tex_content)
# <-------- 拆分过长的latex文件 ---------->
# <-------- 拆分过长的latex文件 ---------->
pfg.run_file_split(max_token_limit=1024)
n_split = len(pfg.sp_file_contents)
# <-------- 抽取摘要 ---------->
# <-------- 抽取摘要 ---------->
# if language == 'en':
# abs_extract_inputs = f"Please write an abstract for this paper"
@@ -70,14 +70,14 @@ def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
# sys_prompt="Your job is to collect information from materials。",
# )
# <-------- 多线程润色开始 ---------->
# <-------- 多线程润色开始 ---------->
if language == 'en->zh':
inputs_array = ["Below is a section from an English academic paper, translate it into Chinese, do not modify any latex command such as \section, \cite and equations:" +
inputs_array = ["Below is a section from an English academic paper, translate it into Chinese, do not modify any latex command such as \section, \cite and equations:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"翻译 {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper translator." for _ in range(n_split)]
elif language == 'zh->en':
inputs_array = [f"Below is a section from a Chinese academic paper, translate it into English, do not modify any latex command such as \section, \cite and equations:" +
inputs_array = [f"Below is a section from a Chinese academic paper, translate it into English, do not modify any latex command such as \section, \cite and equations:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"翻译 {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper translator." for _ in range(n_split)]
@@ -93,7 +93,7 @@ def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
scroller_max_len = 80
)
# <-------- 整理结果,退出 ---------->
# <-------- 整理结果,退出 ---------->
create_report_file_name = time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime()) + f"-chatgpt.polish.md"
res = write_history_to_file(gpt_response_collection, create_report_file_name)
promote_file_to_downloadzone(res, chatbot=chatbot)

View File

@@ -1,313 +0,0 @@
from toolbox import update_ui, trimmed_format_exc, get_conf, get_log_folder, promote_file_to_downloadzone
from toolbox import CatchException, report_exception, update_ui_lastest_msg, zip_result, gen_time_str
from functools import partial
import glob, os, requests, time, tarfile
pj = os.path.join
ARXIV_CACHE_DIR = os.path.expanduser(f"~/arxiv_cache/")
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 工具函数 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
# 专业词汇声明 = 'If the term "agent" is used in this section, it should be translated to "智能体". '
def switch_prompt(pfg, mode, more_requirement):
"""
Generate prompts and system prompts based on the mode for proofreading or translating.
Args:
- pfg: Proofreader or Translator instance.
- mode: A string specifying the mode, either 'proofread' or 'translate_zh'.
Returns:
- inputs_array: A list of strings containing prompts for users to respond to.
- sys_prompt_array: A list of strings containing prompts for system prompts.
"""
n_split = len(pfg.sp_file_contents)
if mode == 'proofread_en':
inputs_array = [r"Below is a section from an academic paper, proofread this section." +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " + more_requirement +
r"Answer me only with the revised text:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
sys_prompt_array = ["You are a professional academic paper writer." for _ in range(n_split)]
elif mode == 'translate_zh':
inputs_array = [r"Below is a section from an English academic paper, translate it into Chinese. " + more_requirement +
r"Do not modify any latex command such as \section, \cite, \begin, \item and equations. " +
r"Answer me only with the translated text:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
sys_prompt_array = ["You are a professional translator." for _ in range(n_split)]
else:
assert False, "未知指令"
return inputs_array, sys_prompt_array
def desend_to_extracted_folder_if_exist(project_folder):
"""
Descend into the extracted folder if it exists, otherwise return the original folder.
Args:
- project_folder: A string specifying the folder path.
Returns:
- A string specifying the path to the extracted folder, or the original folder if there is no extracted folder.
"""
maybe_dir = [f for f in glob.glob(f'{project_folder}/*') if os.path.isdir(f)]
if len(maybe_dir) == 0: return project_folder
if maybe_dir[0].endswith('.extract'): return maybe_dir[0]
return project_folder
def move_project(project_folder, arxiv_id=None):
"""
Create a new work folder and copy the project folder to it.
Args:
- project_folder: A string specifying the folder path of the project.
Returns:
- A string specifying the path to the new work folder.
"""
import shutil, time
time.sleep(2) # avoid time string conflict
if arxiv_id is not None:
new_workfolder = pj(ARXIV_CACHE_DIR, arxiv_id, 'workfolder')
else:
new_workfolder = f'{get_log_folder()}/{gen_time_str()}'
try:
shutil.rmtree(new_workfolder)
except:
pass
# align subfolder if there is a folder wrapper
items = glob.glob(pj(project_folder,'*'))
items = [item for item in items if os.path.basename(item)!='__MACOSX']
if len(glob.glob(pj(project_folder,'*.tex'))) == 0 and len(items) == 1:
if os.path.isdir(items[0]): project_folder = items[0]
shutil.copytree(src=project_folder, dst=new_workfolder)
return new_workfolder
def arxiv_download(chatbot, history, txt, allow_cache=True):
def check_cached_translation_pdf(arxiv_id):
translation_dir = pj(ARXIV_CACHE_DIR, arxiv_id, 'translation')
if not os.path.exists(translation_dir):
os.makedirs(translation_dir)
target_file = pj(translation_dir, 'translate_zh.pdf')
if os.path.exists(target_file):
promote_file_to_downloadzone(target_file, rename_file=None, chatbot=chatbot)
target_file_compare = pj(translation_dir, 'comparison.pdf')
if os.path.exists(target_file_compare):
promote_file_to_downloadzone(target_file_compare, rename_file=None, chatbot=chatbot)
return target_file
return False
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
if ('.' in txt) and ('/' not in txt) and is_float(txt): # is arxiv ID
txt = 'https://arxiv.org/abs/' + txt.strip()
if ('.' in txt) and ('/' not in txt) and is_float(txt[:10]): # is arxiv ID
txt = 'https://arxiv.org/abs/' + txt[:10]
if not txt.startswith('https://arxiv.org'):
return txt, None # 是本地文件,跳过下载
# <-------------- inspect format ------------->
chatbot.append([f"检测到arxiv文档连接", '尝试下载 ...'])
yield from update_ui(chatbot=chatbot, history=history)
time.sleep(1) # 刷新界面
url_ = txt # https://arxiv.org/abs/1707.06690
if not txt.startswith('https://arxiv.org/abs/'):
msg = f"解析arxiv网址失败, 期望格式例如: https://arxiv.org/abs/1707.06690。实际得到格式: {url_}"
yield from update_ui_lastest_msg(msg, chatbot=chatbot, history=history) # 刷新界面
return msg, None
# <-------------- set format ------------->
arxiv_id = url_.split('/abs/')[-1]
if 'v' in arxiv_id: arxiv_id = arxiv_id[:10]
cached_translation_pdf = check_cached_translation_pdf(arxiv_id)
if cached_translation_pdf and allow_cache: return cached_translation_pdf, arxiv_id
url_tar = url_.replace('/abs/', '/e-print/')
translation_dir = pj(ARXIV_CACHE_DIR, arxiv_id, 'e-print')
extract_dst = pj(ARXIV_CACHE_DIR, arxiv_id, 'extract')
os.makedirs(translation_dir, exist_ok=True)
# <-------------- download arxiv source file ------------->
dst = pj(translation_dir, arxiv_id+'.tar')
if os.path.exists(dst):
yield from update_ui_lastest_msg("调用缓存", chatbot=chatbot, history=history) # 刷新界面
else:
yield from update_ui_lastest_msg("开始下载", chatbot=chatbot, history=history) # 刷新界面
proxies = get_conf('proxies')
r = requests.get(url_tar, proxies=proxies)
with open(dst, 'wb+') as f:
f.write(r.content)
# <-------------- extract file ------------->
yield from update_ui_lastest_msg("下载完成", chatbot=chatbot, history=history) # 刷新界面
from toolbox import extract_archive
extract_archive(file_path=dst, dest_dir=extract_dst)
return extract_dst, arxiv_id
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 插件主程序1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@CatchException
def Latex英文纠错加PDF对比(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# <-------------- information about this plugin ------------->
chatbot.append([ "函数插件功能?",
"对整个Latex项目进行纠错, 用latex编译为PDF对修正处做高亮。函数插件贡献者: Binary-Husky。注意事项: 目前仅支持GPT3.5/GPT4其他模型转化效果未知。目前对机器学习类文献转化效果最好其他类型文献转化效果未知。仅在Windows系统进行了测试其他操作系统表现未知。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# <-------------- more requirements ------------->
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
more_req = plugin_kwargs.get("advanced_arg", "")
_switch_prompt_ = partial(switch_prompt, more_requirement=more_req)
# <-------------- check deps ------------->
try:
import glob, os, time, subprocess
subprocess.Popen(['pdflatex', '-version'])
from .latex_fns.latex_actions import Latex精细分解与转化, 编译Latex
except Exception as e:
chatbot.append([ f"解析项目: {txt}",
f"尝试执行Latex指令失败。Latex没有安装, 或者不在环境变量PATH中。安装方法https://tug.org/texlive/。报错信息\n\n```\n\n{trimmed_format_exc()}\n\n```\n\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- clear history and read input ------------->
history = []
if os.path.exists(txt):
project_folder = txt
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.tex', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- if is a zip/tar file ------------->
project_folder = desend_to_extracted_folder_if_exist(project_folder)
# <-------------- move latex project away from temp folder ------------->
project_folder = move_project(project_folder, arxiv_id=None)
# <-------------- if merge_translate_zh is already generated, skip gpt req ------------->
if not os.path.exists(project_folder + '/merge_proofread_en.tex'):
yield from Latex精细分解与转化(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
chatbot, history, system_prompt, mode='proofread_en', switch_prompt=_switch_prompt_)
# <-------------- compile PDF ------------->
success = yield from 编译Latex(chatbot, history, main_file_original='merge', main_file_modified='merge_proofread_en',
work_folder_original=project_folder, work_folder_modified=project_folder, work_folder=project_folder)
# <-------------- zip PDF ------------->
zip_res = zip_result(project_folder)
if success:
chatbot.append((f"成功啦", '请查收结果(压缩包)...'))
yield from update_ui(chatbot=chatbot, history=history); time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
else:
chatbot.append((f"失败了", '虽然PDF生成失败了, 但请查收结果(压缩包), 内含已经翻译的Tex文档, 也是可读的, 您可以到Github Issue区, 用该压缩包+对话历史存档进行反馈 ...'))
yield from update_ui(chatbot=chatbot, history=history); time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <-------------- we are done ------------->
return success
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 插件主程序2 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
@CatchException
def Latex翻译中文并重新编译PDF(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# <-------------- information about this plugin ------------->
chatbot.append([
"函数插件功能?",
"对整个Latex项目进行翻译, 生成中文PDF。函数插件贡献者: Binary-Husky。注意事项: 此插件Windows支持最佳Linux下必须使用Docker安装详见项目主README.md。目前仅支持GPT3.5/GPT4其他模型转化效果未知。目前对机器学习类文献转化效果最好其他类型文献转化效果未知。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# <-------------- more requirements ------------->
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
more_req = plugin_kwargs.get("advanced_arg", "")
no_cache = more_req.startswith("--no-cache")
if no_cache: more_req.lstrip("--no-cache")
allow_cache = not no_cache
_switch_prompt_ = partial(switch_prompt, more_requirement=more_req)
# <-------------- check deps ------------->
try:
import glob, os, time, subprocess
subprocess.Popen(['pdflatex', '-version'])
from .latex_fns.latex_actions import Latex精细分解与转化, 编译Latex
except Exception as e:
chatbot.append([ f"解析项目: {txt}",
f"尝试执行Latex指令失败。Latex没有安装, 或者不在环境变量PATH中。安装方法https://tug.org/texlive/。报错信息\n\n```\n\n{trimmed_format_exc()}\n\n```\n\n"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- clear history and read input ------------->
history = []
try:
txt, arxiv_id = yield from arxiv_download(chatbot, history, txt, allow_cache)
except tarfile.ReadError as e:
yield from update_ui_lastest_msg(
"无法自动下载该论文的Latex源码请前往arxiv打开此论文下载页面点other Formats然后download source手动下载latex源码包。接下来调用本地Latex翻译插件即可。",
chatbot=chatbot, history=history)
return
if txt.endswith('.pdf'):
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"发现已经存在翻译好的PDF文档")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
if os.path.exists(txt):
project_folder = txt
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无法处理: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.tex', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# <-------------- if is a zip/tar file ------------->
project_folder = desend_to_extracted_folder_if_exist(project_folder)
# <-------------- move latex project away from temp folder ------------->
project_folder = move_project(project_folder, arxiv_id)
# <-------------- if merge_translate_zh is already generated, skip gpt req ------------->
if not os.path.exists(project_folder + '/merge_translate_zh.tex'):
yield from Latex精细分解与转化(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
chatbot, history, system_prompt, mode='translate_zh', switch_prompt=_switch_prompt_)
# <-------------- compile PDF ------------->
success = yield from 编译Latex(chatbot, history, main_file_original='merge', main_file_modified='merge_translate_zh', mode='translate_zh',
work_folder_original=project_folder, work_folder_modified=project_folder, work_folder=project_folder)
# <-------------- zip PDF ------------->
zip_res = zip_result(project_folder)
if success:
chatbot.append((f"成功啦", '请查收结果(压缩包)...'))
yield from update_ui(chatbot=chatbot, history=history); time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
else:
chatbot.append((f"失败了", '虽然PDF生成失败了, 但请查收结果(压缩包), 内含已经翻译的Tex文档, 您可以到Github Issue区, 用该压缩包进行反馈。如系统是Linux请检查系统字体见Github wiki ...'))
yield from update_ui(chatbot=chatbot, history=history); time.sleep(1) # 刷新界面
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <-------------- we are done ------------->
return success

View File

@@ -1,5 +1,6 @@
import glob, time, os, re, logging
from toolbox import update_ui, trimmed_format_exc, gen_time_str, disable_auto_promotion
import glob, shutil, os, re
from loguru import logger
from toolbox import update_ui, trimmed_format_exc, gen_time_str
from toolbox import CatchException, report_exception, get_log_folder
from toolbox import write_history_to_file, promote_file_to_downloadzone
fast_debug = False
@@ -18,7 +19,7 @@ class PaperFileGroup():
def get_token_num(txt): return len(enc.encode(txt, disallowed_special=()))
self.get_token_num = get_token_num
def run_file_split(self, max_token_limit=1900):
def run_file_split(self, max_token_limit=2048):
"""
将长文本分离开来
"""
@@ -34,7 +35,7 @@ class PaperFileGroup():
self.sp_file_contents.append(segment)
self.sp_file_index.append(index)
self.sp_file_tag.append(self.file_paths[index] + f".part-{j}.md")
logging.info('Segmentation: done')
logger.info('Segmentation: done')
def merge_result(self):
self.file_result = ["" for _ in range(len(self.file_paths))]
@@ -51,9 +52,9 @@ class PaperFileGroup():
return manifest
def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, language='en'):
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
# <-------- 读取Markdown文件删除其中的所有注释 ---------->
# <-------- 读取Markdown文件删除其中的所有注释 ---------->
pfg = PaperFileGroup()
for index, fp in enumerate(file_manifest):
@@ -63,26 +64,26 @@ def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
pfg.file_paths.append(fp)
pfg.file_contents.append(file_content)
# <-------- 拆分过长的Markdown文件 ---------->
pfg.run_file_split(max_token_limit=1500)
# <-------- 拆分过长的Markdown文件 ---------->
pfg.run_file_split(max_token_limit=1024)
n_split = len(pfg.sp_file_contents)
# <-------- 多线程翻译开始 ---------->
# <-------- 多线程翻译开始 ---------->
if language == 'en->zh':
inputs_array = ["This is a Markdown file, translate it into Chinese, do not modify any existing Markdown commands:" +
inputs_array = ["This is a Markdown file, translate it into Chinese, do NOT modify any existing Markdown commands, do NOT use code wrapper (```), ONLY answer me with translated results:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"翻译 {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper translator." for _ in range(n_split)]
sys_prompt_array = ["You are a professional academic paper translator." + plugin_kwargs.get("additional_prompt", "") for _ in range(n_split)]
elif language == 'zh->en':
inputs_array = [f"This is a Markdown file, translate it into English, do not modify any existing Markdown commands:" +
inputs_array = [f"This is a Markdown file, translate it into English, do NOT modify any existing Markdown commands, do NOT use code wrapper (```), ONLY answer me with translated results:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"翻译 {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper translator." for _ in range(n_split)]
sys_prompt_array = ["You are a professional academic paper translator." + plugin_kwargs.get("additional_prompt", "") for _ in range(n_split)]
else:
inputs_array = [f"This is a Markdown file, translate it into {language}, do not modify any existing Markdown commands, only answer me with translated results:" +
inputs_array = [f"This is a Markdown file, translate it into {language}, do NOT modify any existing Markdown commands, do NOT use code wrapper (```), ONLY answer me with translated results:" +
f"\n\n{frag}" for frag in pfg.sp_file_contents]
inputs_show_user_array = [f"翻译 {f}" for f in pfg.sp_file_tag]
sys_prompt_array = ["You are a professional academic paper translator." for _ in range(n_split)]
sys_prompt_array = ["You are a professional academic paper translator." + plugin_kwargs.get("additional_prompt", "") for _ in range(n_split)]
gpt_response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
@@ -99,11 +100,16 @@ def 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, ch
for i_say, gpt_say in zip(gpt_response_collection[0::2], gpt_response_collection[1::2]):
pfg.sp_file_result.append(gpt_say)
pfg.merge_result()
pfg.write_result(language)
output_file_arr = pfg.write_result(language)
for output_file in output_file_arr:
promote_file_to_downloadzone(output_file, chatbot=chatbot)
if 'markdown_expected_output_path' in plugin_kwargs:
expected_f_name = plugin_kwargs['markdown_expected_output_path']
shutil.copyfile(output_file, expected_f_name)
except:
logging.error(trimmed_format_exc())
logger.error(trimmed_format_exc())
# <-------- 整理结果,退出 ---------->
# <-------- 整理结果,退出 ---------->
create_report_file_name = gen_time_str() + f"-chatgpt.md"
res = write_history_to_file(gpt_response_collection, file_basename=create_report_file_name)
promote_file_to_downloadzone(res, chatbot=chatbot)
@@ -121,7 +127,7 @@ def get_files_from_everything(txt, preference=''):
proxies = get_conf('proxies')
# 网络的远程文件
if preference == 'Github':
logging.info('正在从github下载资源 ...')
logger.info('正在从github下载资源 ...')
if not txt.endswith('.md'):
# Make a request to the GitHub API to retrieve the repository information
url = txt.replace("https://github.com/", "https://api.github.com/repos/") + '/readme'
@@ -159,7 +165,6 @@ def Markdown英译中(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_p
"函数插件功能?",
"对整个Markdown项目进行翻译。函数插件贡献者: Binary-Husky"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
disable_auto_promotion(chatbot)
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
@@ -199,7 +204,6 @@ def Markdown中译英(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_p
"函数插件功能?",
"对整个Markdown项目进行翻译。函数插件贡献者: Binary-Husky"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
disable_auto_promotion(chatbot)
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
@@ -232,7 +236,6 @@ def Markdown翻译指定语言(txt, llm_kwargs, plugin_kwargs, chatbot, history,
"函数插件功能?",
"对整个Markdown项目进行翻译。函数插件贡献者: Binary-Husky"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
disable_auto_promotion(chatbot)
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
@@ -255,7 +258,7 @@ def Markdown翻译指定语言(txt, llm_kwargs, plugin_kwargs, chatbot, history,
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何.md文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
language = plugin_kwargs.get("advanced_arg", 'Chinese')
yield from 多文件翻译(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, language=language)

View File

@@ -1,4 +1,5 @@
import os
from loguru import logger
from toolbox import CatchException, update_ui, gen_time_str, promote_file_to_downloadzone
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import input_clipping
@@ -27,17 +28,17 @@ def eval_manim(code):
class_name = get_class_name(code)
try:
try:
time_str = gen_time_str()
subprocess.check_output([sys.executable, '-c', f"from gpt_log.MyAnimation import {class_name}; {class_name}().render()"])
shutil.move(f'media/videos/1080p60/{class_name}.mp4', f'gpt_log/{class_name}-{time_str}.mp4')
return f'gpt_log/{time_str}.mp4'
except subprocess.CalledProcessError as e:
output = e.output.decode()
print(f"Command returned non-zero exit status {e.returncode}: {output}.")
logger.error(f"Command returned non-zero exit status {e.returncode}: {output}.")
return f"Evaluating python script failed: {e.output}."
except:
print('generating mp4 failed')
except:
logger.error('generating mp4 failed')
return "Generating mp4 failed."
@@ -45,7 +46,7 @@ def get_code_block(reply):
import re
pattern = r"```([\s\S]*?)```" # regex pattern to match code blocks
matches = re.findall(pattern, reply) # find all code blocks in text
if len(matches) != 1:
if len(matches) != 1:
raise RuntimeError("GPT is not generating proper code.")
return matches[0].strip('python') # code block
@@ -61,7 +62,7 @@ def 动画生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt
user_request 当前用户的请求信息IP地址等
"""
# 清空历史,以免输入溢出
history = []
history = []
# 基本信息:功能、贡献者
chatbot.append([
@@ -73,24 +74,24 @@ def 动画生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt
# 尝试导入依赖, 如果缺少依赖, 则给出安装建议
dep_ok = yield from inspect_dependency(chatbot=chatbot, history=history) # 刷新界面
if not dep_ok: return
# 输入
i_say = f'Generate a animation to show: ' + txt
demo = ["Here is some examples of manim", examples_of_manim()]
_, demo = input_clipping(inputs="", history=demo, max_token_limit=2560)
# 开始
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=demo,
inputs=i_say, inputs_show_user=i_say,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=demo,
sys_prompt=
r"Write a animation script with 3blue1brown's manim. "+
r"Please begin with `from manim import *`. " +
r"Please begin with `from manim import *`. " +
r"Answer me with a code block wrapped by ```."
)
chatbot.append(["开始生成动画", "..."])
history.extend([i_say, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
# 将代码转为动画
code = get_code_block(gpt_say)
res = eval_manim(code)
@@ -165,7 +166,7 @@ class PointWithTrace(Scene):
```
# do not use get_graph, this funciton is deprecated
# do not use get_graph, this function is deprecated
class ExampleFunctionGraph(Scene):
def construct(self):

View File

@@ -0,0 +1,437 @@
from toolbox import CatchException, update_ui, report_exception
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.plugin_template.plugin_class_template import (
GptAcademicPluginTemplate,
)
from crazy_functions.plugin_template.plugin_class_template import ArgProperty
# 以下是每类图表的PROMPT
SELECT_PROMPT = """
{subject}
=============
以上是从文章中提取的摘要,将会使用这些摘要绘制图表。请你选择一个合适的图表类型:
1 流程图
2 序列图
3 类图
4 饼图
5 甘特图
6 状态图
7 实体关系图
8 象限提示图
不需要解释原因,仅需要输出单个不带任何标点符号的数字。
"""
# 没有思维导图!!!测试发现模型始终会优先选择思维导图
# 流程图
PROMPT_1 = """
请你给出围绕“{subject}”的逻辑关系图使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
graph TD
P("编程") --> L1("Python")
P("编程") --> L2("C")
P("编程") --> L3("C++")
P("编程") --> L4("Javascipt")
P("编程") --> L5("PHP")
```
"""
# 序列图
PROMPT_2 = """
请你给出围绕“{subject}”的序列图使用mermaid语法。
mermaid语法举例
```mermaid
sequenceDiagram
participant A as 用户
participant B as 系统
A->>B: 登录请求
B->>A: 登录成功
A->>B: 获取数据
B->>A: 返回数据
```
"""
# 类图
PROMPT_3 = """
请你给出围绕“{subject}”的类图使用mermaid语法。
mermaid语法举例
```mermaid
classDiagram
Class01 <|-- AveryLongClass : Cool
Class03 *-- Class04
Class05 o-- Class06
Class07 .. Class08
Class09 --> C2 : Where am i?
Class09 --* C3
Class09 --|> Class07
Class07 : equals()
Class07 : Object[] elementData
Class01 : size()
Class01 : int chimp
Class01 : int gorilla
Class08 <--> C2: Cool label
```
"""
# 饼图
PROMPT_4 = """
请你给出围绕“{subject}”的饼图使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
pie title Pets adopted by volunteers
"" : 386
"" : 85
"兔子" : 15
```
"""
# 甘特图
PROMPT_5 = """
请你给出围绕“{subject}”的甘特图使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
gantt
title "项目开发流程"
dateFormat YYYY-MM-DD
section "设计"
"需求分析" :done, des1, 2024-01-06,2024-01-08
"原型设计" :active, des2, 2024-01-09, 3d
"UI设计" : des3, after des2, 5d
section "开发"
"前端开发" :2024-01-20, 10d
"后端开发" :2024-01-20, 10d
```
"""
# 状态图
PROMPT_6 = """
请你给出围绕“{subject}”的状态图使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
stateDiagram-v2
[*] --> "Still"
"Still" --> [*]
"Still" --> "Moving"
"Moving" --> "Still"
"Moving" --> "Crash"
"Crash" --> [*]
```
"""
# 实体关系图
PROMPT_7 = """
请你给出围绕“{subject}”的实体关系图使用mermaid语法。
mermaid语法举例
```mermaid
erDiagram
CUSTOMER ||--o{ ORDER : places
ORDER ||--|{ LINE-ITEM : contains
CUSTOMER {
string name
string id
}
ORDER {
string orderNumber
date orderDate
string customerID
}
LINE-ITEM {
number quantity
string productID
}
```
"""
# 象限提示图
PROMPT_8 = """
请你给出围绕“{subject}”的象限图使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
graph LR
A["Hard skill"] --> B("Programming")
A["Hard skill"] --> C("Design")
D["Soft skill"] --> E("Coordination")
D["Soft skill"] --> F("Communication")
```
"""
# 思维导图
PROMPT_9 = """
{subject}
==========
请给出上方内容的思维导图充分考虑其之间的逻辑使用mermaid语法注意需要使用双引号将内容括起来。
mermaid语法举例
```mermaid
mindmap
root((mindmap))
("Origins")
("Long history")
::icon(fa fa-book)
("Popularisation")
("British popular psychology author Tony Buzan")
::icon(fa fa-user)
("Research")
("On effectiveness<br/>and features")
::icon(fa fa-search)
("On Automatic creation")
::icon(fa fa-robot)
("Uses")
("Creative techniques")
::icon(fa fa-lightbulb-o)
("Strategic planning")
::icon(fa fa-flag)
("Argument mapping")
::icon(fa fa-comments)
("Tools")
("Pen and paper")
::icon(fa fa-pencil)
("Mermaid")
::icon(fa fa-code)
```
"""
def 解析历史输入(history, llm_kwargs, file_manifest, chatbot, plugin_kwargs):
############################## <第 0 步,切割输入> ##################################
# 借用PDF切割中的函数对文本进行切割
TOKEN_LIMIT_PER_FRAGMENT = 2500
txt = (
str(history).encode("utf-8", "ignore").decode()
) # avoid reading non-utf8 chars
from crazy_functions.pdf_fns.breakdown_txt import (
breakdown_text_to_satisfy_token_limit,
)
txt = breakdown_text_to_satisfy_token_limit(
txt=txt, limit=TOKEN_LIMIT_PER_FRAGMENT, llm_model=llm_kwargs["llm_model"]
)
############################## <第 1 步,迭代地历遍整个文章,提取精炼信息> ##################################
results = []
MAX_WORD_TOTAL = 4096
n_txt = len(txt)
last_iteration_result = "从以下文本中提取摘要。"
for i in range(n_txt):
NUM_OF_WORD = MAX_WORD_TOTAL // n_txt
i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words in Chinese: {txt[i]}"
i_say_show_user = f"[{i+1}/{n_txt}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {txt[i][:200]} ...."
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
i_say,
i_say_show_user, # i_say=真正给chatgpt的提问 i_say_show_user=给用户看的提问
llm_kwargs,
chatbot,
history=[
"The main content of the previous section is?",
last_iteration_result,
], # 迭代上一次的结果
sys_prompt="Extracts the main content from the text section where it is located for graphing purposes, answer me with Chinese.", # 提示
)
results.append(gpt_say)
last_iteration_result = gpt_say
############################## <第 2 步,根据整理的摘要选择图表类型> ##################################
gpt_say = str(plugin_kwargs) # 将图表类型参数赋值为插件参数
results_txt = "\n".join(results) # 合并摘要
if gpt_say not in [
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
]: # 如插件参数不正确则使用对话模型判断
i_say_show_user = (
f"接下来将判断适合的图表类型,如连续3次判断失败将会使用流程图进行绘制"
)
gpt_say = "[Local Message] 收到。" # 用户提示
chatbot.append([i_say_show_user, gpt_say])
yield from update_ui(chatbot=chatbot, history=[]) # 更新UI
i_say = SELECT_PROMPT.format(subject=results_txt)
i_say_show_user = f'请判断适合使用的流程图类型,其中数字对应关系为:1-流程图,2-序列图,3-类图,4-饼图,5-甘特图,6-状态图,7-实体关系图,8-象限提示图。由于不管提供文本是什么,模型大概率认为"思维导图"最合适,因此思维导图仅能通过参数调用。'
for i in range(3):
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say,
inputs_show_user=i_say_show_user,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history=[],
sys_prompt="",
)
if gpt_say in [
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
]: # 判断返回是否正确
break
if gpt_say not in ["1", "2", "3", "4", "5", "6", "7", "8", "9"]:
gpt_say = "1"
############################## <第 3 步,根据选择的图表类型绘制图表> ##################################
if gpt_say == "1":
i_say = PROMPT_1.format(subject=results_txt)
elif gpt_say == "2":
i_say = PROMPT_2.format(subject=results_txt)
elif gpt_say == "3":
i_say = PROMPT_3.format(subject=results_txt)
elif gpt_say == "4":
i_say = PROMPT_4.format(subject=results_txt)
elif gpt_say == "5":
i_say = PROMPT_5.format(subject=results_txt)
elif gpt_say == "6":
i_say = PROMPT_6.format(subject=results_txt)
elif gpt_say == "7":
i_say = PROMPT_7.replace("{subject}", results_txt) # 由于实体关系图用到了{}符号
elif gpt_say == "8":
i_say = PROMPT_8.format(subject=results_txt)
elif gpt_say == "9":
i_say = PROMPT_9.format(subject=results_txt)
i_say_show_user = f"请根据判断结果绘制相应的图表。如需绘制思维导图请使用参数调用,同时过大的图表可能需要复制到在线编辑器中进行渲染。"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say,
inputs_show_user=i_say_show_user,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history=[],
sys_prompt="",
)
history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 界面更新
@CatchException
def Mermaid_Figure_Gen(
txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, web_port
):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数,用于灵活调整复杂功能的各种参数
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
web_port 当前软件运行的端口号
"""
import os
# 基本信息:功能、贡献者
chatbot.append(
[
"函数插件功能?",
"根据当前聊天历史或指定的路径文件(文件内容优先)绘制多种mermaid图表将会由对话模型首先判断适合的图表类型随后绘制图表。\
\n您也可以使用插件参数指定绘制的图表类型,函数插件贡献者: Menghuan1918",
]
)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
if os.path.exists(txt): # 如输入区无内容则直接解析历史记录
from crazy_functions.pdf_fns.parse_word import extract_text_from_files
file_exist, final_result, page_one, file_manifest, exception = (
extract_text_from_files(txt, chatbot, history)
)
else:
file_exist = False
exception = ""
file_manifest = []
if exception != "":
if exception == "word":
report_exception(
chatbot,
history,
a=f"解析项目: {txt}",
b=f"找到了.doc文件但是该文件格式不被支持请先转化为.docx格式。",
)
elif exception == "pdf":
report_exception(
chatbot,
history,
a=f"解析项目: {txt}",
b=f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade pymupdf```。",
)
elif exception == "word_pip":
report_exception(
chatbot,
history,
a=f"解析项目: {txt}",
b=f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade python-docx pywin32```。",
)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
else:
if not file_exist:
history.append(txt) # 如输入区不是文件则将输入区内容加入历史记录
i_say_show_user = f"首先你从历史记录中提取摘要。"
gpt_say = "[Local Message] 收到。" # 用户提示
chatbot.append([i_say_show_user, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 更新UI
yield from 解析历史输入(
history, llm_kwargs, file_manifest, chatbot, plugin_kwargs
)
else:
file_num = len(file_manifest)
for i in range(file_num): # 依次处理文件
i_say_show_user = f"[{i+1}/{file_num}]处理文件{file_manifest[i]}"
gpt_say = "[Local Message] 收到。" # 用户提示
chatbot.append([i_say_show_user, gpt_say])
yield from update_ui(chatbot=chatbot, history=history) # 更新UI
history = [] # 如输入区内容为文件则清空历史记录
history.append(final_result[i])
yield from 解析历史输入(
history, llm_kwargs, file_manifest, chatbot, plugin_kwargs
)
class Mermaid_Gen(GptAcademicPluginTemplate):
def __init__(self):
pass
def define_arg_selection_menu(self):
gui_definition = {
"Type_of_Mermaid": ArgProperty(
title="绘制的Mermaid图表类型",
options=[
"由LLM决定",
"流程图",
"序列图",
"类图",
"饼图",
"甘特图",
"状态图",
"实体关系图",
"象限提示图",
"思维导图",
],
default_value="由LLM决定",
description="选择'由LLM决定'时将由对话模型判断适合的图表类型(不包括思维导图),选择其他类型时将直接绘制指定的图表类型。",
type="dropdown",
).model_dump_json(),
}
return gui_definition
def execute(
txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request
):
options = [
"由LLM决定",
"流程图",
"序列图",
"类图",
"饼图",
"甘特图",
"状态图",
"实体关系图",
"象限提示图",
"思维导图",
]
plugin_kwargs = options.index(plugin_kwargs['Type_of_Mermaid'])
yield from Mermaid_Figure_Gen(
txt,
llm_kwargs,
plugin_kwargs,
chatbot,
history,
system_prompt,
user_request,
)

View File

@@ -6,13 +6,14 @@
"""
import time
from toolbox import CatchException, update_ui, gen_time_str, trimmed_format_exc, ProxyNetworkActivate
from toolbox import get_conf, select_api_key, update_ui_lastest_msg, Singleton
from toolbox import get_conf, select_api_key, update_ui_latest_msg, Singleton
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, get_plugin_arg
from crazy_functions.crazy_utils import input_clipping, try_install_deps
from crazy_functions.agent_fns.persistent import GradioMultiuserManagerForPersistentClasses
from crazy_functions.agent_fns.auto_agent import AutoGenMath
import time
from loguru import logger
def remove_model_prefix(llm):
if llm.startswith('api2d-'): llm = llm.replace('api2d-', '')
@@ -21,7 +22,7 @@ def remove_model_prefix(llm):
@CatchException
def 多智能体终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Multi_Agent_Legacy终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本例如需要翻译的一段话再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
@@ -57,11 +58,11 @@ def 多智能体终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
if get_conf("AUTOGEN_USE_DOCKER"):
import docker
except:
chatbot.append([ f"处理任务: {txt}",
chatbot.append([ f"处理任务: {txt}",
f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade pyautogen docker```。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
import autogen
@@ -72,22 +73,22 @@ def 多智能体终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
chatbot.append([f"处理任务: {txt}", f"缺少docker运行环境"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 解锁插件
chatbot.get_cookies()['lock_plugin'] = None
persistent_class_multi_user_manager = GradioMultiuserManagerForPersistentClasses()
user_uuid = chatbot.get_cookies().get('uuid')
persistent_key = f"{user_uuid}->多智能体终端"
persistent_key = f"{user_uuid}->Multi_Agent_Legacy终端"
if persistent_class_multi_user_manager.already_alive(persistent_key):
# 当已经存在一个正在运行的多智能体终端时,直接将用户输入传递给它,而不是再次启动一个新的多智能体终端
print('[debug] feed new user input')
# 当已经存在一个正在运行的Multi_Agent_Legacy终端时,直接将用户输入传递给它,而不是再次启动一个新的Multi_Agent_Legacy终端
logger.info('[debug] feed new user input')
executor = persistent_class_multi_user_manager.get(persistent_key)
exit_reason = yield from executor.main_process_ui_control(txt, create_or_resume="resume")
else:
# 运行多智能体终端 (首次)
print('[debug] create new executor instance')
# 运行Multi_Agent_Legacy终端 (首次)
logger.info('[debug] create new executor instance')
history = []
chatbot.append(["正在启动: 多智能体终端", "插件动态生成, 执行开始, 作者 Microsoft & Binary-Husky."])
chatbot.append(["正在启动: Multi_Agent_Legacy终端", "插件动态生成, 执行开始, 作者 Microsoft & Binary-Husky."])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
executor = AutoGenMath(llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
persistent_class_multi_user_manager.set(persistent_key, executor)
@@ -95,7 +96,7 @@ def 多智能体终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_
if exit_reason == "wait_feedback":
# 当用户点击了“等待反馈”按钮时将executor存储到cookie中等待用户的再次调用
executor.chatbot.get_cookies()['lock_plugin'] = 'crazy_functions.多智能体->多智能体终端'
executor.chatbot.get_cookies()['lock_plugin'] = 'crazy_functions.Multi_Agent_Legacy->Multi_Agent_Legacy终端'
else:
executor.chatbot.get_cookies()['lock_plugin'] = None
yield from update_ui(chatbot=executor.chatbot, history=executor.history) # 更新状态

View File

@@ -1,5 +1,5 @@
from toolbox import CatchException, update_ui, get_conf
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
import datetime
@CatchException
def 同时问询(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
@@ -20,8 +20,8 @@ def 同时问询(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt
# llm_kwargs['llm_model'] = 'chatglm&gpt-3.5-turbo&api2d-gpt-3.5-turbo' # 支持任意数量的llm接口用&符号分隔
llm_kwargs['llm_model'] = MULTI_QUERY_LLM_MODELS # 支持任意数量的llm接口用&符号分隔
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=txt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
inputs=txt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt=system_prompt,
retry_times_at_unknown_error=0
)
@@ -52,8 +52,8 @@ def 同时问询_指定模型(txt, llm_kwargs, plugin_kwargs, chatbot, history,
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 由于请求gpt需要一段时间我们先及时地做一次界面更新
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=txt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
inputs=txt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=history,
sys_prompt=system_prompt,
retry_times_at_unknown_error=0
)

View File

@@ -1,13 +1,12 @@
from loguru import logger
from toolbox import update_ui
from toolbox import CatchException, report_exception
from .crazy_utils import read_and_clean_pdf_text
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
fast_debug = False
from crazy_functions.crazy_utils import read_and_clean_pdf_text
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
import tiktoken
print('begin analysis on:', file_name)
logger.info('begin analysis on:', file_name)
############################## <第 0 步切割PDF> ##################################
# 递归地切割PDF文件每一块尽量是完整的一个section比如introductionexperiment等必要时再进行切割
@@ -15,7 +14,7 @@ def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_pro
file_content, page_one = read_and_clean_pdf_text(file_name) # 尝试按照章节切割PDF
file_content = file_content.encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
page_one = str(page_one).encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
TOKEN_LIMIT_PER_FRAGMENT = 2500
from crazy_functions.pdf_fns.breakdown_txt import breakdown_text_to_satisfy_token_limit
@@ -23,7 +22,7 @@ def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_pro
page_one_fragments = breakdown_text_to_satisfy_token_limit(txt=str(page_one), limit=TOKEN_LIMIT_PER_FRAGMENT//4, llm_model=llm_kwargs['llm_model'])
# 为了更好的效果我们剥离Introduction之后的部分如果有
paper_meta = page_one_fragments[0].split('introduction')[0].split('Introduction')[0].split('INTRODUCTION')[0]
############################## <第 1 步从摘要中提取高价值信息放到history中> ##################################
final_results = []
final_results.append(paper_meta)
@@ -36,16 +35,16 @@ def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_pro
last_iteration_result = paper_meta # 初始值是摘要
MAX_WORD_TOTAL = 4096
n_fragment = len(paper_fragments)
if n_fragment >= 20: print('文章极长,不能达到预期效果')
if n_fragment >= 20: logger.warning('文章极长,不能达到预期效果')
for i in range(n_fragment):
NUM_OF_WORD = MAX_WORD_TOTAL // n_fragment
i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i]}"
i_say_show_user = f"[{i+1}/{n_fragment}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} words: {paper_fragments[i][:200]} ...."
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(i_say, i_say_show_user, # i_say=真正给chatgpt的提问 i_say_show_user=给用户看的提问
llm_kwargs, chatbot,
llm_kwargs, chatbot,
history=["The main idea of the previous section is?", last_iteration_result], # 迭代上一次的结果
sys_prompt="Extract the main idea of this section, answer me with Chinese." # 提示
)
)
iteration_results.append(gpt_say)
last_iteration_result = gpt_say
@@ -57,13 +56,13 @@ def 解析PDF(file_name, llm_kwargs, plugin_kwargs, chatbot, history, system_pro
chatbot.append([i_say_show_user, gpt_say])
############################## <第 4 步设置一个token上限防止回答时Token溢出> ##################################
from .crazy_utils import input_clipping
from crazy_functions.crazy_utils import input_clipping
_, final_results = input_clipping("", final_results, max_token_limit=3200)
yield from update_ui(chatbot=chatbot, history=final_results) # 注意这里的历史记录被替代了
@CatchException
def 理解PDF文档内容标准文件输入(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def PDF_QA标准文件输入(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
import glob, os
# 基本信息:功能、贡献者
@@ -76,8 +75,8 @@ def 理解PDF文档内容标准文件输入(txt, llm_kwargs, plugin_kwargs, chat
try:
import fitz
except:
report_exception(chatbot, history,
a = f"解析项目: {txt}",
report_exception(chatbot, history,
a = f"解析项目: {txt}",
b = f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade pymupdf```。")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return

View File

@@ -1,23 +1,25 @@
from loguru import logger
from toolbox import update_ui, promote_file_to_downloadzone, gen_time_str
from toolbox import CatchException, report_exception
from toolbox import write_history_to_file, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from .crazy_utils import read_and_clean_pdf_text
from .crazy_utils import input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import read_and_clean_pdf_text
from crazy_functions.crazy_utils import input_clipping
def 解析PDF(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
file_write_buffer = []
for file_name in file_manifest:
print('begin analysis on:', file_name)
logger.info('begin analysis on:', file_name)
############################## <第 0 步切割PDF> ##################################
# 递归地切割PDF文件每一块尽量是完整的一个section比如introductionexperiment等必要时再进行切割
# 的长度必须小于 2500 个 Token
file_content, page_one = read_and_clean_pdf_text(file_name) # 尝试按照章节切割PDF
file_content = file_content.encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
page_one = str(page_one).encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
TOKEN_LIMIT_PER_FRAGMENT = 2500
from crazy_functions.pdf_fns.breakdown_txt import breakdown_text_to_satisfy_token_limit
@@ -25,7 +27,7 @@ def 解析PDF(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot,
page_one_fragments = breakdown_text_to_satisfy_token_limit(txt=str(page_one), limit=TOKEN_LIMIT_PER_FRAGMENT//4, llm_model=llm_kwargs['llm_model'])
# 为了更好的效果我们剥离Introduction之后的部分如果有
paper_meta = page_one_fragments[0].split('introduction')[0].split('Introduction')[0].split('INTRODUCTION')[0]
############################## <第 1 步从摘要中提取高价值信息放到history中> ##################################
final_results = []
final_results.append(paper_meta)
@@ -38,16 +40,16 @@ def 解析PDF(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot,
last_iteration_result = paper_meta # 初始值是摘要
MAX_WORD_TOTAL = 4096 * 0.7
n_fragment = len(paper_fragments)
if n_fragment >= 20: print('文章极长,不能达到预期效果')
if n_fragment >= 20: logger.warning('文章极长,不能达到预期效果')
for i in range(n_fragment):
NUM_OF_WORD = MAX_WORD_TOTAL // n_fragment
i_say = f"Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} Chinese characters: {paper_fragments[i]}"
i_say_show_user = f"[{i+1}/{n_fragment}] Read this section, recapitulate the content of this section with less than {NUM_OF_WORD} Chinese characters: {paper_fragments[i][:200]}"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(i_say, i_say_show_user, # i_say=真正给chatgpt的提问 i_say_show_user=给用户看的提问
llm_kwargs, chatbot,
llm_kwargs, chatbot,
history=["The main idea of the previous section is?", last_iteration_result], # 迭代上一次的结果
sys_prompt="Extract the main idea of this section with Chinese." # 提示
)
)
iteration_results.append(gpt_say)
last_iteration_result = gpt_say
@@ -67,15 +69,15 @@ def 解析PDF(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot,
- (2):What are the past methods? What are the problems with them? Is the approach well motivated?
- (3):What is the research methodology proposed in this paper?
- (4):On what task and what performance is achieved by the methods in this paper? Can the performance support their goals?
Follow the format of the output that follows:
Follow the format of the output that follows:
1. Title: xxx\n\n
2. Authors: xxx\n\n
3. Affiliation: xxx\n\n
4. Keywords: xxx\n\n
5. Urls: xxx or xxx , xxx \n\n
6. Summary: \n\n
- (1):xxx;\n
- (2):xxx;\n
- (1):xxx;\n
- (2):xxx;\n
- (3):xxx;\n
- (4):xxx.\n\n
Be sure to use Chinese answers (proper nouns need to be marked in English), statements as concise and academic as possible,
@@ -85,8 +87,8 @@ do not have too much repetitive information, numerical values using the original
file_write_buffer.extend(final_results)
i_say, final_results = input_clipping(i_say, final_results, max_token_limit=2000)
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say, inputs_show_user='开始最终总结',
llm_kwargs=llm_kwargs, chatbot=chatbot, history=final_results,
inputs=i_say, inputs_show_user='开始最终总结',
llm_kwargs=llm_kwargs, chatbot=chatbot, history=final_results,
sys_prompt= f"Extract the main idea of this paper with less than {NUM_OF_WORD} Chinese characters"
)
final_results.append(gpt_say)
@@ -101,21 +103,21 @@ do not have too much repetitive information, numerical values using the original
@CatchException
def 批量总结PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def PDF_Summary(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
import glob, os
# 基本信息:功能、贡献者
chatbot.append([
"函数插件功能?",
"批量总结PDF文档。函数插件贡献者: ValeriaWongEralien"])
"PDF_Summary。函数插件贡献者: ValeriaWongEralien"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
import fitz
except:
report_exception(chatbot, history,
a = f"解析项目: {txt}",
report_exception(chatbot, history,
a = f"解析项目: {txt}",
b = f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade pymupdf```。")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
@@ -134,7 +136,7 @@ def 批量总结PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
# 搜索需要处理的文件清单
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.pdf', recursive=True)]
# 如果没找到任何文件
if len(file_manifest) == 0:
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何.tex或.pdf文件: {txt}")

View File

@@ -0,0 +1,83 @@
from toolbox import CatchException, check_packages, get_conf
from toolbox import update_ui, update_ui_latest_msg, disable_auto_promotion
from toolbox import trimmed_format_exc_markdown
from crazy_functions.crazy_utils import get_files_from_everything
from crazy_functions.pdf_fns.parse_pdf import get_avail_grobid_url
from crazy_functions.pdf_fns.parse_pdf_via_doc2x import 解析PDF_基于DOC2X
from crazy_functions.pdf_fns.parse_pdf_legacy import 解析PDF_简单拆解
from crazy_functions.pdf_fns.parse_pdf_grobid import 解析PDF_基于GROBID
from shared_utils.colorful import *
@CatchException
def 批量翻译PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
disable_auto_promotion(chatbot)
# 基本信息:功能、贡献者
chatbot.append([None, "插件功能批量翻译PDF文档。函数插件贡献者: Binary-Husky"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 尝试导入依赖,如果缺少依赖,则给出安装建议
try:
check_packages(["fitz", "tiktoken", "scipdf"])
except:
chatbot.append([None, f"导入软件依赖失败。使用该模块需要额外依赖,安装方法```pip install --upgrade pymupdf tiktoken scipdf_parser```。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 清空历史,以免输入溢出
history = []
success, file_manifest, project_folder = get_files_from_everything(txt, type='.pdf')
# 检测输入参数,如没有给定输入参数,直接退出
if (not success) and txt == "": txt = '空空如也的输入栏。提示请先上传文件把PDF文件拖入对话'
# 如果没找到任何文件
if len(file_manifest) == 0:
chatbot.append([None, f"找不到任何.pdf拓展名的文件: {txt}"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# 开始正式执行任务
method = plugin_kwargs.get("pdf_parse_method", None)
if method == "DOC2X":
# ------- 第一种方法效果最好但是需要DOC2X服务 -------
DOC2X_API_KEY = get_conf("DOC2X_API_KEY")
if len(DOC2X_API_KEY) != 0:
try:
yield from 解析PDF_基于DOC2X(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, DOC2X_API_KEY, user_request)
return
except:
chatbot.append([None, f"DOC2X服务不可用请检查报错详细。{trimmed_format_exc_markdown()}"])
yield from update_ui(chatbot=chatbot, history=history)
if method == "GROBID":
# ------- 第二种方法,效果次优 -------
grobid_url = get_avail_grobid_url()
if grobid_url is not None:
yield from 解析PDF_基于GROBID(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, grobid_url)
return
if method == "Classic":
# ------- 第三种方法,早期代码,效果不理想 -------
yield from update_ui_latest_msg("GROBID服务不可用请检查config中的GROBID_URL。作为替代现在将执行效果稍差的旧版代码。", chatbot, history, delay=3)
yield from 解析PDF_简单拆解(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)
return
if method is None:
# ------- 以上三种方法都试一遍 -------
DOC2X_API_KEY = get_conf("DOC2X_API_KEY")
if len(DOC2X_API_KEY) != 0:
try:
yield from 解析PDF_基于DOC2X(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, DOC2X_API_KEY, user_request)
return
except:
chatbot.append([None, f"DOC2X服务不可用正在尝试GROBID。{trimmed_format_exc_markdown()}"])
yield from update_ui(chatbot=chatbot, history=history)
grobid_url = get_avail_grobid_url()
if grobid_url is not None:
yield from 解析PDF_基于GROBID(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, grobid_url)
return
yield from update_ui_latest_msg("GROBID服务不可用请检查config中的GROBID_URL。作为替代现在将执行效果稍差的旧版代码。", chatbot, history, delay=3)
yield from 解析PDF_简单拆解(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)
return

View File

@@ -1,11 +1,11 @@
from toolbox import CatchException, report_exception, get_log_folder, gen_time_str
from toolbox import update_ui, promote_file_to_downloadzone, update_ui_lastest_msg, disable_auto_promotion
from toolbox import update_ui, promote_file_to_downloadzone, update_ui_latest_msg, disable_auto_promotion
from toolbox import write_history_to_file, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from .crazy_utils import read_and_clean_pdf_text
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import read_and_clean_pdf_text
from .pdf_fns.parse_pdf import parse_pdf, get_avail_grobid_url, translate_pdf
from colorful import *
from shared_utils.colorful import *
import copy
import os
import math
@@ -60,7 +60,7 @@ def 批量翻译PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
# 清空历史,以免输入溢出
history = []
from .crazy_utils import get_files_from_everything
from crazy_functions.crazy_utils import get_files_from_everything
success, file_manifest, project_folder = get_files_from_everything(txt, type='.pdf')
if len(file_manifest) > 0:
# 尝试导入依赖,如果缺少依赖,则给出安装建议
@@ -76,8 +76,8 @@ def 批量翻译PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
success_mmd, file_manifest_mmd, _ = get_files_from_everything(txt, type='.mmd')
success = success or success_mmd
file_manifest += file_manifest_mmd
chatbot.append(["文件列表:", ", ".join([e.split('/')[-1] for e in file_manifest])]);
yield from update_ui( chatbot=chatbot, history=history)
chatbot.append(["文件列表:", ", ".join([e.split('/')[-1] for e in file_manifest])]);
yield from update_ui( chatbot=chatbot, history=history)
# 检测输入参数,如没有给定输入参数,直接退出
if not success:
if txt == "": txt = '空空如也的输入栏'

View File

@@ -0,0 +1,33 @@
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
from .PDF_Translate import 批量翻译PDF文档
class PDF_Tran(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
"""
gui_definition = {
"main_input":
ArgProperty(title="PDF文件路径", description="未指定路径,请上传文件后,再点击该插件", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"additional_prompt":
ArgProperty(title="额外提示词", description="例如:对专有名词、翻译语气等方面的要求", default_value="", type="string").model_dump_json(), # 高级参数输入区,自动同步
"pdf_parse_method":
ArgProperty(title="PDF解析方法", options=["DOC2X", "GROBID", "Classic"], description="", default_value="GROBID", type="dropdown").model_dump_json(),
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
main_input = plugin_kwargs["main_input"]
additional_prompt = plugin_kwargs["additional_prompt"]
pdf_parse_method = plugin_kwargs["pdf_parse_method"]
yield from 批量翻译PDF文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)

View File

@@ -1,19 +1,18 @@
from toolbox import update_ui
from toolbox import CatchException, report_exception
from toolbox import write_history_to_file, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
def 解析Paper(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
import time, glob, os
print('begin analysis on:', file_manifest)
for index, fp in enumerate(file_manifest):
with open(fp, 'r', encoding='utf-8', errors='replace') as f:
file_content = f.read()
prefix = "接下来请你逐文件分析下面的论文文件,概括其内容" if index==0 else ""
i_say = prefix + f'请对下面的文章片段用中文做一个概述,文件名是{os.path.relpath(fp, project_folder)},文章内容是 ```{file_content}```'
i_say_show_user = prefix + f'[{index}/{len(file_manifest)}] 请对下面的文章片段做一个概述: {os.path.abspath(fp)}'
i_say_show_user = prefix + f'[{index+1}/{len(file_manifest)}] 请对下面的文章片段做一个概述: {os.path.abspath(fp)}'
chatbot.append((i_say_show_user, "[Local Message] waiting gpt response."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
@@ -44,7 +43,7 @@ def 解析Paper(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbo
@CatchException
def 读文章写摘要(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Paper_Abstract_Writer(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
history = [] # 清空历史,以免输入溢出
import glob, os
if os.path.exists(txt):

View File

@@ -0,0 +1,360 @@
import os
import time
import glob
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass
from typing import Dict, List, Generator, Tuple
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from toolbox import update_ui, promote_file_to_downloadzone, write_history_to_file, CatchException, report_exception
from shared_utils.fastapi_server import validate_path_safety
from crazy_functions.paper_fns.paper_download import extract_paper_id, extract_paper_ids, get_arxiv_paper, format_arxiv_id
@dataclass
class PaperQuestion:
"""论文分析问题类"""
id: str # 问题ID
question: str # 问题内容
importance: int # 重要性 (1-55最高)
description: str # 问题描述
class PaperAnalyzer:
"""论文快速分析器"""
def __init__(self, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List, history: List, system_prompt: str):
"""初始化分析器"""
self.llm_kwargs = llm_kwargs
self.plugin_kwargs = plugin_kwargs
self.chatbot = chatbot
self.history = history
self.system_prompt = system_prompt
self.paper_content = ""
self.results = {}
# 定义论文分析问题库已合并为4个核心问题
self.questions = [
PaperQuestion(
id="research_and_methods",
question="这篇论文的主要研究问题、目标和方法是什么请分析1)论文的核心研究问题和研究动机2)论文提出的关键方法、模型或理论框架3)这些方法如何解决研究问题。",
importance=5,
description="研究问题与方法"
),
PaperQuestion(
id="findings_and_innovation",
question="论文的主要发现、结论及创新点是什么请分析1)论文的核心结果与主要发现2)作者得出的关键结论3)研究的创新点与对领域的贡献4)与已有工作的区别。",
importance=4,
description="研究发现与创新"
),
PaperQuestion(
id="methodology_and_data",
question="论文使用了什么研究方法和数据请详细分析1)研究设计与实验设置2)数据收集方法与数据集特点3)分析技术与评估方法4)方法学上的合理性。",
importance=3,
description="研究方法与数据"
),
PaperQuestion(
id="limitations_and_impact",
question="论文的局限性、未来方向及潜在影响是什么请分析1)研究的不足与限制因素2)作者提出的未来研究方向3)该研究对学术界和行业可能产生的影响4)研究结果的适用范围与推广价值。",
importance=2,
description="局限性与影响"
),
]
# 按重要性排序
self.questions.sort(key=lambda q: q.importance, reverse=True)
def _load_paper(self, paper_path: str) -> Generator:
from crazy_functions.doc_fns.text_content_loader import TextContentLoader
"""加载论文内容"""
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 使用TextContentLoader读取文件
loader = TextContentLoader(self.chatbot, self.history)
yield from loader.execute_single_file(paper_path)
# 获取加载的内容
if len(self.history) >= 2 and self.history[-2]:
self.paper_content = self.history[-2]
yield from update_ui(chatbot=self.chatbot, history=self.history)
return True
else:
self.chatbot.append(["错误", "无法读取论文内容,请检查文件是否有效"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return False
def _analyze_question(self, question: PaperQuestion) -> Generator:
"""分析单个问题 - 直接显示问题和答案"""
try:
# 创建分析提示
prompt = f"请基于以下论文内容回答问题:\n\n{self.paper_content}\n\n问题:{question.question}"
# 使用单线程版本的请求函数
response = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=prompt,
inputs_show_user=question.question, # 显示问题本身
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history=[], # 空历史,确保每个问题独立分析
sys_prompt="你是一个专业的科研论文分析助手,需要仔细阅读论文内容并回答问题。请保持客观、准确,并基于论文内容提供深入分析。"
)
if response:
self.results[question.id] = response
return True
return False
except Exception as e:
self.chatbot.append(["错误", f"分析问题时出错: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return False
def _generate_summary(self) -> Generator:
"""生成最终总结报告"""
self.chatbot.append(["生成报告", "正在整合分析结果,生成最终报告..."])
yield from update_ui(chatbot=self.chatbot, history=self.history)
summary_prompt = "请基于以下对论文的各个方面的分析,生成一份全面的论文解读报告。报告应该简明扼要地呈现论文的关键内容,并保持逻辑连贯性。"
for q in self.questions:
if q.id in self.results:
summary_prompt += f"\n\n关于{q.description}的分析:\n{self.results[q.id]}"
try:
# 使用单线程版本的请求函数,可以在前端实时显示生成结果
response = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=summary_prompt,
inputs_show_user="生成论文解读报告",
llm_kwargs=self.llm_kwargs,
chatbot=self.chatbot,
history=[],
sys_prompt="你是一个科研论文解读专家,请将多个方面的分析整合为一份完整、连贯、有条理的报告。报告应当重点突出,层次分明,并且保持学术性和客观性。"
)
if response:
return response
return "报告生成失败"
except Exception as e:
self.chatbot.append(["错误", f"生成报告时出错: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return "报告生成失败: " + str(e)
def save_report(self, report: str) -> Generator:
"""保存分析报告"""
timestamp = time.strftime("%Y%m%d_%H%M%S")
# 保存为Markdown文件
try:
md_content = f"# 论文快速解读报告\n\n{report}"
for q in self.questions:
if q.id in self.results:
md_content += f"\n\n## {q.description}\n\n{self.results[q.id]}"
result_file = write_history_to_file(
history=[md_content],
file_basename=f"论文解读_{timestamp}.md"
)
if result_file and os.path.exists(result_file):
promote_file_to_downloadzone(result_file, chatbot=self.chatbot)
self.chatbot.append(["保存成功", f"解读报告已保存至: {os.path.basename(result_file)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
else:
self.chatbot.append(["警告", "保存报告成功但找不到文件"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
except Exception as e:
self.chatbot.append(["警告", f"保存报告失败: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
def analyze_paper(self, paper_path: str) -> Generator:
"""分析论文主流程"""
# 加载论文
success = yield from self._load_paper(paper_path)
if not success:
return
# 分析关键问题 - 直接询问每个问题,不显示进度信息
for question in self.questions:
yield from self._analyze_question(question)
# 生成总结报告
final_report = yield from self._generate_summary()
# 显示最终报告
# self.chatbot.append(["论文解读报告", final_report])
yield from update_ui(chatbot=self.chatbot, history=self.history)
# 保存报告
yield from self.save_report(final_report)
def _find_paper_file(path: str) -> str:
"""查找路径中的论文文件(简化版)"""
if os.path.isfile(path):
return path
# 支持的文件扩展名(按优先级排序)
extensions = ["pdf", "docx", "doc", "txt", "md", "tex"]
# 简单地遍历目录
if os.path.isdir(path):
try:
for ext in extensions:
# 手动检查每个可能的文件而不使用glob
potential_file = os.path.join(path, f"paper.{ext}")
if os.path.exists(potential_file) and os.path.isfile(potential_file):
return potential_file
# 如果没找到特定命名的文件,检查目录中的所有文件
for file in os.listdir(path):
file_path = os.path.join(path, file)
if os.path.isfile(file_path):
file_ext = file.split('.')[-1].lower() if '.' in file else ""
if file_ext in extensions:
return file_path
except Exception:
pass # 忽略任何错误
return None
def download_paper_by_id(paper_info, chatbot, history) -> str:
"""下载论文并返回保存路径
Args:
paper_info: 元组包含论文ID类型arxiv或doi和ID值
chatbot: 聊天机器人对象
history: 历史记录
Returns:
str: 下载的论文路径或None
"""
from crazy_functions.review_fns.data_sources.scihub_source import SciHub
id_type, paper_id = paper_info
# 创建保存目录 - 使用时间戳创建唯一文件夹
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
user_name = chatbot.get_user() if hasattr(chatbot, 'get_user') else "default"
from toolbox import get_log_folder, get_user
base_save_dir = get_log_folder(get_user(chatbot), plugin_name='paper_download')
save_dir = os.path.join(base_save_dir, f"papers_{timestamp}")
if not os.path.exists(save_dir):
os.makedirs(save_dir)
save_path = Path(save_dir)
chatbot.append([f"下载论文", f"正在下载{'arXiv' if id_type == 'arxiv' else 'DOI'} {paper_id} 的论文..."])
update_ui(chatbot=chatbot, history=history)
pdf_path = None
try:
if id_type == 'arxiv':
# 使用改进的arxiv查询方法
formatted_id = format_arxiv_id(paper_id)
paper_result = get_arxiv_paper(formatted_id)
if not paper_result:
chatbot.append([f"下载失败", f"未找到arXiv论文: {paper_id}"])
update_ui(chatbot=chatbot, history=history)
return None
# 下载PDF
filename = f"arxiv_{paper_id.replace('/', '_')}.pdf"
pdf_path = str(save_path / filename)
paper_result.download_pdf(filename=pdf_path)
else: # doi
# 下载DOI
sci_hub = SciHub(
doi=paper_id,
path=save_path
)
pdf_path = sci_hub.fetch()
# 检查下载结果
if pdf_path and os.path.exists(pdf_path):
promote_file_to_downloadzone(pdf_path, chatbot=chatbot)
chatbot.append([f"下载成功", f"已成功下载论文: {os.path.basename(pdf_path)}"])
update_ui(chatbot=chatbot, history=history)
return pdf_path
else:
chatbot.append([f"下载失败", f"论文下载失败: {paper_id}"])
update_ui(chatbot=chatbot, history=history)
return None
except Exception as e:
chatbot.append([f"下载错误", f"下载论文时出错: {str(e)}"])
update_ui(chatbot=chatbot, history=history)
return None
@CatchException
def 快速论文解读(txt: str, llm_kwargs: Dict, plugin_kwargs: Dict, chatbot: List,
history: List, system_prompt: str, user_request: str):
"""主函数 - 论文快速解读"""
# 初始化分析器
chatbot.append(["函数插件功能及使用方式", "论文快速解读:通过分析论文的关键要素,帮助您迅速理解论文内容,适用于各学科领域的科研论文。 <br><br>📋 使用方式:<br>1、直接上传PDF文件或者输入DOI号仅针对SCI hub存在的论文或arXiv ID如2501.03916<br>2、点击插件开始分析"])
yield from update_ui(chatbot=chatbot, history=history)
paper_file = None
# 检查输入是否为论文IDarxiv或DOI
paper_info = extract_paper_id(txt)
if paper_info:
# 如果是论文ID下载论文
chatbot.append(["检测到论文ID", f"检测到{'arXiv' if paper_info[0] == 'arxiv' else 'DOI'} ID: {paper_info[1]},准备下载论文..."])
yield from update_ui(chatbot=chatbot, history=history)
# 下载论文 - 完全重新实现
paper_file = download_paper_by_id(paper_info, chatbot, history)
if not paper_file:
report_exception(chatbot, history, a=f"下载论文失败", b=f"无法下载{'arXiv' if paper_info[0] == 'arxiv' else 'DOI'}论文: {paper_info[1]}")
yield from update_ui(chatbot=chatbot, history=history)
return
else:
# 检查输入路径
if not os.path.exists(txt):
report_exception(chatbot, history, a=f"解析论文: {txt}", b=f"找不到文件或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history)
return
# 验证路径安全性
user_name = chatbot.get_user()
validate_path_safety(txt, user_name)
# 查找论文文件
paper_file = _find_paper_file(txt)
if not paper_file:
report_exception(chatbot, history, a=f"解析论文", b=f"在路径 {txt} 中未找到支持的论文文件")
yield from update_ui(chatbot=chatbot, history=history)
return
yield from update_ui(chatbot=chatbot, history=history)
# 增加调试信息检查paper_file的类型和值
chatbot.append(["文件类型检查", f"paper_file类型: {type(paper_file)}, 值: {paper_file}"])
yield from update_ui(chatbot=chatbot, history=history)
chatbot.pop() # 移除调试信息
# 确保paper_file是字符串
if paper_file is not None and not isinstance(paper_file, str):
# 尝试转换为字符串
try:
paper_file = str(paper_file)
except:
report_exception(chatbot, history, a=f"类型错误", b=f"论文路径不是有效的字符串: {type(paper_file)}")
yield from update_ui(chatbot=chatbot, history=history)
return
# 分析论文
chatbot.append(["开始分析", f"正在分析论文: {os.path.basename(paper_file)}"])
yield from update_ui(chatbot=chatbot, history=history)
analyzer = PaperAnalyzer(llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)
yield from analyzer.analyze_paper(paper_file)

View File

@@ -1,42 +1,40 @@
from loguru import logger
from toolbox import update_ui
from toolbox import CatchException, report_exception
from toolbox import write_history_to_file, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
fast_debug = False
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
def 生成函数注释(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
def Program_Comment_Gen(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
import time, os
print('begin analysis on:', file_manifest)
logger.info('begin analysis on:', file_manifest)
for index, fp in enumerate(file_manifest):
with open(fp, 'r', encoding='utf-8', errors='replace') as f:
file_content = f.read()
i_say = f'请对下面的程序文件做一个概述并对文件中的所有函数生成注释使用markdown表格输出结果文件名是{os.path.relpath(fp, project_folder)},文件内容是 ```{file_content}```'
i_say_show_user = f'[{index}/{len(file_manifest)}] 请对下面的程序文件做一个概述,并对文件中的所有函数生成注释: {os.path.abspath(fp)}'
i_say_show_user = f'[{index+1}/{len(file_manifest)}] 请对下面的程序文件做一个概述,并对文件中的所有函数生成注释: {os.path.abspath(fp)}'
chatbot.append((i_say_show_user, "[Local Message] waiting gpt response."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
if not fast_debug:
msg = '正常'
# ** gpt request **
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
i_say, i_say_show_user, llm_kwargs, chatbot, history=[], sys_prompt=system_prompt) # 带超时倒计时
msg = '正常'
# ** gpt request **
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
i_say, i_say_show_user, llm_kwargs, chatbot, history=[], sys_prompt=system_prompt) # 带超时倒计时
chatbot[-1] = (i_say_show_user, gpt_say)
history.append(i_say_show_user); history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history, msg=msg) # 刷新界面
if not fast_debug: time.sleep(2)
if not fast_debug:
res = write_history_to_file(history)
promote_file_to_downloadzone(res, chatbot=chatbot)
chatbot.append(("完成了吗?", res))
chatbot[-1] = (i_say_show_user, gpt_say)
history.append(i_say_show_user); history.append(gpt_say)
yield from update_ui(chatbot=chatbot, history=history, msg=msg) # 刷新界面
time.sleep(2)
res = write_history_to_file(history)
promote_file_to_downloadzone(res, chatbot=chatbot)
chatbot.append(("完成了吗?", res))
yield from update_ui(chatbot=chatbot, history=history, msg=msg) # 刷新界面
@CatchException
def 批量生成函数注释(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def 批量Program_Comment_Gen(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
history = [] # 清空历史,以免输入溢出
import glob, os
if os.path.exists(txt):
@@ -53,4 +51,4 @@ def 批量生成函数注释(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何.tex文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
yield from 生成函数注释(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)
yield from Program_Comment_Gen(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)

View File

@@ -0,0 +1,153 @@
import os,glob
from typing import List
from shared_utils.fastapi_server import validate_path_safety
from toolbox import report_exception
from toolbox import CatchException, update_ui, get_conf, get_log_folder, update_ui_latest_msg
from shared_utils.fastapi_server import validate_path_safety
from crazy_functions.crazy_utils import input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
RAG_WORKER_REGISTER = {}
MAX_HISTORY_ROUND = 5
MAX_CONTEXT_TOKEN_LIMIT = 4096
REMEMBER_PREVIEW = 1000
@CatchException
def handle_document_upload(files: List[str], llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request, rag_worker):
"""
Handles document uploads by extracting text and adding it to the vector store.
"""
from llama_index.core import Document
from crazy_functions.rag_fns.rag_file_support import extract_text, supports_format
user_name = chatbot.get_user()
checkpoint_dir = get_log_folder(user_name, plugin_name='experimental_rag')
for file_path in files:
try:
validate_path_safety(file_path, user_name)
text = extract_text(file_path)
if text is None:
chatbot.append(
[f"上传文件: {os.path.basename(file_path)}", f"文件解析失败无法提取文本内容请更换文件。失败原因可能为1.文档格式过于复杂2. 不支持的文件格式,支持的文件格式后缀有:" + ", ".join(supports_format)])
else:
chatbot.append(
[f"上传文件: {os.path.basename(file_path)}", f"上传文件前50个字符为:{text[:50]}"])
document = Document(text=text, metadata={"source": file_path})
rag_worker.add_documents_to_vector_store([document])
chatbot.append([f"上传文件: {os.path.basename(file_path)}", "文件已成功添加到知识库。"])
except Exception as e:
report_exception(chatbot, history, a=f"处理文件: {file_path}", b=str(e))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# Main Q&A function with document upload support
@CatchException
def Rag问答(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# import vector store lib
VECTOR_STORE_TYPE = "Milvus"
if VECTOR_STORE_TYPE == "Milvus":
try:
from crazy_functions.rag_fns.milvus_worker import MilvusRagWorker as LlamaIndexRagWorker
except:
VECTOR_STORE_TYPE = "Simple"
if VECTOR_STORE_TYPE == "Simple":
from crazy_functions.rag_fns.llama_index_worker import LlamaIndexRagWorker
# 1. we retrieve rag worker from global context
user_name = chatbot.get_user()
checkpoint_dir = get_log_folder(user_name, plugin_name='experimental_rag')
if user_name in RAG_WORKER_REGISTER:
rag_worker = RAG_WORKER_REGISTER[user_name]
else:
rag_worker = RAG_WORKER_REGISTER[user_name] = LlamaIndexRagWorker(
user_name,
llm_kwargs,
checkpoint_dir=checkpoint_dir,
auto_load_checkpoint=True
)
current_context = f"{VECTOR_STORE_TYPE} @ {checkpoint_dir}"
tip = "提示输入“清空向量数据库”可以清空RAG向量数据库"
# 2. Handle special commands
if os.path.exists(txt) and os.path.isdir(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
# Extract file paths from the user input
# Assuming the user inputs file paths separated by commas after the command
file_paths = [f for f in glob.glob(f'{project_folder}/**/*', recursive=True)]
chatbot.append([txt, f'正在处理上传的文档 ({current_context}) ...'])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
yield from handle_document_upload(file_paths, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request, rag_worker)
return
elif txt == "清空向量数据库":
chatbot.append([txt, f'正在清空 ({current_context}) ...'])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
rag_worker.purge_vector_store()
yield from update_ui_latest_msg('已清空', chatbot, history, delay=0) # 刷新界面
return
# 3. Normal Q&A processing
chatbot.append([txt, f'正在召回知识 ({current_context}) ...'])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 4. Clip history to reduce token consumption
txt_origin = txt
if len(history) > MAX_HISTORY_ROUND * 2:
history = history[-(MAX_HISTORY_ROUND * 2):]
txt_clip, history, flags = input_clipping(txt, history, max_token_limit=MAX_CONTEXT_TOKEN_LIMIT, return_clip_flags=True)
input_is_clipped_flag = (flags["original_input_len"] != flags["clipped_input_len"])
# 5. If input is clipped, add input to vector store before retrieve
if input_is_clipped_flag:
yield from update_ui_latest_msg('检测到长输入, 正在向量化 ...', chatbot, history, delay=0) # 刷新界面
# Save input to vector store
rag_worker.add_text_to_vector_store(txt_origin)
yield from update_ui_latest_msg('向量化完成 ...', chatbot, history, delay=0) # 刷新界面
if len(txt_origin) > REMEMBER_PREVIEW:
HALF = REMEMBER_PREVIEW // 2
i_say_to_remember = txt[:HALF] + f" ...\n...(省略{len(txt_origin)-REMEMBER_PREVIEW}字)...\n... " + txt[-HALF:]
if (flags["original_input_len"] - flags["clipped_input_len"]) > HALF:
txt_clip = txt_clip + f" ...\n...(省略{len(txt_origin)-len(txt_clip)-HALF}字)...\n... " + txt[-HALF:]
else:
i_say_to_remember = i_say = txt_clip
else:
i_say_to_remember = i_say = txt_clip
# 6. Search vector store and build prompts
nodes = rag_worker.retrieve_from_store_with_query(i_say)
prompt = rag_worker.build_prompt(query=i_say, nodes=nodes)
# 7. Query language model
if len(chatbot) != 0:
chatbot.pop(-1) # Pop temp chat, because we are going to add them again inside `request_gpt_model_in_new_thread_with_ui_alive`
model_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=prompt,
inputs_show_user=i_say,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history=history,
sys_prompt=system_prompt,
retry_times_at_unknown_error=0
)
# 8. Remember Q&A
yield from update_ui_latest_msg(
model_say + '</br></br>' + f'对话记忆中, 请稍等 ({current_context}) ...',
chatbot, history, delay=0.5
)
rag_worker.remember_qa(i_say_to_remember, model_say)
history.extend([i_say, model_say])
# 9. Final UI Update
yield from update_ui_latest_msg(model_say, chatbot, history, delay=0, msg=tip)

View File

@@ -0,0 +1,167 @@
import pickle, os, random
from toolbox import CatchException, update_ui, get_conf, get_log_folder, update_ui_latest_msg
from crazy_functions.crazy_utils import input_clipping
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.json_fns.select_tool import structure_output, select_tool
from pydantic import BaseModel, Field
from loguru import logger
from typing import List
SOCIAL_NETWORK_WORKER_REGISTER = {}
class SocialNetwork():
def __init__(self):
self.people = []
class SaveAndLoad():
def __init__(self, user_name, llm_kwargs, auto_load_checkpoint=True, checkpoint_dir=None) -> None:
self.user_name = user_name
self.checkpoint_dir = checkpoint_dir
if auto_load_checkpoint:
self.social_network = self.load_from_checkpoint(checkpoint_dir)
else:
self.social_network = SocialNetwork()
def does_checkpoint_exist(self, checkpoint_dir=None):
import os, glob
if checkpoint_dir is None: checkpoint_dir = self.checkpoint_dir
if not os.path.exists(checkpoint_dir): return False
if len(glob.glob(os.path.join(checkpoint_dir, "social_network.pkl"))) == 0: return False
return True
def save_to_checkpoint(self, checkpoint_dir=None):
if checkpoint_dir is None: checkpoint_dir = self.checkpoint_dir
with open(os.path.join(checkpoint_dir, 'social_network.pkl'), "wb+") as f:
pickle.dump(self.social_network, f)
return
def load_from_checkpoint(self, checkpoint_dir=None):
if checkpoint_dir is None: checkpoint_dir = self.checkpoint_dir
if self.does_checkpoint_exist(checkpoint_dir=checkpoint_dir):
with open(os.path.join(checkpoint_dir, 'social_network.pkl'), "rb") as f:
social_network = pickle.load(f)
return social_network
else:
return SocialNetwork()
class Friend(BaseModel):
friend_name: str = Field(description="name of a friend")
friend_description: str = Field(description="description of a friend (everything about this friend)")
friend_relationship: str = Field(description="The relationship with a friend (e.g. friend, family, colleague)")
class FriendList(BaseModel):
friends_list: List[Friend] = Field(description="The list of friends")
class SocialNetworkWorker(SaveAndLoad):
def ai_socail_advice(self, prompt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, run_gpt_fn, intention_type):
pass
def ai_remove_friend(self, prompt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, run_gpt_fn, intention_type):
pass
def ai_list_friends(self, prompt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, run_gpt_fn, intention_type):
pass
def ai_add_multi_friends(self, prompt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, run_gpt_fn, intention_type):
friend, err_msg = structure_output(
txt=prompt,
prompt="根据提示, 解析多个联系人的身份信息\n\n",
err_msg=f"不能理解该联系人",
run_gpt_fn=run_gpt_fn,
pydantic_cls=FriendList
)
if friend.friends_list:
for f in friend.friends_list:
self.add_friend(f)
msg = f"成功添加{len(friend.friends_list)}个联系人: {str(friend.friends_list)}"
yield from update_ui_latest_msg(lastmsg=msg, chatbot=chatbot, history=history, delay=0)
def run(self, txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
prompt = txt
run_gpt_fn = lambda inputs, sys_prompt: predict_no_ui_long_connection(inputs=inputs, llm_kwargs=llm_kwargs, history=[], sys_prompt=sys_prompt, observe_window=[])
self.tools_to_select = {
"SocialAdvice":{
"explain_to_llm": "如果用户希望获取社交指导调用SocialAdvice生成一些社交建议",
"callback": self.ai_socail_advice,
},
"AddFriends":{
"explain_to_llm": "如果用户给出了联系人调用AddMultiFriends把联系人添加到数据库",
"callback": self.ai_add_multi_friends,
},
"RemoveFriend":{
"explain_to_llm": "如果用户希望移除某个联系人调用RemoveFriend",
"callback": self.ai_remove_friend,
},
"ListFriends":{
"explain_to_llm": "如果用户列举联系人调用ListFriends",
"callback": self.ai_list_friends,
}
}
try:
Explanation = '\n'.join([f'{k}: {v["explain_to_llm"]}' for k, v in self.tools_to_select.items()])
class UserSociaIntention(BaseModel):
intention_type: str = Field(
description=
f"The type of user intention. You must choose from {self.tools_to_select.keys()}.\n\n"
f"Explanation:\n{Explanation}",
default="SocialAdvice"
)
pydantic_cls_instance, err_msg = select_tool(
prompt=txt,
run_gpt_fn=run_gpt_fn,
pydantic_cls=UserSociaIntention
)
except Exception as e:
yield from update_ui_latest_msg(
lastmsg=f"无法理解用户意图 {err_msg}",
chatbot=chatbot,
history=history,
delay=0
)
return
intention_type = pydantic_cls_instance.intention_type
intention_callback = self.tools_to_select[pydantic_cls_instance.intention_type]['callback']
yield from intention_callback(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, run_gpt_fn, intention_type)
def add_friend(self, friend):
# check whether the friend is already in the social network
for f in self.social_network.people:
if f.friend_name == friend.friend_name:
f.friend_description = friend.friend_description
f.friend_relationship = friend.friend_relationship
logger.info(f"Repeated friend, update info: {friend}")
return
logger.info(f"Add a new friend: {friend}")
self.social_network.people.append(friend)
return
@CatchException
def I人助手(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
# 1. we retrieve worker from global context
user_name = chatbot.get_user()
checkpoint_dir=get_log_folder(user_name, plugin_name='experimental_rag')
if user_name in SOCIAL_NETWORK_WORKER_REGISTER:
social_network_worker = SOCIAL_NETWORK_WORKER_REGISTER[user_name]
else:
social_network_worker = SOCIAL_NETWORK_WORKER_REGISTER[user_name] = SocialNetworkWorker(
user_name,
llm_kwargs,
checkpoint_dir=checkpoint_dir,
auto_load_checkpoint=True
)
# 2. save
yield from social_network_worker.run(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
social_network_worker.save_to_checkpoint(checkpoint_dir)
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面

View File

@@ -1,12 +1,12 @@
from toolbox import update_ui, promote_file_to_downloadzone, disable_auto_promotion
from toolbox import update_ui, promote_file_to_downloadzone
from toolbox import CatchException, report_exception, write_history_to_file
from .crazy_utils import input_clipping
from shared_utils.fastapi_server import validate_path_safety
from crazy_functions.crazy_utils import input_clipping
def 解析源代码新(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
import os, copy
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
disable_auto_promotion(chatbot=chatbot)
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
summary_batch_isolation = True
inputs_array = []
@@ -23,7 +23,7 @@ def 解析源代码新(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
file_content = f.read()
prefix = "接下来请你逐文件分析下面的工程" if index==0 else ""
i_say = prefix + f'请对下面的程序文件做一个概述文件名是{os.path.relpath(fp, project_folder)},文件代码是 ```{file_content}```'
i_say_show_user = prefix + f'[{index}/{len(file_manifest)}] 请对下面的程序文件做一个概述: {fp}'
i_say_show_user = prefix + f'[{index+1}/{len(file_manifest)}] 请对下面的程序文件做一个概述: {fp}'
# 装载请求内容
inputs_array.append(i_say)
inputs_show_user_array.append(i_say_show_user)
@@ -82,13 +82,13 @@ def 解析源代码新(file_manifest, project_folder, llm_kwargs, plugin_kwargs,
inputs=inputs, inputs_show_user=inputs_show_user, llm_kwargs=llm_kwargs, chatbot=chatbot,
history=this_iteration_history_feed, # 迭代之前的分析
sys_prompt="你是一个程序架构分析师,正在分析一个项目的源代码。" + sys_prompt_additional)
diagram_code = make_diagram(this_iteration_files, result, this_iteration_history_feed)
summary = "请用一句话概括这些文件的整体功能。\n\n" + diagram_code
summary_result = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=summary,
inputs_show_user=summary,
llm_kwargs=llm_kwargs,
inputs=summary,
inputs_show_user=summary,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history=[i_say, result], # 迭代之前的分析
sys_prompt="你是一个程序架构分析师,正在分析一个项目的源代码。" + sys_prompt_additional)
@@ -128,6 +128,7 @@ def 解析一个Python项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -146,6 +147,7 @@ def 解析一个Matlab项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析Matlab项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -164,6 +166,7 @@ def 解析一个C项目的头文件(txt, llm_kwargs, plugin_kwargs, chatbot, his
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -184,6 +187,7 @@ def 解析一个C项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, system
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -206,6 +210,7 @@ def 解析一个Java项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, sys
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无权访问: {txt}")
@@ -228,6 +233,7 @@ def 解析一个前端项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无权访问: {txt}")
@@ -257,6 +263,7 @@ def 解析一个Golang项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无权访问: {txt}")
@@ -278,6 +285,7 @@ def 解析一个Rust项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, sys
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a=f"解析项目: {txt}", b=f"找不到本地项目或无权访问: {txt}")
@@ -298,6 +306,7 @@ def 解析一个Lua项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -320,6 +329,7 @@ def 解析一个CSharp项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, s
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
@@ -345,15 +355,19 @@ def 解析任意code项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, sys
pattern_except_suffix = [_.lstrip(" ^*.,").rstrip(" ,") for _ in txt_pattern.split(" ") if _ != "" and _.strip().startswith("^*.")]
pattern_except_suffix += ['zip', 'rar', '7z', 'tar', 'gz'] # 避免解析压缩文件
# 将要忽略匹配的文件名(例如: ^README.md)
pattern_except_name = [_.lstrip(" ^*,").rstrip(" ,").replace(".", "\.") for _ in txt_pattern.split(" ") if _ != "" and _.strip().startswith("^") and not _.strip().startswith("^*.")]
pattern_except_name = [_.lstrip(" ^*,").rstrip(" ,").replace(".", r"\.") # 移除左边通配符,移除右侧逗号,转义点号
for _ in txt_pattern.split(" ") # 以空格分割
if (_ != "" and _.strip().startswith("^") and not _.strip().startswith("^*.")) # ^开始,但不是^*.开始
]
# 生成正则表达式
pattern_except = '/[^/]+\.(' + "|".join(pattern_except_suffix) + ')$'
pattern_except = r'/[^/]+\.(' + "|".join(pattern_except_suffix) + ')$'
pattern_except += '|/(' + "|".join(pattern_except_name) + ')$' if pattern_except_name != [] else ''
history.clear()
import glob, os, re
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")

View File

@@ -12,6 +12,12 @@ class PaperFileGroup():
self.sp_file_index = []
self.sp_file_tag = []
# count_token
from request_llms.bridge_all import model_info
enc = model_info["gpt-3.5-turbo"]['tokenizer']
def get_token_num(txt): return len(enc.encode(txt, disallowed_special=()))
self.get_token_num = get_token_num
def run_file_split(self, max_token_limit=1900):
"""
将长文本分离开来
@@ -54,11 +60,11 @@ def parseNotebook(filename, enable_markdown=1):
Code += f"This is {idx+1}th code block: \n"
Code += code+"\n"
return Code
return Code
def ipynb解释(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
enable_markdown = plugin_kwargs.get("advanced_arg", "1")

View File

@@ -0,0 +1,162 @@
import os, copy, time
from toolbox import CatchException, report_exception, update_ui, zip_result, promote_file_to_downloadzone, update_ui_latest_msg, get_conf, generate_file_link
from shared_utils.fastapi_server import validate_path_safety
from crazy_functions.crazy_utils import input_clipping
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.agent_fns.python_comment_agent import PythonCodeComment
from crazy_functions.diagram_fns.file_tree import FileNode
from crazy_functions.agent_fns.watchdog import WatchDog
from shared_utils.advanced_markdown_format import markdown_convertion_for_file
from loguru import logger
def 注释源代码(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt):
summary_batch_isolation = True
inputs_array = []
inputs_show_user_array = []
history_array = []
sys_prompt_array = []
assert len(file_manifest) <= 512, "源文件太多超过512个, 请缩减输入文件的数量。或者您也可以选择删除此行警告并修改代码拆分file_manifest列表从而实现分批次处理。"
# 建立文件树
file_tree_struct = FileNode("root", build_manifest=True)
for file_path in file_manifest:
file_tree_struct.add_file(file_path, file_path)
# <第一步,逐个文件分析,多线程>
lang = "" if not plugin_kwargs["use_chinese"] else " (you must use Chinese)"
for index, fp in enumerate(file_manifest):
# 读取文件
with open(fp, 'r', encoding='utf-8', errors='replace') as f:
file_content = f.read()
prefix = ""
i_say = prefix + f'Please conclude the following source code at {os.path.relpath(fp, project_folder)} with only one sentence{lang}, the code is:\n```{file_content}```'
i_say_show_user = prefix + f'[{index+1}/{len(file_manifest)}] 请用一句话对下面的程序文件做一个整体概述: {fp}'
# 装载请求内容
MAX_TOKEN_SINGLE_FILE = 2560
i_say, _ = input_clipping(inputs=i_say, history=[], max_token_limit=MAX_TOKEN_SINGLE_FILE)
inputs_array.append(i_say)
inputs_show_user_array.append(i_say_show_user)
history_array.append([])
sys_prompt_array.append(f"You are a software architecture analyst analyzing a source code project. Do not dig into details, tell me what the code is doing in general. Your answer must be short, simple and clear{lang}.")
# 文件读取完成,对每一个源代码文件,生成一个请求线程,发送到大模型进行分析
gpt_response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array = inputs_array,
inputs_show_user_array = inputs_show_user_array,
history_array = history_array,
sys_prompt_array = sys_prompt_array,
llm_kwargs = llm_kwargs,
chatbot = chatbot,
show_user_at_complete = True
)
# <第二步,逐个文件分析,生成带注释文件>
tasks = ["" for _ in range(len(file_manifest))]
def bark_fn(tasks):
for i in range(len(tasks)): tasks[i] = "watchdog is dead"
wd = WatchDog(timeout=10, bark_fn=lambda: bark_fn(tasks), interval=3, msg="ThreadWatcher timeout")
wd.begin_watch()
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=get_conf('DEFAULT_WORKER_NUM'))
def _task_multi_threading(i_say, gpt_say, fp, file_tree_struct, index):
language = 'Chinese' if plugin_kwargs["use_chinese"] else 'English'
def observe_window_update(x):
if tasks[index] == "watchdog is dead":
raise TimeoutError("ThreadWatcher: watchdog is dead")
tasks[index] = x
pcc = PythonCodeComment(llm_kwargs, plugin_kwargs, language=language, observe_window_update=observe_window_update)
pcc.read_file(path=fp, brief=gpt_say)
revised_path, revised_content = pcc.begin_comment_source_code(None, None)
file_tree_struct.manifest[fp].revised_path = revised_path
file_tree_struct.manifest[fp].revised_content = revised_content
# <将结果写回源文件>
with open(fp, 'w', encoding='utf-8') as f:
f.write(file_tree_struct.manifest[fp].revised_content)
# <生成对比html>
with open("crazy_functions/agent_fns/python_comment_compare.html", 'r', encoding='utf-8') as f:
html_template = f.read()
warp = lambda x: "```python\n\n" + x + "\n\n```"
from themes.theme import load_dynamic_theme
_, advanced_css, _, _ = load_dynamic_theme("Default")
html_template = html_template.replace("ADVANCED_CSS", advanced_css)
html_template = html_template.replace("REPLACE_CODE_FILE_LEFT", pcc.get_markdown_block_in_html(markdown_convertion_for_file(warp(pcc.original_content))))
html_template = html_template.replace("REPLACE_CODE_FILE_RIGHT", pcc.get_markdown_block_in_html(markdown_convertion_for_file(warp(revised_content))))
compare_html_path = fp + '.compare.html'
file_tree_struct.manifest[fp].compare_html = compare_html_path
with open(compare_html_path, 'w', encoding='utf-8') as f:
f.write(html_template)
tasks[index] = ""
chatbot.append([None, f"正在处理:"])
futures = []
index = 0
for i_say, gpt_say, fp in zip(gpt_response_collection[0::2], gpt_response_collection[1::2], file_manifest):
future = executor.submit(_task_multi_threading, i_say, gpt_say, fp, file_tree_struct, index)
index += 1
futures.append(future)
# <第三步,等待任务完成>
cnt = 0
while True:
cnt += 1
wd.feed()
time.sleep(3)
worker_done = [h.done() for h in futures]
remain = len(worker_done) - sum(worker_done)
# <展示已经完成的部分>
preview_html_list = []
for done, fp in zip(worker_done, file_manifest):
if not done: continue
if hasattr(file_tree_struct.manifest[fp], 'compare_html'):
preview_html_list.append(file_tree_struct.manifest[fp].compare_html)
else:
logger.error(f"文件: {fp} 的注释结果未能成功")
file_links = generate_file_link(preview_html_list)
yield from update_ui_latest_msg(
f"当前任务: <br/>{'<br/>'.join(tasks)}.<br/>" +
f"剩余源文件数量: {remain}.<br/>" +
f"已完成的文件: {sum(worker_done)}.<br/>" +
file_links +
"<br/>" +
''.join(['.']*(cnt % 10 + 1)
), chatbot=chatbot, history=history, delay=0)
yield from update_ui(chatbot=chatbot, history=[]) # 刷新界面
if all(worker_done):
executor.shutdown()
break
# <第四步,压缩结果>
zip_res = zip_result(project_folder)
promote_file_to_downloadzone(file=zip_res, chatbot=chatbot)
# <END>
chatbot.append((None, "所有源文件均已处理完毕。"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
@CatchException
def 注释Python项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
history = [] # 清空历史,以免输入溢出
plugin_kwargs["use_chinese"] = plugin_kwargs.get("use_chinese", False)
import glob, os
if os.path.exists(txt):
project_folder = txt
validate_path_safety(project_folder, chatbot.get_user())
else:
if txt == "": txt = '空空如也的输入栏'
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到本地项目或无权访问: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
file_manifest = [f for f in glob.glob(f'{project_folder}/**/*.py', recursive=True)]
if len(file_manifest) == 0:
report_exception(chatbot, history, a = f"解析项目: {txt}", b = f"找不到任何python文件: {txt}")
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
yield from 注释源代码(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt)

View File

@@ -0,0 +1,36 @@
from toolbox import get_conf, update_ui
from crazy_functions.plugin_template.plugin_class_template import GptAcademicPluginTemplate, ArgProperty
from crazy_functions.SourceCode_Comment import 注释Python项目
class SourceCodeComment_Wrap(GptAcademicPluginTemplate):
def __init__(self):
"""
请注意`execute`会执行在不同的线程中,因此您在定义和使用类变量时,应当慎之又慎!
"""
pass
def define_arg_selection_menu(self):
"""
定义插件的二级选项菜单
"""
gui_definition = {
"main_input":
ArgProperty(title="路径", description="程序路径(上传文件后自动填写)", default_value="", type="string").model_dump_json(), # 主输入,自动从输入框同步
"use_chinese":
ArgProperty(title="注释语言", options=["英文", "中文"], default_value="英文", description="", type="dropdown").model_dump_json(),
# "use_emoji":
# ArgProperty(title="在注释中使用emoji", options=["禁止", "允许"], default_value="禁止", description="无", type="dropdown").model_dump_json(),
}
return gui_definition
def execute(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
执行插件
"""
if plugin_kwargs["use_chinese"] == "中文":
plugin_kwargs["use_chinese"] = True
else:
plugin_kwargs["use_chinese"] = False
yield from 注释Python项目(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)

View File

@@ -1,6 +1,6 @@
from toolbox import CatchException, update_ui, ProxyNetworkActivate, update_ui_lastest_msg, get_log_folder, get_user
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, get_files_from_everything
from toolbox import CatchException, update_ui, ProxyNetworkActivate, update_ui_latest_msg, get_log_folder, get_user
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, get_files_from_everything
from loguru import logger
install_msg ="""
1. python -m pip install torch --index-url https://download.pytorch.org/whl/cpu
@@ -9,7 +9,7 @@ install_msg ="""
3. python -m pip install unstructured[all-docs] --upgrade
4. python -c 'import nltk; nltk.download("punkt")'
4. python -c 'import nltk; nltk.download("punkt")'
"""
@CatchException
@@ -40,9 +40,9 @@ def 知识库文件注入(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
except Exception as e:
chatbot.append(["依赖不足", f"{str(e)}\n\n导入依赖失败。请用以下命令安装" + install_msg])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# from .crazy_utils import try_install_deps
# from crazy_functions.crazy_utils import try_install_deps
# try_install_deps(['zh_langchain==0.2.1', 'pypinyin'], reload_m=['pypinyin', 'zh_langchain'])
# yield from update_ui_lastest_msg("安装完成,您可以再次重试。", chatbot, history)
# yield from update_ui_latest_msg("安装完成,您可以再次重试。", chatbot, history)
return
# < --------------------读取文件--------------- >
@@ -56,11 +56,11 @@ def 知识库文件注入(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
chatbot.append(["没有找到任何可读取文件", "当前支持的格式包括: txt, md, docx, pptx, pdf, json等"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
return
# < -------------------预热文本向量化模组--------------- >
chatbot.append(['<br/>'.join(file_manifest), "正在预热文本向量化模组, 如果是第一次运行, 将消耗较长时间下载中文向量化模型..."])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
print('Checking Text2vec ...')
logger.info('Checking Text2vec ...')
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
with ProxyNetworkActivate('Download_LLM'): # 临时地激活代理网络
HuggingFaceEmbeddings(model_name="GanymedeNil/text2vec-large-chinese")
@@ -68,7 +68,7 @@ def 知识库文件注入(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
# < -------------------构建知识库--------------- >
chatbot.append(['<br/>'.join(file_manifest), "正在构建知识库..."])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
print('Establishing knowledge archive ...')
logger.info('Establishing knowledge archive ...')
with ProxyNetworkActivate('Download_LLM'): # 临时地激活代理网络
kai = knowledge_archive_interface()
vs_path = get_log_folder(user=get_user(chatbot), plugin_name='vec_store')
@@ -79,8 +79,8 @@ def 知识库文件注入(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
# yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# chatbot._cookies['langchain_plugin_embedding'] = kai.get_current_archive_id()
# chatbot._cookies['lock_plugin'] = 'crazy_functions.知识库文件注入->读取知识库作答'
# chatbot.append(['完成', "“根据知识库作答”函数插件已经接管问答系统, 提问吧! 但注意, 您接下来不能再使用其他插件了,刷新页面即可以退出知识库问答模式。"])
chatbot.append(['构建完成', f"当前知识库内的有效文件:\n\n---\n\n{kai_files}\n\n---\n\n请切换至“知识库问答”插件进行知识库访问, 或者使用此插件继续上传更多文件。"])
# chatbot.append(['完成', "“根据知识库作答”函数插件已经接管问答系统, 提问吧! 但注意, 您接下来不能再使用其他插件了,刷新页面即可以退出Vectorstore_QA模式。"])
chatbot.append(['构建完成', f"当前知识库内的有效文件:\n\n---\n\n{kai_files}\n\n---\n\n请切换至“Vectorstore_QA”插件进行知识库访问, 或者使用此插件继续上传更多文件。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 由于请求gpt需要一段时间我们先及时地做一次界面更新
@CatchException
@@ -93,9 +93,9 @@ def 读取知识库作答(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
except Exception as e:
chatbot.append(["依赖不足", f"{str(e)}\n\n导入依赖失败。请用以下命令安装" + install_msg])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# from .crazy_utils import try_install_deps
# from crazy_functions.crazy_utils import try_install_deps
# try_install_deps(['zh_langchain==0.2.1', 'pypinyin'], reload_m=['pypinyin', 'zh_langchain'])
# yield from update_ui_lastest_msg("安装完成,您可以再次重试。", chatbot, history)
# yield from update_ui_latest_msg("安装完成,您可以再次重试。", chatbot, history)
return
# < ------------------- --------------- >
@@ -109,8 +109,8 @@ def 读取知识库作答(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
chatbot.append((txt, f'[知识库 {kai_id}] ' + prompt))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面 # 由于请求gpt需要一段时间我们先及时地做一次界面更新
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=prompt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
inputs=prompt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
sys_prompt=system_prompt
)
history.extend((prompt, gpt_say))

View File

@@ -0,0 +1,204 @@
import requests
import random
import time
import re
import json
from bs4 import BeautifulSoup
from functools import lru_cache
from itertools import zip_longest
from check_proxy import check_proxy
from toolbox import CatchException, update_ui, get_conf, promote_file_to_downloadzone, update_ui_latest_msg, generate_file_link
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive, input_clipping
from request_llms.bridge_all import model_info
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.prompts.internet import SearchOptimizerPrompt, SearchAcademicOptimizerPrompt
from crazy_functions.json_fns.pydantic_io import GptJsonIO, JsonStringError
from textwrap import dedent
from loguru import logger
from pydantic import BaseModel, Field
class Query(BaseModel):
search_keyword: str = Field(description="search query for video resource")
class VideoResource(BaseModel):
thought: str = Field(description="analysis of the search results based on the user's query")
title: str = Field(description="title of the video")
author: str = Field(description="author/uploader of the video")
bvid: str = Field(description="unique ID of the video")
another_failsafe_bvid: str = Field(description="provide another bvid, the other one is not working")
def get_video_resource(search_keyword):
from crazy_functions.media_fns.get_media import search_videos
# Search for videos and return the first result
videos = search_videos(
search_keyword
)
# Return the first video if results exist, otherwise return None
return videos
def download_video(bvid, user_name, chatbot, history):
# from experimental_mods.get_bilibili_resource import download_bilibili
from crazy_functions.media_fns.get_media import download_video
# pause a while
tic_time = 8
for i in range(tic_time):
yield from update_ui_latest_msg(
lastmsg=f"即将下载音频。等待{tic_time-i}秒后自动继续, 点击“停止”键取消此操作。",
chatbot=chatbot, history=[], delay=1)
# download audio
chatbot.append((None, "下载音频, 请稍等...")); yield from update_ui(chatbot=chatbot, history=history)
downloaded_files = yield from download_video(bvid, only_audio=True, user_name=user_name, chatbot=chatbot, history=history)
if len(downloaded_files) == 0:
# failed to download audio
return []
# preview
preview_list = [promote_file_to_downloadzone(fp) for fp in downloaded_files]
file_links = generate_file_link(preview_list)
yield from update_ui_latest_msg(f"已完成的文件: <br/>" + file_links, chatbot=chatbot, history=history, delay=0)
chatbot.append((None, f"即将下载视频。"))
# pause a while
tic_time = 16
for i in range(tic_time):
yield from update_ui_latest_msg(
lastmsg=f"即将下载视频。等待{tic_time-i}秒后自动继续, 点击“停止”键取消此操作。",
chatbot=chatbot, history=[], delay=1)
# download video
chatbot.append((None, "下载视频, 请稍等...")); yield from update_ui(chatbot=chatbot, history=history)
downloaded_files_part2 = yield from download_video(bvid, only_audio=False, user_name=user_name, chatbot=chatbot, history=history)
# preview
preview_list = [promote_file_to_downloadzone(fp) for fp in downloaded_files_part2]
file_links = generate_file_link(preview_list)
yield from update_ui_latest_msg(f"已完成的文件: <br/>" + file_links, chatbot=chatbot, history=history, delay=0)
# return
return downloaded_files + downloaded_files_part2
class Strategy(BaseModel):
thought: str = Field(description="analysis of the user's wish, for example, can you recall the name of the resource?")
which_methods: str = Field(description="Which method to use to find the necessary information? choose from 'method_1' and 'method_2'.")
method_1_search_keywords: str = Field(description="Generate keywords to search the internet if you choose method 1, otherwise empty.")
method_2_generate_keywords: str = Field(description="Generate keywords for video download engine if you choose method 2, otherwise empty.")
@CatchException
def 多媒体任务(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
user_wish: str = txt
# query demos:
# - "我想找一首歌里面有句歌词是“turn your face towards the sun”"
# - "一首歌,第一句是红豆生南国"
# - "一首音乐,中国航天任务专用的那首"
# - "戴森球计划在熔岩星球的音乐"
# - "hanser的百变什么精"
# - "打大圣残躯时的bgm"
# - "渊下宫战斗音乐"
# 搜索
chatbot.append((txt, "检索中, 请稍等..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
if "跳过联网搜索" not in user_wish:
# 结构化生成
internet_search_keyword = user_wish
yield from update_ui_latest_msg(lastmsg=f"发起互联网检索: {internet_search_keyword} ...", chatbot=chatbot, history=[], delay=1)
from crazy_functions.Internet_GPT import internet_search_with_analysis_prompt
result = yield from internet_search_with_analysis_prompt(
prompt=internet_search_keyword,
analysis_prompt="请根据搜索结果分析,获取用户需要找的资源的名称、作者、出处等信息。",
llm_kwargs=llm_kwargs,
chatbot=chatbot
)
yield from update_ui_latest_msg(lastmsg=f"互联网检索结论: {result} \n\n 正在生成进一步检索方案 ...", chatbot=chatbot, history=[], delay=1)
rf_req = dedent(f"""
The user wish to get the following resource:
{user_wish}
Meanwhile, you can access another expert's opinion on the user's wish:
{result}
Generate search keywords (less than 5 keywords) for video download engine accordingly.
""")
else:
user_wish = user_wish.replace("跳过联网搜索", "").strip()
rf_req = dedent(f"""
The user wish to get the following resource:
{user_wish}
Generate research keywords (less than 5 keywords) accordingly.
""")
gpt_json_io = GptJsonIO(Query)
inputs = rf_req + gpt_json_io.format_instructions
run_gpt_fn = lambda inputs, sys_prompt: predict_no_ui_long_connection(inputs=inputs, llm_kwargs=llm_kwargs, history=[], sys_prompt=sys_prompt, observe_window=[])
analyze_res = run_gpt_fn(inputs, "")
logger.info(analyze_res)
query: Query = gpt_json_io.generate_output_auto_repair(analyze_res, run_gpt_fn)
video_engine_keywords = query.search_keyword
# 关键词展示
chatbot.append((None, f"检索关键词已确认: {video_engine_keywords}。筛选中, 请稍等..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 获取候选资源
candidate_dictionary: dict = get_video_resource(video_engine_keywords)
candidate_dictionary_as_str = json.dumps(candidate_dictionary, ensure_ascii=False, indent=4)
# 展示候选资源
candidate_display = "\n".join([f"{i+1}. {it['title']}" for i, it in enumerate(candidate_dictionary)])
chatbot.append((None, f"候选:\n\n{candidate_display}"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 结构化生成
rf_req_2 = dedent(f"""
The user wish to get the following resource:
{user_wish}
Select the most relevant and suitable video resource from the following search results:
{candidate_dictionary_as_str}
Note:
1. The first several search video results are more likely to satisfy the user's wish.
2. The time duration of the video should be less than 10 minutes.
3. You should analyze the search results first, before giving your answer.
4. Use Chinese if possible.
5. Beside the primary video selection, give a backup video resource `bvid`.
""")
gpt_json_io = GptJsonIO(VideoResource)
inputs = rf_req_2 + gpt_json_io.format_instructions
run_gpt_fn = lambda inputs, sys_prompt: predict_no_ui_long_connection(inputs=inputs, llm_kwargs=llm_kwargs, history=[], sys_prompt=sys_prompt, observe_window=[])
analyze_res = run_gpt_fn(inputs, "")
logger.info(analyze_res)
video_resource: VideoResource = gpt_json_io.generate_output_auto_repair(analyze_res, run_gpt_fn)
# Display
chatbot.append(
(None,
f"分析:{video_resource.thought}" "<br/>"
f"选择: `{video_resource.title}`。" "<br/>"
f"作者:{video_resource.author}"
)
)
chatbot.append((None, f"下载中, 请稍等..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
if video_resource and video_resource.bvid:
logger.info(video_resource)
downloaded = yield from download_video(video_resource.bvid, chatbot.get_user(), chatbot, history)
if not downloaded:
chatbot.append((None, f"下载失败, 尝试备选 ..."))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
downloaded = yield from download_video(video_resource.another_failsafe_bvid, chatbot.get_user(), chatbot, history)
@CatchException
def debug(bvid, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
yield from download_video(bvid, chatbot.get_user(), chatbot, history)

View File

@@ -21,7 +21,7 @@ Please describe in natural language what you want to do.
5. If you don't need to upload a file, you can simply repeat your command again.
"""
explain_msg = """
## 虚空终端插件说明:
## Void_Terminal插件说明:
1. 请用**自然语言**描述您需要做什么例如
- 请调用插件为我翻译PDF论文论文我刚刚放到上传区了
@@ -33,7 +33,7 @@ explain_msg = """
- 请调用插件解析python源代码项目代码我刚刚打包拖到上传区了
- 请问Transformer网络的结构是怎样的
2. 您可以打开插件下拉菜单以了解本项目的各种能力
2. 您可以打开插件下拉菜单以了解本项目的各种能力
3. 如果您使用调用插件xxx修改配置xxx请问等关键词您的意图可以被识别的更准确
@@ -47,7 +47,7 @@ explain_msg = """
from pydantic import BaseModel, Field
from typing import List
from toolbox import CatchException, update_ui, is_the_upload_folder
from toolbox import update_ui_lastest_msg, disable_auto_promotion
from toolbox import update_ui_latest_msg, disable_auto_promotion
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import input_clipping
@@ -67,7 +67,7 @@ class UserIntention(BaseModel):
def chat(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_intention):
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=txt, inputs_show_user=txt,
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
llm_kwargs=llm_kwargs, chatbot=chatbot, history=[],
sys_prompt=system_prompt
)
chatbot[-1] = [txt, gpt_say]
@@ -104,44 +104,44 @@ def analyze_intention_with_simple_rules(txt):
@CatchException
def 虚空终端(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Void_Terminal(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
disable_auto_promotion(chatbot=chatbot)
# 获取当前虚空终端状态
# 获取当前Void_Terminal状态
state = VoidTerminalState.get_state(chatbot)
appendix_msg = ""
# 用简单的关键词检测用户意图
is_certain, _ = analyze_intention_with_simple_rules(txt)
if is_the_upload_folder(txt):
state.set_state(chatbot=chatbot, key='has_provided_explaination', value=False)
state.set_state(chatbot=chatbot, key='has_provided_explanation', value=False)
appendix_msg = "\n\n**很好,您已经上传了文件**,现在请您描述您的需求。"
if is_certain or (state.has_provided_explaination):
if is_certain or (state.has_provided_explanation):
# 如果意图明确,跳过提示环节
state.set_state(chatbot=chatbot, key='has_provided_explaination', value=True)
state.set_state(chatbot=chatbot, key='has_provided_explanation', value=True)
state.unlock_plugin(chatbot=chatbot)
yield from update_ui(chatbot=chatbot, history=history)
yield from 虚空终端主路由(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
yield from Void_Terminal主路由(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request)
return
else:
# 如果意图模糊,提示
state.set_state(chatbot=chatbot, key='has_provided_explaination', value=True)
state.set_state(chatbot=chatbot, key='has_provided_explanation', value=True)
state.lock_plugin(chatbot=chatbot)
chatbot.append(("虚空终端状态:", explain_msg+appendix_msg))
chatbot.append(("Void_Terminal状态:", explain_msg+appendix_msg))
yield from update_ui(chatbot=chatbot, history=history)
return
def 虚空终端主路由(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Void_Terminal主路由(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
history = []
chatbot.append(("虚空终端状态: ", f"正在执行任务: {txt}"))
chatbot.append(("Void_Terminal状态: ", f"正在执行任务: {txt}"))
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# ⭐ ⭐ ⭐ 分析用户意图
is_certain, user_intention = analyze_intention_with_simple_rules(txt)
if not is_certain:
yield from update_ui_lastest_msg(
yield from update_ui_latest_msg(
lastmsg=f"正在执行任务: {txt}\n\n分析用户意图中", chatbot=chatbot, history=history, delay=0)
gpt_json_io = GptJsonIO(UserIntention)
rf_req = "\nchoose from ['ModifyConfiguration', 'ExecutePlugin', 'Chat']"
@@ -152,16 +152,16 @@ def 虚空终端主路由(txt, llm_kwargs, plugin_kwargs, chatbot, history, syst
analyze_res = run_gpt_fn(inputs, "")
try:
user_intention = gpt_json_io.generate_output_auto_repair(analyze_res, run_gpt_fn)
lastmsg=f"正在执行任务: {txt}\n\n用户意图理解: 意图={explain_intention_to_user[user_intention.intention_type]}",
lastmsg=f"正在执行任务: {txt}\n\n用户意图理解: 意图={explain_intention_to_user[user_intention.intention_type]}",
except JsonStringError as e:
yield from update_ui_lastest_msg(
yield from update_ui_latest_msg(
lastmsg=f"正在执行任务: {txt}\n\n用户意图理解: 失败 当前语言模型({llm_kwargs['llm_model']})不能理解您的意图", chatbot=chatbot, history=history, delay=0)
return
else:
pass
yield from update_ui_lastest_msg(
lastmsg=f"正在执行任务: {txt}\n\n用户意图理解: 意图={explain_intention_to_user[user_intention.intention_type]}",
yield from update_ui_latest_msg(
lastmsg=f"正在执行任务: {txt}\n\n用户意图理解: 意图={explain_intention_to_user[user_intention.intention_type]}",
chatbot=chatbot, history=history, delay=0)
# 用户意图: 修改本项目的配置

View File

@@ -1,7 +1,7 @@
from toolbox import update_ui
from toolbox import CatchException, report_exception
from toolbox import write_history_to_file, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
fast_debug = False
@@ -40,10 +40,10 @@ def 解析docx(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot
i_say = f'请对下面的文章片段用中文做概述,文件名是{os.path.relpath(fp, project_folder)},文章内容是 ```{paper_frag}```'
i_say_show_user = f'请对下面的文章片段做概述: {os.path.abspath(fp)}的第{i+1}/{len(paper_fragments)}个片段。'
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say,
inputs_show_user=i_say_show_user,
inputs=i_say,
inputs_show_user=i_say_show_user,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
chatbot=chatbot,
history=[],
sys_prompt="总结文章。"
)
@@ -56,10 +56,10 @@ def 解析docx(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot
if len(paper_fragments) > 1:
i_say = f"根据以上的对话,总结文章{os.path.abspath(fp)}的主要内容。"
gpt_say = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=i_say,
inputs_show_user=i_say,
inputs=i_say,
inputs_show_user=i_say,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
chatbot=chatbot,
history=this_paper_history,
sys_prompt="总结文章。"
)
@@ -79,13 +79,13 @@ def 解析docx(file_manifest, project_folder, llm_kwargs, plugin_kwargs, chatbot
@CatchException
def 总结word文档(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
def Word_Summary(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
import glob, os
# 基本信息:功能、贡献者
chatbot.append([
"函数插件功能?",
"批量总结Word文档。函数插件贡献者: JasonGuo1。注意, 如果是.doc文件, 请先转化为.docx格式。"])
"批量Word_Summary。函数插件贡献者: JasonGuo1。注意, 如果是.doc文件, 请先转化为.docx格式。"])
yield from update_ui(chatbot=chatbot, history=history) # 刷新界面
# 尝试导入依赖,如果缺少依赖,则给出安装建议

View File

@@ -1,5 +1,5 @@
from toolbox import CatchException, update_ui, gen_time_str, trimmed_format_exc, ProxyNetworkActivate
from toolbox import report_exception, get_log_folder, update_ui_lastest_msg, Singleton
from toolbox import report_exception, get_log_folder, update_ui_latest_msg, Singleton
from crazy_functions.agent_fns.pipe import PluginMultiprocessManager, PipeCom
from crazy_functions.agent_fns.general import AutoGenGeneral

View File

@@ -1,4 +1,5 @@
from crazy_functions.agent_fns.pipe import PluginMultiprocessManager, PipeCom
from loguru import logger
class EchoDemo(PluginMultiprocessManager):
def subprocess_worker(self, child_conn):
@@ -7,7 +8,7 @@ class EchoDemo(PluginMultiprocessManager):
while True:
msg = self.child_conn.recv() # PipeCom
if msg.cmd == "user_input":
# wait futher user input
# wait father user input
self.child_conn.send(PipeCom("show", msg.content))
wait_success = self.subprocess_worker_wait_user_feedback(wait_msg="我准备好处理下一个问题了.")
if not wait_success:
@@ -16,4 +17,4 @@ class EchoDemo(PluginMultiprocessManager):
elif msg.cmd == "terminate":
self.child_conn.send(PipeCom("done", ""))
break
print('[debug] subprocess_worker terminated')
logger.info('[debug] subprocess_worker terminated')

View File

@@ -27,7 +27,7 @@ def gpt_academic_generate_oai_reply(
llm_kwargs=llm_config,
history=history,
sys_prompt=self._oai_system_message[0]['content'],
console_slience=True
console_silence=True
)
assumed_done = reply.endswith('\nTERMINATE')
return True, reply

View File

@@ -1,5 +1,6 @@
from toolbox import get_log_folder, update_ui, gen_time_str, get_conf, promote_file_to_downloadzone
from crazy_functions.agent_fns.watchdog import WatchDog
from loguru import logger
import time, os
class PipeCom:
@@ -47,7 +48,7 @@ class PluginMultiprocessManager:
def terminate(self):
self.p.terminate()
self.alive = False
print("[debug] instance terminated")
logger.info("[debug] instance terminated")
def subprocess_worker(self, child_conn):
# ⭐⭐ run in subprocess
@@ -72,7 +73,7 @@ class PluginMultiprocessManager:
if file_type.lower() in ['png', 'jpg']:
image_path = os.path.abspath(fp)
self.chatbot.append([
'检测到新生图像:',
'检测到新生图像:',
f'本地文件预览: <br/><div align="center"><img src="file={image_path}"></div>'
])
yield from update_ui(chatbot=self.chatbot, history=self.history)
@@ -114,21 +115,21 @@ class PluginMultiprocessManager:
self.cnt = 1
self.parent_conn = self.launch_subprocess_with_pipe() # ⭐⭐⭐
repeated, cmd_to_autogen = self.send_command(txt)
if txt == 'exit':
if txt == 'exit':
self.chatbot.append([f"结束", "结束信号已明确终止AutoGen程序。"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
self.terminate()
return "terminate"
# patience = 10
while True:
time.sleep(0.5)
if not self.alive:
# the heartbeat watchdog might have it killed
self.terminate()
return "terminate"
if self.parent_conn.poll():
if self.parent_conn.poll():
self.feed_heartbeat_watchdog()
if "[GPT-Academic] 等待中" in self.chatbot[-1][-1]:
self.chatbot.pop(-1) # remove the last line
@@ -152,8 +153,8 @@ class PluginMultiprocessManager:
yield from update_ui(chatbot=self.chatbot, history=self.history)
if msg.cmd == "interact":
yield from self.overwatch_workdir_file_change()
self.chatbot.append([f"程序抵达用户反馈节点.", msg.content +
"\n\n等待您的进一步指令." +
self.chatbot.append([f"程序抵达用户反馈节点.", msg.content +
"\n\n等待您的进一步指令." +
"\n\n(1) 一般情况下您不需要说什么, 清空输入区, 然后直接点击“提交”以继续. " +
"\n\n(2) 如果您需要补充些什么, 输入要反馈的内容, 直接点击“提交”以继续. " +
"\n\n(3) 如果您想终止程序, 输入exit, 直接点击“提交”以终止AutoGen并解锁. "

View File

@@ -0,0 +1,457 @@
import datetime
import re
import os
from loguru import logger
from textwrap import dedent
from toolbox import CatchException, update_ui
from request_llms.bridge_all import predict_no_ui_long_connection
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
# TODO: 解决缩进问题
find_function_end_prompt = '''
Below is a page of code that you need to read. This page may not yet complete, you job is to split this page to separate functions, class functions etc.
- Provide the line number where the first visible function ends.
- Provide the line number where the next visible function begins.
- If there are no other functions in this page, you should simply return the line number of the last line.
- Only focus on functions declared by `def` keyword. Ignore inline functions. Ignore function calls.
------------------ Example ------------------
INPUT:
```
L0000 |import sys
L0001 |import re
L0002 |
L0003 |def trimmed_format_exc():
L0004 | import os
L0005 | import traceback
L0006 | str = traceback.format_exc()
L0007 | current_path = os.getcwd()
L0008 | replace_path = "."
L0009 | return str.replace(current_path, replace_path)
L0010 |
L0011 |
L0012 |def trimmed_format_exc_markdown():
L0013 | ...
L0014 | ...
```
OUTPUT:
```
<first_function_end_at>L0009</first_function_end_at>
<next_function_begin_from>L0012</next_function_begin_from>
```
------------------ End of Example ------------------
------------------ the real INPUT you need to process NOW ------------------
```
{THE_TAGGED_CODE}
```
'''
revise_function_prompt = '''
You need to read the following code, and revise the source code ({FILE_BASENAME}) according to following instructions:
1. You should analyze the purpose of the functions (if there are any).
2. You need to add docstring for the provided functions (if there are any).
Be aware:
1. You must NOT modify the indent of code.
2. You are NOT authorized to change or translate non-comment code, and you are NOT authorized to add empty lines either, toggle qu.
3. Use {LANG} to add comments and docstrings. Do NOT translate Chinese that is already in the code.
4. Besides adding a docstring, use the ⭐ symbol to annotate the most core and important line of code within the function, explaining its role.
------------------ Example ------------------
INPUT:
```
L0000 |
L0001 |def zip_result(folder):
L0002 | t = gen_time_str()
L0003 | zip_folder(folder, get_log_folder(), f"result.zip")
L0004 | return os.path.join(get_log_folder(), f"result.zip")
L0005 |
L0006 |
```
OUTPUT:
<instruction_1_purpose>
This function compresses a given folder, and return the path of the resulting `zip` file.
</instruction_1_purpose>
<instruction_2_revised_code>
```
def zip_result(folder):
"""
Compresses the specified folder into a zip file and stores it in the log folder.
Args:
folder (str): The path to the folder that needs to be compressed.
Returns:
str: The path to the created zip file in the log folder.
"""
t = gen_time_str()
zip_folder(folder, get_log_folder(), f"result.zip") # ⭐ Execute the zipping of folder
return os.path.join(get_log_folder(), f"result.zip")
```
</instruction_2_revised_code>
------------------ End of Example ------------------
------------------ the real INPUT you need to process NOW ({FILE_BASENAME}) ------------------
```
{THE_CODE}
```
{INDENT_REMINDER}
{BRIEF_REMINDER}
{HINT_REMINDER}
'''
revise_function_prompt_chinese = '''
您需要阅读以下代码,并根据以下说明修订源代码({FILE_BASENAME}):
1. 如果源代码中包含函数的话, 你应该分析给定函数实现了什么功能
2. 如果源代码中包含函数的话, 你需要为函数添加docstring, docstring必须使用中文
请注意:
1. 你不得修改代码的缩进
2. 你无权更改或翻译代码中的非注释部分,也不允许添加空行
3. 使用 {LANG} 添加注释和文档字符串。不要翻译代码中已有的中文
4. 除了添加docstring之外, 使用⭐符号给该函数中最核心、最重要的一行代码添加注释,并说明其作用
------------------ 示例 ------------------
INPUT:
```
L0000 |
L0001 |def zip_result(folder):
L0002 | t = gen_time_str()
L0003 | zip_folder(folder, get_log_folder(), f"result.zip")
L0004 | return os.path.join(get_log_folder(), f"result.zip")
L0005 |
L0006 |
```
OUTPUT:
<instruction_1_purpose>
该函数用于压缩指定文件夹,并返回生成的`zip`文件的路径。
</instruction_1_purpose>
<instruction_2_revised_code>
```
def zip_result(folder):
"""
该函数将指定的文件夹压缩成ZIP文件, 并将其存储在日志文件夹中。
输入参数:
folder (str): 需要压缩的文件夹的路径。
返回值:
str: 日志文件夹中创建的ZIP文件的路径。
"""
t = gen_time_str()
zip_folder(folder, get_log_folder(), f"result.zip") # ⭐ 执行文件夹的压缩
return os.path.join(get_log_folder(), f"result.zip")
```
</instruction_2_revised_code>
------------------ End of Example ------------------
------------------ the real INPUT you need to process NOW ({FILE_BASENAME}) ------------------
```
{THE_CODE}
```
{INDENT_REMINDER}
{BRIEF_REMINDER}
{HINT_REMINDER}
'''
class PythonCodeComment():
def __init__(self, llm_kwargs, plugin_kwargs, language, observe_window_update) -> None:
self.original_content = ""
self.full_context = []
self.full_context_with_line_no = []
self.current_page_start = 0
self.page_limit = 100 # 100 lines of code each page
self.ignore_limit = 20
self.llm_kwargs = llm_kwargs
self.plugin_kwargs = plugin_kwargs
self.language = language
self.observe_window_update = observe_window_update
if self.language == "chinese":
self.core_prompt = revise_function_prompt_chinese
else:
self.core_prompt = revise_function_prompt
self.path = None
self.file_basename = None
self.file_brief = ""
def generate_tagged_code_from_full_context(self):
for i, code in enumerate(self.full_context):
number = i
padded_number = f"{number:04}"
result = f"L{padded_number}"
self.full_context_with_line_no.append(f"{result} | {code}")
return self.full_context_with_line_no
def read_file(self, path, brief):
with open(path, 'r', encoding='utf8') as f:
self.full_context = f.readlines()
self.original_content = ''.join(self.full_context)
self.file_basename = os.path.basename(path)
self.file_brief = brief
self.full_context_with_line_no = self.generate_tagged_code_from_full_context()
self.path = path
def find_next_function_begin(self, tagged_code:list, begin_and_end):
begin, end = begin_and_end
THE_TAGGED_CODE = ''.join(tagged_code)
self.llm_kwargs['temperature'] = 0
result = predict_no_ui_long_connection(
inputs=find_function_end_prompt.format(THE_TAGGED_CODE=THE_TAGGED_CODE),
llm_kwargs=self.llm_kwargs,
history=[],
sys_prompt="",
observe_window=[],
console_silence=True
)
def extract_number(text):
# 使用正则表达式匹配模式
match = re.search(r'<next_function_begin_from>L(\d+)</next_function_begin_from>', text)
if match:
# 提取匹配的数字部分并转换为整数
return int(match.group(1))
return None
line_no = extract_number(result)
if line_no is not None:
return line_no
else:
return end
def _get_next_window(self):
#
current_page_start = self.current_page_start
if self.current_page_start == len(self.full_context) + 1:
raise StopIteration
# 如果剩余的行数非常少,一鼓作气处理掉
if len(self.full_context) - self.current_page_start < self.ignore_limit:
future_page_start = len(self.full_context) + 1
self.current_page_start = future_page_start
return current_page_start, future_page_start
tagged_code = self.full_context_with_line_no[ self.current_page_start: self.current_page_start + self.page_limit]
line_no = self.find_next_function_begin(tagged_code, [self.current_page_start, self.current_page_start + self.page_limit])
if line_no > len(self.full_context) - 5:
line_no = len(self.full_context) + 1
future_page_start = line_no
self.current_page_start = future_page_start
# ! consider eof
return current_page_start, future_page_start
def dedent(self, text):
"""Remove any common leading whitespace from every line in `text`.
"""
# Look for the longest leading string of spaces and tabs common to
# all lines.
margin = None
_whitespace_only_re = re.compile('^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile('(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)
text = _whitespace_only_re.sub('', text)
indents = _leading_whitespace_re.findall(text)
for indent in indents:
if margin is None:
margin = indent
# Current line more deeply indented than previous winner:
# no change (previous winner is still on top).
elif indent.startswith(margin):
pass
# Current line consistent with and no deeper than previous winner:
# it's the new winner.
elif margin.startswith(indent):
margin = indent
# Find the largest common whitespace between current line and previous
# winner.
else:
for i, (x, y) in enumerate(zip(margin, indent)):
if x != y:
margin = margin[:i]
break
# sanity check (testing/debugging only)
if 0 and margin:
for line in text.split("\n"):
assert not line or line.startswith(margin), \
"line = %r, margin = %r" % (line, margin)
if margin:
text = re.sub(r'(?m)^' + margin, '', text)
return text, len(margin)
else:
return text, 0
def get_next_batch(self):
current_page_start, future_page_start = self._get_next_window()
return ''.join(self.full_context[current_page_start: future_page_start]), current_page_start, future_page_start
def tag_code(self, fn, hint):
code = fn
_, n_indent = self.dedent(code)
indent_reminder = "" if n_indent == 0 else "(Reminder: as you can see, this piece of code has indent made up with {n_indent} whitespace, please preserve them in the OUTPUT.)"
brief_reminder = "" if self.file_brief == "" else f"({self.file_basename} abstract: {self.file_brief})"
hint_reminder = "" if hint is None else f"(Reminder: do not ignore or modify code such as `{hint}`, provide complete code in the OUTPUT.)"
self.llm_kwargs['temperature'] = 0
result = predict_no_ui_long_connection(
inputs=self.core_prompt.format(
LANG=self.language,
FILE_BASENAME=self.file_basename,
THE_CODE=code,
INDENT_REMINDER=indent_reminder,
BRIEF_REMINDER=brief_reminder,
HINT_REMINDER=hint_reminder
),
llm_kwargs=self.llm_kwargs,
history=[],
sys_prompt="",
observe_window=[],
console_silence=True
)
def get_code_block(reply):
import re
pattern = r"```([\s\S]*?)```" # regex pattern to match code blocks
matches = re.findall(pattern, reply) # find all code blocks in text
if len(matches) == 1:
return matches[0].strip('python') # code block
return None
code_block = get_code_block(result)
if code_block is not None:
code_block = self.sync_and_patch(original=code, revised=code_block)
return code_block
else:
return code
def get_markdown_block_in_html(self, html):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
found_list = soup.find_all("div", class_="markdown-body")
if found_list:
res = found_list[0]
return res.prettify()
else:
return None
def sync_and_patch(self, original, revised):
"""Ensure the number of pre-string empty lines in revised matches those in original."""
def count_leading_empty_lines(s, reverse=False):
"""Count the number of leading empty lines in a string."""
lines = s.split('\n')
if reverse: lines = list(reversed(lines))
count = 0
for line in lines:
if line.strip() == '':
count += 1
else:
break
return count
original_empty_lines = count_leading_empty_lines(original)
revised_empty_lines = count_leading_empty_lines(revised)
if original_empty_lines > revised_empty_lines:
additional_lines = '\n' * (original_empty_lines - revised_empty_lines)
revised = additional_lines + revised
elif original_empty_lines < revised_empty_lines:
lines = revised.split('\n')
revised = '\n'.join(lines[revised_empty_lines - original_empty_lines:])
original_empty_lines = count_leading_empty_lines(original, reverse=True)
revised_empty_lines = count_leading_empty_lines(revised, reverse=True)
if original_empty_lines > revised_empty_lines:
additional_lines = '\n' * (original_empty_lines - revised_empty_lines)
revised = revised + additional_lines
elif original_empty_lines < revised_empty_lines:
lines = revised.split('\n')
revised = '\n'.join(lines[:-(revised_empty_lines - original_empty_lines)])
return revised
def begin_comment_source_code(self, chatbot=None, history=None):
# from toolbox import update_ui_latest_msg
assert self.path is not None
assert '.py' in self.path # must be python source code
# write_target = self.path + '.revised.py'
write_content = ""
# with open(self.path + '.revised.py', 'w+', encoding='utf8') as f:
while True:
try:
# yield from update_ui_latest_msg(f"({self.file_basename}) 正在读取下一段代码片段:\n", chatbot=chatbot, history=history, delay=0)
next_batch, line_no_start, line_no_end = self.get_next_batch()
self.observe_window_update(f"正在处理{self.file_basename} - {line_no_start}/{len(self.full_context)}\n")
# yield from update_ui_latest_msg(f"({self.file_basename}) 处理代码片段:\n\n{next_batch}", chatbot=chatbot, history=history, delay=0)
hint = None
MAX_ATTEMPT = 2
for attempt in range(MAX_ATTEMPT):
result = self.tag_code(next_batch, hint)
try:
successful, hint = self.verify_successful(next_batch, result)
except Exception as e:
logger.error('ignored exception:\n' + str(e))
break
if successful:
break
if attempt == MAX_ATTEMPT - 1:
# cannot deal with this, give up
result = next_batch
break
# f.write(result)
write_content += result
except StopIteration:
next_batch, line_no_start, line_no_end = [], -1, -1
return None, write_content
def verify_successful(self, original, revised):
""" Determine whether the revised code contains every line that already exists
"""
from crazy_functions.ast_fns.comment_remove import remove_python_comments
original = remove_python_comments(original)
original_lines = original.split('\n')
revised_lines = revised.split('\n')
for l in original_lines:
l = l.strip()
if '\'' in l or '\"' in l: continue # ast sometimes toggle " to '
found = False
for lt in revised_lines:
if l in lt:
found = True
break
if not found:
return False, l
return True, None

View File

@@ -0,0 +1,45 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<style>ADVANCED_CSS</style>
<meta charset="UTF-8">
<title>源文件对比</title>
<style>
body {
font-family: Arial, sans-serif;
display: flex;
justify-content: center;
align-items: center;
height: 100vh;
margin: 0;
}
.container {
display: flex;
width: 95%;
height: -webkit-fill-available;
}
.code-container {
flex: 1;
margin: 0px;
padding: 0px;
border: 1px solid #ccc;
background-color: #f9f9f9;
overflow: auto;
}
pre {
white-space: pre-wrap;
word-wrap: break-word;
}
</style>
</head>
<body>
<div class="container">
<div class="code-container">
REPLACE_CODE_FILE_LEFT
</div>
<div class="code-container">
REPLACE_CODE_FILE_RIGHT
</div>
</div>
</body>
</html>

View File

@@ -1,4 +1,5 @@
import threading, time
from loguru import logger
class WatchDog():
def __init__(self, timeout, bark_fn, interval=3, msg="") -> None:
@@ -8,12 +9,12 @@ class WatchDog():
self.interval = interval
self.msg = msg
self.kill_dog = False
def watch(self):
while True:
if self.kill_dog: break
if time.time() - self.last_feed > self.timeout:
if len(self.msg) > 0: print(self.msg)
if len(self.msg) > 0: logger.info(self.msg)
self.bark_fn()
break
time.sleep(self.interval)

View File

@@ -0,0 +1,54 @@
import token
import tokenize
import copy
import io
def remove_python_comments(input_source: str) -> str:
source_flag = copy.copy(input_source)
source = io.StringIO(input_source)
ls = input_source.split('\n')
prev_toktype = token.INDENT
readline = source.readline
def get_char_index(lineno, col):
# find the index of the char in the source code
if lineno == 1:
return len('\n'.join(ls[:(lineno-1)])) + col
else:
return len('\n'.join(ls[:(lineno-1)])) + col + 1
def replace_char_between(start_lineno, start_col, end_lineno, end_col, source, replace_char, ls):
# replace char between start_lineno, start_col and end_lineno, end_col with replace_char, but keep '\n' and ' '
b = get_char_index(start_lineno, start_col)
e = get_char_index(end_lineno, end_col)
for i in range(b, e):
if source[i] == '\n':
source = source[:i] + '\n' + source[i+1:]
elif source[i] == ' ':
source = source[:i] + ' ' + source[i+1:]
else:
source = source[:i] + replace_char + source[i+1:]
return source
tokgen = tokenize.generate_tokens(readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
if toktype == token.STRING and (prev_toktype == token.INDENT):
source_flag = replace_char_between(slineno, scol, elineno, ecol, source_flag, ' ', ls)
elif toktype == token.STRING and (prev_toktype == token.NEWLINE):
source_flag = replace_char_between(slineno, scol, elineno, ecol, source_flag, ' ', ls)
elif toktype == tokenize.COMMENT:
source_flag = replace_char_between(slineno, scol, elineno, ecol, source_flag, ' ', ls)
prev_toktype = toktype
return source_flag
# 示例使用
if __name__ == "__main__":
with open("source.py", "r", encoding="utf-8") as f:
source_code = f.read()
cleaned_code = remove_python_comments(source_code)
with open("cleaned_source.py", "w", encoding="utf-8") as f:
f.write(cleaned_code)

View File

@@ -1,141 +0,0 @@
from toolbox import CatchException, update_ui, promote_file_to_downloadzone
from .crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
import datetime, json
def fetch_items(list_of_items, batch_size):
for i in range(0, len(list_of_items), batch_size):
yield list_of_items[i:i + batch_size]
def string_to_options(arguments):
import argparse
import shlex
# Create an argparse.ArgumentParser instance
parser = argparse.ArgumentParser()
# Add command-line arguments
parser.add_argument("--llm_to_learn", type=str, help="LLM model to learn", default="gpt-3.5-turbo")
parser.add_argument("--prompt_prefix", type=str, help="Prompt prefix", default='')
parser.add_argument("--system_prompt", type=str, help="System prompt", default='')
parser.add_argument("--batch", type=int, help="System prompt", default=50)
parser.add_argument("--pre_seq_len", type=int, help="pre_seq_len", default=50)
parser.add_argument("--learning_rate", type=float, help="learning_rate", default=2e-2)
parser.add_argument("--num_gpus", type=int, help="num_gpus", default=1)
parser.add_argument("--json_dataset", type=str, help="json_dataset", default="")
parser.add_argument("--ptuning_directory", type=str, help="ptuning_directory", default="")
# Parse the arguments
args = parser.parse_args(shlex.split(arguments))
return args
@CatchException
def 微调数据集生成(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
history = [] # 清空历史,以免输入溢出
chatbot.append(("这是什么功能?", "[Local Message] 微调数据集生成"))
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
args = plugin_kwargs.get("advanced_arg", None)
if args is None:
chatbot.append(("没给定指令", "退出"))
yield from update_ui(chatbot=chatbot, history=history); return
else:
arguments = string_to_options(arguments=args)
dat = []
with open(txt, 'r', encoding='utf8') as f:
for line in f.readlines():
json_dat = json.loads(line)
dat.append(json_dat["content"])
llm_kwargs['llm_model'] = arguments.llm_to_learn
for batch in fetch_items(dat, arguments.batch):
res = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=[f"{arguments.prompt_prefix}\n\n{b}" for b in (batch)],
inputs_show_user_array=[f"Show Nothing" for _ in (batch)],
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history_array=[[] for _ in (batch)],
sys_prompt_array=[arguments.system_prompt for _ in (batch)],
max_workers=10 # OpenAI所允许的最大并行过载
)
with open(txt+'.generated.json', 'a+', encoding='utf8') as f:
for b, r in zip(batch, res[1::2]):
f.write(json.dumps({"content":b, "summary":r}, ensure_ascii=False)+'\n')
promote_file_to_downloadzone(txt+'.generated.json', rename_file='generated.json', chatbot=chatbot)
return
@CatchException
def 启动微调(txt, llm_kwargs, plugin_kwargs, chatbot, history, system_prompt, user_request):
"""
txt 输入栏用户输入的文本,例如需要翻译的一段话,再例如一个包含了待处理文件的路径
llm_kwargs gpt模型参数如温度和top_p等一般原样传递下去就行
plugin_kwargs 插件模型的参数
chatbot 聊天显示框的句柄,用于显示给用户
history 聊天历史,前情提要
system_prompt 给gpt的静默提醒
user_request 当前用户的请求信息IP地址等
"""
import subprocess
history = [] # 清空历史,以免输入溢出
chatbot.append(("这是什么功能?", "[Local Message] 微调数据集生成"))
if ("advanced_arg" in plugin_kwargs) and (plugin_kwargs["advanced_arg"] == ""): plugin_kwargs.pop("advanced_arg")
args = plugin_kwargs.get("advanced_arg", None)
if args is None:
chatbot.append(("没给定指令", "退出"))
yield from update_ui(chatbot=chatbot, history=history); return
else:
arguments = string_to_options(arguments=args)
pre_seq_len = arguments.pre_seq_len # 128
learning_rate = arguments.learning_rate # 2e-2
num_gpus = arguments.num_gpus # 1
json_dataset = arguments.json_dataset # 't_code.json'
ptuning_directory = arguments.ptuning_directory # '/home/hmp/ChatGLM2-6B/ptuning'
command = f"torchrun --standalone --nnodes=1 --nproc-per-node={num_gpus} main.py \
--do_train \
--train_file AdvertiseGen/{json_dataset} \
--validation_file AdvertiseGen/{json_dataset} \
--preprocessing_num_workers 20 \
--prompt_column content \
--response_column summary \
--overwrite_cache \
--model_name_or_path THUDM/chatglm2-6b \
--output_dir output/clothgen-chatglm2-6b-pt-{pre_seq_len}-{learning_rate} \
--overwrite_output_dir \
--max_source_length 256 \
--max_target_length 256 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--predict_with_generate \
--max_steps 100 \
--logging_steps 10 \
--save_steps 20 \
--learning_rate {learning_rate} \
--pre_seq_len {pre_seq_len} \
--quantization_bit 4"
process = subprocess.Popen(command, shell=True, cwd=ptuning_directory)
try:
process.communicate(timeout=3600*24)
except subprocess.TimeoutExpired:
process.kill()
return

View File

@@ -1,27 +1,41 @@
from toolbox import update_ui, get_conf, trimmed_format_exc, get_max_token, Singleton
import threading
import os
import logging
import threading
from loguru import logger
from shared_utils.char_visual_effect import scrolling_visual_effect
from toolbox import update_ui, get_conf, trimmed_format_exc, get_max_token, Singleton
def input_clipping(inputs, history, max_token_limit):
def input_clipping(inputs, history, max_token_limit, return_clip_flags=False):
"""
当输入文本 + 历史文本超出最大限制时,采取措施丢弃一部分文本。
输入:
- inputs 本次请求
- history 历史上下文
- max_token_limit 最大token限制
输出:
- inputs 本次请求经过clip
- history 历史上下文经过clip
"""
import numpy as np
from request_llms.bridge_all import model_info
enc = model_info["gpt-3.5-turbo"]['tokenizer']
def get_token_num(txt): return len(enc.encode(txt, disallowed_special=()))
mode = 'input-and-history'
# 当 输入部分的token占比 小于 全文的一半时,只裁剪历史
input_token_num = get_token_num(inputs)
if input_token_num < max_token_limit//2:
original_input_len = len(inputs)
if input_token_num < max_token_limit//2:
mode = 'only-history'
max_token_limit = max_token_limit - input_token_num
everything = [inputs] if mode == 'input-and-history' else ['']
everything.extend(history)
n_token = get_token_num('\n'.join(everything))
full_token_num = n_token = get_token_num('\n'.join(everything))
everything_token = [get_token_num(e) for e in everything]
everything_token_num = sum(everything_token)
delta = max(everything_token) // 16 # 截断时的颗粒度
while n_token > max_token_limit:
where = np.argmax(everything_token)
encoded = enc.encode(everything[where], disallowed_special=())
@@ -32,15 +46,29 @@ def input_clipping(inputs, history, max_token_limit):
if mode == 'input-and-history':
inputs = everything[0]
full_token_num = everything_token_num
else:
pass
full_token_num = everything_token_num + input_token_num
history = everything[1:]
return inputs, history
flags = {
"mode": mode,
"original_input_token_num": input_token_num,
"original_full_token_num": full_token_num,
"original_input_len": original_input_len,
"clipped_input_len": len(inputs),
}
if not return_clip_flags:
return inputs, history
else:
return inputs, history, flags
def request_gpt_model_in_new_thread_with_ui_alive(
inputs, inputs_show_user, llm_kwargs,
inputs, inputs_show_user, llm_kwargs,
chatbot, history, sys_prompt, refresh_interval=0.2,
handle_token_exceed=True,
handle_token_exceed=True,
retry_times_at_unknown_error=2,
):
"""
@@ -77,7 +105,7 @@ def request_gpt_model_in_new_thread_with_ui_alive(
exceeded_cnt = 0
while True:
# watchdog error
if len(mutable) >= 2 and (time.time()-mutable[1]) > watch_dog_patience:
if len(mutable) >= 2 and (time.time()-mutable[1]) > watch_dog_patience:
raise RuntimeError("检测到程序终止。")
try:
# 【第一种情况】:顺利完成
@@ -105,7 +133,7 @@ def request_gpt_model_in_new_thread_with_ui_alive(
except:
# 【第三种情况】:其他错误:重试几次
tb_str = '```\n' + trimmed_format_exc() + '```'
print(tb_str)
logger.error(tb_str)
mutable[0] += f"[Local Message] 警告,在执行过程中遭遇问题, Traceback\n\n{tb_str}\n\n"
if retry_op > 0:
retry_op -= 1
@@ -135,18 +163,31 @@ def request_gpt_model_in_new_thread_with_ui_alive(
yield from update_ui(chatbot=chatbot, history=[]) # 如果最后成功了,则删除报错信息
return final_result
def can_multi_process(llm):
if llm.startswith('gpt-'): return True
if llm.startswith('api2d-'): return True
if llm.startswith('azure-'): return True
if llm.startswith('spark'): return True
if llm.startswith('zhipuai'): return True
return False
def can_multi_process(llm) -> bool:
from request_llms.bridge_all import model_info
def default_condition(llm) -> bool:
# legacy condition
if llm.startswith('gpt-'): return True
if llm.startswith('chatgpt-'): return True
if llm.startswith('api2d-'): return True
if llm.startswith('azure-'): return True
if llm.startswith('spark'): return True
if llm.startswith('zhipuai') or llm.startswith('glm-'): return True
return False
if llm in model_info:
if 'can_multi_thread' in model_info[llm]:
return model_info[llm]['can_multi_thread']
else:
return default_condition(llm)
else:
return default_condition(llm)
def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array, inputs_show_user_array, llm_kwargs,
chatbot, history_array, sys_prompt_array,
refresh_interval=0.2, max_workers=-1, scroller_max_len=30,
inputs_array, inputs_show_user_array, llm_kwargs,
chatbot, history_array, sys_prompt_array,
refresh_interval=0.2, max_workers=-1, scroller_max_len=75,
handle_token_exceed=True, show_user_at_complete=False,
retry_times_at_unknown_error=2,
):
@@ -189,7 +230,7 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
# 屏蔽掉 chatglm的多线程可能会导致严重卡顿
if not can_multi_process(llm_kwargs['llm_model']):
max_workers = 1
executor = ThreadPoolExecutor(max_workers=max_workers)
n_frag = len(inputs_array)
# 用户反馈
@@ -214,8 +255,8 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
try:
# 【第一种情况】:顺利完成
gpt_say = predict_no_ui_long_connection(
inputs=inputs, llm_kwargs=llm_kwargs, history=history,
sys_prompt=sys_prompt, observe_window=mutable[index], console_slience=True
inputs=inputs, llm_kwargs=llm_kwargs, history=history,
sys_prompt=sys_prompt, observe_window=mutable[index], console_silence=True
)
mutable[index][2] = "已成功"
return gpt_say
@@ -243,10 +284,10 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
# 【第三种情况】:其他错误
if detect_timeout(): raise RuntimeError("检测到程序终止。")
tb_str = '```\n' + trimmed_format_exc() + '```'
print(tb_str)
logger.error(tb_str)
gpt_say += f"[Local Message] 警告,线程{index}在执行过程中遭遇问题, Traceback\n\n{tb_str}\n\n"
if len(mutable[index][0]) > 0: gpt_say += "此线程失败前收到的回答:\n\n" + mutable[index][0]
if retry_op > 0:
if retry_op > 0:
retry_op -= 1
wait = random.randint(5, 20)
if ("Rate limit reached" in tb_str) or ("Too Many Requests" in tb_str):
@@ -271,6 +312,8 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
futures = [executor.submit(_req_gpt, index, inputs, history, sys_prompt) for index, inputs, history, sys_prompt in zip(
range(len(inputs_array)), inputs_array, history_array, sys_prompt_array)]
cnt = 0
while True:
# yield一次以刷新前端页面
time.sleep(refresh_interval)
@@ -283,12 +326,11 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
mutable[thread_index][1] = time.time()
# 在前端打印些好玩的东西
for thread_index, _ in enumerate(worker_done):
print_something_really_funny = "[ ...`"+mutable[thread_index][0][-scroller_max_len:].\
replace('\n', '').replace('`', '.').replace(' ', '.').replace('<br/>', '.....').replace('$', '.')+"`... ]"
print_something_really_funny = f"[ ...`{scrolling_visual_effect(mutable[thread_index][0], scroller_max_len)}`... ]"
observe_win.append(print_something_really_funny)
# 在前端打印些好玩的东西
stat_str = ''.join([f'`{mutable[thread_index][2]}`: {obs}\n\n'
if not done else f'`{mutable[thread_index][2]}`\n\n'
stat_str = ''.join([f'`{mutable[thread_index][2]}`: {obs}\n\n'
if not done else f'`{mutable[thread_index][2]}`\n\n'
for thread_index, done, obs in zip(range(len(worker_done)), worker_done, observe_win)])
# 在前端打印些好玩的东西
chatbot[-1] = [chatbot[-1][0], f'多线程操作已经开始,完成情况: \n\n{stat_str}' + ''.join(['.']*(cnt % 10+1))]
@@ -302,7 +344,7 @@ def request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
for inputs_show_user, f in zip(inputs_show_user_array, futures):
gpt_res = f.result()
gpt_response_collection.extend([inputs_show_user, gpt_res])
# 是否在结束时,在界面上显示结果
if show_user_at_complete:
for inputs_show_user, f in zip(inputs_show_user_array, futures):
@@ -337,7 +379,7 @@ def read_and_clean_pdf_text(fp):
import fitz, copy
import re
import numpy as np
from colorful import print亮黄, print亮绿
# from shared_utils.colorful import print亮黄, print亮绿
fc = 0 # Index 0 文本
fs = 1 # Index 1 字体
fb = 2 # Index 2 框框
@@ -347,12 +389,12 @@ def read_and_clean_pdf_text(fp):
"""
提取文本块主字体
"""
fsize_statiscs = {}
fsize_statistics = {}
for wtf in l['spans']:
if wtf['size'] not in fsize_statiscs: fsize_statiscs[wtf['size']] = 0
fsize_statiscs[wtf['size']] += len(wtf['text'])
return max(fsize_statiscs, key=fsize_statiscs.get)
if wtf['size'] not in fsize_statistics: fsize_statistics[wtf['size']] = 0
fsize_statistics[wtf['size']] += len(wtf['text'])
return max(fsize_statistics, key=fsize_statistics.get)
def ffsize_same(a,b):
"""
提取字体大小是否近似相等
@@ -388,14 +430,14 @@ def read_and_clean_pdf_text(fp):
if index == 0:
page_one_meta = [" ".join(["".join([wtf['text'] for wtf in l['spans']]) for l in t['lines']]).replace(
'- ', '') for t in text_areas['blocks'] if 'lines' in t]
############################## <第 2 步,获取正文主字体> ##################################
try:
fsize_statiscs = {}
fsize_statistics = {}
for span in meta_span:
if span[1] not in fsize_statiscs: fsize_statiscs[span[1]] = 0
fsize_statiscs[span[1]] += span[2]
main_fsize = max(fsize_statiscs, key=fsize_statiscs.get)
if span[1] not in fsize_statistics: fsize_statistics[span[1]] = 0
fsize_statistics[span[1]] += span[2]
main_fsize = max(fsize_statistics, key=fsize_statistics.get)
if REMOVE_FOOT_NOTE:
give_up_fize_threshold = main_fsize * REMOVE_FOOT_FFSIZE_PERCENT
except:
@@ -404,7 +446,7 @@ def read_and_clean_pdf_text(fp):
mega_sec = []
sec = []
for index, line in enumerate(meta_line):
if index == 0:
if index == 0:
sec.append(line[fc])
continue
if REMOVE_FOOT_NOTE:
@@ -501,12 +543,12 @@ def get_files_from_everything(txt, type): # type='.md'
"""
这个函数是用来获取指定目录下所有指定类型(如.md的文件并且对于网络上的文件也可以获取它。
下面是对每个参数和返回值的说明:
参数
- txt: 路径或网址,表示要搜索的文件或者文件夹路径或网络上的文件。
参数
- txt: 路径或网址,表示要搜索的文件或者文件夹路径或网络上的文件。
- type: 字符串,表示要搜索的文件类型。默认是.md。
返回值
- success: 布尔值,表示函数是否成功执行。
- file_manifest: 文件路径列表,里面包含以指定类型为后缀名的所有文件的绝对路径。
返回值
- success: 布尔值,表示函数是否成功执行。
- file_manifest: 文件路径列表,里面包含以指定类型为后缀名的所有文件的绝对路径。
- project_folder: 字符串,表示文件所在的文件夹路径。如果是网络上的文件,就是临时文件夹的路径。
该函数详细注释已添加,请确认是否满足您的需要。
"""
@@ -554,23 +596,23 @@ class nougat_interface():
def nougat_with_timeout(self, command, cwd, timeout=3600):
import subprocess
from toolbox import ProxyNetworkActivate
logging.info(f'正在执行命令 {command}')
logger.info(f'正在执行命令 {command}')
with ProxyNetworkActivate("Nougat_Download"):
process = subprocess.Popen(command, shell=True, cwd=cwd, env=os.environ)
process = subprocess.Popen(command, shell=False, cwd=cwd, env=os.environ)
try:
stdout, stderr = process.communicate(timeout=timeout)
except subprocess.TimeoutExpired:
process.kill()
stdout, stderr = process.communicate()
print("Process timed out!")
logger.error("Process timed out!")
return False
return True
def NOUGAT_parse_pdf(self, fp, chatbot, history):
from toolbox import update_ui_lastest_msg
from toolbox import update_ui_latest_msg
yield from update_ui_lastest_msg("正在解析论文, 请稍候。进度:正在排队, 等待线程锁...",
yield from update_ui_latest_msg("正在解析论文, 请稍候。进度:正在排队, 等待线程锁...",
chatbot=chatbot, history=history, delay=0)
self.threadLock.acquire()
import glob, threading, os
@@ -578,9 +620,10 @@ class nougat_interface():
dst = os.path.join(get_log_folder(plugin_name='nougat'), gen_time_str())
os.makedirs(dst)
yield from update_ui_lastest_msg("正在解析论文, 请稍候。进度正在加载NOUGAT... 提示首次运行需要花费较长时间下载NOUGAT参数",
yield from update_ui_latest_msg("正在解析论文, 请稍候。进度正在加载NOUGAT... 提示首次运行需要花费较长时间下载NOUGAT参数",
chatbot=chatbot, history=history, delay=0)
self.nougat_with_timeout(f'nougat --out "{os.path.abspath(dst)}" "{os.path.abspath(fp)}"', os.getcwd(), timeout=3600)
command = ['nougat', '--out', os.path.abspath(dst), os.path.abspath(fp)]
self.nougat_with_timeout(command, cwd=os.getcwd(), timeout=3600)
res = glob.glob(os.path.join(dst,'*.mmd'))
if len(res) == 0:
self.threadLock.release()

View File

@@ -1,8 +1,9 @@
import os
from textwrap import indent
from loguru import logger
class FileNode:
def __init__(self, name):
def __init__(self, name, build_manifest=False):
self.name = name
self.children = []
self.is_leaf = False
@@ -10,7 +11,9 @@ class FileNode:
self.parenting_ship = []
self.comment = ""
self.comment_maxlen_show = 50
self.build_manifest = build_manifest
self.manifest = {}
@staticmethod
def add_linebreaks_at_spaces(string, interval=10):
return '\n'.join(string[i:i+interval] for i in range(0, len(string), interval))
@@ -29,6 +32,7 @@ class FileNode:
level = 1
if directory_names == "":
new_node = FileNode(file_name)
self.manifest[file_path] = new_node
current_node.children.append(new_node)
new_node.is_leaf = True
new_node.comment = self.sanitize_comment(file_comment)
@@ -50,13 +54,14 @@ class FileNode:
new_node.level = level - 1
current_node = new_node
term = FileNode(file_name)
self.manifest[file_path] = term
term.level = level
term.comment = self.sanitize_comment(file_comment)
term.is_leaf = True
current_node.children.append(term)
def print_files_recursively(self, level=0, code="R0"):
print(' '*level + self.name + ' ' + str(self.is_leaf) + ' ' + str(self.level))
logger.info(' '*level + self.name + ' ' + str(self.is_leaf) + ' ' + str(self.level))
for j, child in enumerate(self.children):
child.print_files_recursively(level=level+1, code=code+str(j))
self.parenting_ship.extend(child.parenting_ship)
@@ -119,4 +124,4 @@ if __name__ == "__main__":
"用于加载和分割文件中的文本的通用文件加载器用于加载和分割文件中的文本的通用文件加载器用于加载和分割文件中的文本的通用文件加载器",
"包含了用于构建和管理向量数据库的函数和类包含了用于构建和管理向量数据库的函数和类包含了用于构建和管理向量数据库的函数和类",
]
print(build_file_tree_mermaid_diagram(file_manifest, file_comments, "项目文件树"))
logger.info(build_file_tree_mermaid_diagram(file_manifest, file_comments, "项目文件树"))

View File

@@ -0,0 +1,812 @@
import os
import time
from abc import ABC, abstractmethod
from datetime import datetime
from docx import Document
from docx.enum.style import WD_STYLE_TYPE
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT, WD_LINE_SPACING
from docx.oxml.ns import qn
from docx.shared import Inches, Cm
from docx.shared import Pt, RGBColor, Inches
from typing import Dict, List, Tuple
import markdown
from crazy_functions.doc_fns.conversation_doc.word_doc import convert_markdown_to_word
class DocumentFormatter(ABC):
"""文档格式化基类,定义文档格式化的基本接口"""
def __init__(self, final_summary: str, file_summaries_map: Dict, failed_files: List[Tuple]):
self.final_summary = final_summary
self.file_summaries_map = file_summaries_map
self.failed_files = failed_files
@abstractmethod
def format_failed_files(self) -> str:
"""格式化失败文件列表"""
pass
@abstractmethod
def format_file_summaries(self) -> str:
"""格式化文件总结内容"""
pass
@abstractmethod
def create_document(self) -> str:
"""创建完整文档"""
pass
class WordFormatter(DocumentFormatter):
"""Word格式文档生成器 - 符合中国政府公文格式规范(GB/T 9704-2012),并进行了优化"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.doc = Document()
self._setup_document()
self._create_styles()
# 初始化三级标题编号系统
self.numbers = {
1: 0, # 一级标题编号
2: 0, # 二级标题编号
3: 0 # 三级标题编号
}
def _setup_document(self):
"""设置文档基本格式,包括页面设置和页眉"""
sections = self.doc.sections
for section in sections:
# 设置页面大小为A4
section.page_width = Cm(21)
section.page_height = Cm(29.7)
# 设置页边距
section.top_margin = Cm(3.7) # 上边距37mm
section.bottom_margin = Cm(3.5) # 下边距35mm
section.left_margin = Cm(2.8) # 左边距28mm
section.right_margin = Cm(2.6) # 右边距26mm
# 设置页眉页脚距离
section.header_distance = Cm(2.0)
section.footer_distance = Cm(2.0)
# 添加页眉
header = section.header
header_para = header.paragraphs[0]
header_para.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
header_run = header_para.add_run("该文档由GPT-academic生成")
header_run.font.name = '仿宋'
header_run._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
header_run.font.size = Pt(9)
def _create_styles(self):
"""创建文档样式"""
# 创建正文样式
style = self.doc.styles.add_style('Normal_Custom', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = '仿宋'
style._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
style.font.size = Pt(14)
style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
style.paragraph_format.space_after = Pt(0)
style.paragraph_format.first_line_indent = Pt(28)
# 创建各级标题样式
self._create_heading_style('Title_Custom', '方正小标宋简体', 32, WD_PARAGRAPH_ALIGNMENT.CENTER)
self._create_heading_style('Heading1_Custom', '黑体', 22, WD_PARAGRAPH_ALIGNMENT.LEFT)
self._create_heading_style('Heading2_Custom', '黑体', 18, WD_PARAGRAPH_ALIGNMENT.LEFT)
self._create_heading_style('Heading3_Custom', '黑体', 16, WD_PARAGRAPH_ALIGNMENT.LEFT)
def _create_heading_style(self, style_name: str, font_name: str, font_size: int, alignment):
"""创建标题样式"""
style = self.doc.styles.add_style(style_name, WD_STYLE_TYPE.PARAGRAPH)
style.font.name = font_name
style._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
style.font.size = Pt(font_size)
style.font.bold = True
style.paragraph_format.alignment = alignment
style.paragraph_format.space_before = Pt(12)
style.paragraph_format.space_after = Pt(12)
style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
return style
def _get_heading_number(self, level: int) -> str:
"""
生成标题编号
Args:
level: 标题级别 (0-3)
Returns:
str: 格式化的标题编号
"""
if level == 0: # 主标题不需要编号
return ""
self.numbers[level] += 1 # 增加当前级别的编号
# 重置下级标题编号
for i in range(level + 1, 4):
self.numbers[i] = 0
# 根据级别返回不同格式的编号
if level == 1:
return f"{self.numbers[1]}. "
elif level == 2:
return f"{self.numbers[1]}.{self.numbers[2]} "
elif level == 3:
return f"{self.numbers[1]}.{self.numbers[2]}.{self.numbers[3]} "
return ""
def _add_heading(self, text: str, level: int):
"""
添加带编号的标题
Args:
text: 标题文本
level: 标题级别 (0-3)
"""
style_map = {
0: 'Title_Custom',
1: 'Heading1_Custom',
2: 'Heading2_Custom',
3: 'Heading3_Custom'
}
number = self._get_heading_number(level)
paragraph = self.doc.add_paragraph(style=style_map[level])
if number:
number_run = paragraph.add_run(number)
font_size = 22 if level == 1 else (18 if level == 2 else 16)
self._get_run_style(number_run, '黑体', font_size, True)
text_run = paragraph.add_run(text)
font_size = 32 if level == 0 else (22 if level == 1 else (18 if level == 2 else 16))
self._get_run_style(text_run, '黑体', font_size, True)
# 主标题添加日期
if level == 0:
date_paragraph = self.doc.add_paragraph()
date_paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
date_run = date_paragraph.add_run(datetime.now().strftime('%Y年%m月%d'))
self._get_run_style(date_run, '仿宋', 16, False)
return paragraph
def _get_run_style(self, run, font_name: str, font_size: int, bold: bool = False):
"""设置文本运行对象的样式"""
run.font.name = font_name
run._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
run.font.size = Pt(font_size)
run.font.bold = bold
def format_failed_files(self) -> str:
"""格式化失败文件列表"""
result = []
if not self.failed_files:
return "\n".join(result)
result.append("处理失败文件:")
for fp, reason in self.failed_files:
result.append(f"{os.path.basename(fp)}: {reason}")
self._add_heading("处理失败文件", 1)
for fp, reason in self.failed_files:
self._add_content(f"{os.path.basename(fp)}: {reason}", indent=False)
self.doc.add_paragraph()
return "\n".join(result)
def _add_content(self, text: str, indent: bool = True):
"""添加正文内容使用convert_markdown_to_word处理文本"""
# 使用convert_markdown_to_word处理markdown文本
processed_text = convert_markdown_to_word(text)
paragraph = self.doc.add_paragraph(processed_text, style='Normal_Custom')
if not indent:
paragraph.paragraph_format.first_line_indent = Pt(0)
return paragraph
def format_file_summaries(self) -> str:
"""
格式化文件总结内容确保正确的标题层级并处理markdown文本
"""
result = []
# 首先对文件路径进行分组整理
file_groups = {}
for path in sorted(self.file_summaries_map.keys()):
dir_path = os.path.dirname(path)
if dir_path not in file_groups:
file_groups[dir_path] = []
file_groups[dir_path].append(path)
# 处理没有目录的文件
root_files = file_groups.get("", [])
if root_files:
for path in sorted(root_files):
file_name = os.path.basename(path)
result.append(f"\n📄 {file_name}")
result.append(self.file_summaries_map[path])
# 无目录的文件作为二级标题
self._add_heading(f"📄 {file_name}", 2)
# 使用convert_markdown_to_word处理文件内容
self._add_content(convert_markdown_to_word(self.file_summaries_map[path]))
self.doc.add_paragraph()
# 处理有目录的文件
for dir_path in sorted(file_groups.keys()):
if dir_path == "": # 跳过已处理的根目录文件
continue
# 添加目录作为二级标题
result.append(f"\n📁 {dir_path}")
self._add_heading(f"📁 {dir_path}", 2)
# 该目录下的所有文件作为三级标题
for path in sorted(file_groups[dir_path]):
file_name = os.path.basename(path)
result.append(f"\n📄 {file_name}")
result.append(self.file_summaries_map[path])
# 添加文件名作为三级标题
self._add_heading(f"📄 {file_name}", 3)
# 使用convert_markdown_to_word处理文件内容
self._add_content(convert_markdown_to_word(self.file_summaries_map[path]))
self.doc.add_paragraph()
return "\n".join(result)
def create_document(self):
"""创建完整Word文档并返回文档对象"""
# 重置所有编号
for level in self.numbers:
self.numbers[level] = 0
# 添加主标题
self._add_heading("文档总结报告", 0)
self.doc.add_paragraph()
# 添加总体摘要使用convert_markdown_to_word处理
self._add_heading("总体摘要", 1)
self._add_content(convert_markdown_to_word(self.final_summary))
self.doc.add_paragraph()
# 添加失败文件列表(如果有)
if self.failed_files:
self.format_failed_files()
# 添加文件详细总结
self._add_heading("各文件详细总结", 1)
self.format_file_summaries()
return self.doc
def save_as_pdf(self, word_path, pdf_path=None):
"""将生成的Word文档转换为PDF
参数:
word_path: Word文档的路径
pdf_path: 可选PDF文件的输出路径。如果未指定将使用与Word文档相同的名称和位置
返回:
生成的PDF文件路径如果转换失败则返回None
"""
from crazy_functions.doc_fns.conversation_doc.word2pdf import WordToPdfConverter
try:
pdf_path = WordToPdfConverter.convert_to_pdf(word_path, pdf_path)
return pdf_path
except Exception as e:
print(f"PDF转换失败: {str(e)}")
return None
class MarkdownFormatter(DocumentFormatter):
"""Markdown格式文档生成器"""
def format_failed_files(self) -> str:
if not self.failed_files:
return ""
formatted_text = ["\n## ⚠️ 处理失败的文件"]
for fp, reason in self.failed_files:
formatted_text.append(f"- {os.path.basename(fp)}: {reason}")
formatted_text.append("\n---")
return "\n".join(formatted_text)
def format_file_summaries(self) -> str:
formatted_text = []
sorted_paths = sorted(self.file_summaries_map.keys())
current_dir = ""
for path in sorted_paths:
dir_path = os.path.dirname(path)
if dir_path != current_dir:
if dir_path:
formatted_text.append(f"\n## 📁 {dir_path}")
current_dir = dir_path
file_name = os.path.basename(path)
formatted_text.append(f"\n### 📄 {file_name}")
formatted_text.append(self.file_summaries_map[path])
formatted_text.append("\n---")
return "\n".join(formatted_text)
def create_document(self) -> str:
document = [
"# 📑 文档总结报告",
"\n## 总体摘要",
self.final_summary
]
if self.failed_files:
document.append(self.format_failed_files())
document.extend([
"\n# 📚 各文件详细总结",
self.format_file_summaries()
])
return "\n".join(document)
class HtmlFormatter(DocumentFormatter):
"""HTML格式文档生成器 - 优化版"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.md = markdown.Markdown(extensions=['extra','codehilite', 'tables','nl2br'])
self.css_styles = """
@keyframes fadeIn {
from { opacity: 0; transform: translateY(20px); }
to { opacity: 1; transform: translateY(0); }
}
@keyframes slideIn {
from { transform: translateX(-20px); opacity: 0; }
to { transform: translateX(0); opacity: 1; }
}
@keyframes pulse {
0% { transform: scale(1); }
50% { transform: scale(1.05); }
100% { transform: scale(1); }
}
:root {
/* Enhanced color palette */
--primary-color: #2563eb;
--primary-light: #eff6ff;
--secondary-color: #1e293b;
--background-color: #f8fafc;
--text-color: #334155;
--text-light: #64748b;
--border-color: #e2e8f0;
--error-color: #ef4444;
--error-light: #fef2f2;
--success-color: #22c55e;
--warning-color: #f59e0b;
--card-shadow: 0 4px 6px -1px rgb(0 0 0 / 0.1), 0 2px 4px -2px rgb(0 0 0 / 0.1);
--hover-shadow: 0 20px 25px -5px rgb(0 0 0 / 0.1), 0 8px 10px -6px rgb(0 0 0 / 0.1);
/* Typography */
--heading-font: "Plus Jakarta Sans", system-ui, sans-serif;
--body-font: "Inter", system-ui, sans-serif;
}
body {
font-family: var(--body-font);
line-height: 1.8;
max-width: 1200px;
margin: 0 auto;
padding: 2rem;
color: var(--text-color);
background-color: var(--background-color);
font-size: 16px;
-webkit-font-smoothing: antialiased;
}
.container {
background: white;
padding: 3rem;
border-radius: 24px;
box-shadow: var(--card-shadow);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
animation: fadeIn 0.6s ease-out;
border: 1px solid var(--border-color);
}
.container:hover {
box-shadow: var(--hover-shadow);
transform: translateY(-2px);
}
h1, h2, h3 {
font-family: var(--heading-font);
font-weight: 600;
}
h1 {
color: var(--primary-color);
font-size: 2.8em;
text-align: center;
margin: 2rem 0 3rem;
padding-bottom: 1.5rem;
border-bottom: 3px solid var(--primary-color);
letter-spacing: -0.03em;
position: relative;
display: flex;
align-items: center;
justify-content: center;
gap: 1rem;
}
h1::after {
content: '';
position: absolute;
bottom: -3px;
left: 50%;
transform: translateX(-50%);
width: 120px;
height: 3px;
background: linear-gradient(90deg, var(--primary-color), var(--primary-light));
border-radius: 3px;
transition: width 0.3s ease;
}
h1:hover::after {
width: 180px;
}
h2 {
color: var(--secondary-color);
font-size: 1.9em;
margin: 2.5rem 0 1.5rem;
padding-left: 1.2rem;
border-left: 4px solid var(--primary-color);
letter-spacing: -0.02em;
display: flex;
align-items: center;
gap: 1rem;
transition: all 0.3s ease;
}
h2:hover {
color: var(--primary-color);
transform: translateX(5px);
}
h3 {
color: var(--text-color);
font-size: 1.5em;
margin: 2rem 0 1rem;
padding-bottom: 0.8rem;
border-bottom: 2px solid var(--border-color);
transition: all 0.3s ease;
display: flex;
align-items: center;
gap: 0.8rem;
}
h3:hover {
color: var(--primary-color);
border-bottom-color: var(--primary-color);
}
.summary {
background: var(--primary-light);
padding: 2.5rem;
border-radius: 16px;
margin: 2.5rem 0;
box-shadow: 0 4px 6px -1px rgba(37, 99, 235, 0.1);
position: relative;
overflow: hidden;
transition: transform 0.3s ease, box-shadow 0.3s ease;
animation: slideIn 0.5s ease-out;
}
.summary:hover {
transform: translateY(-3px);
box-shadow: 0 8px 12px -2px rgba(37, 99, 235, 0.15);
}
.summary::before {
content: '';
position: absolute;
top: 0;
left: 0;
width: 4px;
height: 100%;
background: linear-gradient(to bottom, var(--primary-color), rgba(37, 99, 235, 0.6));
}
.summary p {
margin: 1.2rem 0;
line-height: 1.9;
color: var(--text-color);
transition: color 0.3s ease;
}
.summary:hover p {
color: var(--secondary-color);
}
.details {
margin-top: 3.5rem;
padding-top: 2.5rem;
border-top: 2px dashed var(--border-color);
animation: fadeIn 0.8s ease-out;
}
.failed-files {
background: var(--error-light);
padding: 2rem;
border-radius: 16px;
margin: 3rem 0;
border-left: 4px solid var(--error-color);
position: relative;
transition: all 0.3s ease;
animation: slideIn 0.5s ease-out;
}
.failed-files:hover {
transform: translateX(5px);
box-shadow: 0 8px 15px -3px rgba(239, 68, 68, 0.1);
}
.failed-files h2 {
color: var(--error-color);
border-left: none;
padding-left: 0;
}
.failed-files ul {
margin: 1.8rem 0;
padding-left: 1.2rem;
list-style-type: none;
}
.failed-files li {
margin: 1.2rem 0;
padding: 1.2rem 1.8rem;
background: rgba(239, 68, 68, 0.08);
border-radius: 12px;
transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
}
.failed-files li:hover {
transform: translateX(8px);
background: rgba(239, 68, 68, 0.12);
}
.directory-section {
margin: 3.5rem 0;
padding: 2rem;
background: var(--background-color);
border-radius: 16px;
position: relative;
transition: all 0.3s ease;
animation: fadeIn 0.6s ease-out;
}
.directory-section:hover {
background: white;
box-shadow: var(--card-shadow);
}
.file-summary {
background: white;
padding: 2rem;
margin: 1.8rem 0;
border-radius: 16px;
box-shadow: var(--card-shadow);
border-left: 4px solid var(--border-color);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
position: relative;
overflow: hidden;
}
.file-summary:hover {
border-left-color: var(--primary-color);
transform: translateX(8px) translateY(-2px);
box-shadow: var(--hover-shadow);
}
.file-summary {
background: white;
padding: 2rem;
margin: 1.8rem 0;
border-radius: 16px;
box-shadow: var(--card-shadow);
border-left: 4px solid var(--border-color);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
position: relative;
}
.file-summary:hover {
border-left-color: var(--primary-color);
transform: translateX(8px) translateY(-2px);
box-shadow: var(--hover-shadow);
}
.icon {
display: inline-flex;
align-items: center;
justify-content: center;
width: 32px;
height: 32px;
border-radius: 8px;
background: var(--primary-light);
color: var(--primary-color);
font-size: 1.2em;
transition: all 0.3s ease;
}
.file-summary:hover .icon,
.directory-section:hover .icon {
transform: scale(1.1);
background: var(--primary-color);
color: white;
}
/* Smooth scrolling */
html {
scroll-behavior: smooth;
}
/* Selection style */
::selection {
background: var(--primary-light);
color: var(--primary-color);
}
/* Print styles */
@media print {
body {
background: white;
}
.container {
box-shadow: none;
padding: 0;
}
.file-summary, .failed-files {
break-inside: avoid;
box-shadow: none;
}
.icon {
display: none;
}
}
/* Responsive design */
@media (max-width: 768px) {
body {
padding: 1rem;
font-size: 15px;
}
.container {
padding: 1.5rem;
}
h1 {
font-size: 2.2em;
margin: 1.5rem 0 2rem;
}
h2 {
font-size: 1.7em;
}
h3 {
font-size: 1.4em;
}
.summary, .failed-files, .directory-section {
padding: 1.5rem;
}
.file-summary {
padding: 1.2rem;
}
.icon {
width: 28px;
height: 28px;
}
}
/* Dark mode support */
@media (prefers-color-scheme: dark) {
:root {
--primary-light: rgba(37, 99, 235, 0.15);
--background-color: #0f172a;
--text-color: #e2e8f0;
--text-light: #94a3b8;
--border-color: #1e293b;
--error-light: rgba(239, 68, 68, 0.15);
}
.container, .file-summary {
background: #1e293b;
}
.directory-section {
background: #0f172a;
}
.directory-section:hover {
background: #1e293b;
}
}
"""
def format_failed_files(self) -> str:
if not self.failed_files:
return ""
failed_files_html = ['<div class="failed-files">']
failed_files_html.append('<h2><span class="icon">⚠️</span> 处理失败的文件</h2>')
failed_files_html.append("<ul>")
for fp, reason in self.failed_files:
failed_files_html.append(
f'<li><strong>📄 {os.path.basename(fp)}</strong><br><span style="color: var(--text-light)">{reason}</span></li>'
)
failed_files_html.append("</ul></div>")
return "\n".join(failed_files_html)
def format_file_summaries(self) -> str:
formatted_html = []
sorted_paths = sorted(self.file_summaries_map.keys())
current_dir = ""
for path in sorted_paths:
dir_path = os.path.dirname(path)
if dir_path != current_dir:
if dir_path:
formatted_html.append('<div class="directory-section">')
formatted_html.append(f'<h2><span class="icon">📁</span> {dir_path}</h2>')
formatted_html.append('</div>')
current_dir = dir_path
file_name = os.path.basename(path)
formatted_html.append('<div class="file-summary">')
formatted_html.append(f'<h3><span class="icon">📄</span> {file_name}</h3>')
formatted_html.append(self.md.convert(self.file_summaries_map[path]))
formatted_html.append('</div>')
return "\n".join(formatted_html)
def create_document(self) -> str:
"""生成HTML文档
Returns:
str: 完整的HTML文档字符串
"""
return f"""
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>文档总结报告</title>
<link href="https://cdnjs.cloudflare.com/ajax/libs/inter/3.19.3/inter.css" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Plus+Jakarta+Sans:wght@400;600&display=swap" rel="stylesheet">
<style>{self.css_styles}</style>
</head>
<body>
<div class="container">
<h1><span class="icon">📑</span> 文档总结报告</h1>
<div class="summary">
<h2><span class="icon">📋</span> 总体摘要</h2>
<p>{self.md.convert(self.final_summary)}</p>
</div>
{self.format_failed_files()}
<div class="details">
<h2><span class="icon">📚</span> 各文件详细总结</h2>
{self.format_file_summaries()}
</div>
</div>
</body>
</html>
"""

View File

View File

@@ -0,0 +1,812 @@
import os
import time
from abc import ABC, abstractmethod
from datetime import datetime
from docx import Document
from docx.enum.style import WD_STYLE_TYPE
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT, WD_LINE_SPACING
from docx.oxml.ns import qn
from docx.shared import Inches, Cm
from docx.shared import Pt, RGBColor, Inches
from typing import Dict, List, Tuple
import markdown
from crazy_functions.doc_fns.conversation_doc.word_doc import convert_markdown_to_word
class DocumentFormatter(ABC):
"""文档格式化基类,定义文档格式化的基本接口"""
def __init__(self, final_summary: str, file_summaries_map: Dict, failed_files: List[Tuple]):
self.final_summary = final_summary
self.file_summaries_map = file_summaries_map
self.failed_files = failed_files
@abstractmethod
def format_failed_files(self) -> str:
"""格式化失败文件列表"""
pass
@abstractmethod
def format_file_summaries(self) -> str:
"""格式化文件总结内容"""
pass
@abstractmethod
def create_document(self) -> str:
"""创建完整文档"""
pass
class WordFormatter(DocumentFormatter):
"""Word格式文档生成器 - 符合中国政府公文格式规范(GB/T 9704-2012),并进行了优化"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.doc = Document()
self._setup_document()
self._create_styles()
# 初始化三级标题编号系统
self.numbers = {
1: 0, # 一级标题编号
2: 0, # 二级标题编号
3: 0 # 三级标题编号
}
def _setup_document(self):
"""设置文档基本格式,包括页面设置和页眉"""
sections = self.doc.sections
for section in sections:
# 设置页面大小为A4
section.page_width = Cm(21)
section.page_height = Cm(29.7)
# 设置页边距
section.top_margin = Cm(3.7) # 上边距37mm
section.bottom_margin = Cm(3.5) # 下边距35mm
section.left_margin = Cm(2.8) # 左边距28mm
section.right_margin = Cm(2.6) # 右边距26mm
# 设置页眉页脚距离
section.header_distance = Cm(2.0)
section.footer_distance = Cm(2.0)
# 添加页眉
header = section.header
header_para = header.paragraphs[0]
header_para.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
header_run = header_para.add_run("该文档由GPT-academic生成")
header_run.font.name = '仿宋'
header_run._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
header_run.font.size = Pt(9)
def _create_styles(self):
"""创建文档样式"""
# 创建正文样式
style = self.doc.styles.add_style('Normal_Custom', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = '仿宋'
style._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
style.font.size = Pt(14)
style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
style.paragraph_format.space_after = Pt(0)
style.paragraph_format.first_line_indent = Pt(28)
# 创建各级标题样式
self._create_heading_style('Title_Custom', '方正小标宋简体', 32, WD_PARAGRAPH_ALIGNMENT.CENTER)
self._create_heading_style('Heading1_Custom', '黑体', 22, WD_PARAGRAPH_ALIGNMENT.LEFT)
self._create_heading_style('Heading2_Custom', '黑体', 18, WD_PARAGRAPH_ALIGNMENT.LEFT)
self._create_heading_style('Heading3_Custom', '黑体', 16, WD_PARAGRAPH_ALIGNMENT.LEFT)
def _create_heading_style(self, style_name: str, font_name: str, font_size: int, alignment):
"""创建标题样式"""
style = self.doc.styles.add_style(style_name, WD_STYLE_TYPE.PARAGRAPH)
style.font.name = font_name
style._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
style.font.size = Pt(font_size)
style.font.bold = True
style.paragraph_format.alignment = alignment
style.paragraph_format.space_before = Pt(12)
style.paragraph_format.space_after = Pt(12)
style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
return style
def _get_heading_number(self, level: int) -> str:
"""
生成标题编号
Args:
level: 标题级别 (0-3)
Returns:
str: 格式化的标题编号
"""
if level == 0: # 主标题不需要编号
return ""
self.numbers[level] += 1 # 增加当前级别的编号
# 重置下级标题编号
for i in range(level + 1, 4):
self.numbers[i] = 0
# 根据级别返回不同格式的编号
if level == 1:
return f"{self.numbers[1]}. "
elif level == 2:
return f"{self.numbers[1]}.{self.numbers[2]} "
elif level == 3:
return f"{self.numbers[1]}.{self.numbers[2]}.{self.numbers[3]} "
return ""
def _add_heading(self, text: str, level: int):
"""
添加带编号的标题
Args:
text: 标题文本
level: 标题级别 (0-3)
"""
style_map = {
0: 'Title_Custom',
1: 'Heading1_Custom',
2: 'Heading2_Custom',
3: 'Heading3_Custom'
}
number = self._get_heading_number(level)
paragraph = self.doc.add_paragraph(style=style_map[level])
if number:
number_run = paragraph.add_run(number)
font_size = 22 if level == 1 else (18 if level == 2 else 16)
self._get_run_style(number_run, '黑体', font_size, True)
text_run = paragraph.add_run(text)
font_size = 32 if level == 0 else (22 if level == 1 else (18 if level == 2 else 16))
self._get_run_style(text_run, '黑体', font_size, True)
# 主标题添加日期
if level == 0:
date_paragraph = self.doc.add_paragraph()
date_paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
date_run = date_paragraph.add_run(datetime.now().strftime('%Y年%m月%d'))
self._get_run_style(date_run, '仿宋', 16, False)
return paragraph
def _get_run_style(self, run, font_name: str, font_size: int, bold: bool = False):
"""设置文本运行对象的样式"""
run.font.name = font_name
run._element.rPr.rFonts.set(qn('w:eastAsia'), font_name)
run.font.size = Pt(font_size)
run.font.bold = bold
def format_failed_files(self) -> str:
"""格式化失败文件列表"""
result = []
if not self.failed_files:
return "\n".join(result)
result.append("处理失败文件:")
for fp, reason in self.failed_files:
result.append(f"{os.path.basename(fp)}: {reason}")
self._add_heading("处理失败文件", 1)
for fp, reason in self.failed_files:
self._add_content(f"{os.path.basename(fp)}: {reason}", indent=False)
self.doc.add_paragraph()
return "\n".join(result)
def _add_content(self, text: str, indent: bool = True):
"""添加正文内容使用convert_markdown_to_word处理文本"""
# 使用convert_markdown_to_word处理markdown文本
processed_text = convert_markdown_to_word(text)
paragraph = self.doc.add_paragraph(processed_text, style='Normal_Custom')
if not indent:
paragraph.paragraph_format.first_line_indent = Pt(0)
return paragraph
def format_file_summaries(self) -> str:
"""
格式化文件总结内容确保正确的标题层级并处理markdown文本
"""
result = []
# 首先对文件路径进行分组整理
file_groups = {}
for path in sorted(self.file_summaries_map.keys()):
dir_path = os.path.dirname(path)
if dir_path not in file_groups:
file_groups[dir_path] = []
file_groups[dir_path].append(path)
# 处理没有目录的文件
root_files = file_groups.get("", [])
if root_files:
for path in sorted(root_files):
file_name = os.path.basename(path)
result.append(f"\n📄 {file_name}")
result.append(self.file_summaries_map[path])
# 无目录的文件作为二级标题
self._add_heading(f"📄 {file_name}", 2)
# 使用convert_markdown_to_word处理文件内容
self._add_content(convert_markdown_to_word(self.file_summaries_map[path]))
self.doc.add_paragraph()
# 处理有目录的文件
for dir_path in sorted(file_groups.keys()):
if dir_path == "": # 跳过已处理的根目录文件
continue
# 添加目录作为二级标题
result.append(f"\n📁 {dir_path}")
self._add_heading(f"📁 {dir_path}", 2)
# 该目录下的所有文件作为三级标题
for path in sorted(file_groups[dir_path]):
file_name = os.path.basename(path)
result.append(f"\n📄 {file_name}")
result.append(self.file_summaries_map[path])
# 添加文件名作为三级标题
self._add_heading(f"📄 {file_name}", 3)
# 使用convert_markdown_to_word处理文件内容
self._add_content(convert_markdown_to_word(self.file_summaries_map[path]))
self.doc.add_paragraph()
return "\n".join(result)
def create_document(self):
"""创建完整Word文档并返回文档对象"""
# 重置所有编号
for level in self.numbers:
self.numbers[level] = 0
# 添加主标题
self._add_heading("文档总结报告", 0)
self.doc.add_paragraph()
# 添加总体摘要使用convert_markdown_to_word处理
self._add_heading("总体摘要", 1)
self._add_content(convert_markdown_to_word(self.final_summary))
self.doc.add_paragraph()
# 添加失败文件列表(如果有)
if self.failed_files:
self.format_failed_files()
# 添加文件详细总结
self._add_heading("各文件详细总结", 1)
self.format_file_summaries()
return self.doc
def save_as_pdf(self, word_path, pdf_path=None):
"""将生成的Word文档转换为PDF
参数:
word_path: Word文档的路径
pdf_path: 可选PDF文件的输出路径。如果未指定将使用与Word文档相同的名称和位置
返回:
生成的PDF文件路径如果转换失败则返回None
"""
from crazy_functions.doc_fns.conversation_doc.word2pdf import WordToPdfConverter
try:
pdf_path = WordToPdfConverter.convert_to_pdf(word_path, pdf_path)
return pdf_path
except Exception as e:
print(f"PDF转换失败: {str(e)}")
return None
class MarkdownFormatter(DocumentFormatter):
"""Markdown格式文档生成器"""
def format_failed_files(self) -> str:
if not self.failed_files:
return ""
formatted_text = ["\n## ⚠️ 处理失败的文件"]
for fp, reason in self.failed_files:
formatted_text.append(f"- {os.path.basename(fp)}: {reason}")
formatted_text.append("\n---")
return "\n".join(formatted_text)
def format_file_summaries(self) -> str:
formatted_text = []
sorted_paths = sorted(self.file_summaries_map.keys())
current_dir = ""
for path in sorted_paths:
dir_path = os.path.dirname(path)
if dir_path != current_dir:
if dir_path:
formatted_text.append(f"\n## 📁 {dir_path}")
current_dir = dir_path
file_name = os.path.basename(path)
formatted_text.append(f"\n### 📄 {file_name}")
formatted_text.append(self.file_summaries_map[path])
formatted_text.append("\n---")
return "\n".join(formatted_text)
def create_document(self) -> str:
document = [
"# 📑 文档总结报告",
"\n## 总体摘要",
self.final_summary
]
if self.failed_files:
document.append(self.format_failed_files())
document.extend([
"\n# 📚 各文件详细总结",
self.format_file_summaries()
])
return "\n".join(document)
class HtmlFormatter(DocumentFormatter):
"""HTML格式文档生成器 - 优化版"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.md = markdown.Markdown(extensions=['extra','codehilite', 'tables','nl2br'])
self.css_styles = """
@keyframes fadeIn {
from { opacity: 0; transform: translateY(20px); }
to { opacity: 1; transform: translateY(0); }
}
@keyframes slideIn {
from { transform: translateX(-20px); opacity: 0; }
to { transform: translateX(0); opacity: 1; }
}
@keyframes pulse {
0% { transform: scale(1); }
50% { transform: scale(1.05); }
100% { transform: scale(1); }
}
:root {
/* Enhanced color palette */
--primary-color: #2563eb;
--primary-light: #eff6ff;
--secondary-color: #1e293b;
--background-color: #f8fafc;
--text-color: #334155;
--text-light: #64748b;
--border-color: #e2e8f0;
--error-color: #ef4444;
--error-light: #fef2f2;
--success-color: #22c55e;
--warning-color: #f59e0b;
--card-shadow: 0 4px 6px -1px rgb(0 0 0 / 0.1), 0 2px 4px -2px rgb(0 0 0 / 0.1);
--hover-shadow: 0 20px 25px -5px rgb(0 0 0 / 0.1), 0 8px 10px -6px rgb(0 0 0 / 0.1);
/* Typography */
--heading-font: "Plus Jakarta Sans", system-ui, sans-serif;
--body-font: "Inter", system-ui, sans-serif;
}
body {
font-family: var(--body-font);
line-height: 1.8;
max-width: 1200px;
margin: 0 auto;
padding: 2rem;
color: var(--text-color);
background-color: var(--background-color);
font-size: 16px;
-webkit-font-smoothing: antialiased;
}
.container {
background: white;
padding: 3rem;
border-radius: 24px;
box-shadow: var(--card-shadow);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
animation: fadeIn 0.6s ease-out;
border: 1px solid var(--border-color);
}
.container:hover {
box-shadow: var(--hover-shadow);
transform: translateY(-2px);
}
h1, h2, h3 {
font-family: var(--heading-font);
font-weight: 600;
}
h1 {
color: var(--primary-color);
font-size: 2.8em;
text-align: center;
margin: 2rem 0 3rem;
padding-bottom: 1.5rem;
border-bottom: 3px solid var(--primary-color);
letter-spacing: -0.03em;
position: relative;
display: flex;
align-items: center;
justify-content: center;
gap: 1rem;
}
h1::after {
content: '';
position: absolute;
bottom: -3px;
left: 50%;
transform: translateX(-50%);
width: 120px;
height: 3px;
background: linear-gradient(90deg, var(--primary-color), var(--primary-light));
border-radius: 3px;
transition: width 0.3s ease;
}
h1:hover::after {
width: 180px;
}
h2 {
color: var(--secondary-color);
font-size: 1.9em;
margin: 2.5rem 0 1.5rem;
padding-left: 1.2rem;
border-left: 4px solid var(--primary-color);
letter-spacing: -0.02em;
display: flex;
align-items: center;
gap: 1rem;
transition: all 0.3s ease;
}
h2:hover {
color: var(--primary-color);
transform: translateX(5px);
}
h3 {
color: var(--text-color);
font-size: 1.5em;
margin: 2rem 0 1rem;
padding-bottom: 0.8rem;
border-bottom: 2px solid var(--border-color);
transition: all 0.3s ease;
display: flex;
align-items: center;
gap: 0.8rem;
}
h3:hover {
color: var(--primary-color);
border-bottom-color: var(--primary-color);
}
.summary {
background: var(--primary-light);
padding: 2.5rem;
border-radius: 16px;
margin: 2.5rem 0;
box-shadow: 0 4px 6px -1px rgba(37, 99, 235, 0.1);
position: relative;
overflow: hidden;
transition: transform 0.3s ease, box-shadow 0.3s ease;
animation: slideIn 0.5s ease-out;
}
.summary:hover {
transform: translateY(-3px);
box-shadow: 0 8px 12px -2px rgba(37, 99, 235, 0.15);
}
.summary::before {
content: '';
position: absolute;
top: 0;
left: 0;
width: 4px;
height: 100%;
background: linear-gradient(to bottom, var(--primary-color), rgba(37, 99, 235, 0.6));
}
.summary p {
margin: 1.2rem 0;
line-height: 1.9;
color: var(--text-color);
transition: color 0.3s ease;
}
.summary:hover p {
color: var(--secondary-color);
}
.details {
margin-top: 3.5rem;
padding-top: 2.5rem;
border-top: 2px dashed var(--border-color);
animation: fadeIn 0.8s ease-out;
}
.failed-files {
background: var(--error-light);
padding: 2rem;
border-radius: 16px;
margin: 3rem 0;
border-left: 4px solid var(--error-color);
position: relative;
transition: all 0.3s ease;
animation: slideIn 0.5s ease-out;
}
.failed-files:hover {
transform: translateX(5px);
box-shadow: 0 8px 15px -3px rgba(239, 68, 68, 0.1);
}
.failed-files h2 {
color: var(--error-color);
border-left: none;
padding-left: 0;
}
.failed-files ul {
margin: 1.8rem 0;
padding-left: 1.2rem;
list-style-type: none;
}
.failed-files li {
margin: 1.2rem 0;
padding: 1.2rem 1.8rem;
background: rgba(239, 68, 68, 0.08);
border-radius: 12px;
transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
}
.failed-files li:hover {
transform: translateX(8px);
background: rgba(239, 68, 68, 0.12);
}
.directory-section {
margin: 3.5rem 0;
padding: 2rem;
background: var(--background-color);
border-radius: 16px;
position: relative;
transition: all 0.3s ease;
animation: fadeIn 0.6s ease-out;
}
.directory-section:hover {
background: white;
box-shadow: var(--card-shadow);
}
.file-summary {
background: white;
padding: 2rem;
margin: 1.8rem 0;
border-radius: 16px;
box-shadow: var(--card-shadow);
border-left: 4px solid var(--border-color);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
position: relative;
overflow: hidden;
}
.file-summary:hover {
border-left-color: var(--primary-color);
transform: translateX(8px) translateY(-2px);
box-shadow: var(--hover-shadow);
}
.file-summary {
background: white;
padding: 2rem;
margin: 1.8rem 0;
border-radius: 16px;
box-shadow: var(--card-shadow);
border-left: 4px solid var(--border-color);
transition: all 0.4s cubic-bezier(0.4, 0, 0.2, 1);
position: relative;
}
.file-summary:hover {
border-left-color: var(--primary-color);
transform: translateX(8px) translateY(-2px);
box-shadow: var(--hover-shadow);
}
.icon {
display: inline-flex;
align-items: center;
justify-content: center;
width: 32px;
height: 32px;
border-radius: 8px;
background: var(--primary-light);
color: var(--primary-color);
font-size: 1.2em;
transition: all 0.3s ease;
}
.file-summary:hover .icon,
.directory-section:hover .icon {
transform: scale(1.1);
background: var(--primary-color);
color: white;
}
/* Smooth scrolling */
html {
scroll-behavior: smooth;
}
/* Selection style */
::selection {
background: var(--primary-light);
color: var(--primary-color);
}
/* Print styles */
@media print {
body {
background: white;
}
.container {
box-shadow: none;
padding: 0;
}
.file-summary, .failed-files {
break-inside: avoid;
box-shadow: none;
}
.icon {
display: none;
}
}
/* Responsive design */
@media (max-width: 768px) {
body {
padding: 1rem;
font-size: 15px;
}
.container {
padding: 1.5rem;
}
h1 {
font-size: 2.2em;
margin: 1.5rem 0 2rem;
}
h2 {
font-size: 1.7em;
}
h3 {
font-size: 1.4em;
}
.summary, .failed-files, .directory-section {
padding: 1.5rem;
}
.file-summary {
padding: 1.2rem;
}
.icon {
width: 28px;
height: 28px;
}
}
/* Dark mode support */
@media (prefers-color-scheme: dark) {
:root {
--primary-light: rgba(37, 99, 235, 0.15);
--background-color: #0f172a;
--text-color: #e2e8f0;
--text-light: #94a3b8;
--border-color: #1e293b;
--error-light: rgba(239, 68, 68, 0.15);
}
.container, .file-summary {
background: #1e293b;
}
.directory-section {
background: #0f172a;
}
.directory-section:hover {
background: #1e293b;
}
}
"""
def format_failed_files(self) -> str:
if not self.failed_files:
return ""
failed_files_html = ['<div class="failed-files">']
failed_files_html.append('<h2><span class="icon">⚠️</span> 处理失败的文件</h2>')
failed_files_html.append("<ul>")
for fp, reason in self.failed_files:
failed_files_html.append(
f'<li><strong>📄 {os.path.basename(fp)}</strong><br><span style="color: var(--text-light)">{reason}</span></li>'
)
failed_files_html.append("</ul></div>")
return "\n".join(failed_files_html)
def format_file_summaries(self) -> str:
formatted_html = []
sorted_paths = sorted(self.file_summaries_map.keys())
current_dir = ""
for path in sorted_paths:
dir_path = os.path.dirname(path)
if dir_path != current_dir:
if dir_path:
formatted_html.append('<div class="directory-section">')
formatted_html.append(f'<h2><span class="icon">📁</span> {dir_path}</h2>')
formatted_html.append('</div>')
current_dir = dir_path
file_name = os.path.basename(path)
formatted_html.append('<div class="file-summary">')
formatted_html.append(f'<h3><span class="icon">📄</span> {file_name}</h3>')
formatted_html.append(self.md.convert(self.file_summaries_map[path]))
formatted_html.append('</div>')
return "\n".join(formatted_html)
def create_document(self) -> str:
"""生成HTML文档
Returns:
str: 完整的HTML文档字符串
"""
return f"""
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>文档总结报告</title>
<link href="https://cdnjs.cloudflare.com/ajax/libs/inter/3.19.3/inter.css" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Plus+Jakarta+Sans:wght@400;600&display=swap" rel="stylesheet">
<style>{self.css_styles}</style>
</head>
<body>
<div class="container">
<h1><span class="icon">📑</span> 文档总结报告</h1>
<div class="summary">
<h2><span class="icon">📋</span> 总体摘要</h2>
<p>{self.md.convert(self.final_summary)}</p>
</div>
{self.format_failed_files()}
<div class="details">
<h2><span class="icon">📚</span> 各文件详细总结</h2>
{self.format_file_summaries()}
</div>
</div>
</body>
</html>
"""

View File

@@ -0,0 +1,237 @@
from abc import ABC, abstractmethod
from typing import Any, Dict, Optional, Type, TypeVar, Generic, Union
from dataclasses import dataclass
from enum import Enum, auto
import logging
from datetime import datetime
# 设置日志
logger = logging.getLogger(__name__)
# 自定义异常类定义
class FoldingError(Exception):
"""折叠相关的自定义异常基类"""
pass
class FormattingError(FoldingError):
"""格式化过程中的错误"""
pass
class MetadataError(FoldingError):
"""元数据相关的错误"""
pass
class ValidationError(FoldingError):
"""验证错误"""
pass
class FoldingStyle(Enum):
"""折叠样式枚举"""
SIMPLE = auto() # 简单折叠
DETAILED = auto() # 详细折叠(带有额外信息)
NESTED = auto() # 嵌套折叠
@dataclass
class FoldingOptions:
"""折叠选项配置"""
style: FoldingStyle = FoldingStyle.DETAILED
code_language: Optional[str] = None # 代码块的语言
show_timestamp: bool = False # 是否显示时间戳
indent_level: int = 0 # 缩进级别
custom_css: Optional[str] = None # 自定义CSS类
T = TypeVar('T') # 用于泛型类型
class BaseMetadata(ABC):
"""元数据基类"""
@abstractmethod
def validate(self) -> bool:
"""验证元数据的有效性"""
pass
def _validate_non_empty_str(self, value: Optional[str]) -> bool:
"""验证字符串非空"""
return bool(value and value.strip())
@dataclass
class FileMetadata(BaseMetadata):
"""文件元数据"""
rel_path: str
size: float
last_modified: Optional[datetime] = None
mime_type: Optional[str] = None
encoding: str = 'utf-8'
def validate(self) -> bool:
"""验证文件元数据的有效性"""
try:
if not self._validate_non_empty_str(self.rel_path):
return False
if self.size < 0:
return False
return True
except Exception as e:
logger.error(f"File metadata validation error: {str(e)}")
return False
class ContentFormatter(ABC, Generic[T]):
"""内容格式化抽象基类
支持泛型类型参数,可以指定具体的元数据类型。
"""
@abstractmethod
def format(self,
content: str,
metadata: T,
options: Optional[FoldingOptions] = None) -> str:
"""格式化内容
Args:
content: 需要格式化的内容
metadata: 类型化的元数据
options: 折叠选项
Returns:
str: 格式化后的内容
Raises:
FormattingError: 格式化过程中的错误
"""
pass
def _create_summary(self, metadata: T) -> str:
"""创建折叠摘要,可被子类重写"""
return str(metadata)
def _format_content_block(self,
content: str,
options: Optional[FoldingOptions]) -> str:
"""格式化内容块,处理代码块等特殊格式"""
if not options:
return content
if options.code_language:
return f"```{options.code_language}\n{content}\n```"
return content
def _add_indent(self, text: str, level: int) -> str:
"""添加缩进"""
if level <= 0:
return text
indent = " " * level
return "\n".join(indent + line for line in text.splitlines())
class FileContentFormatter(ContentFormatter[FileMetadata]):
"""文件内容格式化器"""
def format(self,
content: str,
metadata: FileMetadata,
options: Optional[FoldingOptions] = None) -> str:
"""格式化文件内容"""
if not metadata.validate():
raise MetadataError("Invalid file metadata")
try:
options = options or FoldingOptions()
# 构建摘要信息
summary_parts = [
f"{metadata.rel_path} ({metadata.size:.2f}MB)",
f"Type: {metadata.mime_type}" if metadata.mime_type else None,
(f"Modified: {metadata.last_modified.strftime('%Y-%m-%d %H:%M:%S')}"
if metadata.last_modified and options.show_timestamp else None)
]
summary = " | ".join(filter(None, summary_parts))
# 构建HTML类
css_class = f' class="{options.custom_css}"' if options.custom_css else ''
# 格式化内容
formatted_content = self._format_content_block(content, options)
# 组装最终结果
result = (
f'<details{css_class}><summary>{summary}</summary>\n\n'
f'{formatted_content}\n\n'
f'</details>\n\n'
)
return self._add_indent(result, options.indent_level)
except Exception as e:
logger.error(f"Error formatting file content: {str(e)}")
raise FormattingError(f"Failed to format file content: {str(e)}")
class ContentFoldingManager:
"""内容折叠管理器"""
def __init__(self):
"""初始化折叠管理器"""
self._formatters: Dict[str, ContentFormatter] = {}
self._register_default_formatters()
def _register_default_formatters(self) -> None:
"""注册默认的格式化器"""
self.register_formatter('file', FileContentFormatter())
def register_formatter(self, name: str, formatter: ContentFormatter) -> None:
"""注册新的格式化器"""
if not isinstance(formatter, ContentFormatter):
raise TypeError("Formatter must implement ContentFormatter interface")
self._formatters[name] = formatter
def _guess_language(self, extension: str) -> Optional[str]:
"""根据文件扩展名猜测编程语言"""
extension = extension.lower().lstrip('.')
language_map = {
'py': 'python',
'js': 'javascript',
'java': 'java',
'cpp': 'cpp',
'cs': 'csharp',
'html': 'html',
'css': 'css',
'md': 'markdown',
'json': 'json',
'xml': 'xml',
'sql': 'sql',
'sh': 'bash',
'yaml': 'yaml',
'yml': 'yaml',
'txt': None # 纯文本不需要语言标识
}
return language_map.get(extension)
def format_content(self,
content: str,
formatter_type: str,
metadata: Union[FileMetadata],
options: Optional[FoldingOptions] = None) -> str:
"""格式化内容"""
formatter = self._formatters.get(formatter_type)
if not formatter:
raise KeyError(f"No formatter registered for type: {formatter_type}")
if not isinstance(metadata, FileMetadata):
raise TypeError("Invalid metadata type")
return formatter.format(content, metadata, options)

View File

@@ -0,0 +1,211 @@
import re
import os
import pandas as pd
from datetime import datetime
from openpyxl import Workbook
class ExcelTableFormatter:
"""聊天记录中Markdown表格转Excel生成器"""
def __init__(self):
"""初始化Excel文档对象"""
self.workbook = Workbook()
self._table_count = 0
self._current_sheet = None
def _normalize_table_row(self, row):
"""标准化表格行,处理不同的分隔符情况"""
row = row.strip()
if row.startswith('|'):
row = row[1:]
if row.endswith('|'):
row = row[:-1]
return [cell.strip() for cell in row.split('|')]
def _is_separator_row(self, row):
"""检查是否是分隔行(由 - 或 : 组成)"""
clean_row = re.sub(r'[\s|]', '', row)
return bool(re.match(r'^[-:]+$', clean_row))
def _extract_tables_from_text(self, text):
"""从文本中提取所有表格内容"""
if not isinstance(text, str):
return []
tables = []
current_table = []
is_in_table = False
for line in text.split('\n'):
line = line.strip()
if not line:
if is_in_table and current_table:
if len(current_table) >= 2:
tables.append(current_table)
current_table = []
is_in_table = False
continue
if '|' in line:
if not is_in_table:
is_in_table = True
current_table.append(line)
else:
if is_in_table and current_table:
if len(current_table) >= 2:
tables.append(current_table)
current_table = []
is_in_table = False
if is_in_table and current_table and len(current_table) >= 2:
tables.append(current_table)
return tables
def _parse_table(self, table_lines):
"""解析表格内容为结构化数据"""
try:
headers = self._normalize_table_row(table_lines[0])
separator_index = next(
(i for i, line in enumerate(table_lines) if self._is_separator_row(line)),
1
)
data_rows = []
for line in table_lines[separator_index + 1:]:
cells = self._normalize_table_row(line)
# 确保单元格数量与表头一致
while len(cells) < len(headers):
cells.append('')
cells = cells[:len(headers)]
data_rows.append(cells)
if headers and data_rows:
return {
'headers': headers,
'data': data_rows
}
except Exception as e:
print(f"解析表格时发生错误: {str(e)}")
return None
def _create_sheet(self, question_num, table_num):
"""创建新的工作表"""
sheet_name = f'Q{question_num}_T{table_num}'
if len(sheet_name) > 31:
sheet_name = f'Table{self._table_count}'
if sheet_name in self.workbook.sheetnames:
sheet_name = f'{sheet_name}_{datetime.now().strftime("%H%M%S")}'
return self.workbook.create_sheet(title=sheet_name)
def create_document(self, history):
"""
处理聊天历史中的所有表格并创建Excel文档
Args:
history: 聊天历史列表
Returns:
Workbook: 处理完成的Excel工作簿对象如果没有表格则返回None
"""
has_tables = False
# 删除默认创建的工作表
default_sheet = self.workbook['Sheet']
self.workbook.remove(default_sheet)
# 遍历所有回答
for i in range(1, len(history), 2):
answer = history[i]
tables = self._extract_tables_from_text(answer)
for table_lines in tables:
parsed_table = self._parse_table(table_lines)
if parsed_table:
self._table_count += 1
sheet = self._create_sheet(i // 2 + 1, self._table_count)
# 写入表头
for col, header in enumerate(parsed_table['headers'], 1):
sheet.cell(row=1, column=col, value=header)
# 写入数据
for row_idx, row_data in enumerate(parsed_table['data'], 2):
for col_idx, value in enumerate(row_data, 1):
sheet.cell(row=row_idx, column=col_idx, value=value)
has_tables = True
return self.workbook if has_tables else None
def save_chat_tables(history, save_dir, base_name):
"""
保存聊天历史中的表格到Excel文件
Args:
history: 聊天历史列表
save_dir: 保存目录
base_name: 基础文件名
Returns:
list: 保存的文件路径列表
"""
result_files = []
try:
# 创建Excel格式
excel_formatter = ExcelTableFormatter()
workbook = excel_formatter.create_document(history)
if workbook is not None:
# 确保保存目录存在
os.makedirs(save_dir, exist_ok=True)
# 生成Excel文件路径
excel_file = os.path.join(save_dir, base_name + '.xlsx')
# 保存Excel文件
workbook.save(excel_file)
result_files.append(excel_file)
print(f"已保存表格到Excel文件: {excel_file}")
except Exception as e:
print(f"保存Excel格式失败: {str(e)}")
return result_files
# 使用示例
if __name__ == "__main__":
# 示例聊天历史
history = [
"问题1",
"""这是第一个表格:
| A | B | C |
|---|---|---|
| 1 | 2 | 3 |""",
"问题2",
"这是没有表格的回答",
"问题3",
"""回答包含多个表格:
| Name | Age |
|------|-----|
| Tom | 20 |
第二个表格:
| X | Y |
|---|---|
| 1 | 2 |"""
]
# 保存表格
save_dir = "output"
base_name = "chat_tables"
saved_files = save_chat_tables(history, save_dir, base_name)

View File

@@ -0,0 +1,190 @@
class HtmlFormatter:
"""聊天记录HTML格式生成器"""
def __init__(self, chatbot, history):
self.chatbot = chatbot
self.history = history
self.css_styles = """
:root {
--primary-color: #2563eb;
--primary-light: #eff6ff;
--secondary-color: #1e293b;
--background-color: #f8fafc;
--text-color: #334155;
--border-color: #e2e8f0;
--card-shadow: 0 4px 6px -1px rgb(0 0 0 / 0.1), 0 2px 4px -2px rgb(0 0 0 / 0.1);
}
body {
font-family: system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
line-height: 1.8;
margin: 0;
padding: 2rem;
color: var(--text-color);
background-color: var(--background-color);
}
.container {
max-width: 1200px;
margin: 0 auto;
background: white;
padding: 2rem;
border-radius: 16px;
box-shadow: var(--card-shadow);
}
::selection {
background: var(--primary-light);
color: var(--primary-color);
}
@keyframes fadeIn {
from { opacity: 0; transform: translateY(20px); }
to { opacity: 1; transform: translateY(0); }
}
@keyframes slideIn {
from { transform: translateX(-20px); opacity: 0; }
to { transform: translateX(0); opacity: 1; }
}
.container {
animation: fadeIn 0.6s ease-out;
}
.QaBox {
animation: slideIn 0.5s ease-out;
transition: all 0.3s ease;
}
.QaBox:hover {
transform: translateX(5px);
}
.Question, .Answer, .historyBox {
transition: all 0.3s ease;
}
.chat-title {
color: var(--primary-color);
font-size: 2em;
text-align: center;
margin: 1rem 0 2rem;
padding-bottom: 1rem;
border-bottom: 2px solid var(--primary-color);
}
.chat-body {
display: flex;
flex-direction: column;
gap: 1.5rem;
margin: 2rem 0;
}
.QaBox {
background: white;
padding: 1.5rem;
border-radius: 8px;
border-left: 4px solid var(--primary-color);
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);
margin-bottom: 1.5rem;
}
.Question {
color: var(--secondary-color);
font-weight: 500;
margin-bottom: 1rem;
}
.Answer {
color: var(--text-color);
background: var(--primary-light);
padding: 1rem;
border-radius: 6px;
}
.history-section {
margin-top: 3rem;
padding-top: 2rem;
border-top: 2px solid var(--border-color);
}
.history-title {
color: var(--secondary-color);
font-size: 1.5em;
margin-bottom: 1.5rem;
text-align: center;
}
.historyBox {
background: white;
padding: 1rem;
margin: 0.5rem 0;
border-radius: 6px;
border: 1px solid var(--border-color);
}
@media (prefers-color-scheme: dark) {
:root {
--background-color: #0f172a;
--text-color: #e2e8f0;
--border-color: #1e293b;
}
.container, .QaBox {
background: #1e293b;
}
}
"""
def format_chat_content(self) -> str:
"""格式化聊天内容"""
chat_content = []
for q, a in self.chatbot:
question = str(q) if q is not None else ""
answer = str(a) if a is not None else ""
chat_content.append(f'''
<div class="QaBox">
<div class="Question">{question}</div>
<div class="Answer">{answer}</div>
</div>
''')
return "\n".join(chat_content)
def format_history_content(self) -> str:
"""格式化历史记录内容"""
if not self.history:
return ""
history_content = []
for entry in self.history:
history_content.append(f'''
<div class="historyBox">
<div class="entry">{entry}</div>
</div>
''')
return "\n".join(history_content)
def create_document(self) -> str:
"""生成完整的HTML文档
Returns:
str: 完整的HTML文档字符串
"""
return f"""
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>对话存档</title>
<style>{self.css_styles}</style>
</head>
<body>
<div class="container">
<h1 class="chat-title">对话存档</h1>
<div class="chat-body">
{self.format_chat_content()}
</div>
</div>
</body>
</html>
"""

View File

@@ -0,0 +1,39 @@
class MarkdownFormatter:
"""Markdown格式文档生成器 - 用于生成对话记录的markdown文档"""
def __init__(self):
self.content = []
def _add_content(self, text: str):
"""添加正文内容"""
if text:
self.content.append(f"\n{text}\n")
def create_document(self, history: list) -> str:
"""
创建完整的Markdown文档
Args:
history: 历史记录列表,偶数位置为问题,奇数位置为答案
Returns:
str: 生成的Markdown文本
"""
self.content = []
# 处理问答对
for i in range(0, len(history), 2):
question = history[i]
answer = history[i + 1]
# 添加问题
self.content.append(f"\n### 问题 {i//2 + 1}")
self._add_content(question)
# 添加回答
self.content.append(f"\n### 回答 {i//2 + 1}")
self._add_content(answer)
# 添加分隔线
self.content.append("\n---\n")
return "\n".join(self.content)

View File

@@ -0,0 +1,172 @@
from datetime import datetime
import os
import re
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
def convert_markdown_to_pdf(markdown_text):
"""将Markdown文本转换为PDF格式的纯文本"""
if not markdown_text:
return ""
# 标准化换行符
markdown_text = markdown_text.replace('\r\n', '\n').replace('\r', '\n')
# 处理标题、粗体、斜体
markdown_text = re.sub(r'^#\s+(.+)$', r'\1', markdown_text, flags=re.MULTILINE)
markdown_text = re.sub(r'\*\*(.+?)\*\*', r'\1', markdown_text)
markdown_text = re.sub(r'\*(.+?)\*', r'\1', markdown_text)
# 处理列表
markdown_text = re.sub(r'^\s*[-*+]\s+(.+?)(?=\n|$)', r'\1', markdown_text, flags=re.MULTILINE)
markdown_text = re.sub(r'^\s*\d+\.\s+(.+?)(?=\n|$)', r'\1', markdown_text, flags=re.MULTILINE)
# 处理链接
markdown_text = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'\1', markdown_text)
# 处理段落
markdown_text = re.sub(r'\n{2,}', '\n', markdown_text)
markdown_text = re.sub(r'(?<!\n)(?<!^)(?<!•\s)(?<!\d\.\s)\n(?![\s•\d])', '\n\n', markdown_text, flags=re.MULTILINE)
# 清理空白
markdown_text = re.sub(r' +', ' ', markdown_text)
markdown_text = re.sub(r'(?m)^\s+|\s+$', '', markdown_text)
return markdown_text.strip()
class PDFFormatter:
"""聊天记录PDF文档生成器 - 使用 Noto Sans CJK 字体"""
def __init__(self):
self._init_reportlab()
self._register_fonts()
self.styles = self._get_reportlab_lib()['getSampleStyleSheet']()
self._create_styles()
def _init_reportlab(self):
"""初始化 ReportLab 相关组件"""
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
self._lib = {
'A4': A4,
'getSampleStyleSheet': getSampleStyleSheet,
'ParagraphStyle': ParagraphStyle,
'cm': cm
}
self._platypus = {
'SimpleDocTemplate': SimpleDocTemplate,
'Paragraph': Paragraph,
'Spacer': Spacer
}
def _get_reportlab_lib(self):
return self._lib
def _get_reportlab_platypus(self):
return self._platypus
def _register_fonts(self):
"""注册 Noto Sans CJK 字体"""
possible_font_paths = [
'/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc',
'/usr/share/fonts/noto-cjk/NotoSansCJK-Regular.ttc',
'/usr/share/fonts/noto/NotoSansCJK-Regular.ttc'
]
font_registered = False
for path in possible_font_paths:
if os.path.exists(path):
try:
pdfmetrics.registerFont(TTFont('NotoSansCJK', path))
font_registered = True
break
except:
continue
if not font_registered:
print("Warning: Could not find Noto Sans CJK font. Using fallback font.")
self.font_name = 'Helvetica'
else:
self.font_name = 'NotoSansCJK'
def _create_styles(self):
"""创建文档样式"""
ParagraphStyle = self._lib['ParagraphStyle']
# 标题样式
self.styles.add(ParagraphStyle(
name='Title_Custom',
fontName=self.font_name,
fontSize=24,
leading=38,
alignment=1,
spaceAfter=32
))
# 日期样式
self.styles.add(ParagraphStyle(
name='Date_Style',
fontName=self.font_name,
fontSize=16,
leading=20,
alignment=1,
spaceAfter=20
))
# 问题样式
self.styles.add(ParagraphStyle(
name='Question_Style',
fontName=self.font_name,
fontSize=12,
leading=18,
leftIndent=28,
spaceAfter=6
))
# 回答样式
self.styles.add(ParagraphStyle(
name='Answer_Style',
fontName=self.font_name,
fontSize=12,
leading=18,
leftIndent=28,
spaceAfter=12
))
def create_document(self, history, output_path):
"""生成PDF文档"""
# 创建PDF文档
doc = self._platypus['SimpleDocTemplate'](
output_path,
pagesize=self._lib['A4'],
rightMargin=2.6 * self._lib['cm'],
leftMargin=2.8 * self._lib['cm'],
topMargin=3.7 * self._lib['cm'],
bottomMargin=3.5 * self._lib['cm']
)
# 构建内容
story = []
Paragraph = self._platypus['Paragraph']
# 添加对话内容
for i in range(0, len(history), 2):
question = history[i]
answer = convert_markdown_to_pdf(history[i + 1]) if i + 1 < len(history) else ""
if question:
q_text = f'问题 {i // 2 + 1}{str(question)}'
story.append(Paragraph(q_text, self.styles['Question_Style']))
if answer:
a_text = f'回答 {i // 2 + 1}{str(answer)}'
story.append(Paragraph(a_text, self.styles['Answer_Style']))
# 构建PDF
doc.build(story)
return doc

View File

@@ -0,0 +1,79 @@
import re
def convert_markdown_to_txt(markdown_text):
"""Convert markdown text to plain text while preserving formatting"""
# Standardize line endings
markdown_text = markdown_text.replace('\r\n', '\n').replace('\r', '\n')
# 1. Handle headers but keep their formatting instead of removing them
markdown_text = re.sub(r'^#\s+(.+)$', r'# \1', markdown_text, flags=re.MULTILINE)
markdown_text = re.sub(r'^##\s+(.+)$', r'## \1', markdown_text, flags=re.MULTILINE)
markdown_text = re.sub(r'^###\s+(.+)$', r'### \1', markdown_text, flags=re.MULTILINE)
# 2. Handle bold and italic - simply remove markers
markdown_text = re.sub(r'\*\*(.+?)\*\*', r'\1', markdown_text)
markdown_text = re.sub(r'\*(.+?)\*', r'\1', markdown_text)
# 3. Handle lists but preserve formatting
markdown_text = re.sub(r'^\s*[-*+]\s+(.+?)(?=\n|$)', r'\1', markdown_text, flags=re.MULTILINE)
# 4. Handle links - keep only the text
markdown_text = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', r'\1 (\2)', markdown_text)
# 5. Handle HTML links - convert to user-friendly format
markdown_text = re.sub(r'<a href=[\'"]([^\'"]+)[\'"](?:\s+target=[\'"][^\'"]+[\'"])?>([^<]+)</a>', r'\2 (\1)',
markdown_text)
# 6. Preserve paragraph breaks
markdown_text = re.sub(r'\n{3,}', '\n\n', markdown_text) # normalize multiple newlines to double newlines
# 7. Clean up extra spaces but maintain indentation
markdown_text = re.sub(r' +', ' ', markdown_text)
return markdown_text.strip()
class TxtFormatter:
"""Chat history TXT document generator"""
def __init__(self):
self.content = []
self._setup_document()
def _setup_document(self):
"""Initialize document with header"""
self.content.append("=" * 50)
self.content.append("GPT-Academic对话记录".center(48))
self.content.append("=" * 50)
def _format_header(self):
"""Create document header with current date"""
from datetime import datetime
date_str = datetime.now().strftime('%Y年%m月%d')
return [
date_str.center(48),
"\n" # Add blank line after date
]
def create_document(self, history):
"""Generate document from chat history"""
# Add header with date
self.content.extend(self._format_header())
# Add conversation content
for i in range(0, len(history), 2):
question = history[i]
answer = convert_markdown_to_txt(history[i + 1]) if i + 1 < len(history) else ""
if question:
self.content.append(f"问题 {i // 2 + 1}{str(question)}")
self.content.append("") # Add blank line
if answer:
self.content.append(f"回答 {i // 2 + 1}{str(answer)}")
self.content.append("") # Add blank line
# Join all content with newlines
return "\n".join(self.content)

View File

@@ -0,0 +1,155 @@
from docx2pdf import convert
import os
import platform
import subprocess
from typing import Union
from pathlib import Path
from datetime import datetime
class WordToPdfConverter:
"""Word文档转PDF转换器"""
@staticmethod
def convert_to_pdf(word_path: Union[str, Path], pdf_path: Union[str, Path] = None) -> str:
"""
将Word文档转换为PDF
参数:
word_path: Word文档的路径
pdf_path: 可选PDF文件的输出路径。如果未指定将使用与Word文档相同的名称和位置
返回:
生成的PDF文件路径
异常:
如果转换失败,将抛出相应异常
"""
try:
# 确保输入路径是Path对象
word_path = Path(word_path)
# 如果未指定pdf_path则使用与word文档相同的名称
if pdf_path is None:
pdf_path = word_path.with_suffix('.pdf')
else:
pdf_path = Path(pdf_path)
# 检查操作系统
if platform.system() == 'Linux':
# Linux系统需要安装libreoffice
which_result = subprocess.run(['which', 'libreoffice'], capture_output=True, text=True)
if which_result.returncode != 0:
raise RuntimeError("请先安装LibreOffice: sudo apt-get install libreoffice")
print(f"开始转换Word文档: {word_path} 到 PDF")
# 使用subprocess代替os.system
result = subprocess.run(
['libreoffice', '--headless', '--convert-to', 'pdf:writer_pdf_Export',
str(word_path), '--outdir', str(pdf_path.parent)],
capture_output=True, text=True
)
if result.returncode != 0:
error_msg = result.stderr or "未知错误"
print(f"LibreOffice转换失败错误信息: {error_msg}")
raise RuntimeError(f"LibreOffice转换失败: {error_msg}")
print(f"LibreOffice转换输出: {result.stdout}")
# 如果输出路径与默认生成的不同,则重命名
default_pdf = word_path.with_suffix('.pdf')
if default_pdf != pdf_path and default_pdf.exists():
os.rename(default_pdf, pdf_path)
print(f"已将PDF从 {default_pdf} 重命名为 {pdf_path}")
# 验证PDF是否成功生成
if not pdf_path.exists() or pdf_path.stat().st_size == 0:
raise RuntimeError("PDF生成失败或文件为空")
print(f"PDF转换成功文件大小: {pdf_path.stat().st_size} 字节")
else:
# Windows和MacOS使用docx2pdf
print(f"使用docx2pdf转换 {word_path}{pdf_path}")
convert(word_path, pdf_path)
# 验证PDF是否成功生成
if not pdf_path.exists() or pdf_path.stat().st_size == 0:
raise RuntimeError("PDF生成失败或文件为空")
print(f"PDF转换成功文件大小: {pdf_path.stat().st_size} 字节")
return str(pdf_path)
except Exception as e:
print(f"PDF转换异常: {str(e)}")
raise Exception(f"转换PDF失败: {str(e)}")
@staticmethod
def batch_convert(word_dir: Union[str, Path], pdf_dir: Union[str, Path] = None) -> list:
"""
批量转换目录下的所有Word文档
参数:
word_dir: 包含Word文档的目录路径
pdf_dir: 可选PDF文件的输出目录。如果未指定将使用与Word文档相同的目录
返回:
生成的PDF文件路径列表
"""
word_dir = Path(word_dir)
if pdf_dir:
pdf_dir = Path(pdf_dir)
pdf_dir.mkdir(parents=True, exist_ok=True)
converted_files = []
for word_file in word_dir.glob("*.docx"):
try:
if pdf_dir:
pdf_path = pdf_dir / word_file.with_suffix('.pdf').name
else:
pdf_path = word_file.with_suffix('.pdf')
pdf_file = WordToPdfConverter.convert_to_pdf(word_file, pdf_path)
converted_files.append(pdf_file)
except Exception as e:
print(f"转换 {word_file} 失败: {str(e)}")
return converted_files
@staticmethod
def convert_doc_to_pdf(doc, output_dir: Union[str, Path] = None) -> str:
"""
将docx对象直接转换为PDF
参数:
doc: python-docx的Document对象
output_dir: 可选,输出目录。如果未指定,将使用当前目录
返回:
生成的PDF文件路径
"""
try:
# 设置临时文件路径和输出路径
output_dir = Path(output_dir) if output_dir else Path.cwd()
output_dir.mkdir(parents=True, exist_ok=True)
# 生成临时word文件
temp_docx = output_dir / f"temp_{datetime.now().strftime('%Y%m%d_%H%M%S')}.docx"
doc.save(temp_docx)
# 转换为PDF
pdf_path = temp_docx.with_suffix('.pdf')
WordToPdfConverter.convert_to_pdf(temp_docx, pdf_path)
# 删除临时word文件
temp_docx.unlink()
return str(pdf_path)
except Exception as e:
if temp_docx.exists():
temp_docx.unlink()
raise Exception(f"转换PDF失败: {str(e)}")

View File

@@ -0,0 +1,177 @@
import re
from docx import Document
from docx.shared import Cm, Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT, WD_LINE_SPACING
from docx.enum.style import WD_STYLE_TYPE
from docx.oxml.ns import qn
from datetime import datetime
def convert_markdown_to_word(markdown_text):
# 0. 首先标准化所有换行符为\n
markdown_text = markdown_text.replace('\r\n', '\n').replace('\r', '\n')
# 1. 处理标题 - 支持更多级别的标题,使用更精确的正则
# 保留标题标记,以便后续处理时还能识别出标题级别
markdown_text = re.sub(r'^(#{1,6})\s+(.+?)(?:\s+#+)?$', r'\1 \2', markdown_text, flags=re.MULTILINE)
# 2. 处理粗体、斜体和加粗斜体
markdown_text = re.sub(r'\*\*\*(.+?)\*\*\*', r'\1', markdown_text) # 加粗斜体
markdown_text = re.sub(r'\*\*(.+?)\*\*', r'\1', markdown_text) # 加粗
markdown_text = re.sub(r'\*(.+?)\*', r'\1', markdown_text) # 斜体
markdown_text = re.sub(r'_(.+?)_', r'\1', markdown_text) # 下划线斜体
markdown_text = re.sub(r'__(.+?)__', r'\1', markdown_text) # 下划线加粗
# 3. 处理代码块 - 不移除,而是简化格式
# 多行代码块
markdown_text = re.sub(r'```(?:\w+)?\n([\s\S]*?)```', r'[代码块]\n\1[/代码块]', markdown_text)
# 单行代码
markdown_text = re.sub(r'`([^`]+)`', r'[代码]\1[/代码]', markdown_text)
# 4. 处理列表 - 保留列表结构
# 匹配无序列表
markdown_text = re.sub(r'^(\s*)[-*+]\s+(.+?)$', r'\1• \2', markdown_text, flags=re.MULTILINE)
# 5. 处理Markdown链接
markdown_text = re.sub(r'\[([^\]]+)\]\(([^)]+?)\s*(?:"[^"]*")?\)', r'\1 (\2)', markdown_text)
# 6. 处理HTML链接
markdown_text = re.sub(r'<a href=[\'"]([^\'"]+)[\'"](?:\s+target=[\'"][^\'"]+[\'"])?>([^<]+)</a>', r'\2 (\1)',
markdown_text)
# 7. 处理图片
markdown_text = re.sub(r'!\[([^\]]*)\]\([^)]+\)', r'[图片:\1]', markdown_text)
return markdown_text
class WordFormatter:
"""聊天记录Word文档生成器 - 符合中国政府公文格式规范(GB/T 9704-2012)"""
def __init__(self):
self.doc = Document()
self._setup_document()
self._create_styles()
def _setup_document(self):
"""设置文档基本格式,包括页面设置和页眉"""
sections = self.doc.sections
for section in sections:
# 设置页面大小为A4
section.page_width = Cm(21)
section.page_height = Cm(29.7)
# 设置页边距
section.top_margin = Cm(3.7) # 上边距37mm
section.bottom_margin = Cm(3.5) # 下边距35mm
section.left_margin = Cm(2.8) # 左边距28mm
section.right_margin = Cm(2.6) # 右边距26mm
# 设置页眉页脚距离
section.header_distance = Cm(2.0)
section.footer_distance = Cm(2.0)
# 添加页眉
header = section.header
header_para = header.paragraphs[0]
header_para.alignment = WD_PARAGRAPH_ALIGNMENT.RIGHT
header_run = header_para.add_run("GPT-Academic对话记录")
header_run.font.name = '仿宋'
header_run._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
header_run.font.size = Pt(9)
def _create_styles(self):
"""创建文档样式"""
# 创建正文样式
style = self.doc.styles.add_style('Normal_Custom', WD_STYLE_TYPE.PARAGRAPH)
style.font.name = '仿宋'
style._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
style.font.size = Pt(12) # 调整为12磅
style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
style.paragraph_format.space_after = Pt(0)
# 创建问题样式
question_style = self.doc.styles.add_style('Question_Style', WD_STYLE_TYPE.PARAGRAPH)
question_style.font.name = '黑体'
question_style._element.rPr.rFonts.set(qn('w:eastAsia'), '黑体')
question_style.font.size = Pt(14) # 调整为14磅
question_style.font.bold = True
question_style.paragraph_format.space_before = Pt(12) # 减小段前距
question_style.paragraph_format.space_after = Pt(6)
question_style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
question_style.paragraph_format.left_indent = Pt(0) # 移除左缩进
# 创建回答样式
answer_style = self.doc.styles.add_style('Answer_Style', WD_STYLE_TYPE.PARAGRAPH)
answer_style.font.name = '仿宋'
answer_style._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
answer_style.font.size = Pt(12) # 调整为12磅
answer_style.paragraph_format.space_before = Pt(6)
answer_style.paragraph_format.space_after = Pt(12)
answer_style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
answer_style.paragraph_format.left_indent = Pt(0) # 移除左缩进
# 创建标题样式
title_style = self.doc.styles.add_style('Title_Custom', WD_STYLE_TYPE.PARAGRAPH)
title_style.font.name = '黑体' # 改用黑体
title_style._element.rPr.rFonts.set(qn('w:eastAsia'), '黑体')
title_style.font.size = Pt(22) # 调整为22磅
title_style.font.bold = True
title_style.paragraph_format.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
title_style.paragraph_format.space_before = Pt(0)
title_style.paragraph_format.space_after = Pt(24)
title_style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
# 添加参考文献样式
ref_style = self.doc.styles.add_style('Reference_Style', WD_STYLE_TYPE.PARAGRAPH)
ref_style.font.name = '宋体'
ref_style._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')
ref_style.font.size = Pt(10.5) # 参考文献使用小号字体
ref_style.paragraph_format.space_before = Pt(3)
ref_style.paragraph_format.space_after = Pt(3)
ref_style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.SINGLE
ref_style.paragraph_format.left_indent = Pt(21)
ref_style.paragraph_format.first_line_indent = Pt(-21)
# 添加参考文献标题样式
ref_title_style = self.doc.styles.add_style('Reference_Title_Style', WD_STYLE_TYPE.PARAGRAPH)
ref_title_style.font.name = '黑体'
ref_title_style._element.rPr.rFonts.set(qn('w:eastAsia'), '黑体')
ref_title_style.font.size = Pt(16)
ref_title_style.font.bold = True
ref_title_style.paragraph_format.space_before = Pt(24)
ref_title_style.paragraph_format.space_after = Pt(12)
ref_title_style.paragraph_format.line_spacing_rule = WD_LINE_SPACING.ONE_POINT_FIVE
def create_document(self, history):
"""写入聊天历史"""
# 添加标题
title_para = self.doc.add_paragraph(style='Title_Custom')
title_run = title_para.add_run('GPT-Academic 对话记录')
# 添加日期
date_para = self.doc.add_paragraph()
date_para.alignment = WD_PARAGRAPH_ALIGNMENT.CENTER
date_run = date_para.add_run(datetime.now().strftime('%Y年%m月%d'))
date_run.font.name = '仿宋'
date_run._element.rPr.rFonts.set(qn('w:eastAsia'), '仿宋')
date_run.font.size = Pt(16)
self.doc.add_paragraph() # 添加空行
# 添加对话内容
for i in range(0, len(history), 2):
question = history[i]
answer = convert_markdown_to_word(history[i + 1])
if question:
q_para = self.doc.add_paragraph(style='Question_Style')
q_para.add_run(f'问题 {i//2 + 1}').bold = True
q_para.add_run(str(question))
if answer:
a_para = self.doc.add_paragraph(style='Answer_Style')
a_para.add_run(f'回答 {i//2 + 1}').bold = True
a_para.add_run(str(answer))
return self.doc

View File

@@ -0,0 +1,4 @@
import nltk
nltk.data.path.append('~/nltk_data')
nltk.download('averaged_perceptron_tagger', download_dir='~/nltk_data')
nltk.download('punkt', download_dir='~/nltk_data')

View File

@@ -0,0 +1,286 @@
from __future__ import annotations
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional, List, Set, Dict, Union, Iterator, Tuple
from dataclasses import dataclass, field
import logging
from concurrent.futures import ThreadPoolExecutor, as_completed
import chardet
from functools import lru_cache
import os
@dataclass
class ExtractorConfig:
"""提取器配置类"""
encoding: str = 'auto'
na_filter: bool = True
skip_blank_lines: bool = True
chunk_size: int = 10000
max_workers: int = 4
preserve_format: bool = True
read_all_sheets: bool = True # 新增:是否读取所有工作表
text_cleanup: Dict[str, bool] = field(default_factory=lambda: {
'remove_extra_spaces': True,
'normalize_whitespace': False,
'remove_special_chars': False,
'lowercase': False
})
class ExcelTextExtractor:
"""增强的Excel格式文件文本内容提取器"""
SUPPORTED_EXTENSIONS: Set[str] = {
'.xlsx', '.xls', '.csv', '.tsv', '.xlsm', '.xltx', '.xltm', '.ods'
}
def __init__(self, config: Optional[ExtractorConfig] = None):
self.config = config or ExtractorConfig()
self._setup_logging()
self._detect_encoding = lru_cache(maxsize=128)(self._detect_encoding)
def _setup_logging(self) -> None:
"""配置日志记录器"""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
fh = logging.FileHandler('excel_extractor.log')
fh.setLevel(logging.ERROR)
self.logger.addHandler(fh)
def _detect_encoding(self, file_path: Path) -> str:
if self.config.encoding != 'auto':
return self.config.encoding
try:
with open(file_path, 'rb') as f:
raw_data = f.read(10000)
result = chardet.detect(raw_data)
return result['encoding'] or 'utf-8'
except Exception as e:
self.logger.warning(f"Encoding detection failed: {e}. Using utf-8")
return 'utf-8'
def _validate_file(self, file_path: Union[str, Path]) -> Path:
path = Path(file_path).resolve()
if not path.exists():
raise ValueError(f"File not found: {path}")
if not path.is_file():
raise ValueError(f"Not a file: {path}")
if not os.access(path, os.R_OK):
raise PermissionError(f"No read permission: {path}")
if path.suffix.lower() not in self.SUPPORTED_EXTENSIONS:
raise ValueError(
f"Unsupported format: {path.suffix}. "
f"Supported: {', '.join(sorted(self.SUPPORTED_EXTENSIONS))}"
)
return path
def _format_value(self, value: Any) -> str:
if pd.isna(value) or value is None:
return ''
if isinstance(value, (int, float)):
return str(value)
return str(value).strip()
def _process_chunk(self, chunk: pd.DataFrame, columns: Optional[List[str]] = None, sheet_name: str = '') -> str:
"""处理数据块新增sheet_name参数"""
try:
if columns:
chunk = chunk[columns]
if self.config.preserve_format:
formatted_chunk = chunk.applymap(self._format_value)
rows = []
# 添加工作表名称作为标题
if sheet_name:
rows.append(f"[Sheet: {sheet_name}]")
# 添加表头
headers = [str(col) for col in formatted_chunk.columns]
rows.append('\t'.join(headers))
# 添加数据行
for _, row in formatted_chunk.iterrows():
rows.append('\t'.join(row.values))
return '\n'.join(rows)
else:
flat_values = (
chunk.astype(str)
.replace({'nan': '', 'None': '', 'NaN': ''})
.values.flatten()
)
return ' '.join(v for v in flat_values if v)
except Exception as e:
self.logger.error(f"Error processing chunk: {e}")
raise
def _read_file(self, file_path: Path) -> Union[pd.DataFrame, Iterator[pd.DataFrame], Dict[str, pd.DataFrame]]:
"""读取文件,支持多工作表"""
try:
encoding = self._detect_encoding(file_path)
if file_path.suffix.lower() in {'.csv', '.tsv'}:
sep = '\t' if file_path.suffix.lower() == '.tsv' else ','
# 对大文件使用分块读取
if file_path.stat().st_size > self.config.chunk_size * 1024:
return pd.read_csv(
file_path,
encoding=encoding,
na_filter=self.config.na_filter,
skip_blank_lines=self.config.skip_blank_lines,
sep=sep,
chunksize=self.config.chunk_size,
on_bad_lines='warn'
)
else:
return pd.read_csv(
file_path,
encoding=encoding,
na_filter=self.config.na_filter,
skip_blank_lines=self.config.skip_blank_lines,
sep=sep
)
else:
# Excel文件处理支持多工作表
if self.config.read_all_sheets:
# 读取所有工作表
return pd.read_excel(
file_path,
na_filter=self.config.na_filter,
keep_default_na=self.config.na_filter,
engine='openpyxl',
sheet_name=None # None表示读取所有工作表
)
else:
# 只读取第一个工作表
return pd.read_excel(
file_path,
na_filter=self.config.na_filter,
keep_default_na=self.config.na_filter,
engine='openpyxl',
sheet_name=0 # 读取第一个工作表
)
except Exception as e:
self.logger.error(f"Error reading file {file_path}: {e}")
raise
def extract_text(
self,
file_path: Union[str, Path],
columns: Optional[List[str]] = None,
separator: str = '\n'
) -> str:
"""提取文本,支持多工作表"""
try:
path = self._validate_file(file_path)
self.logger.info(f"Processing: {path}")
reader = self._read_file(path)
texts = []
# 处理Excel多工作表
if isinstance(reader, dict):
for sheet_name, df in reader.items():
sheet_text = self._process_chunk(df, columns, sheet_name)
if sheet_text:
texts.append(sheet_text)
return separator.join(texts)
# 处理单个DataFrame
elif isinstance(reader, pd.DataFrame):
return self._process_chunk(reader, columns)
# 处理DataFrame迭代器
else:
with ThreadPoolExecutor(max_workers=self.config.max_workers) as executor:
futures = {
executor.submit(self._process_chunk, chunk, columns): i
for i, chunk in enumerate(reader)
}
chunk_texts = []
for future in as_completed(futures):
try:
text = future.result()
if text:
chunk_texts.append((futures[future], text))
except Exception as e:
self.logger.error(f"Error in chunk {futures[future]}: {e}")
# 按块的顺序排序
chunk_texts.sort(key=lambda x: x[0])
texts = [text for _, text in chunk_texts]
# 合并文本,保留格式
if texts and self.config.preserve_format:
result = texts[0] # 第一块包含表头
if len(texts) > 1:
# 跳过后续块的表头行
for text in texts[1:]:
result += '\n' + '\n'.join(text.split('\n')[1:])
return result
else:
return separator.join(texts)
except Exception as e:
self.logger.error(f"Extraction failed: {e}")
raise
@staticmethod
def get_supported_formats() -> List[str]:
"""获取支持的文件格式列表"""
return sorted(ExcelTextExtractor.SUPPORTED_EXTENSIONS)
def main():
"""主函数:演示用法"""
config = ExtractorConfig(
encoding='auto',
preserve_format=True,
read_all_sheets=True, # 启用多工作表读取
text_cleanup={
'remove_extra_spaces': True,
'normalize_whitespace': False,
'remove_special_chars': False,
'lowercase': False
}
)
extractor = ExcelTextExtractor(config)
try:
sample_file = 'example.xlsx'
if Path(sample_file).exists():
text = extractor.extract_text(
sample_file,
columns=['title', 'content']
)
print("提取的文本:")
print(text)
else:
print(f"示例文件 {sample_file} 不存在")
print("\n支持的格式:", extractor.get_supported_formats())
except Exception as e:
print(f"错误: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,359 @@
from __future__ import annotations
from pathlib import Path
from typing import Optional, Set, Dict, Union, List
from dataclasses import dataclass, field
import logging
import os
import re
import subprocess
import tempfile
import shutil
@dataclass
class MarkdownConverterConfig:
"""PDF 到 Markdown 转换器配置类
Attributes:
extract_images: 是否提取图片
extract_tables: 是否尝试保留表格结构
extract_code_blocks: 是否识别代码块
extract_math: 是否转换数学公式
output_dir: 输出目录路径
image_dir: 图片保存目录路径
paragraph_separator: 段落之间的分隔符
text_cleanup: 文本清理选项字典
docintel_endpoint: Document Intelligence端点URL (可选)
enable_plugins: 是否启用插件
llm_client: LLM客户端对象 (例如OpenAI client)
llm_model: 要使用的LLM模型名称
"""
extract_images: bool = True
extract_tables: bool = True
extract_code_blocks: bool = True
extract_math: bool = True
output_dir: str = ""
image_dir: str = "images"
paragraph_separator: str = '\n\n'
text_cleanup: Dict[str, bool] = field(default_factory=lambda: {
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
})
docintel_endpoint: str = ""
enable_plugins: bool = False
llm_client: Optional[object] = None
llm_model: str = ""
class MarkdownConverter:
"""PDF 到 Markdown 转换器
使用 markitdown 库实现 PDF 到 Markdown 的转换,支持多种配置选项。
"""
SUPPORTED_EXTENSIONS: Set[str] = {
'.pdf',
}
def __init__(self, config: Optional[MarkdownConverterConfig] = None):
"""初始化转换器
Args:
config: 转换器配置对象如果为None则使用默认配置
"""
self.config = config or MarkdownConverterConfig()
self._setup_logging()
# 检查是否安装了 markitdown
self._check_markitdown_installation()
def _setup_logging(self) -> None:
"""配置日志记录器"""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
# 添加文件处理器
fh = logging.FileHandler('markdown_converter.log')
fh.setLevel(logging.ERROR)
self.logger.addHandler(fh)
def _check_markitdown_installation(self) -> None:
"""检查是否安装了 markitdown"""
try:
# 尝试导入 markitdown 库
from markitdown import MarkItDown
self.logger.info("markitdown 库已安装")
except ImportError:
self.logger.warning("markitdown 库未安装,尝试安装...")
try:
subprocess.check_call(["pip", "install", "markitdown"])
self.logger.info("markitdown 库安装成功")
from markitdown import MarkItDown
except (subprocess.SubprocessError, ImportError):
self.logger.error("无法安装 markitdown 库,请手动安装")
self.markitdown_available = False
return
self.markitdown_available = True
def _validate_file(self, file_path: Union[str, Path], max_size_mb: int = 100) -> Path:
"""验证文件
Args:
file_path: 文件路径
max_size_mb: 允许的最大文件大小(MB)
Returns:
Path: 验证后的Path对象
Raises:
ValueError: 文件不存在、格式不支持或大小超限
PermissionError: 没有读取权限
"""
path = Path(file_path).resolve()
if not path.exists():
raise ValueError(f"文件不存在: {path}")
if not path.is_file():
raise ValueError(f"不是一个文件: {path}")
if not os.access(path, os.R_OK):
raise PermissionError(f"没有读取权限: {path}")
file_size_mb = path.stat().st_size / (1024 * 1024)
if file_size_mb > max_size_mb:
raise ValueError(
f"文件大小 ({file_size_mb:.1f}MB) 超过限制 {max_size_mb}MB"
)
if path.suffix.lower() not in self.SUPPORTED_EXTENSIONS:
raise ValueError(
f"不支持的格式: {path.suffix}. "
f"支持的格式: {', '.join(sorted(self.SUPPORTED_EXTENSIONS))}"
)
return path
def _cleanup_text(self, text: str) -> str:
"""清理文本
Args:
text: 原始文本
Returns:
str: 清理后的文本
"""
if self.config.text_cleanup['remove_extra_spaces']:
text = ' '.join(text.split())
if self.config.text_cleanup['normalize_whitespace']:
text = text.replace('\t', ' ').replace('\r', '\n')
if self.config.text_cleanup['lowercase']:
text = text.lower()
return text.strip()
@staticmethod
def get_supported_formats() -> List[str]:
"""获取支持的文件格式列表"""
return sorted(MarkdownConverter.SUPPORTED_EXTENSIONS)
def convert_to_markdown(
self,
file_path: Union[str, Path],
output_path: Optional[Union[str, Path]] = None
) -> str:
"""将 PDF 转换为 Markdown
Args:
file_path: PDF 文件路径
output_path: 输出 Markdown 文件路径,如果为 None 则返回内容而不保存
Returns:
str: 转换后的 Markdown 内容
Raises:
Exception: 转换过程中的错误
"""
try:
path = self._validate_file(file_path)
self.logger.info(f"处理: {path}")
if not self.markitdown_available:
raise ImportError("markitdown 库未安装,无法进行转换")
# 导入 markitdown 库
from markitdown import MarkItDown
# 准备输出目录
if output_path:
output_path = Path(output_path)
output_dir = output_path.parent
output_dir.mkdir(parents=True, exist_ok=True)
else:
# 创建临时目录作为输出目录
temp_dir = tempfile.mkdtemp()
output_dir = Path(temp_dir)
output_path = output_dir / f"{path.stem}.md"
# 图片目录
image_dir = output_dir / self.config.image_dir
image_dir.mkdir(parents=True, exist_ok=True)
# 创建 MarkItDown 实例并进行转换
if self.config.docintel_endpoint:
md = MarkItDown(docintel_endpoint=self.config.docintel_endpoint)
elif self.config.llm_client and self.config.llm_model:
md = MarkItDown(
enable_plugins=self.config.enable_plugins,
llm_client=self.config.llm_client,
llm_model=self.config.llm_model
)
else:
md = MarkItDown(enable_plugins=self.config.enable_plugins)
# 执行转换
result = md.convert(str(path))
markdown_content = result.text_content
# 清理文本
markdown_content = self._cleanup_text(markdown_content)
# 如果需要保存到文件
if output_path:
with open(output_path, 'w', encoding='utf-8') as f:
f.write(markdown_content)
self.logger.info(f"转换成功,输出到: {output_path}")
return markdown_content
except Exception as e:
self.logger.error(f"转换失败: {e}")
raise
finally:
# 如果使用了临时目录且没有指定输出路径,则清理临时目录
if 'temp_dir' in locals() and not output_path:
shutil.rmtree(temp_dir, ignore_errors=True)
def convert_to_markdown_and_save(
self,
file_path: Union[str, Path],
output_path: Union[str, Path]
) -> Path:
"""将 PDF 转换为 Markdown 并保存到指定路径
Args:
file_path: PDF 文件路径
output_path: 输出 Markdown 文件路径
Returns:
Path: 输出文件的 Path 对象
Raises:
Exception: 转换过程中的错误
"""
self.convert_to_markdown(file_path, output_path)
return Path(output_path)
def batch_convert(
self,
file_paths: List[Union[str, Path]],
output_dir: Union[str, Path]
) -> List[Path]:
"""批量转换多个 PDF 文件为 Markdown
Args:
file_paths: PDF 文件路径列表
output_dir: 输出目录路径
Returns:
List[Path]: 输出文件路径列表
Raises:
Exception: 转换过程中的错误
"""
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
output_paths = []
for file_path in file_paths:
path = Path(file_path)
output_path = output_dir / f"{path.stem}.md"
try:
self.convert_to_markdown(file_path, output_path)
output_paths.append(output_path)
self.logger.info(f"成功转换: {path} -> {output_path}")
except Exception as e:
self.logger.error(f"转换失败 {path}: {e}")
return output_paths
def main():
"""主函数:演示用法"""
# 配置
config = MarkdownConverterConfig(
extract_images=True,
extract_tables=True,
extract_code_blocks=True,
extract_math=True,
enable_plugins=False,
text_cleanup={
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
}
)
# 创建转换器
converter = MarkdownConverter(config)
# 使用示例
try:
# 替换为实际的文件路径
sample_file = './crazy_functions/doc_fns/read_fns/paper/2501.12599v1.pdf'
if Path(sample_file).exists():
# 转换为 Markdown 并打印内容
markdown_content = converter.convert_to_markdown(sample_file)
print("转换后的 Markdown 内容:")
print(markdown_content[:500] + "...") # 只打印前500个字符
# 转换并保存到文件
output_file = f"./output_{Path(sample_file).stem}.md"
output_path = converter.convert_to_markdown_and_save(sample_file, output_file)
print(f"\n已保存到: {output_path}")
# 使用LLM增强的示例 (需要添加相应的导入和配置)
# try:
# from openai import OpenAI
# client = OpenAI()
# llm_config = MarkdownConverterConfig(
# llm_client=client,
# llm_model="gpt-4o"
# )
# llm_converter = MarkdownConverter(llm_config)
# llm_result = llm_converter.convert_to_markdown("example.jpg")
# print("LLM增强的结果:")
# print(llm_result[:500] + "...")
# except ImportError:
# print("未安装OpenAI库跳过LLM示例")
else:
print(f"示例文件 {sample_file} 不存在")
print("\n支持的格式:", converter.get_supported_formats())
except Exception as e:
print(f"错误: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,493 @@
from __future__ import annotations
from pathlib import Path
from typing import Optional, Set, Dict, Union, List
from dataclasses import dataclass, field
import logging
import os
import re
from unstructured.partition.auto import partition
from unstructured.documents.elements import (
Text, Title, NarrativeText, ListItem, Table,
Footer, Header, PageBreak, Image, Address
)
@dataclass
class PaperMetadata:
"""论文元数据类"""
title: str = ""
authors: List[str] = field(default_factory=list)
affiliations: List[str] = field(default_factory=list)
journal: str = ""
volume: str = ""
issue: str = ""
year: str = ""
doi: str = ""
date: str = ""
publisher: str = ""
conference: str = ""
abstract: str = ""
keywords: List[str] = field(default_factory=list)
@dataclass
class ExtractorConfig:
"""元数据提取器配置类"""
paragraph_separator: str = '\n\n'
text_cleanup: Dict[str, bool] = field(default_factory=lambda: {
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
})
class PaperMetadataExtractor:
"""论文元数据提取器
使用unstructured库从多种文档格式中提取论文的标题、作者、摘要等元数据信息。
"""
SUPPORTED_EXTENSIONS: Set[str] = {
'.pdf', '.docx', '.doc', '.txt', '.ppt', '.pptx',
'.xlsx', '.xls', '.md', '.org', '.odt', '.rst',
'.rtf', '.epub', '.html', '.xml', '.json'
}
# 定义论文各部分的关键词模式
SECTION_PATTERNS = {
'abstract': r'\b(摘要|abstract|summary|概要|résumé|zusammenfassung|аннотация)\b',
'keywords': r'\b(关键词|keywords|key\s+words|关键字|mots[- ]clés|schlüsselwörter|ключевые слова)\b',
}
def __init__(self, config: Optional[ExtractorConfig] = None):
"""初始化提取器
Args:
config: 提取器配置对象如果为None则使用默认配置
"""
self.config = config or ExtractorConfig()
self._setup_logging()
def _setup_logging(self) -> None:
"""配置日志记录器"""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
# 添加文件处理器
fh = logging.FileHandler('paper_metadata_extractor.log')
fh.setLevel(logging.ERROR)
self.logger.addHandler(fh)
def _validate_file(self, file_path: Union[str, Path], max_size_mb: int = 100) -> Path:
"""验证文件
Args:
file_path: 文件路径
max_size_mb: 允许的最大文件大小(MB)
Returns:
Path: 验证后的Path对象
Raises:
ValueError: 文件不存在、格式不支持或大小超限
PermissionError: 没有读取权限
"""
path = Path(file_path).resolve()
if not path.exists():
raise ValueError(f"文件不存在: {path}")
if not path.is_file():
raise ValueError(f"不是文件: {path}")
if not os.access(path, os.R_OK):
raise PermissionError(f"没有读取权限: {path}")
file_size_mb = path.stat().st_size / (1024 * 1024)
if file_size_mb > max_size_mb:
raise ValueError(
f"文件大小 ({file_size_mb:.1f}MB) 超过限制 {max_size_mb}MB"
)
if path.suffix.lower() not in self.SUPPORTED_EXTENSIONS:
raise ValueError(
f"不支持的文件格式: {path.suffix}. "
f"支持的格式: {', '.join(sorted(self.SUPPORTED_EXTENSIONS))}"
)
return path
def _cleanup_text(self, text: str) -> str:
"""清理文本
Args:
text: 原始文本
Returns:
str: 清理后的文本
"""
if self.config.text_cleanup['remove_extra_spaces']:
text = ' '.join(text.split())
if self.config.text_cleanup['normalize_whitespace']:
text = text.replace('\t', ' ').replace('\r', '\n')
if self.config.text_cleanup['lowercase']:
text = text.lower()
return text.strip()
@staticmethod
def get_supported_formats() -> List[str]:
"""获取支持的文件格式列表"""
return sorted(PaperMetadataExtractor.SUPPORTED_EXTENSIONS)
def extract_metadata(self, file_path: Union[str, Path], strategy: str = "fast") -> PaperMetadata:
"""提取论文元数据
Args:
file_path: 文件路径
strategy: 提取策略 ("fast""accurate")
Returns:
PaperMetadata: 提取的论文元数据
Raises:
Exception: 提取过程中的错误
"""
try:
path = self._validate_file(file_path)
self.logger.info(f"正在处理: {path}")
# 使用unstructured库分解文档
elements = partition(
str(path),
strategy=strategy,
include_metadata=True,
nlp=False,
)
# 提取元数据
metadata = PaperMetadata()
# 提取标题和作者
self._extract_title_and_authors(elements, metadata)
# 提取摘要和关键词
self._extract_abstract_and_keywords(elements, metadata)
# 提取其他元数据
self._extract_additional_metadata(elements, metadata)
return metadata
except Exception as e:
self.logger.error(f"元数据提取失败: {e}")
raise
def _extract_title_and_authors(self, elements, metadata: PaperMetadata) -> None:
"""从文档中提取标题和作者信息 - 改进版"""
# 收集所有潜在的标题候选
title_candidates = []
all_text = []
raw_text = []
# 首先收集文档前30个元素的文本用于辅助判断
for i, element in enumerate(elements[:30]):
if isinstance(element, (Text, Title, NarrativeText)):
text = str(element).strip()
if text:
all_text.append(text)
raw_text.append(text)
# 打印出原始文本,用于调试
print("原始文本前10行:")
for i, text in enumerate(raw_text[:10]):
print(f"{i}: {text}")
# 1. 尝试查找连续的标题片段并合并它们
i = 0
while i < len(all_text) - 1:
current = all_text[i]
next_text = all_text[i + 1]
# 检查是否存在标题分割情况:一行以冒号结尾,下一行像是标题的延续
if current.endswith(':') and len(current) < 50 and len(next_text) > 5 and next_text[0].isupper():
# 合并这两行文本
combined_title = f"{current} {next_text}"
# 查找合并前的文本并替换
all_text[i] = combined_title
all_text.pop(i + 1)
# 给合并后的标题很高的分数
title_candidates.append((combined_title, 15, i))
else:
i += 1
# 2. 首先尝试从标题元素中查找
for i, element in enumerate(elements[:15]): # 只检查前15个元素
if isinstance(element, Title):
title_text = str(element).strip()
# 排除常见的非标题内容
if title_text.lower() not in ['abstract', '摘要', 'introduction', '引言']:
# 计算标题分数(越高越可能是真正的标题)
score = self._evaluate_title_candidate(title_text, i, element)
title_candidates.append((title_text, score, i))
# 3. 特别处理常见的论文标题格式
for i, text in enumerate(all_text[:15]):
# 特别检查"KIMI K1.5:"类型的前缀标题
if re.match(r'^[A-Z][A-Z0-9\s\.]+(\s+K\d+(\.\d+)?)?:', text):
score = 12 # 给予很高的分数
title_candidates.append((text, score, i))
# 如果下一行也是全大写,很可能是标题的延续
if i+1 < len(all_text) and all_text[i+1].isupper() and len(all_text[i+1]) > 10:
combined_title = f"{text} {all_text[i+1]}"
title_candidates.append((combined_title, 15, i)) # 给合并标题更高分数
# 匹配全大写的标题行
elif text.isupper() and len(text) > 10 and len(text) < 100:
score = 10 - i * 0.5 # 越靠前越可能是标题
title_candidates.append((text, score, i))
# 对标题候选按分数排序并选取最佳候选
if title_candidates:
title_candidates.sort(key=lambda x: x[1], reverse=True)
metadata.title = title_candidates[0][0]
title_position = title_candidates[0][2]
print(f"所有标题候选: {title_candidates[:3]}")
else:
# 如果没有找到合适的标题,使用一个备选策略
for text in all_text[:10]:
if text.isupper() and len(text) > 10 and len(text) < 200: # 大写且适当长度的文本
metadata.title = text
break
title_position = 0
# 提取作者信息 - 改进后的作者提取逻辑
author_candidates = []
# 1. 特别处理"TECHNICAL REPORT OF"之后的行,通常是作者或团队
for i, text in enumerate(all_text):
if "TECHNICAL REPORT" in text.upper() and i+1 < len(all_text):
team_text = all_text[i+1].strip()
if re.search(r'\b(team|group|lab)\b', team_text, re.IGNORECASE):
author_candidates.append((team_text, 15))
# 2. 查找包含Team的文本
for text in all_text[:20]:
if "Team" in text and len(text) < 30:
# 这很可能是团队名
author_candidates.append((text, 12))
# 添加作者到元数据
if author_candidates:
# 按分数排序
author_candidates.sort(key=lambda x: x[1], reverse=True)
# 去重
seen_authors = set()
for author, _ in author_candidates:
if author.lower() not in seen_authors and not author.isdigit():
seen_authors.add(author.lower())
metadata.authors.append(author)
# 如果没有找到作者,尝试查找隶属机构信息中的团队名称
if not metadata.authors:
for text in all_text[:20]:
if re.search(r'\b(team|group|lab|laboratory|研究组|团队)\b', text, re.IGNORECASE):
if len(text) < 50: # 避免太长的文本
metadata.authors.append(text.strip())
break
# 提取隶属机构信息
for i, element in enumerate(elements[:30]):
element_text = str(element).strip()
if re.search(r'(university|institute|department|school|laboratory|college|center|centre|\d{5,}|^[a-zA-Z]+@|学院|大学|研究所|研究院)', element_text, re.IGNORECASE):
# 可能是隶属机构
if element_text not in metadata.affiliations and len(element_text) > 10:
metadata.affiliations.append(element_text)
def _evaluate_title_candidate(self, text, position, element):
"""评估标题候选项的可能性分数"""
score = 0
# 位置因素:越靠前越可能是标题
score += max(0, 10 - position) * 0.5
# 长度因素:标题通常不会太短也不会太长
if 10 <= len(text) <= 150:
score += 3
elif len(text) < 10:
score -= 2
elif len(text) > 150:
score -= 3
# 格式因素
if text.isupper(): # 全大写可能是标题
score += 2
if re.match(r'^[A-Z]', text): # 首字母大写
score += 1
if ':' in text: # 标题常包含冒号
score += 1.5
# 内容因素
if re.search(r'\b(scaling|learning|model|approach|method|system|framework|analysis)\b', text.lower()):
score += 2 # 包含常见的学术论文关键词
# 避免误判
if re.match(r'^\d+$', text): # 纯数字
score -= 10
if re.search(r'^(http|www|doi)', text.lower()): # URL或DOI
score -= 5
if len(text.split()) <= 2 and len(text) < 15: # 太短的短语
score -= 3
# 元数据因素(如果有)
if hasattr(element, 'metadata') and element.metadata:
# 修复正确处理ElementMetadata对象
try:
# 尝试通过getattr安全地获取属性
font_size = getattr(element.metadata, 'font_size', None)
if font_size is not None and font_size > 14: # 假设标准字体大小是12
score += 3
font_weight = getattr(element.metadata, 'font_weight', None)
if font_weight == 'bold':
score += 2 # 粗体加分
except (AttributeError, TypeError):
# 如果metadata的访问方式不正确尝试其他可能的访问方式
try:
metadata_dict = element.metadata.__dict__ if hasattr(element.metadata, '__dict__') else {}
if 'font_size' in metadata_dict and metadata_dict['font_size'] > 14:
score += 3
if 'font_weight' in metadata_dict and metadata_dict['font_weight'] == 'bold':
score += 2
except Exception:
# 如果所有尝试都失败,忽略元数据处理
pass
return score
def _extract_abstract_and_keywords(self, elements, metadata: PaperMetadata) -> None:
"""从文档中提取摘要和关键词"""
abstract_found = False
keywords_found = False
abstract_text = []
for i, element in enumerate(elements):
element_text = str(element).strip().lower()
# 寻找摘要部分
if not abstract_found and (
isinstance(element, Title) and
re.search(self.SECTION_PATTERNS['abstract'], element_text, re.IGNORECASE)
):
abstract_found = True
continue
# 如果找到摘要部分,收集内容直到遇到关键词部分或新章节
if abstract_found and not keywords_found:
# 检查是否遇到关键词部分或新章节
if (
isinstance(element, Title) or
re.search(self.SECTION_PATTERNS['keywords'], element_text, re.IGNORECASE) or
re.match(r'\b(introduction|引言|method|方法)\b', element_text, re.IGNORECASE)
):
keywords_found = re.search(self.SECTION_PATTERNS['keywords'], element_text, re.IGNORECASE)
abstract_found = False # 停止收集摘要
else:
# 收集摘要文本
if isinstance(element, (Text, NarrativeText)) and element_text:
abstract_text.append(element_text)
# 如果找到关键词部分,提取关键词
if keywords_found and not abstract_found and not metadata.keywords:
if isinstance(element, (Text, NarrativeText)):
# 清除可能的"关键词:"/"Keywords:"前缀
cleaned_text = re.sub(r'^\s*(关键词|keywords|key\s+words)\s*[:]\s*', '', element_text, flags=re.IGNORECASE)
# 尝试按不同分隔符分割
for separator in [';', '', ',', '']:
if separator in cleaned_text:
metadata.keywords = [k.strip() for k in cleaned_text.split(separator) if k.strip()]
break
# 如果未能分割,将整个文本作为一个关键词
if not metadata.keywords and cleaned_text:
metadata.keywords = [cleaned_text]
keywords_found = False # 已提取关键词,停止处理
# 设置摘要文本
if abstract_text:
metadata.abstract = self.config.paragraph_separator.join(abstract_text)
def _extract_additional_metadata(self, elements, metadata: PaperMetadata) -> None:
"""提取其他元数据信息"""
for element in elements[:30]: # 只检查文档前部分
element_text = str(element).strip()
# 尝试匹配DOI
doi_match = re.search(r'(doi|DOI):\s*(10\.\d{4,}\/[a-zA-Z0-9.-]+)', element_text)
if doi_match and not metadata.doi:
metadata.doi = doi_match.group(2)
# 尝试匹配日期
date_match = re.search(r'(published|received|accepted|submitted):\s*(\d{1,2}\s+[a-zA-Z]+\s+\d{4}|\d{4}[-/]\d{1,2}[-/]\d{1,2})', element_text, re.IGNORECASE)
if date_match and not metadata.date:
metadata.date = date_match.group(2)
# 尝试匹配年份
year_match = re.search(r'\b(19|20)\d{2}\b', element_text)
if year_match and not metadata.year:
metadata.year = year_match.group(0)
# 尝试匹配期刊/会议名称
journal_match = re.search(r'(journal|conference):\s*([^,;.]+)', element_text, re.IGNORECASE)
if journal_match:
if "journal" in journal_match.group(1).lower() and not metadata.journal:
metadata.journal = journal_match.group(2).strip()
elif not metadata.conference:
metadata.conference = journal_match.group(2).strip()
def main():
"""主函数:演示用法"""
# 创建提取器
extractor = PaperMetadataExtractor()
# 使用示例
try:
# 替换为实际的文件路径
sample_file = '/Users/boyin.liu/Documents/示例文档/论文/3.pdf'
if Path(sample_file).exists():
metadata = extractor.extract_metadata(sample_file)
print("提取的元数据:")
print(f"标题: {metadata.title}")
print(f"作者: {', '.join(metadata.authors)}")
print(f"机构: {', '.join(metadata.affiliations)}")
print(f"摘要: {metadata.abstract[:200]}...")
print(f"关键词: {', '.join(metadata.keywords)}")
print(f"DOI: {metadata.doi}")
print(f"日期: {metadata.date}")
print(f"年份: {metadata.year}")
print(f"期刊: {metadata.journal}")
print(f"会议: {metadata.conference}")
else:
print(f"示例文件 {sample_file} 不存在")
print("\n支持的格式:", extractor.get_supported_formats())
except Exception as e:
print(f"错误: {e}")
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,86 @@
from pathlib import Path
from crazy_functions.doc_fns.read_fns.unstructured_all.paper_structure_extractor import PaperStructureExtractor
def extract_and_save_as_markdown(paper_path, output_path=None):
"""
提取论文结构并保存为Markdown格式
参数:
paper_path: 论文文件路径
output_path: 输出的Markdown文件路径如果不指定将使用与输入相同的文件名但扩展名为.md
返回:
保存的Markdown文件路径
"""
# 创建提取器
extractor = PaperStructureExtractor()
# 解析文件路径
paper_path = Path(paper_path)
# 如果未指定输出路径,使用相同文件名但扩展名为.md
if output_path is None:
output_path = paper_path.with_suffix('.md')
else:
output_path = Path(output_path)
# 确保输出目录存在
output_path.parent.mkdir(parents=True, exist_ok=True)
print(f"正在处理论文: {paper_path}")
try:
# 提取论文结构
paper = extractor.extract_paper_structure(paper_path)
# 生成Markdown内容
markdown_content = extractor.generate_markdown(paper)
# 保存到文件
with open(output_path, 'w', encoding='utf-8') as f:
f.write(markdown_content)
print(f"已成功保存Markdown文件: {output_path}")
# 打印摘要信息
print("\n论文摘要信息:")
print(f"标题: {paper.metadata.title}")
print(f"作者: {', '.join(paper.metadata.authors)}")
print(f"关键词: {', '.join(paper.keywords)}")
print(f"章节数: {len(paper.sections)}")
print(f"图表数: {len(paper.figures)}")
print(f"表格数: {len(paper.tables)}")
print(f"公式数: {len(paper.formulas)}")
print(f"参考文献数: {len(paper.references)}")
return output_path
except Exception as e:
print(f"处理论文时出错: {e}")
import traceback
traceback.print_exc()
return None
# 使用示例
if __name__ == "__main__":
# 替换为实际的论文文件路径
sample_paper = "crazy_functions/doc_fns/read_fns/paper/2501.12599v1.pdf"
# 可以指定输出路径,也可以使用默认路径
# output_file = "/path/to/output/paper_structure.md"
# extract_and_save_as_markdown(sample_paper, output_file)
# 使用默认输出路径(与输入文件同名但扩展名为.md
extract_and_save_as_markdown(sample_paper)
# # 批量处理多个论文的示例
# paper_dir = Path("/path/to/papers/folder")
# output_dir = Path("/path/to/output/folder")
#
# # 确保输出目录存在
# output_dir.mkdir(parents=True, exist_ok=True)
#
# # 处理目录中的所有PDF文件
# for paper_file in paper_dir.glob("*.pdf"):
# output_file = output_dir / f"{paper_file.stem}.md"
# extract_and_save_as_markdown(paper_file, output_file)

View File

@@ -0,0 +1,275 @@
from __future__ import annotations
from pathlib import Path
from typing import Optional, Set, Dict, Union, List
from dataclasses import dataclass, field
import logging
import os
from unstructured.partition.auto import partition
from unstructured.documents.elements import (
Text, Title, NarrativeText, ListItem, Table,
Footer, Header, PageBreak, Image, Address
)
@dataclass
class TextExtractorConfig:
"""通用文档提取器配置类
Attributes:
extract_headers_footers: 是否提取页眉页脚
extract_tables: 是否提取表格内容
extract_lists: 是否提取列表内容
extract_titles: 是否提取标题
paragraph_separator: 段落之间的分隔符
text_cleanup: 文本清理选项字典
"""
extract_headers_footers: bool = False
extract_tables: bool = True
extract_lists: bool = True
extract_titles: bool = True
paragraph_separator: str = '\n\n'
text_cleanup: Dict[str, bool] = field(default_factory=lambda: {
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
})
class UnstructuredTextExtractor:
"""通用文档文本内容提取器
使用 unstructured 库支持多种文档格式的文本提取,提供统一的接口和配置选项。
"""
SUPPORTED_EXTENSIONS: Set[str] = {
# 文档格式
'.pdf', '.docx', '.doc', '.txt',
# 演示文稿
'.ppt', '.pptx',
# 电子表格
'.xlsx', '.xls', '.csv',
# 图片
'.png', '.jpg', '.jpeg', '.tiff',
# 邮件
'.eml', '.msg', '.p7s',
# Markdown
".md",
# Org Mode
".org",
# Open Office
".odt",
# reStructured Text
".rst",
# Rich Text
".rtf",
# TSV
".tsv",
# EPUB
'.epub',
# 其他格式
'.html', '.xml', '.json',
}
def __init__(self, config: Optional[TextExtractorConfig] = None):
"""初始化提取器
Args:
config: 提取器配置对象如果为None则使用默认配置
"""
self.config = config or TextExtractorConfig()
self._setup_logging()
def _setup_logging(self) -> None:
"""配置日志记录器"""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
# 添加文件处理器
fh = logging.FileHandler('text_extractor.log')
fh.setLevel(logging.ERROR)
self.logger.addHandler(fh)
def _validate_file(self, file_path: Union[str, Path], max_size_mb: int = 100) -> Path:
"""验证文件
Args:
file_path: 文件路径
max_size_mb: 允许的最大文件大小(MB)
Returns:
Path: 验证后的Path对象
Raises:
ValueError: 文件不存在、格式不支持或大小超限
PermissionError: 没有读取权限
"""
path = Path(file_path).resolve()
if not path.exists():
raise ValueError(f"File not found: {path}")
if not path.is_file():
raise ValueError(f"Not a file: {path}")
if not os.access(path, os.R_OK):
raise PermissionError(f"No read permission: {path}")
file_size_mb = path.stat().st_size / (1024 * 1024)
if file_size_mb > max_size_mb:
raise ValueError(
f"File size ({file_size_mb:.1f}MB) exceeds limit of {max_size_mb}MB"
)
if path.suffix.lower() not in self.SUPPORTED_EXTENSIONS:
raise ValueError(
f"Unsupported format: {path.suffix}. "
f"Supported: {', '.join(sorted(self.SUPPORTED_EXTENSIONS))}"
)
return path
def _cleanup_text(self, text: str) -> str:
"""清理文本
Args:
text: 原始文本
Returns:
str: 清理后的文本
"""
if self.config.text_cleanup['remove_extra_spaces']:
text = ' '.join(text.split())
if self.config.text_cleanup['normalize_whitespace']:
text = text.replace('\t', ' ').replace('\r', '\n')
if self.config.text_cleanup['lowercase']:
text = text.lower()
return text.strip()
def _should_extract_element(self, element) -> bool:
"""判断是否应该提取某个元素
Args:
element: 文档元素
Returns:
bool: 是否应该提取
"""
if isinstance(element, (Text, NarrativeText)):
return True
if isinstance(element, Title) and self.config.extract_titles:
return True
if isinstance(element, ListItem) and self.config.extract_lists:
return True
if isinstance(element, Table) and self.config.extract_tables:
return True
if isinstance(element, (Header, Footer)) and self.config.extract_headers_footers:
return True
return False
@staticmethod
def get_supported_formats() -> List[str]:
"""获取支持的文件格式列表"""
return sorted(UnstructuredTextExtractor.SUPPORTED_EXTENSIONS)
def extract_text(
self,
file_path: Union[str, Path],
strategy: str = "fast"
) -> str:
"""提取文本
Args:
file_path: 文件路径
strategy: 提取策略 ("fast""accurate")
Returns:
str: 提取的文本内容
Raises:
Exception: 提取过程中的错误
"""
try:
path = self._validate_file(file_path)
self.logger.info(f"Processing: {path}")
# 修改这里:添加 nlp=False 参数来禁用 NLTK
elements = partition(
str(path),
strategy=strategy,
include_metadata=True,
nlp=True,
)
# 其余代码保持不变
text_parts = []
for element in elements:
if self._should_extract_element(element):
text = str(element)
cleaned_text = self._cleanup_text(text)
if cleaned_text:
if isinstance(element, (Header, Footer)):
prefix = "[Header] " if isinstance(element, Header) else "[Footer] "
text_parts.append(f"{prefix}{cleaned_text}")
else:
text_parts.append(cleaned_text)
return self.config.paragraph_separator.join(text_parts)
except Exception as e:
self.logger.error(f"Extraction failed: {e}")
raise
def main():
"""主函数:演示用法"""
# 配置
config = TextExtractorConfig(
extract_headers_footers=True,
extract_tables=True,
extract_lists=True,
extract_titles=True,
text_cleanup={
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
}
)
# 创建提取器
extractor = UnstructuredTextExtractor(config)
# 使用示例
try:
# 替换为实际的文件路径
sample_file = './crazy_functions/doc_fns/read_fns/paper/2501.12599v1.pdf'
if Path(sample_file).exists() or True:
text = extractor.extract_text(sample_file)
print("提取的文本:")
print(text)
else:
print(f"示例文件 {sample_file} 不存在")
print("\n支持的格式:", extractor.get_supported_formats())
except Exception as e:
print(f"错误: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,219 @@
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict, Optional, Union
from urllib.parse import urlparse
import logging
import trafilatura
import requests
from pathlib import Path
@dataclass
class WebExtractorConfig:
"""网页内容提取器配置类
Attributes:
extract_comments: 是否提取评论
extract_tables: 是否提取表格
extract_links: 是否保留链接信息
paragraph_separator: 段落分隔符
timeout: 网络请求超时时间(秒)
max_retries: 最大重试次数
user_agent: 自定义User-Agent
text_cleanup: 文本清理选项
"""
extract_comments: bool = False
extract_tables: bool = True
extract_links: bool = False
paragraph_separator: str = '\n\n'
timeout: int = 10
max_retries: int = 3
user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
text_cleanup: Dict[str, bool] = field(default_factory=lambda: {
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
})
class WebTextExtractor:
"""网页文本内容提取器
使用trafilatura库提取网页中的主要文本内容去除广告、导航等无关内容。
"""
def __init__(self, config: Optional[WebExtractorConfig] = None):
"""初始化提取器
Args:
config: 提取器配置对象如果为None则使用默认配置
"""
self.config = config or WebExtractorConfig()
self._setup_logging()
def _setup_logging(self) -> None:
"""配置日志记录器"""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
# 添加文件处理器
fh = logging.FileHandler('web_extractor.log')
fh.setLevel(logging.ERROR)
self.logger.addHandler(fh)
def _validate_url(self, url: str) -> bool:
"""验证URL格式是否有效
Args:
url: 网页URL
Returns:
bool: URL是否有效
"""
try:
result = urlparse(url)
return all([result.scheme, result.netloc])
except Exception:
return False
def _download_webpage(self, url: str) -> Optional[str]:
"""下载网页内容
Args:
url: 网页URL
Returns:
Optional[str]: 网页HTML内容失败返回None
Raises:
Exception: 下载失败时抛出异常
"""
headers = {'User-Agent': self.config.user_agent}
for attempt in range(self.config.max_retries):
try:
response = requests.get(
url,
headers=headers,
timeout=self.config.timeout
)
response.raise_for_status()
return response.text
except requests.RequestException as e:
self.logger.warning(f"Attempt {attempt + 1} failed: {e}")
if attempt == self.config.max_retries - 1:
raise Exception(f"Failed to download webpage after {self.config.max_retries} attempts: {e}")
return None
def _cleanup_text(self, text: str) -> str:
"""清理文本
Args:
text: 原始文本
Returns:
str: 清理后的文本
"""
if not text:
return ""
if self.config.text_cleanup['remove_extra_spaces']:
text = ' '.join(text.split())
if self.config.text_cleanup['normalize_whitespace']:
text = text.replace('\t', ' ').replace('\r', '\n')
if self.config.text_cleanup['lowercase']:
text = text.lower()
return text.strip()
def extract_text(self, url: str) -> str:
"""提取网页文本内容
Args:
url: 网页URL
Returns:
str: 提取的文本内容
Raises:
ValueError: URL无效时抛出
Exception: 提取失败时抛出
"""
try:
if not self._validate_url(url):
raise ValueError(f"Invalid URL: {url}")
self.logger.info(f"Processing URL: {url}")
# 下载网页
html_content = self._download_webpage(url)
if not html_content:
raise Exception("Failed to download webpage")
# 配置trafilatura提取选项
extract_config = {
'include_comments': self.config.extract_comments,
'include_tables': self.config.extract_tables,
'include_links': self.config.extract_links,
'no_fallback': False, # 允许使用后备提取器
}
# 提取文本
extracted_text = trafilatura.extract(
html_content,
**extract_config
)
if not extracted_text:
raise Exception("No content could be extracted")
# 清理文本
cleaned_text = self._cleanup_text(extracted_text)
return cleaned_text
except Exception as e:
self.logger.error(f"Extraction failed: {e}")
raise
def main():
"""主函数:演示用法"""
# 配置
config = WebExtractorConfig(
extract_comments=False,
extract_tables=True,
extract_links=False,
timeout=10,
text_cleanup={
'remove_extra_spaces': True,
'normalize_whitespace': True,
'remove_special_chars': False,
'lowercase': False
}
)
# 创建提取器
extractor = WebTextExtractor(config)
# 使用示例
try:
# 替换为实际的URL
sample_url = 'https://arxiv.org/abs/2412.00036'
text = extractor.extract_text(sample_url)
print("提取的文本:")
print(text)
except Exception as e:
print(f"错误: {e}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,451 @@
import os
import re
import glob
import time
import queue
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Generator, Tuple, Set, Optional, Dict
from dataclasses import dataclass
from loguru import logger
from toolbox import update_ui
from crazy_functions.rag_fns.rag_file_support import extract_text
from crazy_functions.doc_fns.content_folder import ContentFoldingManager, FileMetadata, FoldingOptions, FoldingStyle, FoldingError
from shared_utils.fastapi_server import validate_path_safety
from datetime import datetime
import mimetypes
@dataclass
class FileInfo:
"""文件信息数据类"""
path: str # 完整路径
rel_path: str # 相对路径
size: float # 文件大小(MB)
extension: str # 文件扩展名
last_modified: str # 最后修改时间
class TextContentLoader:
"""优化版本的文本内容加载器 - 保持原有接口"""
# 压缩文件扩展名
COMPRESSED_EXTENSIONS: Set[str] = {'.zip', '.rar', '.7z', '.tar', '.gz', '.bz2', '.xz'}
# 系统配置
MAX_FILE_SIZE: int = 100 * 1024 * 1024 # 最大文件大小100MB
MAX_TOTAL_SIZE: int = 100 * 1024 * 1024 # 最大总大小100MB
MAX_FILES: int = 100 # 最大文件数量
CHUNK_SIZE: int = 1024 * 1024 # 文件读取块大小1MB
MAX_WORKERS: int = min(32, (os.cpu_count() or 1) * 4) # 最大工作线程数
BATCH_SIZE: int = 5 # 批处理大小
def __init__(self, chatbot: List, history: List):
"""初始化加载器"""
self.chatbot = chatbot
self.history = history
self.failed_files: List[Tuple[str, str]] = []
self.processed_size: int = 0
self.start_time: float = 0
self.file_cache: Dict[str, str] = {}
self._lock = threading.Lock()
self.executor = ThreadPoolExecutor(max_workers=self.MAX_WORKERS)
self.results_queue = queue.Queue()
self.folding_manager = ContentFoldingManager()
def _create_file_info(self, entry: os.DirEntry, root_path: str) -> FileInfo:
"""优化的文件信息创建
Args:
entry: 目录入口对象
root_path: 根路径
Returns:
FileInfo: 文件信息对象
"""
try:
stats = entry.stat() # 使用缓存的文件状态
return FileInfo(
path=entry.path,
rel_path=os.path.relpath(entry.path, root_path),
size=stats.st_size / (1024 * 1024),
extension=os.path.splitext(entry.path)[1].lower(),
last_modified=time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(stats.st_mtime))
)
except (OSError, ValueError) as e:
return None
def _process_file_batch(self, file_batch: List[FileInfo]) -> List[Tuple[FileInfo, Optional[str]]]:
"""批量处理文件
Args:
file_batch: 要处理的文件信息列表
Returns:
List[Tuple[FileInfo, Optional[str]]]: 处理结果列表
"""
results = []
futures = {}
for file_info in file_batch:
if file_info.path in self.file_cache:
results.append((file_info, self.file_cache[file_info.path]))
continue
if file_info.size * 1024 * 1024 > self.MAX_FILE_SIZE:
with self._lock:
self.failed_files.append(
(file_info.rel_path,
f"文件过大({file_info.size:.2f}MB > {self.MAX_FILE_SIZE / (1024 * 1024)}MB")
)
continue
future = self.executor.submit(self._read_file_content, file_info)
futures[future] = file_info
for future in as_completed(futures):
file_info = futures[future]
try:
content = future.result()
if content:
with self._lock:
self.file_cache[file_info.path] = content
self.processed_size += file_info.size * 1024 * 1024
results.append((file_info, content))
except Exception as e:
with self._lock:
self.failed_files.append((file_info.rel_path, f"读取失败: {str(e)}"))
return results
def _read_file_content(self, file_info: FileInfo) -> Optional[str]:
"""读取单个文件内容
Args:
file_info: 文件信息对象
Returns:
Optional[str]: 文件内容
"""
try:
content = extract_text(file_info.path)
if not content or not content.strip():
return None
return content
except Exception as e:
logger.exception(f"读取文件失败: {str(e)}")
raise Exception(f"读取文件失败: {str(e)}")
def _is_valid_file(self, file_path: str) -> bool:
"""检查文件是否有效
Args:
file_path: 文件路径
Returns:
bool: 是否为有效文件
"""
if not os.path.isfile(file_path):
return False
extension = os.path.splitext(file_path)[1].lower()
if (extension in self.COMPRESSED_EXTENSIONS or
os.path.basename(file_path).startswith('.') or
not os.access(file_path, os.R_OK)):
return False
# 只要文件可以访问且不在排除列表中就认为是有效的
return True
def _collect_files(self, path: str) -> List[FileInfo]:
"""收集文件信息
Args:
path: 目标路径
Returns:
List[FileInfo]: 有效文件信息列表
"""
files = []
total_size = 0
# 处理单个文件的情况
if os.path.isfile(path):
if self._is_valid_file(path):
file_info = self._create_file_info(os.DirEntry(os.path.dirname(path)), os.path.dirname(path))
if file_info:
return [file_info]
return []
# 处理目录的情况
try:
# 使用os.walk来递归遍历目录
for root, _, filenames in os.walk(path):
for filename in filenames:
if len(files) >= self.MAX_FILES:
self.failed_files.append((filename, f"超出最大文件数限制({self.MAX_FILES})"))
continue
file_path = os.path.join(root, filename)
if not self._is_valid_file(file_path):
continue
try:
stats = os.stat(file_path)
file_size = stats.st_size / (1024 * 1024) # 转换为MB
if file_size * 1024 * 1024 > self.MAX_FILE_SIZE:
self.failed_files.append((file_path,
f"文件过大({file_size:.2f}MB > {self.MAX_FILE_SIZE / (1024 * 1024)}MB"))
continue
if total_size + file_size * 1024 * 1024 > self.MAX_TOTAL_SIZE:
self.failed_files.append((file_path, "超出总大小限制"))
continue
file_info = FileInfo(
path=file_path,
rel_path=os.path.relpath(file_path, path),
size=file_size,
extension=os.path.splitext(file_path)[1].lower(),
last_modified=time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(stats.st_mtime))
)
total_size += file_size * 1024 * 1024
files.append(file_info)
except Exception as e:
self.failed_files.append((file_path, f"处理文件失败: {str(e)}"))
continue
except Exception as e:
self.failed_files.append(("目录扫描", f"扫描失败: {str(e)}"))
return []
return sorted(files, key=lambda x: x.rel_path)
def _format_content_with_fold(self, file_info, content: str) -> str:
"""使用折叠管理器格式化文件内容"""
try:
metadata = FileMetadata(
rel_path=file_info.rel_path,
size=file_info.size,
last_modified=datetime.fromtimestamp(
os.path.getmtime(file_info.path)
),
mime_type=mimetypes.guess_type(file_info.path)[0]
)
options = FoldingOptions(
style=FoldingStyle.DETAILED,
code_language=self.folding_manager._guess_language(
os.path.splitext(file_info.path)[1]
),
show_timestamp=True
)
return self.folding_manager.format_content(
content=content,
formatter_type='file',
metadata=metadata,
options=options
)
except Exception as e:
return f"Error formatting content: {str(e)}"
def _format_content_for_llm(self, file_infos: List[FileInfo], contents: List[str]) -> str:
"""格式化用于LLM的内容
Args:
file_infos: 文件信息列表
contents: 内容列表
Returns:
str: 格式化后的内容
"""
if len(file_infos) != len(contents):
raise ValueError("文件信息和内容数量不匹配")
result = [
"以下是多个文件的内容集合。每个文件的内容都以 '===== 文件 {序号}: {文件名} =====' 开始,",
"'===== 文件 {序号} 结束 =====' 结束。你可以根据这些分隔符来识别不同文件的内容。\n\n"
]
for idx, (file_info, content) in enumerate(zip(file_infos, contents), 1):
result.extend([
f"===== 文件 {idx}: {file_info.rel_path} =====",
"文件内容:",
content.strip(),
f"===== 文件 {idx} 结束 =====\n"
])
return "\n".join(result)
def execute(self, txt: str) -> Generator:
"""执行文本加载和显示 - 保持原有接口
Args:
txt: 目标路径
Yields:
Generator: UI更新生成器
"""
try:
# 首先显示正在处理的提示信息
self.chatbot.append(["提示", "正在提取文本内容,请稍作等待..."])
yield from update_ui(chatbot=self.chatbot, history=self.history)
user_name = self.chatbot.get_user()
validate_path_safety(txt, user_name)
self.start_time = time.time()
self.processed_size = 0
self.failed_files.clear()
successful_files = []
successful_contents = []
# 收集文件
files = self._collect_files(txt)
if not files:
# 移除之前的提示信息
self.chatbot.pop()
self.chatbot.append(["提示", "未找到任何有效文件"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
# 批量处理文件
content_blocks = []
for i in range(0, len(files), self.BATCH_SIZE):
batch = files[i:i + self.BATCH_SIZE]
results = self._process_file_batch(batch)
for file_info, content in results:
if content:
content_blocks.append(self._format_content_with_fold(file_info, content))
successful_files.append(file_info)
successful_contents.append(content)
# 显示文件内容,替换之前的提示信息
if content_blocks:
# 移除之前的提示信息
self.chatbot.pop()
self.chatbot.append(["文件内容", "\n".join(content_blocks)])
self.history.extend([
self._format_content_for_llm(successful_files, successful_contents),
"我已经接收到你上传的文件的内容,请提问"
])
yield from update_ui(chatbot=self.chatbot, history=self.history)
yield from update_ui(chatbot=self.chatbot, history=self.history)
except Exception as e:
# 发生错误时,移除之前的提示信息
if len(self.chatbot) > 0 and self.chatbot[-1][0] == "提示":
self.chatbot.pop()
self.chatbot.append(["错误", f"处理过程中出现错误: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
finally:
self.executor.shutdown(wait=False)
self.file_cache.clear()
def execute_single_file(self, file_path: str) -> Generator:
"""执行单个文件的加载和显示
Args:
file_path: 文件路径
Yields:
Generator: UI更新生成器
"""
try:
# 首先显示正在处理的提示信息
self.chatbot.append(["提示", "正在提取文本内容,请稍作等待..."])
yield from update_ui(chatbot=self.chatbot, history=self.history)
user_name = self.chatbot.get_user()
validate_path_safety(file_path, user_name)
self.start_time = time.time()
self.processed_size = 0
self.failed_files.clear()
# 验证文件是否存在且可读
if not os.path.isfile(file_path):
self.chatbot.pop()
self.chatbot.append(["错误", f"指定路径不是文件: {file_path}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
if not self._is_valid_file(file_path):
self.chatbot.pop()
self.chatbot.append(["错误", f"无效的文件类型或无法读取: {file_path}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
# 创建文件信息
try:
stats = os.stat(file_path)
file_size = stats.st_size / (1024 * 1024) # 转换为MB
if file_size * 1024 * 1024 > self.MAX_FILE_SIZE:
self.chatbot.pop()
self.chatbot.append(["错误", f"文件过大({file_size:.2f}MB > {self.MAX_FILE_SIZE / (1024 * 1024)}MB"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
file_info = FileInfo(
path=file_path,
rel_path=os.path.basename(file_path),
size=file_size,
extension=os.path.splitext(file_path)[1].lower(),
last_modified=time.strftime('%Y-%m-%d %H:%M:%S',
time.localtime(stats.st_mtime))
)
except Exception as e:
self.chatbot.pop()
self.chatbot.append(["错误", f"处理文件失败: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
# 读取文件内容
try:
content = self._read_file_content(file_info)
if not content:
self.chatbot.pop()
self.chatbot.append(["提示", f"文件内容为空或无法提取: {file_path}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
except Exception as e:
self.chatbot.pop()
self.chatbot.append(["错误", f"读取文件失败: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
return
# 格式化内容并更新UI
formatted_content = self._format_content_with_fold(file_info, content)
# 移除之前的提示信息
self.chatbot.pop()
self.chatbot.append(["文件内容", formatted_content])
# 更新历史记录便于LLM处理
llm_content = self._format_content_for_llm([file_info], [content])
self.history.extend([llm_content, "我已经接收到你上传的文件的内容,请提问"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
except Exception as e:
# 发生错误时,移除之前的提示信息
if len(self.chatbot) > 0 and self.chatbot[-1][0] == "提示":
self.chatbot.pop()
self.chatbot.append(["错误", f"处理过程中出现错误: {str(e)}"])
yield from update_ui(chatbot=self.chatbot, history=self.history)
def __del__(self):
"""析构函数 - 确保资源被正确释放"""
if hasattr(self, 'executor'):
self.executor.shutdown(wait=False)
if hasattr(self, 'file_cache'):
self.file_cache.clear()

View File

@@ -1,4 +1,4 @@
from toolbox import CatchException, update_ui, update_ui_lastest_msg
from toolbox import CatchException, update_ui, update_ui_latest_msg
from crazy_functions.multi_stage.multi_stage_utils import GptAcademicGameBaseState
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from request_llms.bridge_all import predict_no_ui_long_connection
@@ -8,12 +8,12 @@ import random
class MiniGame_ASCII_Art(GptAcademicGameBaseState):
def step(self, prompt, chatbot, history):
if self.step_cnt == 0:
if self.step_cnt == 0:
chatbot.append(["我画你猜(动物)", "请稍等..."])
else:
if prompt.strip() == 'exit':
self.delete_game = True
yield from update_ui_lastest_msg(lastmsg=f"谜底是{self.obj},游戏结束。", chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=f"谜底是{self.obj},游戏结束。", chatbot=chatbot, history=history, delay=0.)
return
chatbot.append([prompt, ""])
yield from update_ui(chatbot=chatbot, history=history)
@@ -31,12 +31,12 @@ class MiniGame_ASCII_Art(GptAcademicGameBaseState):
self.cur_task = 'identify user guess'
res = get_code_block(raw_res)
history += ['', f'the answer is {self.obj}', inputs, res]
yield from update_ui_lastest_msg(lastmsg=res, chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=res, chatbot=chatbot, history=history, delay=0.)
elif self.cur_task == 'identify user guess':
if is_same_thing(self.obj, prompt, self.llm_kwargs):
self.delete_game = True
yield from update_ui_lastest_msg(lastmsg="你猜对了!", chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg="你猜对了!", chatbot=chatbot, history=history, delay=0.)
else:
self.cur_task = 'identify user guess'
yield from update_ui_lastest_msg(lastmsg="猜错了再试试输入“exit”获取答案。", chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg="猜错了再试试输入“exit”获取答案。", chatbot=chatbot, history=history, delay=0.)

View File

@@ -63,7 +63,7 @@ prompts_terminate = """小说的前文回顾:
"""
from toolbox import CatchException, update_ui, update_ui_lastest_msg
from toolbox import CatchException, update_ui, update_ui_latest_msg
from crazy_functions.multi_stage.multi_stage_utils import GptAcademicGameBaseState
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from request_llms.bridge_all import predict_no_ui_long_connection
@@ -88,23 +88,23 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
self.story = []
chatbot.append(["互动写故事", f"这次的故事开头是:{self.headstart}"])
self.sys_prompt_ = '你是一个想象力丰富的杰出作家。正在与你的朋友互动一起写故事因此你每次写的故事段落应少于300字结局除外'
def generate_story_image(self, story_paragraph):
try:
from crazy_functions.图片生成 import gen_image
from crazy_functions.Image_Generate import gen_image
prompt_ = predict_no_ui_long_connection(inputs=story_paragraph, llm_kwargs=self.llm_kwargs, history=[], sys_prompt='你需要根据用户给出的小说段落进行简短的环境描写。要求80字以内。')
image_url, image_path = gen_image(self.llm_kwargs, prompt_, '512x512', model="dall-e-2", quality='standard', style='natural')
return f'<br/><div align="center"><img src="file={image_path}"></div>'
except:
return ''
def step(self, prompt, chatbot, history):
"""
首先,处理游戏初始化等特殊情况
"""
if self.step_cnt == 0:
if self.step_cnt == 0:
self.begin_game_step_0(prompt, chatbot, history)
self.lock_plugin(chatbot)
self.cur_task = 'head_start'
@@ -112,7 +112,7 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
if prompt.strip() == 'exit' or prompt.strip() == '结束剧情':
# should we terminate game here?
self.delete_game = True
yield from update_ui_lastest_msg(lastmsg=f"游戏结束。", chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=f"游戏结束。", chatbot=chatbot, history=history, delay=0.)
return
if '剧情收尾' in prompt:
self.cur_task = 'story_terminate'
@@ -132,13 +132,13 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
inputs_ = prompts_hs.format(headstart=self.headstart)
history_ = []
story_paragraph = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs_, '故事开头', self.llm_kwargs,
inputs_, '故事开头', self.llm_kwargs,
chatbot, history_, self.sys_prompt_
)
self.story.append(story_paragraph)
# # 配图
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
# # 构建后续剧情引导
previously_on_story = ""
@@ -147,7 +147,7 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
inputs_ = prompts_interact.format(previously_on_story=previously_on_story)
history_ = []
self.next_choices = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs_, '请在以下几种故事走向中,选择一种(当然,您也可以选择给出其他故事走向):', self.llm_kwargs,
inputs_, '请在以下几种故事走向中,选择一种(当然,您也可以选择给出其他故事走向):', self.llm_kwargs,
chatbot,
history_,
self.sys_prompt_
@@ -166,13 +166,13 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
inputs_ = prompts_resume.format(previously_on_story=previously_on_story, choice=self.next_choices, user_choice=prompt)
history_ = []
story_paragraph = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs_, f'下一段故事(您的选择是:{prompt})。', self.llm_kwargs,
inputs_, f'下一段故事(您的选择是:{prompt})。', self.llm_kwargs,
chatbot, history_, self.sys_prompt_
)
self.story.append(story_paragraph)
# # 配图
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
# # 构建后续剧情引导
previously_on_story = ""
@@ -181,10 +181,10 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
inputs_ = prompts_interact.format(previously_on_story=previously_on_story)
history_ = []
self.next_choices = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs_,
'请在以下几种故事走向中,选择一种。当然,您也可以给出您心中的其他故事走向。另外,如果您希望剧情立即收尾,请输入剧情走向,并以“剧情收尾”四个字提示程序。', self.llm_kwargs,
chatbot,
history_,
inputs_,
'请在以下几种故事走向中,选择一种。当然,您也可以给出您心中的其他故事走向。另外,如果您希望剧情立即收尾,请输入剧情走向,并以“剧情收尾”四个字提示程序。', self.llm_kwargs,
chatbot,
history_,
self.sys_prompt_
)
self.cur_task = 'user_choice'
@@ -200,12 +200,12 @@ class MiniGame_ResumeStory(GptAcademicGameBaseState):
inputs_ = prompts_terminate.format(previously_on_story=previously_on_story, user_choice=prompt)
history_ = []
story_paragraph = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs_, f'故事收尾(您的选择是:{prompt})。', self.llm_kwargs,
inputs_, f'故事收尾(您的选择是:{prompt})。', self.llm_kwargs,
chatbot, history_, self.sys_prompt_
)
# # 配图
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_lastest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>正在生成插图中 ...', chatbot=chatbot, history=history, delay=0.)
yield from update_ui_latest_msg(lastmsg=story_paragraph + '<br/>'+ self.generate_story_image(story_paragraph), chatbot=chatbot, history=history, delay=0.)
# terminate game
self.delete_game = True

View File

@@ -5,7 +5,7 @@ def get_code_block(reply):
import re
pattern = r"```([\s\S]*?)```" # regex pattern to match code blocks
matches = re.findall(pattern, reply) # find all code blocks in text
if len(matches) == 1:
if len(matches) == 1:
return "```" + matches[0] + "```" # code block
raise RuntimeError("GPT is not generating proper code.")
@@ -13,10 +13,10 @@ def is_same_thing(a, b, llm_kwargs):
from pydantic import BaseModel, Field
class IsSameThing(BaseModel):
is_same_thing: bool = Field(description="determine whether two objects are same thing.", default=False)
def run_gpt_fn(inputs, sys_prompt, history=[]):
def run_gpt_fn(inputs, sys_prompt, history=[]):
return predict_no_ui_long_connection(
inputs=inputs, llm_kwargs=llm_kwargs,
inputs=inputs, llm_kwargs=llm_kwargs,
history=history, sys_prompt=sys_prompt, observe_window=[]
)
@@ -24,7 +24,7 @@ def is_same_thing(a, b, llm_kwargs):
inputs_01 = "Identity whether the user input and the target is the same thing: \n target object: {a} \n user input object: {b} \n\n\n".format(a=a, b=b)
inputs_01 += "\n\n\n Note that the user may describe the target object with a different language, e.g. cat and 猫 are the same thing."
analyze_res_cot_01 = run_gpt_fn(inputs_01, "", [])
inputs_02 = inputs_01 + gpt_json_io.format_instructions
analyze_res = run_gpt_fn(inputs_02, "", [inputs_01, analyze_res_cot_01])

View File

@@ -2,7 +2,7 @@ import time
import importlib
from toolbox import trimmed_format_exc, gen_time_str, get_log_folder
from toolbox import CatchException, update_ui, gen_time_str, trimmed_format_exc, is_the_upload_folder
from toolbox import promote_file_to_downloadzone, get_log_folder, update_ui_lastest_msg
from toolbox import promote_file_to_downloadzone, get_log_folder, update_ui_latest_msg
import multiprocessing
def get_class_name(class_string):
@@ -41,11 +41,11 @@ def is_function_successfully_generated(fn_path, class_name, return_dict):
# Now you can create an instance of the class
instance = some_class()
return_dict['success'] = True
return
return
except:
return_dict['traceback'] = trimmed_format_exc()
return
def subprocess_worker(code, file_path, return_dict):
return_dict['result'] = None
return_dict['success'] = False

Some files were not shown because too many files have changed in this diff Show More