1
This commit is contained in:
132
jd/QUICKSTART_UBUNTU.md
Normal file
132
jd/QUICKSTART_UBUNTU.md
Normal file
@@ -0,0 +1,132 @@
|
||||
# Ubuntu 快速入门指南
|
||||
|
||||
## 快速安装(推荐)
|
||||
|
||||
使用自动安装脚本:
|
||||
|
||||
```bash
|
||||
cd ~/project/jdpl # 进入项目目录
|
||||
chmod +x jd/setup_ubuntu.sh
|
||||
./jd/setup_ubuntu.sh
|
||||
```
|
||||
|
||||
脚本会自动:
|
||||
1. ✅ 检查并安装 Python3 和依赖
|
||||
2. ✅ 检查并安装 Chrome/Chromium
|
||||
3. ✅ 安装 Chrome 运行时依赖
|
||||
4. ✅ 创建 Python 虚拟环境
|
||||
5. ✅ 安装 DrissionPage
|
||||
6. ✅ 创建便捷运行脚本
|
||||
|
||||
## 手动安装
|
||||
|
||||
### 1. 安装 Chrome
|
||||
|
||||
```bash
|
||||
# Google Chrome (推荐)
|
||||
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
|
||||
sudo apt install -y ./google-chrome-stable_current_amd64.deb
|
||||
|
||||
# 或 Chromium
|
||||
sudo apt install -y chromium-browser
|
||||
```
|
||||
|
||||
### 2. 安装系统依赖
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install -y python3 python3-pip python3-venv \
|
||||
libnss3 libatk-bridge2.0-0 libdrm2 libxkbcommon0 \
|
||||
libxcomposite1 libxdamage1 libxfixes3 libxrandr2 \
|
||||
libgbm1 libasound2
|
||||
```
|
||||
|
||||
### 3. 创建虚拟环境
|
||||
|
||||
```bash
|
||||
cd ~/project/jdpl
|
||||
python3 -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install DrissionPage
|
||||
deactivate
|
||||
```
|
||||
|
||||
### 4. 运行脚本
|
||||
|
||||
```bash
|
||||
# 方式1: 使用便捷脚本(如果运行了 setup_ubuntu.sh)
|
||||
./run_logistics.sh
|
||||
|
||||
# 方式2: 手动运行
|
||||
source venv/bin/activate
|
||||
python jd/fetch_logistics_ubuntu.py
|
||||
deactivate
|
||||
```
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q: 遇到 "externally-managed-environment" 错误?
|
||||
|
||||
A: 这是 Ubuntu 22.04+ 的保护机制。**必须使用虚拟环境**,不要使用 `--break-system-packages`。
|
||||
|
||||
### Q: 虚拟环境在哪里?
|
||||
|
||||
A: 在项目目录下的 `venv` 文件夹。每次运行前需要激活:`source venv/bin/activate`
|
||||
|
||||
### Q: 找不到 Chrome?
|
||||
|
||||
A: 脚本会自动查找,也可以手动安装。常见路径:
|
||||
- `/usr/bin/google-chrome`
|
||||
- `/usr/bin/chromium-browser`
|
||||
|
||||
### Q: 无头模式 vs 有界面模式?
|
||||
|
||||
A: 在 `fetch_logistics_ubuntu.py` 中修改:
|
||||
```python
|
||||
USE_HEADLESS = True # 无头模式(服务器环境)
|
||||
USE_HEADLESS = False # 有界面模式(需要图形界面)
|
||||
```
|
||||
|
||||
### Q: 如何修改默认 URL?
|
||||
|
||||
A: 编辑 `fetch_logistics_ubuntu.py`,找到:
|
||||
```python
|
||||
tracking_url = "https://3.cn/2t-Iibig"
|
||||
```
|
||||
修改为你想要的 URL。
|
||||
|
||||
## 验证安装
|
||||
|
||||
运行测试:
|
||||
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
python -c "
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
import os
|
||||
chrome_path = '/usr/bin/google-chrome'
|
||||
if not os.path.exists(chrome_path):
|
||||
chrome_path = '/usr/bin/chromium-browser'
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(chrome_path)
|
||||
options.headless(True)
|
||||
page = ChromiumPage(options)
|
||||
page.get('https://www.baidu.com')
|
||||
print('✅ 测试成功!')
|
||||
page.quit()
|
||||
"
|
||||
deactivate
|
||||
```
|
||||
|
||||
## 项目结构
|
||||
|
||||
```
|
||||
jdpl/
|
||||
├── jd/
|
||||
│ ├── fetch_logistics_ubuntu.py # Ubuntu 主脚本
|
||||
│ ├── setup_ubuntu.sh # 自动安装脚本
|
||||
│ └── UBUNTU_SETUP.md # 详细文档
|
||||
├── venv/ # Python 虚拟环境(运行脚本后创建)
|
||||
└── run_logistics.sh # 便捷运行脚本(运行 setup 后创建)
|
||||
```
|
||||
|
||||
309
jd/UBUNTU_SETUP.md
Normal file
309
jd/UBUNTU_SETUP.md
Normal file
@@ -0,0 +1,309 @@
|
||||
# Ubuntu 环境设置指南
|
||||
|
||||
## 1. 安装 Google Chrome 或 Chromium
|
||||
|
||||
### 方式一:安装 Google Chrome(推荐)
|
||||
|
||||
```bash
|
||||
# 下载并安装 Google Chrome
|
||||
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
|
||||
sudo apt install -y ./google-chrome-stable_current_amd64.deb
|
||||
```
|
||||
|
||||
### 方式二:安装 Chromium(开源版本)
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install -y chromium-browser
|
||||
```
|
||||
|
||||
## 2. 安装必要的依赖库
|
||||
|
||||
```bash
|
||||
# 更新包列表
|
||||
sudo apt update
|
||||
|
||||
# 安装 Chrome/Chromium 运行时依赖
|
||||
sudo apt install -y \
|
||||
libnss3 \
|
||||
libatk-bridge2.0-0 \
|
||||
libdrm2 \
|
||||
libxkbcommon0 \
|
||||
libxcomposite1 \
|
||||
libxdamage1 \
|
||||
libxfixes3 \
|
||||
libxrandr2 \
|
||||
libgbm1 \
|
||||
libasound2 \
|
||||
libpango-1.0-0 \
|
||||
libcairo2 \
|
||||
libatk1.0-0 \
|
||||
libgdk-pixbuf2.0-0 \
|
||||
libgtk-3-0
|
||||
```
|
||||
|
||||
## 3. 安装 Python 依赖
|
||||
|
||||
⚠️ **注意**: Ubuntu 22.04+ 默认不允许直接使用 pip 安装系统级包。请使用以下方法之一:
|
||||
|
||||
### 方法一:使用虚拟环境(推荐)✅
|
||||
|
||||
```bash
|
||||
# 1. 确保已安装 python3-venv
|
||||
sudo apt install -y python3-venv python3-pip
|
||||
|
||||
# 2. 进入项目目录
|
||||
cd ~/project/jdpl # 或者你的项目路径
|
||||
|
||||
# 3. 创建虚拟环境
|
||||
python3 -m venv venv
|
||||
|
||||
# 4. 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
|
||||
# 5. 安装依赖
|
||||
pip install DrissionPage
|
||||
|
||||
# 如果使用数据库
|
||||
pip install sqlalchemy pymysql
|
||||
|
||||
# 6. 运行脚本(需要在虚拟环境中)
|
||||
python jd/fetch_logistics_ubuntu.py
|
||||
|
||||
# 7. 退出虚拟环境(不需要时)
|
||||
deactivate
|
||||
```
|
||||
|
||||
### 方法二:使用 pipx(适合单命令工具)
|
||||
|
||||
```bash
|
||||
# 1. 安装 pipx
|
||||
sudo apt install -y pipx
|
||||
pipx ensurepath
|
||||
|
||||
# 2. 使用 pipx 安装(如果要全局可用)
|
||||
# 注意:pipx 主要用于安装应用程序,不太适合库
|
||||
```
|
||||
|
||||
### 方法三:使用 --break-system-packages(不推荐,但快速)
|
||||
|
||||
```bash
|
||||
# ⚠️ 警告:可能破坏系统 Python 环境,不推荐在生产环境使用
|
||||
|
||||
# 安装 DrissionPage
|
||||
pip3 install --break-system-packages DrissionPage
|
||||
|
||||
# 如果使用数据库
|
||||
pip3 install --break-system-packages sqlalchemy pymysql
|
||||
```
|
||||
|
||||
### 方法四:使用 apt 安装(如果可用)
|
||||
|
||||
```bash
|
||||
# 某些包可能通过 apt 安装(但 DrissionPage 通常不行)
|
||||
sudo apt install -y python3-drissionpage # 通常不可用
|
||||
```
|
||||
|
||||
**推荐使用方法一(虚拟环境)**,这是最安全和标准的做法。
|
||||
|
||||
## 4. 配置说明
|
||||
|
||||
### 无头模式(Headless)vs 有界面模式
|
||||
|
||||
在 `fetch_logistics_ubuntu.py` 文件中,可以设置 `USE_HEADLESS` 变量:
|
||||
|
||||
```python
|
||||
USE_HEADLESS = True # 无头模式,适合服务器环境,不显示浏览器窗口
|
||||
USE_HEADLESS = False # 有界面模式,需要图形界面支持
|
||||
```
|
||||
|
||||
### 无头模式使用场景:
|
||||
- 服务器环境(无桌面环境)
|
||||
- SSH 远程连接
|
||||
- Docker 容器
|
||||
- 需要后台运行
|
||||
|
||||
### 有界面模式使用场景:
|
||||
- 本地 Ubuntu 桌面环境
|
||||
- 需要调试和查看浏览器行为
|
||||
- 有 X11 或 Wayland 显示服务器
|
||||
|
||||
## 5. 运行脚本
|
||||
|
||||
```bash
|
||||
# 进入脚本目录
|
||||
cd /path/to/jdpl/jd
|
||||
|
||||
# 运行脚本
|
||||
python3 fetch_logistics_ubuntu.py
|
||||
```
|
||||
|
||||
## 6. 如果遇到问题
|
||||
|
||||
### 问题1: 找不到 Chrome/Chromium
|
||||
|
||||
```bash
|
||||
# 检查是否安装
|
||||
which google-chrome
|
||||
which chromium-browser
|
||||
|
||||
# 如果找不到,检查常见路径
|
||||
ls -la /usr/bin/google-chrome*
|
||||
ls -la /usr/bin/chromium*
|
||||
```
|
||||
|
||||
### 问题2: 权限问题
|
||||
|
||||
```bash
|
||||
# 如果提示权限不足,可能需要添加 --no-sandbox 参数
|
||||
# 脚本中已经自动添加了这个参数
|
||||
```
|
||||
|
||||
### 问题3: 无头模式无法使用
|
||||
|
||||
如果设置 `USE_HEADLESS = False` 但仍然无法显示,可能需要:
|
||||
|
||||
```bash
|
||||
# 检查 DISPLAY 环境变量
|
||||
echo $DISPLAY
|
||||
|
||||
# 如果为空,设置显示(如果是本地桌面)
|
||||
export DISPLAY=:0
|
||||
|
||||
# 或者使用 Xvfb(虚拟显示)
|
||||
sudo apt install -y xvfb
|
||||
xvfb-run -a python3 fetch_logistics_ubuntu.py
|
||||
```
|
||||
|
||||
### 问题4: 缺少共享内存
|
||||
|
||||
如果看到 `/dev/shm` 相关错误:
|
||||
|
||||
```bash
|
||||
# 检查 /dev/shm 大小
|
||||
df -h /dev/shm
|
||||
|
||||
# 如果太小,可以挂载更大的空间(临时)
|
||||
sudo mount -o remount,size=2G /dev/shm
|
||||
```
|
||||
|
||||
### 问题5: 依赖库缺失
|
||||
|
||||
如果运行时提示缺少某些库:
|
||||
|
||||
```bash
|
||||
# 安装所有可能的依赖
|
||||
sudo apt install -y \
|
||||
fonts-liberation \
|
||||
libappindicator3-1 \
|
||||
libasound2 \
|
||||
libatk-bridge2.0-0 \
|
||||
libatk1.0-0 \
|
||||
libcairo2 \
|
||||
libcups2 \
|
||||
libdbus-1-3 \
|
||||
libexpat1 \
|
||||
libfontconfig1 \
|
||||
libgbm1 \
|
||||
libgcc1 \
|
||||
libglib2.0-0 \
|
||||
libgtk-3-0 \
|
||||
libnspr4 \
|
||||
libnss3 \
|
||||
libpango-1.0-0 \
|
||||
libpangocairo-1.0-0 \
|
||||
libstdc++6 \
|
||||
libx11-6 \
|
||||
libx11-xcb1 \
|
||||
libxcb1 \
|
||||
libxcomposite1 \
|
||||
libxcursor1 \
|
||||
libxdamage1 \
|
||||
libxext6 \
|
||||
libxfixes3 \
|
||||
libxi6 \
|
||||
libxrandr2 \
|
||||
libxrender1 \
|
||||
libxss1 \
|
||||
libxtst6 \
|
||||
lsb-release \
|
||||
wget \
|
||||
xdg-utils
|
||||
```
|
||||
|
||||
## 7. Docker 环境(可选)
|
||||
|
||||
如果需要:
|
||||
|
||||
```dockerfile
|
||||
FROM ubuntu:22.04
|
||||
|
||||
RUN apt-get update && apt-get install -y \
|
||||
python3 \
|
||||
python3-pip \
|
||||
wget \
|
||||
&& wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb \
|
||||
&& apt-get install -y ./google-chrome-stable_current_amd64.deb \
|
||||
&& pip3 install DrissionPage
|
||||
|
||||
WORKDIR /app
|
||||
COPY jd/fetch_logistics_ubuntu.py .
|
||||
CMD ["python3", "fetch_logistics_ubuntu.py"]
|
||||
```
|
||||
|
||||
## 8. 验证安装
|
||||
|
||||
### 如果使用虚拟环境:
|
||||
|
||||
```bash
|
||||
# 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
|
||||
# 运行测试
|
||||
python -c "
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
import os
|
||||
chrome_path = '/usr/bin/google-chrome'
|
||||
if not os.path.exists(chrome_path):
|
||||
chrome_path = '/usr/bin/chromium-browser'
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(chrome_path)
|
||||
options.headless(True)
|
||||
page = ChromiumPage(options)
|
||||
page.get('https://www.baidu.com')
|
||||
print('✅ 浏览器测试成功!')
|
||||
page.quit()
|
||||
"
|
||||
```
|
||||
|
||||
### 如果使用系统级安装(--break-system-packages):
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
import os
|
||||
chrome_path = '/usr/bin/google-chrome'
|
||||
if not os.path.exists(chrome_path):
|
||||
chrome_path = '/usr/bin/chromium-browser'
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(chrome_path)
|
||||
options.headless(True)
|
||||
page = ChromiumPage(options)
|
||||
page.get('https://www.baidu.com')
|
||||
print('✅ 浏览器测试成功!')
|
||||
page.quit()
|
||||
"
|
||||
```
|
||||
|
||||
## 常见路径总结
|
||||
|
||||
Chrome/Chromium 在 Ubuntu 上的常见路径:
|
||||
- `/usr/bin/google-chrome` - Google Chrome
|
||||
- `/usr/bin/google-chrome-stable` - Google Chrome (稳定版)
|
||||
- `/usr/bin/chromium-browser` - Chromium
|
||||
- `/usr/bin/chromium` - Chromium (简化名)
|
||||
- `/snap/bin/chromium` - Snap 安装的 Chromium
|
||||
- `/opt/google/chrome/chrome` - 某些安装方式的路径
|
||||
|
||||
脚本会自动检测这些路径。
|
||||
|
||||
238
jd/UBUNTU_TERMINAL_DRAG_DROP.md
Normal file
238
jd/UBUNTU_TERMINAL_DRAG_DROP.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# Ubuntu 终端拖拽文件设置指南
|
||||
|
||||
## 方法一:GNOME Terminal(默认终端)
|
||||
|
||||
GNOME Terminal **默认支持**拖拽文件功能!
|
||||
|
||||
### 使用方式:
|
||||
1. 打开终端
|
||||
2. 从文件管理器(Nautilus)直接拖拽文件到终端
|
||||
3. 文件路径会自动插入到光标位置
|
||||
|
||||
### 如果拖拽不工作,检查以下设置:
|
||||
|
||||
#### 1. 确认使用的是 GNOME Terminal
|
||||
```bash
|
||||
# 查看当前终端
|
||||
echo $TERM
|
||||
# 或
|
||||
ps -p $PPID -o comm=
|
||||
```
|
||||
|
||||
#### 2. 检查终端偏好设置
|
||||
- 打开 GNOME Terminal
|
||||
- 点击菜单:`编辑` → `首选项` → `常规`
|
||||
- 确保已启用相关选项
|
||||
|
||||
#### 3. 使用快捷键代替
|
||||
如果拖拽不工作,可以:
|
||||
- 右键点击终端 → `粘贴文件名`(某些版本支持)
|
||||
- 或者使用命令:`cat <拖拽文件到此处>`
|
||||
|
||||
## 方法二:其他终端应用
|
||||
|
||||
### Tilix(平铺终端)
|
||||
```bash
|
||||
# 安装
|
||||
sudo apt install -y tilix
|
||||
|
||||
# Tilix 默认支持拖拽文件
|
||||
```
|
||||
|
||||
### Konsole(KDE 终端)
|
||||
```bash
|
||||
# 安装
|
||||
sudo apt install -y konsole
|
||||
|
||||
# Konsole 支持拖拽文件
|
||||
```
|
||||
|
||||
### Alacritty
|
||||
```bash
|
||||
# 安装
|
||||
sudo apt install -y alacritty
|
||||
|
||||
# 可能需要配置,默认可能不支持拖拽
|
||||
```
|
||||
|
||||
### Terminator
|
||||
```bash
|
||||
# 安装
|
||||
sudo apt install -y terminator
|
||||
|
||||
# 支持拖拽文件功能
|
||||
```
|
||||
|
||||
## 方法三:使用文件选择对话框
|
||||
|
||||
如果拖拽不工作,可以使用交互式文件选择:
|
||||
|
||||
### 在脚本中使用文件选择器
|
||||
```bash
|
||||
# 使用 zenity(GNOME 文件选择器)
|
||||
FILE=$(zenity --file-selection --title="选择文件")
|
||||
echo "选择的文件: $FILE"
|
||||
|
||||
# 或使用 kdialog(KDE 文件选择器)
|
||||
FILE=$(kdialog --getopenfilename)
|
||||
|
||||
# 在 Python 中也可以使用
|
||||
# python -c "from tkinter.filedialog import askopenfilename; print(askopenfilename())"
|
||||
```
|
||||
|
||||
## 方法四:使用剪贴板
|
||||
|
||||
### 在文件管理器中复制文件路径
|
||||
1. 在文件管理器中右键文件
|
||||
2. 选择"复制"或按 `Ctrl+C`
|
||||
3. 在终端中粘贴:`Ctrl+Shift+V`(或鼠标中键)
|
||||
|
||||
### 复制完整路径到剪贴板
|
||||
```bash
|
||||
# 在文件管理器中
|
||||
# 右键 → 属性 → 位置(复制完整路径)
|
||||
|
||||
# 或使用命令获取路径
|
||||
realpath filename.txt | xclip -selection clipboard
|
||||
```
|
||||
|
||||
## 方法五:配置终端别名/函数
|
||||
|
||||
创建一个便捷函数:
|
||||
|
||||
```bash
|
||||
# 添加到 ~/.bashrc 或 ~/.zshrc
|
||||
file_path() {
|
||||
if [ $# -eq 0 ]; then
|
||||
# 如果没有参数,使用文件选择器
|
||||
FILE=$(zenity --file-selection --title="选择文件")
|
||||
if [ -n "$FILE" ]; then
|
||||
echo "$FILE"
|
||||
fi
|
||||
else
|
||||
# 如果有参数,直接输出
|
||||
echo "$1"
|
||||
fi
|
||||
}
|
||||
|
||||
# 使用方法
|
||||
# file_path # 会弹出文件选择对话框
|
||||
# file_path ~/test.txt # 直接输出路径
|
||||
```
|
||||
|
||||
## 方法六:使用 Tab 补全
|
||||
|
||||
Ubuntu 终端默认支持 Tab 补全:
|
||||
1. 输入部分路径,如:`~/proj`
|
||||
2. 按 `Tab` 键自动补全
|
||||
3. 如果有多个匹配,按 `Tab` 两次显示所有选项
|
||||
|
||||
## 检查拖拽功能是否正常
|
||||
|
||||
### 测试步骤:
|
||||
1. 打开 GNOME Terminal
|
||||
2. 打开文件管理器(Nautilus)
|
||||
3. 找到一个文件(如 `test.txt`)
|
||||
4. 拖拽文件到终端窗口
|
||||
5. 应该看到文件路径自动输入
|
||||
|
||||
### 如果拖拽不工作:
|
||||
|
||||
#### 1. 检查桌面环境
|
||||
```bash
|
||||
echo $XDG_CURRENT_DESKTOP
|
||||
# 应该显示 GNOME 或 Ubuntu
|
||||
```
|
||||
|
||||
#### 2. 重启终端
|
||||
```bash
|
||||
# 完全关闭所有终端窗口,重新打开
|
||||
```
|
||||
|
||||
#### 3. 更新系统
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt upgrade -y
|
||||
```
|
||||
|
||||
#### 4. 检查文件管理器
|
||||
确保使用的是 Nautilus(GNOME 文件管理器):
|
||||
```bash
|
||||
# 查看文件管理器进程
|
||||
ps aux | grep nautilus
|
||||
```
|
||||
|
||||
## 替代方案:在代码中直接支持拖拽
|
||||
|
||||
如果你在开发应用,可以让应用支持拖拽:
|
||||
|
||||
### Python + Tkinter 示例
|
||||
```python
|
||||
import tkinter as tk
|
||||
from tkinter import filedialog
|
||||
|
||||
def select_file():
|
||||
root = tk.Tk()
|
||||
root.withdraw() # 隐藏主窗口
|
||||
file_path = filedialog.askopenfilename()
|
||||
root.destroy()
|
||||
return file_path if file_path else None
|
||||
|
||||
# 使用
|
||||
path = select_file()
|
||||
print(f"选择的文件: {path}")
|
||||
```
|
||||
|
||||
### Bash 脚本 + 文件选择器
|
||||
```bash
|
||||
#!/bin/bash
|
||||
FILE=$(zenity --file-selection --title="选择物流链接文件")
|
||||
if [ -n "$FILE" ]; then
|
||||
echo "处理文件: $FILE"
|
||||
# 你的处理逻辑
|
||||
fi
|
||||
```
|
||||
|
||||
## 快速参考
|
||||
|
||||
| 操作 | 方法 |
|
||||
|------|------|
|
||||
| 拖拽文件 | 直接从文件管理器拖到终端(GNOME Terminal 默认支持) |
|
||||
| 复制路径 | `Ctrl+C` → `Ctrl+Shift+V` |
|
||||
| 文件选择器 | `zenity --file-selection` |
|
||||
| Tab 补全 | 输入路径时按 `Tab` |
|
||||
| 粘贴文件名 | 某些终端支持右键菜单 |
|
||||
|
||||
## 常见问题
|
||||
|
||||
### Q: 拖拽后没有反应?
|
||||
A:
|
||||
1. 确认使用的是 GNOME Terminal
|
||||
2. 尝试重启终端
|
||||
3. 检查是否有权限问题
|
||||
|
||||
### Q: 拖拽显示的是文件内容而不是路径?
|
||||
A: 某些终端可能需要按住 `Shift` 或 `Ctrl` 键拖拽才会插入路径
|
||||
|
||||
### Q: 如何在 SSH 远程终端中拖拽?
|
||||
A: SSH 远程终端通常不支持拖拽,可以使用:
|
||||
- `scp` 命令上传文件
|
||||
- 使用 `cat << EOF` 手动输入
|
||||
- 使用 SFTP 客户端
|
||||
|
||||
## 推荐工作流
|
||||
|
||||
对于你的物流提取脚本,建议:
|
||||
|
||||
```bash
|
||||
# 方法1: 直接拖拽 URL 或文件到终端
|
||||
# 拖拽包含 URL 的文件到终端,路径会自动出现
|
||||
python jd/fetch_logistics_ubuntu.py <拖拽文件>
|
||||
|
||||
# 方法2: 使用参数
|
||||
python jd/fetch_logistics_ubuntu.py https://3.cn/2t-Iibig
|
||||
|
||||
# 方法3: 修改脚本支持交互式输入
|
||||
# 在脚本中添加文件选择功能
|
||||
```
|
||||
|
||||
382
jd/fetch_logistics.py
Normal file
382
jd/fetch_logistics.py
Normal file
@@ -0,0 +1,382 @@
|
||||
import time
|
||||
import json
|
||||
import re
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
|
||||
# 设置浏览器路径
|
||||
CHROME_PATH = r'C:\Program Files\Google\Chrome\Application\chrome.exe'
|
||||
|
||||
# 全局浏览器实例
|
||||
global_page = None
|
||||
|
||||
def get_global_browser():
|
||||
"""获取全局浏览器实例"""
|
||||
global global_page
|
||||
if global_page is None:
|
||||
print("正在初始化浏览器...")
|
||||
print(f"浏览器路径: {CHROME_PATH}")
|
||||
|
||||
# 导入 os 检查文件是否存在
|
||||
import os
|
||||
if not os.path.exists(CHROME_PATH):
|
||||
raise FileNotFoundError(f"找不到 Chrome 浏览器,路径: {CHROME_PATH}")
|
||||
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(CHROME_PATH)
|
||||
|
||||
# DrissionPage 默认应该是有界面的浏览器
|
||||
# 参考 jd.py 和 tb.py 的实现,直接创建即可
|
||||
# 如果需要最大化窗口,可以尝试添加参数(可选)
|
||||
try:
|
||||
options.set_argument('--start-maximized')
|
||||
except:
|
||||
pass # 如果设置失败就忽略,不影响浏览器启动
|
||||
|
||||
print("正在启动浏览器,请稍候...")
|
||||
print("如果浏览器没有自动打开,请检查 Chrome 是否正确安装")
|
||||
|
||||
try:
|
||||
global_page = ChromiumPage(options)
|
||||
print("✅ 浏览器已成功启动!")
|
||||
print(f"当前页面 URL: {global_page.url}")
|
||||
# 等待浏览器完全启动
|
||||
time.sleep(2)
|
||||
except Exception as e:
|
||||
print(f"❌ 浏览器启动失败: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
raise
|
||||
else:
|
||||
print("使用已存在的浏览器实例")
|
||||
return global_page
|
||||
|
||||
|
||||
def extract_logistics_info(tracking_url):
|
||||
"""
|
||||
从京东物流追踪页面提取运单号、承运人等信息
|
||||
|
||||
Args:
|
||||
tracking_url: 物流追踪页面 URL,例如 https://3.cn/2t-Iibig
|
||||
|
||||
Returns:
|
||||
dict: 包含运单号、承运人、承运人电话、物流跟踪信息等的字典
|
||||
"""
|
||||
page = get_global_browser()
|
||||
|
||||
try:
|
||||
print(f"\n正在打开物流追踪页面: {tracking_url}")
|
||||
page.get(tracking_url)
|
||||
print("页面加载中,请稍候...")
|
||||
time.sleep(5) # 等待页面加载
|
||||
|
||||
# 检查页面是否成功加载
|
||||
current_url = page.url
|
||||
print(f"当前页面 URL: {current_url}")
|
||||
|
||||
# 检查页面标题
|
||||
try:
|
||||
title = page.title
|
||||
print(f"页面标题: {title}")
|
||||
except:
|
||||
print("无法获取页面标题")
|
||||
|
||||
# 检查页面是否有内容
|
||||
try:
|
||||
html_length = len(page.html)
|
||||
print(f"页面 HTML 长度: {html_length} 字符")
|
||||
if html_length < 100:
|
||||
print("⚠️ 警告: 页面内容可能未完全加载")
|
||||
except Exception as e:
|
||||
print(f"⚠️ 无法获取页面 HTML: {e}")
|
||||
|
||||
result = {
|
||||
"waybill_no": None, # 运单号
|
||||
"carrier": None, # 国内承运人
|
||||
"carrier_phone": None, # 国内承运人电话
|
||||
"tracking_info": [], # 物流跟踪信息列表
|
||||
"raw_html": None # 原始 HTML(用于调试)
|
||||
}
|
||||
|
||||
# 方法1: 监听网络请求,查找物流数据 API
|
||||
print("方法1: 监听网络请求...")
|
||||
page.listen.start()
|
||||
|
||||
# 滚动页面触发可能的请求
|
||||
page.scroll.down(300)
|
||||
time.sleep(2)
|
||||
page.scroll.to_bottom()
|
||||
time.sleep(3)
|
||||
|
||||
# 检查监听到的请求
|
||||
responses = page.listen.get()
|
||||
print(f"监听到 {len(responses)} 个请求")
|
||||
|
||||
# 查找可能的物流数据接口
|
||||
possible_urls = [
|
||||
'track', 'logistics', 'waybill', 'express',
|
||||
'delivery', '3.cn', 'jd.com/logistics',
|
||||
'api.m.jd.com', 'mapi.jd.com'
|
||||
]
|
||||
|
||||
for resp in responses:
|
||||
url = resp.url if hasattr(resp, 'url') else ''
|
||||
url_lower = url.lower()
|
||||
|
||||
# 检查是否可能是物流相关的 API
|
||||
if any(keyword in url_lower for keyword in possible_urls):
|
||||
print(f"发现可能的物流 API: {url[:100]}")
|
||||
try:
|
||||
if hasattr(resp, 'response') and hasattr(resp.response, 'body'):
|
||||
body = resp.response.body
|
||||
|
||||
# 处理 JSON 响应
|
||||
if isinstance(body, dict):
|
||||
json_data = body
|
||||
elif isinstance(body, str):
|
||||
try:
|
||||
json_data = json.loads(body)
|
||||
except:
|
||||
continue
|
||||
else:
|
||||
continue
|
||||
|
||||
# 尝试从 JSON 中提取运单号等信息
|
||||
extracted = extract_from_json(json_data)
|
||||
if extracted:
|
||||
result.update(extracted)
|
||||
print("成功从 API 响应中提取数据")
|
||||
return result
|
||||
except Exception as e:
|
||||
print(f"解析 API 响应时出错: {e}")
|
||||
|
||||
# 方法2: 从页面 HTML/DOM 中提取
|
||||
print("\n方法2: 从页面 DOM 提取数据...")
|
||||
|
||||
html = page.html
|
||||
result['raw_html'] = html[:5000] # 保存部分 HTML 用于调试
|
||||
|
||||
# 从 HTML 文本中提取运单号
|
||||
waybill_patterns = [
|
||||
r'运单号[::\s]*(\d+)',
|
||||
r'waybill[_\s]*no["\']?\s*[::]\s*["\']?(\d+)',
|
||||
r'tracking[_\s]*number["\']?\s*[::]\s*["\']?(\d+)',
|
||||
r'"waybillNo"\s*[::]\s*["\']?(\d+)',
|
||||
r'"trackingNumber"\s*[::]\s*["\']?(\d+)',
|
||||
]
|
||||
|
||||
for pattern in waybill_patterns:
|
||||
matches = re.findall(pattern, html, re.IGNORECASE)
|
||||
if matches:
|
||||
result['waybill_no'] = matches[0]
|
||||
print(f"找到运单号: {result['waybill_no']}")
|
||||
break
|
||||
|
||||
# 提取承运人
|
||||
carrier_patterns = [
|
||||
r'国内承运人[::\s]*([^\s<,,]+)',
|
||||
r'carrier[::\s]*([^\s<,,]+)',
|
||||
r'"carrier"\s*[::]\s*["\']?([^"\']+)',
|
||||
]
|
||||
|
||||
for pattern in carrier_patterns:
|
||||
matches = re.findall(pattern, html, re.IGNORECASE)
|
||||
if matches:
|
||||
result['carrier'] = matches[0].strip()
|
||||
print(f"找到承运人: {result['carrier']}")
|
||||
break
|
||||
|
||||
# 提取承运人电话
|
||||
phone_patterns = [
|
||||
r'国内承运人电话[::\s]*(\d+)',
|
||||
r'carrier[_\s]*phone[::\s]*(\d+)',
|
||||
r'"carrierPhone"\s*[::]\s*["\']?(\d+)',
|
||||
]
|
||||
|
||||
for pattern in phone_patterns:
|
||||
matches = re.findall(pattern, html, re.IGNORECASE)
|
||||
if matches:
|
||||
result['carrier_phone'] = matches[0]
|
||||
print(f"找到承运人电话: {result['carrier_phone']}")
|
||||
break
|
||||
|
||||
# 方法3: 从 DOM 元素中提取
|
||||
print("\n方法3: 从 DOM 元素提取数据...")
|
||||
|
||||
# 尝试查找运单号元素
|
||||
waybill_elements = page.eles('xpath=//*[contains(text(), "运单号") or contains(text(), "运单")]')
|
||||
for elem in waybill_elements:
|
||||
text = elem.text
|
||||
parent_text = elem.parent().text if elem.parent() else ""
|
||||
full_text = text + " " + parent_text
|
||||
|
||||
# 从文本中提取数字作为运单号
|
||||
numbers = re.findall(r'\d{8,}', full_text)
|
||||
if numbers and not result['waybill_no']:
|
||||
result['waybill_no'] = numbers[0]
|
||||
print(f"从元素文本中找到运单号: {result['waybill_no']}")
|
||||
|
||||
# 提取承运人
|
||||
if '承运人' in text and not result['carrier']:
|
||||
carrier_match = re.search(r'承运人[::\s]*([^\s<,,]+)', full_text)
|
||||
if carrier_match:
|
||||
result['carrier'] = carrier_match.group(1).strip()
|
||||
print(f"从元素文本中找到承运人: {result['carrier']}")
|
||||
|
||||
# 提取电话
|
||||
if '电话' in text and not result['carrier_phone']:
|
||||
phone_match = re.search(r'电话[::\s]*(\d+)', full_text)
|
||||
if phone_match:
|
||||
result['carrier_phone'] = phone_match.group(1)
|
||||
print(f"从元素文本中找到电话: {result['carrier_phone']}")
|
||||
|
||||
# 提取物流跟踪信息(时间线)
|
||||
print("\n提取物流跟踪信息...")
|
||||
tracking_elements = page.eles('xpath=//*[contains(@class, "track") or contains(@class, "logistics") or contains(@class, "timeline")]')
|
||||
|
||||
if not tracking_elements:
|
||||
# 尝试查找包含时间戳的元素
|
||||
tracking_elements = page.eles('xpath=//*[contains(text(), "2025") or contains(text(), "货物") or contains(text(), "到达")]')
|
||||
|
||||
tracking_info = []
|
||||
for elem in tracking_elements[:20]: # 限制数量
|
||||
text = elem.text
|
||||
if text and len(text) > 5:
|
||||
# 尝试提取时间戳
|
||||
time_match = re.search(r'(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})', text)
|
||||
if time_match or any(keyword in text for keyword in ['货物', '到达', '揽收', '运输', '配送', '签收']):
|
||||
tracking_info.append({
|
||||
'text': text.strip(),
|
||||
'time': time_match.group(1) if time_match else None
|
||||
})
|
||||
|
||||
result['tracking_info'] = tracking_info[:10] # 最多保存10条
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
print(f"提取物流信息时出错: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return None
|
||||
|
||||
|
||||
def extract_from_json(json_data):
|
||||
"""
|
||||
从 JSON 数据中提取物流信息
|
||||
|
||||
Args:
|
||||
json_data: JSON 字典
|
||||
|
||||
Returns:
|
||||
dict: 提取到的物流信息
|
||||
"""
|
||||
result = {}
|
||||
|
||||
def search_dict(d, key_patterns):
|
||||
"""递归搜索字典中的值"""
|
||||
if isinstance(d, dict):
|
||||
for k, v in d.items():
|
||||
# 检查键名
|
||||
for pattern in key_patterns:
|
||||
if re.search(pattern, k, re.IGNORECASE):
|
||||
return v
|
||||
# 递归搜索值
|
||||
if isinstance(v, (dict, list)):
|
||||
found = search_dict(v, key_patterns)
|
||||
if found:
|
||||
return found
|
||||
elif isinstance(d, list):
|
||||
for item in d:
|
||||
found = search_dict(item, key_patterns)
|
||||
if found:
|
||||
return found
|
||||
return None
|
||||
|
||||
# 搜索运单号
|
||||
waybill = search_dict(json_data, [r'waybill', r'tracking.*number', r'运单号', r'waybillNo'])
|
||||
if waybill:
|
||||
result['waybill_no'] = str(waybill)
|
||||
|
||||
# 搜索承运人
|
||||
carrier = search_dict(json_data, [r'carrier', r'承运人', r'carrierName'])
|
||||
if carrier:
|
||||
result['carrier'] = str(carrier)
|
||||
|
||||
# 搜索承运人电话
|
||||
phone = search_dict(json_data, [r'carrier.*phone', r'承运人电话', r'carrierPhone', r'phone'])
|
||||
if phone:
|
||||
result['carrier_phone'] = str(phone)
|
||||
|
||||
# 搜索物流跟踪信息
|
||||
tracking = search_dict(json_data, [r'track', r'logistics', r'物流', r'轨迹', r'history'])
|
||||
if tracking:
|
||||
if isinstance(tracking, list):
|
||||
result['tracking_info'] = tracking
|
||||
elif isinstance(tracking, dict):
|
||||
result['tracking_info'] = [tracking]
|
||||
|
||||
return result if result else None
|
||||
|
||||
|
||||
def print_result(result):
|
||||
"""打印提取结果"""
|
||||
if not result:
|
||||
print("未能提取到物流信息")
|
||||
return
|
||||
|
||||
print("\n" + "="*50)
|
||||
print("物流信息提取结果:")
|
||||
print("="*50)
|
||||
print(f"运单号: {result.get('waybill_no', '未找到')}")
|
||||
print(f"国内承运人: {result.get('carrier', '未找到')}")
|
||||
print(f"国内承运人电话: {result.get('carrier_phone', '未找到')}")
|
||||
|
||||
if result.get('tracking_info'):
|
||||
print(f"\n物流跟踪信息 (共 {len(result['tracking_info'])} 条):")
|
||||
for idx, info in enumerate(result['tracking_info'], 1):
|
||||
if isinstance(info, dict):
|
||||
text = info.get('text', str(info))
|
||||
time_str = info.get('time', '')
|
||||
print(f" {idx}. {text}")
|
||||
if time_str:
|
||||
print(f" 时间: {time_str}")
|
||||
else:
|
||||
print(f" {idx}. {info}")
|
||||
else:
|
||||
print("\n物流跟踪信息: 未找到")
|
||||
|
||||
print("="*50)
|
||||
|
||||
|
||||
# 主程序
|
||||
if __name__ == '__main__':
|
||||
# 测试 URL
|
||||
tracking_url = "https://3.cn/2t-Iibig"
|
||||
|
||||
print("="*60)
|
||||
print("京东物流信息提取工具")
|
||||
print("="*60)
|
||||
print(f"目标 URL: {tracking_url}")
|
||||
print("开始提取物流信息...\n")
|
||||
|
||||
try:
|
||||
result = extract_logistics_info(tracking_url)
|
||||
except Exception as e:
|
||||
print(f"\n❌ 执行过程中出错: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
result = None
|
||||
|
||||
if result:
|
||||
print_result(result)
|
||||
|
||||
# 保存结果到文件
|
||||
output_file = "logistics_result.json"
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(result, f, ensure_ascii=False, indent=2)
|
||||
print(f"\n结果已保存到: {output_file}")
|
||||
else:
|
||||
print("提取失败")
|
||||
|
||||
print("\n脚本执行完成,浏览器保持打开状态用于调试")
|
||||
|
||||
454
jd/fetch_logistics_ubuntu.py
Normal file
454
jd/fetch_logistics_ubuntu.py
Normal file
@@ -0,0 +1,454 @@
|
||||
import time
|
||||
import json
|
||||
import re
|
||||
import os
|
||||
import platform
|
||||
import threading
|
||||
from flask import Flask, request, jsonify
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
|
||||
# Ubuntu 上常见的 Chrome/Chromium 路径
|
||||
UBUNTU_CHROME_PATHS = [
|
||||
'/usr/bin/google-chrome',
|
||||
'/usr/bin/google-chrome-stable',
|
||||
'/usr/bin/chromium-browser',
|
||||
'/usr/bin/chromium',
|
||||
'/snap/bin/chromium',
|
||||
'/opt/google/chrome/chrome',
|
||||
]
|
||||
|
||||
# 是否使用无头模式(headless)
|
||||
# True: 无界面模式,适合服务器环境
|
||||
# False: 有界面模式,需要 X11 或 Wayland
|
||||
USE_HEADLESS = True # 可以根据需要修改
|
||||
|
||||
# 全局浏览器实例
|
||||
global_page = None
|
||||
|
||||
|
||||
def find_chrome_path():
|
||||
"""自动查找 Ubuntu 系统中的 Chrome/Chromium 路径"""
|
||||
print("正在查找 Chrome/Chromium 浏览器...")
|
||||
|
||||
# 首先尝试常见的路径
|
||||
for path in UBUNTU_CHROME_PATHS:
|
||||
if os.path.exists(path):
|
||||
print(f"✅ 找到浏览器: {path}")
|
||||
return path
|
||||
|
||||
# 尝试使用 which 命令查找
|
||||
import subprocess
|
||||
try:
|
||||
result = subprocess.run(['which', 'google-chrome'],
|
||||
capture_output=True, text=True, timeout=5)
|
||||
if result.returncode == 0 and os.path.exists(result.stdout.strip()):
|
||||
path = result.stdout.strip()
|
||||
print(f"✅ 通过 which 找到浏览器: {path}")
|
||||
return path
|
||||
except:
|
||||
pass
|
||||
|
||||
try:
|
||||
result = subprocess.run(['which', 'chromium-browser'],
|
||||
capture_output=True, text=True, timeout=5)
|
||||
if result.returncode == 0 and os.path.exists(result.stdout.strip()):
|
||||
path = result.stdout.strip()
|
||||
print(f"✅ 通过 which 找到浏览器: {path}")
|
||||
return path
|
||||
except:
|
||||
pass
|
||||
|
||||
# 如果都找不到,返回最常见的路径
|
||||
default_path = '/usr/bin/google-chrome'
|
||||
print(f"⚠️ 未找到浏览器,将使用默认路径: {default_path}")
|
||||
print("请确保已安装 Google Chrome 或 Chromium:")
|
||||
print(" sudo apt update")
|
||||
print(" sudo apt install -y google-chrome-stable")
|
||||
print(" 或者")
|
||||
print(" sudo apt install -y chromium-browser")
|
||||
return default_path
|
||||
|
||||
|
||||
def get_global_browser():
|
||||
"""获取全局浏览器实例(Ubuntu 版本)"""
|
||||
global global_page
|
||||
if global_page is None:
|
||||
print("="*60)
|
||||
print("Ubuntu 浏览器初始化")
|
||||
print("="*60)
|
||||
|
||||
# 检查操作系统
|
||||
if platform.system() != 'Linux':
|
||||
print(f"⚠️ 警告: 当前系统是 {platform.system()},此脚本专为 Ubuntu 设计")
|
||||
|
||||
# 查找 Chrome 路径
|
||||
chrome_path = find_chrome_path()
|
||||
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(chrome_path)
|
||||
|
||||
# Ubuntu 服务器环境通常使用无头模式
|
||||
if USE_HEADLESS:
|
||||
print("配置为无头模式(headless)...")
|
||||
try:
|
||||
options.headless(True)
|
||||
except:
|
||||
# 如果 headless 方法不存在,使用参数
|
||||
try:
|
||||
options.set_argument('--headless=new')
|
||||
options.set_argument('--no-sandbox')
|
||||
options.set_argument('--disable-dev-shm-usage')
|
||||
except:
|
||||
pass
|
||||
else:
|
||||
print("配置为有界面模式...")
|
||||
# 检查是否有显示环境
|
||||
display = os.environ.get('DISPLAY')
|
||||
if not display:
|
||||
print("⚠️ 警告: 未检测到 DISPLAY 环境变量")
|
||||
print("如果无法显示浏览器,请:")
|
||||
print(" 1. 设置 USE_HEADLESS = True")
|
||||
print(" 2. 或者设置 DISPLAY 环境变量(如 DISPLAY=:0)")
|
||||
print(" 3. 或者使用 Xvfb(虚拟显示)")
|
||||
|
||||
# Linux 特定参数
|
||||
try:
|
||||
options.set_argument('--no-sandbox') # 在某些环境下需要
|
||||
options.set_argument('--disable-dev-shm-usage') # 避免 /dev/shm 空间不足
|
||||
options.set_argument('--disable-gpu') # 禁用 GPU(可选,在 headless 模式下有用)
|
||||
except:
|
||||
pass
|
||||
|
||||
print(f"正在启动浏览器...")
|
||||
print(f"浏览器路径: {chrome_path}")
|
||||
if USE_HEADLESS:
|
||||
print("模式: 无头模式(后台运行)")
|
||||
else:
|
||||
print("模式: 有界面模式")
|
||||
|
||||
try:
|
||||
global_page = ChromiumPage(options)
|
||||
print("✅ 浏览器已成功启动!")
|
||||
time.sleep(2) # 等待浏览器完全启动
|
||||
except Exception as e:
|
||||
print(f"❌ 浏览器启动失败: {e}")
|
||||
print("\n可能的解决方案:")
|
||||
print("1. 确保已安装 Chrome/Chromium:")
|
||||
print(" sudo apt update")
|
||||
print(" sudo apt install -y google-chrome-stable")
|
||||
print("2. 如果使用无头模式失败,尝试设置 USE_HEADLESS = False")
|
||||
print("3. 确保有足够的权限")
|
||||
print("4. 检查是否缺少依赖:")
|
||||
print(" sudo apt install -y libnss3 libatk-bridge2.0-0 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
raise
|
||||
else:
|
||||
print("使用已存在的浏览器实例")
|
||||
|
||||
return global_page
|
||||
|
||||
|
||||
def extract_logistics_info(tracking_url):
|
||||
"""
|
||||
从京东物流追踪页面提取运单号、承运人等信息(Ubuntu 版本)
|
||||
|
||||
Args:
|
||||
tracking_url: 物流追踪页面 URL,例如 https://3.cn/2t-Iibig
|
||||
|
||||
Returns:
|
||||
dict: 包含运单号、承运人、承运人电话、物流跟踪信息等的字典
|
||||
"""
|
||||
page = get_global_browser()
|
||||
|
||||
try:
|
||||
print(f"\n正在打开物流追踪页面: {tracking_url}")
|
||||
page.get(tracking_url)
|
||||
print("页面加载中,请稍候...")
|
||||
time.sleep(5) # 等待页面加载
|
||||
|
||||
# 检查页面是否成功加载
|
||||
current_url = page.url
|
||||
print(f"当前页面 URL: {current_url}")
|
||||
|
||||
# 检查页面标题
|
||||
try:
|
||||
title = page.title
|
||||
print(f"页面标题: {title}")
|
||||
except:
|
||||
print("无法获取页面标题")
|
||||
|
||||
# 检查页面是否有内容
|
||||
try:
|
||||
html_length = len(page.html)
|
||||
print(f"页面 HTML 长度: {html_length} 字符")
|
||||
if html_length < 100:
|
||||
print("⚠️ 警告: 页面内容可能未完全加载")
|
||||
except Exception as e:
|
||||
print(f"⚠️ 无法获取页面 HTML: {e}")
|
||||
|
||||
result = {
|
||||
"waybill_no": None, # 运单号
|
||||
"carrier": None, # 国内承运人
|
||||
"carrier_phone": None, # 国内承运人电话
|
||||
"tracking_info": [], # 物流跟踪信息列表
|
||||
}
|
||||
|
||||
# 从 DOM 元素中提取数据
|
||||
print("\n从 DOM 元素提取数据...")
|
||||
|
||||
# 尝试查找运单号元素
|
||||
waybill_elements = page.eles('xpath=//*[contains(text(), "运单号") or contains(text(), "运单")]')
|
||||
for elem in waybill_elements:
|
||||
text = elem.text
|
||||
parent_text = elem.parent().text if elem.parent() else ""
|
||||
full_text = text + " " + parent_text
|
||||
|
||||
# 从文本中提取数字作为运单号
|
||||
numbers = re.findall(r'\d{8,}', full_text)
|
||||
if numbers and not result['waybill_no']:
|
||||
result['waybill_no'] = numbers[0]
|
||||
print(f"✅ 找到运单号: {result['waybill_no']}")
|
||||
|
||||
# 提取承运人
|
||||
if '承运人' in text and not result['carrier']:
|
||||
carrier_match = re.search(r'承运人[::\s]*([^\s<,,]+)', full_text)
|
||||
if carrier_match:
|
||||
result['carrier'] = carrier_match.group(1).strip()
|
||||
print(f"✅ 找到承运人: {result['carrier']}")
|
||||
|
||||
# 提取电话
|
||||
if '电话' in text and not result['carrier_phone']:
|
||||
phone_match = re.search(r'电话[::\s]*(\d+)', full_text)
|
||||
if phone_match:
|
||||
result['carrier_phone'] = phone_match.group(1)
|
||||
print(f"✅ 找到承运人电话: {result['carrier_phone']}")
|
||||
|
||||
# 提取物流跟踪信息(时间线)
|
||||
print("\n提取物流跟踪信息...")
|
||||
tracking_elements = page.eles('xpath=//*[contains(@class, "track") or contains(@class, "logistics") or contains(@class, "timeline")]')
|
||||
|
||||
if not tracking_elements:
|
||||
# 尝试查找包含时间戳的元素
|
||||
tracking_elements = page.eles('xpath=//*[contains(text(), "2025") or contains(text(), "货物") or contains(text(), "到达")]')
|
||||
|
||||
tracking_info = []
|
||||
for elem in tracking_elements[:20]: # 限制数量
|
||||
text = elem.text
|
||||
if text and len(text) > 5:
|
||||
# 尝试提取时间戳
|
||||
time_match = re.search(r'(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})', text)
|
||||
if time_match or any(keyword in text for keyword in ['货物', '到达', '揽收', '运输', '配送', '签收']):
|
||||
tracking_info.append({
|
||||
'text': text.strip(),
|
||||
'time': time_match.group(1) if time_match else None
|
||||
})
|
||||
|
||||
result['tracking_info'] = tracking_info[:10] # 最多保存10条
|
||||
|
||||
if result['tracking_info']:
|
||||
print(f"✅ 找到 {len(result['tracking_info'])} 条物流跟踪信息")
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
print(f"提取物流信息时出错: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return None
|
||||
|
||||
|
||||
def print_result(result):
|
||||
"""打印提取结果"""
|
||||
if not result:
|
||||
print("未能提取到物流信息")
|
||||
return
|
||||
|
||||
print("\n" + "="*50)
|
||||
print("物流信息提取结果:")
|
||||
print("="*50)
|
||||
print(f"运单号: {result.get('waybill_no', '未找到')}")
|
||||
print(f"国内承运人: {result.get('carrier', '未找到')}")
|
||||
print(f"国内承运人电话: {result.get('carrier_phone', '未找到')}")
|
||||
|
||||
if result.get('tracking_info'):
|
||||
print(f"\n物流跟踪信息 (共 {len(result['tracking_info'])} 条):")
|
||||
for idx, info in enumerate(result['tracking_info'], 1):
|
||||
if isinstance(info, dict):
|
||||
text = info.get('text', str(info))
|
||||
time_str = info.get('time', '')
|
||||
print(f" {idx}. {text}")
|
||||
if time_str:
|
||||
print(f" 时间: {time_str}")
|
||||
else:
|
||||
print(f" {idx}. {info}")
|
||||
else:
|
||||
print("\n物流跟踪信息: 未找到")
|
||||
|
||||
print("="*50)
|
||||
|
||||
|
||||
# =================== Flask API 接口 ===================
|
||||
# 初始化 Flask 应用
|
||||
app = Flask(__name__)
|
||||
|
||||
# 初始化锁,防止并发访问
|
||||
fetch_lock = threading.Lock()
|
||||
|
||||
|
||||
@app.route('/fetch_logistics', methods=['GET', 'POST'])
|
||||
def fetch_logistics():
|
||||
"""
|
||||
查询物流信息接口
|
||||
|
||||
参数:
|
||||
tracking_url: 物流追踪页面 URL(GET 或 POST)
|
||||
例如: https://3.cn/2t-Iibig
|
||||
|
||||
返回:
|
||||
JSON 格式的物流信息,包含:
|
||||
- waybill_no: 运单号
|
||||
- carrier: 国内承运人
|
||||
- carrier_phone: 国内承运人电话
|
||||
- tracking_info: 物流跟踪信息列表
|
||||
- success: 是否成功
|
||||
- message: 消息提示
|
||||
"""
|
||||
# 获取参数(支持 GET 和 POST)
|
||||
if request.method == 'POST':
|
||||
if request.is_json:
|
||||
data = request.get_json()
|
||||
tracking_url = data.get('tracking_url') or data.get('url')
|
||||
else:
|
||||
tracking_url = request.form.get('tracking_url') or request.form.get('url') or request.args.get('tracking_url') or request.args.get('url')
|
||||
else:
|
||||
tracking_url = request.args.get('tracking_url') or request.args.get('url')
|
||||
|
||||
if not tracking_url:
|
||||
return jsonify({
|
||||
"success": False,
|
||||
"error": "缺少参数 tracking_url 或 url",
|
||||
"message": "请提供物流追踪页面 URL"
|
||||
}), 400
|
||||
|
||||
# 验证 URL 格式
|
||||
if not (tracking_url.startswith('http://') or tracking_url.startswith('https://')):
|
||||
return jsonify({
|
||||
"success": False,
|
||||
"error": "URL 格式错误",
|
||||
"message": "URL 必须以 http:// 或 https:// 开头"
|
||||
}), 400
|
||||
|
||||
try:
|
||||
with fetch_lock: # 加锁,防止并发调用
|
||||
print(f"\n收到物流查询请求: {tracking_url}")
|
||||
result = extract_logistics_info(tracking_url)
|
||||
|
||||
if result:
|
||||
# 构建返回数据
|
||||
response_data = {
|
||||
"success": True,
|
||||
"message": "查询成功",
|
||||
"data": {
|
||||
"waybill_no": result.get('waybill_no'),
|
||||
"carrier": result.get('carrier'),
|
||||
"carrier_phone": result.get('carrier_phone'),
|
||||
"tracking_info": result.get('tracking_info', []),
|
||||
"tracking_count": len(result.get('tracking_info', []))
|
||||
},
|
||||
"url": tracking_url
|
||||
}
|
||||
|
||||
# 如果有些信息未找到,添加提示
|
||||
missing_fields = []
|
||||
if not result.get('waybill_no'):
|
||||
missing_fields.append('waybill_no')
|
||||
if not result.get('carrier'):
|
||||
missing_fields.append('carrier')
|
||||
|
||||
if missing_fields:
|
||||
response_data["warning"] = f"以下字段未找到: {', '.join(missing_fields)}"
|
||||
|
||||
return jsonify(response_data), 200
|
||||
else:
|
||||
return jsonify({
|
||||
"success": False,
|
||||
"error": "提取失败",
|
||||
"message": "未能从页面中提取到物流信息",
|
||||
"url": tracking_url
|
||||
}), 500
|
||||
|
||||
except Exception as e:
|
||||
print(f"查询物流信息时出错: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return jsonify({
|
||||
"success": False,
|
||||
"error": str(e),
|
||||
"message": "服务器内部错误",
|
||||
"url": tracking_url
|
||||
}), 500
|
||||
|
||||
|
||||
@app.route('/health', methods=['GET'])
|
||||
def health():
|
||||
"""健康检查接口"""
|
||||
return jsonify({
|
||||
"status": "ok",
|
||||
"service": "京东物流信息查询服务",
|
||||
"version": "1.0.0"
|
||||
}), 200
|
||||
|
||||
|
||||
@app.route('/', methods=['GET'])
|
||||
def index():
|
||||
"""首页,返回 API 使用说明"""
|
||||
return jsonify({
|
||||
"service": "京东物流信息查询 API",
|
||||
"version": "1.0.0",
|
||||
"endpoints": {
|
||||
"/fetch_logistics": {
|
||||
"method": ["GET", "POST"],
|
||||
"description": "查询物流信息",
|
||||
"parameters": {
|
||||
"tracking_url": "物流追踪页面 URL(必需)",
|
||||
"url": "tracking_url 的别名(可选)"
|
||||
},
|
||||
"example_get": "/fetch_logistics?tracking_url=https://3.cn/2t-Iibig",
|
||||
"example_post": "POST /fetch_logistics\n{\"tracking_url\": \"https://3.cn/2t-Iibig\"}"
|
||||
},
|
||||
"/health": {
|
||||
"method": ["GET"],
|
||||
"description": "健康检查"
|
||||
}
|
||||
}
|
||||
}), 200
|
||||
|
||||
|
||||
# =================== 启动服务 ===================
|
||||
if __name__ == '__main__':
|
||||
# API 服务模式(默认)
|
||||
print("="*60)
|
||||
print("京东物流信息查询 API 服务 (Ubuntu 版本)")
|
||||
print("="*60)
|
||||
print(f"无头模式: {'是' if USE_HEADLESS else '否'}")
|
||||
print("\n服务接口:")
|
||||
print(" GET/POST /fetch_logistics?tracking_url=<URL> - 查询物流信息")
|
||||
print(" GET /health - 健康检查")
|
||||
print(" GET / - API 说明")
|
||||
print("\n启动服务...")
|
||||
print("服务地址: http://0.0.0.0:5001")
|
||||
print("按 Ctrl+C 停止服务\n")
|
||||
|
||||
try:
|
||||
app.run(host='0.0.0.0', port=5001, debug=False, threaded=True)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n服务已停止")
|
||||
finally:
|
||||
if 'global_page' in globals() and global_page:
|
||||
try:
|
||||
global_page.quit()
|
||||
print("浏览器已关闭")
|
||||
except:
|
||||
pass
|
||||
|
||||
206
jd/jd.py
206
jd/jd.py
@@ -6,8 +6,7 @@ import threading
|
||||
from flask import Flask, request, jsonify
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
from sqlalchemy import create_engine, Column, Integer, String, Text, DateTime
|
||||
from sqlalchemy.ext.declarative import declarative_base
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
from sqlalchemy.orm import declarative_base, sessionmaker
|
||||
|
||||
# =================== 配置部分 ===================
|
||||
# 浏览器路径(请根据本地实际路径修改)
|
||||
@@ -27,6 +26,14 @@ app = Flask(__name__)
|
||||
# 初始化锁
|
||||
fetch_lock = threading.Lock()
|
||||
|
||||
# 全局爬虫控制标志
|
||||
crawler_running = False
|
||||
crawler_thread = None
|
||||
current_product_id = None
|
||||
|
||||
# 当前“允许运行”的抓取任务 product_id(新请求会覆盖,旧线程检测到不匹配则退出)
|
||||
active_fetch_product_id = None
|
||||
|
||||
|
||||
# 初始化数据库连接
|
||||
db_url = f"mysql+pymysql://{db_config['user']}:{db_config['password']}@{db_config['host']}:{db_config['port']}/{db_config['database']}?charset=utf8mb4"
|
||||
@@ -65,16 +72,27 @@ def get_global_browser():
|
||||
return global_page
|
||||
|
||||
|
||||
def _is_fetch_cancelled(product_id):
|
||||
"""当前任务是否已被新请求取消(只保留最新请求的 product_id)"""
|
||||
global active_fetch_product_id
|
||||
return active_fetch_product_id is not None and active_fetch_product_id != product_id
|
||||
|
||||
|
||||
def fetch_jd_comments(product_id):
|
||||
global active_fetch_product_id
|
||||
page = get_global_browser() # 使用全局浏览器
|
||||
try:
|
||||
# 打开商品页面
|
||||
page.get(f'https://item.jd.com/{product_id}.html#crumb-wrap')
|
||||
time.sleep(random.uniform(5, 8))
|
||||
if _is_fetch_cancelled(product_id):
|
||||
return 0
|
||||
|
||||
# 向下滚动主页面
|
||||
page.scroll.down(150)
|
||||
time.sleep(random.uniform(3, 5))
|
||||
if _is_fetch_cancelled(product_id):
|
||||
return 0
|
||||
|
||||
# 点击“买家赞不绝口”
|
||||
element1 = page.ele('xpath=//div[contains(text(), "买家赞不绝口")]')
|
||||
@@ -86,16 +104,20 @@ def fetch_jd_comments(product_id):
|
||||
if element1:
|
||||
element1.click()
|
||||
time.sleep(random.uniform(3, 5))
|
||||
if _is_fetch_cancelled(product_id):
|
||||
return 0
|
||||
# 点击“当前商品”
|
||||
element2 = page.ele('xpath=//div[contains(text(), "当前商品")]')
|
||||
if element2:
|
||||
element2.click()
|
||||
time.sleep(random.uniform(3, 5))
|
||||
|
||||
if _is_fetch_cancelled(product_id):
|
||||
return 0
|
||||
# 定位弹窗区域
|
||||
popup = page.ele('xpath=//*[@id="rateList"]/div/div[3]')
|
||||
if not popup:
|
||||
return []
|
||||
return 0
|
||||
|
||||
# 点击“视频”
|
||||
element3 = page.ele('xpath=//div[contains(text(), "视频")]')
|
||||
@@ -103,20 +125,28 @@ def fetch_jd_comments(product_id):
|
||||
element3.click()
|
||||
time.sleep(random.uniform(3, 5))
|
||||
|
||||
if _is_fetch_cancelled(product_id):
|
||||
return 0
|
||||
# 监听请求
|
||||
page.listen.start('https://api.m.jd.com/client.action')
|
||||
|
||||
max_retries = 10 # 最多尝试 5 次无新数据
|
||||
retry_count = 0
|
||||
new_comments = [] # 存储最终的新评论
|
||||
seen_ids = set() # 已处理过的 comment_id
|
||||
total_comments_saved = 0 # 总共保存的评论数
|
||||
|
||||
while retry_count < max_retries and len(new_comments) < 10:
|
||||
# 持续获取评论,直到被新请求取消或手动停止
|
||||
while True:
|
||||
if _is_fetch_cancelled(product_id):
|
||||
print(f"[fetch_jd_comments] 商品 {product_id} 已被新请求取消,退出")
|
||||
break
|
||||
scroll_amount = random.randint(10000, 100000)
|
||||
popup.scroll.down(scroll_amount)
|
||||
print(f"弹窗向下滚动了 {scroll_amount} 像素")
|
||||
|
||||
time.sleep(random.uniform(3, 5))
|
||||
if _is_fetch_cancelled(product_id):
|
||||
break
|
||||
|
||||
resp = page.listen.wait(timeout=5)
|
||||
if resp and 'getCommentListPage' in resp.request.postData:
|
||||
@@ -161,6 +191,12 @@ def fetch_jd_comments(product_id):
|
||||
print(f"本次获取到 {len(fresh_comments)} 条新评论")
|
||||
new_comments.extend(fresh_comments)
|
||||
retry_count = 0 # 有新数据,重置重试计数器
|
||||
|
||||
# 立即保存这批评论到数据库
|
||||
save_comments_to_db(product_id, fresh_comments)
|
||||
total_comments_saved += len(fresh_comments)
|
||||
print(f"已保存 {len(fresh_comments)} 条评论到数据库,总计保存 {total_comments_saved} 条评论")
|
||||
|
||||
else:
|
||||
print("本次无新评论,继续滚动...")
|
||||
retry_count += 1
|
||||
@@ -173,16 +209,35 @@ def fetch_jd_comments(product_id):
|
||||
else:
|
||||
print("未捕获到新的评论数据,继续滚动...")
|
||||
retry_count += 1
|
||||
if _is_fetch_cancelled(product_id):
|
||||
break
|
||||
|
||||
print(f"共抓取到 {len(new_comments)} 条新评论(最多需要10条)")
|
||||
return new_comments[:10] # 只保留前10条
|
||||
print(f"爬虫已停止,共抓取到 {total_comments_saved} 条评论")
|
||||
return total_comments_saved
|
||||
|
||||
except Exception as e:
|
||||
print("发生错误:", e)
|
||||
return []
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
# =================== 持续爬虫后台运行函数 ===================
|
||||
def continuous_crawler(product_id):
|
||||
"""持续爬取评论的后台函数"""
|
||||
global crawler_running
|
||||
try:
|
||||
print(f"开始持续爬取商品 {product_id} 的评论...")
|
||||
while crawler_running:
|
||||
result = fetch_jd_comments(product_id)
|
||||
if not crawler_running:
|
||||
break
|
||||
# 如果没有获取到数据,等待一段时间再继续
|
||||
time.sleep(10)
|
||||
print(f"商品 {product_id} 的持续爬取已停止")
|
||||
except Exception as e:
|
||||
print(f"持续爬虫发生错误: {e}")
|
||||
crawler_running = False
|
||||
|
||||
# =================== 提取评论并保存到数据库 ===================
|
||||
def save_comments_to_db(product_id, comments):
|
||||
session = Session()
|
||||
@@ -229,33 +284,144 @@ def save_comments_to_db(product_id, comments):
|
||||
|
||||
|
||||
# =================== Flask API 接口 ===================
|
||||
@app.route('/fetch_comments', methods=['POST'])
|
||||
def fetch_comments():
|
||||
@app.route('/start_crawler', methods=['POST'])
|
||||
def start_crawler():
|
||||
"""启动持续爬虫"""
|
||||
global crawler_running, crawler_thread, current_product_id
|
||||
|
||||
product_id = request.args.get('product_id')
|
||||
if not product_id:
|
||||
return jsonify({"error": "缺少 product_id"}), -200
|
||||
return jsonify({"error": "缺少 product_id"}), 400
|
||||
|
||||
if crawler_running:
|
||||
return jsonify({
|
||||
"message": f"爬虫已在运行中,当前商品ID: {current_product_id}",
|
||||
"status": "already_running"
|
||||
}), 200
|
||||
|
||||
try:
|
||||
with fetch_lock: # 加锁,防止并发调用
|
||||
comments = fetch_jd_comments(product_id)
|
||||
if not comments:
|
||||
return jsonify({"message": "未获取到评论数据"}), -200
|
||||
|
||||
save_comments_to_db(product_id, comments)
|
||||
with fetch_lock:
|
||||
crawler_running = True
|
||||
current_product_id = product_id
|
||||
crawler_thread = threading.Thread(target=continuous_crawler, args=(product_id,))
|
||||
crawler_thread.daemon = True
|
||||
crawler_thread.start()
|
||||
|
||||
return jsonify({
|
||||
"message": f"成功保存 {len(comments)} 条评论",
|
||||
"message": f"已启动持续爬虫,商品ID: {product_id}",
|
||||
"status": "started",
|
||||
"product_id": product_id
|
||||
}), 200
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({"error": str(e)}), -200
|
||||
crawler_running = False
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
|
||||
@app.route('/stop_crawler', methods=['POST'])
|
||||
def stop_crawler():
|
||||
"""停止持续爬虫"""
|
||||
global crawler_running, crawler_thread, current_product_id
|
||||
|
||||
if not crawler_running:
|
||||
return jsonify({
|
||||
"message": "爬虫未在运行",
|
||||
"status": "not_running"
|
||||
}), 200
|
||||
|
||||
try:
|
||||
with fetch_lock:
|
||||
crawler_running = False
|
||||
stopped_product_id = current_product_id
|
||||
current_product_id = None
|
||||
|
||||
# 等待线程结束
|
||||
if crawler_thread and crawler_thread.is_alive():
|
||||
crawler_thread.join(timeout=10)
|
||||
|
||||
return jsonify({
|
||||
"message": f"已停止持续爬虫,商品ID: {stopped_product_id}",
|
||||
"status": "stopped",
|
||||
"product_id": stopped_product_id
|
||||
}), 200
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
|
||||
@app.route('/crawler_status', methods=['GET'])
|
||||
def crawler_status():
|
||||
"""获取爬虫状态"""
|
||||
global crawler_running, current_product_id
|
||||
|
||||
return jsonify({
|
||||
"running": crawler_running,
|
||||
"product_id": current_product_id,
|
||||
"status": "running" if crawler_running else "stopped"
|
||||
}), 200
|
||||
|
||||
|
||||
@app.route('/test', methods=['GET'])
|
||||
def test():
|
||||
"""测试端点,验证服务器是否正常工作"""
|
||||
print("测试端点被访问")
|
||||
return jsonify({"message": "服务器运行正常", "status": "ok"}), 200
|
||||
|
||||
|
||||
@app.route('/fetch_comments', methods=['GET', 'POST'])
|
||||
def fetch_comments():
|
||||
"""单次获取评论(在后台运行,立即返回)。新请求会中断所有历史请求线程,只执行本次请求。"""
|
||||
global crawler_running, active_fetch_product_id
|
||||
print(f"[fetch_comments] 收到请求,方法: {request.method}, 参数: {request.args}")
|
||||
product_id = request.args.get('product_id')
|
||||
|
||||
if not product_id:
|
||||
print("[fetch_comments] 错误: 缺少 product_id")
|
||||
return jsonify({"error": "缺少 product_id"}), 400
|
||||
|
||||
print(f"[fetch_comments] 开始处理商品ID: {product_id},将中断所有历史请求后执行")
|
||||
|
||||
try:
|
||||
# 立刻中断所有历史:停止持续爬虫并标记“当前任务”为新 product_id,旧线程在循环中检测到会自行退出
|
||||
with fetch_lock:
|
||||
crawler_running = False
|
||||
active_fetch_product_id = product_id
|
||||
|
||||
def run_fetch():
|
||||
try:
|
||||
print(f"[后台线程] 开始获取商品 {product_id} 的评论...")
|
||||
result = fetch_jd_comments(product_id)
|
||||
print(f"[后台线程] 获取完成,结果: {result}")
|
||||
except Exception as e:
|
||||
import traceback
|
||||
error_msg = f"后台获取评论时发生错误: {e}\n{traceback.format_exc()}"
|
||||
print(f"[后台线程] {error_msg}")
|
||||
|
||||
fetch_thread = threading.Thread(target=run_fetch)
|
||||
fetch_thread.daemon = True
|
||||
fetch_thread.start()
|
||||
print(f"[fetch_comments] 后台线程已启动(历史请求已标记为取消)")
|
||||
|
||||
response_data = {
|
||||
"message": f"已开始获取商品 {product_id} 的评论,正在后台运行中...(已中断之前的请求)",
|
||||
"status": "started",
|
||||
"product_id": product_id,
|
||||
"note": "评论获取在后台进行,请稍后查看数据库或使用 /crawler_status 查看状态"
|
||||
}
|
||||
print(f"[fetch_comments] 返回响应: {response_data}")
|
||||
return jsonify(response_data), 200
|
||||
|
||||
except Exception as e:
|
||||
import traceback
|
||||
error_msg = f"处理请求时发生错误: {e}\n{traceback.format_exc()}"
|
||||
print(f"[fetch_comments] {error_msg}")
|
||||
return jsonify({"error": str(e)}), 500
|
||||
|
||||
|
||||
# =================== 启动服务 ===================
|
||||
if __name__ == '__main__':
|
||||
try:
|
||||
app.run(host='0.0.0.0', port=5000, debug=True)
|
||||
app.run(host='0.0.0.0', port=5008, debug=True)
|
||||
finally:
|
||||
if 'global_page' in globals() and global_page:
|
||||
global_page.quit()
|
||||
|
||||
154
jd/logistics.py
Normal file
154
jd/logistics.py
Normal file
@@ -0,0 +1,154 @@
|
||||
import time
|
||||
import json
|
||||
import re
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
|
||||
# 设置浏览器路径
|
||||
CHROME_PATH = r'C:\Program Files\Google\Chrome\Application\chrome.exe'
|
||||
|
||||
# 物流追踪页面 URL
|
||||
TRACKING_URL = "https://3.cn/2t-Iibig"
|
||||
|
||||
# 配置并启动浏览器
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(CHROME_PATH)
|
||||
|
||||
# 创建浏览器实例
|
||||
page = ChromiumPage(options)
|
||||
|
||||
try:
|
||||
print("正在打开物流追踪页面...")
|
||||
page.get(TRACKING_URL)
|
||||
|
||||
# 等待页面加载
|
||||
time.sleep(5)
|
||||
|
||||
print("\n=== 方法1: 尝试从页面元素提取信息 ===")
|
||||
|
||||
# 尝试提取运单号
|
||||
waybill_elements = page.eles('xpath=//*[contains(text(), "运单号")]')
|
||||
if waybill_elements:
|
||||
print(f"找到运单号相关元素: {len(waybill_elements)} 个")
|
||||
for elem in waybill_elements:
|
||||
print(f" 文本: {elem.text}")
|
||||
# 尝试获取父元素或兄弟元素
|
||||
parent = elem.parent()
|
||||
if parent:
|
||||
print(f" 父元素文本: {parent.text[:100]}")
|
||||
|
||||
# 尝试提取承运人信息
|
||||
carrier_elements = page.eles('xpath=//*[contains(text(), "承运人")]')
|
||||
if carrier_elements:
|
||||
print(f"\n找到承运人相关元素: {len(carrier_elements)} 个")
|
||||
for elem in carrier_elements:
|
||||
print(f" 文本: {elem.text}")
|
||||
|
||||
print("\n=== 方法2: 监听网络请求,查找数据接口 ===")
|
||||
|
||||
# 监听所有包含数据的请求
|
||||
print("开始监听网络请求...")
|
||||
page.listen.start()
|
||||
|
||||
# 滚动页面触发可能的请求
|
||||
page.scroll.down(500)
|
||||
time.sleep(3)
|
||||
page.scroll.to_bottom()
|
||||
time.sleep(5)
|
||||
|
||||
# 获取所有监听到的请求
|
||||
all_responses = page.listen.get()
|
||||
print(f"\n共监听到 {len(all_responses)} 个请求")
|
||||
|
||||
# 查找可能包含物流数据的请求
|
||||
keywords = ['track', 'logistics', 'waybill', 'express', 'delivery', '3.cn', 'jd.com', 'json', 'api']
|
||||
|
||||
for idx, resp in enumerate(all_responses):
|
||||
url = resp.url if hasattr(resp, 'url') else ''
|
||||
print(f"\n请求 {idx + 1}:")
|
||||
print(f" URL: {url[:150]}")
|
||||
|
||||
# 检查是否包含关键词
|
||||
url_lower = url.lower()
|
||||
if any(keyword in url_lower for keyword in keywords):
|
||||
print(f" ⭐ 可能相关的请求!")
|
||||
try:
|
||||
if hasattr(resp, 'response') and hasattr(resp.response, 'body'):
|
||||
body = resp.response.body
|
||||
if isinstance(body, dict):
|
||||
print(f" 响应数据 (前500字符): {str(body)[:500]}")
|
||||
# 尝试解析 JSON
|
||||
print(f" 完整的 JSON 数据:")
|
||||
print(json.dumps(body, indent=2, ensure_ascii=False)[:1000])
|
||||
elif isinstance(body, str):
|
||||
print(f" 响应数据 (前500字符): {body[:500]}")
|
||||
# 尝试解析 JSON
|
||||
try:
|
||||
json_data = json.loads(body)
|
||||
print(f" 解析后的 JSON (前1000字符):")
|
||||
print(json.dumps(json_data, indent=2, ensure_ascii=False)[:1000])
|
||||
except:
|
||||
pass
|
||||
except Exception as e:
|
||||
print(f" 解析响应时出错: {e}")
|
||||
|
||||
print("\n=== 方法3: 提取页面 HTML 中的 JSON 数据 ===")
|
||||
|
||||
# 获取页面 HTML
|
||||
html = page.html
|
||||
# 查找可能的 JSON 数据(在 script 标签中)
|
||||
json_patterns = [
|
||||
r'window\.__INITIAL_STATE__\s*=\s*({.+?});',
|
||||
r'var\s+trackData\s*=\s*({.+?});',
|
||||
r'const\s+trackingInfo\s*=\s*({.+?});',
|
||||
r'data\s*:\s*({.+?})',
|
||||
r'"waybillNo"[:\s]+"([^"]+)"',
|
||||
r'"trackingNumber"[:\s]+"([^"]+)"',
|
||||
]
|
||||
|
||||
for pattern in json_patterns:
|
||||
matches = re.findall(pattern, html, re.DOTALL)
|
||||
if matches:
|
||||
print(f"\n找到匹配模式 {pattern}:")
|
||||
for match in matches[:3]: # 只显示前3个
|
||||
print(f" 匹配: {str(match)[:200]}")
|
||||
|
||||
print("\n=== 尝试提取页面中的所有文本内容 ===")
|
||||
page_text = page.html
|
||||
# 查找运单号(通常是数字)
|
||||
waybill_pattern = r'运单号[:\s]*(\d+)'
|
||||
waybill_matches = re.findall(waybill_pattern, page_text)
|
||||
if waybill_matches:
|
||||
print(f"找到运单号: {waybill_matches}")
|
||||
|
||||
# 查找承运人
|
||||
carrier_pattern = r'国内承运人[:\s]*([^\s<]+)'
|
||||
carrier_matches = re.findall(carrier_pattern, page_text)
|
||||
if carrier_matches:
|
||||
print(f"找到承运人: {carrier_matches}")
|
||||
|
||||
# 查找电话号码
|
||||
phone_pattern = r'国内承运人电话[:\s]*(\d+)'
|
||||
phone_matches = re.findall(phone_pattern, page_text)
|
||||
if phone_matches:
|
||||
print(f"找到电话: {phone_matches}")
|
||||
|
||||
print("\n=== 等待用户查看页面 ===")
|
||||
print("页面已打开,请手动检查浏览器中的网络请求(F12 -> Network),查找包含物流数据的 API")
|
||||
print("按 Enter 键继续或等待 60 秒后自动关闭...")
|
||||
|
||||
try:
|
||||
input()
|
||||
except:
|
||||
time.sleep(60)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n用户中断脚本执行")
|
||||
except Exception as e:
|
||||
print(f"\n发生错误: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
finally:
|
||||
print("\n脚本执行完成,浏览器保持打开状态用于调试")
|
||||
# 可以选择是否关闭浏览器
|
||||
# page.quit()
|
||||
|
||||
5
jd/requirements.txt
Normal file
5
jd/requirements.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
# jd.py 依赖
|
||||
flask>=2.0.0
|
||||
DrissionPage>=4.0.0
|
||||
sqlalchemy>=2.0.0
|
||||
pymysql>=1.0.0
|
||||
35
jd/run_win.bat
Normal file
35
jd/run_win.bat
Normal file
@@ -0,0 +1,35 @@
|
||||
@echo off
|
||||
chcp 65001 >nul
|
||||
title JD 服务 - 一键启动
|
||||
|
||||
echo ========================================
|
||||
echo JD 服务 - 依赖安装与启动
|
||||
echo ========================================
|
||||
echo.
|
||||
|
||||
cd /d "%~dp0"
|
||||
|
||||
:: 检查 Python
|
||||
python --version >nul 2>&1
|
||||
if errorlevel 1 (
|
||||
echo [错误] 未找到 Python,请先安装 Python 并加入 PATH。
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
echo [1/2] 安装依赖...
|
||||
python -m pip install -r requirements.txt -q
|
||||
if errorlevel 1 (
|
||||
echo [错误] 依赖安装失败。
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo 依赖已就绪。
|
||||
echo.
|
||||
|
||||
echo [2/2] 启动服务...
|
||||
echo 按 Ctrl+C 可停止服务。
|
||||
echo.
|
||||
python jd.py
|
||||
|
||||
pause
|
||||
162
jd/setup_ubuntu.sh
Normal file
162
jd/setup_ubuntu.sh
Normal file
@@ -0,0 +1,162 @@
|
||||
#!/bin/bash
|
||||
# Ubuntu 环境快速设置脚本
|
||||
|
||||
set -e # 遇到错误立即退出
|
||||
|
||||
# 确保使用 bash 运行(兼容性问题处理)
|
||||
if [ -z "$BASH_VERSION" ]; then
|
||||
echo "警告: 此脚本需要使用 bash 运行"
|
||||
echo "请使用: bash $0"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "=========================================="
|
||||
echo "京东物流提取工具 - Ubuntu 环境设置"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
|
||||
|
||||
# 1. 检查并安装系统依赖
|
||||
echo "步骤 1: 检查系统依赖..."
|
||||
if ! command -v python3 >/dev/null 2>&1; then
|
||||
echo "安装 Python3..."
|
||||
sudo apt update
|
||||
sudo apt install -y python3 python3-pip python3-venv
|
||||
else
|
||||
echo "✅ Python3 已安装"
|
||||
fi
|
||||
|
||||
# 检查 Chrome/Chromium
|
||||
CHROME_PATH=""
|
||||
if command -v google-chrome >/dev/null 2>&1; then
|
||||
CHROME_PATH=$(which google-chrome)
|
||||
echo "✅ 找到 Google Chrome: $CHROME_PATH"
|
||||
elif [ -f "/usr/bin/google-chrome" ]; then
|
||||
CHROME_PATH="/usr/bin/google-chrome"
|
||||
echo "✅ 找到 Google Chrome: $CHROME_PATH"
|
||||
elif command -v chromium-browser >/dev/null 2>&1; then
|
||||
CHROME_PATH=$(which chromium-browser)
|
||||
echo "✅ 找到 Chromium: $CHROME_PATH"
|
||||
elif [ -f "/usr/bin/chromium-browser" ]; then
|
||||
CHROME_PATH="/usr/bin/chromium-browser"
|
||||
echo "✅ 找到 Chromium: $CHROME_PATH"
|
||||
else
|
||||
echo "⚠️ 未找到 Chrome/Chromium,将尝试安装..."
|
||||
echo "选择要安装的浏览器:"
|
||||
echo "1) Google Chrome (推荐)"
|
||||
echo "2) Chromium (开源版本)"
|
||||
read -p "请选择 [1-2]: " choice
|
||||
|
||||
if [ "$choice" = "1" ]; then
|
||||
echo "正在安装 Google Chrome..."
|
||||
wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
|
||||
sudo apt install -y ./google-chrome-stable_current_amd64.deb
|
||||
rm -f google-chrome-stable_current_amd64.deb
|
||||
CHROME_PATH="/usr/bin/google-chrome"
|
||||
elif [ "$choice" = "2" ]; then
|
||||
echo "正在安装 Chromium..."
|
||||
sudo apt update
|
||||
sudo apt install -y chromium-browser
|
||||
CHROME_PATH="/usr/bin/chromium-browser"
|
||||
fi
|
||||
fi
|
||||
|
||||
# 2. 安装 Chrome 运行时依赖
|
||||
echo ""
|
||||
echo "步骤 2: 检查 Chrome 运行时依赖..."
|
||||
DEPS="libnss3 libatk-bridge2.0-0 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2"
|
||||
MISSING_DEPS=""
|
||||
|
||||
for dep in $DEPS; do
|
||||
if ! dpkg -l 2>/dev/null | grep -q "^ii.*$dep"; then
|
||||
if [ -z "$MISSING_DEPS" ]; then
|
||||
MISSING_DEPS="$dep"
|
||||
else
|
||||
MISSING_DEPS="$MISSING_DEPS $dep"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
if [ -n "$MISSING_DEPS" ]; then
|
||||
echo "安装缺失的依赖: $MISSING_DEPS"
|
||||
sudo apt install -y $MISSING_DEPS
|
||||
else
|
||||
echo "✅ 所有依赖已安装"
|
||||
fi
|
||||
|
||||
# 3. 创建虚拟环境
|
||||
echo ""
|
||||
echo "步骤 3: 设置 Python 虚拟环境..."
|
||||
if [ ! -d "venv" ]; then
|
||||
echo "创建虚拟环境..."
|
||||
python3 -m venv venv
|
||||
echo "✅ 虚拟环境创建成功"
|
||||
else
|
||||
echo "✅ 虚拟环境已存在"
|
||||
fi
|
||||
|
||||
# 4. 激活虚拟环境并安装 Python 包
|
||||
echo ""
|
||||
echo "步骤 4: 安装 Python 依赖包..."
|
||||
source venv/bin/activate
|
||||
|
||||
# 升级 pip
|
||||
pip install --upgrade pip
|
||||
|
||||
# 安装依赖
|
||||
pip install DrissionPage Flask
|
||||
|
||||
# 可选:如果需要数据库功能
|
||||
read -p "是否需要数据库功能?(sqlalchemy, pymysql) [y/N]: " need_db
|
||||
if [ "$need_db" = "y" ] || [ "$need_db" = "Y" ]; then
|
||||
pip install sqlalchemy pymysql
|
||||
fi
|
||||
|
||||
deactivate
|
||||
|
||||
# 5. 创建运行脚本
|
||||
echo ""
|
||||
echo "步骤 5: 创建便捷运行脚本..."
|
||||
# 创建 API 服务启动脚本
|
||||
cat > run_logistics_api.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
# 启动物流信息查询 API 服务
|
||||
|
||||
cd "$(dirname "$0")"
|
||||
source venv/bin/activate
|
||||
|
||||
# 启动 API 服务
|
||||
python jd/fetch_logistics_ubuntu.py
|
||||
|
||||
deactivate
|
||||
EOF
|
||||
|
||||
chmod +x run_logistics_api.sh
|
||||
|
||||
# 6. 完成
|
||||
echo ""
|
||||
echo "=========================================="
|
||||
echo "✅ 环境设置完成!"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
echo "快速开始:"
|
||||
echo ""
|
||||
echo "启动 API 服务:"
|
||||
echo " 方式1: 使用便捷脚本"
|
||||
echo " ./run_logistics_api.sh"
|
||||
echo ""
|
||||
echo " 方式2: 手动启动"
|
||||
echo " source venv/bin/activate"
|
||||
echo " python jd/fetch_logistics_ubuntu.py"
|
||||
echo " deactivate"
|
||||
echo ""
|
||||
echo " API 接口地址: http://localhost:5001"
|
||||
echo " 查询示例:"
|
||||
echo " curl 'http://localhost:5001/fetch_logistics?tracking_url=https://3.cn/2t-Iibig'"
|
||||
echo " 或"
|
||||
echo " curl -X POST http://localhost:5001/fetch_logistics -H 'Content-Type: application/json' -d '{\"tracking_url\":\"https://3.cn/2t-Iibig\"}'"
|
||||
echo ""
|
||||
echo "浏览器路径: $CHROME_PATH"
|
||||
echo "虚拟环境: $(pwd)/venv"
|
||||
echo ""
|
||||
|
||||
11
jd/tb.py
11
jd/tb.py
@@ -13,8 +13,7 @@ from sqlalchemy.orm import sessionmaker, declarative_base
|
||||
CHROME_PATH = r'C:\Program Files\Google\Chrome\Application\chrome.exe'
|
||||
|
||||
# 固定商品详情页 URL
|
||||
TARGET_URL = "https://detail.tmall.com/item.htm?abbucket=1&id=735141569627<k2=1753093866331wbixx4bjhgx78xdlrpyxq&ns=1&priceTId=213e074d17530938630755244e1109&skuId=5667837161089&spm=a21n57.1.hoverItem.2&utparam=%7B%22aplus_abtest%22%3A%228c55408acbff553514850c28e821c3b4%22%7D&xxc=taobaoSearch"
|
||||
# MySQL 配置
|
||||
TARGET_URL = "https://detail.tmall.com/item.htm?abbucket=1&id=629109576049&mi_id=0000ug1x7t_mV0K12gYppRSVQ7NozSDtS3YwUTM7oCeMS5w&ns=1&skuId=5800648665359&spm=a21n57.1.hoverItem.1&utparam=%7B%22aplus_abtest%22%3A%2254df76059607f4cb191afc7c675e8349%22%7D&xxc=taobaoSearch"# MySQL 配置
|
||||
db_config = {
|
||||
"host": "192.168.8.88",
|
||||
"port": 3306,
|
||||
@@ -96,7 +95,7 @@ def fetch_taobao_comments():
|
||||
return []
|
||||
|
||||
# 开始监听指定请求
|
||||
target_url = 'https://h5api.m.tmall.com/h5/mtop.taobao.rate.detaillist.get/6.0/?jsv=2.7.5'
|
||||
target_url = 'https://h5api.m.tmall.com/h5/mtop.taobao.rate.detaillist.get/6.0/?jsv=2.7.4'
|
||||
page.listen.start(target_url)
|
||||
|
||||
seen_ids = set()
|
||||
@@ -157,6 +156,8 @@ def save_taobao_comments_to_db(comments):
|
||||
user_nick = comment.get('userNick', '匿名用户')
|
||||
pic_list = comment.get('feedPicPathList', [])
|
||||
comment_date = comment.get('feedbackDate', '')
|
||||
# 从评论数据中提取 skuId 作为 product_id
|
||||
sku_id = comment.get('skuId', '')
|
||||
|
||||
exists = session.query(TaobaoComment).filter_by(comment_id=comment_id).first()
|
||||
if exists:
|
||||
@@ -166,7 +167,7 @@ def save_taobao_comments_to_db(comments):
|
||||
picture_urls = [url for url in pic_list if url.startswith('//')]
|
||||
|
||||
new_comment = TaobaoComment(
|
||||
product_id="735141569627",
|
||||
product_id=sku_id, # 使用 skuId 替代硬编码的 product_id
|
||||
user_name=user_nick,
|
||||
comment_text=feedback,
|
||||
comment_id=comment_id,
|
||||
@@ -174,7 +175,7 @@ def save_taobao_comments_to_db(comments):
|
||||
comment_date=comment_date
|
||||
)
|
||||
session.add(new_comment)
|
||||
print(f"正在写入评论: {comment_id}")
|
||||
print(f"正在写入评论: {comment_id}, skuId: {sku_id}")
|
||||
session.commit()
|
||||
except Exception as e:
|
||||
session.rollback()
|
||||
|
||||
91
jd/test_browser.py
Normal file
91
jd/test_browser.py
Normal file
@@ -0,0 +1,91 @@
|
||||
"""测试浏览器是否能正常启动"""
|
||||
import time
|
||||
from DrissionPage import ChromiumPage, ChromiumOptions
|
||||
|
||||
CHROME_PATH = r'C:\Program Files\Google\Chrome\Application\chrome.exe'
|
||||
|
||||
print("="*60)
|
||||
print("浏览器启动测试")
|
||||
print("="*60)
|
||||
|
||||
# 检查 Chrome 路径
|
||||
import os
|
||||
if not os.path.exists(CHROME_PATH):
|
||||
print(f"❌ 错误: 找不到 Chrome 浏览器")
|
||||
print(f"路径: {CHROME_PATH}")
|
||||
print("\n请检查:")
|
||||
print("1. Chrome 是否已安装")
|
||||
print("2. Chrome 的安装路径是否正确")
|
||||
print("3. 如果 Chrome 安装在别的路径,请修改 CHROME_PATH 变量")
|
||||
exit(1)
|
||||
else:
|
||||
print(f"✅ Chrome 路径检查通过: {CHROME_PATH}")
|
||||
|
||||
# 配置浏览器选项
|
||||
print("\n正在配置浏览器选项...")
|
||||
options = ChromiumOptions()
|
||||
options.set_browser_path(CHROME_PATH)
|
||||
|
||||
# 尝试启动浏览器
|
||||
print("正在启动浏览器...")
|
||||
print("如果浏览器没有自动打开,可能会有以下原因:")
|
||||
print("1. Chrome 浏览器正在被其他程序使用")
|
||||
print("2. ChromeDriver 版本不匹配")
|
||||
print("3. 防火墙或安全软件阻止")
|
||||
print("\n请等待 10 秒...\n")
|
||||
|
||||
try:
|
||||
page = ChromiumPage(options)
|
||||
print("✅ 浏览器启动成功!")
|
||||
|
||||
# 测试打开一个简单的页面
|
||||
print("\n正在打开测试页面: https://www.baidu.com")
|
||||
page.get('https://www.baidu.com')
|
||||
time.sleep(3)
|
||||
|
||||
# 检查页面信息
|
||||
try:
|
||||
print(f"当前 URL: {page.url}")
|
||||
print(f"页面标题: {page.title}")
|
||||
html_len = len(page.html)
|
||||
print(f"页面 HTML 长度: {html_len} 字符")
|
||||
|
||||
if html_len > 1000:
|
||||
print("\n✅ 测试成功!浏览器正常工作。")
|
||||
print("\n现在可以运行 fetch_logistics.py 了。")
|
||||
else:
|
||||
print("\n⚠️ 警告: 页面内容可能未完全加载")
|
||||
except Exception as e:
|
||||
print(f"\n⚠️ 获取页面信息时出错: {e}")
|
||||
|
||||
print("\n浏览器将保持打开状态 30 秒,请查看是否能看到浏览器窗口...")
|
||||
print("如果能看到浏览器窗口,说明启动成功。")
|
||||
time.sleep(30)
|
||||
|
||||
# 询问是否关闭
|
||||
print("\n测试完成。浏览器将保持打开状态。")
|
||||
print("您可以手动关闭浏览器窗口,或者按 Ctrl+C 退出程序。")
|
||||
|
||||
# 不自动关闭,让用户查看
|
||||
try:
|
||||
input("\n按 Enter 键关闭浏览器并退出...")
|
||||
except:
|
||||
pass
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n❌ 浏览器启动失败!")
|
||||
print(f"错误信息: {e}")
|
||||
print("\n可能的解决方案:")
|
||||
print("1. 检查 Chrome 是否正确安装")
|
||||
print("2. 尝试关闭所有 Chrome 窗口后重试")
|
||||
print("3. 检查是否有权限问题")
|
||||
print("4. 查看是否有错误日志")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
finally:
|
||||
try:
|
||||
page.quit()
|
||||
print("浏览器已关闭")
|
||||
except:
|
||||
pass
|
||||
|
||||
Reference in New Issue
Block a user