High-Fidelity OCR on Apple Silicon
A local, zero-config OCR inference tool powered by GLM-4V and MLX.
Designed for macOS M-series Macs with a beautiful web interface.
GLM-OCR MLX wraps the GLM-OCR model in a turnkey macOS application. You upload a PDF or image, and it returns structured Markdown with tables, formulas, and layout-aware text — all running locally on your Mac's GPU.
Upload multi-page PDFs, PNGs, or JPEGs for instant OCR.
State-of-the-art 0.9B param model scoring 94.6 on OmniDocBench.
Runs on Apple Metal GPU via mlx-vlm for fast local inference.
PP-DocLayoutV3 detects tables, formulas, images, and text blocks.
Two servers work together: an MLX inference server and a Flask web UI.
Serves the GLM-OCR-bf16 model via an OpenAI-compatible chat/completions endpoint. Handles the actual neural network inference on the Metal GPU.
Accepts uploads, splits PDFs to images, queues OCR jobs, and displays results page-by-page with a live progress bar and Markdown rendering.
M1, M2, M3, or M4 Mac required. MLX uses the Metal GPU and unified memory architecture.
Download from python.org/downloads/macos if not installed. The launcher checks automatically.
Needed to clone the GLM-OCR SDK on first launch. Install via Xcode Command Line Tools or Homebrew.
The model weights are ~20 GB (GLM-OCR-bf16 + PP-DocLayoutV3). They download automatically on first launch from Hugging Face. 16 GB unified memory is minimum; 32 GB+ recommended for multi-page PDFs.
These folders are created at runtime:
weights/ — Downloaded AI model files (~20 GB)
output/ — OCR results (Markdown + JSON + images)
sessions/ — Job state files (JSON per job)
glm-ocr/ — Cloned GLM-OCR SDK
The entire setup is automated. Just double-click launch.command in Finder.
launch.command in Finder.http://localhost:5003launch.command doesA breakdown of the automated startup sequence:
# Step 0 — Clone SDK if missing if [ ! -d "glm-ocr" ]; then git clone https://github.com/zai-org/GLM-OCR glm-ocr fi # Step 1 — Verify Python 3.12+ python3 -c "import sys; sys.exit(0 if sys.version_info >= (3,12) else 1)" # Step 2 — Create venv & install deps (first run only) python3 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt # Step 3 — Download / verify model weights python utils/download_weights.py # Step 4 — Start MLX server (background) mlx_vlm.server --trust-remote-code & # Step 5 — Start Flask app (background) python app.py & # Step 6 — Open browser open http://localhost:5003
Drag and drop a PDF, PNG, or JPEG onto the upload card — or click to browse. Accepted: .pdf .png .jpg .jpeg
A progress bar shows real-time status. PDFs are split into page images, then each page is OCR'd sequentially. Results stream as they finish.
A split-panel view: original document on the left, rendered Markdown on the right. Navigate pages with Prev/Next or jump to any page.
Click Export to download results as Markdown or JSON — either the current page or the full document.
Layout Toggle Switch between the original image and the layout visualization overlay to see detected regions. History Click the History button to browse and reload previous scan results.
The Flask app exposes these endpoints — useful for scripting or integration.
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/upload | Upload a file and start OCR. Returns job_id. |
| GET | /api/status/<job_id> | Poll job status, progress %, and page counts. |
| GET | /api/page/<job_id>/<idx> | Get OCR result for a single page (JSON). |
| GET | /api/jobs | List all jobs sorted by newest first. |
| GET | /api/export/<job_id> | Export results. Params: format, scope, page_idx. |
| POST | /api/last_page/<job_id> | Save the user's last-viewed page index. |
| GET | /api/image/<job_id>/<path> | Serve an image file belonging to a job. |
All settings live in glm_config.yaml. Key sections:
MLX server host, port (8080), and debug flag.
Connection to the MLX server: host, port, model path, timeouts, retries.
PP-DocLayoutV3 settings: detection threshold, batch size, label-to-task mappings.
| Setting | Default |
|---|---|
| pipeline.enable_layout | true |
| pipeline.max_workers | 4 |
| pipeline.ocr_api.api_port | 8080 |
| pipeline.page_loader.max_tokens | 4096 |
| pipeline.layout.threshold | 0.3 |
Set pipeline.maas.enabled: true and provide a Zhipu API key to use the cloud API instead of local inference. This bypasses the MLX server entirely — no GPU needed.
pipeline: maas: enabled: true api_key: your-zhipu-key
When enable_layout: true, documents go through a two-stage pipeline:
Paragraphs, titles, references, seals, vertical text
Tables → recognized with table prompt
Display & inline formulas → LaTeX output
Charts & images kept; headers, footers, page numbers discarded
Two formats, two scopes — choose what you need from the Export dropdown.
Clean Markdown with headings, tables, and LaTeX formulas. Pages separated by horizontal rules. Ideal for docs, Notion, Obsidian.
Structured JSON with page data, content strings, and image paths. Ideal for programmatic processing and data pipelines.
Downloads the entire document's OCR output in one file.
Downloads only the page you're currently viewing.
# Programmatic export via API $ curl "http://localhost:5003/api/export/JOB_ID?format=markdown&scope=all" -o result.md $ curl "http://localhost:5003/api/export/JOB_ID?format=json&scope=current&page_idx=0" -o page1.json
Install the latest Python from python.org/downloads/macos. The system Python on macOS is too old.
Right-click the file → Open → confirm in the dialog. Or: System Settings → Privacy → Allow.
Another process may hold the port. Run lsof -i :8080 to check. Use the deep clean script to kill stale processes.
Normal — the model weights load into unified memory on the first request. Subsequent scans are much faster.
The model needs ~8 GB of unified memory. Close other heavy apps. 16 GB Macs should work; 8 GB Macs may struggle.
Check your internet connection. Run python utils/download_weights.py manually to retry. Weights come from Hugging Face.
If something goes wrong, use the interactive reset utility:
$ ./utils/deep_clean.command
It prompts you to selectively reset each component:
Stops any lingering processes on ports 8080 and 5003.
Deletes the virtual environment. Recreated on next launch.
Removes the cloned SDK. Re-cloned on next launch.
Wipes all OCR results and job history.
Nuclear option — deletes all downloaded weights. They'll re-download on next launch (requires internet).
The default PDF rendering is 200 DPI. Higher-quality input images produce better OCR output, especially for small text.
For simple single-column docs, try enable_layout: false in the config for faster, direct whole-image OCR.
Default is 4 parallel workers for region OCR. Lower to 1–2 on 8 GB Macs, raise on 64 GB+ machines.
Full SDK output lives in the output/ directory — including per-page Markdown, JSON, layout visualizations, and original images.
Results persist across app restarts — the History button lets you reload any previous scan.
Double-click launch.command, upload a document, and get structured Markdown in minutes — all running locally on your Mac.
launch.command
PDF or image
Markdown or JSON
GLM-OCR MLX · Powered by GLM-4V + Apple MLX · MIT License