Meeting 04
The importance of building “Big Tables”
I met with some of you about your research projects, and I noticed some confusion about how to start a new research topic. From my perspective, building “big tables” is always the first step.
These big tables include:
- primary sources
- related scholarship
- other relevant materials
Introduction of Today’s Task
There are two PDF files in the pdf/ folder. They are extracted pages from Wolfgang Frank’s famous book Introduction of Ming sources (see a newer edition on WorldCat: https://search.worldcat.org/title/779242963). For this class, we are using an edition published in the 1960s.
pdf/wf_abbreviations.pdfis the abbreviations list.pdf/wf_five_pages.pdfcontains five pages from the first chapter.
Your tasks:
- Extract the abbreviations and their meanings from the list into a table.
- Turn the abbreviations in the content into clickable links.
Google NotebookLM
Landing page: https://notebooklm.google/
OCR tool for Apple
The software is called glm-ocr-mlx. It is provided as ocr_software/glm-ocr-mlx-main.zip. Download the file and unzip it, then read the README.md file for instructions on how to use it.
Currently, it only works on Apple Silicon computers. If you have a Windows/Linux/Intel Mac, you can use the OCR tool for LM Studio/Ollama, which we will introduce in the next section.
Why do we need JSON files for OCR output?
OCR tool for LM Studio/Ollama
Repository: https://github.com/kltng/ocr_batch_processor
Ollama: https://ollama.com/
Use Openwork or chatbots to clean the OCR output and turn them into clickable links.
If you cannot get OCR output, you can use the JSON files in the ocr_results/ folder. They are the output of glm-ocr-mlx. You can use Openwork or chatbots to clean the OCR output and extract the abbreviations and their meanings into a table.