SlideShare a Scribd company logo
Better problem solving through scripting: How to think through your
#eprdctn roadblocks and script your way to efficiency
Do you find yourself repeating the same task over and over? Or feeling certain there is a
way to automate a task but it's just outside of your skill set? Kris Coppieters from Rorohiko
has built a career solving just those kinds of problems. Whether it's scripting a solution from
within InDesign, or using AppleScript to finish off some markup, Kris can show you how to
bring high-level thinking to quick and dirty tasks.
Two categories of automation
When it comes to automation, we can classify most of the software used into two large
categories which I'll call tools and systems.
Tools
Tools are smaller programs which are similar to tools used in various crafts, e.g. carpentry.
In carpentry, you might use a hammer, nails, screws, screwdrivers, saws, CNC machines,
drills... A few of these tools are more complex, and have a wide range of functions, but
many tools are simple and have a very specific function. Some of these tools are specifically
made for a specific task.
In #eprdctn, the tools used are text editors, search-and-replace, XML editors like Oxygen,
testing tools, reader programs...
Systems
Systems are more complex than simple tools. They take in some raw materials and spit out
finished results at the other end.
A sawmill can be seen as a system. It processes logs and produces planks, poles, boards...
The people working in the sawmill will use machines and tools to make the system work.
A working system can encompass many processes and sub-processes. Some processes are
automated, some are manual processes.
In #eprdctn, we might have a system that takes in raw data, and produces ebooks. In
#eprdctn publishing in general, systems are often called workflows.
Let's talk about tools and jigs
This presentation is all about tools for your craft.
There's a guy called Dan Erlewine on YouTube, who is a luthier. He has many YouTube
movies on guitar repair, and many of those movies are about clever self-made tools he uses
to work on guitars. He calls them 'jigs'.
https://www.youtube.com/results?search_query=dan+erlewine+jig
I like that term, so I want to coin the term for 'jig' for #eprdcnt custom tools. You've heard it
here first!
The first step is to take notice when you find yourself repeatedly performing a cumbersome,
difficult, or repetitive task. Then ask yourself whether it's possible to create a jig for that.
Basic Tools
When working in #eprdctn, a lot of work involves editing various text files. More often than
not, the editing will aim to affect the structure of the document, rather than the content:
retagging, restructuring tags, managing CSS classes,...
Another common operation will be unpacking and repacking files. For example, EPUB files
are nothing more than glorified .ZIP files.
Regular Expressions
One of the tools you want in your toolchest some familiarity with regular expressions. Once
you master regular expressions, you can use search-and-replace for a fairly wide range of
tasks.
Regular expressions can help with tasks that go beyond a simple search-and replace; for
example you could use a search and replace with regular expressions for restructuring
HTML.
Regular expressions are not easy to master. They are very cryptic, almost impossible to
read, and they are not standardized.
Regular expressions are also often loosely referred to as 'GREP' which is a reference to the
Unix command line tool which started it all. GREP = Global Regular Expression Print.
Lack of standardization
You might be using a text editor like BBEdit, Notepad++, Sublime Edit,... All of these support
regular expressions, but each of them will its own unique 'dialect'. They'll be 95% the same
between the different text editors, but there are subtle differences.
You might be using InDesign as the source for EPUB documents. InDesign supports regular
expressions in its Find-Change dialog. And these are in a specific InDesign 'dialect' which has
only 80% similarity when compared to the regular expressions used common text editors.
InDesign has some fairly unique features with regards to regular expressions, features which
you won't find in your text editor.
It does not end there. InDesign supports a scripting language called ExtendScript which is a
form of JavaScript. ExtendScript supports regular expressions, and guess what? They use yet
another dialect of regular expressions, again quite different to the InDesign regular
expressions as seen in the Find-Change dialog.
Then we have the various scripting languages that could be used for tool creation - PHP,
Perl, Python, JavaScript, awk, sed... There are many 10s of them. None of them is 'better' -
whatever works, works.
Again, all of them have regular expressions, and each scripting language will have its own
unique dialect.
Understand the basic principle and use the documentation
To make sense of it all, my recommendation is that you must understand the basic ideas of
regular expressions and how they are constructed. These are well supported in all the
dialects.
Once you understand the basic ideas, you need to consult the documentation and/or use
the facilities of the software at hand to determine the proper expressions.
To match a thin space, for example:
InDesign: ~<
Most text editors: x{2009}
Referring to matched parenthesized sub-expressions in replacement patterns is another
point of difference. For example, sometimes, you need to use $1, sometimes you need to
use 1 to refer to the first parenthesized subexpression.
Text Editors
Another basic tool you need is a (set of) good text editors. Steer clear of word processors or
underpowered tools: don't use MS Word, Apple's TextEdit, Notepad.exe ... None of these
are proper text editors.
Word processors often try to be helpful and 'helpfully' change quotes into curly quotes, or
muck with line endings and character encodings, blithely destroying your HTML structure.
Some editors are much more than that. If you can afford it, you want to have Oxygen XML in
your tool chest. This tool can serve as a text editor but it also understands XML, HTML, CSS...
and will allow you to do much smarter editing with much reduced risk.
Editing XML files in a regular text editor works just fine, but you run the risk of damaging
some finely tuned tag balance, and never know it.
Some text editors (e.g. BBEdit, Oxygen, Atom...) are smart enough to handle text files inside
ZIP-ped data files (e.g. EPUB files) and you can edit text files inside an EPUB without needing
to 'crack it open'.
Scripting Languages
Another powerful basic tool is to have a basic understanding of some scripting language or
languages. There are many out there: the most popular ones are probably Python,
JavaScript, PHP, Perl.
Most of these scripting languages offer the necessary support for complex operations, e.g.
zipping and unzipping, regular expressions, search-and-replace, XML parsing, accessing data
over HTTP or HTTPS connections, connecting to databases...
A scripting language comes in handy when you're faced with a repetitive task that's going
beyond what you can accomplish with find-and-replace and regular expressions. For
example, when there is some 'if-then' logic that needs to handled, or some processing that
needs to be done.
Tools
EPUB unpack/repack: eCanCrusher
When working with EPUB, if you have access to tools like BBEdit or Oxygen, you can perform
EPUB-wide operations like straight on the EPUB file without ever having to
decompress/recompress it.
But sometimes you want to decompress the EPUB, make some changes, then re-compress
it.
There are multiple tools that do this. eCanCrusher is one of them. It works in simple
drag/drop fashion.
https://www.docdataflow.com/ecancrusher/
To decompress: drag/drop an EPUB file onto the eCanCrusher application icon. A
decompressed EPUB folder will appear.
To recompress: drag/drop an EPUB folder onto the eCanCrusher application icon. A
compressed EPUB file will appear.
To configure: double-click the eCanCrusher application icon.
Custom Scripts
Another set of tools on your toolbelt can be custom scripts written in a variety of scripting
languages.
Languages like Python, PHP, JavaScript/Node.js... can be used to write scripts that process
individual text files (e.g. XHTML, CSS,...) or complete EPUB.
None of these is particularly better or worse, and switching to a different language than the
one you already know is rarely beneficial.
All of these scripting languages have features to facilitate handling of XML, pattern
matching, and so on.
There are two hurdles to writing scripts: first of all, installing and configuring the software to
use the scripting language is not always straightforward.
Second, writing scripts is not for the faint of heart, but the rewards are tremendous.
Pick one language, get good at it.
Macintosh
On a standard Mac, some common scripting languages are pre-installed (e.g. PHP, Python
2.7). Installing additional languages is straightforward: open the Terminal window
(Application -> Utilities -> Terminal.app) then invoke the scripting language. For common
scripting languages the Mac will propose to download and install the necessary command
line tools. In the screenshot below, I've just typed python3
As python3 is not installed by default, the Mac offers to fetch and install it:
For node.js (JavaScript) you need to visit and download from https://nodejs.org
Windows
Windows does not have the more common scripting languages pre-installed. There are
many options to download and install these.
One of the many options is Cygwin, which installs a 'Unix-like' environment on Windows and
allows you to use the same command-line tools as Mac and Linux users:
https://cygwin.com
When you run the Cygwin installer, you'll see a window where you can pick-and-choose and
decide which Linux/Unix tools to install.
I find it easiest to install all PHP-related and all Python-related stuff, and things like zip and
unzip.
Rather than try and be selective, I simply search for 'PHP' and/or 'Python' in the package list
and select whole 'PHP' and 'Python' collection.
Better problem solving through scripting: How to think through your #eprdctn roadblocks - Course notes
To install Node.js you need to download and run an installer from https://nodejs.org
Creating a script
There are many ways to go about this, and I won't even attempt to list all of them.
Instead, I'll be creating a very simple script from scratch and will run it on some XHTML files.
For the sake of argument, my task is to go through an EPUB file and find the style attribute
associated with the <body> tag, and remove it. Instead, I'll move that style attribute into the
CSS file for the body tag.
The first step is to experiment a bit. I'll can decompress the EPUB using eCanCrusher, or I
can use an EPUB-aware text editor like BBEdit, Atom, Oxygen XML.
I'll open one of the xhtml files and set out to find the regular expression pattern that works.
Eventually, I came up with:
(body[^>]*) style="[^"]*"([^>]*>)
to be replaced by
12
or
$1$2
depending on the dialect of GREP your text editor is using.
If we only need to do one EPUB, we could simply do a search-and-replace all.
But we want to do this to many EPUBs.
The next step is to create a script that will take a file name as a command line parameter,
which then reads the file, performs the search-and-replace, and overwrites the file with the
updated file.
I created a file deleteBodyStyle.php which has the following script:
<?php
$fileContents = file_get_contents($argv[1]);
$fileContents = preg_replace('/(<body[^>]*)
style="[^"]*"([^>]*>)/','$1$2',$fileContents);
file_put_contents($argv[1], $fileContents);
This script reads the file content of the file at hand (file path is provided as $argv[1]), does
the search-and-replace, and writes out the updated file contents.
I also created the equivalent Python version in a file deleteBodyStyle.py:
import sys
import re
with open(sys.argv[1], 'r') as inFile:
data = inFile.read()
data = re.sub(r'(<body[^>]*) style="[^"]*"([^>]*>)', r'12', data)
with open(sys.argv[1], 'w') as outFile:
outFile.write(data)
We can now test these scripts. I will be using a sample EPUB made by means of Adobe
InDesign 2020 from an InDesign sample file called Adobe History.indd that came with
InDesign CS3. It is an excerpt from the book Inside the Publishing Revolution: The Adobe
Story by Pamela Pfiffner.
Before looking at DropToScript, I'll first use the scripts in a manual fashion. That's slightly
cumbersome, but we need to go through it to get a good understanding of what is going on.
I crack open the EPUB exported from InDesign, and then I'll use Terminal on Mac (or Cygwin
Terminal on Windows) to execute the scripts from the command line, using drag/drop to
avoid having to type the path to the xhtml files.
cd Desktop
php deleteBodyStyle.php /Users/kris/Desktop/Adobe
History/OEBPS/Adobe_History-6.xhtml
python deleteBodyStyle.py /Users/kris/Desktop/Adobe
History/OEBPS/Adobe_History-7.xhtml
To execute either of these scripts on all .xhtml files, we can use some command-line magic.
or
dirToScan="/Users/kris/Desktop/Adobe History/OEBPS/"; ls
"$dirToScan"*.xhtml | while read fileToScan; do python
deleteBodyStyle.py "$fileToScan"; done
After adjusting the CSS file, we can recompress the EPUB.
I've not done any error handling. It would be cleaner to also add additional checks (e.g.
verify the file name extension of the file being processed) and error checks (e.g. report any
unexpected circumstances), but for most intents and purposes the above script will work
fine.
DropToScript Script Wrapper
Once you have a script, you'll often find that you're going through the same motions over
and over:
• Decompress EPUB
• Run the script on a bunch of files (e.g. html files or css files).
• Repackage EPUB
DropToScript manages the decompress/repackage part automatically. Once you have a
script (in PHP, Python, Node JavaScript...)that can process a single file at a time, you can
configure DropToScript to automatically perform the same script on many files, simply by
dragging an EPUB or a collection of file icons onto the DropToScript icon.
DropToScript comes bundled with a number of pre-made useful scripts, but you can easily
add your own.
After downloading it, you need to configure it so it can find the PHP or Python installation
on your computer. You do this by double-clicking the icon of the application.
As an example, you can copy either the deleteBodyStyle.php or deleteBodyStyle.py file into
the DropScripts folder and then drag-drop any EPUB onto DropToScript to have
deleteBodyStyle executed on the text files inside the EPUB.
Stuff I use
cd_to (Mac): https://github.com/jbtule/cdto
Cygwin: https://www.cygwin.com/
Atom Text Editor: https://atom.io/
Notepad++ Text Editor: https://notepad-plus-plus.org/
eCanCrusher: https://www.docdataflow.com/ecancrusher/
DropToScript: https://github.com/BCLibCoop/nnels-a11y-publishing/tree/kris-
enhancements-20200318/ReleaseVersions
Guitar Jigs: https://www.youtube.com/results?search_query=Dan+Erlewine+jig
Inside the Publishing Revolution: The Adobe Story:
https://www.amazon.com/Inside-Publishing-Revolution-Adobe-Story/dp/0321115643

More Related Content

PDF
Creating a compiler for your own language
PPTX
INTRODUCTIONS OF HTML
PPTX
PPTX
An introduction to coding
PDF
Backend roadmap
PPTX
Mark asoi ppt
PDF
Why Ruby
Creating a compiler for your own language
INTRODUCTIONS OF HTML
An introduction to coding
Backend roadmap
Mark asoi ppt
Why Ruby

What's hot (20)

PPTX
What is Coding
PPTX
Coding vs programming
PDF
Envisioning the Future of Language Workbenches
PDF
Best Practices For Writing Super Readable Code
PPT
Ijet Talk
PPTX
What Is Coding And Why Should You Learn It?
PPTX
computer languages
PPTX
Introduction to python
PPTX
Margareth lota
PDF
Programming language
PPT
PDF
MT and Translator's Tools
PDF
The Ring programming language version 1.9 book - Part 97 of 210
PPTX
Architecting Domain-Specific Languages
PPTX
Programming Language
PDF
Web programming UNIT II by Bhavsingh Maloth
DOC
Algorithm vs
PPT
Programming language
What is Coding
Coding vs programming
Envisioning the Future of Language Workbenches
Best Practices For Writing Super Readable Code
Ijet Talk
What Is Coding And Why Should You Learn It?
computer languages
Introduction to python
Margareth lota
Programming language
MT and Translator's Tools
The Ring programming language version 1.9 book - Part 97 of 210
Architecting Domain-Specific Languages
Programming Language
Web programming UNIT II by Bhavsingh Maloth
Algorithm vs
Programming language
Ad

Similar to Better problem solving through scripting: How to think through your #eprdctn roadblocks - Course notes (20)

PPT
Classification Of Software
PDF
Specification Of The Programming Language Of Java
PDF
notes on Programming fundamentals
DOCX
DOCX
Unit 1
PPT
Java And Community Support
PDF
Markdown - friend or foe?
PPTX
Presentation-1.pptx
PPTX
Cmp2412 programming principles
PDF
Cs121 Unit Test
DOCX
SYSTEM DEVELOPMENT
PPTX
Programming paradigms Techniques_part2.pptx
PDF
lecture2-PerlProgramming
PDF
lecture2-PerlProgramming
PDF
Introduction to Python - Algorithm Compiler
PDF
Python overview
PDF
Reverse Engineering in Linux - The tools showcase
PDF
Procedural Programming Of Programming Languages
PDF
Learning to code in 2020
PPT
Intro. to prog. c++
Classification Of Software
Specification Of The Programming Language Of Java
notes on Programming fundamentals
Unit 1
Java And Community Support
Markdown - friend or foe?
Presentation-1.pptx
Cmp2412 programming principles
Cs121 Unit Test
SYSTEM DEVELOPMENT
Programming paradigms Techniques_part2.pptx
lecture2-PerlProgramming
lecture2-PerlProgramming
Introduction to Python - Algorithm Compiler
Python overview
Reverse Engineering in Linux - The tools showcase
Procedural Programming Of Programming Languages
Learning to code in 2020
Intro. to prog. c++
Ad

More from BookNet Canada (20)

PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
PDF
Book industry state of the nation 2025 - Tech Forum 2025
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
PDF
Bridging the divide: A conversation on tariffs today in the book industry - T...
PDF
Transcript: Canadian book publishing: Insights from the latest salary survey ...
PDF
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
PDF
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
PDF
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
PDF
On the rise: Book subjects on the move in the Canadian market - Tech Forum 2025
PDF
Transcript: On the rise: Book subjects on the move in the Canadian market - T...
PDF
Transcript: New from BookNet Canada for 2025: Loan Stars
PDF
New from BookNet Canada for 2025: Loan Stars
PDF
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
PDF
New from BookNet Canada for 2025: BNC CataList - Tech Forum 2025
PDF
Transcript: Elements of Indigenous Style: Insights and applications for the b...
PDF
Elements of Indigenous Style: Insights and applications for the book industry...
PDF
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
PDF
Green paths: Building a sustainable future in bookselling - Tech Forum 2025
PDF
Transcript: Green paths: Building a sustainable future in bookselling - Tech ...
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
Book industry state of the nation 2025 - Tech Forum 2025
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
Bridging the divide: A conversation on tariffs today in the book industry - T...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
On the rise: Book subjects on the move in the Canadian market - Tech Forum 2025
Transcript: On the rise: Book subjects on the move in the Canadian market - T...
Transcript: New from BookNet Canada for 2025: Loan Stars
New from BookNet Canada for 2025: Loan Stars
New from BookNet Canada for 2025: BNC SalesData and BNC LibraryData
New from BookNet Canada for 2025: BNC CataList - Tech Forum 2025
Transcript: Elements of Indigenous Style: Insights and applications for the b...
Elements of Indigenous Style: Insights and applications for the book industry...
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
Green paths: Building a sustainable future in bookselling - Tech Forum 2025
Transcript: Green paths: Building a sustainable future in bookselling - Tech ...

Recently uploaded (20)

PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PPTX
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
PDF
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
PPTX
Emphasizing It's Not The End 08 06 2025.pptx
PDF
Swiggy’s Playbook: UX, Logistics & Monetization
PDF
natwest.pdf company description and business model
PPTX
worship songs, in any order, compilation
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
Anesthesia and it's stage with mnemonic and images
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PPTX
Project and change Managment: short video sequences for IBA
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
The Effect of Human Resource Management Practice on Organizational Performanc...
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
oil_refinery_presentation_v1 sllfmfls.pdf
AcademyNaturalLanguageProcessing-EN-ILT-M02-Introduction.pptx
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
Emphasizing It's Not The End 08 06 2025.pptx
Swiggy’s Playbook: UX, Logistics & Monetization
natwest.pdf company description and business model
worship songs, in any order, compilation
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
Intro to ISO 9001 2015.pptx wareness raising
Anesthesia and it's stage with mnemonic and images
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Project and change Managment: short video sequences for IBA
Tablets And Capsule Preformulation Of Paracetamol
The Effect of Human Resource Management Practice on Organizational Performanc...
2025-08-10 Joseph 02 (shared slides).pptx
chapter8-180915055454bycuufucdghrwtrt.pptx
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
_ISO_Presentation_ISO 9001 and 45001.pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx

Better problem solving through scripting: How to think through your #eprdctn roadblocks - Course notes

  • 1. Better problem solving through scripting: How to think through your #eprdctn roadblocks and script your way to efficiency Do you find yourself repeating the same task over and over? Or feeling certain there is a way to automate a task but it's just outside of your skill set? Kris Coppieters from Rorohiko has built a career solving just those kinds of problems. Whether it's scripting a solution from within InDesign, or using AppleScript to finish off some markup, Kris can show you how to bring high-level thinking to quick and dirty tasks. Two categories of automation When it comes to automation, we can classify most of the software used into two large categories which I'll call tools and systems. Tools Tools are smaller programs which are similar to tools used in various crafts, e.g. carpentry. In carpentry, you might use a hammer, nails, screws, screwdrivers, saws, CNC machines, drills... A few of these tools are more complex, and have a wide range of functions, but many tools are simple and have a very specific function. Some of these tools are specifically made for a specific task. In #eprdctn, the tools used are text editors, search-and-replace, XML editors like Oxygen, testing tools, reader programs... Systems Systems are more complex than simple tools. They take in some raw materials and spit out finished results at the other end. A sawmill can be seen as a system. It processes logs and produces planks, poles, boards... The people working in the sawmill will use machines and tools to make the system work. A working system can encompass many processes and sub-processes. Some processes are automated, some are manual processes. In #eprdctn, we might have a system that takes in raw data, and produces ebooks. In #eprdctn publishing in general, systems are often called workflows. Let's talk about tools and jigs This presentation is all about tools for your craft. There's a guy called Dan Erlewine on YouTube, who is a luthier. He has many YouTube movies on guitar repair, and many of those movies are about clever self-made tools he uses to work on guitars. He calls them 'jigs'. https://www.youtube.com/results?search_query=dan+erlewine+jig I like that term, so I want to coin the term for 'jig' for #eprdcnt custom tools. You've heard it here first! The first step is to take notice when you find yourself repeatedly performing a cumbersome, difficult, or repetitive task. Then ask yourself whether it's possible to create a jig for that. Basic Tools When working in #eprdctn, a lot of work involves editing various text files. More often than not, the editing will aim to affect the structure of the document, rather than the content: retagging, restructuring tags, managing CSS classes,... Another common operation will be unpacking and repacking files. For example, EPUB files are nothing more than glorified .ZIP files.
  • 2. Regular Expressions One of the tools you want in your toolchest some familiarity with regular expressions. Once you master regular expressions, you can use search-and-replace for a fairly wide range of tasks. Regular expressions can help with tasks that go beyond a simple search-and replace; for example you could use a search and replace with regular expressions for restructuring HTML. Regular expressions are not easy to master. They are very cryptic, almost impossible to read, and they are not standardized. Regular expressions are also often loosely referred to as 'GREP' which is a reference to the Unix command line tool which started it all. GREP = Global Regular Expression Print. Lack of standardization You might be using a text editor like BBEdit, Notepad++, Sublime Edit,... All of these support regular expressions, but each of them will its own unique 'dialect'. They'll be 95% the same between the different text editors, but there are subtle differences. You might be using InDesign as the source for EPUB documents. InDesign supports regular expressions in its Find-Change dialog. And these are in a specific InDesign 'dialect' which has only 80% similarity when compared to the regular expressions used common text editors. InDesign has some fairly unique features with regards to regular expressions, features which you won't find in your text editor. It does not end there. InDesign supports a scripting language called ExtendScript which is a form of JavaScript. ExtendScript supports regular expressions, and guess what? They use yet another dialect of regular expressions, again quite different to the InDesign regular expressions as seen in the Find-Change dialog. Then we have the various scripting languages that could be used for tool creation - PHP, Perl, Python, JavaScript, awk, sed... There are many 10s of them. None of them is 'better' - whatever works, works. Again, all of them have regular expressions, and each scripting language will have its own unique dialect. Understand the basic principle and use the documentation To make sense of it all, my recommendation is that you must understand the basic ideas of regular expressions and how they are constructed. These are well supported in all the dialects. Once you understand the basic ideas, you need to consult the documentation and/or use the facilities of the software at hand to determine the proper expressions. To match a thin space, for example: InDesign: ~< Most text editors: x{2009} Referring to matched parenthesized sub-expressions in replacement patterns is another point of difference. For example, sometimes, you need to use $1, sometimes you need to use 1 to refer to the first parenthesized subexpression. Text Editors Another basic tool you need is a (set of) good text editors. Steer clear of word processors or underpowered tools: don't use MS Word, Apple's TextEdit, Notepad.exe ... None of these are proper text editors. Word processors often try to be helpful and 'helpfully' change quotes into curly quotes, or muck with line endings and character encodings, blithely destroying your HTML structure.
  • 3. Some editors are much more than that. If you can afford it, you want to have Oxygen XML in your tool chest. This tool can serve as a text editor but it also understands XML, HTML, CSS... and will allow you to do much smarter editing with much reduced risk. Editing XML files in a regular text editor works just fine, but you run the risk of damaging some finely tuned tag balance, and never know it. Some text editors (e.g. BBEdit, Oxygen, Atom...) are smart enough to handle text files inside ZIP-ped data files (e.g. EPUB files) and you can edit text files inside an EPUB without needing to 'crack it open'. Scripting Languages Another powerful basic tool is to have a basic understanding of some scripting language or languages. There are many out there: the most popular ones are probably Python, JavaScript, PHP, Perl. Most of these scripting languages offer the necessary support for complex operations, e.g. zipping and unzipping, regular expressions, search-and-replace, XML parsing, accessing data over HTTP or HTTPS connections, connecting to databases... A scripting language comes in handy when you're faced with a repetitive task that's going beyond what you can accomplish with find-and-replace and regular expressions. For example, when there is some 'if-then' logic that needs to handled, or some processing that needs to be done. Tools EPUB unpack/repack: eCanCrusher When working with EPUB, if you have access to tools like BBEdit or Oxygen, you can perform EPUB-wide operations like straight on the EPUB file without ever having to decompress/recompress it. But sometimes you want to decompress the EPUB, make some changes, then re-compress it. There are multiple tools that do this. eCanCrusher is one of them. It works in simple drag/drop fashion. https://www.docdataflow.com/ecancrusher/ To decompress: drag/drop an EPUB file onto the eCanCrusher application icon. A decompressed EPUB folder will appear. To recompress: drag/drop an EPUB folder onto the eCanCrusher application icon. A compressed EPUB file will appear. To configure: double-click the eCanCrusher application icon.
  • 4. Custom Scripts Another set of tools on your toolbelt can be custom scripts written in a variety of scripting languages. Languages like Python, PHP, JavaScript/Node.js... can be used to write scripts that process individual text files (e.g. XHTML, CSS,...) or complete EPUB. None of these is particularly better or worse, and switching to a different language than the one you already know is rarely beneficial. All of these scripting languages have features to facilitate handling of XML, pattern matching, and so on. There are two hurdles to writing scripts: first of all, installing and configuring the software to use the scripting language is not always straightforward. Second, writing scripts is not for the faint of heart, but the rewards are tremendous. Pick one language, get good at it. Macintosh On a standard Mac, some common scripting languages are pre-installed (e.g. PHP, Python 2.7). Installing additional languages is straightforward: open the Terminal window (Application -> Utilities -> Terminal.app) then invoke the scripting language. For common scripting languages the Mac will propose to download and install the necessary command line tools. In the screenshot below, I've just typed python3 As python3 is not installed by default, the Mac offers to fetch and install it: For node.js (JavaScript) you need to visit and download from https://nodejs.org
  • 5. Windows Windows does not have the more common scripting languages pre-installed. There are many options to download and install these. One of the many options is Cygwin, which installs a 'Unix-like' environment on Windows and allows you to use the same command-line tools as Mac and Linux users: https://cygwin.com When you run the Cygwin installer, you'll see a window where you can pick-and-choose and decide which Linux/Unix tools to install. I find it easiest to install all PHP-related and all Python-related stuff, and things like zip and unzip.
  • 6. Rather than try and be selective, I simply search for 'PHP' and/or 'Python' in the package list and select whole 'PHP' and 'Python' collection.
  • 8. To install Node.js you need to download and run an installer from https://nodejs.org Creating a script There are many ways to go about this, and I won't even attempt to list all of them. Instead, I'll be creating a very simple script from scratch and will run it on some XHTML files. For the sake of argument, my task is to go through an EPUB file and find the style attribute associated with the <body> tag, and remove it. Instead, I'll move that style attribute into the CSS file for the body tag. The first step is to experiment a bit. I'll can decompress the EPUB using eCanCrusher, or I can use an EPUB-aware text editor like BBEdit, Atom, Oxygen XML. I'll open one of the xhtml files and set out to find the regular expression pattern that works. Eventually, I came up with: (body[^>]*) style="[^"]*"([^>]*>) to be replaced by 12 or $1$2 depending on the dialect of GREP your text editor is using. If we only need to do one EPUB, we could simply do a search-and-replace all. But we want to do this to many EPUBs. The next step is to create a script that will take a file name as a command line parameter, which then reads the file, performs the search-and-replace, and overwrites the file with the updated file. I created a file deleteBodyStyle.php which has the following script:
  • 9. <?php $fileContents = file_get_contents($argv[1]); $fileContents = preg_replace('/(<body[^>]*) style="[^"]*"([^>]*>)/','$1$2',$fileContents); file_put_contents($argv[1], $fileContents); This script reads the file content of the file at hand (file path is provided as $argv[1]), does the search-and-replace, and writes out the updated file contents. I also created the equivalent Python version in a file deleteBodyStyle.py: import sys import re with open(sys.argv[1], 'r') as inFile: data = inFile.read() data = re.sub(r'(<body[^>]*) style="[^"]*"([^>]*>)', r'12', data) with open(sys.argv[1], 'w') as outFile: outFile.write(data) We can now test these scripts. I will be using a sample EPUB made by means of Adobe InDesign 2020 from an InDesign sample file called Adobe History.indd that came with InDesign CS3. It is an excerpt from the book Inside the Publishing Revolution: The Adobe Story by Pamela Pfiffner. Before looking at DropToScript, I'll first use the scripts in a manual fashion. That's slightly cumbersome, but we need to go through it to get a good understanding of what is going on. I crack open the EPUB exported from InDesign, and then I'll use Terminal on Mac (or Cygwin Terminal on Windows) to execute the scripts from the command line, using drag/drop to avoid having to type the path to the xhtml files. cd Desktop php deleteBodyStyle.php /Users/kris/Desktop/Adobe History/OEBPS/Adobe_History-6.xhtml python deleteBodyStyle.py /Users/kris/Desktop/Adobe History/OEBPS/Adobe_History-7.xhtml To execute either of these scripts on all .xhtml files, we can use some command-line magic. or dirToScan="/Users/kris/Desktop/Adobe History/OEBPS/"; ls "$dirToScan"*.xhtml | while read fileToScan; do python deleteBodyStyle.py "$fileToScan"; done After adjusting the CSS file, we can recompress the EPUB. I've not done any error handling. It would be cleaner to also add additional checks (e.g. verify the file name extension of the file being processed) and error checks (e.g. report any unexpected circumstances), but for most intents and purposes the above script will work fine. DropToScript Script Wrapper Once you have a script, you'll often find that you're going through the same motions over and over: • Decompress EPUB • Run the script on a bunch of files (e.g. html files or css files). • Repackage EPUB DropToScript manages the decompress/repackage part automatically. Once you have a script (in PHP, Python, Node JavaScript...)that can process a single file at a time, you can
  • 10. configure DropToScript to automatically perform the same script on many files, simply by dragging an EPUB or a collection of file icons onto the DropToScript icon. DropToScript comes bundled with a number of pre-made useful scripts, but you can easily add your own. After downloading it, you need to configure it so it can find the PHP or Python installation on your computer. You do this by double-clicking the icon of the application. As an example, you can copy either the deleteBodyStyle.php or deleteBodyStyle.py file into the DropScripts folder and then drag-drop any EPUB onto DropToScript to have deleteBodyStyle executed on the text files inside the EPUB. Stuff I use cd_to (Mac): https://github.com/jbtule/cdto Cygwin: https://www.cygwin.com/ Atom Text Editor: https://atom.io/ Notepad++ Text Editor: https://notepad-plus-plus.org/ eCanCrusher: https://www.docdataflow.com/ecancrusher/ DropToScript: https://github.com/BCLibCoop/nnels-a11y-publishing/tree/kris- enhancements-20200318/ReleaseVersions Guitar Jigs: https://www.youtube.com/results?search_query=Dan+Erlewine+jig Inside the Publishing Revolution: The Adobe Story: https://www.amazon.com/Inside-Publishing-Revolution-Adobe-Story/dp/0321115643