Edit 2022-03-25: The tool is part of something bigger! Read more in a separate post.
Recently, I built a PDF-viewer that mimicks Powerpoint's presenter view. Among other things, it shows slide notes next to the corresponding slides. But, since I continue using Powerpoint to create the slides I also continue writing down slide notes in Powerpoint. That's why I needed a way to extract the slide notes from Powerpoint presentation files and convert them to a text file that my PDF-viewer accepts.
The result is a Bash script that requires xmlstarlet
(sudo apt install xmlstarlet
). Usage is pretty straight forward:
extractnotespptx /path/to/presentation.pptx
It produces a file /path/to/presentation.notes
containing all the slide notes in the format that is required by the presenterview-detached
:
#1
Lorem ipsum...
#2
Further bla bla
...
Under the hood¶
Continue reading if you want to know how its done. The complete script can be found on github.
Unzip PPTX¶
Fortunately, pptx
-files are just zip
-files containing a bunch of XML-files. Inside an unzipped pptx
you can find a folder ppt/notesSlides
and a number of XML-files called notesSlides1.xml
, notesSlides2.xml
, notesSlides3.xml
, and so on. These contain the notes that were added to the slides. Notice that the number in the file name doesn't match a slide, i.e. notesSlides13.xml
may contain the notes you added to slide number 22.
Extract Notes From XML¶
The above-mentioned XML-files have two interesting fields: /p:notes/p:cSld/p:spTree/p:sp/p:txBody/a:p/a:fld[@type='slidenum']/a:t
and /p:notes/p:cSld/p:spTree/p:sp/p:txBody/a:p
. The former contains the number of the slide to which the notes belong and the latter contains a lot of subfields with the actual slide notes.
Using xmlstarlet
, you can easily extract the contents: xmlstarlet sel -t -m "//a:fld[@type='slidenum']" -v .
gives you the slide number and xmlstarlet sel -t -m "//p:txBody//a:p[.//a:r//a:t]" -v . -n
extracts the actual note text spread across several lines, one for each XML subfield.
The final script collects these things in an associative Bash array, sorts everything based on array keys, i.e. slide numbers, and writes out a Markdown file that is needed for the presenterview-detached
.