Experiment: Building corpora from television captions

During my PhD research I developed a workflow to extract captions from New Zealand’s Digital TV broadcasts to build data-sets of texts related to television broadcasts. Captions were encoded as images and so part of the process used Optical Character Recognition (OCR) to extract the text. At the 2016 NZEENZ conference I presented on the possibility of using caption-based corpora to build data-sets to study and monitor the content of NZ media and NZ English.

Skills & Tools: Wrangling hardware; Automating capture of specific TV shows; Integrating software to extract and OCR image captions.