- [Richard] Let's now hear from Brad Turner, who will tell us about Bookshare's initiative called Page AI for transforming PDF to EPUB. - [Brad] Perfect, thank you. So last time we spoke, we talked briefly about math and how to make math accessible. And we embarked on a project that we called Math Detective, because it goes through books, finds the math, whether it's image-based math or inline math, sends that to a different remediation engine, and re-injects the accessible math back in the book. That project advanced such that we introduced it into our book production pipeline, and we've now transformed almost 8 million math equations within Bookshare. And we're working on thresholds of accuracy for that math. So that right now we've set it very, very, very high. And we want to be able to drop it down and increase the number of equations. That led us to a project called, what we're calling Page AI, which is transforming a PDF to an EPUB. The way our process works right now is if a user requests a book, we get thousands of books a month from publishers, but sometimes we get book requests that we can't get from publishers. So a user requests a book. We procure, chop, and scan that book, run it through optical character recognition. We send it out for proofing to ensure accuracy on that book. And then that book gets sent back up to the Bookshare library. The problem with that is that image retention can be a problem when you scan a book with a bunch of images, the math often is not accessible because you lose images or create inaccessible inline text out of a math equation. It's very expensive for a complex book, hundreds of dollars to thousands of dollars and, you know, three weeks to three months to process these complex books. When you're working with students who need, you know, a grade 11 math book for school, having them go three months before they get their content, almost guarantees their struggles in the classroom. And so we said, if we can do this with math, why can't we do this with an entire book? So we started on an AI project where we started breaking down the pages of a book into different classes, headers, texts, images, block math, inline math, tables, page numbers, and captions. And essentially what it looks like is this you, on the left side of the screen is a standard page on the right side of the screen we have that blocked out into different colours. So on the top right of the page is an image. And the block of that covers that image is yellow. On the very top is the page number. And so we have a page number in a different colour, pink in this case. Headers are all green, block text is all blue, and so captions are grey. So all of a sudden we're able to say, hmm, if this is block text, let's mark it out as block text in a template. We then train our AI model with that template. And the next three or four screens I'll show you are the model telling us what it thinks the headers are. So you can see the pixel density on the right side of the page, where we've said, go find all the headers. And based on our training, the model has pulled out the chapter header and then the header for each of the different sections. This next slide down we've said find the images. And there's one image on this page. And based on the pixels, it has found that image. So everything else is black. The image is what's starting to show through. There's our blocks of text. So each of the elements on the page is being identified by the model. You end up then with a model that gives you the ability to block elements on the page. And so on the screen, you see a couple of different textbook pages. You know, one, there's a block of text talking about the Amazon River and a picture of the Amazon River. And the picture is blocked in yellow, and the text is blocked in blue. And then you see there are equations that are blocked in purple. So all of a sudden now we're able to break this page down into its discrete elements. Once we have those segments, the segments in boxes, we order those boxes. So we know where each of those segments goes. So we either cluster using machine learning, a different machine learning clustering algorithm, or we use just a simple sort. And then and order those boxes. And also at that point, assign a confidence rating to each page to allow us to determine whether we will need to come back and look at that page or whether the machine is completely confident that the page is accurate. This next slide, what you see are the numbers in the boxes. So the very, very top left is box one. And then you move down to box two, which is the think about header. And then we go left to right three to four are the text boxes, five is the image, six is the equation. So all of a sudden you start to find that there's placement of those blocks on the pages and then synthesise into XHTML. So we save the images in order, and then we determine what is in each block. If it's text, it goes to an OCR engine in our tests where we're using Tesseract. If it's an image or math or a table, it goes to our math detector project that I've talked about earlier. We then create the HTML file at the beginning of each, sorry, a new HTML file at the beginning of each chapter. We create a table of contents page with the links to each chapter file, and lo and behold, we've created from an untagged PDF an EPUB because then we generate that book back. So on the left, you see the original page so it's blocked. On the right you see the machine has put it back together with headers, page numbers, images, and then it's a standard EPUB page. So we create the additional files necessary for the EPUB. We compress the parent directory into an EPUB3 and regenerate the book as an EPUB. So now the process begins to look like this, the user requests a book, we chop and scan it. We then dump it into the EPUB Generator API, which goes through our machine learning models, segments the pages, determines what's in each segment. Some of them go to the image processing API, some go, excuse me, (Brad coughs) to the optical character recognition. And then it reassembles the book and submits that back up to Bookshare. We can retain images, the math becomes accessible. So the math images all get all text. The inline math becomes accessible because it gets marked as math. It's significantly cheaper because it's all online. It's a couple of hours for a 500-page book, and that's doing our prototype testing on a PC, not on a beefy server. So much cheaper, much faster, the old adage of good, cheap, and fast, pick two of the three. It's a good book that's cheaper and faster. So that's, we're working, we continue to work on that. We have integrated our Math Detective Project with that as well. So now all of a sudden we're starting to be able to take scanned books, break them down, remediate the math, remediate the text, identify the images, rebuild that book, and generate an EPUB. And in some cases without ever touching the book. So more to come as the models improve. And as our machine learning techniques improve. But it's super promising for remediating image-based content. If you have any questions, contact me, BradT@Benetech.org. Thanks very much. - [Richard] Brad, thank you for that. That's just wonderful.