Monday, September 7, 2015

Answer: How to search in a scanned document?


As you'd expect... 

... there are many ways to search in a scanned PDF for some text. 

Let's review: the SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.  


1.  How can you transform this document (LINK) into something that you can search within? 
2.  Once you've done that, can you determine how many times the authors refer to "multiple documents" in that paper?  (This was my original search task--finding interesting papers about how people read multiple documents at the same reading session. That's how I found this paper.)  

So this Challenge is really about "tool finding" -- can you figure out how to convert from a scanned document into a readable / findable / searchable one?  


As we've talked about before, taking a scanned document and converting the scan into recognizable text is called "Optical Character Recognition," or OCR, so I'm going to use that in my query.  

I also remembered that Google Docs had some OCR capability, so my first query was: 

     [ Google docs OCR ] 

which led me to a lovely Help Center article about how to import a PDF file into your Google Drive, then open it with Docs.  And, voila, instant OCR!  

Here's what it looks like: 


As you can see, I imported the scanned PDF into Docs, and then I Control-Click on the document to "Open with" Google Docs.  This will automatically run the OCR process, and give me a new Google Doc that combines the scanned version with the OCR-d text. 



As you can see, the OCR process correctly recognized the text.  The scan is above the horizontal line, and the recognized text is below it.  

Now I can just use Control-F to find the text  multiple documents and we should be done.  Here's what I found: 


As you can see, the Control-F found 2 instances of our target string, that is, multiple documents.  

But I'm a bit of a traditionalist--I like to read long papers like this on the printed page, so I printed it out and began to read.  Everything was fine, but then as I read, I saw another instance of the phrase multiple documents that was NOT one of the two I'd found by Control-F!  WHAT?  How was that possible? 

I went back to my Google Doc and looked at the first page of the OCR-d PDF: 


That's when I noticed that much of the first page of text had NOT been recognized!  Huh.  As you can see in the above image, you can't even Control-F for the title of the document: there are zero hits for the title.  IF the OCR process was accurate, it certainly would have located the title of the paper (which is just a few lines below).  

Okay, I know that OCR is a difficult process; many OCR systems have errors, and I just found one here in the Docs OCR.  When there are strange boxes on the page, Docs OCR might skip over a chunk of the text.

But that didn't explain the "extra" instances of the phrase multiple documents I found in the printed-out version of the paper.  What's up with that?  

As I scrolled down looking for the "extra" instance I'd found, I discovered that the Google Docs version ended at page 10 (out of 21 pages in the original)--there were no references, and nothing past the mid-point of the paper!  Gack.  

I went back to the Help Center for some explanation, and discovered that it very clearly says "... For PDF files, we only look at the first 10 pages when searching for text to extract."  

Okay, so it's documented, but it's still a huge surprise.  There should be a notice in the converted doc (in bold, red, flaming letters) that tells you this.  It should say something like "There's more text in your document, but we stopped the OCR after 10 pages."  

Argh.  THAT's frustrating.  Time for another approach, one that will do more than 10 pages of OCR.  

My next query was: 

     [ OCR recognition PDF ] 

and I learned there are a number of online PDF OCR conversion tools.  But I ALSO learned that Adobe Acrobat has a conversion capability built into it.  (Note that this is for Acrobat Pro, not Acrobat Reader--that just lets you read PDF files, not convert them.)  

So, like Teri, I just used the OCR tools for Acrobat to convert.  I used the default settings to OCR the text.  I opened the PDF in Acrobat, opened the "Recognize Text" tool on the right side (see below) and clicked "In This File" to run the OCR.  


It took a couple of minutes, but then gave me a nice, Control-F-able document.  In this document I found 6 instances of multiple documents.  





And so, that was that.  I'd found all 6 instances. 

Or was it? 

In the comments on this post, Jon, Aui, and Remmij found it 7 times.  What?  How's that possible?  How could I have missed one? 

As Jon pointed out, you could search Google Books for the book that this chapter is in (Handbook of Research on Reading Comprehension) and then do a search for "multiple documents" in that book.  Indeed, you'll find the phrase 7 times (but only 5 of them are from this particular chapter, the other 2 are from other chapters in the book).  

But Aui did an interesting thing by doing a search for: 

     ["reading comprehension strategies" "strategies are developmental" ]

which found an already-scanned version of the paper at Academia.edu (a technical paper repository)!  A Control-F search there finds... 6 hits.  But Aui reports finding 7 hits!  What's going on? 

I tried to figure out how Aui could have found 7 instances of multiple documents.  What would be a more "basic" way to do this search?  And what was I doing wrong? 

To make things as basic as possible, I downloade the full-text PDF from the Academia.edu site.  I opened that document, then selected all the text (by doing a CMD+A or Control+A for PCs), then copied and pasted it into a SimpleText document (an MS Word document would work as well).  

The Trick:  When you copy/paste from the PDF file into a SimpleText or MS Word document, the receiving document drops all of the formating information, including things like the new-line character.  As a consequence, it runs ALL of the text together like this: 


This is a real pain if you're trying to copy formatted text from point A to point B, but when you're doing a text-find, it can be an advantage.  

But notice this... When I did a Control-F in this SimpleText document (without any formating), I found 7 instances of multiple documents.  (See the number 7 on the right side of the search box above?)  

Let's look at this same instance in the original PDF.  (We're looking at this instance because it's the one that wasn't found using our normal search methods.)  Here  I've put boxes around the two words: 


See that?  They're on separate lines of text.  So THAT'S why doing a search in the PDF or in the Google Docs copy doesn't work--that pair of words is separated by a newline character.  When I copy-pasted it into the SimpleText editor, the paste operation dropped all of the newlines and all of a sudden, Control-F could work.  

And so, yes, Aui found the correct answer: there are 7 (seven!) instance of the phrase multiple documents in this paper.  

More generally, this is something to be careful of when using Control-F.  Look at the following piece of text (this happens to be in Google Docs, but it can be in almost any text editor): 


Notice that the Control-F FIND box (pointed to by the red arrow) shows that there's only 1 instance found.  The Control-F command only found the multiple documents highlighted in green. 

I added the orange box to show you that there's actually another multiple documents in the text--this one happens to have a newline character between the first and second line, while the second paragraph does not.  

Control-F does not work across newline boundaries.  That's why the copy-paste without formatting was useful in the previous example--it deleted all of the formatting, including newlines.  


Now you know.  

Other ways to do this conversion:  There are, of course, other ways to OCR a scanned PDF.  As Rosemary pointed our,  Kami is one such tool.  And Remmij pointed out Free Online OCR (http://www.onlineocr.net/) which has a 5 Mb limit, so it doesn't quite work for this example. 

Beyond that, there are various paid methods you can use.  These web services such as CometDocs (which I hear good things about), and there are apps you can buy to do this as well.  Prizmo and ABBYY FineReader Express (both Mac apps) and EverNote (both platforms) 


Search Lessons 

There are several lessons in this week's Challenge (not all of which I understood before taking on the Challenge myself).  

1.  Be sure you know the limits of your tools.  I was somewhat surprised to find out (the hard way) that the Google Docs OCR process would only convert 10 pages of your text.  I found out the limit accidentally, but then followed up by checking the documentation and doing a bit of testing myself.  

2. Always sanity check your results.  When I noticed that the paper printout version of the paper seemed to have more instance of our phrase than the online version, that made a little bell go off in my brain.  That's what started me to sanity check things.  Be aware, be sensitive, and be willing to spend the extra couple of minutes running down funny little anomalies.  (There's a famous book, The Cuckoo's Egg, that tells the story of how Cliff Stoll brought down an international hacking scandal by tracking down a missing 9 seconds of computer time.  Moral: Pay attention to small discrepancies. They can be important.) 

3. REALLY understand the limits of your tools.  As we see from Aui's clever result, sometimes even something as simple as Control-F won't work across newline boundaries, and you very well might miss a result that you care about.  This is true for many (all?) text editors, including MS Word and Google Docs.  

4. Sometimes searching for text fragments can lead you to another version of the document that's more amenable to search.  Aui's search for a couple of phrases from the original paper led directly to an already-scanned and searchable version of the paper.  I hadn't found that version in any of my searches.  It's another version of the "One more search" aphorism--in this case, searching for the same document in a very different way leads to success.  

5.  Control-F does not work across newlines.  As always, pay attention.  If you're looking for just a single word, there's no issue here, a newline can't sneak into the middle of a word (although a smart document editor might hyphenate it on you).  But if you're searching for phrases, be careful--the longer the phrase, the more likely it is that you're going to miss an instance or two.  


This week's Challenge certainly taught me a lot.  Now I know when I can use Google Docs OCR tools, and when to NOT use it.  I also now know how to use Acrobat's OCR feature to convert a scanned PDF of any length.  Handy tools.  






13 comments:

  1. Dan, just a quick clarification regarding Online OCR capacity/capability — it handled the full 24 pages of your linked pdf with no problem when I used it… after registering.
    (and the output in MS WORD docx found all 7, no problem. As Teri mentioned, the thing with Adobe Acrobat Pro is access/$$)
    100MB

    ReplyDelete
    Replies
    1. Thanks Remmij I didn't know about 100MB when registering. It is great to know.

      I worked Challenge with Online Convert. It is a great site to change different formats and no registration needed. With this, I just found 6 even when converting document to docx.

      Remmij, what apps/extensions you suggest in Drive?

      Delete
    2. the Google Drive conversion capacity may be more nuanced, depending on file type… (found on Wikipedia's - GDrive page):

      Search
      Search results can be narrowed by file type, ownership, visibility, and the open-with app. Google Drive supports Boolean operators.[39]

      Using Google Goggles and Optical Character Recognition (OCR) technology, users can search for images by describing or naming what is in them. For example, a search for "mountain" returns all the photos of mountains as well as any text documents about mountains. Text in the first 100 pages of text documents and text-based PDFs, and in the first 10 pages of image-based PDFs can be searched.[34]* Text in images and PDFs can be extracted using OCR.

      - *[34]


      Ramón - sorry, but I can't make any suggestions for Google Drive because I don't use it or know much about the service… on a future to-do list…
      seems the more I find, the less I know and unless I'm using something on a regular basis it changes/evolves just enough to be unusable (without a new learning curve)
      the next time I visit a site or app… off-putting when I'm just trying to get a task done quickly… an ongoing conundrum. õ¿õ¬
      might be related, it's a Monday - do_nuts

      j tU - is the free Acrobat Pro trial a one time thing? I saw that when I looked at the Adobe site, but didn't want to burn the offer on the sRs?…
      the whole subscription thing with Adobe products seems to preclude the occasional user imho…
      Acrobat Pro DC example:
      Subscription -
      US$ 14 .99 /mo
      Requires annual commitment (paid monthly)
      Desktop -
      Acrobat DC only
      US$ 449 .00

      Delete
    3. Remmij - I suspect there are some errors here in the Wikipedia page. I'll try to get the details on this (and, if possible, fix up the page).

      To answer your question of JtU, yes, the Acrobat Pro trial is a one-time deal. The cost is pretty high, but that's their business model.

      Delete
  2. Acrobat Pro is a Free trial for 30 days so no $$ problem, I have done it with no hassles. [about a year ago]

    Lots of good stuff in this latest project

    Thanks

    jon tU

    ReplyDelete
  3. It's worth reading up on Regular Expressions for this. My day job often ends up trawling through very large text files for specific text, which can often be broken up across newlines and many other meta-characters. I tend to use Sublime Text (http://www.sublimetext.com/) to either manipulate the text in to a searchable format (e.g. find+replace '\n' with ' ') or to search across multiple lines ('multiple([\n\s]*)documents')

    There's no built-in OCR support (though there may be in other text editors)

    ReplyDelete
    Replies
    1. At some future point I'll do a Challenge that needs regular expressions--they're incredibly useful for this kind of thing. (Although you'd still have to know about the newline \n issue.)

      Delete
  4. I find another free online ocr to convert image to text, it supports 40+ languages, and can save converted text to editable txt file and searchable pdf document.

    ReplyDelete
  5. I think this can help - www.searchscans.com. It can convert the scanned documents into the searchable content. Though its in beta but good enough to start. Free limit is much more than what google drive provides.

    ReplyDelete
  6. There might be no safety points because it doesn't use some other software to run. You're simply required to put in the PDF Converter in your system, browse and supply the file that must be transformed and click on on 'Convert'. It's simple to make use of and extremely helpful as in comparison with on-line conversion. If you want to learn more about this topic please visit https://2pdf.com

    ReplyDelete
  7. Thanks a lot for this post, it's been really helpful to me. I've been googling a lot about searching through scanned PDFs lately, and so far this is the most helpful piece of advice I've stumbled upon!

    The thing with knowing the limits of my tools hit really close to home because none of them seem to cut it for me. I have tons of scanned PDFs, and they're stored mostly on OneDrive (switching to 1Tb on Google Drive isn't an option yet, it's not up to me). So searching through all of that is a real challenge - and I have to do it a lot lately.

    After a lot of research, I've narrowed down my requirements to having a tool that would have OCR, full-text search and would allow me to connect my OneDrive. (Bonus points for permissions and e2e encryption.)

    The only thing I found that could be what fits this description is this startup called ExploriFile (https://explorifile.com/), but they haven't launched yet. I signed up for updates and I'm waiting when they make the free trial available (well, at least I got a promo code for signing up).

    Any other suggestions for something similar?

    ReplyDelete