Copyright Detection of Copied Paper Documents

Developed a system to automatically manage copyrighted printed materials by monitoring photocopiers to detect the copying of pages from copyrighted publications. The goal is to reimburse publishers for copying done in schools in countries without a "fair use" doctrine. This approach produces precision and recall above 98% with millions of page images. The customer really liked demonstrations of the system but a pilot deployment was delayed a couple of times by stalled negotiations due to reorganizations until the current delay caused by COVID-19.

Technical challenges. Started with an adaptation of our Embedded Media Marker approach that turned out to be unsuitable for more than a few thousand images. Used the same SIFT-like image feature descriptors and turned them into visual words for bag-of-words matching. Visual words produced very good results but required a compression technique for the inverted index to keep it in memory for millions of pages, each containing thousands of keypoints. Hierarchical k-means clustering of feature vectors to determine visual words also required techniques to reduce memory usage by never copying feature vectors, only swapping them (2.5 billion 40-byte vectors for 1 million images). Geometric verification by RANSAC affine-transform fitting proved to be insufficient because of matching patches on pages. Developed a grid-based, uniform-distribution approach for the geometric verification of whole pages.

Technologies. C++, Java (for server), MongoDB, DynamoDB, AWS Lambda, SIFT-like features, hierarchical k-means clustering, visual words.

Match grid

Document Linking with Embedded Media Markers

Designed and developed a system that links paper to digital media. Faint marks on the paper indicate the presence of a link. The link is established via an image search from keypoint features in a patch of the paper in a database of links to digital content. This work formed the basis of a Fuji Xerox product.

Technical challenges. Developed a high-performance Java web server that used the Java Native Interface (JNI) to detect keypoints in a submitted image and to perform a search in an approximate nearest neighbor tree (ANN). Created an Android app that used video capture to frequently check whether the captured image had a match in the database. Designed and developed an authoring tool that indicated to the user whether a region on a page was unsuitable for linking due to insufficient or even blank content.

Technologies. Android, Java (for app and server), JavaScript, MySQL, JNI, C++, SIFT-like features.

Document Browsing and Searching

Created a system that ingested a large collection of office documents and made them browseable and searchable via a web interface. Documents were converted to PDF to extract page thumbnails and text. Text phrases were detected and used to summarize documents.

Technical challenges. Created a pipeline that opened documents with OpenOffice to save them as PDF. Used Ghostscript to extract page images that were scaled down to different sizes for a zoomable UI. Summarized directory trees by showing popup collages of thumbnails of documents best matching a search or having interesting properties. Developed a zoomable UI that adapted document boxes and thumbnail sizes. Offered a document clip UI that showed parts of document pages matching a search.

Technologies. Java, JavaScript, PDF, OpenOffice, Ghostscript.