Scalable Comparison for Large Number of Documents
I have a large number of documents, of 500 to 2000 words, that I would like link to each other based on semantic similarity. The number of documents will increase rapidly so a process of comparing each document with all other documents will not scale.
As a solution I plan to retrieve a retina representation (Fingerprint) for each document from:
Then find the 50 documents with the largest intersection of the values in the Fingerprint, then retrieve a representation of similarity with those 50 documents form:
Is this the most efficient strategy ?
Hi Gary. The requested functionality is not part of the API. We include this functionality in customized products that we develop for clients. You can, as you mentioned, use the bulk comparison call, but if you want you can also write code to implement cosine similarity, or another similarity vector measure, locally so that you do not have to make all the REST calls. If you are coding in Python, here is an example on scikit-learn of such an implementation:
Hope this helps. David, Technical Team.
Thanks Gary. A developer is looking into your question. I hope to get back to you with an answer tomorrow. Cheers. David, Technical Team.
Gary Leydon commented
I'm looking to do the same sort of thing.
1. Generate fingerprints of all the documents in our curriculum management system and store them in a database.
2. User enters query ---> convert this to fingerprint
3. This is where I'm stuck. Calling the /compare endpoint with query fingerprint and every document fingerprint in the database? Even using the bulk endpoint this doesn't seem correct or efficient.
Is there a use case or documents you could point me to ?
What you suggest is very true, in fact we plan to offer this kind of functionality in our next (coming soon) release of our Retina Spark API.
If you want to be kept informed about this, please subscribe to our news letter (http://www.cortical.io/blog/) or keep an eye on our Retina Spark API documentation, will be updated once is available (http://documentation.cortical.io/retina-spark-api.html).
We do not offer REST access to the Spark API, but we offer access via custom in-house installations or hosting it ourselves for you.
Peter Boot commented
Will, I have had some success using a "Jaccard similarity" https://en.wikipedia.org/wiki/Jaccard_index metric to compare Fingerprints
Will Bryant commented
I am looking into doing the same thing as Peter, but I am having trouble wrapping my head around how these documents could be indexed and retrieved in an efficient manner. Essentially, the end-goal would be to have something that functions like a lucene search engine, but backed by fingerprint representations.
it certainly is a good idea to retrieve the fingerprint representation of each text only once and then store it locally.
Finding the 50 documents with the largest intersection with your comparison document requires you to run a search for n times n documents again unless you got another idea for that. Maybe you can go a bit deeper in the reasoning behind this approach?
Personally I would retrieve the fingerprints for each document to be linked/compared. Then store it locally and either build a position based index where each position points to any other document containing that position (hence you would also have the linkage built incrementally) or use the stored fingerprints and the new document's fingerprint to run the compare those. Creating a local index would yet be the first choice though.
Let us know if you require further information or assistance.
have a nice day,