I suggest you ...

Scalable Comparison for Large Number of Documents

I have a large number of documents, of 500 to 2000 words, that I would like link to each other based on semantic similarity. The number of documents will increase rapidly so a process of comparing each document with all other documents will not scale.

As a solution I plan to retrieve a retina representation (Fingerprint) for each document from:
/text?retina_name=en_associative

Then find the 50 documents with the largest intersection of the values in the Fingerprint, then retrieve a representation of similarity with those 50 documents form:
compare?retina_name=en_associative

Is this the most efficient strategy ?

2 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    Peter Boot shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    7 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • AdminTechnical Team (Admin, Cortical.io) commented  ·   ·  Flag as inappropriate

        Hi Gary. The requested functionality is not part of the API. We include this functionality in customized products that we develop for clients. You can, as you mentioned, use the bulk comparison call, but if you want you can also write code to implement cosine similarity, or another similarity vector measure, locally so that you do not have to make all the REST calls. If you are coding in Python, here is an example on scikit-learn of such an implementation:
        http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
        Hope this helps. David, Technical Team.

      • Gary Leydon commented  ·   ·  Flag as inappropriate

        I'm looking to do the same sort of thing.
        1. Generate fingerprints of all the documents in our curriculum management system and store them in a database.
        2. User enters query ---> convert this to fingerprint
        3. This is where I'm stuck. Calling the /compare endpoint with query fingerprint and every document fingerprint in the database? Even using the bulk endpoint this doesn't seem correct or efficient.
        Is there a use case or documents you could point me to ?

      • AdminTechnical Team (Admin, Cortical.io) commented  ·   ·  Flag as inappropriate

        Hi Will,

        What you suggest is very true, in fact we plan to offer this kind of functionality in our next (coming soon) release of our Retina Spark API.

        If you want to be kept informed about this, please subscribe to our news letter (http://www.cortical.io/blog/) or keep an eye on our Retina Spark API documentation, will be updated once is available (http://documentation.cortical.io/retina-spark-api.html).

        We do not offer REST access to the Spark API, but we offer access via custom in-house installations or hosting it ourselves for you.

        Best regards,
        Technical team

      • Will Bryant commented  ·   ·  Flag as inappropriate

        I am looking into doing the same thing as Peter, but I am having trouble wrapping my head around how these documents could be indexed and retrieved in an efficient manner. Essentially, the end-goal would be to have something that functions like a lucene search engine, but backed by fingerprint representations.

      • AdminTechnical Team (Admin, Cortical.io) commented  ·   ·  Flag as inappropriate

        Hello Peter,

        it certainly is a good idea to retrieve the fingerprint representation of each text only once and then store it locally.

        Finding the 50 documents with the largest intersection with your comparison document requires you to run a search for n times n documents again unless you got another idea for that. Maybe you can go a bit deeper in the reasoning behind this approach?

        Personally I would retrieve the fingerprints for each document to be linked/compared. Then store it locally and either build a position based index where each position points to any other document containing that position (hence you would also have the linkage built incrementally) or use the stored fingerprints and the new document's fingerprint to run the compare those. Creating a local index would yet be the first choice though.

        Let us know if you require further information or assistance.

        have a nice day,

        Tech. Support

      Feedback and Knowledge Base