C code to index large text library and find similar
$250-750 USD
Pagado a la entrega
I need a mini-app (Compiled C on Linux) that groups similar sentences together.
I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.
Then iterate through doing word-by-word comparisons (16bit comparisons).
Two algos are acceptable:-
1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.
We leave such large gap so that we don't need to worry about word roots.
From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.
The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.
The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.
I need something in 36 hours. A mediocre algorithm is fine.
Nº del proyecto: #17551738
Sobre el proyecto
9 freelancers están ofertando un promedio de $372 por este trabajo
Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks Más
Hi im free so i can do this type of jobs in quick manner as you have 36 hours for the job lets dont waste the time and get it started
Hello, I am an experienced algorithm designer and would really like to work on your project. I appreciate how detailed your project description is and have understood every aspect of it. Award me the project and I w Más
Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram