C code to index large text library and find similar

Cerrado Publicado hace 5 años Pagado a la entrega
Cerrado Pagado a la entrega

I need a mini-app (Compiled C on Linux) that groups similar sentences together.

I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.

Then iterate through doing word-by-word comparisons (16bit comparisons).

Two algos are acceptable:-

1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.

We leave such large gap so that we don't need to worry about word roots.

From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.

The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.

The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.

I need something in 36 hours. A mediocre algorithm is fine.

Programación en C Programación en C# Programación en C++ Linux Python

Nº del proyecto: #17551738

Sobre el proyecto

9 propuestas Proyecto remoto Activo hace 5 años

9 freelancers están ofertando un promedio de $372 por este trabajo

hbxfnzwpf

You can trust my expertise, I can finish in time, thanks a lot! I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developin Más

$300 USD en 1 día
(202 comentarios)
7.3
jjmutumi

Hello, I have more than 6 years experience writing software with Python. I can make a very fast, maintainable script for this in Cython if you are interested? Consider that: 1 - The main slowdown is from cache Más

$250 USD en 0 días
(8 comentarios)
4.7
dstepanenko

Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks Más

$300 USD en 1 día
(33 comentarios)
6.9
MzHashmi

Hi im free so i can do this type of jobs in quick manner as you have 36 hours for the job lets dont waste the time and get it started

$555 USD en 10 días
(8 comentarios)
4.1
codingedward

Hello, I am an experienced algorithm designer and would really like to work on your project. I appreciate how detailed your project description is and have understood every aspect of it. Award me the project and I w Más

$250 USD en 3 días
(11 comentarios)
3.2
Anpera

I have expertise in C/C++ My plan to solve this thing: 1) You give me example of dataset 2) I do rapid prototyping in python and show you approximate result of algorithm execution and timing. 3) If you like i Más

$333 USD en 2 días
(1 comentario)
2.6
ansarias21

Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram

$250 USD en 1 día
(5 comentarios)
0.8
humrobo

Hi, Hope you doing well sir i read your message in given below i make sure you that i can help you to build mini-app (Compiled C on Linux) that groups similar sentences together. as better as per your given requir Más

$555 USD en 10 días
(1 comentario)
1.6
itsparx

Dear Prospect Hiring Manager. Thank you for giving me a chance to bid on your project. i am a serious bidder here and i have already worked on a similar project before and can deliver as u have mentioned I have Más

$555 USD en 10 días
(0 comentarios)
0.6