Here at the Miskatonic Machine Lab, we are currently embarking on a large-scale project involving semantic analysis of blog posts. To collect aggregate data, we will need to collect a large number of blog posts on various topics. We require a PHP script to collect this data. This script will need to:
1. Scrape Google Blog Search ([login to view URL]) results(sorted by date) for a specified keyword and get the URLs of the first 100 results.
2. Visit each search result and extract the post title, date published, and post body. It will need to successfully scrape many different blog platforms- it can either intelligently try to determine the post title, date and body, or hardcode in scraper settings for 20-30 of the most popular blog platforms indexed by Google Blog Search.
3. For each keyword, the 100 scraped posts should be stored in a MySQL database and then combined into an RSS feed. So all 100 posts for a particular keyword should be in a single RSS feed. This feed should be accessible via HTTP in the format
[login to view URL]
It should support keywords both with and without quotes. For performance, once a feed for a keyword is generated, it should be cached.
The script should have a very simple admin interface where I can enter a keyword to generate a new feed, view already generated feeds, and delete keywords(and posts for that keyword) from the database.
Because this script will be collecting large amounts of data, it needs to be as fast and efficient as possible.
You can test your script with the keyword "my cat died.". For that keyword, you would be scraping the following search results URL:
[login to view URL];ie=UTF-8&scoring=d&q=%22my+cat+died%22&btnG=Search+Blogs
This data gatherer is critical to our research. Satisfactory completion of this project will lead to us assigning you much larger and more involved projects. We have a $300,000 federal grant to spend on this research, and once we find a quality programmer who can quickly complete a small project like this, we can assign more complex (and expensive!) work.
Thanks in advance for your bids,
Prof. John Gainsworth
Machine Lab
Miskatonic University