Develop a web crawler in Python

Completado Publicado hace 4 años Pagado a la entrega
Completado Pagado a la entrega

We are academics at Imperial College London and New York University conducting research on startup companies and their legal policies on the web.

The task is to write a web crawler in Python 3 that gets historical information about company websites from the Wayback machine ([login to view URL]).

We will also need a [login to view URL] file that shows any Python dependencies, and a Jupyter demo notebook that runs the crawler on a sample of inputs.

There will be more work available for a freelancer who completes this task to a high standard.

*** DESCRIPTION OF DESIRED CODE ***

The crawler should have two high-level functions.

FUNCTION 1: find_snapshots

Please use one of the Wayback APIs for this function if possible ([login to view URL]).

Input:

A website (e.g., [login to view URL])

A list of two dates in YYYYMMDD format (e.g., [20150202, 20150802])

Output:

A dictionary of links to all snapshots available on wayback between the given dates (return empty dictionary if no snapshots available).

Example output:

out_dict = {'20150202' : '[login to view URL]://[login to view URL]'}

FUNCTION 2: get_snapshot_info

Input:

A link to a snapshot (e.g., a value of the dictionary returned by find_snapshots)

A keyword (e.g., 'privacy')

Output:

A dictionary with information about the snapshot, obtained as follows

a) Visit snapshot link. Download homepage HTML code (discard garbage such as 404 errors). It is particularly important that this step deals with redirects that sometimes happen on wayback (HTML 302 returns and pop-ups are common) and does not return garbage in those cases.

b) Extract all hyperlinks from the homepage HTML (use of BeautifulSoup preferred). Find all hyperlinks containing the keyword EITHER in their text OR in the URL. Example of such a hyperlink: [login to view URL]://[login to view URL]

c) Visit all hyperlinks containing the keyword and download their HTML code (discard garbage such as 404 errors)

Example output:

out_dict = { ‘homepage_download’ : True, # boolean flag for whether download in step a) is successful

‘homepage_html’ : string # string containing HTML code of homepage downloaded in step a)

‘keyword_links’ : [‘xxx','yyy','zzz'], # list of links found to contain keyword in step b) (return empty list if none found)

‘keyword_download' : [True, False, True], # list of boolean flags for whether downloads in step c) is successful

'keyword_html' : [string, string, string]} # list of strings containing HTML code of keyword pages downloaded in step c)

*** EVALUATION AND PROJECT COMPLETION ***

DEFINITION OF SUCCESS:

For a given set of inputs (website, dates, keyword), we define success in two stages:

The crawler is successful in stage 1 if homepage_download = True for at least one snapshot found in the given date range (and if the HTML is not garbage)

The crawler is successful in stage 2 if keyword_download = True for at least one subpage of a snapshot (and if the HTML is not garbage)

EVALUATION OF FREELANCER OUTPUT:

We will give you a list of 500 inputs (website, dates, keyword) for development. We will test your results on a list of 500 different inputs.

Human trials have the following success rates with keyword = 'privacy':

- Stage 1 success: 60% if the website is a startup company in the given date range, and 90% if it is a mature company.

- Stage 2 success: 50% if the website is a startup company in the given date range, and 80% if it is a mature company.

For completion of the project, you must achieve the following

- success rates are reasonably close to human trials

- source code is clean and well documented

- Jupyter demo notebook is clean and well documented

- [login to view URL] allows code to be run without errors

Python Extracción de datos web Arquitectura de software PHP Extracción de datos

Nº del proyecto: #20270382

Sobre el proyecto

43 propuestas Proyecto remoto Activo hace 4 años

Adjudicado a:

seaanddream

Hi, my name is Selim. I am from Solihull, UK. I read your `Develop a web crawler in Python` project descriptions carefully before bidding. I checked the target url, and your requirements as well... I got what you need Más

£750 GBP en 10 días
(356 comentarios)
8.8

43 freelancers están ofertando un promedio de £526 por este trabajo

helmot

Hello. I have worked wayback before and can show you a demo and codes if you are interested. Its in Python 2.7 but its not hard to switch to Python 3. Thanks, Helmot

£500 GBP en 7 días
(240 comentarios)
8.3
zhangyingtai

Hello sir Thanks for your detailed job description. I have got full understanding from the job description and am very clear about the task. I have 9 years of experience about web scraping and am suitable for th Más

£555 GBP en 5 días
(129 comentarios)
7.6
mananraja

hey, I have read what you need and checked the website you mentioned. I can make a PYTHON scraper script to get this done. I will also fulfill your 4 requirements that you listed at the end of your description. I have Más

£250 GBP en 2 días
(374 comentarios)
7.5
Guptapuru304

Sir/Ma'am, I am senior python developer and have been working for 3 years now. I have done such works previously for amazon, instagram, aliexpress etc. and can deliver it to you in less than a day. I can also help Más

£250 GBP en 2 días
(83 comentarios)
7.6
p4logics

Dear Sir, I am interested in your project. I'm senior Core Java, J2ee, Javafx, Spring boot, Spring JPA, Hibernate, Angular developer. I'm also expert in web scrapping using java selenium, jsoup and python. I assure, Más

£500 GBP en 7 días
(91 comentarios)
7.4
C3guru

Hello. I am a talented Web scraping solution developer. Especially, I've mastered selenium and scrapy with python. You can see my profile that finished a lot of scraping jobs. I've just reviewed your requirements and Más

£500 GBP en 7 días
(51 comentarios)
7.2
zeke

I wrote many web crawlers. This is my favorite type of job. I am absolutely confident I can finish this project to your satisfaction and on time. Available to start immediately and finish as soon as possible. Looking f Más

£500 GBP en 7 días
(211 comentarios)
7.6
alexwmsoft

100% Completion Rate and 5 Stars Dear, employer. My name is Lee, I am an experienced web developer, and web scraping expert. I have good experiences in web scraping using PHP, Python, Java and so on. I read your job Más

£500 GBP en 7 días
(44 comentarios)
6.6
kunitsynartem

Hello! I'm interested in making your project for historical snapshots parsing using Python. I'm ready to make a script that will do both stages of work described in your project. Just please, provide me with sample inp Más

£500 GBP en 7 días
(47 comentarios)
6.4
rajorshi1001

Hi, I am an experienced python developer and I can complete the task using Jupyter notebook. I use BeautifulSoup for all my scraping projects. I will use the wayback api for first function and selenium for the second o Más

£300 GBP en 7 días
(63 comentarios)
6.1
farooq4161

Greetings, I am an experienced professional scrapper and have done similar projects in the past. Same can be verified from my profile. Let me allow to assist you with your requirements. Thanks

£750 GBP en 8 días
(75 comentarios)
6.4
esolzpk

HI I have gone through the requirements in detail and i have few questions is I am specialize in website design and development and are excited for the opportunity to work with you in accomplishing your goals. We h Más

£555 GBP en 6 días
(26 comentarios)
6.1
smsaurabhv

‌Hi, I have gone through your requirement to scrape lots of websites. I am EXPERT in building scraping tools /scripts. Hence, I can SURELY work on your project. I am having 4 YEARS of EXPERIENCE in developing PHP-PYTHO Más

£250 GBP en 3 días
(130 comentarios)
6.2
damilareisaac

Hi there, I have read through the project description. I can help you complete the project using python scripting. I will be looking forward to hear from you. Please contact me on PM for details.

£500 GBP en 7 días
(55 comentarios)
6.2
maryumakhter5

Hey there! Python crawler, Scrape I can do any sort of data mining or web scraping that you need to be done in a reasonable amount of time. I have years of Python, HTML, CSS and JavaScript experience under my belt a Más

£500 GBP en 10 días
(36 comentarios)
5.7
BestService222

Hi, I’ve carefully gone through your job post. I have more then 4+ years experience in Python development.I am very much interested in your project with all of your requirements. I feel very confident on your project a Más

£750 GBP en 10 días
(39 comentarios)
5.7
naishodayo

Hi,dear. I am a senior software developer. I am very familiar with web scraping. I have just checked your project description, I am able to complete this project. I am looking forward to your response. Thanks.

£500 GBP en 7 días
(4 comentarios)
4.8
arajdhar

Hello, From the given description, it seems that you want to perform the following activities as part of developing the Web Crawler: 1. Use the Wayback Availability JSON API to determine whether a snapshot for a give Más

£250 GBP en 7 días
(8 comentarios)
4.8
friendzsoft

Hi, I have experience in Web scraping using Python 2 and 3. Please contact me for more details. Thanks,

£277 GBP en 5 días
(6 comentarios)
4.2
DrPeixoto

Greetings friend. I already have the your code written and working according to your specifications. Please send me a message and i will immediately send it to you. You can also send me a sample of the inputs lis Más

£250 GBP en 1 día
(15 comentarios)
4.2