Skip to main content

Questions tagged [web-crawling]

Web crawling is done typically for building and maintaining a Web Index. The crawled data is compared to previous versions of that dataset for changes. Web scraping is similar to web crawling, except that the site being crawled is scraped by the bot (crawler).

-1 votes
1 answer
91 views

Types of data great for webscraping

I am a professional web-crawler and love scraping sites for potential data. However, I have recently hit a brick-wall on openly sourced websites with unique data that can be webscraped. So I would ...
Working dollar's user avatar
0 votes
1 answer
210 views

Catalog of all websites on the entire internet

Needs For a project, I am trying to run analysis on a representative sample of all websites on the entire internet. I'm trying to do things like measure market penetration of certain web technologies,...
John's user avatar
  • 3
5 votes
2 answers
1k views

Searching for a specific file name across websites on Wayback Machine

Is there any way to find all occurrences of some file on the entire internet, as captured by the Internet Archive? I want to wildcard-search for a filename. Of course, this doesn't make sense with ...
phil294's user avatar
  • 170
1 vote
1 answer
734 views

How do I get historical data from coinmarketcap.com?

If you scroll down on this page: https://coinmarketcap.com/historical/20150503/ There is something that says: Total Market Cap: $3,863,780,096 Notice the date is given in the URL. How can I loop ...
user23089's user avatar
0 votes
3 answers
475 views

Online Dictionary for Scraping

Does anybody know if there is an extensive online dictionary other than Wiktionary that lists many words on a single page, as opposed to having to search for a single word? None of the online ...
oldboy's user avatar
  • 121
2 votes
0 answers
2k views

Facebook Mypersonality dataset

I want to have access to a database for Facebook Photos dictionary that contains 17,2mn records (photo_id,owner_id, created_time). At this time, the owner of the dataset has decided to stop sharing ...
Krebto's user avatar
  • 91
1 vote
1 answer
226 views

The list of restricted second-level domains such as .co.uk and .ac.jp

Some country-code top-level domains (ccTLD) contain general-purpose second-level domains, like .co and .ac, so the end user of the domain is identified with the third-level domain, such as oxford.ac....
Anton Tarasenko's user avatar
1 vote
1 answer
1k views

Extracting data in tabular form from camelcamelcamel

Can anyone give me pointers on how I can extract price and sales rank data from camelcamelcamel.com? They display only a chart showing the evolution of prices/sales rank over time, but I'm trying to ...
Shrabastee Banerjee's user avatar
1 vote
0 answers
33 views

Are there any automated techniques one can use to gather data online for a dataset?

I am a developer myself and would like to use latest technologies to build some open data datasets. I would like to know if you are aware of any techniques or algorithm one can use to automate gather ...
Mathematics's user avatar
2 votes
2 answers
684 views

Places to get large (volume wise) datasets reasonably well formatted and free to share?

I've been struggling with this problem for almost 2 hours now :( I need some datasets that are somewhere around 50 - 5000 GB large when uncompressed to showcase and test various data storage and ...
George's user avatar
  • 121
2 votes
1 answer
50 views

What tool can I use to manage a researcher network bibliography?

I am making a website for a researcher network using Jekyll and hosting it on github pages. Each researcher (250+ people) should have his own document with an up to date bibliography. I could ...
Billybobbonnet's user avatar
1 vote
3 answers
490 views

How do football APIs get their data?

I am wondering how sport APIs get their data. Do they use web scraping or web crawling? If they do: is it legal to create my web scraper to gather data? I checked various sites and they all lead to ...
Antonio Erdeljac's user avatar
16 votes
2 answers
611 views

The "right to mine" and scraping the web in EU

Sorry for the somewhat general question, to which the answers necessary depend on the country of operation and necessarily are also subject to become obsolete at some point. But I think it to be quite ...
puslet88's user avatar
  • 261
1 vote
0 answers
113 views

Scraping product image from eCommerce websites [closed]

What are the legal clauses associated with scraping images of products displayed on eCommerce websites (such as Home Depot, Walmart, Amazon, eBay etc)? The intention is to obtain meta-data from the ...
Jugesh Sundram's user avatar
5 votes
3 answers
8k views

how to search archive.org for PDF files on a captured website between some date range

I am trying to search health insurance companies' website content on http://archive.org/ during the period of 2005-2008 for legal documents related to pre-existing medical conditions. most of these ...
Anthony Damico's user avatar

15 30 50 per page