Questions tagged [web-crawling]
Web crawling is done typically for building and maintaining a Web Index. The crawled data is compared to previous versions of that dataset for changes. Web scraping is similar to web crawling, except that the site being crawled is scraped by the bot (crawler).
52 questions
-1
votes
1
answer
91
views
Types of data great for webscraping
I am a professional web-crawler and love scraping sites for potential data. However, I have recently hit a brick-wall on openly sourced websites with unique data that can be webscraped.
So I would ...
0
votes
1
answer
210
views
Catalog of all websites on the entire internet
Needs
For a project, I am trying to run analysis on a representative sample of all websites on the entire internet. I'm trying to do things like measure market penetration of certain web technologies,...
5
votes
2
answers
1k
views
Searching for a specific file name across websites on Wayback Machine
Is there any way to find all occurrences of some file on the entire internet, as captured by the Internet Archive? I want to wildcard-search for a filename.
Of course, this doesn't make sense with ...
1
vote
1
answer
734
views
How do I get historical data from coinmarketcap.com?
If you scroll down on this page:
https://coinmarketcap.com/historical/20150503/
There is something that says: Total Market Cap: $3,863,780,096
Notice the date is given in the URL. How can I loop ...
0
votes
3
answers
475
views
Online Dictionary for Scraping
Does anybody know if there is an extensive online dictionary other than Wiktionary that lists many words on a single page, as opposed to having to search for a single word?
None of the online ...
2
votes
0
answers
2k
views
Facebook Mypersonality dataset
I want to have access to a database for Facebook Photos dictionary that contains 17,2mn records (photo_id,owner_id, created_time).
At this time, the owner of the dataset has decided to stop sharing ...
1
vote
1
answer
226
views
The list of restricted second-level domains such as .co.uk and .ac.jp
Some country-code top-level domains (ccTLD) contain general-purpose second-level domains, like .co and .ac, so the end user of the domain is identified with the third-level domain, such as oxford.ac....
1
vote
1
answer
1k
views
Extracting data in tabular form from camelcamelcamel
Can anyone give me pointers on how I can extract price and sales rank data from camelcamelcamel.com? They display only a chart showing the evolution of prices/sales rank over time, but I'm trying to ...
1
vote
0
answers
33
views
Are there any automated techniques one can use to gather data online for a dataset?
I am a developer myself and would like to use latest technologies to build some open data datasets.
I would like to know if you are aware of any techniques or algorithm one can use to automate gather ...
2
votes
2
answers
684
views
Places to get large (volume wise) datasets reasonably well formatted and free to share?
I've been struggling with this problem for almost 2 hours now :(
I need some datasets that are somewhere around 50 - 5000 GB large when uncompressed to showcase and test various data storage and ...
2
votes
1
answer
50
views
What tool can I use to manage a researcher network bibliography?
I am making a website for a researcher network using Jekyll and hosting it on github pages.
Each researcher (250+ people) should have his own document with an up to date bibliography.
I could ...
1
vote
3
answers
490
views
How do football APIs get their data?
I am wondering how sport APIs get their data.
Do they use web scraping or web crawling?
If they do: is it legal to create my web scraper to gather data?
I checked various sites and they all lead to ...
16
votes
2
answers
611
views
The "right to mine" and scraping the web in EU
Sorry for the somewhat general question, to which the answers necessary depend on the country of operation and necessarily are also subject to become obsolete at some point. But I think it to be quite ...
1
vote
0
answers
113
views
Scraping product image from eCommerce websites [closed]
What are the legal clauses associated with scraping images of products displayed on eCommerce websites (such as Home Depot, Walmart, Amazon, eBay etc)? The intention is to obtain meta-data from the ...
5
votes
3
answers
8k
views
how to search archive.org for PDF files on a captured website between some date range
I am trying to search health insurance companies' website content on
http://archive.org/ during the period of 2005-2008 for legal documents related to pre-existing medical conditions.
most of these ...