Newest 'web-crawling' Questions - Open Data Stack Exchange

-1 votes

1 answer

91 views

Types of data great for webscraping

I am a professional web-crawler and love scraping sites for potential data. However, I have recently hit a brick-wall on openly sourced websites with unique data that can be webscraped. So I would ...

Working dollar

99

asked Apr 28, 2022 at 10:09

0 votes

1 answer

210 views

Catalog of all websites on the entire internet

Needs For a project, I am trying to run analysis on a representative sample of all websites on the entire internet. I'm trying to do things like measure market penetration of certain web technologies,...

John

3

asked Nov 17, 2020 at 17:44

5 votes

2 answers

1k views

Searching for a specific file name across websites on Wayback Machine

Is there any way to find all occurrences of some file on the entire internet, as captured by the Internet Archive? I want to wildcard-search for a filename. Of course, this doesn't make sense with ...

phil294

170

asked Oct 18, 2020 at 15:28

1 vote

1 answer

734 views

How do I get historical data from coinmarketcap.com?

If you scroll down on this page: https://coinmarketcap.com/historical/20150503/ There is something that says: Total Market Cap: $3,863,780,096 Notice the date is given in the URL. How can I loop ...

user23089

11

asked Dec 8, 2019 at 3:22

0 votes

3 answers

475 views

Online Dictionary for Scraping

Does anybody know if there is an extensive online dictionary other than Wiktionary that lists many words on a single page, as opposed to having to search for a single word? None of the online ...

oldboy

121

asked Jul 10, 2019 at 0:03

2 votes

0 answers

2k views

Facebook Mypersonality dataset

I want to have access to a database for Facebook Photos dictionary that contains 17,2mn records (photo_id,owner_id, created_time). At this time, the owner of the dataset has decided to stop sharing ...

Krebto

91

asked Jul 11, 2018 at 16:03

1 vote

1 answer

226 views

The list of restricted second-level domains such as .co.uk and .ac.jp

Some country-code top-level domains (ccTLD) contain general-purpose second-level domains, like .co and .ac, so the end user of the domain is identified with the third-level domain, such as oxford.ac....

Anton Tarasenko

3,661

asked Nov 20, 2017 at 16:47

1 vote

1 answer

1k views

Extracting data in tabular form from camelcamelcamel

Can anyone give me pointers on how I can extract price and sales rank data from camelcamelcamel.com? They display only a chart showing the evolution of prices/sales rank over time, but I'm trying to ...

Shrabastee Banerjee

11

asked Oct 25, 2017 at 15:34

1 vote

0 answers

33 views

Are there any automated techniques one can use to gather data online for a dataset?

I am a developer myself and would like to use latest technologies to build some open data datasets. I would like to know if you are aware of any techniques or algorithm one can use to automate gather ...

Mathematics

445

asked Jul 31, 2017 at 7:23

2 votes

2 answers

684 views

Places to get large (volume wise) datasets reasonably well formatted and free to share?

I've been struggling with this problem for almost 2 hours now :( I need some datasets that are somewhere around 50 - 5000 GB large when uncompressed to showcase and test various data storage and ...

George

121

asked May 21, 2017 at 15:53

2 votes

1 answer

50 views

What tool can I use to manage a researcher network bibliography?

I am making a website for a researcher network using Jekyll and hosting it on github pages. Each researcher (250+ people) should have his own document with an up to date bibliography. I could ...

Billybobbonnet

133

asked May 9, 2017 at 9:26

1 vote

3 answers

490 views

How do football APIs get their data?

I am wondering how sport APIs get their data. Do they use web scraping or web crawling? If they do: is it legal to create my web scraper to gather data? I checked various sites and they all lead to ...

Antonio Erdeljac

113

asked Apr 16, 2017 at 10:20

16 votes

2 answers

611 views

The "right to mine" and scraping the web in EU

Sorry for the somewhat general question, to which the answers necessary depend on the country of operation and necessarily are also subject to become obsolete at some point. But I think it to be quite ...

puslet88

261

asked Feb 2, 2017 at 19:29

1 vote

0 answers

113 views

Scraping product image from eCommerce websites [closed]

What are the legal clauses associated with scraping images of products displayed on eCommerce websites (such as Home Depot, Walmart, Amazon, eBay etc)? The intention is to obtain meta-data from the ...

Jugesh Sundram

111

asked Nov 22, 2016 at 12:19

5 votes

3 answers

8k views

how to search archive.org for PDF files on a captured website between some date range

I am trying to search health insurance companies' website content on http://archive.org/ during the period of 2005-2008 for legal documents related to pre-existing medical conditions. most of these ...

Anthony Damico

1,480

asked Nov 15, 2016 at 9:34

Stack Exchange Network

Questions tagged [web-crawling]

Types of data great for webscraping

Catalog of all websites on the entire internet

Searching for a specific file name across websites on Wayback Machine

How do I get historical data from coinmarketcap.com?

Online Dictionary for Scraping

Facebook Mypersonality dataset

The list of restricted second-level domains such as .co.uk and .ac.jp

Extracting data in tabular form from camelcamelcamel

Are there any automated techniques one can use to gather data online for a dataset?

Places to get large (volume wise) datasets reasonably well formatted and free to share?

What tool can I use to manage a researcher network bibliography?

How do football APIs get their data?

The "right to mine" and scraping the web in EU

Scraping product image from eCommerce websites [closed]

how to search archive.org for PDF files on a captured website between some date range

Hot Network Questions

Questions tagged [web-crawling]

Related Tags