Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

7
  • 8
    Why should it matter if we do, when they're feeding the entire network to LLM's live anyway.
    – Kevin B
    Commented Jul 9 at 18:14
  • @KevinB Well you can't just prohibit people like that, if you allow LLM usage with the datadump but state the cc by-sa license, it might encourage LLM savvy people to train free (as in freedom) models. In my opinion, if we provide an easier way to do the thing more ethically, people might be drawn to release their models under cc by-sa because they realize there's an easier way to train the models than scraping. Of course, big companies probably won't care...
    – John
    Commented Jul 10 at 2:40
  • 2
    @KevinB In this case - its that I've gone through the 'proper' process, made requests the right way, and am doing so under the same constraints as downloading them one by one. In theory - I (or some future user) would be an independent backup for data dump, but downloading hundreds of files from hundreds of pages to do things is a bit of a bore. The fact that I've followed processes laid out, and agreed to by the company and gotten nowhere is a bit of a annoyance really. Commented Jul 10 at 3:32
  • 3
    I think the "quiet part out loud" (maybe not so quiet) piece of this is that increased friction was the primary goal. Improving the experience would be a change in course for why it was made that way in the first place... Make no mistake, I'm firmly with you, and I really hope it changes for the better– gating the data dump was a flagrant violation of the reasoning for the dumps in the first place. Even in the most optimistic read, where the company added friction to avoid removing them altogether... the friction is very much the point.
    – zcoop98
    Commented Jul 10 at 21:54
  • 3
    Well yes, but that's also why I kept asking from before the official announcement - and was assured by folks it was possible. Its easier for me to do things the 'wrong' way - and either grab it off a unofficial upload (which is likely the best option for the backlog) or run existing tools that automate the process (which have issues due to cloudflare) to grab the full dump . Its as much about keeping assurances made as the actual dumps. Commented Jul 11 at 0:40
  • Mutual trust, you say.
    – canon
    Commented Jul 11 at 19:15
  • 3
    Well I see this as not trusting me to turn around and upload it somewhere where I can train an LLM. I'm not having the company live up to promises made when I'm trying to follow their processes. So yeah, mutual trust. The company has issues trusting its userbase, and we have issues trusting the company to live up to its promises. Commented Jul 11 at 23:05