We've been considering for a couple of years whether to allow AI LLMs to scrape our information.
Without naming them, some MC companies have already tried to train models against our DB without consent - one attempt seems to have been what we previously interpreted as an attempted DDOS attack.
On one hand, we want to make Medical Cannabis information as widely accessible as is possible without barriers. On the other hand, companies are increasingly trying to automatically scrape/interpret our daily updates/data for commercial purposes - when we remain a struggling non-profit without commercial backing, and without the proper bandwidth to even facilitate such constant daily requests.
For a long-time now we've been considering training an LLM on our own dataset, whereby we could do a far better job overall - and it wouldn't be too hard for us to programmatically feed AI a complete summary of both our main database content; and all forum posts, future reviews etc. Even to the point transcriptions of YouTube reviews and comments could be inputted.
The monumental problem with us doing this is sheer computational cost, both in terms of AI tokens and API access tokens for third-party platforms. Given the increasingly costly dedicated hardware required - we would probably be looking at a few thousand a month in costs (even self-hosting), and even with caching AI summaries/queries periodically.
Posting this long list of thoughts to garner further public feedback from patients and the industry, because we've been unsure how to properly handle the situation for a long-time now.
We are otherwise going to post a big announcement on trying to fix our funding some point soon.
Without naming them, some MC companies have already tried to train models against our DB without consent - one attempt seems to have been what we previously interpreted as an attempted DDOS attack.
On one hand, we want to make Medical Cannabis information as widely accessible as is possible without barriers. On the other hand, companies are increasingly trying to automatically scrape/interpret our daily updates/data for commercial purposes - when we remain a struggling non-profit without commercial backing, and without the proper bandwidth to even facilitate such constant daily requests.
For a long-time now we've been considering training an LLM on our own dataset, whereby we could do a far better job overall - and it wouldn't be too hard for us to programmatically feed AI a complete summary of both our main database content; and all forum posts, future reviews etc. Even to the point transcriptions of YouTube reviews and comments could be inputted.
The monumental problem with us doing this is sheer computational cost, both in terms of AI tokens and API access tokens for third-party platforms. Given the increasingly costly dedicated hardware required - we would probably be looking at a few thousand a month in costs (even self-hosting), and even with caching AI summaries/queries periodically.
Posting this long list of thoughts to garner further public feedback from patients and the industry, because we've been unsure how to properly handle the situation for a long-time now.
We are otherwise going to post a big announcement on trying to fix our funding some point soon.