Assumed Audience: Anyone with any opinion about “AI,” especially LLM’s. Discuss on Hacker News.
Epistemic Status: Satisfied.
So OpenAI recently revealed info about its spider. That information included its bot name (GPTBot) and its user agent string.
But even more importantly, they revealed the IP address blocks that they would use.
So I blocked them all.
I also blocked their spider with robots.txt
and my server; they’ve already
added IP blocks, and this is a good backup.
Why did I do this?
Because I don’t want my material used for training LLM’s.
Especially my personal code and my business code.
However, there’s a catch: one of the sites blocked is https://docs.yzena.com/, which is the documentation for my Yzena software.
Most people who think LLM’s are good will probably be stunned; after all, if my documentation could be crawled, GPTx could answer people’s questions about my software for me.
But here’s the problem: it will answer them wrong.
You see, my documentation will be thorough. If it doesn’t answer all user questions, it’s not good enough.
But despite the volume of documentation that will exist, and the good organization to make that volume searchable.
But still, GPTx should help users, right? Wrong.
Despite the volume of documentation, my documentation would still be just a tiny blip in the amount of information in the LLM, and it will still pull in information from elsewhere to answer questions.
And since my software will be unique, anything outside the documentation is liable to be wrong.
And so I will probably be answering more questions by people misled about my software than would be saved by GPTx.
I’ll answer honest questions; I delight in serving users.
But I don’t want to have to dispel wrong notions because of a dumb bot using statistics to pretend to speak.