Inverting the Xorshift128 random number generator

littlemaninmyhead.wordpress.com

108 points by rurban 6 days ago

Retr0id 3 days ago

If you represent the state as a 128-long vector of GF(2) elements, you can model the state transition function as a matrix multiplication.

This allows you to describe any output bit (at any offset in the output stream) as a function of the 128 initial state elements.

Treating the initial state vector as 128 unknown variables, you can solve for them (via gaussian elimination) with any 128 bits from the output stream, so long as you know their offsets in the output sequence.

Although, depending on what those offsets are there's a ~50% chance there's no unique solution, in which case you may need a few more bits. For the case of two consecutive 64-bit outputs, you get a unique solution.

For 128 samples, the gaussian elimination can be done in 2^21 steps, or only 2^14 steps if you can put a whole bit-vector in one register (which you can!)

If the offsets are known in advance, you can pre-compute the gaussian elimination work and end up with a constant matrix that, if multiplied by the leaked bits, obtains the initial state vector.

m1el 2 days ago

Exactly. I've implemented a xorshift-based rng inverter previously, and here's the implementation for the algorithm in the article:
https://github.com/m1el/samaras/blob/master/src/xorshift128....
- Retr0id 2 days ago
  
  Seeing your comment with the precomputed matrix reminded me of this visualisation I made: https://bsky.app/profile/retr0.id/post/3kc2rz7i7fm2y
- ScottContini 20 hours ago
  
  The blog is not about going from one known internal state to the previous internal state.
  It is about not knowing the internal state but figuring it out after having 2 (or 3) consecutive outputs of the random number generator.
tptacek 3 days ago

If you follow the ChatGPT log, that's the angle he takes as well.
- Retr0id 3 days ago
  
  Oh that's fascinating, I didn't get that impression at all from the article itself.
  - ScottContini 20 hours ago
    
    If you look at the chatgpt conversation which is linked at the bottom of the article (it will take a minute to load because it is very long), you will see that my initial approach was using matrices over GF(2). I am in the process of writing a blog about how I thought I could solve this quickly using ChatGPT, but unfortunately it took me down a bunch of rabbit holes that didn't work. Ultimately the ideas that did work were my own, but I still think using ChatGPT is useful if you know how to use it right (keep it on a tight leash, more details coming soon).
    Back to the matrices... I thought I have an idea using matrices that would solve it really fast. It didn't work. There were dependencies in the matrix that I did not expect that made the idea fail. You can see all the details in the chat log.
    With your recommendation, it is not clear to me that you are dealing with the problem that there is an integer addition of the state values to get the outputs. This integer addition ruins the direct usage of linear algebra to solve the problem. The initial approach that ChatGPT suggested, which is its own idea (not mine), was that we could bring in the integer carry bits as unknown variables and derive them via induction. I could come back to that idea, but I am sceptical that will work.
    Eventually I got fed up with ChatGPT's implementation and it trying to help debug, which was only slowing me down. It kept trying to change the underlying data structures during the debugging. I had to scold it about that: you cannot change things when you are debugging it.
    Ultimately, I did away with the linear algebra and just approached it from some simple observations about how bits affect other bits. It is linear algebra under the hood, but I did not use matrices at this point.
    One of the reasons why I wrote the blog is to encourage more people to look at this. We should be able to invert this in 2 or 3 outputs, whereas Z3 requires 5. Nothing I am doing is complex. I would love to see someone produce a tool that breaks it really quickly. And I really mean a tool, not just a statement on why it is weak. We all know it is weak, but web security researchers need tools to use that demonstrate exploits based upon primitives that are wrongly used. So I would strongly encourage people with ideas to implement them and blog about the ones that work.
    I have another idea on the plate that I think might bring the search down to 2^20 effort, but need time to work out details. Free time is the hardest thing in life for me to find: I just have a busy life. But I also have a passion for building fast tools and puzzle solving.
    
    Retr0id 9 hours ago
    
    Ah, yes, the addition throws a spanner in the works a little. 3 successive Math.random() outputs should definitely be solvable with a 2^32 brute force (~seconds on CPU). I'll try to understand your 2^26 approach a little better though.
- viraNtor 2 days ago
  
  Which ChatGPT version was used? 4o? 5? It's clear from the log it suffers from serious context problems, keeps forgetting/hallucinating functions and forgetting arguments to functions is has already written.

viraptor 3 days ago

> I couldn’t believe that z3 was the best one can do

It probably never is, but if you don't care about a nice solution or don't care about anything past "is it possible?", then z3 and smt2 are amazing. You often don't even need to understand the whole problem. "take constraints, give solution" works great in those situations.

Sesse__ 3 days ago

I've tried to use Z3 to find hash collisions in a simple hash, and it was amazingly bad at it. It took forever, whereas just trying random values until some were (multi)colliding was sub-second :-)

NooneAtAll3 3 days ago

looking at the CVE report itself, Math.random() not being crypto-level seems to be known? - and vulnerability comes from Node.js using it for some crypto purpose

so OP simply did a good exercise for himself recreating exact weakness of it

tptacek 3 days ago

No, the post takes the attack from 5 observations down to 3.
- NooneAtAll3 a day ago
  
  no, the post takes a shoddy quickly-made implementation of an attack and improves it to its own better implementation of an attack
  neither are professional frontline research, because said frontline work has already been done long loong ago when Xorshift was popularized and definitely when it became javascript's *default* rng
  this is recreational cryptography, don't over-present it
  - ScottContini 20 hours ago
    
    You can call it recreational cryptography. I am no longer a professional cryptographer: I used to be. Now I have a full time job in the software industry and a family with kids. I don't have a lot of free time to work on cryptography like professional cryptographers do.
    The person you replied to is correct. To my knowledge, the best inversion of Math.random( ) is this one: https://github.com/PwnFunction/v8-randomness-predictor . It takes 5 outputs from Math.random( ) to determine the seed. My research included a 2^50 algorithm to get it with 3 outputs. If there is a better implementation out there that does it in less than 2^50 work for 3 outputs, could you please provide a link to the implementation?
    Also, as I said in the blog, this is a first step. I think I can bring it down by a factor of 2^6 with another trick I am working on, but details are still being tested. As the saying goes, attacks always get better, never worse.
    The blog is also to encourage the aspiring or amateur cryptographer to have a look themselves. Nothing in my research is particularly deep, so I hope it shows a wider audience that what cryptographers do doesn't always require complex mathematics. This is a simple attack and I thought it was worth blogging about.
    I have another blog about why I left cryptography. Part of it is about being stuck in doing research that has no practical implications. To a real cryptographer, attacking XorShift128+ and Math.random( ) may seem uninteresting. I have a different view. Engineers make mistakes and use the wrong tools for the job all the time. We tell them it is wrong, but it is so much more powerful to prove it. When I looked at CVE-2025-7783, I just shook my head: the web security community is stuck using less than ideal tools (requiring 5 outputs to invert the algorithm) because the cryptographic community does not value producing tools to invert things that are not designed for cryptographic purposes. I think this attitude is doing a disfavour to the web security community.
  - ScottContini 16 hours ago
    
    Also: I just realised who the person is that you were replying to, tptacek. Maybe you should Google his name. He's well known in cryptography, glad he is on my side! :-)
  - tptacek a day ago
    
    Please tell me more about what does and doesn't qualify as recreational cryptography.

Aardwolf 3 days ago

Xorshift128+ is not a cryptographic rng though, so at least this isn't a cryptographic attack...

Should programming languages use cryptographic rngs like a ChaCha20 based one in their standard libraries to stop accidental use of non cryptographic rngs for cryptographic purposes? But that comes at the cost of speed

kbolino 2 days ago

Go has taken exactly this stance, at least since Go 1.22 from 18 months ago.
They settled on 8 rounds of ChaCha with 300 bytes of internal state to amortize the latency. The end result is something only 1.5-2x slower than (their particular flavor of) PCG [1]. It was deemed good enough to become the default.
[1]: https://go.dev/blog/chacha8rand#performance
tptacek 2 days ago

It is a cryptographic attack, just against an algorithm that isn't cryptographically strong. There's a lot of value in those.
kstrauser 2 days ago

I think some naming conventions could go a long way. If you want to import `fast_unsafe_random`, you might think twice.
- thomasmg 2 days ago
  
  I agree, why would you slow down things for everybody if it's only a problem for cryptographic purposes. Xorshift128+ etc are around 10 to 30 times faster than ChaCha20.
  The challenge is things that don't _obviously_ need cryptographically secure generators. For example, do you need a secure generator for the seed of a hash table, or a sorting algorithm? (For those that do use a seed). Some will argue that yes, this is important. Until a few years ago, the hash tables used static hash algorithms without any randomization, but "hash flooding" changed that. I think that nowadays, still many hash table implementations don't use secure generators.
  Then, there's secure and insecure hash functions. Secure hash functions like SHA-256 are (compared to non-secure functions) specially slow for short keys. There are "somewhat" secure hash function algorithms like SipHash that can be used for this purpose.
  - tptacek 2 days ago
    
    I think people overthink this, and we should just have a library standard full-strength CSPRNG, and then people with fussy fast-randomness needs (Monte Carlo or whatever) can just pull in libraries. "Don't need secure random and can't use it" is a very niche problem space.
    It'll never happen in Go though! They're over a decade committed to this library shape.
    
    kstrauser 2 days ago
    
    I strongly agree here. The default should be strong, “slow” randomness. If you know you need something different in your specific use case, and just can’t abide the safe version, import something else.
    
    caspper69 2 days ago
    
    I agree. .NET is the opposite of Go. Calls to System.Random use Xoshiro128++ under the hood (as of .NET 6 I believe). On the other hand, calls to RandomNumberGenerator.GetBytes() are cryptographically secure, using the Windows kernel cryptographic provider on Windows and /dev/urandom (chacha20) on Linux and arc4random_buf() on MacOS (which also uses chacha20 under the hood).
    I ported around 20 RNGs to C# (all non-cs), and there are tons of uses for non-cryptographic RNGs, so I'm a little torn. I guess in modern development most people who need an RNG need it for crypto purposes (I would guess salts, keys and nonces mostly), but I'd hate to see all the Xoshiros, Mersenne Twisters, PCGs, and MWCs, etc. go the way of the dodo simply because they are not deemed fit for crypto purposes. Games, simulations, non-cryptographic hashes all need deterministic and high performance RNGs, and don't need all of the cryptographic guarantees.
    To top it off, there is no standard definition of what makes an RNG cryptographically secure, so it's a slightly loaded question anyway. Everything I've read says an algo needs the following properties: forward secrecy (unable to guess future outputs given the current state), backward secrecy (if I know current outputs, I shouldn't be able to recover previous internal state or previous outputs), and the output must be indistinguishable from true random bits, even with a chosen-input attack. This is where I politely defer to the expert mathematicians and cryptographers, because I'm not equipped to perform such an analysis.
    I can understand why things have developed this way though- people have needed random numbers far longer than they've needed cryptographically secure random numbers, so the default is the non-cryptographically secure variant. A language created tomorrow would likely follow in Go's footsteps and default to the cryptographically secure.
    
    tptacek 2 days ago
    
    No, CSPRNG vs. RNG isn't a loaded question. Every RNG that says "this isn't an according-to-Hoyle cryptographically random number generator, but..." isn't one. Most modern CSPRNGs are designed with well-understood cryptographic primitives, so they draft off those security properties. Establishing those properties for a novel set of primitives is a major undertaking.
    It's a little frustrating, because there are definitely fast RNGs that have tried to blur this line. A reasonable first approximation of the current situation is that a CSPRNG should have somewhere in its core a mixing function based on an actual cryptographic hash or permutation function; if the design has to explain what that function is and how it works (as opposed to just saying "this is ChaCha20"), it's not secure. These fast RNGs, like Xorshiro and PCG, all get there by not having cryptographically strong mixing functions at their core.
    For what it's worth, I think the "GetBytes() means secure, IntN() means it's not secure" is a clever misfeature. Just make all the standard library random interfaces back onto a real CSPRNG, and let people pull in PCG or whatever if they have specialized needs for fast insecure RNGs.
    
    wahern 2 days ago
    
    > Just make all the standard library random interfaces back onto a real CSPRNG
    That's what OpenBSD has done for the traditional C and POSIX randomness APIs.
    Also, re your earlier comment, OpenBSD's arc4random API is everywhere now except linux/musl and Windows. POSIX now has getentropy, which on recent Linux kernel will be as fast as arc4random_buf. But it would be nice if musl got the arc4random API, which includes arc4random_uniform for generating 32-bit numbers in the interval [0, N), minimizing the risk of people screwing that up.
    Unlikely vendors will takes OpenBSD's approach to the historic PRNG APIs, but they're almost all there for the arc4random API. Also, the former approach is less ideal than it sounds; the latest versions of PUC Lua, for example, use an included non-CSPRNG rather than merely binding the historic C APIs. Explicitly using the arc4random API means the semantics are explicit, too, and you can more easily audit code. It's conspicuously missing an API for floating point intervals, but perhaps that'll come along.
    
    Dylan16807 2 days ago
    
    What's your threshold for "high performance"? A modern CPU can use a secure algorithm and produce more than one byte per cycle. Xorshift is a bit faster but not much faster.
  - kstrauser 2 days ago
    
    Good point about the hashing. Python does the right thing by making you select the one you want when writing your own code. If it had a default option, make that SHA-256 so that all users get the strong one by default. But yes, if you’re not actually doing crypto stuff, say if you only want to see if two locally generated files have the same content, there are lots of much faster alternatives.
  - Dylan16807 2 days ago
    
    > why would you slow down things for everybody
    Secure generators are still fast and the number of programs where you'd see a difference is very small.
  - Dylan16807 2 days ago
    
    > Xorshift128+ etc are around 10 to 30 times faster than ChaCha20.
    What methods, what CPU? Is that using chacha20 a couple bytes at a time? If you generate your random bytes in medium size blocks you'll probably see a much smaller difference.
- cozzyd 2 days ago
  
  funsafe_random
chaboud 2 days ago

Perhaps put a warning in the name since the folks who don’t read the docs are the ones you’re trying to protect?
For example: Math.RandomNotCrypto()
When someone uses that in production for cryptographic purposes (and, yes someone is going to do that), they have to wear a dunce cap to the office for a month.
- layer8 2 days ago
  
  People are likely to use it in security-relevant ways without being aware that the use case constitutes “crypto”.
  - degamad 2 days ago
    
    Exactly - I'm just generating random session ids, I'm not encrypting anything (or using any bitcoins). There's no crypto here, right?
    
    cratermoon 2 days ago
    
    Anakin Padme 4 Panel "right?" meme.
- Retr0id 2 days ago
  
  Math.random is a web API so you can't just rename it without breaking a large chunk of the web.
  A non-breaking change would be to upgrade Math.random to be cryptographically secure - these days we know how to do this with minimal performance impact.
  - chaboud 8 hours ago
    
    This is a “next time” recommendation. Short of a time machine, we can’t change published names.
    And, yes, I’d be down with going cryptographically secure (for now) with existing systems.
kevin_thibedeau 2 days ago

No. Sometimes you want a reproducible PRNG sequence. These efforts to harden generators break the existing specifications with a rug pull.
hannob 2 days ago

> But that comes at the cost of speed
That is mostly a myth.
I mean... technically, yes. But the cost is so marginal that you will have a hard time even measuring it unless you generate gigabytes of data.
For pretty much all common use cases like generation of ids, tokens, etc., you can use a secure random number generator and it will not impact your performance in any meaningful way.
- tialaramex 2 days ago
  
  It's also the exact same silly argument as for the memory unsafety.
  Incorrect isn't faster, it's just wrong, I can have wrong instantly and you're not faster than that, or smaller, or cheaper, or easier to understand. So you're just much worse.

delduca 3 days ago

I have recently replaced Lua's random for this implemetation

https://nullonerror.org/2025/08/02/replacing-lua-s-math-rand...

bazzargh 2 days ago

A word of caution. A few years ago we had a production impact event where customers were getting identical cookies (and so started seeing each others sessions). When I took a look at the code, what I found was that they were doing something very like your code - using a time() based seed and an PRNG.
Whenever we deployed new nginx configs, those servers would roll out and restart, getting _similar_ time() results in the seed. But the individual nginx workers? Their seeds were nearly identical. Not every call to the PRNG was meant for UUIDs, but enough were that disaster was inevitable.
The solution is to use a library that leverages libuuid (via ffi or otherwise). A "native lua" implementation is always going to miss the entropy sources available in your server and generate clashes if it's seeded with time(). (eg https://github.com/Kong/lua-uuid, https://github.com/bungle/lua-resty-uuid)
- tptacek 2 days ago
  
  Why would you ever use an insecure RNG to generate a cookie?
  - bazzargh 2 days ago
    
    In the code I saw, at least twice in its history people had introduced a "pure lua" solution for speed, and were clearly unaware of the shotgun they'd just pointed at their feet. (as in, somebody saw the issue and fixed it, and then someone else _fixed it back_ before I came along).
    But in case _I'm_ messing up here, I'll bow to your expertise: libuuid uses /dev/random, which uses a CSPRNG (ChaCha20) with entropy ingested via Blake2 from whatever sources the system can get, right?
    We did actually do a bunch of before/after testing showing the collision rates (zero after), and I believe the cookie in question has been replaced with a third party identity system in the intervening years - but if we did it wrong, I'd like to know.
    
    akerl_ 2 days ago
    
    I think the question isn’t “why switch it to libuuid”, it’s “why is anybody ever setting it to a time-based non-CS PRNG”.
- magicalhippo 2 days ago
  
  Had this issue on a ray tracer I worked on. Since sampling was supposed to be random, you could fire it up on multiple machines and just average the result to get a lower noise image.
  Except the distributed code fired it up all worker instances almost simultaneously and the code used time() to seed the RNG, so many workers ended up using the same seed and hence averaging those results did nothing.
  - 01HNNWZ0MV43FF 2 days ago
    
    I am reminded of an article about a poker site:
    "There are 52-factorial ways to shuffle a deck of cards, but the site's PRNG only has 32 bits of state. 4 billion is alarmingly less than 52-factorial! But even worse, the PRNG is seeded using the number of milliseconds since midnight. 86 million is alarmingly less than 4 billion!"
    So the actual entropy on the card table was equivalent to about 5 cards' worth. After seeing the 2 cards in his hand, and the 3 cards in the flop, he could use a program to solve for every other card in everyone's hand and in the entire deck!
    (I may have mixed up many details - If anyone has an archive of the article please post it!)
    
    degamad 2 days ago
    
    I assume you're talking about
    https://web.archive.org/web/20140210072712/http://www.laurad...
    Previously on HN
    https://news.ycombinator.com/item?id=7207851
- jandrewrogers 2 days ago
  
  UUIDv4 is banned in some environments because of how common it is to find someone using weak PRNGs to generate them. It happens way more often than it should.
- delduca 2 days ago
  
  Thank you for your advise, I will update my blog with it.
  Just FYI I only use this on a hidden n' seek game engine, so it is fine.
shakna 2 days ago

A small note: Lua 5.4 changed the way random happens.
Its now based on xoshiro256**.
Thats still not crytographically secure, but it is vastly faster.

cratermoon 2 days ago

> I was truly amazed to see ChatGPT understand my reasoning and even come up with its own ideas to help improve the research.

ChatGPT did nothing of the sort. The creators of ChatGPT happened to have in their corpus a sufficient amount of text from related research, forums and blogs discussing RNGs, and related math and programming topics to create a model that can generate plausible-sounding synthetic text.