Curl-impersonate: Special build of curl that can impersonate the major browsers

https://github.com/lwthiker/curl-impersonate

davidsojevic
There's a fork of this that has some great improvements over to the top of the original and it is also actively maintained: https://github.com/lexiforest/curl-impersonate

There's also Python bindings for the fork for anyone who uses Python: https://github.com/lexiforest/curl_cffi

nyanpasu64
I suppose it does make sense that a "make curl look like a browser" program would get sponsored by "bypass bot detection" services...
ImHereToVote
Easy. Just make a small fragment shader to produce a token in your client. No bot is going to waste GPU resources to compile your shader.
kelsey978126
Why do people even think this? Bots almost always just use headful instrumented browsers now. if a human sitting at a keyboard can load the content, so can a bot.
gruez
Can't they use a software renderer like swiftshader? You don't need to pass in an actual gpu through virtio or whatever.
illegally
There's also a module for fully integrating this with the Python requests library: https://github.com/el1s7/curl-adapter
RKFADU_UOFCCLEL
All these "advanced" technologies that change faster than I can turn my neck, to make a simple request that looks like it was one of the "certified" big 3 web browsers, which will ironically tax the server less than a certified browser. Is this the nightmare dystopia I was warned about in the 90's? I wonder if anyone here can name the one company that is responsible for this despite positioning themselves as a good guy open source / hacker community contributor.
jchw
I'm rooting for Ladybird to gain traction in the future. Currently, it is using cURL proper for networking. That is probably going to have some challenges (I think cURL is still limited in some ways, e.g. I don't think it can do WebSockets over h2 yet) but on the other hand, having a rising browser engine might eventually remove this avenue for fingerprinting since legitimate traffic will have the same fingerprint as stock cURL.
rhdunn
It would be good to see Ladybird's cURL usage improve cURL itself, such as the WebSocket over h2 example you mention. It is also a good test of cURL to see and identify what functionality cURL is missing w.r.t. real-world browser workflows.
userbinator
but on the other hand, having a rising browser engine might eventually remove this avenue for fingerprinting

If what I've seen from CloudFlare et.al. are any indication, it's the exact opposite --- the amount of fingerprinting and "exploitation" of implementation-defined behaviour has increased significantly in the past few months, likely in an attempt to kill off other browser engines; the incumbents do not like competition at all.

The enemy has been trying to spin it as "AI bots DDoSing" but one wonders how much of that was their own doing...

SoftTalker
It's entirely deliberate. CloudFlare could certainly distinguish low-volume but legit web browsers from bots, as much as they can distinguish chrome/edge/safari/firefox from bots. That is if they cared to.
hansvm
Hold up, one of those things is not like the other. Are we really blaming webmasters for 100x increases in costs from a huge wave of poorly written and maliciously aggressive bots?
refulgentis
> Are we really blaming...

No, they're discussing increased fingerprinting / browser profiling recently and how it affects low-market-share browsers.

jillyboel
Your costs only went up 100x if you built your site poorly
cyanydeez
I dont think they're doing this to kill off browser engines; they're trying to sift browsers into "user" and "AI slop", so they can prioritize users.

This is entirely web crawler 2.0 apocolypse.

nicman23
man i just want a bot to buy groceries for me
extraduder_ire
I think "slop" only refers to the output of generative AI systems. bot, crawler, scraper, or spider would be a more apt term for software making (excessive) requests to collect data.
nonrandomstring
When I spoke to these guys [0] we touched on those quirks and foibles that make a signature (including TCP stack stuff beyond control of any userspace app).

I love this curl, but I worry that if a component takes on the role of deception in order to "keep up" it accumulates a legacy of hard to maintain "compatibility" baggage.

Ideally it should just say... "hey I'm curl, let me in"

The problem of course lies with a server that is picky about dress codes, and that problem in turn is caused by crooks sneaking in disguise, so it's rather a circular chicken and egg thing.

[0] https://cybershow.uk/episodes.php?id=39

thaumasiotes
> Ideally it should just say... "hey I'm curl, let me in"

What? Ideally it should just say "GET /path/to/page".

Sending a user agent is a bad idea. That shouldn't be happening at all, from any source.

Tor3
Since the first browser appeared I've always meant that sending a user agent id was a really bad idea. It breaks with the fundamental idea of the web protocol, that it's the server's responsibility to provide data and it's the client's responsibility to present it to the user. The server does not need to know anything about the client. Including user agent in this whole thing was a huge mistake as it allowed web site designers to code for specific quirks in browsers. I can to some extent accept a capability list from the client, but I'm not so sure even that is necessary.
nonrandomstring
Absolutely, yes! A protocol should not be tied to client details. Where did "User Agent" strings even come from?
immibis
What should instead happen is that Chrome should stop sending as much of a fingerprint, so that sites won't be able to fingerprint. That won't happen, since it's against Google's interests.
gruez
This is a fundamental misunderstanding of how TLS fingerprinting works. The "fingerprint" isn't from chrome sending a "fingerprint: [random uuid]" attribute in every TLS negotiation. It's derived from various properties of the TLS stack, like what ciphers it can accept. You can't make "stop sending as much of a fingerprint", without every browser agreeing on the same TLS stack. It's already minimal as it is, because there's basically no aspect of the TLS stack that users can configure, and chrome bundles its own, so you'd expect every chrome user to have the same TLS fingerprint. It's only really useful to distinguish "fake" chrome users (eg. curl with custom header set, or firefox users with user agent spoofer) from "real" chrome users.
johnisgood
I used to call it "cURL", but apparently officially it is curl, correct?
bdhcuidbebe
I’d guess Daniel pronounce it as ”kurl”, with a hard C like in ”crust”, since hes swedish.
cruffle_duffle
As in “See-URL”? I’ve always called it curl but “see url” makes a hell of a lot of sense too! I’ve just never considered it and it’s one of those things you rarely say out loud.
johnisgood
I prefer cURL as well, but according to official sources it is curl. :D Not sure how it is pronounced though, I pronounce it as "see-url" and/or "see-U-R-L". It might be pronounced as "curl" though.
ryao
Did they also set IP_TTL to set the TTL value to match the platform being impersonated?

If not, then fingerprinting could still be done to some extent at the IP layer. If the TTL value in the IP layer is below 64, it is obvious this is either not running on modern Windows or is running on a modern Windows machine that has had its default TTL changed, since by default the TTL of packets on modern Windows starts at 128 while most other platforms start it at 64. Since the other platforms do not have issues communicating over the internet, so IP packets from modern Windows will always be seen by the remote end with TTLs at or above 64 (likely just above).

That said, it would be difficult to fingerprint at the IP layer, although it is not impossible.

gruez
>That said, it would be difficult to fingerprint at the IP layer, although it is not impossible.

Only if you're using PaaS/IaaS providers don't give you low level access to the TCP/IP stack. If you're running your own servers it's trivial to fingerprint all manner of TCP/IP properties.

https://en.wikipedia.org/wiki/TCP/IP_stack_fingerprinting

ryao
I meant it is difficult relative to fingerprinting TLS and HTTP. The information is not exported by the berkeley socket API unless you use raw sockets and implement your own userland TCP stack.
sneak
Couldn’t you just monitor the inbound traffic and associate the packets to the connections? Doing your own TCP seems silly.
xrisk
Wouldn’t the TTL value of received packets depend on network conditions? Can you recover the client’s value from the server?
ralferoo
The argument is that if the many (maybe the majority) of systems are sending packets with a TTL of 64 and they don't experience problems on the internet, then it stands to reason that almost everywhere on the internet is reachable in less than 64 hops (personally, I'd be amazed if it any routes are actually as high as 32 hops).

If everywhere is reachable in under 64 hops, then packets sent from systems that use a TTL of 128 will arrive at the destination with a TTL still over 64 (or else they'd have been discarded for all the other systems already).

ryao
Windows 9x used a TTL of 32. I vaguely recall hearing that it caused problems in extremely exotic cases, but that could have been misinformation. I imagine that >99.999% of the time, 32 is enough. This makes fingerprinting via TTL to distinguish between those who set it at 32, 64, 128 and 255 (OpenSolaris and derivatives) viable. That said, almost nobody uses Windows 9x or OpenSolaris derivatives on the internet these days, so I used values from systems that they do use for my argument that fingerprinting via TTL is possible.
fc417fc802
What is the reasoning behind TTL counting down instead of up, anyway? Wouldn't we generally expect those routing the traffic to determine if and how to do so?
therealcamino
To allow the sender to set the TTL, right? Without adding another field to the packet header.

If you count up from zero, then you'd also have to include in every packet how high it can go, so that a router has enough info to decide if the packet is still live. Otherwise every connection in the network would have to share the same fixed TTL, or obey the TTL set in whatever random routers it goes through. If you count down, you're always checking against zero.

ryao
If your doctor says you have only 128 days to live, you count down, not up. TTL is time to live, which is the same thing.
sadjad
The primary purpose of TTL is to prevent packets from looping endlessly during routing. If a packet gets stuck in a loop, its TTL will eventually reach zero, and then it will be dropped.
fc417fc802
That doesn't answer my question. If it counted up then it would be up to each hop to set its own policy. Things wouldn't loop endlessly in that scenario either.
VladVladikoff
Wait a sec… if the TLS handshakes look different, would it be possible to have an nginx level filter for traffic that claims to be a web browser (eg chrome user agent), yet really is a python/php script? Because this would account for the vast majority of malicious bot traffic, and I would love to just block it.
aaron42net
Cloudflare uses JA3 and now JA4 TLS fingerprints, which are hashes of various TLS handshake parameters. https://github.com/FoxIO-LLC/ja4/blob/main/technical_details... has more details on how that works, and they do offer an Nginx module: https://github.com/FoxIO-LLC/ja4-nginx-module
gruez
That's basically what security vendors like cloudflare does, except with even more fingerprinting, like a javascript challenge that checks the js interpreter/DOM.
walrus01
JS to check user agent things like screen window dimensions as well, which legit browsers will have and bots will also present but with a more uniform and predictable set of x and y dimensions per set of source IPs. Lots of possibilities for js endpoint fingerprinting.
Fripplebubby
I also present a uniform and predictable set of x and y dimensions per source IPs as a human user who maximizes my browser window
jrochkind1
Well, I think that's what OP is meant to avoid you doing, exactly.
immibis
Yes, and sites are doing this and it absolutely sucks because it's not reliable and blocks everyone who isn't using the latest Chrome on the latest Windows. Please don't whitelist TLS fingerprints unless you're actually under attack right now.
fc417fc802
If you're going to whitelist (or block at all really) please simply redirect all rejected connections to a proof of work scheme. At least that way things continue to work with only mild inconvenience.
jrochkind1
I am very curious if the current wave of mystery distributed (AI?) bots will just run javascript and be able to get past proof of work too....

Based on the fact that they are requesting the same absolutely useless and duplicative pages (like every possible combniation of query params even if it does not lead to unique content) from me hundreds of times per url, and are able to distribute so much that I'm only getting 1-5 requests per day from each IP...

...cost does not seem to be a concern for them? Maybe they won't actually mind ~5 seconds of CPU on a proof of work either? They are really a mystery to me.

I currently am using CloudFlare Turnstile, which incorporates proof of work but also various other signals, which is working, but I know does have false positives. I am working on implementing a simpler nothing but JS proof of work (SHA-512-based), and am going to switch that in and if it works great (becuase I don't want to keep out the false positives!), but if it doesn't, back to Turnstile.

The mystery distributred idiot bots were too much. (Scaling up resources -- they just scaled up their bot rates too!!!) I don't mind people scraping if they do it respectfully and reasonably; taht's not what's been going on, and it's an internet-wide phenomenon of the past year.

RKFADU_UOFCCLEL
Blocking a hacking attack is not even a thing, they just change IP address each time they learn a new fact about how your system works and progress smoothly without interruption until they exfiltrate your data. Same goes for scrapers the only difference being there is no vulnerability to fix that will stop them.
jamal-kumar
This tool is pretty sweet in little bash scripts combo'd up with gnu parallel on red team engagements for mapping https endpoints within whatever scoped address ranges that will only respond to either proper browsers due to whatever, or with the SNI stuff in order. Been finding it super sweet for that. Can do all the normal curl switches like -H for header spoofing
croemer
Back then (2022) it was Firefox only
GNOMES
I had to do something like this with Ansible's get_url module once.

Was having issues getting module to download an installer from a vendors site.

Played with Curl/WGET, but was running into the same, while it worked from a browser.

I ended up getting both Curl + get_url to work by passing the same headers my browser sent such as User-Agent, encoding, etc

INTPenis
Only three patches and shell wrappers, this should get Daniel coding. Imho this should definitely be in mainline curl.
1vuio0pswjnm7
"For these reasons, some web services use the TLS and HTTP handshakes to fingerprint which client is accessing them, and then present different content for different clients."

Examples: [missing]

userbinator
I'm always ambivalent about things like this showing up here. On one hand, it's good to let others know that there is still that bit of rebelliousness and independence alive amongst the population. On the other hand, much like other "freedom is insecurity" projects, attracting unwanted attention may make it worse for those who rely on them.

Writing a browser is hard, and the incumbents are continually making it harder.

jolmg
Your comment makes it sound like a browser being fingerprintable is a desired property by browser developers. It's just something that happens on its own from different people doing things differently. I don't see this as being about rebelliousness. Software being fingerprintable erodes privacy and software diversity.
gkbrk
Not all browsers, but Chrome certainly desires to be fingerprintable. They even try to cryptographically prove that the current browser is an unmodified Chrome with Web Environment Integrity [1].

Doesn't get more fingerprintable than that. They provide an un-falsifiable certificate that "the current browser is an unmodified Chrome build, running on an unmodified Android phone with secure boot".

If they didn't want to fingerprintable, they could just not do that and spend all the engineering time and money on something else.

[1]: https://en.wikipedia.org/wiki/Web_Environment_Integrity