This post was syndicated from: The Hacker Factor Blog and was written by: The Hacker Factor Blog. Original post: at The Hacker Factor Blog
As I tweak and tune the firewall and IDS system at FotoForensics, I keep coming across unexpected challenges and findings. One of the challenges is related to proxies. If a user uploads prohibited content from a proxy, then my current system bans the entire proxy. An ideal solution would only ban the user.
Proxies serve a lot of different purposes. Most people think about proxies in regards to anonymity, like the TOR network. TOR is a series of proxies that ensure that the endpoint cannot identify the starting point.
However, there are other uses for proxies. Corporations frequently have a set of proxies for handling network traffic. This allows them to scan all network traffic for potential malware. It’s a great solution for mitigating the risk from one user getting a virus and passing it to everyone in the network.
Some governments run proxies as a means to filter content. China and Syria come to mind. China has a custom solution that has been dubbed the “Great Firewall of China“. They use it to restrict site access and filter content. Syria, on the other hand, appears to use a COTS (commercial off-the-shelf) solution. In my web logs, most traffic from Syria comes through Blue Coat ProxySG systems.
And then there are the proxies that are used to bypass usage limits. For example, your hotel may charge for Internet access. If there’s a tech convention in the hotel, then it’s common to see one person pay for the access, and then run his own SOCKS proxy for everyone else to relay out over the network. This gives everyone access without needing everyone to pay for the access.
Proxy networks that are designed for anonymity typically don’t leak anything. If I ban a TOR node, then that node stays banned since I cannot identify individual users. However, the proxies that are designed for access typically do reveal something about the user. In fact, many proxies explicitly identify who’s request is being relayed. This added information is stuffed in HTTP header fields that most web sites ignore.
For example, I recently received an HTTP request from 22.214.171.124 that contained the HTTP header “X-Forwarded-For: 126.96.36.199″. If I were to ban the user, then I would ban “188.8.131.52″, since that system connected to my server. However, 184.108.40.206 is google-proxy-66-249-81-4.google.com and is part of a proxy network. This proxy network identified who was relaying with the X-Forwarded-For header. In this case, “220.127.116.11″ is someone in Yemen. If I see this reference, then I can start banning the user in Yemen rather than the Google Proxy that is used by lots of people. (NOTE: I changed the Yemen IP address for privacy, and this user didn’t upload anything requiring a ban; this is just an example.)
Unfortunately, there is no real standard here. Different proxies use different methods to denote the user being relayed. I’ve seen headers like “X-Forwarded”, “X-Forwarded-For”, “HTTP_X_FORWARDED_FOR” (yes, they actually sent this in their header; this is NOT from the Apache variable), “Forwarded”, “Forwarded-For-IP”, “Via”, and more. Unless I know to look for it, I’m liable to ban a proxy rather than a user.
In some cases, I see the direct connection address also listed as the relayed address; it claims to be relaying itself. I suspect that this is cause by some kind of anti-virus system that is filtering network traffic through a local proxy. And sometimes I see private addresses (“private” as in “private use” and “should not be routed over the Internet”; not “don’t tell anyone”). These are likely home users or small companies that run a proxy for all of the computers on their local networks.
If I cannot identify the user being proxied, then just identifying that the system is a proxy can be useful. Rather than banning known proxies for three months, I might ban the proxy for only a day or a week. The reduced time should cut down on the number of people blocked because of the proxy that they used.
There are unique headers that can identify that a proxy is present. Blue Coat ProxySG, for example, adds in a unique header: “X-BlueCoat-Via: abce6cd5a6733123″. This tracking ID is unique to the Blue Coat system; every user relaying through that specific proxy gets the same unique ID. It is intended to prevent looping between Blue Coat devices. If the ProxySG system sees its own unique ID, then it has identified a loop.
Blue Coat is not the only vendor with their own proxy identifier. Fortinet’s software adds in a “X-FCCKV2″ header. And Verizon silently adds in an “X-UIDH” header that has a large binary string for tracking users.
Language and Location
Besides identifying proxies, I can also identify the user’s preferred language.
The intent with specifying languages in the HTTP header is to help web sites present content in the native language. If my site supports English, German, and French, then seeing a hint that says “French” should help me automatically render the page using French. However, this can be used along with IP address geolocation to identify potential proxies. If the IP address traces to Australia but the user appears to speak Italian, then it increases the likelihood that I’m seeing an Australian proxy that is relaying for a user in Italy.
The official way to identify the user’s language is to use an HTTP “Accept-Language” header. For example, “Accept-Language: en-US,en;q=0.5″ says to use the United States dialect of English, or just English if there is no dialect support at the web site. However, there are unofficial approaches to specifying the desired language. For example, many web browsers encode the user’s preferred language into the HTTP user-agent string.
Similarly, Facebook can relay network requests. These appear in the header “X-Facebook-Locale”. This is an unofficial way to identify when Facebook being use as a proxy. However, it also tells me the user’s preferred language: “X-Facebook-Locale: fr_CA”. In this case, the user prefers the Canadian dialect of French (fr_CA). While the user may be located anywhere in the world, he is probably in Canada.
There’s only one standard way to specify the recipient’s language. However, there are lots of common non-standard ways. Just knowing what to look for can be a problem. But the bigger problem happens when you see conflicting language definitions.
User-Agent: Mozilla/5.0 (Linux; Android 4.4.2; it-it; SAMSUNG SM-G900F/G900FXXU1ANH4 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/1.6 Chrome/28.0.1500.94 Mobile Safari/537.36
X-OperaMini-Phone-UA: Mozilla/5.0 (Linux; U; Android 4.4.2; id-id; SM-G900T Build/id=KOT49H.G900SKSU1ANCE) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
If I see all of these in one request, then I’ll probably choose the official header first (German from German). However, without the official header, would I choose Spanish from Latin America (“es-LA” is unofficial but widely used), Italian from Italy (it-it) as specified by the web browser user-agent string, or the language from one of those other fields? (Fortunately, in the real world these would likely all be the same. And you’re unlikely to see most of these fields together. Still, I have seen some conflicting fields.)
Time to Program!
So far, I have identified nearly a dozen different HTTP headers that denote some kind of proxy. Some of them identify the user behind the proxy, but others leak clues or only indicate that a proxy was used. All of this can be useful for determining how to handle a ban after someone violates my site’s terms of service, even if I don’t know who is behind the proxy.
In the near future, I should be able to identify at least some of these proxies. If I can identify the people using proxies, then I can restrict access to the user rather than the entire proxy. And if I can at least identify the proxy, then I can still try to lessen the impact for other users.