walls.corpus | wget of mass destruction

wget of mass destruction

Sunday, 9 February, 2014 — engineering civics

David E. Sanger and Eric Schmitt, reporting for the New York Times, have published an article titled “Snowden Used Low-Cost Tool to Best N.S.A.”. I know they’re reporting for a general audience, but I believe the article does a disservice by allowing anonymous national security “officials” to put simple automation into scare quotes:

Using “web crawler” software designed to search, index and back up a website, Mr. Snowden “scraped data out of our systems” while he went about his day job, according to a senior intelligence official. “We do not believe this was an individual sitting at a machine and downloading this much material in sequence,” the official said. The process, he added, was “quite automated.”

The findings are striking because the N.S.A.’s mission includes protecting the nation’s most sensitive military and intelligence computer systems from cyberattacks, especially the sophisticated attacks that emanate from Russia and China. Mr. Snowden’s “insider attack,” by contrast, was hardly sophisticated and should have been easily detected, investigators found.

Automation gonna automate, I suppose. Given that we’ve seen this dance with Aaron Schwartz, Chelsea Manning and Edward Snowden, the national security-industrial complex has a disingenuously naïve view of automation tools, particularly around Schwartz at MIT and Snowden, suggesting there was a mix of luck and quite possibly something nefarious to all this automation. The New York Times should approach statements made by agency officials skeptically. This sort of programming is not hard. Moreover, no one has to work particularly hard to hide this. In fact, what might look to some like “hiding” would simply be polite engineering under a different lens.

One key is a not-at-all-advanced concept of throttling. Well-behaved web crawlers (also known as spiders) are respectful about how many requests they issue in a given amount of time. A lot of requests all at once will attract the very sort of attention unnamed officials seem beside themselves to acknowledge Snowden only barely called to himself.

First, lots of requests in a short amount of time shows up in log files as such and quickly becomes a pattern. Patterns attract attention. Assuming the NSA and it’s various contractors audit access logs (which itself is something I’d automate), spreading requests over time makes it less likely to arouse suspicion. Moreover, unless an audit is looking for a particular type of activity, that manual or automated audit will not care a whit about well-throttled crawler traffic, because it looks a lot like expected traffic. It’s “hiding” to the same degree someone of average height and dress is “hiding” as they walk on a Manhattan sidewalk.

Second, setting aside any activity logs, system activity monitors seem more likely to catch a misbehaving web crawler. System activity monitors look at how much work a machine is doing at a given time. Typical checks look at how busy the CPU is, how much RAM is in use, overall network activity, what processes (“programs”) are running and so on. Some servers have automated checks in place, some don’t. For sake of discussion, I assert the servers hosting the content Snowden accessed were monitored in such a fashion. Now, assume each server has a variable amount, but average band of activity. Unless what Snowden was doing with his web crawler caused one of these checks to go out-of-bounds, it was unlikely to attract attention. Normal activity gets ignored.

On to the alleged crawling software itself.

In interviews, officials declined to say which web crawler Mr. Snowden had used, or whether he had written some of the software himself. Officials said it functioned like Googlebot, a widely used web crawler that Google developed to find and index new pages on the web. What officials cannot explain is why the presence of such software in a highly classified system was not an obvious tip-off to unauthorized activity.

First, Snowden’s job was as a systems administrator. Systems administration and development jobs involve access to not in any way top secret technologies like *NIX servers which typically have a wide-array of built-in scripting languages (Perl and Python most likely, Ruby very possibly). Or, perhaps Snowden is a shell scripter. Bash will get the job done.

As software goes, a basic web crawler is not exceptionally hard. I assert if its written with tools likely already resident on any average server or *NIX-based laptop (e.g. Mac OS X, Linux, possibly Windows with PowerShell), there’s really nothing about one that would raise any particular suspicion. Effectively, the raw pieces of the web crawler were quite likely already present. Writing a text file to marshal these raw pieces together is unlikely to raise suspicion because a systems administrator or software developer already has scores of similar files laying around. There’s not a magic “web crawler” bit that flips and will alert anyone.

As a thought experiment, what happens if every machine is audited and new and modified files are flagged, logged and sent off somewhere for analysis? Probably nothing, because in a large working group, a lot of these files are going to look very similar to each other, have innocuous or cryptic names and it would be a nigh-impossible task to write meaningful software to determine what all of these new files are for and, if they’re programs, what they do. Surely, no one is going to look at each one of these files. It’d be soul-sucking work.

Put another way; hammers, screwdrivers, wrenches, pliers, saws, knives aren’t noteworthy tools in a tool box. A new hammer on a construction site is unlikely to raise any attention. Similarly, just as carpenters use jigs, painters use scaffolding and auto mechanics use impact wrenches, ramps and hydraulic lifts to make their jobs easier, faster, more consistent and less tedious, systems engineers and developers use scripts. Now, imagine a construction site or factory inspecting everyone’s tool bag and workspace constantly for anything “inappropriate”. It wouldn’t be terribly effective and it’d be a huge burden and expense on the actual work. Imagine your average TSA security line at the office park.

There’s also some question about the web crawler having Snowden’s credentials:

When inserted with Mr. Snowden’s passwords, the web crawler became especially powerful. Investigators determined he probably had also made use of the passwords of some colleagues or supervisors.

But he was also aided by a culture within the N.S.A., officials say, that “compartmented” relatively little information. As a result, a 29-year-old computer engineer, working from a World War II-era tunnel in Oahu and then from downtown Honolulu, had access to unencrypted files that dealt with information as varied as the bulk collection of domestic phone numbers and the intercepted communications of Chancellor Angela Merkel of Germany and dozens of other leaders.

Officials say web crawlers are almost never used on the N.S.A.’s internal systems, making it all the more inexplicable that the one used by Mr. Snowden did not set off alarms as it copied intelligence and military documents stored in the N.S.A.’s systems and linked through the agency’s internal equivalent of Wikipedia.

As noted above, there’s nothing particularly special about a web crawler versus any other manner of script. It’s easy to inform utilities like wget and curl about authentication parameters and keep login cookies. It’s also easy for such a web crawler to announce itself to the server it requests information from in any manner. There’s a convention around giving an identification string, as Google and Yahoo do for their web crawlers, but it’s just as easy to call a web crawler Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko or Internet Explorer 11. Add in polite engineering of not requesting every page the web crawler sees as soon as it processes each preceding page and it’s going to be far less obvious that traffic to a web server is coming from a script instead of a human clicking a link. There’s not necessarily anything nefarious going on.

If Snowden had access to all of these systems and accessing what sounds equivalent to a corporate intranet was not going to arouse suspicion, there’s little I can think about this conceptual web crawler that would tip the balance into being caught. If the NSA wasn’t going to catch Snowden doing all of the work himself, it’s no more likely they were going to catch an automated process he wrote.

I don’t find any part of this story surprising from a technical standpoint. What I do find somewhat distressing is that unnamed officials think this is special or conveys villainous status on Snowden. It doesn’t, just as it should not have with Aaron Schwartz. Said officials should actually know better and if they don’t, they need to find technical advisors who will correctly inform them.

I bring this all up because I would like for reporters on stories such as this to find an average systems administrator, security analyst or software engineer to talk to in order to provide perspective. The New York Times has an excellent digital staff with developers who could easily demonstrate what a similar script would look like and how it would work and look internally. Surely, a news organization that builds great interactive stories and is growing more comfortable in its own clothes online can use some agency and draw on some of the experience that’s helping to provide some of that comfort to call officials on bad, self-serving analysis like this.