Wednesday, April 29 2020

7 years at Mozilla

By Julien Vehent, Wednesday, April 29 2020.

Seven years ago, on April 29th 2013, I walked into the old Castro Street Mozilla headquarters in Mountain View for my week of onboarding and orientation. Jubilant and full of imposter syndrom, that day marked the start of a whole new era in my professional career.

I'm not going to spend an entire post reminiscing about the good ol' days (though those days were good indeed). Instead, I thought it might be useful to share a few things that I've learned over the last seven years, as I went from senior engineer to senior manager.

Be brief

Je n’ai fait celle-ci plus longue que parce que je n’ai pas eu le loisir de la faire plus courte.

- Blaise Pascal

Pascal's famous quote - If I had more time, I would have written a shorter letter - is strong advice. One of the best way to disrupt any discussion or debate is indeed to be overly verbose, to extend your commentary into infinity, and to bore people to death.

Here's the thing: nobody cares about the history of the universe. Be brief. If someone asks for someone specific, give them that information, and perhaps mention that there's more to be said about it. They'll let you know if they are interested in hearing the backstory.

People like to hear themselves talk. I certainly do. But over time I realized that, by being brief, I get a lot more attention from people, so they ask more questions, reach out more often, because they know I won't drag them into a lengthy debate.

But don't fall into the extreme opposite. Being overly brief can be detrimental to a conversation, or make you appear like you don't care to participate. The is a right balance to be found between too short and too long, depending on the context and your audience.

Tell a good story

Or more accurately, tell a story that touches your audience directly. If you're working on a security review of a nodejs application hosted in heroku, tell them stories of other nodejs applications hosted in heroku, ideally from a nearby team or organization. Don't go lecture them on the need for network security monitoring in datacenters, they simply won't care, as it doesn't touch them.

If you want to get people interested in what you're doing, first you need to take interest into what they are doing, then you need to tell them a good story they will care about. Finding out what that is will increase your chances of success.

That story will also change depending on your audience. Engineers will care about one thing, their managers something else, and the executives another thing entirely. Emphasize the parts of your project or idea that your audience will be most interested in to catch their attention, without losing the nature of your work.

This isn't rocket science. In fact, it's old school business playbook. Dale Carnegie's 1936 "How to Win Friends & Influence People" covers this at length. And while I certainly wouldn't take that book to the letter, it raises a number of points which I think are relevant to security professionals.

Be technical

I started out at Mozilla as a senior security engineer focused entirely on operations and infrastructure. I spent my days doing security reviews, making guides and writing code. 100% technical work. When I got promoted to staff, then to manager, then to senior manager, the proportion of technical work gradually reduced to make room for managerial work.

Managing is important. With 9-or-so people on the team, being able to accurately focus attention on the right set of problems is critical to the security of the perimeter. In fact, there's an entire school of thought that advocates that managers should be entirely focused on management tasks, and stay away from technical work.

For better or worse, I don't buy into that. I believe, for myself and for my team, that I'm a better team manager and security strategist when I have a deep technical understanding of the issues at hand.

That doesn't mean non-technical managers are bad. In fact, I think there are many situations where a non-technical manager is a better choice than a technical one. But for the field of operations security, at Mozilla, managing the people I manage, I think being technical is a strength.

How do you remain technical while being a manager? There are certainly areas in which I don't have a deep technical understanding and struggle to acquire one. But in general, I find that experimenting outside core projects, and picking up tasks outside the critical path, helps remain current and relevant.

For example, if the organization decide to switch to writing web applications in Rust with Actix, I'll write one myself. I won't get to the level of expertise I have in other areas, but I'll know enough to be relevant during security reviews and threat modeling sessions. And I continue to acquire knowledge in my areas of specialty: cloud infrastructure, cryptographic services, etc.

I don't expect to leave the management track any time soon. In fact, I expect to continue to grow in it. But I find it important that I could go back to a senior staff engineer role if I wanted to. Perhaps it is hubris, time will tell.

Never make assumptions

A year ago, I spent a night seated on a small table outside the reception of the Wahweap campground in Lake Powell, Arizona, as I was helping my team re-issue an intermediate certificate used to sign Firefox add-ons. It was freezing outside. I caught a cold, and a strong lesson, as dozens of my peers where untangling a mess I had helped create.

We called this incident "Armagaddon" internally, and it all started because we made a few assumptions we never took time to verify. We assumed that certificate expiration checking was disabled when verifying add-ons signatures, when in fact it was only disabled for end-entities. When the intermediate expired, everything blew up.

I learned that lesson. I also learned to identify and question every assumptions that we, engineers, make. The more complex a system becomes - and the Firefox ecosystem certainly is a complex one - the more assumptions people make. Learning to identify and consistently call out those assumption, forcing myself and others to verify them, and basing decisions on hard data and tests is critically important.

There is a place and a time where assumptions can be made and risks be taken, but not always, and certainly not on mission critical components. As an industry, we've bought into the "go fast and break things" mindset that is plain wrong for a lot of environments. Learning to slow down and taking the time to verify assumptions is, perhaps, the biggest cultural change we need.

Don't ignore backward compatibility

You think differently about engineering when you have to maintain compatibility with devices and software that haven't been updated in one, and sometime two, decades. Firefox falls into that category. Every time we try to change something that's been around a while, we run into backward compatibility issues.

Up until recently, we maintained a separate set of HTTPS endpoints that supported SSL3 and SHA-1 certificates, issued by decomissioned roots, to allow XP SP2 users to download Firefox installers. And when I say recently, I mean one or two years ago. Long after Microsoft had stopped supporting those users.

I have tons of examples of having to maintain weird configurations and infrastructure for deprecated users no one wants to think about. Yet, they exists, and often represent a sizeable portion of our users that cannot simply be ignored.

As a system designer, learning to account for backward compatibility is a learning curve. It's certainly much easier to greenfield a process while ignoring its history than to design a monster that needs to adopt modern techniques while serving old clients.

Some folks are better at this than others, and this is where you really feel the importance of experience and the value of tenured employees. Those people who jump ship every 18 months? They can't tell you a thing about backward compatibility. But that engineer who's been maintaining a critical system for the past 5 or 10 years absolutely can. Seek them, ask for the history of things, it's always interesting to hear!

Use boring tech

Nobody ever got fired for building a site in Python with Django and Postgresql. Or perhaps you'd like to keep using PHP? Maybe even Perl? The cool kids at the local hackathon will make fun of you for not using the latest javascript framework or Rust nightly, but your security team will probably love you for it.

The thing is, in 99% of cases, you'll be an order of magnitude more productive and secure with boring tech. There are very few cases where the bleeding edge will actually give you an edge.

For example, I'm a big fan of Rust. And not because not being a big fan of Rust at Mozilla is a severe faux-pas. But because I think the language team is doing a great job of distilling programming best practices into a reasonable set of engineering principles. Yet, I absolutely do not recommend anyone to write backend services in Rust, unless they are ready to deal with a lot of pain. Things like web frameworks, ORMs, cloud SDKs, migration frameworks, unit testing, and so on all exist in Rust, but are nowhere as mature or tested as their Python equivalents.

My favorite stack to build a quick prototype of a web service is Python, Flask, Postgresql and Heroku. That's it. All stuff that's been around for over a decade and that no one considers new or cool.

Bugzilla is written in Perl. Phabricator or Pocket are PHP. ZAP is Java. etc. There are tons of examples of software that is widely successful by using boring tech, because their developers are so much more productive on those stacks than they would be on anything bleeding edge.

And from a security perspective, those boring techs have acquired a level of maturity that can only be attained by walking the walk. Sure, programming languages can and do prevent entire classes of vulnerabilities, but not all of them, and using Rust won't stop SQL injections or SSRF.

So when should you not use boring tech? Mostly when you have time and money. If you're flush on cash and you unique problems, taking six months or a year to ramp up a new tech is worthwhile. The ideal time to do it is when you're rewriting a well-established service, and have a solid test suite to verify the rewrite is equivalent to the original. That's what the Durable Sync team did at Mozilla, when they rewrote the Firefox Sync backend from Python/Mysql to Rust/Spanner.

Delete your code

My first two years at Mozilla were focused on an endpoint security project called MIG, for Mozilla Investigator. It was an exciting greenfield project and I got to use all the cool stuff: Go, MongoDB (yurk), RabbitMQ (meh), etc. I wrote tons and tons of code, shipped a fully functional architecture, got a logo from a designer friend, gave a dozen conference talks, even started a small open source community around it.

And then I switched gears.

In the space of maybe 6 months, I completely stopped working on MIG to focus on cloud services security. At first, it was hard to let go of a project I had invested so much into. Then, gradually, it faded, until eventually the project died off and got archived. My code was effectively deleted. Two years of work out the window. This isn't something you're trained to deal with. And in fact, most engineers, like I was, are overly attached to their code, to the point of aggressively fighting any change they disagree with.

If this is you, stop, right now, you're not doing anyone any favors.

Your code will be deleted. Your projects will be cancelled. People are going to take over your work and transform it entirely, and you won't even have a say in it. This is OK. That's the way we move forward and progress. Learning to cope with that feeling early on will help you later in your career.

Nowadays, when someone is about to touch code I have previously written, I explicitely welcome them to rip out anything they think is worth removing. I give them a clear signal that I'm not attached to my code. I'd much rather see the project thrive than keep around old lines of code. I encourage you to do the same.

Be passionate, and respectful

Security folks generally don't need to be told to be passionate. In fact, they are often told the opposite, that they should tone it down a notch, that they are making too many waves. I disagree with this. I think it's good and useful to an organization to have a passionate security team that truly cares about doing good work. If we're not passionate about security, who else is going to be? That's literally what we're paid to do.

When a team comes ask for a security review about a new project, they expect to receive the full-on adversarial doomsday threat model experience from us. They want to talk to someone who's passionate about breaking and hardening services to the extreme. So this is what we do in our risks assessments meetings. We don't tone it down or play it safe, we push things to the extreme, we're unreasonable and it's fun as hell!

But when folks respectfully disagree with my recommendation to encrypt all databases with HSM-backed AES keys split with a Shamir threshold of 4 and store each fragment in underground bunkers around the world to prevent the NSA from compromising several of our employees to access the data, I remain respectful of their opinion. We have a productive and polite debate that leads to a reasonable middle-ground which sufficiently addresses the risks, and can be implemented within timeline and budget.

Both passion and respect are important quality of a successful engineer (or a person, really). I like to end a day knowing that I've done the right thing and that, even if everything didn't go my way, I have made a good case and I'm satisfied with the outcome.

Onward.

And so this is seven years at Mozilla. A lot more could be said about the work accomplished or the lessons learned, but then this would turn into a book, and I swore to myself I wouldn't write another one of those just yet.

Working at Mozilla is a heck of a job, even in those trying times. The team is fantastic, the work is fascinating, and, as they said during my first week of onboarding back in the old Castro street office, we get to work for Mankind, not for The Man. That's gotta be worth something!

Monday, September 30 2019

Beyond The Security Team

By Julien Vehent, Monday, September 30 2019. General

This is a keynote I gave to DevSecCon Seattle in September 2019. The recording of that keynote should be available soon.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote.png

Good morning everyone, and thank you for joining us on this second day of DevSecCon. My name is Julien Vehent. I run the Firefox Operations Security team at Mozilla, where I lead a team that secures the backend services and infrastructure of Firefox. I’m also the author of Securing DevOps.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_1_.png

This story starts a few months ago, when I am sitting in our mid-year review with management. We’re reviewing past and future projects, looking at where the dozen or so people in my group spend their time, when my boss notes that my team is under invested in infrastructure security. It’s not a criticism. He just wonders if that’s ok. I have to take a moment to think through the state of our infrastructure. I mentally go through the projects the operations teams have going on, list the security audits and incidents of the past few months.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_2_.png

I pull up our security metrics and give the main dashboard a quick glance before answering that, yes, I think reducing our investment in infrastructure security makes sense right now. We can free up those resources to work on other areas that need help.

Infrastructure security is probably where security teams all over the industry spend the majority of their time. It’s certainly where, in the pre-cloud era, they use to spend most of their time.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_3_.png

Up until recently, this was true for my group as well. But after years of working closely with ops on hardening our AWS accounts, improving logging, integrating security testing in deployments, secrets managements, instances updates, and so on, we have reached the point where things are pretty darn good. Instead of implementing new infrastructure security controls, we spend most of our time making sure the controls that exist don’t regress.

The infrastructure certainly does continue to evolve, but operations teams have matured to the point of becoming their own security teams. In most cases, they know best how to protect their environments. We continue to help, of course. We’re not far away. We talk daily. We have their back during security incidents and for the occasional security review. We also use our metrics to call out areas of improvements. But that’s not a massive amount of work compared to our investment from previous years.

I have advocated for some time now that operations teams make the best security teams, and every interaction that I have with the ops of the Firefox organization confirm that impression. They know security just as well as any security engineer would, and in most operational domains, they are the security experts. Effectively, security has gone beyond the security team.

So what I want to discuss here today is how we turned our organization’s culture from centralizing security to one where ownership is distributed, and each team owns security for their areas. I’d say it took us a little more than three years to get there, but let me start by going back a lot further than that.

It didn’t use to be this way

I’m french. I grew up in cold and rainy Normandy. It’s not unlike Seattle in many ways. I studied in the Loire Valley and started my career in Paris, back in the mid-2000s. I started out by working in banks, as a security engineer in the web and minitel division of a french bank. If you don’t know what a minitel is, you’re seriously missing out. But that's a story for another time.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_4_.png

So I was working in suit and tie at a bank in Paris, and the stereotypes were true: we were taking long lunches, occasionally drinking wine and napping during soporific afternoon meetings. Eating lots of cheese and running out of things to do in our 8 or 9 weeks of vacations. Those were the days. And when it came to security, we were the supreme authority of the land. A group of select few that all engineers feared, our words could make projects live or die.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_5_.png

This is how a deployment worked back then. This was pre-devops when deployment could take three weeks and everyone was fine with it. An engineering group would kick off a projet, carefully plan it, come up with an elegant and performant design, spend weeks of engineering building the thing. And don’t get me wrong, the engineering teams were absolutely top-notch. Best of the best. 100% french quality that the rest of the world continues to envy us today. Then they would try to deploy their newborn and, “WAIT, what the hell is this?” asks the security team who just discovered the project.

This is where things usually went sideways for engineering. Security would freak out at the new project, complain that it wasn’t consulted, delay production deploys, use its massive influence to make last minute changes and integrate complex controls deep into the new solution. Eventually, it would realize this new system isn’t all that risky after all, it would write a security report about it, and engineering would be allowed to deploy it to production.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_6_.png

In those environments, trust wasn’t even an option. Security decisions were made by security engineers and that was that. Developers, operators and architects of all level were expected to field every security topic to the security team, from asking for permission to open a firewall rule, to picking a hash algorithm for their app. No one dared bypass us. We had so much authority we didn’t hesitate to go up against multi-million dollar projects. In a heavily regulated industry, no one wants a written record of the security team raising the alarm on their projects.

On the first day of the conference, we heard Tanya talk about the need to shift left, and I think this little overly-dramatic story is a good example of why it’s so important that we do so. Shifting left means moving our security work closer to the design phases of engineering. It means being part of the early cycles of the SDLC. It means removing the security surprise from the equation. You’ve probably heard everyone mention something along those lines in recent years. I’m probably preaching to the choir here, but I thought it may be useful to remind those of us who haven’t lived through these broken security models why it’s so important we don’t go back to them.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_7_.png

And consolidating security decisions into the hands of security people has two major downsides.

First, it slows projects down dramatically. We’ve talked about the 1 security engineer to 100 developer ratio yesterday. Routing all security topics through the security team creates a major bottleneck that delays work and generates frustrations. If you’ve worked in organizations that require security reviews before every push to production, you’ve experienced this frustration. Backlogs end up being unmanageable, and review quality drops significantly.

Secondly, non-security teams feel exempt from having to account for security and spend little time worrying about how vulnerable their code is to attacks, or how permeable their perimeter is. Why should they spend time on security, when the organization clearly signals that it isn’t their problem to worry about?

We knew back then this model wasn’t sustainable, but there was little incentive to change it. Security teams didn’t want to give up the power they had accumulated over the years. They wanted more power, because security was never good enough and our society as we knew it was going to end in an Armageddon of buffer overflows should the sysadmins disable SELinux in production to allow for rapid release cycles.

Getting closer to devs & ops

Something that I should mention at this point is I’m an odd security engineer. What I love doing is actually building and shipping software, web applications and internet services in particular. I’ve been doing that for much longer than I’ve been doing security. At some point in my career, I was running a marketing website affiliated with “Who wants to be a millionaire”.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_8_.png

Yeah, that TV Game that airs every day at lunchtime around the country. They would air a question on the show that people had to answer on our website to win points they could use to buy stuff. I gotta say, this was a proud moment of my career. I felt a strong sense of making the world a better place then. It was powerful.

Anyways, this was a tiny operation, from a web perspective. When I joined, we had one java webapp directly connected to the internet, and one oracle database. I upgraded that to an haproxy load balancer, three app servers, and a beefier oracle database. It was all duct tape and cut corners, but it worked, and it made money. And it would crash every day. I remember watching the real-time metrics out of haproxy when they would air the question, and it would spike to 2000 requests per seconds until the site would fall down. Inevitably, every day, the site would break and there was nothing I could do about it. But business was happy because before we had crashed, we had made money.

The point I’m trying to make here is not that I’m a shitty sysadmin who use to run a site that broke every day. It’s that I understand things don’t need to be perfect to serve their purpose. Unless you work at a nuclear plant or a hospital, it’s often more important to serve the business than to have perfect security.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_9_.png

A little more than four years ago, I joined a small operations team focused on building cloud services for Firefox. They were adopting all the new stuff: immutable systems, fully automated deployments controlled by jenkins pipelines, autoscaling groups, load balancing based on application health, and so on. It was an exciting greenfield where none of my security training applied and everything had to be reevaluated from scratch. A challenge, if I ever had one. The chance to shape security in a completely different way.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote.jpg

So I joined the cloud operations team as a security engineer. The only security engineer. In the middle of a dozen of so hardened ops who were running stuff for millions of Firefox users. I live in Florida. We know that swimming in gator infested ponds is just stupid. The same way, security engineers generally avoid getting cornered by hordes of angry sysadmins. They are the enemy, you see. They are the ones who leave mission critical servers open to the internet. They are the ones who don’t change default passwords on network gears. They are the ones who ignore you when you ask for a system update seven times in a row. They are the enemy.

To be fair, they see us the same way. We’re Sauron, on top of our dark tower, overseeing everything. We corrupt the hearts and minds of their leaders. We add impossible requirements to time constrained projects. We generally make their lives impossible.

And here I was. A security guy. Joining an ops team.

By and large, they were nice folks, but it quickly became clear that any attempt at playing the arrogant french security guy who knows it all and dictates how things should be would be met with apathy. I had to win this team over. So I decided to pick a problem, solve it, make their life easier and win myself some goodwill.
Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_11_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_11_.png

I didn’t have to search for long. Literally days before I joined the team, a mistake happened and the git repository containing all the secrets got merged into the configuration management repo. It was a beautiful fuck-up, executed with brio, with the full confidence of a battle-tested engineer who has ran these exact commands hundreds of times before. The culprit is a dear friend and colleague of mine who I like to use as an example of professionalism with the youngsters, and I strongly believe that what failed then was not the human layer, but completely inadequate tooling.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_12_.png

We took it as a wake up call that our process needed improvement. So as my first project, I decided to redesign secrets management. There was momentum. Existing tooling was inadequate. And it was a fun greenfield project. I started by collecting the requirements: applications needed to receive their secrets upon initialization, and in autoscaling groups that had to happen without a human taking action. This is a solved problem nowadays known as the bootstrapping of trust, where we use the identify given to an instance to grant permissions to resources, such as the ability to download a given file from S3 or the permission to decrypt with a KMS key. At the time, those concepts were still fairly new, and as I was working through the requirements, something interesting happened.

In a typical security project, I’d gather all the requirements, pick the best possible security architecture and implement it in the cleanest possible way. I’d then spend weeks or months selling my solutions internally, relentlessly trying to convert people to my cause, until everyone agreed or caved.

But in this project, I decided to talk to my customers first. I sat down with every ops who would use the thing and spent the first few weeks of the project studying the provisioning logic and the secrets management workflow. I also looked at the state of the art, and added some features I really wanted, like clean git history and backup keys.

By the time I had reached the fourth proposal, the ops team had significantly shaped the design to fit their needs. I didn’t need to sell them on the value, because by then, they had already decided they really needed, and wanted, the tool. Mind you, I hadn’t written a single line of code yet, and I wasn’t sure I could implement everything we had talked about. There was a chance I had oversold the capabilities of the tool, but that was a risk worth taking.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_13_.png

It took a couple of months to get it implemented. The first version was written in ugly Python, with little tests and poor cross-platform support. But it worked, and it continues to work (after a rewrite in Go). The result of this project is the open source secrets management tool called Sops, which has been our internal standard for three and something years now. Since I started this talk, perhaps a dozen EC2 instances have autoscaled and called Sops to decrypt their secrets for provisioning.

Don’t just build security tools.
Build operational tools that do things securely.

Today, Sops is popular DevOps tool inside and outside Mozilla, but more importantly, this project laid out the foundation of how security and operations would work together: strong collaboration on complex technical topics. We don’t just build security tools like we used to. We build operational tools that do things securely. This may seem like a distinction without a difference, but I found that it changes the way we think about our work from only being a security team, to being a security team that supports a business with specific operational needs. Effectively, it forces us to be embedded into operations. Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_14_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_14_.png

I spent a couple years embedded in that operations team, working closely with devs & ops, sharing their successes and failures, and helping the team mature its security posture from the inside. I wrote Securing DevOps during those years, and I transferred a lot of what I learned from ops into the book.

I found it fascinating that, while I always had a strong opinion about security architecture and the technical direction we were taking, being down in the trenches dramatically changed my methods. I completely stopped trying to use my security title to block a project, or go against a decision. Instead, I would propose alternative solutions that were both viable and reasonable, and negotiate a middle ground with the parties involved. Now, to be fair, the team dynamic helped a lot, particularly having a manager who put security at equal footing with everything else. But being embedded with Ops, effectively being an ops, greatly contributed to this culture. You see, I wasn’t just any expert in the room, I was their expert in the room, and if something went sideways, I would take just as much blame as everyone else. Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_15_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_15_.png

There are two generally accepted models for building security teams. The centralized one, where all the security people report to a CISO who reports to the CEO, and the distributed one, where security engineers are distributed into engineering teams and a CISO sets the strategy from the top of the org. Both of these models have their pros and cons.

The centralized one generally has better security strategy, because security operates as a cohesive group reporting to one person. But its people are so far away from the engineering teams that actually do the work that it operates on incomplete data and a lot of assumptions.

The distributed model is pretty much the exact opposite. It has better information from being so close to where things happen, but its reporting chain is directly tied to the organization’s hierarchy, and the CISO may have a hard time implementing a cohesive security strategy org-wide.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_16_.png

The embedding model is sort of a hybrid between these two that tries to get the best of both worlds. Having a strong security organization that reports to an influential CISO is good, and having access to real-world data is also critical to making the right decisions. Security engineers should then report to the CISO but be embedded into engineering teams.

Now, "embedded" here has a strong meaning. It doesn’t just mean that security people snoop into these teams chatrooms and weekly meetings. It means that the engineering managers of these teams have partial control of the security engineers. They can request their time and change their priorities as needed. If a project needs an urgent review, or an incident needs handling, or a tools needs to be written, the security engineers will step in and provide support. That’s how you show the organization that you’re really 100% in, and not just a compliance group on the outskirts of the org.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_17_.png

Reverse embedding is also very important, and we’ve had that model in the security industry for many years: it’s called security champions. Security champions are engineers from your organization who have a special interest in security. Oftentimes, they are more senior engineers who are in architect roles or deep experts who consider security to be a critical aspect of their work. They are the security team’s best friends. Its partners throughout the organization. They should be treated with respect and given as much support as possible, because they’ll move mountains for you.

Security champions should have full access to the security team. No closed doors meetings they are excluded from, no backchannel discussions they can’t participate in. If you can’t trust your champions, you can’t trust anyone in your org, and that’s a pretty broken situation.

Champions must also be involved with setting the security strategy. If you’re going to adopt a framework or a platform for its security properties, make sure to consult your champions. If they are on board, they’ll help sell that decision down the engineering chain.

Avoid Making Assumptions

If you embed your security engineers into the dev and ops, and open the doors of your organization to security champions, you’ll allow information to flow freely and make better decision. This allow you to dramatically reduce the amount of assumptions you have to make every day, which directly correlates to stronger security.

The first project I worked on when I joined Mozilla was called MIG, for Mozilla InvestiGator (the logo was a gator, for investigator, get it?). The problem we were trying to solve was inspecting our live systems for indicators of compromise in real-time. Back in 2013, we already had too many servers to run investigations manually. The various method we had tried all had drawbacks. The most successful of them involved running parallel ssh from a bastion host that had ssh keys and firewall rules to connect everywhere. If that sounds terrifying to you, it’s because it was. So my job was to invent a system that could securely connect to all those systems to tell us if a file with a given checksum was present, which would indicate that a malware or backdoor existed on the host.

That project was cool as hell! How often do you get to spend 2 years implementing a completely novel agent-based system in a new language? I had a lot of fun working on MIG, and it saved our bacon a few times. But quickly what became evident was that we were running the same investigations over and over. We had some fancy modules that could look for byte strings in memory, or run complex analysis on files, but we never used them. Instead, we used MIG as an inventory tool to find out which version of a package was installed on a system, or which host could connect to the internet through an outbound gateway. MIG wasn’t so much of a security investigation platform as it was an inventory one, and it was addressing a critical need: the need for information.

Every security incident starts with an information gathering phase. You need to understand exactly how much is impacted, and what is the risk of propagation, before you can work on mitigation. The inventory problem continues to be a major concern in infosec. There are too many versions of too many applications running on too many systems for anyone to keep track of them effectively. Is this is even ignoring the problem Shadow IT poses to organizations that haven’t modernized fast enough. As security experts, we operate in the dark most of the time, so we make assumptions.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_18_.png

I’ve grown to learn that assumption are at the root of most vulnerabilities. The NYTimes wrote a fantastic article to how Boeing built a defective 737 Max, and I’ll let you guess what’s at the core of it: assumptions. In their case, it’s – and I quote the New York Times article here - “After Boeing removed one of the sensors from an automated flight system on its 737 Max, the jet’s designers and regulators still proceeded as if there would be two“. The article is truly eye opening on how people making assumptions about various parts of the systems led to the plane being unreliable, to dramatic consequences.

I've made too many assumptions throughout my career, and too often did they prove to be entirely wrong. One of them even broke Firefox for our entire user base. Nothing good comes from making assumptions.

Assumptions lead to vulnerability

Assumptions leads to vulnerability. Let me say that one more time. Assumptions lead to vulnerability. As security experts, our job is to identify every assumption as a potential security issue. If you only get one thing out of this talk, please let it be this: every time someone uses the word “assume” in any context, reply we “let’s see how we can remove assumption”, or “do we have a test to confirm this?”

Nowadays, I assert the security maturity of an organization to how many assumptions its security team is making. It’s terrifying, really. But we have a cure, it’s Data. Having better inventories, something cloud infrastructure truly helps us with, is an important step toward fixing out of date systems. Clearing up assumptions on how systems interconnects reduces the attack surface. Etc, etc.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_2_.png

Data is a formidable silver bullet for a lot of things, and it must be at the core of any decent security strategy. The more data you have, the easier it it to take risk. Good security metrics updated daily is what helped me answer my boss’s question that we were okay reinvesting our resources in other areas. Those dashboards are also what upper management want to see. Not that they necessarily want to know the details of the dashboard, though sometimes they do, but they want to make sure you have that data, that you can answer questions about the organization’s security posture, and that you can make decisions based on accurate information.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_19_.png

Data does not come out of thin air. To get good data, you need good tests, good monitoring and good telemetry. Security teams are generally pretty good at writing tests, but they don’t always write the right tests.

Above is an example of an AWS test from our internal framework called “frost”. It checks that EC2 instances are running on an acceptable AMI version. You can see that code for yourself, it’s open source at github.com/mozilla/frost. What’s interesting about this test is that it took a few iterations to get it right. Our first attempt was just completely wrong and flagged pretty much every production instance as insecure, because we weren't aligned with the ops team, and had written the test without consulting them.

In this case, we flag an instance as insecure if it is running on an AMI that isn’t owned by us, and that is older than a configured max age. This is a good test, and we can give that data directly to the ops team for action. It’s really worthwhile spending extra time making sure your tests are trustworthy, because otherwise you’re sending compliance reports no one ever reads or take actions on, and you’re pissing people off.

Setting the expectations

But data and tests only reflect how well you’re growing the security awareness in your organization. They don’t, in and of themselves, mature your organization. So while it is important to spend time improving your testing tools and metrics gathering frameworks, you should also spend time explaining to the organization what your ideal state is. You should set the expectations.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_20_.png

A few years ago, we were digging through our metrics to understand how we could get websites and APIs to a higher level of security. It was clear we weren’t being successful at deploying modern security techniques like Content Security Policy or HSTS. It seemed like every time we would perform a risk assessment, we would get a full buy-in from dev teams, and yet those controls would never make it to the production version. We had an implementation problem.

So we tried a few things, hoping that one of them would catch on.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_21_.png

We first pushed on the idea that every deployment pipeline would invoke ZAP right after the pre-production site was deployed. I talked about this in the book under the idea of test-driven security, and I use examples of invoking a container running ZAP in CircleCI. The ZAP container would scan a pre-production version of the site and output a compliance report against a web security baseline. The report was explicit in calling out missing controls, so we thought it would be easy for devs to adopt it and fix their webapps. But it didn’t take off. We tried really hard to get it included in Jenkins deployment pipelines and web applications CI/CD, and yet the uptake was low, and fairly short lived. The integrations would get disabled as soon as it became annoying (too slow, blocking deploys, etc...). The tooling just wasn't there yet.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_22_.png

But the idea of the “baseline”, this minimal set of security controls we wanted every website and api to implement, was fairly successful. So we turned it into a checklist in markdown format that could easily be copied into github issues.

This one worked out beautifully. We would create the checklist in the target repository while running the risk assessment, and devs would go through it as part of their pre-launch check. Over time, we added dozens of items to the checklist, from new controls like checking for out-of-date dependencies, to traps we wanted devs to avoid like refusing to proxy requests to aws instances metadata. The checklist got big, and many items don’t apply to most projects, but we just cross them off and let the devs focus on the stuff that matters. They seem to like it.

And something else interesting happened: project managers started tracking completion of the checklist as part of their own pre-launch checklist. We would see security checklist completeness being mentioned as part of a readiness meeting with directors. It got taken seriously, and as a result every single website we launched over the last couple years implements content security policy, HSTS, same site cookies and so on.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_23_.png

The checklist isn’t the only thing that helped improve adoption. Giving developers self-service security tools is also very important. And the approach of giving letter-grades has an interesting psychological effects on engineering teams, because no one wants to ship a production site that gets an F or even a C on the publicly accessible security assessment tool. Everyone wants an A+. And guess what? Following the checklist does give you an A+. That makes the story pretty straightforward: follow the checklist, and you’ll get your nice A+ on the Observatory.

Clear Expectations
↓
Checklist
↓
Self Assessment
↓
Profit

This particular model may not work exactly for your organization. The power dynamics and internal politics may be different. But the general rule still applies: if you want security to be adopted in products that ship, set the expectations early and clearly. Don’t give them vague rules like “only use encryption algorithms that provide more than 128 bits of security”. No one knows what that means. Instead, give them information they can directly translate into code, like a content security policy they can copy and paste then tweak, or a logging library they can import in their apps that spits out the right format from the get go. Set the expectations, and make them easy to follow. Devs and ops are too busy to jump through hoops.

Once you’ve set the expectations, give them checklists and the tools to self-assess. Don’t make your people have to ask you every time they need a check of their webapp, it bothers them as much as it bothers you. Instead, give them security tools that are fully self-service. Give them a chance to be their own security team, and to make you obsolete. Clear expectations and self-service security tools is how you build up adoption.

Not having to say “no”

There is another anti-pattern of security team I’d like to address: it’s the stereotypical “no” team. The team that operates in organizations where engineers keep bringing up projects they feel they have to shut down because of how risky they are. Those security people are usually not a happy bunch. You rarely see them smile. They complain a lot. They look way older than they really are. Maybe they took up drinking.

See, I’m a happy person. I enjoy my work, and I enjoy the people I work with, and I don’t want to end up like that. So I set a personal goal to pretty much never having to say no. Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_24_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_24_.png

I have a little one at home. She’s a little over a year old and just started walking. She’s having a blast really, and it’s a sight to see her run around the house, a huge smile on her face, using her newly acquired skills. For her mother and I, it’s just plain terrifying. Every step she takes is a potential disaster. Every furniture in the house that isn’t covered in foam and soft blanket is a threat. Every pot, jar, broom, cat or dog is a weapon she could get her hands on at any moment. The threat modeling of a parent is simple: “EVERYTHING IS DANGEROUS, DO NOT LET THIS CHILD OUT OF SIGHT, NO YOU CAN’T CLIMB THE STAIRS ON YOUR OWN, DON’T LICK THAT KNIFE!”. Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_25_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_25_.png

The safe approach here should be simple: cover the little devil in bubble wrap, and don’t let her leave the safe space of her playpen. There, problem solved. Now if she could just stop growing...

And by the way, you CAN buy bubble wrap baby suits. It’s a thing. For those of you with kids, you may want to look into it. Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_26_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_26_.png

There is a famous quote from one of my personal hero: Rear Admiral Grace Hopper. She invented compilers, back when computers were barely a thing. She used to hand out nanoseconds to military officers to explain how long messages took to travel over the wire. A nanosecond here is a small piece of wire about 30cm long (that’s almost a foot, for all you americans out there) that represent the maximum distance that light or electricity can travel in a billionth of a second. When an admiral would ask her why it takes so damn long to send a message via a satellite, she’d point out that between here and the satellite there’s a large number of nanoseconds.

Anyway, Admiral Hopper once said “ships are safe in harbor, but that’s not what ships are for”. If you only remember two things from this talk, add that one to the list.
Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_27_.png

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_27_.png

As a dad, it is literally my job to paternalize my kid. A lot of security teams feel the same way about their daily job. Let me argue here, that this is completely the wrong approach. It’s not your job to paternalize your organization. The people you work with are adults capable of making rational decisions, and when they decide to ignore a risk, they are also making a rational decision. You may disagree with it, and that’s fine, but you shouldn’t presume that you know better than everyone else involved.

What our job as security professionals really is, is to help the organization make informed decision. We need to surface the risks, explain the threats, perhaps reshape the design of a project to better address some concerns. And when everything is said and done, the organization can decide for itself how much risk it is willing to accept.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_28_.png

At Mozilla, Instead of saying no, we run risk assessments and threat models together as a team, then we make sure everyone agrees on the assessment, and if they think it’s an appropriate amount of risk to take. The security team may have concerns over a specific feature, only to realize during the assessments those concerns aren’t really that high. Or perhaps they are and the engineering team didn’t realize that until now, and is perfectly willing to modify or remove that feature entirely. And sometimes the project simply is risky by nature, but it’s only being rolled out to a small number of users while being tested out.

The point of risks assessments and threat modeling isn’t only to identify the risks. It’s to help the organization make informed decisions about those risks. A security team that simply says “no” doesn’t solve anyone’s problems, instead, build yourself an assessment framework that can evaluate new projects and features, and can get people to take calculated risks based on the environment the business operates in.

We call this framework the “Rapid Risk Assessment”, or RRA. Guillaume Destuynder and I introduced this framework at Mozilla back in 2013, and over the last 6 years we have ran hundreds, if not thousands, of RRAs with everyone in the organization. I still haven’t heard anyone find the exercise a waste of time. RRAs are a bit different from the standard risk assessments. They are short, 30 minutes to one hour, and focused on small components. It’s more of a security and threat discussion than a typical matrix-based risk framework, and I think this is why people actually enjoy them. One particular team even told me once they were looking forward to running the RRA on their new project. How cool is that?

Having a risk assessment framework is nice, but you can also get started without one. In the panel yesterday, Zane Lackey told the story of introducing risk assessments at Etsy by joining engineering meetings and simply asking "How would you attack this app?" to the devs. This works, I've asked similar questions many times. Guillaume's favorite is "what would happen should the database leak on Twitter?". Devs & Ops are much better at threat modeling than they often realize, and you can see the wheels spinning in their brains when you ask this type of question. Try it out, it's actually fun!

Being Strategic

By this point in the talk, I hope that I’ve convinced you security is owned well beyond the security team, and you might be tempted to think that, perhaps we could get rid of all those pesky security people altogether. I’ll be honest, that’s the end game for me. The digital world will adopt perfect security. People will carefully consider their actions and take the right amount of risk when appropriate. They will communicate effectively and remove all assumptions in their work. They will treat each other with respect during security incidents and collaborate effectively toward a resolution. And I’ll be serving fresh Mojitos at a Tiki bar over there by the beach.

Not a good deal for security conferences really. Sorry Snyk, this may have been a bad investment. But I have a feeling they’ll be doing fine for a while. Until those blessed days come, we’ve got work to do.

A mature security team may not need to hold the hands of its organization anymore, but it still has one job: it must be strategic. It must foresee threats long before they impact the organization, and it must plan defenses long before the organization has to adopt them. It’s easy for security teams to get caught into the tactical day to day, go from review to incident to review to incident, but doing so does not effectively improve anything. The real gratification, the one that no one else gets to see but your own team, is seeing every other organization but your own battle a vulnerability you’re not exposed to because of strategic choices you’ve made long ago.

Screenshot_2019-09-25_Beyond_the_Security_Team_-_DevSecCon_KeyNote_29_.png

Let me give you an example: I’m a strong supported of putting our admin panels behind VPN. Yes, VPNs. Remember them? Those old dusty tunnels that we use to love in the 2000s until zero trust became all the rage and every vendor on the planet wanted you to replace everything with their version of it. Well, guess what, we do zero trust, but we also put our most sensitive admin panels behind VPN. The reason for it is simply defense in depth. We know authentication layers fail regularly, we have tons of data to prove it, and we don’t trust them as the only layer of defense.

Do developers and operators complain about it? Yes, absolutely, almost all the time. But they also understand our perspective, and trust us to make the right call. It all ties together: you can’t make strategic decisions for your organizations if your own people don’t trust you.

So what else should you be strategic about? It really depends on where you’re operating. Are you global or local? What population do you serve? Can your services be abused for malicious purpose?

For example, imagine you’re running a web business selling birthday cards. You could decide to automatically close every account that’s created or accessed from specific parts of the world. It would be a drastic stance to take, but perhaps the loss of business is cheaper than the security cost. It’s a strategic decision to make, and it’s the role of a mature security team to help its leadership make it by informing on the risks.

I personally like the idea that each component of your environment should be allow to blow up without impacting the rest of the infrastructure. I don’t like over-centralization. This model is nice when you work with teams that have varying degrees of maturity, because you can let them do their things without worrying about dragging everyone down to the lowest common denominator. In practice, it means we don’t put everything in the same AWS accounts, we don’t use single sign on for certain things that we consider too sensitive, and so on. The point is to make strategic decisions that make sense for your organization.

Whatever decision you make, spend time documenting it, and don’t forget to have your champions review them and influence them. Build the security strategy together, so there is no confusion that this is a group effort the entire organization has bought into. It will make implementing it a whole lot easier.

So in closing, I’d like to leave you with this: a security team must help the organization make strategic security decisions. To do so, it must be trusted. To be trusted, its need to have data, avoid making assumptions, set clear expectations and to avoid saying no. And above all, it must be embedded across the organizations.

To go beyond the security team

Get the security team closer to your organization

Thank you.

Thursday, August 15 2019

The cost of micro-services complexity

By Julien Vehent, Thursday, August 15 2019.

It has long been recognized by the security industry that complex systems are impossible to secure, and that pushing for simplicity helps increase trust by reducing assumptions and increasing our ability to audit. This is often captured under the acronym KISS, for "keep it stupid simple", a design principlepopularized by the US Navy back in the 60s. For a long time, we thought the enemy were application monoliths that burden our infrastructure with years of unpatched vulnerabilities.

So we split them up. We took them apart. We created micro-services where each function, each logical component, is its own individual service, designed, developed, operated and monitored in complete isolation from the rest of the infrastructure. And we composed them ad vitam æternam. Want to send an email? Call the rest API of micro-service X. Want to run a batch job? Invoke lambda function Y. Want to update a database entry? Post it to A which sends an event to B consumed by C stored in D transformed by E and inserted by F. We all love micro-services architecture. It’s like watching dominoes fall down. When it works, it’s visceral. It’s when it doesn’t that things get interesting. After nearly a decade of operating them, let me share some downsides and caveats encountered in large-scale production environments.

High operational cost

The first problem is operational cost. Even in a devops cloud automated world, each micro-service, serverless or not, needs setup, maintenance and deployment. We never fully got to the holy grail of completely automated everything, so humans are still involved with these things. Perhaps someone sold you on the idea devs could do the ops work on their free time, but let’s face it, that’s a lie, and you need dedicated teams of specialists to run the stuff the right way. And those folks don’t come cheap.

The more services you have, the harder it is to keep up with them. First you’ll start noticing delays in getting new services deployed. A week. Two weeks. A month. What do you mean you need a three months notice to get a new service setup?

Then, it’s the deployments that start to take time. And as a result, services that don’t absolutely need to be deployed, well, aren’t. Soon they’ll become outdated, vulnerable, running on the old version of everything and deploying a new version means a week worth of work to get it back to the current standard.

QA uncertainty

A second problem is quality assurance. Deploying anything in a micro-services world means verifying everything still works. Got a chain of 10 services? Each one probably has its own dev team, QA specialists, ops people that need to get involved, or at least notified, with every deployment of any service in the chain. I know it’s not supposed to be this way. We’re supposed to have automated QA, integration tests, and synthetic end-to-end monitoring that can confirm that a butterfly flapping its wings in us-west-2 triggers a KPI update on the leadership dashboard. But in the real world, nothing’s ever perfect and things tend to break in mysterious ways all the time. So you warn everybody when you deploy anything, and require each intermediate service to rerun their own QA until the pain of getting 20 people involved with a one-line change really makes you wish you had a monolith.

The alternative is that you don’t get those people involved, because, well, they’re busy, and everything is fine until a minor change goes out, all testing passes, until two days later in a different part of the world someone’s product is badly broken. It takes another 8 hours for them to track it back to your change, another 2 to roll it back, and 4 to test everything by hand. The post-mortem of that incident has 37 invitees, including 4 senior directors. Bonus points if you were on vacation when that happened.

Huge attack surface

And finally, there’s security. We sure love auditing micro-services, with their tiny codebases that are always neat and clean. We love reviewing their infrastructure too, with those dynamic security groups and clean dataflows and dedicated databases and IAM controlled permissions. There’s a lot of security benefits to micro-services, so we’ve been heavily advocating for them for several years now.

And then, one day, someone gets fed up with having to manage API keys for three dozen services in flat YAML files and suggests to use oauth for service-to-service authentication. Or perhaps Jean-Kevin drank the mTLS Kool-Aid at the FoolNix conference and made a PKI prototype on the flight back (side note: do you know how hard it is to securely run a PKI over 5 or 10 years? It’s hard). Or perhaps compliance mandates that every server, no matter how small, must run a security agent on them.

Even when you keep everything simple, this vast network of tiny services quickly becomes a nightmare to reason about. It’s just too big, and it’s everywhere. Your cross-IAM role assumptions keep you up at night. 73% of services are behind on updates and no one dares touch them. One day, you ask if anyone has a diagram of all the network flows and Jean-Kevin sends you a dot graph he generated using some hacky python. Your browser crashes trying to open it, the damn thing is 158MB of SVG.

Most vulnerabilities happen in the seam of things. API credentials will leak. Firewall will open. Access controls will get mis-managed. The more of them you have, the harder it is to keep it locked down.

Everything in moderation

I’m not anti micro-services. I do believe they are great, and that you should use them, but, like a good bottle of Lagavulin, in moderation. It’s probably OK to let your monolith do more than one thing, and it’s certainly OK to extract the one functionality that several applications need into a micro-service. We did this with autograph, because it was obvious that handling cryptographic operations should be done by a dedicated micro-service, but we don’t do it for everything. My advice is to wait until at least three services want a given thing before turning it into a micro-service. And if the dependency chain becomes too large, consider going back to a well-managed monolith, because in many cases, it actually is the simpler approach.

Tuesday, February 5 2019

Interviewing tips for junior engineers

By Julien Vehent, Tuesday, February 5 2019.

I was recently asked by the brother of a friend who is about to graduate for tips about working in IT in the US. His situation is not entirely dissimilar to mine, being a foreigner with a permit to work in America. Below is my reply to him, that I hope will be helpful to other young engineers in similar situations.

For background, I've been working in the US since early 2011. I had a few years of experience as a security engineer in Paris when we moved. I first took a job as a systems engineer while waiting for my green card, then joined a small tech company in the email marketing space to work on systems and security, then joined Mozilla in 2013 as a security engineer. I've been at Mozilla for almost six years, now running the Firefox operations security team with a team of six people scattered across the US, Canada and the UK. I've been hiring engineers for a few years in various countries.

Are there any skills or experiences beyond programming required to be an interesting candidate for foreign employers

You're just getting started in your career, so employers will mostly look for technical proficiency and being a pleasant person to work with. Expectations are fairly low at this stage. If you can solve technical puzzles and people enjoy talking to you, you can pass the bar at most companies.

This changes after 5-ish years of experience, when employers want to see project management, technical leadership, and maybe a hint of people management too. But for now, I wouldn't worry about it.

I would say the most important thing is to have a plan: where do you want to be 10/15/20 years from now? My aim was toward Chief Security Officer roles, an executive position that requires technical skills, strategic thinking, communication, risk and project management, etc. It takes 15 to 20 years to get to that level, so I picked jobs that progressively gave me the experience needed (and that were a lot of fun, because I'm a geek).

Were there any challenges that you faced, other than immigration, that I may need to be aware of as a foreign candidate?

Immigration is the only problem to solve. When I first applied for jobs in the US, I was on a J-1 visa doing my Master's internship at University of Maryland. I probably applied to 150 jobs and didn't get a single reply, most likely because no one wants to hire candidates that need a visa (a long and expensive process). i ended up going back to France for a couple years, and came back after obtaining my green card, which took the immigration question out of the way. So my advise here is to settle the immigration question in the country where you want to work before you apply for jobs, or if you need employer support to get a visa, be up front about it when talking to them. (I have several former students who now work for US companies that have hiring processes that incorporate visa applications, so it's not unheard of, but the bar is high).

I have a Master of Science from a small french university that is completely unknown to the US, yet that never came up as an issue. A Master is a Master. Should degree equivalence come up as an issue during an interview, offer to provide a grade comparison between US GPA and your country grades (some paid service provide that). That should put employers at ease.

Language has also never been an issue for me, even with my strong french accent. If you're good at the technical stuff, people won't pay attention to your accent (at least in the US). And I imagine you're fully fluent anyway, so having a deep-dive architecture conversation in English won't be a problem.

And lastly, do you have any advice on how to stand out from the rest of the candidates aside from just a good resume (maybe some specific volunteer experience or unique skills that I could gain while I am still finishing my thesis?)

Resume are mostly useless. I spend between 30 and 60 seconds on a candidate's resume. The problem is most people pad their resumes with a lot of buzzwords and fancy-sounding projects I cannot verify, and thus cannot trust. It would seem the length of a resume is inversely proportional to the actual skills of a candidate. Recruiters use them to check for minimal requirements: has the right level of education, knows programming language X, has the right level of experience. Engineering managers will instead focus on actual technical questions to assess your skills.

At your level, keep you resume short. My rule of thumb is one page per five years of experience (you should only have a single page, I have three). This might contradict advise you're reading elsewhere that recommend putting your entire life story in your resume, so if you're concerned about not having enough details, make two versions: the one page short overview (linkedin-style), and the longer version hosted on your personal site. Offer a link to the longer version in the short version, so people can check it out if they want to, but most likely won't. Recruiters have to go through several hundred candidates for a single position, so they don't have time, or care for, your life story. (Someone made a short version of Marissa Meyer's resume that I think speaks volume).

Make sure to highlight any interesting project you worked on, technical or otherwise. Recruiters love discussing actual accomplishment. Back when I started, I had a few open source projects and articles written in technical magazines that I put on my resume. Nowadays, that would be a GitHub profile with personal (or professional, if you're lucky) projects. You don't need to rewrite the Linux kernel, but if you can publish a handful of tools you developed over the years, it'll help validate your credentials. Just don't go fork fancy projects to pad your GitHub profile, it won't fool anyone (I know, it sounds silly, but I see that all too often).

Another thing recruiters love is Hackerrank, a coding challenge website used by companies to verify the programming skills of prospective candidates. It's very likely US companies will send you some sort of coding challenge as part of the interview process (we even do it before talking to candidates nowadays). My advise is to spend a few weekends building a profile on Hackerrank and getting used to the type of puzzle they ask for. This is similar to what the GAFA ask for in technical interviews ("quicksort on a whiteboard" type of questions).

At the end of the day, I expect a junior engineer to be smart and excited about technology, if not somewhat easily distracted. Those are good qualities to show during an interview and on your resume.

My last advise would be to pick a path you want to follow for the next few years and be very clear about it when interviewing. You should have a goal. You have a lot of degrees, and recruiters will ask you what you're looking for. So if you want to go into the tech world, be up front about it and tell them you want to focus on engineering for the foreseeable future. In my experience, regardless of your level of education, you need to start at the bottom of the ladder and climb your way up. A solid education will help you climb a lot faster than other folks, and you could reach technical leadership in just a couple years at the right company, or move to a large corporation and use that fancy degree to climb the management path. In both cases, I would recommend getting those first couple years of programming and engineering work under your belt first. Heck, even the CTO of Microsoft started as a mere programmer!

I hope this helps folks who are getting started. And I'm always happy to answer questions from junior engineers, so don't hesitate to reach out!

Thursday, January 17 2019

Maybe don't throw away your VPN just yet...

By Julien Vehent, Thursday, January 17 2019.

Over the past few years I've followed the rise of the BeyondCorp project, Google's effort to move away from perimetric network security to identity-based access controls. The core principle of BeyondCorp is to require strong authentication to access resources rather than relying on the source IP a connection originates from. Don't trust the network, authenticate all accesses, are requirements in a world where your workforce is highly distributed and connects to privileged resources from untrusted networks every day. They are also a defense against office and datacenter networks that are rarely secure enough for the data they have access to. BeyondCorp, and zero trust networks, are good for security.

This isn't new. Most modern organizations have completely moved away from trusting source IPs and rely on authentication to grant access to data. But BeyondCorp goes further by recommending that your entire infrastructure should have a foot on the Internet and protect access using strong authentication. The benefits of this approach are enormous: employees can be fully mobile and continue to access privileged resources, and compromising an internal network is no longer sufficient to compromise the entire organization.

As a concept, this is good. And if you're hosting on GCP or are willing to proxy your traffic through GCP, you can leverage their Identity and Access Proxy to implement these concepts securely. But what about everyone else? Should you throw away your network security and put all your security in the authentication layer of your applications? Maybe not...

At Mozilla, we've long adopted single sign on, first using SAML, nowadays using OpenID Connect (OIDC). Most of our applications, both public facing and internal, require SSO to protect access to privileged resources. We never trust the network and always require strong authentication. And yet, we continue to maintain VPNs to protect our most sensitive admin panels.

"How uncool", I hear you object, "and here we thought you were all about DevOps and shit". And you would be correct, but I'm also pragmatic, and I can't count the number of times we've had authentication bugs that let our red team or security auditors bypass authentication. The truth is, even highly experienced programmers and operators make mistakes and will let a bug disable or fail to protect part of that one super sensitive page you never want to leave open to the internet. And I never blame them because SSO/OAuth/OIDC are massively complex protocols that require huge libraries that fail in weird and unexpected ways. I've never reached the point where I fully trust our SSO, because we find one of those auth bypass every other month. Here's the catch: they never lead to major security incidents because we put all our admin panels behind a good old VPN.

Those VPN that no one likes to use or maintain (me included) also provide a stable and reliable security layer that simply never fails. They are far from perfect, and we don't use them to authenticate users or grant access to resources, but we use them to cover our butts when the real authentication layer fails. So far, real world experience continues to support this model.

So, there, you have it: adopt BeyondCorp and zero trust networks, but also consider keeping your most sensitive resources behind a good old VPN (or an SSH jumphost, whatever works for you). VPNs are good at reducing your attack surface and adding an extra layer of protection to your infrastructure. You'll be thankful to have one the next time you find a bypass in your favorite auth library.

Monday, August 6 2018

Running with Runscribe Plus

By Julien Vehent, Monday, August 6 2018.

Six months ago, I broke the bank and bought a pair of Runscribe Plus running pods. I was recovering from a nasty sprain of the deltoid ligaments acquired a year ago while running a forest trails, and was looking for ways to reduce the risk of further damage during my runs.

Also, I love gadgets, so that made for a nice excuse! :)

After reading various articles about the value of increasing step rates to decrease risk of injury, I looked into various footpod options, the leading of which is the Stryd, but wanted to monitor both feet which only the Runscribe Plus can do. So I did something I almost never do: ordering a gadget that hasn't been heavily reviewed by many others (to the exception of the5krunner).

Runscribe Plus, which I'll abbreviate RS, is a sensor that monitors your feet movement and impact while you run. It measures:

pronation: rolling of the foot, particularly useful to prevent sprains
footstrike: where you foot hits the ground, heel, middle or front
shock: how much force your feet hit the ground with
step rate
stride length
contact time
and a bunch of calculation based on these raw metrics

WhatsApp_Image_2018-08-06_at_13.34.27.jpeg

My RS arrived less than a week after ordering them, but I couldn't use them right away. After several hours of investigation and back and forth with the founder, Tim Clark, by email, we figured out that my pods shipped with a bogus firmware. He remotely pushed a new version, which I updated to using the android app, and the pods started working as intended.

Usability is great. RS starts recording automatically when the step rate goes above 140 (I usually run around 165spm), and also stop automatically at the end of a run. The android app then downloads running data from each pod and uploads it to the online dashboard. Both the app and the webui can be used to go through the data, and while the app is fine to visualize data, I do find the webui to be a lot more powerful and convenient to use.

Screenshot_2018-08-06_RunScribe_-_Data_Driven_Athlete.png

The cool thing about RS is being able to compare left and right foot, because each foot is measured separately. This is useful to detect and correct balance issues. In my case, I noticed after a few run that my injured foot, the left one, was a lot less powerful than the right one. It was still painful, and I simply couldn't push on it as much, and the right foot was compensating and taking a lot more shock. I tried to make a conscious effort to reduce this imbalance over the following month, and it seem to have paid off in the end.

Screenshot_2018-08-06_RunScribe_-_Data_Driven_Athlete_2.png

The RunScribe Dashboard displays shock data for each foot recorded during a 5k. The dark red line represents the right foot and is taking a lot more shock that the light red one representing the left foot.

It's possible to use the RS to measure distance, but a lot of users on the forum have been complaining about distance accuracy issues. I've run into some of those, even after calibrating the pods to my stride length over a dozen runs. I would go for a 5 miles run with my gps watch and RS would measure a distance of anything between 4 and 6 miles. RS doesn't have a GPS, so it bases those calculations on your stride length and step count. Those inaccuracies didn't really bother me, because you can always update the distance in the app or webui after the fact, which also helps train the pod, and I am more interested in other metrics anyway.

That being said, distance inaccuracy is completely gone. According to Garmin, this morning's run was 8.6 miles, which RS recorded as 8.5 miles. That's a 1% margin of error, and I honestly can't tell which one is right between RS and Garmin.

So what changed? I was previously placing the pods on the heels of my shoes but recently moved to the laces, which may have helped. I also joined the beta program to get early firmware updates, and I think Tim has been tweaking distance calculation quite a bit lately. At any rate, this is now working great.

Screenshot_2018-08-06_RunScribe_-_Data_Driven_Athlete_3.png

RS can also broadcast live metrics to your running watch, which can then be displayed on their own screen. I don't find those to be very useful, so I don't make use of it, but it does provide real-time shock and step rate and what not.

What about Power?

I'll be honest, I have no idea. Running power doesn't seem extremely useful to me, or maybe I need to spend more time studying its value. RS does expose a Power value, so if this is your thing, you may find it interesting.

Take Away

RS is definitely not for everyone. It has its rough edges and exposes data you need to spend time researching to understand and make good use off. That said, whether you're a professional athlete or, like me, just a geek who likes gadgets and data, it's a fantastic little tool to measure your progress and tweak your effort in areas you wouldn't be able to identify on your own. I like it a lot, and I think more people will adopt this type of tool in the future.

Did it help with my ankle recovery? I think so. Tracking pronation and shock metrics was useful to make sure I wasn't putting myself at risk again. The imbalance data is probably the most useful information I got out of the RS that I couldn't get before, and definitely justifies going with a system with 2 pods instead of one. And, if anything else, it helped me regain confidence in my ability to do long runs without hurting myself.

Screenshot_2018-08-06_RunScribe_-_Data_Driven_Athlete_4.png

Footstrike metrics for left and right foot recorded during a half marathon shows I mostly run on the middle and back of my feet. How to use that data is left as an exercise to the runner.

Last, but most certainly not least, Tim Clark and the Runscribe team are awesome. Even with the resources of a big shop like Garmin, it's not easy to take an experimental products through rounds of testing while maintaining a level of quality that satisfies runners accustomed to expensive running gear ($700 watches, $200 shoes, etc.). For a small team to pull this off is a real accomplishment, all while being accessible and creating a friendly community of passionate runners. It's nice to support the underdog every once in a while, even when that means having to live with minor bugs and being patient between updates.

note: this blog post was not sponsored by runscribe in any way, I paid for my own pods and have no affiliation with them, other than being a happy customer.

Friday, September 22 2017

On the prevalence of cross-site scripting (XSS) attacks in modern web applications

By Julien Vehent, Friday, September 22 2017.

As I attended AppSec USA in Orlando, a lot of discussions revolved around the OWASP Top 10. Setting the drama aside for a moment, there is an interesting discussion to be had on the most common vulnerabilities found in current web applications. The Top 10 from 2013 and 2017 (rc1) hasn’t changed much: first and foremost are injection issues, then broken auth and session management, followed by cross-site scripting attacks.

At first glance, this categorizing appears sensible, and security teams and vendors have depended on it for over a decade. But here’s the issue I have with it: the data we collect from Mozilla’s web bug bounty program strongly disagrees with it. From what we see, injection and authentication/session attacks are nowhere near as common as cross-site scripting attacks. In fact, throughout 2016 and 2017, we have received over five time more XSS reports than any other vulnerability!

This is certainly a huge difference between our data and OWASP's, but Mozilla's dataset is also too small to draw generalities from. Thankfully, both Bugcrowd and Hackerone have published reports that show a similar trend.
In their 2017 report, Bugcrowd said "cross-site scripting (XSS) and Cross Site Request Forgery (CSRF) remain the most reported submissions across industries accounting for 25% and 7% of submissions respectively. This distribution very closely reflects last year’s findings (XSS 25% and CSRF 8%).".
Hackerone went further in their report, and broke the vulnerability stats down by industry, saying that “in all industries except for financial services and banking, cross-site scripting (XSS, CWE-79) was the most common vulnerability type discovered by hackers using the HackerOne platform. For financial services and banking, the most common vulnerability was improper authentication (CWE-287). Healthcare programs have a notably high percentage of SQL injection vulnerabilities (6%) compared to other industries during this time period”.

This data confirms what we’re seeing at Mozilla, but to be fair, not everyone agrees. Whitehat also published their own report where XSS only ranks third, after insufficient transport layer protection and information leakage. Still, even in this report, XSS ranks higher than authentication/authorization and injection issues.

All three sources that shows XSS as the most prevalent issue come from bug bounty programs, and there is a strong chance that bug bounty reporters are simply more focused on XSS than other attacks. That said, when looking at modern web applications (and Mozilla’s web applications are fairly modern), it is rare to find issues in the way authentication and authorization is implemented. Most modern web frameworks have stable and mature support for authentication, authorization, access controls, session management, etc. There’s also a big trend to rely on external auth providers with SAML or OpenID Connect that removed implementation bugs we saw even 4 or 5 years ago. What about non-xss injections? We don’t get that many either. In the handful of services that purposely accept user data, we’ve been paranoid about preventing vulnerabilities, and it seem to have worked well so far. The data we get from security audits, outside of bug bounties, seem to confirm that trend.

In comparison, despite continued efforts from the security community to provide safe escaping frameworks like Cure53’s DOMPurify or Mozilla’s Bleach, web applications are still bad at escaping user-provided content. It’s hard to blame developers here, because the complexity of both the modern web and large applications is such that escaping everything all the time is an impossible goal. As such, the rate of XSS in web applications has steadily increased over the last few years.

What about content security policy? It helps, for sure. Before we enabled CSP on addons.mozilla.org, we had perhaps one or two XSS reports every month. After enabling it, we hardly get one or two per year. For sure, CSP bypass is possible, but not straightforward to achieve, and often sufficient to fend off an attacker (see screenshots from security audit reports below). The continued stream of XSS reports we receive is from older applications that do not use CSP, and the data is a strong signal that we should continue pushing for its adoption.

So, how do we explain the discrepancy between what we’re seeing at Mozilla, Bugcrowd and Hackerone, and what other organizations are reporting as top vulnerabilities? My guess is a lot of vendors are reviewing very old applications that are still vulnerable to issues we’ve solved in modern frameworks, and that Mozilla/Bugcrowd/Hackerone mostly see modern apps. Another possibility is those same vendors have no solutions to XSS, but plenty of commercial solutions to other issues, and thus give them more weight as a way to promote their products. Or we could simply all have bad data and be drawing wrong conclusions.

Regardless of what is causing this discrepancy, there’s evidently a gap between what we’re seeing as the most prevalent issues, and what the rest of the security community, and particularly the OWASP Top 10, is reporting. Surely, this is going to require more digging, so if you have data, please do share it, so we can focus security efforts on the right issues!

Thank you to Greg Guthe and Jonathan Claudius for reviewing drafts of this blog post

Monday, September 11 2017

Lessons learned from mentoring

By Julien Vehent, Monday, September 11 2017.

Over the last few weeks, a number of enthusiastic students have asked me when the registration for the next edition of Mozilla Winter of Security would open. I've been saddened to inform them that there won't be an edition of MWoS this year. I understand this is disappointing to many who were looking forward to work on cool security projects alongside experienced engineers, but the truth is simply don't have the time, resources and energy to mentor students right now.

Firefox engineers are cramming through bugs for the Firefox 57 release, planned for November 14th. We could easily say "sorry, too busy making Firefox awesome, kthnksbye", but there is more to the story of not running MWoS this year than the release of 57. In this blog post, I'd like to explore some of these reasons, and maybe share tips with folks who would like to become mentors.

After running MWoS for 3 years, engaging with hundreds of students and personally mentoring about a dozen, I learned two fundamental lessons:

The return on investment is extremely low, when it's not a direct loss to the mentor.
Students engagement is very hard to maintain, and many are just in it for the glory.

Those are hard-learned lessons that somewhat shattered my belief in mentoring. Let's dive into each.

Return on investment

Many mentors will tell you that having an altruistic approach to mentoring is the best way to engage with students. That's true for short engagements, when you spare a few minutes to answer questions and give guidance, but it's utter bullshit for long engagements.

It is simply not realistic to ask engineers to invest two hours a week over four months without getting something out of it. Your time is precious, have some respect for it. When we initially structured MWoS, we made sure that each party (mentors, students and professors) would get something out of it, specifically:

Mentors get help on a project they would not be able to complete alone.
Students get a great experience and a grade as part of their school curriculum.
Professors get interesting projects and offload the mentoring to Mozilla.

Making sure that students received a grade from their professors helped maintain their engagement (but only to some extend, more on that later), and ensured professors approved of the cost a side project would make to their very-busy-students.

The part that mattered a lot for us, mentors, besides helping train the next generation of engineers, was getting help on projects we couldn't complete ourselves. After running MWoS for three years and over a few dozen projects, the truth is we would be better off writing the code ourselves in the majority of cases. The time invested in teaching students would be better used implementing the features we're looking for, because even when students completed their projects, the code quality was often too low for the features to be merged without significant rewrites.

There have been exceptions, of course, and some teams have produced code of good quality. But those have been the exceptions, not the rule. The low return on investment (and often negative return when mentors invested time into projects that did not complete), meant that it became increasingly hard for busy engineers to convince their managers to dedicate 5 to 10% of their time supporting teams that will likely produce low quality code, if any code at all.

It could be said that we sized our projects improperly, and made them too complex for students to complete. It's a plausible explanation, but at the same time, we have not observed a correlation between project complexity and completion. This leads into the next point.

Students engagement is hard to maintain

You would imagine that a student who is given the opportunity to work with Mozilla engineers for several months would be incredibly engaged, and drop everything for the opportunity to work on interesting, highly visible, very challenging projects. We've certainly seen students like that, and they have been fantastic to work with. I remain friends with a number of them, and it's been rewarding to see them grow into accomplished professional who know way more about the topics I mentored them on than I do today. Those are the good ones. The exceptions. The ones that keep on going when your other mentoring projects keep on failing.

And then, you have the long tail of students who have very mixed interest in their projects. Some are certainly overwhelmed by their coursework and have little time to dedicate to their projects. I have no issue with overwhelmed students, and have repeatedly told many of my mentee to prioritize their coursework and exams over MWoS projects.

The ones that rub me the wrong way are students that are more interested in getting into MWoS than actually completing their projects. This category of resume-padding students cares for the notoriety of the program more than the work they accomplish. They are very hard to notice at first, but after a couple years of mentoring, you start to see the patterns: eagerness to name-drop, github account filled with forks of projects and no authored code, vague technical answers during interview questions, constant mention of their references and people they know, etc.

When you mentor students that are just in it for the glory, the interest in the project will quickly drop. Here's how it usually goes:

By week 2, you'll notice students have no plan to implement the project, and you find yourself holding their hands through the roadmap, sometimes explaining concepts so basic you wonder how they could not be familiar with them yet.
By week 4, students are still "going through the codebase to understand how it is structured", and have no plans to implement the project yet. You spend meeting explaining how things work, and grow frustrated by their lack of research. Did they even look at this since our last meeting?
By week 6, you're pretty much convinced they only work on the project for 30min chunks when you send them a reminder email. The meetings are becoming a drag, a waste of a good half hour in your already busy week. Your tone changes and you become more and more prescriptive, less and less enthusiastic. Students nod, but you have little hope they'll make progress.
By week 8, it's the mid-term, and no progress is made for another month.

You end up cancelling the weekly meeting around week 10, and ask students to contact you when they have made progress. You'll hear back from them 3 months later because their professor is about to grade them. You wonder how that's going to work, since the professor never showed up to the weekly meeting, and never contacted you directly for an assessment. Oh well, they'll probably get an A just because they have Mozilla written next to their project...

This is a somewhat overly dramatic account of a failed engagement, but it's not at all unrealistic. In fact, in the dozen projects I mentored, this probably happened on half of them.

The problem with lowly-engaged students is that they are going to drain your motivation away. There is a particular light in the eye of the true nerd-geek-hacker-engaged-student that makes you want to work with them and guide them through their mistakes. That's the reward of a mentor, and it is always missing from students that are not engaged. You learn to notice it after a while, but often long after the damage done by the opportunists have taken away your interest in mentoring.

Will MWoS rise from the ashes?

The combination of low return on investment and poorly engaged students, in addition to a significant increase in workload, made us cancel this year's round. Maybe next year, if we find the time and energy, we will run MWoS again. It's also possible that other folks at Mozilla, and in other organizations, will run similar programs in the future. Should we run it again, we would be a lot stricter on filtering students, and make sure they are ready to invest a lot of time and energy into their projects. This is fairly easy to do: throw them a challenge during the application period, and check the results. "Implement a crude Diffie-Hellman chat on UDP sockets, you've got 48 hours", or anything along those line, along with a good one hour conversation, ought to do it. We were shy to ask those questions at first, but it became obvious over the years that stronger filtering was desperately needed.

For folks looking to mentor, my recommendation is to open your organization to internships before you do anything else. There's a major difference in productivity between interns and students, mostly because you control 100% of an intern's daily schedule, and can make sure they are working on the tasks you assign them too. Interns often complete their projects and provide direct value to the organization. The same cannot be said by mentee of the MWoS program.

Wednesday, January 18 2017

Video-conferencing the right way

By Julien Vehent, Wednesday, January 18 2017. General

I work from home. I have been doing so for the last four years, ever since I joined Mozilla. Some people dislike it, but it suits me well: I get the calm, focused, relaxing environment needed to work on complex problems all in the comfort of my home.

Even given the opportunity, I probably wouldn't go back to working in an office. For the kind of work that I do, quiet time is more important than high bandwidth human interaction.

Yet, being able to talk to my colleagues and exchanges ideas or solve problems is critical to being productive. That's where the video-conferencing bit comes in. At Mozilla, we use ~~Vidyo~~ Zoom primarily, sometimes Hangout and more rarely Skype. We spend hours every week talking to each other via webcams and microphones, so it's important to do it well.

Having a good video setup is probably the most important and yet least regarded aspect of working remotely. When you start at Mozilla, you're given a laptop and a Zoom account. No one teaches you how to use it. Should I have an external webcam or use the one on your laptop? Do I need headphones, earbuds, a headset with a microphone? What kind of bandwidth does it use? Those things are important to good telepresence, yet most of us only learn them after months of remote work.

When your video setup is the main interface between you and the rest of your team, spending a bit of time doing it right is far from wasted. The difference between a good microphone and a shitty little one, or a quiet room and taking calls from the local coffee shop, influence how much your colleagues will enjoy working with you. I'm a lot more eager to jump on a call with someone I know has good audio and video, than with someone who will drag me in 45 minutes of ambient noise and coughing in his microphone.

This is a list of tips and things that you should care about, for yourself, and for your coworkers. They will help you build a decent setup with no to minimal investment.

The place

It may seem obvious, but you shouldn't take calls from a noisy place. Airports, coffee shops, public libraries, etc. are all horribly noisy environments. You may enjoy working from those places, but your interlocutors will suffer from all the noise. Nowadays, I refuse to take calls and cut meetings short when people try to force me into listening to their surrounding. Be respectful of others and take meetings from a quiet space.

Bandwidth

Despite what ISPs are telling you, no one needs 300Mbps of upstream bandwidth. Take a look at the graph below. It measures the egress point of my gateway. The two yellow spikes are video meetings. They don't even reach 1Mbps! In the middle of the second one, there's a short spike at 2Mbps when I set Vidyo to send my stream at 1080p, but shortly reverted because that software is broken and the faces of my coworkers disappeared. Still, you get the point: 2Mbps is the very maximum you'll need for others to see you, and about the same amount is needed to download their streams.

You do want to be careful about ping: latency can increase up to 200ms without issue, but even 5% packet drop is enough to make your whole experience miserable. Ask Tarek what bad connectivity does to your productivity: he works from a remote part of france where bandwidth is scarce and latency is high. I coined him the inventor of the Tarek protocol, where you have to repeat each word twice for others to understand what you're saying. I'm joking, but the truth is that it's exhausting for everyone. Bad connectivity is tough on remote workers.

(Tarek thought it'd be worth mentioning that he tried to improve his connectivity by subscribing to a satellite connection, but ran into issues in the routing of his traffic: 700ms latency was actually worse than his broken DSL.)

Microphone

Perhaps the single most important aspect of video-conferencing is the quality of your microphone and how you use it. When everyone is wearing headphones, voice quality matters a lot. It is the difference between a pleasant 1h conversation, or a frustrating one that leaves you with a headache.

Rule #1: MUTE!

Let me say that again: FREAKING MUTE ALREADY!

Video softwares are terrible at routing the audio of several people at the same time. This isn't the same as a meeting room, where your brain will gladly separate the voice of someone you're speaking to from the keyboard of the dude next to you. On video, everything is at the same volume, so when you start answering that email while your colleagues are speaking, you're pretty much taking over their entire conversation with keyboard noises. It's terrible, and there's nothing more annoying than having to remind people to mute every five god damn minutes. So, be a good fellow, and mute!

Rule #2: no coughing, eating, breathing, etc... It's easy enough to mute or move your microphone away from your mouth that your colleagues shouldn't have to hear you breathing like a marathoner who just finished the olympics. We're going back to rule #1 here.

Now, let's talk about equipment. A lot of people neglect the value of a good microphone, but it really helps in conversations. Don't use your laptop microphone, it's crap. And so is the mic on your earbuds (yes, even the apple ones). Instead, use a headset with a microphone.

If you have a good webcam, it's somewhat ok to use the microphone that comes with it. The Logitech C920 is a popular choice. The downside of those mics is they will pick up a lot of ambient noise and make you sound distant. I don't like them, but it's an acceptable trade-off.

If you want to go all out, try one of those fancy podcast microphones, like the Blue Yeti.

You most definitely don't need that for good mic quality, but they sound really nice. Here's a recording comparing my camera's embedded mic, Planctonic headsets, a Blue Yeti and a Deity S-Mic 2 shotgun microphone.

Webcam

This part is easy because most laptops already come with 720p webcam that provide decent video quality. I do find the Logitech renders colors and depth better than the webcam embedded on my Lenovo Carbon X1, but the difference isn't huge.

The most important part of your webcam setup should be its location. It's a bit strange to have someone talk to you without looking straight at you, but this is often what happens when people place their webcam to the side of their screen.

I've experimented a bit with this, and my favorite setup is to put the webcam right in the middle of my screen. That way, I'm always staring right at it.

It does consume a little space in the middle of my display, but with a large enough screen - I use an old 720p 35" TV - doesn't really bother me.

Lighting and background are important parameters too. Don't bring light from behind, or your face will look dark, and don't use a messy background so people can focus on what you're saying. These factors contribute to helping others read your facial expressions, which are an important part of good communication. If you don't believe me, ask Cal Lightman ;).

Spread the word!

In many ways, we're the first generation of remote workers, and people are learning how to do it right. I believe video-conferencing is an important part of that process, and I think everyone should take a bit of time and improve their setup. Ultimately, we're all a lot more productive when communication flows easily, so spread the word, and do tell your coworkers when they setup is getting in the way of good conferencing.

Thursday, August 4 2016

TLS stats from 1.6 billion connections to mozilla.org

By Julien Vehent, Thursday, August 4 2016.

One of the most challenging task of editing Server Side TLS is figuring out which ciphersuites are appropriate in which levels. Over the years, we've made judgment calls based on our experience and understanding of the TLS ecosystem, but finding hard data about clients in the wild is always difficult.

For the next revision of the guidelines, I wanted to collect real-world data to decide if we could prune unused ciphersuites from the Intermediate configuration. People often complain that the list of ciphersuites is too long. So far, we've taken a conservative approach, preferring a long list of ciphersuites that accepts all clients rather than being restrictive and potentially breaking the Internet for a minority on unusual devices.

Collecting TLS statistics is actually harder than one might think. Modern infrastructures terminate TLS ahead of application servers and don't always provide ciphersuite information in their logs. We run most of our services in AWS behind ELBs, so there's no way to collect statistics at scale.

Last year, we moved mozilla.org behind Cloudflare to support certificate switching and continue serving our users on very old systems. A few months ago, Cloudflare added TLS protocol and ciphersuite information to their access logs, so we finally had a solid data source to evaluate client support.

Mozilla.org is an excellent target to evaluate client diversity because it receives traffic from all sorts of devices from all over the world. It's not an opinionated site that only certain type of people would visit. It's not region or language specific (the site supports dozens of languages). And it's the main entry point to download Firefox.

I collected logs from Cloudflare intermittently over the course of a few weeks, to get an evenly distributed sample of clients connections. That data is represented in the table below.


  
    Percentage
    Hits
    Protocol
    Ciphersuite
  
  

    80.300%
    1300142157
    TLSv1.2
    ECDHE-RSA-AES128-GCM-SHA256
  
  
    9.900%
    160597128
    TLSv1
    ECDHE-RSA-AES128-SHA
  
  
    2.800%
    45350538
    TLSv1
    DES-CBC3-SHA
  
  
    2.500%
    42058051
    TLSv1.2
    ECDHE-RSA-CHACHA20-POLY1305
  
  
    2.000%
    33972517
    TLSv1.2
    ECDHE-RSA-AES128-SHA256
  
  
    0.800%
    13869096
    none
    NONE
  
  
    0.400%
    6709309
    TLSv1.2
    ECDHE-RSA-CHACHA20-POLY1305-D
  
  
    0.200%
    4311348
    TLSv1
    AES128-SHA
  
  
    0.200%
    3629674
    SSLv3
    DES-CBC3-SHA
  
  
    0.100%
    3155150
    TLSv1.1
    ECDHE-RSA-AES128-SHA
  
  
    0.100%
    1968795
    TLSv1.2
    AES128-GCM-SHA256
  
  
    0.000%
    1110501
    SSLv3
    AES128-SHA
  
  
    0.000%
    860476
    TLSv1.2
    ECDHE-RSA-AES128-SHA
  
  
    0.000%
    540913
    TLSv1.2
    AES128-SHA256
  
  
    0.000%
    139800
    SSLv3
    ECDHE-RSA-AES128-SHA
  
  
    0.000%
    83537
    TLSv1.2
    AES128-SHA
  
  
    0.000%
    77433
    TLSv1.1
    AES128-SHA
  
  
    0.000%
    16728
    TLSv1.2
    DES-CBC3-SHA
  
  
    0.000%
    5550
    TLSv1.2
    ECDHE-RSA-DES-CBC3-SHA
  
  
    0.000%
    2836
    TLSv1.2
    AES256-SHA256
  
  
    0.000%
    2050
    TLSv1.2
    ECDHE-RSA-AES256-SHA
  
  
    0.000%
    1421
    TLSv1
    ECDHE-RSA-DES-CBC3-SHA
  
  
    0.000%
    570
    TLSv1.2
    ECDHE-RSA-AES256-GCM-SHA384
  
  
    0.000%
    386
    TLSv1
    ECDHE-RSA-AES256-SHA
  
  
    0.000%
    141
    TLSv1.2
    AES256-SHA
  
  
    0.000%
    128
    TLSv1
    AES256-SHA
  
  
    0.000%
    66
    TLSv1.3
    ECDHE-RSA-AES128-GCM-SHA256
  
  
    0.000%
    32
    TLSv1.2
    ECDHE-RSA-AES256-SHA384
  
  
    0.000%
    24
    TLSv1.1
    DES-CBC3-SHA
  
  
    0.000%
    8
    SSLv3
    AES256-SHA
  
  
    0.000%
    8
    SSLv3
    ECDHE-RSA-AES256-SHA
  
  
    0.000%
    8
    SSLv3
    ECDHE-RSA-DES-CBC3-SHA
  
  
    0.000%
    8
    TLSv1.1
    AES256-SHA
  
  
    0.000%
    8
    TLSv1.1
    ECDHE-RSA-AES256-SHA
  
  
    0.000%
    8
    TLSv1.1
    ECDHE-RSA-DES-CBC3-SHA
  
  
    0.000%
    8
    TLSv1.2
    AES256-GCM-SHA384

Percentage	Hits	Protocol	Ciphersuite
80.300%	1300142157	TLSv1.2	ECDHE-RSA-AES128-GCM-SHA256
9.900%	160597128	TLSv1	ECDHE-RSA-AES128-SHA
2.800%	45350538	TLSv1	DES-CBC3-SHA
2.500%	42058051	TLSv1.2	ECDHE-RSA-CHACHA20-POLY1305
2.000%	33972517	TLSv1.2	ECDHE-RSA-AES128-SHA256
0.800%	13869096	none	NONE
0.400%	6709309	TLSv1.2	ECDHE-RSA-CHACHA20-POLY1305-D
0.200%	4311348	TLSv1	AES128-SHA
0.200%	3629674	SSLv3	DES-CBC3-SHA
0.100%	3155150	TLSv1.1	ECDHE-RSA-AES128-SHA
0.100%	1968795	TLSv1.2	AES128-GCM-SHA256
0.000%	1110501	SSLv3	AES128-SHA
0.000%	860476	TLSv1.2	ECDHE-RSA-AES128-SHA
0.000%	540913	TLSv1.2	AES128-SHA256
0.000%	139800	SSLv3	ECDHE-RSA-AES128-SHA
0.000%	83537	TLSv1.2	AES128-SHA
0.000%	77433	TLSv1.1	AES128-SHA
0.000%	16728	TLSv1.2	DES-CBC3-SHA
0.000%	5550	TLSv1.2	ECDHE-RSA-DES-CBC3-SHA
0.000%	2836	TLSv1.2	AES256-SHA256
0.000%	2050	TLSv1.2	ECDHE-RSA-AES256-SHA
0.000%	1421	TLSv1	ECDHE-RSA-DES-CBC3-SHA
0.000%	570	TLSv1.2	ECDHE-RSA-AES256-GCM-SHA384
0.000%	386	TLSv1	ECDHE-RSA-AES256-SHA
0.000%	141	TLSv1.2	AES256-SHA
0.000%	128	TLSv1	AES256-SHA
0.000%	66	TLSv1.3	ECDHE-RSA-AES128-GCM-SHA256
0.000%	32	TLSv1.2	ECDHE-RSA-AES256-SHA384
0.000%	24	TLSv1.1	DES-CBC3-SHA
0.000%	8	SSLv3	AES256-SHA
0.000%	8	SSLv3	ECDHE-RSA-AES256-SHA
0.000%	8	SSLv3	ECDHE-RSA-DES-CBC3-SHA
0.000%	8	TLSv1.1	AES256-SHA
0.000%	8	TLSv1.1	ECDHE-RSA-AES256-SHA
0.000%	8	TLSv1.1	ECDHE-RSA-DES-CBC3-SHA
0.000%	8	TLSv1.2	AES256-GCM-SHA384

Unsurprisingly, ECDHE-RSA-AES128-GCM-SHA256 accounts for over 80% of the traffic, as this ciphersuite is preferred in both Firefox and Chrome. It's a good news that most of our users benefit from that level of security, but it doesn't help us understand backward compatibility challenges.

More interesting are the following two entries that both negotiate TLSv1 with ECDHE-RSA-AES128-SHA (9.9%) and DES-CBC3-SHA (2.8%). Almost 13% of the traffic to mozilla.org is stuck in TLSv1 land. The presence of DES-CBC3-SHA in third position is a strong sign that we're nowhere done supporting old clients that don't even know what AES is.

The stat I was most curious about is in 9th position: SSLv3 with DES-CBC3-SHA, which accounts for 0.2% of the traffic, is a signature from Windows XP pre-sp3 clients, when SChannel didn't support TLSv1 or AES. 0.2% may seem insignificant, unless you're one of these users and the only way you will browse the internet is by first downloading Firefox from mozilla.org. We certainly don't recommend for anyone to enable SSLv3 on their site, unless you're in this very special category that needs backward compatibility with very old clients. Mozilla.org is one of those sites.

The rest of the stats are mostly what one would expect: a long list of randomly ordered ciphersuites from devices with a variety of preferences. ECDHE-RSA-CHACHA20-POLY1305 is in that list, but only at 2.5%. Cloudflare doesn't support any of the legacy ciphers like CAMELLIA or SEED, so we can't see if any of those are in use (I would expect them not to be, but who knows...). We can also assume that the handful of SSLv3 connections at the bottom are scanners, since I doubt we have many clients stuck in SSLv3 but with ECDHE-RSA-AES256-SHA support.

What are we missing?

The information we're missing the most is DHE support. Cloudflare doesn't enable it anymore, and it would be interesting to know how many clients out there will only negotiate DHE. I'm suspecting most will fall back to some non-PFS AES based ciphersuite, but proof would be nice. Ultimately, removing DHE from the Intermediate recommendations is a goal, given how difficult it's been for operators to generate DHE parameters securely.

Statistics on ECDSA usage would also be nice. We currently use RSA certificates for mozilla.org, but we're more and more recommending ECDSA certs instead (P-256 is preferred in the Modern configuration level). An interesting experiment would be to perform cert switching between RSA SHA1, RSA SHA256 and ECDSA SHA256.

In conclusion

This data is good enough to start drafting the next edition of Server Side TLS. We'll look at removing DHE support from the Intermediate configuration, and maybe limit the non-PFS ciphersuites to one or two AES fallbacks. It would appear, however, that DES-CBC3-SHA is here to stay for a little while longer...

Friday, November 13 2015

On endpoint security, vendors and root access.

By Julien Vehent, Friday, November 13 2015.

Endpoint security typically comes in two flavors: with or without a local agent. They both do the same thing - reach out to your endpoints and run a bunch of tests - but one will reach out to your systems over SSH, while the other will require a local agent to be deployed on all endpoints. Both approaches often share the same flaw: the servers that operate the security solution have the keys to become root on all your endpoints. These servers become targets of choice: take control of them, and you are root across the infrastructure.

I have evaluated many endpoint security solutions over the past few years, and I too often run into this broken approach to access control. In extreme cases, vendors are even bold enough to sell hosted services that require their customers to grant root accesses to SaaS operating as blackboxes. These vendors are successful, so I imagine they find customers who think sharing root accesses that way is acceptable. I am not one of them.

For some, trust is a commodity that should be outsource-able. They see trust as something you can write into a contract, and not worry about it afterward. To some extend, contracting does help with trust. More often than not, however, trust is earned over time, and contracts only seal the trust relationship both parties have already established.

I trust AWS because they have a proven track record of doing things securely. I did not use to, but time and experience have changed my mind. You, however, young startup that freshly released a fancy new security product I am about to evaluate, I do not yet trust you. You will have to earn that trust over time, and I won't make it easy.

This is where my issue with most endpoint security solutions lies: I do not want to trust random security vendors with root accesses to my servers. Mistakes happen, they will get hacked some day, or leak their password in a git commit or a pastebin, and I do not wish my organization to be a collateral damage of their operational deficiencies.

Endpoint security without blindly trusting the security platform is doable. MIG is designed around the concept of non-trustable infrastructure. This is achieved by requiring all actions sent to MIG agents to be signed using keys that are not stored on the MIG servers, but on the laptops of investigators, the same way SSH keys are managed. If the MIG servers get hacked, some data may leak, but no access will be compromised.

Another aspect that we included in MIG is the notion that endpoint security can be done without arbitrary remote code exception. Most solutions will happily run code that come from the trusted central platform, effectively opening a backdoor into the infrastructure. MIG does not allow this. Agents will only run specific investigative tasks that have been pre-compiled into modules. There is no vector for remote code execution, such that an investigator's key leaking would not allow an attacker to elevate access to being root on endpoints. This approach does limit the capabilities of the platform - we can only investigate what MIG supports - but if remote code execution is really what you need, you probably should be looking into a provisioning tool, or pssh, but not an endpoint security solution.

While I do take MIG as an example, I am not advocating it as a better solution to all things. Rather, I am advocating for proper access controls in endpoint security solutions. Any security product that has the potential to compromise your entire infrastructure if taken over is bad, and should not be trusted. Even if it brings some security benefits. You should not have to compromise on this. Vendors should not ask customers to accept that risk, and just trust them to keep their servers secure. Doing endpoint security the safe way is possible, it's just a matter of engineering it right.

Sunday, October 4 2015

SHA1/SHA256 certificate switching with HAProxy

By Julien Vehent, Sunday, October 4 2015.

SHA-1 certificates are on their way out, and you should upgrade to a SHA-256 certificate as soon as possible... unless you have very old clients and must maintain SHA-1 compatibility for a while.

If you are in this situation, you need to either force your clients to upgrade (difficult) or implement some form of certificate selection logic: we call that "cert switching".

The most deterministic selection method is to serve SHA-256 certificates to clients that present a TLS1.2 CLIENT HELLO that explicitly announces their support for SHA256-RSA (0x0401) in the signature_algorithms extension.

Modern web browsers will send this extension. However, I am not aware of any open source load balancer that is currently able to inspect the content of the signature_algorithms extension. It may come in the future, but for now the easiest way to achieve cert switching is to use HAProxy SNI ACLs: if a client presents the SNI extension, direct it to a backend that presents a SHA-256 certificate. If it doesn't present the extension, assume that it's an old client that speaks SSLv3 or some broken version of TLS, and present it a SHA-1 cert.

This can be achieved in HAProxy by chaining frontend and backends:

global
        ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128
-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-R
SA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK

frontend https-in
        bind 0.0.0.0:443
        mode tcp
        tcp-request inspect-delay 5s
        tcp-request content accept if { req_ssl_hello_type 1 }
        use_backend jve_https if { req.ssl_sni -i jve.linuxwall.info }

        # fallback to backward compatible sha1
        default_backend jve_https_sha1

backend jve_https
        mode tcp
        server jve_https 127.0.0.1:1665
frontend jve_https
        bind 127.0.0.1:1665 ssl no-sslv3 no-tlsv10 crt /etc/haproxy/certs/jve_sha256.pem tfo
        mode http
        option forwardfor
        use_backend jve

backend jve_https_sha1
        mode tcp
        server jve_https 127.0.0.1:1667
frontend jve_https_sha1
        bind 127.0.0.1:1667 ssl crt /etc/haproxy/certs/jve_sha1.pem tfo ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:DES-CBC3-SHA:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA
        mode http
        option forwardfor
        use_backend jve

backend jve
        rspadd Strict-Transport-Security:\ max-age=15768000
        server jve 172.16.0.6:80 maxconn 128

The configuration above receives inbound traffic in the frontend called "https-in". That frontend is in TCP mode and inspects the CLIENT HELLO coming from the client for the value of the SNI extension. If that value exists and matches our target site, it sends the connection to the backend named "jve_https", which redirects to a frontend also named "jve_https" where the SHA256 certificate is configured and served to the client.

If the client fails to present a CLIENT HELLO with SNI, or presents a SNI that doesn't match our target site, it is redirected to the "https_jve_sha1" backend, then to its corresponding frontend where a SHA1 certificate is served. That frontend also supports an older ciphersuite to accommodate older clients.

Both frontends eventually redirect to a single backend named "jve" which sends traffic to the destination web servers.

This is a very simple configuration, and eventually it could be improved using better ACLs (HAproxy regularly adds news ones), but for a basic cert switching configuration, it gets the job done!

Thursday, October 1 2015

Introducing SOPS: a manager of encrypted files for secrets distribution

By Julien Vehent, Thursday, October 1 2015.

Automating the distribution of secrets and credentials to components of an infrastructure is a hard problem. We know how to encrypt secrets and share them between humans, but extending that trust to systems is difficult. Particularly when these systems follow devops principles and are created and destroyed without human intervention. The issue boils down to establishing the initial trust of a system that just joined the infrastructure, and providing it access to the secrets it needs to configure itself.

The initial trust

In many infrastructures, even highly dynamic ones, the initial trust is established by a human. An example is seen in Puppet by the way certificates are issued: when a new system attempts to join a Puppetmaster, an administrator must, by default, manually approve the issuance of the certificate the system needs. This is cumbersome, and many puppetmasters are configured to auto-sign new certificates to work around that issue. This is obviously not recommended and far from ideal.

AWS provides a more flexible approach to trusting new systems. It uses a powerful mechanism of roles and identities. In AWS, it is possible to verify that a new system has been granted a specific role at creation, and it is possible to map that role to specific resources. Instead of trusting new systems directly, the administrator trusts the AWS permission model and its automation infrastructure. As long as AWS keys are safe, and the AWS API is secure, we can assume that trust is maintained and systems are who they say they are.

KMS, Trust and secrets distribution

Using the AWS trust model, we can create fine grained access controls to Amazon's Key Management Service (KMS). KMS is a service that encrypts and decrypts data with AES_GCM, using keys that are never visible to users of the service. Each KMS master key has a set of role-based access controls, and individual roles are permitted to encrypt or decrypt using the master key. KMS helps solve the problem of distributing keys, by shifting it into an access control problem that can be solved using AWS's trust model.

Since KMS's inception a few months ago, a number of projects have popped up to use its capabilities to distribute secrets: credstash and sneaker are such examples. Today I'm introducing sops: a secrets editor that uses KMS and PGP to manage encrypted files.

SOPS: Secrets OPerationS

A few weeks ago, Mozilla's Services Operations team started revisiting the issue of distributing secrets to EC2 instances, with a goal to store these secrets encrypted until the very last moment, when they need to be decrypted on target systems. Not unlike many other organizations that operate sufficiently complex automation, we found this to be a hard problem with a number of prerequisites:

Secrets must be stored in YAML files for easy integration into hiera
Secrets must be stored in GIT, and when a new CloudFormation stack is built, the current HEAD is pinned to the stack. (This allows secrets to be changed in GIT without impacting the current stack that may autoscale).
Encrypt entries separately. Encrypting entire files as blobs makes git conflict resolution almost impossible. Encrypting each entry separately is much easier to manage.
Secrets must always be encrypted on disk (admin laptop, upstream git repo, jenkins and S3) and only be decrypted on the target systems

Daniel Thornton and I brainstormed a number of ideas, and eventually ended-up with a workflow similar to the one described below.

The idea behind SOPS is to provide a wrapper around a text editor that takes care of the encryption and decryption transparently. When creating a new file, sops generates a data encryption key "Kd" that is itself encrypted with one or more KMS master keys and PGP public keys. Kd is used to encrypt the content of the file with AES256-GCM. In order to decrypt the files, sops must have access to any of the KMS or PGP master keys.

SOPS can be used to encrypt YAML, JSON and TEXT files. In TEXT mode, the content of the file is treated as a blob, the same way PGP would encrypt an entire file. In YAML and JSON modes, however, the content of the file is manipulated as a tree where keys are stored in cleartext, and values are encrypted. hiera-eyaml does something similar, and over the years we learned to appreciate its benefits, namely:

diff are meaningful. If a single value of a file is modified, only that value will show up in the diff. The diff is still limited to only showing encrypted data, but that information is already more granular that indicating that an entire file has changed.
conflicts are easier to resolve. If multiple users are working on the same encrypted files, as long as they don't modify the same values, changes are easy to merge. This is an improvement over the PGP encryption approach where unsolvable conflicts often happen when multiple users work on the same file.

# edit a file
$ sops example.yaml 
file written to example.yaml

# take a look at the diff
$ git diff example.yaml
diff --git a/example.yaml b/example.yaml
index 00fe479..5f40330 100644
--- a/example.yaml
+++ b/example.yaml
@@ -1,5 +1,5 @@
 # The secrets below are unreadable without access to one of the sops master key
-myapp1: ENC[AES256_GCM,data:Tr7oo=,iv:1vw=,aad:eo=,tag:ka=]
+myapp1: ENC[AES256_GCM,data:krm,iv:0Y=,aad:KPyE=,tag:oIA==]
app2:
     db:
         user: ENC[AES256_GCM,data:YNKE,iv:H4JQ=,aad:jk0=,tag:Neg==]

Below are two examples of SOPS encrypted files. The first one in YAML, the second one in JSON:

YAML

cleartext:

# The secrets below are unreadable without access to one of the sops master key
myapp1: t00m4nys3cr3tzupdated
app2:
    db:
        user: eve
        password: c4r1b0u
    # private key for secret operations in app2
    key: |
        -----BEGIN RSA PRIVATE KEY-----
        MIIBPAIBAAJBAPTMNIyHuZtpLYc7VsHQtwOkWYobkUblmHWRmbXzlAX6K8tMf3Wf
        ImcbNkqAKnELzFAPSBeEMhrBN0PyOC9lYlMCAwEAAQJBALXD4sjuBn1E7Y9aGiMz
        bJEBuZJ4wbhYxomVoQKfaCu+kH80uLFZKoSz85/ySauWE8LgZcMLIBoiXNhDKfQL
        vHECIQD6tCG9NMFWor69kgbX8vK5Y+QL+kRq+9HK6yZ9a+hsLQIhAPn4Ie6HGTjw
        fHSTXWZpGSan7NwTkIu4U5q2SlLjcZh/AiEA78NYRRBwGwAYNUqzutGBqyXKUl4u
        Erb0xAEyVV7e8J0CIQC8VBY8f8yg+Y7Kxbw4zDYGyb3KkXL10YorpeuZR4LuQQIg
        bKGPkMM4w5blyE1tqGN0T7sJwEx+EUOgacRNqM2ljVA=
        -----END RSA PRIVATE KEY-----
number: 1234567890
an_array:
- secretuser1
- secretuser2
- somelongvalueAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
- some other value

encrypted:

# The secrets below are unreadable without access to one of the sops master key
myapp1: ENC[AES256_GCM,data:krwEdH2fxWRexFuvZHS816Wz46Lm,iv:0STqWePc0HOPuDn2EizQdNepx9ksx0guHGeKrshlYSY=,aad:Krl8HyPGQmnIWIZh74Ib+y0OdiVEvRDBv3jTdMGSPyE=,tag:oI2THtQeUX4ZLNnbrdel2A==]
app2:
    db:
        user: ENC[AES256_GCM,data:YNKE,iv:H9CDb4aUHBJeF2MSTKHQuOwlLxQVdx12AhT0+Dob4JQ=,aad:jlF2KvytlQIgyMpOoO/BiQbukiMwrh1j94Oys+YMgk0=,tag:NeDysIHV9CGtMAQq9i4vMg==]
        password: ENC[AES256_GCM,data:p673JCgHYw==,iv:EOOeivCp/Fd80xFdMYX0QeZn6orGTK8CeckmipjKqYY=,aad:UAhi/SHK0aCzptnFkFG4dW8Vv1ASg7TDHD6lui9mmKQ=,tag:QE6uuhRx+cGInwSVdmxXzA==]
    # private key for secret operations in app2
    key: |-
        ENC[AES256_GCM,data:Ea3zTFSOlg1PDZmBa1U2dtKl3pO4nTmaFswJx41fPfq3u8O2/Bq1UVfXn2SrO13obfr6xH4zuUceCDTvW2qvphlan5ir609EXt4dE2TEEcjVKhmAHf4LMwlZVAbvTJtlsnvo/aYJH95uctjsSX5h8pBlLaTGBGYwMrZuMyRU6vdcMWyha+piJckUc9sq7fevy1TSqIxf1Usbn/0NEklWm2VSNzQ2Urqtny6EXar+xU7NfYSRJ3mqmcJZ14oIeXPdpk962RwMEFWdYrbE7D59kWU2BgMjDxYJD5KXpWiw2YCrA/wsATxVCbZlwqC+TJFA5WAUZX756mFhV/t2Li3zQyDNUe6KkMXV9qwf/oV1j5sVRVFsKDYIBqhi3qWBVA+SO9RloQMjhru+IsdbQcS4LKq/1DrBENeZuJ0djUAxKLVfJzMGUf89ju3m9IEPovW8mfF0RbfAGRwFHMO9nEXCxrTLERf3owdR3u4j5/rNBpIvvy1z+2dy6sAx/eyNdS+cn5qO9BPAxsXpSwkaI96rlBagwH1Pfxus0x/D00j93OpE+M8MgQ/9LA68FlCFU4OAQlvw8f7MPoxnq+/+gFTS/qqjTR6EoUuX5NH2WY93YCC5TCbe4GOXyP0H05PbIWq55UMVLNcpAyac3gO4kL5O5U8=,iv:Dl61tsemKH0fdmNul/PmEEsRYFAh8GorR8GRupus/EM=,aad:Ft2aSYYukD1x8pMj1WvmodLjJV6waPy5FqdlImWyQKA=,tag:EPg4KpWqni/buCFjFL857A==]
number: ENC[AES256_GCM,data:XMrBalgZ9tvBxQ==,iv:XyEAAaIzVy/2trnJhLrjMInLg8tMI4CAX9+ccnj3T1Y=,aad:JOlAkP159UxDjL1CrumTuQDqgW2+VOIwz7bdfaJIIn4=,tag:WOHOMJS4nhSdj/aQcGbU1A==]
an_array:
- ENC[AES256_GCM,data:td1aAv4s4cOzSo0=,iv:ErVqte7GpQ3JfzVpVRf7pWSQZDHn6W0iAntKWFsMqio=,aad:RiYy8fKX/yVY7KRgXSOIzydT0+TwK7WGzSFSy+1GmVM=,tag:aSGLCmNZsGcBjxEGvNQRwA==]
- ENC[AES256_GCM,data:2K8C418jef8zoAY=,iv:cXE4Hwdl4ZHzAHHyyXqaIMFs0mn65JUehDdaw/aM0WI=,aad:RlAgUZUZ1DvxD9/lZQk9KOHKl4L+fYETaAdpDVekCaA=,tag:CORSBzis6Vy45dEvT/UtMg==]
- ENC[AES256_GCM,data:hbcOBbsaWmlnrpeuwLfh1ttsi8zj/pxMc1LYqhdksT/oQb80g2z0FE4QwUVb7VV+x98LAWHofVyV8Q==,iv:/sXHXde82r2FyG3Z3vC5x8zONB14RwC0GmtkiYEUNLI=,aad:BQb8l5fZzF/aa/EYnrOQvRfGUTq9QmJOAR/zmgOfYDA=,tag:fjNeg3Manjl6B2U2oflRhg==]
- ENC[AES256_GCM,data:LLHkzGobqL53ws6E2zglkA==,iv:g9z3zz4DUzJr4Cim0SVqKF736w2mZoItqbB0TcsGrQU=,aad:Odrvz0loqFdd9wKJz0ULMX/lyEQcX8WaHE59MgeXkcI=,tag:V+rV/AeZ4uEgtwGhlamTag==]
sops:
    kms:
    -   enc: CiC6yCOtzsnFhkfdIslYZ0bAf//gYLYCmIu87B3sy/5yYxKnAQEBAQB4usgjrc7JxYZH3SLJWGdGwH//4GC2ApiLvOwd7Mv+cmMAAAB+MHwGCSqGSIb3DQEHBqBvMG0CAQAwaAYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAyGdRODuYMHbA8Ozj8CARCAO7opMolPJUmBXd39Zlp0L2H9fzMKidHm1vvaF6nNFq0ClRY7FlIZmTm4JfnOebPseffiXFn9tG8cq7oi
        enc_ts: 1439568549.245995
        arn: arn:aws:kms:us-east-1:656532927350:key/920aff2e-c5f1-4040-943a-047fa387b27e
    pgp:
    -   fp: 85D77543B3D624B63CEA9E6DBC17301B491B3F21
        enc: |
            -----BEGIN PGP MESSAGE-----
            Version: GnuPG v1

            hQIMA0t4uZHfl9qgAQ//ZvUMJOLUJyzKa/Uigwh1jKVhx3feHUitVjCWBfVTPgj1
            rRbaTcaF/mYi+rLdW+6kmAg1UEPoVgEBEiBvCTcHjyDzw3m0DoQwvK85nqOpEhkx
            rjU1XAnKZ8LNFfIaj8Xo/L6qzE882gwOhfCPU+QmnkWdijs6dQof06DButQDTx5D
            KlFvr9CgSa52/uPazZ41disho9guS06k+KrV/P2F4jrU5aB5mfP7YZY9mkVcm2bv
            9C5O9neNlXcivgWqKQjB5fmv1Z9yUFAUBNg98wjT8o5Hxz6P6hIbV3f+vn/Vu+VZ
            Qo+E7g3/2ItaT89KAIVXgQdHhwJneoDBVpJ4rYz7LLbcvEyAbipKIY4Fl3Cn1ggH
            9odIZWA6FWZxHNhRVonMVHZ8Jei5NkUdpJltjDmPJpl3B+7XiWg4NS8dp860fLeL
            8nrkR0Z4nVK8DNg+7nQiOxHL9wye6ljWl7/xapJ5r+mYA6eLybsSSlxDo9/OmeON
            CYo3jV8HT8amrXYVi4MyZ3LV2TTyGVPObnthYEN2lPSJmms6ei6t/xKaZtAj6779
            EzGbUP9VpTKKf5tqGcy9MeGEk2p5ed5hJGinrrt92cNIebcMBJpkLQAy++V/fKnH
            Meecaoj1NThBnRguNuz73WSy2C5u/g7OoI50HJJCmoVXY+8D64tWmCZc8Ib2fprS
            XgFcoR9u5yPkLZW8xASpRXfKKTbRTTjAXdYEyaYuuOW2nFWo62/d1mZsT7kY21ja
            AhVVoxwsj45FCuk63bDVceAJJm+9xxufMp0gNW1GUk858VLyE8gn+uAB5zBcS5c=
            =BN/t
            -----END PGP MESSAGE-----
        created_at: 1443203323.058362

JSON

cleartext

{
    "address": {
        "city": "New York", 
        "postalCode": "10021-3100", 
        "state": "NY", 
        "streetAddress": "21 2nd Street"
    }, 
    "age": 25, 
    "firstName": "John", 
    "lastName": "Smith", 
    "phoneNumbers": [
        {
            "number": "212 555-1234", 
            "type": "home"
        }, 
        {
            "number": "646 555-4567", 
            "type": "office"
        }
    ]
}

encrypted:

{
    "address": {
        "city": "ENC[AES256_GCM,data:2wNRKB+Sjjw=,iv:rmATLCPii2WMzcT80Wp9gOpYQqzx6juRmCf9ioz2ZLM=,aad:dj0QZW0BvZVjF1Dn25hOJpcwcVB0qYvEIhGWgxq6YzQ=,tag:wOoPYU+8BA9DiNFlsal3Aw==]", 
        "postalCode": "ENC[AES256_GCM,data:xwWZ/np9Gxv3CQ==,iv:OLwOr7iliPyWWBtKfUUH7E1wQlxJLA6aFxIfNAEC/M0=,aad:8mw5NU8MpyBlrh7XaUqa642jeyJWGqKvduaQ5bWJ5pc=,tag:VFmnc4Ay+yKzyHcrKeEzZQ==]", 
        "state": "ENC[AES256_GCM,data:3jY=,iv:Y2bEgkjdn91Pbf5RgJMbyCsyfhV7XWdDhe8wVwTQue0=,aad:DcA5kW1rrET9TxQ4kn9jHSpoMlkcPKs5O5n9wZjZYCQ=,tag:ad1xdNnFwkqx/8EOKVVHIA==]", 
        "streetAddress": "ENC[AES256_GCM,data:payzP57DGPl5S9Z7uQ==,iv:UIz34fk9zH4z6hYfu0duXmAnI8CqnoRhoaIUqg1YoYA=,aad:hll9Baw40lMjwj7HePQ1o1Lsuh1LCwrE6+bkG4025sg=,tag:FDBhYxMmJ1Wj/uxYxdvVZg==]"
    }, 
    "age": "ENC[AES256_GCM,data:4Y4=,iv:hi1iSH19dHSgG/c7yVbNj4yzueHSmmY46yYqeNCoX5M=,aad:nnyubQyaWeLTcz9k9cMHUlgTwVDMyHf32sWCBm7KWAA=,tag:4lcMjstadzI8K40BoDEfDA==]", 
    "firstName": "ENC[AES256_GCM,data:KVe8Dw==,iv:+eg+Rjvaqa2EEp6ufw9c4hwWwObxRLPmxx3fG6rkyps=,aad:3BdHcorHfbvM2Jcs96zX0JY2VQL5dBNgy7zwhqLNqAU=,tag:5OD6MN9SPhBmXuA81hyxhQ==]", 
    "lastName": "ENC[AES256_GCM,data:1+koqsI=,iv:b2kBxSW4yOnLFc8qoeylkMtiO/6qr4cZ5VTntXTyXO8=,aad:W7HXukq3lUUMj9i57UehILG2NAp8XCgJMYbvgflWJIY=,tag:HOrgi1L+IRP+X5JGMnm7Ig==]", 
    "phoneNumbers": [
        {
            "number": "ENC[AES256_GCM,data:Oo0IxdtBrnfE+bTf,iv:tQ1E/JQ4lHZvj1nQnGL2sKE30sCctjiMCiagS2Yzch8=,aad:P+m5gD3pKfNEOy6t61vbKhEpPtMFI2NZjBPrD/m8T9w=,tag:6iRMUVUEx3UZvUTGTjCdwg==]", 
            "type": "ENC[AES256_GCM,data:M3zOKQ==,iv:pD9RO4BPUVu6AWPo2DprRsOqouN+0HJn+RXQAXhfB2s=,aad:KFBBVEEnSjdmah3i2XmPx7wWEiFPrxpnfKYW4BSolhk=,tag:liwNnip/L6SZ9srn0N5G4g==]"
        }, 
        {
            "number": "ENC[AES256_GCM,data:BI2f/qFUea6UHYQ+,iv:jaVLMju6h7s+AlF7CsPbpUFXO2YtYAqYsCIsyHgfrfI=,aad:N+8sVpdTlY5I+DcvnY018Iyh/QesD7bvwfKHRr7q2L0=,tag:hHPPpQKP4cUIXfh9CFe4dA==]", 
            "type": "ENC[AES256_GCM,data:EfKAdEUP,iv:Td+sGaS8XXRqzY98OK08zmdqsO2EqVGK1/yDTursD8U=,aad:h9zi8s+EBsfR3BQG4r+t+uqeChK4Hw6B9nJCrValXnI=,tag:GxSk1LAQIJNGyUy7AvlanQ==]"
        }
    ], 
    "sops": {
        "kms": [
            {
                "arn": "arn:aws:kms:us-east-1:656532927350:key/920aff2e-c5f1-4040-943a-047fa387b27e", 
                "created_at": 1443204393.48012, 
                "enc": "CiC6yCOtzsnFhkfdIslYZ0bAf//gYLYCmIu87B3sy/5yYxKnAQEBAgB4usgjrc7JxYZH3SLJWGdGwH//4GC2ApiLvOwd7Mv+cmMAAAB+MHwGCSqGSIb3DQEHBqBvMG0CAQAwaAYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAwBpvXXfdPzEIyEMxICARCAOy57Odt9ngHHyIjVU8wqMA4QszXdBglNkr+duzKQO316CRoV5r7bO8JwFCb7699qreocJd+RhRH5IIE3"
            }, 
            {
                "arn": "arn:aws:kms:ap-southeast-1:656532927350:key/9006a8aa-0fa6-4c14-930e-a2dfb916de1d", 
                "created_at": 1443204394.74377, 
                "enc": "CiBdfsKZbRNf/Li8Tf2SjeSdP76DineB1sbPjV0TV+meTxKnAQEBAgB4XX7CmW0TX/y4vE39ko3knT++g4p3gdbGz41dE1fpnk8AAAB+MHwGCSqGSIb3DQEHBqBvMG0CAQAwaAYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAwag3w44N8+0WBVySwCARCAOzpqMpvzIXV416ycCJd7mn9dBvjqzkUDag/zHlKse57uNN7P0S9GeRVJ6TyJsVNM+GlWx8++F9B+RUE3"
            }
        ], 
        "pgp": [
            {
                "created_at": 1443204394.748745, 
                "enc": "-----BEGIN PGP MESSAGE-----\nVersion: GnuPG v1\n\nhQIMA0t4uZHfl9qgAQ//dpZVlRD9WGvz6Pl+PRKvBf661IHLkCeOq5ubzqLIJZu7\nJMNu0KBoO0qX+rgIQtzMU+04QlbIukw01q9ELSDYjBDQPRQJ+6OAeauawxf5mPGa\nZKOaSuoCuPbfOmGj8AENdSCpDaDz+KvOPvo5NNe16kC8BeerFJGewyEwbnkx5dxZ\ngk+LJBOuWRVUEzjsB1pzGfGRzvuzHcrUzWAoA8N936hDFIpoeDYC/8KLc0CWTltA\nYYGaKh5cZxC0R0TgQ5S9GjcU2nZjhcL94XRxZ+9BZDLCDRnjnRfUpPSTHoxr9wmR\nAuLtgyCIolrPl3fqRLJSLUH6FyTo2CO+2mFSx7y9m2OXkKQd1z2tkOlpC9PDTjGT\nVfGvy9nMUsmrgWG35soEmk0nNJaZehiscvZfomBnnHQgqx7DMSMxAnBneFqjsyOQ\nGK7Jacs/tigxe8NZcYhx+usITeQzVLmuqZ2pO5nEGyq0XJhJjxce9YVaeC4QOft0\nlm6qq+m6oABOdKTGh6zuIiWxU1r417IEgV8mkwjlraAvNNPKowQq5j8dohG4HaNK\nOKoOt8aIZWvD3HE9szuH+uDRXBBEAbIvnojQIyrqeIYv1xU8hDTllJPKw/kYD6nx\nMKrw4UAFand5qAgN/6QoIrOPXC2jhA2VegXkt0LXXSoP1ccR4bmlrGRHg0x6Y8zS\nXAE+BVEMYh8l+c86BNhzVOaAKGRor4RKtcZIFCs/Gpa4FxzDp5DfxNn/Ovrhq/Xc\nlmzlWY3ywrRF8JSmni2Asxet31RokiA0TKAQj2Q7SFNlBocR/kvxWs8bUZ+s\n=Z9kg\n-----END PGP MESSAGE-----\n", 
                "fp": "85D77543B3D624B63CEA9E6DBC17301B491B3F21"
            }
        ]
    }
}

As you can see on each key/value pair, only the values are encrypted and keys are kept in clear. It can be argued that this approach leaks sensitive information, but it’s a tradeoff we’re willing to accept in exchange for an increased usability.

Simplifying key management

OpenPGP gets a lot of bad press for being an outdated crypto protocol, and while true, what really made look for alternatives is the difficulty to manage and distribute keys to systems. With KMS, we manage permissions to an API, not keys, and that's a lot easier to do.

But PGP is not dead yet, and we still rely on it heavily as a backup solution: all our files are encrypted with KMS and with one PGP public key, with its private key stored securely for emergency decryption in the event that we lose all our KMS master keys.

That said, nothing prevents you from using SOPS the same way you would use an encrypted PGP file: by referencing the pubkeys of each individual who has access to the file. It can easily be done by providing sops with a comma-separated list of public keys when creating a new file:

$ sops --pgp "E60892BB9BD89A69F759A1A0A3D652173B763E8F, 84050F1D61AF7C230A12217687DF65059EF093D3, 85D77543B3D624B63CEA9E6DBC17301B491B3F21" mynewfile.yaml

Updating the master keys

GnuPG can be a little obscure when it comes to managing the keys that have access to a file. While its command line is powerful, it takes a few minutes to find the right commands and figure out how to provide access to a new member of the team.

In SOPS, managing master keys is easy: they are simply stored as entries of the document under sops->{kms,pgp}. By default, that information is hidden during editing, but calling sops with the "-s" flag will display the master keys in the editor. From there, add new keys by creating entry in the document, or remove them by deleting the lines.

Rotation

Rotating data keys is also trivial: sops provides a rotation flag "-r" that will generate a new data key Kd and re-encrypt all values in the file with it. Coupled with in-place encryption/decryption, it is easy to rotate all the keys on a group of files:

for file in $(find . -type f -name "*.yaml"); do
        sops -d -i $file
        sops -e -i -r $file
done

Something that should be done every few months for good practice ;)

Assuming roles

SOPS has the ability to use KMS in multiple AWS accounts by assuming roles in each account. Being able to assume roles is a nice feature of AWS that allows administrator to establish trust relationships between accounts, typically from the most secure account to the least secure one. In our use-case, we use roles to indicate that a user of the Master AWS account is allowed to make use of KMS master keys in development and staging AWS accounts. Using roles, a single file can be encrypted with KMS keys in multiple accounts, thus increasing reliability and ease of use.

Check it out, and contribute!

SOPS is available on Github at http://github.com/mozilla/sops and on Pypi at https://pypi.python.org/pypi/sops. We are progressively reaching a stable stage, with a goal to support Python 2.6.6 to 3.4.

Sunday, September 13 2015

Investigating SSH authorized keys across the infrastructure using MIG

By Julien Vehent, Sunday, September 13 2015.

One of the challenge of operating an organization like Mozilla is dealing with the heterogeneity of the platform. Each group is free to define its own operational practices, as long as they respect strong security rules. We don't centralize a lot, and when we do, we do it in a way to doesn't slow down devops.

The real challenge on the infosec side is being able to investigate infrastructures that are managed in many different ways. We look for anomalies, and one that recently received our focus is finding bad ~/.ssh/authorized_keys files.

Solving that problem involved adding some functionalities to MIG's file investigation module to assert the content of files, as well as writing a little bit of Python. Not only did this method help us find files that needed updating, but it also provided a way to assert the content of authorized_keys files moving forward.

Let's dive in.

LDAP all the things!

We have a really good LDAP database, results of tons of hard work from the folks in Mozilla IT. We use it for a lot of things, from providing a hierarchical view of Mozilla to showing your personal photo in the organization's phonebook. We also use it to store GPG Fingerprints and, what interests us today, SSH Public Keys.

LDAP users are in charge of their keys. They have an admin panel where they can add and remove keys, to facilitate regular rotations. On the infra side, Puppet pulls the public keys and writes them into the users authorized_keys files. As long as LDAP is up to date, and Puppet runs, authorized_keys files contain the proper keys.

But bugs happen, and sometimes, for various reasons, configurations don't get updated when they should be, and files go out of date. This is where we need an external mechanism to find the systems where configurations go stale, and fix them.

Asserting the content of a file

The most common way to verify the integrity of a file is by using a checksum, like a sha256sum. Unfortunately, it is very rare that a given file would always be exactly the same across the infrastructure. That is particularly true in our environment, because we often add a header with a generation date to authorized_keys files.

# HEADER: This file was autogenerated at Mon Jul 27 14:24:07 +0000 2015

That header means the checksum will change on every machine, and we cannot use a checksum approach to assert the content of a file. Instead, we need to use a regular expression.

Content regexes have been present in MIG for a while now, and are probably the most used feature in investigations. But until recently, content regexes were limited to finding things that exist in a file, such as an IOC. The file module would stop inspecting as soon as a regex matches, even skipping the rest of the file, to accelerate investigations.

To assert the content of a file, we need a different approach. The regex needs to verify that every line of a file match our expectations, and if one line does not match, that means the file has bad content.

Introducing Macroal mode

The first part of the equation is making sure that every line in a file match a given regex. In the file module, we introduced a new option called "macroal" that stands for Match All Content Regexes On All Lines. When activated, this mode tells the file module to continue reading the file until the end, and flag the file if all lines have match the content regex.

On the MIG command line, this option can be used in the file module with flag ""-macroal". It's a boolean that is off by default.

$ mig file -t local -path /etc -name "^passwd$" -content "^([A-Za-z0-9]|-|_){1,100}:x:[0-9]{1,5}:[0-9]{1,5}:.+" -macroal

The command above finds /etc/passwd and checks that all the lines in the file match the content regex "^([A-Za-z0-9]|-|_){1,100}:x:[0-9]{1,5}:[0-9]{1,5}:.+". If they do, MIG returns a positive result on the file.

In the JSON of the investigation, macroal is stored in the options of a search:

{
    "searches": {
        "s1": {
            "paths": [
                "/etc"
            ],
            "names": [
                "^passwd$"
            ],
            "contents": [
                "^([A-Za-z0-9]|-|_){1,100}:x:[0-9]{1,5}:[0-9]{1,5}:.+"
            ],
            "options": {
                "macroal": true,
                "matchall": true,
                "maxdepth": 1
            }
        }
    }
}

But finding files that match all the lines is not yet what we want. In fact, we want the exact opposite: finding files that have lines that do not match the content regex.

Introducing Mismatch mode

Another improvement we added to the file module is the mismatch mode. It's a simple feature that inverses the behavior of one or several parameters in a file search.

For example, if we know that all versions of RHEL6.6 have /usr/bin/sudo matching a given sha256, we can use the mismatch option to find instances where sudo does not match the expected checksum.

$ mig file -t "environment->>'ident' LIKE 'Red Hat Enterprise Linux Server release 6.6%'" \
> -path /usr/bin -name "^sudo$" \
> -sha256 28d18c50eb23cfd6ac8d39461d5479e19f6f1a5f6b839d34f2eeaf7ce8a3e054 \
> -mismatch sha256

Mismatch allows us to find anomalies, files that don't match our expectations. By combining Macroal and Mismatch in a file search, we can find files that have unexpected content. But we need one last piece: a content regex that can be used to inspect authorized_keys files.

Building regexes for authorized_keys files

An authorized_keys file should only contain one of three type of line:

a comment line that starts with a pound "#" character
an empty line, or a line full of spaces
a ssh public key

Writing a regex for the first two types is easy. A comment line is "^#.+$" and an empty line is "^(\s+)?$".

Write a regex for SSH public keys isn't too complicated, but we need to take a few precautions. A pubkey entry has three section separated by a white space, and we only care about the first two section. The third one, the comments, can be discarded entirely with ".+".

Next, a few things need to be escaped in the public key, as pubkey are base64 encoded and thus include the slash "/" and plus "+" character that have special meaning in regexes.

Awk and Sed can do this very easily:

$ awk '{print $1,$2}' ~/.ssh/authorized_keys | grep -v "^#" | sed "s;\/;\\\/;g" | sed "s;\+;\\\+;g"

The result can be placed into a content regex and given to MIG.

$ mig file -path /home/jvehent/.ssh -name "^authorized_keys$" \
> -content "^((#.+)|(\\s+)|(ssh-rsa\\sAAAAB3NzaC1yc2EAA[...]yFDMZLFlVmQ==\\s.+))$" \
> -macroal -mismatch content

Or in JSON form:

{
    "searches": {
        "jvehent@mozilla.com_ssh_pubkeys": {
            "contents": [
                "^((#.+)|(\\s+)|(ssh-rsa\\sAAAAB3NzaC1yc2EAA[...]yFDMZLFlVmQ==\\s.+))$"
            ],
            "names": [
                "^authorized_keys$"
            ],
            "options": {
                "macroal": true,
                "matchall": true,
                "maxdepth": 1,
                "mismatch": [
                    "content"
                ]
            },
            "paths": [
                "/home/jvehent/.ssh"
            ]
        }
    }
}

Automating the investigation

With a several hundreds pubkeys in LDAP, it is more than necessary to automate the generation of the investigation file. We can do so with Python and a small LDAP helper library called mozlibldap.

The algorithm is very simple: iterate over active LDAP users and retrieve their public keys. Then it find their home directory from LDAP and creates a MIG file search that asserts the content of their authorized_keys file.

The investigation JSON file gets bigs very quickly (2.4MB, and ~40,000 lines), but still runs decently fast on target systems. A single system runs the whole thing in approximately 15 seconds, and since MIG is completely parallelized, running it across the infrastructure takes less than a minute.

Below is the Python script that generate the investigation in MIG's action v2 format.

#!/usr/bin/env python
# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
# Copyright (c) 2015 Mozilla Corporation
# Author: jvehent@mozilla.com

# Requires:
# mozlibldap

from __future__ import print_function
import mozlibldap
import string
import json
import sys

LDAP_URL = 'ldap://someplace.at.mozilla'
LDAP_BIND_DN = 'mail=ldapreadonlyuser,o=com,dc=mozilla'
LDAP_BIND_PASSWD = "readonlyuserpassphrase"


def main():
    lcli = mozlibldap.MozLDAP(LDAP_URL, LDAP_BIND_DN, LDAP_BIND_PASSWD)
    searches = {}

    # get a list of users that have a pubkey in ldap
    users = lcli.get_all_enabled_users_attr('sshPublicKey')
    for user_attr in users:
        search = {}
        user = user_attr[0].split(',', 1)[0].split('=', 1)[1]
        print("current user: "+user, file=sys.stderr)
        keys = user_attr[1]
        if len(keys) == 0:
            continue
        contentre = '^((#.+)|(\s+)'
        for pubkey in keys['sshPublicKey']:
            if len(pubkey) < 5 or not (pubkey.startswith("ssh")):
                continue
            pubkey = string.join(pubkey.split(' ', 2)[:2], '\s')
            pubkey = pubkey.replace('/', '\/')
            pubkey = pubkey.replace('+', '\+')
            contentre += '|({pubkey}\s.+)'.format(pubkey=pubkey)
        contentre += ')$'
        search["names"] = []
        search["names"].append("^authorized_keys$")
        search["contents"] = []
        search["contents"].append(contentre)
        paths = []
        try:
            paths = get_search_paths(lcli, user)
        except:
            continue
        if not paths or len(paths) < 1:
            continue
        search["paths"] = paths
        search["options"] = {}
        search["options"]["matchall"] = True
        search["options"]["macroal"] = True
        search["options"]["maxdepth"] = 1
        search["options"]["mismatch"] = []
        search["options"]["mismatch"].append("content")
        print(json.dumps(search), file=sys.stderr)
        searches[user+"_ssh_pubkeys"] = search
    action = {}
    action["name"] = "Investigate the content of authorized_keys for LDAP users"
    action["target"] = "status='online' AND mode='daemon'"
    action["version"] = 2
    action["operations"] = []
    operation = {}
    operation["module"] = "file"
    operation["parameters"] = {}
    operation["parameters"]["searches"] = searches
    action["operations"].append(operation)
    print(json.dumps(action, indent=4, sort_keys=True))


def get_search_paths(lcli, user):
    paths = []
    res = lcli.query("mail="+user, ['homeDirectory', 'hgHome',
                                    'stageHome', 'svnHome'])
    for attr in res[0][1]:
        try:
            paths.append(res[0][1][attr][0]+"/.ssh")
        except:
            continue
    return paths


if __name__ == "__main__":
    main()

The script write the investigation JSON to stdout and needs to be redirected to a file. We can then use the MIG command line to run the investigation file.

$ ./make-pubkeys-investigation.py > /tmp/investigate_pubkeys.json
$ mig -i /tmp/investigate_pubkeys.json
[info] launching action from file, all flags are ignored
3124 agents will be targeted. ctrl+c to cancel. launching in 5 4 3 2 1 GO
Following action ID 4898767262251.status=inflight......status=completed
- 100.0% done in 34.848325918s
3124 sent, 3124 done, 3124 succeeded
server.example.net /home/bob/.ssh/authorized_keys [lastmodified:2014-05-30 04:04:45 +0000 UTC, mode:-rw-------, size:968] in search 'bob_ssh_pubkeys'
[...]
17 agent have found results

In conclusion

When maintaining the security of a large infrastructure, it is critical to separate the components that perform the configuration from the components that verify the configuration.

While MIG was written primarily as a security investigation platform, its low-level file investigation capabilities can be used to assert the integrity of configurations organization-wide.

This post shows how checks that verify the integrity of SSH Authorized Keys files can be executed using MIG. The checks are designed to consume negligible amounts of resources, and as such should be automated to run every few days in an approach that should be reused for a large amount of sensitive configuration files.

Test your infra, the same way you would test your applications!

Wednesday, August 26 2015

Hosting Go code on Github with a custom import path

By Julien Vehent, Wednesday, August 26 2015.

We host MIG at https://github.com/mozilla/mig, but while I have tons of respect for the folks at Github, I can't guarantee that we won't use another hosting provider in the future. Telling people to import MIG packages using something of the form "github.com/mozilla/mig/<package>" bothers me, and I've been looking for a better solution for a while.

I bought the domain mig.ninja with the intention to use that as the base import path. I initially tried to use HAProxy to proxy github.com, and somewhat succeeded, but it involved a whole bunch of rewrites that were frankly ugly.

Thankfully, Russ Cox got an even better solution merged into Go 1.4 and I ended up implementing it. Here's how.

Understanding go get

When asked to fetch a package, go get does a number of checks. If the target is on a known hosting site, it fetches the data using a method that is hardcoded (git for github.com, hg for bitbucket, ...). But when using your own domain, go get has no way to know how to fetch the data. To work around that, go lets you specify the vcs method in the import path: import mig.ninja/mig.git The .git indicates to go get that git should be used to retrieve the package. It also doesn't interfere with the code: in the code that import the package, .git is ignored and the package content is accessed using mig.Something.

But that's ugly. No one wants to suffix .git to their import path. There is another, cleaner, solution that use a HTML file to tell go get where the package is located, and which protocol should be used to retrieve it. The file is served from the location of the import path. For an example, let's curl https://mig.ninja/mig:

$ curl https://mig.ninja/mig
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
        <meta name="go-import" content="mig.ninja/mig git https://github.com/mozilla/mig">
        <meta http-equiv="refresh" content="0; url=http://mig.mozilla.org">
    </head>
    <body>
        Nothing to see here; <a href="http://mig.mozilla.org">move along</a>.
    </body>
</html>

The key here is in the <meta> tag named "go-import". When go get requests https://mig.ninja/mig, it hits that HTML file and knows that "mig.ninja/mig" must be retrieved using git from https://github.com/mozilla/mig.

One great aspect of this method, aside from removing the need for .git in the import path, is that "mig.ninja/mig" can now be hosted anywhere as long as the meta tag continues to indicate the authoritative location (in this case: github). It also works nicely with packages under the base repository, such that go get mig.ninja/mig/modules/file works as expected as long as the file is served from that location as well. Note that go get will retrieve the entire repository, not just the target package.

Serving the meta tag from HAProxy

Creating a whole web server for the sole purpose of serving an 11 lines of HTML isn't very appealing. So I reused an existing server that already hosts various things, including this blog, and is powered by HAProxy.

HAProxy can't serve files, but here's the trick, it can serve a custom response at a monitoring uri. I created a new HTTPS backend for mig.ninja that monitors /mig and serves a custom HTTP 200 response.

frontend https-in
        bind 62.210.76.92:443
        mode tcp
        tcp-request inspect-delay 5s
        tcp-request content accept if { req_ssl_hello_type 1 }
        use_backend jve_https if { req_ssl_sni -i jve.linuxwall.info }
        use_backend mig_https if { req_ssl_sni -i mig.ninja }

backend mig_https
        mode tcp
        server mig_https 127.0.0.1:1666

frontend mig_https
        bind 127.0.0.1:1666 ssl no-sslv3 no-tlsv10 crt /etc/certs/mig.ninja.bundle
        mode http
        monitor-uri /mig
        errorfile 200 /etc/haproxy/mig.ninja.200.http
        acl mig_pkg url /mig
        redirect location https://mig.ninja/mig if !mig_pkg

The configuration above uses SNI to serve multiple HTTPS domains from the same IP. When a new connection enters the https_in frontend, it inspects the server_name TLS extension and decides which backend should handle the request. If the request is for mig.ninja, it sends it to the mig_https backend, which forward it to the mig_https frontend. There, the request URI is inspect. If it matches /mig, the file mig.ninja.200.http is returned. Otherwise, a HTTP redirect is returned to send the caller back to https://mig.ninja/mig (in case a longer path, such as mig.ninja/mig/module, was requested).

mig.ninja.200.http is a complete HTTP response, with HTTP headers, body and proper carriage return. HAProxy doesn't process the file at all, it just globs it and sends it back to the client. It looks like this:

HTTP/1.1 200 OK
Cache-Control: no-cache
Connection: close
Content-Type: text/html

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta name="go-import" content="mig.ninja/mig git https://github.com/mozilla/mig">
<meta http-equiv="refresh" content="0; url=http://mig.mozilla.org">
</head>
<body>
Nothing to see here; <a href="http://mig.mozilla.org">move along</a>.
</body>
</html>

Building MIG

With all that in place, retrieving mig is now as easy as go get mig.ninja/mig. Icing on the cake, the clients can now be retrieved with go get as well:

$ go get -u mig.ninja/mig/client/mig
$ mig help
$ go get -u mig.ninja/mig/client/mig-console
$ mig-console

Telling everyone to use the right path

We just this in place, you can't guarantee that users of your packages won't directly reference the github import path.

Except that you can! As outlined's in rsc's document, and in go help importpath, it is possible to specify the authoritative location of a package directly in the package itself. This is done by adding a comment with the package location right next to the package name:

package mig /* import "mig.ninja/mig" */

Using the import comment, Go will enforce retrieval from the authoritative source. Other tools will use it as well, for example try accessing MIG's doc from Godoc using the github URL, and you will notice the redirection: https://godoc.org/github.com/mozilla/mig

Trust Github, it's a great service, but controlling your import path is the way to Go ;)

Sunday, July 26 2015

Using Mozilla Investigator (MIG) to detect unknown hosts

By Julien Vehent, Sunday, July 26 2015.

MIG is a distributed forensics framework we built at Mozilla to keep an eye on our infrastructure. MIG can run investigations on thousands of servers very quickly, and focuses on providing low-level access to remote systems, without giving the investigator access to raw data.

As I was recently presenting MIG at the DFIR Summit in Austin, someone in the audience asked if it could be used to detect unknown or rogue systems inside a network. The best way to perform that kind of detection is to watch the network, particularly for outbound connections rogue hosts or malware would establish to a C&C server. But MIG can also help the detection by inspecting ARP tables of remote systems and cross-referencing the results with local mac addresses on known systems. Any MAC address not configured on a known system is potentially a rogue agent.

First, we want to retrieve all the MAC addresses from the ARP tables of known systems. The netstat module can perform this task by looking for neighbor MACs that match regex "^[0-9a-f]", which will match anything hexadecimal.

$ mig netstat -nm "^[0-9a-f]" > /tmp/seenmacs

We store the results in /tmp/seenmacs and pull a list of unique MACs using some bash.

$ awk '{print tolower($5)}' /tmp/seenmacs | sort | uniq
00:08:00:85:0b:c2
00:0a:9c:50:b4:36
00:0a:9c:50:bc:61
00:0c:29:41:90:fb
00:0c:29:a7:41:f7
00:10:db:ff:10:00
00:10:db:ff:30:00
00:10:db:ff:f0:00
00:21:53:12:42:c1

We now want to check that every single one of the seen MAC addresses is configured on a known agent. Again, the netstat module can be used for this task, this time by querying local mac addresses with the -lm flag.

Now the list of MACs may be quite long, so instead of running one MIG query per MAC, we group them 50 by 50 using the following script:

#! /usr/bin/env bash
i=50
input=$1
output=$2
while true
do
    echo -n "mig netstat " >> $output
    for mac in $(awk '{print tolower($5)}' $1|sort|uniq|head -$i|tail -50)
    do
        echo -n "-lm $mac " >> $output
    done
    echo >> $output
    i=$((i+50))
    if [ $i -gt $(awk '{print tolower($5)}' $1|sort|uniq|wc -l) ]
    then
        exit 0
    fi
done

The script will build MIG netstat command with 50 arguments max. Invoke it with /tmp/seenmacs as argument 1, and an output file as argument 2.

$ bash /tmp/makemigmac.sh /tmp/seenmacs /tmp/migsearchmacs

/tmp/migsearchmacs now contains a number of MIG netstat commands that will search seen MAC addresses across the configured interfaces of known hosts. Run the commands and pipe the output to a results file.

$ for migcmd $(cat /tmp/migsearchmacs); do $migcmd >> /tmp/migfoundmacs; done

We now have a file with seen MAC addresses, and another one with MAC addresses configured on known systems. Doing the delta of the two is fairly easy in bash:

$ for seenmac in $(awk '{print tolower($5)}' /tmp/seenmacs|sort|uniq); do
    hasseen=""; hasseen=$(grep $seenmac /tmp/migfoundmacs)
    if [ "$hasseen" == "" ]; then
        echo "$seenmac is not accounted for"
    fi
done
00:21:59:96:75:7f is not accounted for
00:21:59:98:d5:bf is not accounted for
00:21:59:9c:c0:bf is not accounted for
00:21:59:9e:3c:3f is not accounted for
00:22:64:0e:72:71 is not accounted for
00:23:47:ca:f7:40 is not accounted for
00:25:61:d2:1b:c0 is not accounted for
00:25:b4:1c:c8:1d is not accounted for

Automating the detection

It's probably a good idea to run this procedure on a regular basis. The script below will automate the steps and produce a report you can easily email to your favorite security team.

#!/usr/bin/env bash
SEENMACS=$(mktemp)
SEARCHMACS=$(mktemp)
FOUNDMACS=$(mktemp)
echo "seen mac addresses are in $SEENMACS"
echo "search commands are in $SEARCHMACS"
echo "found mac addresses are in $FOUNDMACS"

echo "step 1: obtain all seen MAC addresses"
$(which mig) netstat -nm "^[0-9a-f]" 2>/dev/null | grep 'found neighbor mac' | awk '{print tolower($5)}' | sort | uniq > $SEENMACS

MACCOUNT=$(wc -l $SEENMACS | awk '{print $1}') 
echo "$MACCOUNT MAC addresses found"

echo "step 2: build MIG commands to search for seen MAC addresses"
i=50
while true;
do
    echo -n "$i.."
    echo -n "$(which mig) netstat -e 50s " >> $SEARCHMACS
    for mac in $(cat $SEENMACS | head -$i | tail -50)
    do
        echo -n "-lm $mac " >> $SEARCHMACS
    done
    echo -n " >> $FOUNDMACS" >> $SEARCHMACS
    if [ $i -gt $MACCOUNT ]
    then
        break
    fi
    echo " 2>/dev/null &" >> $SEARCHMACS
    i=$((i+50))
done
echo
echo "step 3: search for MAC addresses configured on local interfaces"
bash $SEARCHMACS

sleep 60

echo "step 4: list unknown MAC addresses"
for seenmac in $(cat $SEENMACS)
do
    hasseen=$(grep "found local mac $seenmac" $FOUNDMACS)
    if [ "$hasseen" == "" ]; then
        echo "$seenmac is not accounted for"
    fi
done

The list of unknown MACs can then be used to investigate the endpoints. They could be switches, routers or other network devices that don't run the MIG agent. Or they could be rogue endpoints that you should keep an eye on.

Happy hunting!

Thursday, July 9 2015

You can't trust the infra; Encrypt client side!

By Julien Vehent, Thursday, July 9 2015.

Like most of my peers in the infosec community, I learned that good data protection requires strong infrastructure security controls. I practiced the art of network security, learned the arcanes of systems hardening and used those concepts in securing web infrastructures.

But it's 2015, and infrastructure security just doesn't cut it anymore. The cost of implementing controls continues to grow, while our capabilities keep being reduced by cloud environments that limit the perfect security world we want to live in. Cloud is good for business, but it makes infrastructure security really difficult. In the cloud, IDS/IPS aren't usable, or with very limited capabilities. DDoS protection must be done higher up in the stack because you can't access the routing layer. At rest data encryption isn't useful when the keys are stored next to the data. TLS encryption is not used inside the infrastructure because certificate management is hard, so we end up transferring cleartext userdata on massively shared network, hoping its somewhat isolated. The list of security problems we simply cannot solve with reasonable cost/complexity in cloud environments is quite long, and caused many infosec professional hours of ranting.

What about datacenters? It certainly is easier to control infrastructure security there, but ultimately the problem is the same: we're just not 100% sure of what hardware we run our systems on. SMM malwares are a reality, and we know (Thank you Snowden!) that the NSA and other security agencies have the tools to intercept hardware and install their own little spy packages.

If the ultimate goal is perfect data security, I don't think we can achieve it in the current infrastructure security landscape.

Meanwhile, users have been pushing more and more data into the web. Hackers have been hard at work to break into our services and leak that data out to the world. When it's not hackers, the sheer complexity of web infrastructures themselves has caused many a team to unintentionally press the wrong button, and post data where it shouldn't be (password leak on pastebin, anyone?). Ask around in the incident response community, and they will tell you how busy the last couple of years have been dealing with data leaks.

Heck, Amazon even automated looking into Github, a third party company, for AWS keys that infrastructure operators leak! That's like your banks watching CCTV of public transports to alert when you forget your wallet in the metro. Those incidents have become very common, and unfortunately cannot be solved by another layer of firewall.

Looking at what we host at Mozilla, it's easy to spot a small number of services that store data we absolutely never want to leak. We focus our infosec efforts on those, and with everyone's help build systems that we hope are safe enough. It's hard, and there is always that fear of missing something that could expose information from our users. In that landscape, there is one category of services that I'm just not too worried about: the ones that store data already encrypted on the client side.

Firefox Sync is a good example of such service. The data in Sync is strongly encrypted, in Firefox, before being sent to our storage servers. We (Mozilla) don't have the keys. We can't leak the keys. The worst we can do is leak encrypted blobs that probably no one has the ability to decrypt. This is a much better security control that anything else we can ever put on the infrastructure side. It just seems right.

Designing services that encrypt data on the client is the next challenge of information security. It requires that infosec folks work closely with developers, when most of their time is currently spent with sysadmins. It also changes the skillset we need to do our job, and focuses more on a strong understanding of cryptography. Not just SSL/TLS, but crypto algorithms themselves. Javascript is getting better and APIs like WebCrypto or libraries like OpenPGPJS are the way forward to implement client-side encryption. Key management is almost irrelevant if we accept that keys should be derivated from user passwords, like Firefox Accounts did.

Client-side encryption has the added benefit of empowering the user and making them responsible for the security of their data. It's not realistic to expect every business that operates a web service to run like a bank. But it is realistic to expect individuals to care about the security of their own photos, videos, emails, conversations and browsing history. Most users already do care and would welcome more control on their data. Business people, however, are the ones that are hard to convince, because they love looking inside all that data and building dashboard and graphs, and designing fancy statistical models to boost marketing and convertion rates. Note that those things can still be done, but client side (that's how our Directory Tiles advertising service operates).

We have seen with Lavabit that client-side encryption does not reduce the attack surface to zero. A government can still force a service operator to change the client code to retrieve decrypted data. But the cost of attacking a service that way is immensely larger than simply breaking into a database server.

So, should we get rid of our firewall and encrypt everything with javascript? No! Absolutely not. Infrastructure security remains an important component of any infosec strategy, but it has reached a plateau and we need to look for new techniques to continue improving our posture. Cloud providers help streamline the management of firewall rules and network security policies. DevOps practices with VMs and containers help isolate and rotate services quickly. All those things are important but have not solved the data risk in its entirety. Client-side encryption is the next step.

In the future, I'd like to see web services default to HTTPS and use Javascript (or anything else) to encrypt data before handing it over to services that have grown too large, too complex and too cheap to secure perfectly. How we do this, is very much left as an open question.

Wednesday, March 11 2015

10 years of self-hosting Linuxwall.info

By Julien Vehent, Wednesday, March 11 2015.

On March 11th, 2005, a small group of nerds studying in the overly boring city of Blois, France, decided to buy a domain to play with self hosting DNS and email.

10 years later, linuxwall.info is still (mostly) self-hosted and we are (mostly) nerds. It has been, and still is, a lot of fun and I believe the learning process has to progress two to three times faster than my peers who don't self-host.

(on the left: probably the first server to host linuxwall.info, back in my apartment in Blois)

10 years is probably long enough of an experiment to draw lessons on the state of self-hosting. So let's start with the conclusion: self-hosting is entirely possible, but only if you invest the time and energy to do it right. There is no good way to self-host without spending time on it. Symptoms of poorly done self-hosting include loss of email, website headaches, connectivity issues and angry spouse.

What follows is a rapid overview of what I learned doing self-hosting for a decade, starting with the most frustrating aspect of it...

Don't count on your ISP for help

Over the last 10 years, the people who own the pipes have made absolutely zero effort to facilitate self-hosting. I've hosted linuxwall.info on Free.fr, Neuf/SFR, Comcast and Verizon. The only one who provided me with a static IP was Free.fr. Comcast went as far as blocking tcp/25 inbound without providing a way to disable the blocking. ISPs are completely uncooperative with self-hosters, and for no particular reason other than pushing people toward "business" class of bandwidth (same exact thing, but with a static IP and an extra $100 on your monthly bill).

Uplink is probably what self-hosters fear the most. Will it be enough bandwidth? Will it slow down the Internet for the rest of the house? Here is the truth: you need almost no bandwidth to self-host. For many years, linuxwall.info was hosted with 128kbps of uplink, and it worked fine. The handful of inbound email, http connections, xmpp chats or DNS requests will fit into a tiny percentage of your uplink, and you won't even notice it. Just make sure you don't run offsite backups while your spouse is watching netflix... (or use QoS).

Self-hosters need friends

Don't self-host alone, that won't work. The chances of your internet connection going down are too great. Self-hosting must be done the way the internet was built: if one endpoint goes down, there must be another endpoint that takes over. For DNS, that's easy enough with slaves in multiple locations. For web or email, it's more difficult.

(on the right: the linuxwall.info crew: steph, franck, christophe, myself and jerome, circa 2005)

Hosting email servers is definitely the hardest, because building an IMAP distributed cluster with something like dovecot or cyrus-imap is far from trivial. However, it is fairly easy to build a secondary MX with postfix that buffers inbound emails when the primary is down, and just forwards them to the primary when it comes back online.

That's how linuxwall's MX operates. smtp.linuxwall.info is a primary email server (described here), and if it goes down, smtp2.linuxwall.info will receive the inbound mail, cache it, and forward it to the primary when back online. SMTP2 is really just running Postfix with a very basic transport configuration, so we avoid the madness of synchronizing IMAP datastores across servers.

$ dig +short MX linuxwall.info
20 smtp2.linuxwall.info.
10 smtp.linuxwall.info.

root@smtp2:/etc/postfix # cat transport
linuxwall.info  relay:smtp.linuxwall.info:25

DNS is another one that Nowadays, linuxwall's DNS servers host more than one domain. Besides the main one, there is 1nw.eu, chatonly.org, frimousse-action.com, insecure.ws, lesroutesduchocolat.{fr,com}, tzib.net, necto.org and a few more. The root DNS is hosted in my house, but on a dynamic IP, so it is only used as an authority to the 3 slaves that have static IPs. DNS outages are rare because of the distributed nature of the platform, but it does require that several people participate in the network. Don't self-host alone!

Do hard things and write about it

DNS, Email, XMPP, distributed storage, VPN, public websites, ... The hardest things to self-host are also the most interesting. After many years of internet presence, I have enough public websites to warrant paying for a hosted server at online.net. Most of the critical parts of my infra (email, dns slave, websites) are now hosted on linux containers on that host. For a while though, everything that touched the linuxwall.info domain was hosted on personal ADSL connections (the french equivalent to cable).

Linuxwall.info was created to experiment with all sorts of cool technologies that we simply didn't have access to as students. One of the goal was to write about these technologies, and logically the first site we created was http://wiki.linuxwall.info. Over the years, we've written about dozens of tools, setups, successes and failures and some of these articles have become quite popular:

Journey to the Center of the Linux Kernel: Traffic Control, Shaping and QoS
Building and using a PKI (in french)
Nginx performance tuning
Local Jabber/XMPP server with ejabberd and Debian
Effectively fighting spam with DSPAM
DKIM Signature and verification using DKIMproxy
Nagios with Nginx on Fedora (or Amazon Linux AMI)
Mysql 5 clustering (in french)
Postscreen, l'exterminateur de zombies (in french)
Necto: full linux stack for self-hosting email
Soft RAID5 on linux 2.6 (in french)
and the list goes on and on at http://wiki.linuxwall.info/doku.php/start

As it turns out, most of the tech we experimented with is identical to what is deployed in "professional" environments. Same Apache conf, same postfix, same VPN, same DKIM, same RAID, etc... Sometimes, self-hosting is even more advanced that what you'd find in your typical LAMP stack at work, because the constraints on hosting stuff at home require a good amount of engineering creativity.

Most certainly, self-hosting helped me become a better engineer. It helped me acquire experience much more quickly than if I had waited for exposure at work. It does come with the cost of many and many weekends spent deploying, fixing, upgrading and operating an infrastructure that could just as well been hosted elsewhere. For free. But without the satisfaction or making it all work myself.

Control the traffic

QoS is perhaps the most interesting networking challenge a self-hoster can take upon. I have spent hours tweaking the QoS of my gateway server, and wrote a lengthy analysis of Linux's QoS stack at Journey to the Center of the Linux Kernel: Traffic Control, Shaping and QoS. Studying QoS not only helps understand the details of network protocols, but also made me appreciate minimalistic designs: almost all network services that run on the internet today can be hosted with just a few kilobytes per seconds of bandwidth. The myth of gigabits internet connections for residences is built by ISPs as a marketing tool. Even at pick times, self-hosting at home while doing video-conferences does not make my uplink use more than one megabit of traffic. And I verizon pay for 25 of those!

(on the left: a week of uplink from my house, using up to 720kbps)

I can't remember a time when I filled up an uplink during normal operations. Sure, there are times when a large uplink is pleasant - for example when recovering 20GB of email backups from a friend's server - but 99% of the time a basic QoS policy will ensure that your services have the necessary bandwidth without annoying all the users in the house.

The takeaway

Most people get exhausted of self-hosting after a few years, give up and move everything to "the cloud". We haven't, and it's been 10 years. The secret is to take the time to do things right, and not do it alone.

For sure, there will be times of reading your emails directly from the cache of Postfix, because the IMAP server is down Or times when you arrive at a nice vacation spot only to realize that the IP of your home server has changed, and you can't update the DNS for a week. Those times suck. But they also help you think about reliability and high-availability. Stuff that, once you've acquired the knowledge, people will pay a lot of money for.

tl;dr: self-hosting is awesome. 10 years in and going strong, Linuxwall.info is reading for the next 10!

Monday, December 15 2014

Stripe's AWS-Go and uploading to S3

By Julien Vehent, Monday, December 15 2014.

Yesterday, I discovered Stripe's AWS-Go library, and the magic of auto-generated API clients (which is one fascinating topic that I'll have to investigate for MIG).

I took on the exercise of writing a simple file upload tool using aws-go. It was fairly easy to achieve, considering the complexity of AWS's APIs. I would have to evaluate aws-go further before recommending it as a comprehensive AWS interface, but so far it seems complete. Check out http://godoc.org/github.com/stripe/aws-go/gen for a detailed doc.

The source code is below. It reads credentials from ~/.awsgo:

$ cat ~/.awsgo
[credentials]
accesskey = "AKI...."
secretkey = "mw0...."

It takes a file to upload as the only argument, and returns the URL where it is posted.

$ ./s3up s3up
https://s3.amazonaws.com/testawsgo/s3up

AWS-Go is not revolutionary compared to python & boto, but benefits from Go's very clean approach to programming. And getting rid of install dependencies, pip and python{2{6,7},3} hell is kinda nice!

package main

import (
	"code.google.com/p/gcfg"
	"fmt"
	"github.com/stripe/aws-go/aws"
	"github.com/stripe/aws-go/gen/s3"
	"os"
)

// conf takes an AWS configuration from a file in ~/.awsgo
// example:
//
// [credentials]
//    accesskey = "AKI...."
//    secretkey = "mw0...."
//
type conf struct {
	Credentials struct {
		AccessKey string
		SecretKey string
	}
}

func main() {
	var (
		err         error
		conf        conf
		bucket      string = "testawsgo" // change to your convenience
		fd          *os.File
		contenttype string = "binary/octet-stream"
	)
	// obtain credentials from ~/.awsgo
	credfile := os.Getenv("HOME") + "/.awsgo"
	_, err = os.Stat(credfile)
	if err != nil {
		fmt.Println("Error: missing credentials file in ~/.awsgo")
		os.Exit(1)
	}
	err = gcfg.ReadFileInto(&conf, credfile)
	if err != nil {
		panic(err)
	}

	// create a new client to S3 api
	creds := aws.Creds(conf.Credentials.AccessKey, conf.Credentials.SecretKey, "")
	cli := s3.New(creds, "us-east-1", nil)

	// open the file to upload
	if len(os.Args) != 2 {
		fmt.Printf("Usage: %s <inputfile>\n", os.Args[0])
		os.Exit(1)
	}
	fi, err := os.Stat(os.Args[1])
	if err != nil {
		fmt.Printf("Error: no input file found in '%s'\n", os.Args[1])
		os.Exit(1)
	}
	fd, err = os.Open(os.Args[1])
	if err != nil {
		panic(err)
	}
	defer fd.Close()

	// create a bucket upload request and send
	objectreq := s3.PutObjectRequest{
		ACL:           aws.String("public-read"),
		Bucket:        aws.String(bucket),
		Body:          fd,
		ContentLength: aws.Integer(int(fi.Size())),
		ContentType:   aws.String(contenttype),
		Key:           aws.String(fi.Name()),
	}
	_, err = cli.PutObject(&objectreq)
	if err != nil {
		fmt.Printf("Error: %v\n", err)
	} else {
		fmt.Printf("%s\n", "https://s3.amazonaws.com/"+bucket+"/"+fi.Name())
	}

	// list the content of the bucket
	listreq := s3.ListObjectsRequest{
		Bucket: aws.StringValue(&bucket),
	}
	listresp, err := cli.ListObjects(&listreq)
	if err != nil {
		panic(err)
	}
	if err != nil {
		fmt.Printf("Error: %v\n", err)
	} else {
		fmt.Printf("Content of bucket '%s': %d files\n", bucket, len(listresp.Contents))
		for _, obj := range listresp.Contents {
			fmt.Println("-", *obj.Key)
		}
	}
}

Thursday, November 20 2014

SSL/TLS for the Pragmatic

By Julien Vehent, Thursday, November 20 2014.

Tonight I had the pleasure to present "SSL/TLS for the Pragmatic" to the fine folks of Bucks County Devops. It was a fun evening, and I want to thank the organizers, Mike Smalley & Ben Krein, for the invitation.

It was a great opportunity to summarize 18 months of work at Mozilla on building the Server Side TLS Guidelines. By the feedback I received tonight, and on several other occasions, I think we've achieved the goal of building a document that is useful to operations people, and made TLS just a little easier to understand.

We are not, however, anywhere done with the process of teaching TLS to the Internet. Stats speak for themselves, with 70% of sites still supporting SSLv3, 86% enabling RC4, and about 32% still not preferring PFS over RSA handshakes. But things are getting better every day, and ongoing efforts may bring safe defaults in Linux servers as soon as Fedora 21. We live in exciting times!

The slides from my talk are below, and on github as well. I hope you enjoy them. Feel free to share your comments at julien[at]linuxwall.info.

« previous entries - page 1 of 32

Be brief

Tell a good story

Be technical

Never make assumptions

Don't ignore backward compatibility

Use boring tech

Delete your code

Be passionate, and respectful

Onward.

It didn’t use to be this way

Getting closer to devs & ops

Don’t just build security tools.Build operational tools that do things securely.

Avoid Making Assumptions

Assumptions lead to vulnerability

Setting the expectations

Clear Expectations↓Checklist↓Self Assessment↓Profit

Not having to say “no”

Being Strategic

To go beyond the security team

Get the security team closer to your organization

High operational cost

QA uncertainty

Huge attack surface

Everything in moderation

Are there any skills or experiences beyond programming required to be an interesting candidate for foreign employers

Were there any challenges that you faced, other than immigration, that I may need to be aware of as a foreign candidate?

And lastly, do you have any advice on how to stand out from the rest of the candidates aside from just a good resume (maybe some specific volunteer experience or unique skills that I could gain while I am still finishing my thesis?)

What about Power?

Take Away

Return on investment

Students engagement is hard to maintain

Will MWoS rise from the ashes?

The place

Bandwidth

Microphone

Webcam

Spread the word!

What are we missing?

In conclusion

The initial trust

KMS, Trust and secrets distribution

SOPS: Secrets OPerationS

Simplifying key management

Updating the master keys

Rotation

Assuming roles

Check it out, and contribute!

LDAP all the things!

Asserting the content of a file

Introducing Macroal mode

Introducing Mismatch mode

Building regexes for authorized_keys files

Automating the investigation

In conclusion

Understanding go get

Serving the meta tag from HAProxy

Building MIG

Telling everyone to use the right path

Automating the detection

Don't count on your ISP for help

Self-hosters need friends

Do hard things and write about it

Control the traffic

The takeaway

Julien Vehent

Security @ Mozilla

about me

My Book

Search

Last entries

Atom Feed

RSS Feed

Don’t just build security tools.
Build operational tools that do things securely.

Clear Expectations
↓
Checklist
↓
Self Assessment
↓
Profit