Archive for the ‘Operations’ Category

IT Operations in a cloud based environment -What would a team structure look like?

Апрель 13th, 2012
Eric Ries' lean movement is picking up steam and is really extending agile software development to the wider organisation. Its interesting to see over time how some organisations have changed in a more competitive market in recent times. REA Group, the company I work for, have made some significant changes over the past few years including:

  • Adopted the agile software delivery process throughout IT replacing the traditional waterfall method
  • Slided and diced 'development / delivery' resources in different ways to provide accountability to the segment of the business they are working on
  • Adopted a more collaborative approach between IT Operations and IT development/delivery

The traditional 

That being said, there are many companies that arrange teams like
  • Development - the people making the code
  • QA - the people writing and executing automated testing
  • Operations - the people that take the code and run it at scale 
  • Network / security - the people that own infrastructure networking 
  • DBA - the people that performance tune and own databases
using dedicated infrastructure.

The start up

In contrast a new start-up website may have a very small team of
  • Development doing QA and 'deployment' 
using a public cloud service like heroku, dotcloud, Rackspace Cloud or Amazon Web Services.


Now lets assume that start-up is successful and needs to scale fast to complete in the market, would they scale by subdividing teams in the 'traditional model' or use an alternate model that fosters being 'lean' and 'agile'? I actually think the team structure would be
  • Small infrastructure team - providing infrastructure services like 
    • spinning up new (disposable) instances
    • a monitoring / alerting service / framework
    • a standard deployment tooling service / framework
  • Large delivery team based around a service. The team would consist of 
    • Amazing coders on the technology
    • A QA consultant to assist those amazing coders with the ability to write test coverage and automated testing
    • A operations consultant that can either take the infrastructure tooling and extend it to the project/service  - or work the the 'amazing coders' to enable them to build and deploy systems at scale
So where would the 'DBA' fit? This is where I'm not too sure. I would think they would fit into an 'infrastructure' team - providing  a PaaS database service that the delivery teams would hook into when they needed a whatever datastore.

My thoughts

I think there are some interesting times ahead for traditional model IT organisations and having teams based around development and operations just will not work. Operations can no longer afford to be a 'lights running' type gig. It has to be an enabler of infrastructure services to the wider IT community. "I wont be the 'pager' person, ill provide you with a monitoring and alerting system that you can hook into when you build your 'lean' minimal viable product". 

Your thoughts

I would be interested hearing about organisational structure and role changes with the boom of cloud, agile and lean. How do you think an IT organisation should be structured to provide the best culture, environment and challenges while also delivering significant business value?




PlanetMySQL Voting: Vote UP / Vote DOWN

The Casual MySQL DBA – Operational Basics

Ноябрь 17th, 2010

So your not a MySQL DBA, but you have to perform like one. If you have a production environment that’s running now, what are the first things you do when it’s not running or reported as not running?

  1. Are the MySQL processes running? (i.e. mysqld and mysqld_safe)
  2. Can you connect locally via cli?
  3. What’s in the MySQL error log?
  4. What are current MySQL threads doing? Locked? long running? how many? idle sources?
  5. Can you connect remotely via cli?
  6. Verify free diskspace?
  7. Verify system physical resources?
  8. If this is a slave, is MySQL replication running? Is it up to date?
  9. What is the current MySQL load, e.g. reads/writes/throughput/network/disk etc?
  10. What is the current InnoDB state and load? (based on if your using InnoDB)

After you do this manually more then once you should be scripting these commands to be productive for future analysis and proactive monitoring?

Is a problem obvious? Does the output look different to what a normal environment looks like? (HINT: This list is not just for when there is a problem)

So moving forward?

  1. Is disk/memory/cpu/network bottleneck an issue you can resolve?
  2. Can you improving locking statements (if applicable)?
  3. Can you identify, analyse and tune long running statements?
  4. Do you know how to restart MySQL?
  5. Do you know who to call when you have a non working environment?
  6. When did your backup last run?
  7. Does your last backup work?

In order to support any level of production MySQL environment you need to know the answers to these questions? If you don’t, then this is your homework checklist for MySQL DBA operations 101. There a number of resources where you can find the answers, and this help can be available online, however never assume the timeliness of responses, especially if your expecting if for FREE! Open source software can be free, open source support rarely is.


PlanetMySQL Voting: Vote UP / Vote DOWN

Scribd is Hiring (I’m Looking for an Operations Engineer to Join My Team)

Август 17th, 2010

Scribd is a top 100 site on the web and one of the largest sites built using Ruby on Rails. As one of the first rails sites to reach scale, we’ve built a lot of infrastructure and solved a lot of challenges to get Scribd to where it is today. We actively try to push the envelope and have contributed substantial work back to the open source community.

Scribd has an agile, startup culture and an unusually close working relationship between engineering and ops. You’ll regularly find cross-over work at Scribd, with ops people writing application-layer code and engineers figuring out operations-level problems. We think we’re able to make that work because of the uniquely talented people we have on the team.

To allow us to keep scaling, we’re now looking to add a strong, experienced operations guru to the team. As a member of Scribd operations, you’ll have tremendous ownership and responsibility for one of the web’s most popular applications. Because Scribd is a startup, you will wear many hats and have broader responsibility than you would at a larger company.

If you read this blog, you should already have a good sense of the kind of work you’ll be doing on this position.

The Ideal Profile

You are an experienced operations professional and have run ops at at least one large-scale website. You have comprehensive knowledge of a broad variety of system tools, from MySQL and Nginx to Squid and Memcached. You should also have strong software development skills and be well-versed in major programming languages. You should be strongly motivated, a creative solution finder, and ready to jump into the thorniest technical problems whenever necessary.

Responsibilities

  • Develop and maintain all aspects of Scribd’s operations infrastructure, including system monitoring, backups, server configuration, databases, and caching systems
  • Collaborate with engineering to create next generation infrastructure to support changing requirements
  • Predict scaling problems before they occur and work with engineering to prevent them
  • Write and debug application level ruby code
  • Participate in an on-call rotation
  • Quickly diagnose server problems and employ preventive measures to maintain high availability servers

Qualifications

  • Bachelors degree in CS or equivalent experience
  • 3-5 years of professional experience in site operations
  • Strong software engineering skills, including knowledge of major programming languages
  • Strong database skills, preferably with MySQL, and overall linux knowledge
  • Experience with most of the following technologies: MySQL, Nginx, Ruby, Memcached, Squid, git, Solr, HBase, Postfix
  • Proven ability to quickly learn and implement unfamiliar technologies
  • Strong desire to work hard at a rapidly growing company

Location: You are preferably located near San Francisco, CA. Relocation assistance is designed on a per-case basis. In short, we’ll be creative to get you here.

Contact: Please send your email cover letter and resume with the subject “Your name – Senior Site Operations Engineer – via Kovyrin.net” to jobs@scribd.com or contact me directly using any of my contacts. All communication and correspondence is held in the strictest confidence to ensure that you can connect and learn more without exposure.



PlanetMySQL Voting: Vote UP / Vote DOWN

A review of Web Operations by John Allspaw and Jesse Robbins

Июль 4th, 2010
Web Operations

Web Operations

Web Operations. By John Allspaw and Jesse Robbins, O’Reilly 2010, with a chapter by myself. (Here’s a link to the publisher’s site).

I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well. It includes chapters from some really smart people, some of whom I was not previously familiar with. John and Jesse obviously have good connections. A lot of the folks are from Flickr.

Here are the highlights in my opinion.

  • Theo Schlossnagle, who has a place on my list of essential books, opens things with an overview of what web operations really is, and why it’s hard. Don’t skip this. Theo’s introduction is concise and thoughtful.
  • Eric Ries discusses the benefits of continuous deployment. He is right on the money. Right out of college I spent 3 years as a developer at a company with very little engineering discipline, and then left for another company built by a small ace team practicing extreme programming. Eric nails the benefits of continuous deployment — he really gets it. I hadn’t heard of Eric before, but now I’ve subscribed to his blog.
  • John Allspaw (whose book on capacity planning is also on my list of essentials) and Richard Cook discuss how complex systems fail. This chapter appeared in part as a whitepaper and blog post on John’s blog, and is expanded in this book. I have spent a lot of time examining failures for clients, and as VP of Consulting, also a lot of time examining Percona’s own mistakes. I fully agree with the conclusions in this chapter. A few key points: there is never a single root cause; our desire to find one blinds us and keeps us from learning; true failures are inherently unpredictable and happen only when a series of things fails; avoiding failure requires experience with failure. This echoes another book I’ve read recently, The Black Swan.
  • Brian Moon’s chapter on unexpected traffic spikes. If you get a chance to hear Brian speak, take it. He’s an engaging guy with interesting and relevant stories to tell. Stories are always a better experience than bullet points.
  • Jake Loomis’s chapter on postmortems. My own research into prevention of emergencies agrees almost perfectly with his list of things to do on page 225. Read this chapter carefully! Now, knowing how to put this into action is hard — very hard — but at least you’ll have a place to start. The worst compliment I ever got after fixing a system that’d run out of hard drive space (due to utter lack of basic monitoring) was that I’d “saved the day.” Baloney. Postmortems can be a great way to learn your infrastructure’s weaknesses and prevent emergencies in the future. I’m fully confident that this particular client will again deploy new servers without adding them into Nagios, and the results will be predictable.
  • Naturally, my chapter about choosing a relational database architecture for web applications (skewed towards MySQL). There is a chapter on NoSQL databases by Eric Florenzano as well, but it is more introductionary-level.

What wasn’t so good? I didn’t get a lot of value out of John’s interview with Heather Champ, on community management and web operations. I did not think the interview format worked well in a book full of essays. But that might just be me. Also, a couple of places in two or three chapters felt a bit rant-ish without a lot of clear actionable advice; I think readers won’t get so much out of this.

Overall, though, this is a great book, badly needed, on a topic that is simply not yet recognized for its true importance. As Theo writes, we’re seeing the emergence of web operations as a very large profession; it’s one whose definition is not yet formalized or agreed-upon, but that’ll change. It’s too important not to. Jesse’s introduction repeats this sentiment: the world now relies on the web, and so the world relies also on the engineers who make it run. Web operations is work that matters.

Related posts:

  1. A review of The Art of Capacity Planning by John Allspaw
  2. My chapter in the forthcoming Web Operations book
  3. Review of Scalable Internet Architectures by Theo Schlossnagle
  4. A review of Cacti 0.8 Network Monitoring by Dinangkur Kundu and S. M. Ibrahim Lavlu
  5. A review of Optimizing Oracle Performance by Cary Millsap


PlanetMySQL Voting: Vote UP / Vote DOWN