About the Company

At AT&T, we’re connecting the world through the latest tech, top-of-the-line communications and the best in entertainment. Our groundbreaking digital solutions provide intuitive and integrated experiences for millions of customers across online, retail and care channels. Join our mission to deliver compelling communication and entertainment experiences to customers around the world as we continue to evolve as a technology-powered, human-centered organization. As part of our team, you’ll transform the way we deliver a seamless customer experience with digital at the center of all you do. In our world, digital is much larger than just an eCommerce channel, we are transforming all channels to digitally perform as one team to create a better customer experience.  As we move through 2021, the digital transformation will revolutionize the digital space and you can build a career that will propel your future.


About the Team

The mission of our Digital Operations team is to operate a fault resilient, customer-centered, proactive DevOps team. The team is responsible for supporting systems that deliver AT&T’s customer experience, across multiple internet-facing eCommerce applications, databases, platforms and technology stacks. Our customer-journey centric Ops team is made up of Ops Engineers as well as Site Reliability Engineers (SREs) who are all focused on ensuring a highly available, resilient, performant and secure customer experience.

About the Job

Our Digital Operations team is looking for a Site Reliability Engineer (SRE) who is passionate about the customer experience and has analytical & multi-tasking abilities to thrive in a fast-paced environment. The SRE is responsible for ensuring that, as new features and applications are introduced to production, essential aspects for reliability such as availability, resiliency, latency, efficiency, change management, monitoring, emergency response, and capacity planning are conducted alongside development of the new features/applications. The SRE will develop automation code & scripts to proactively address customer issues, reduce mean time to repair and improve application availability. The position also includes collaborating closely with feature delivery teams as a bridge between development and operations by applying a software engineering mindset to system administration. This position will split time between operations/on-call duties and guiding the development of systems and software that help increase site reliability and performance to deliver business value. The SRE will need intimate knowledge of the current state of data-center and cloud infrastructure, CI/CD pipeline tools, Kubernetes, Site Reliability Engineering practices, and ability to implements the plan for desired future state. Attention to detail and strong analytical skills are required, along with a “Customer-First” attitude!


Responsibilities and Day-to-Day View

  • Build software to help operations and support teams - Proactively build and implement services to make operations more effective and reduce toil. This includes adjustments to monitoring and alerting to automating scripts and code in production. Candidate can be tasked with building a homegrown tool from scratch to help with issues in software delivery or resolving impacts from outages/incident.
  • Fix support escalation issues; Optimize on-call rotations and processes - Improve system reliability through the optimization of on-call processes. Add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, update runbooks, tools and documentation to help prepare on-call teams for future incidents.
  • Document “tribal” knowledge - Gain exposure to systems in both staging and production, and take part in work with software development, support, IT operations and on-call duties – to build up historical knowledge over time. Instead of silo-ing this knowledge, ensure constant upkeep of documentation and runbooks to ensure that teams get the information they need right when they need it.
  • Conducting post-incident reviews - Thorough and transparent post-incident reviews to keep teams honest and ensure that everyone is conducting post-incident reviews, documenting their findings and taking action on their learnings. Take action items for building or optimizing parts of the SDLC or incident lifecycle to bolster reliability of the service.
  • Develop automation for mission critical applications using scripts, programs
  • Provide customer impact analysis and troubleshoot complex issues using domain knowledge of AT&T Sales & Ordering flows, applications and downstream interfaces
  • Support APIs in K8s environment
  • Contribute to design and implementation of new system layers utilizing principles of high-complexity compute environments.
  • Provide on-call support for Production customer facing issues
  • Work with developers, environment teams to identify necessary resources and remove constraints to increase application availability.


Required Qualifications

  • Bachelor’s degree in Computer Science or related field
  • Experience in Production Support / Operations environment/ Development
  • Experience in Java, Python, Shell scripts
  • Experience using Docker, Kubernetes and Cloud environments
  • Experience in working in cloud (Azure Preferred)
  • Strong Unix, Networking and troubleshooting knowledge
  • Experience in Agile, Lean Agile and/or Scaled Agile methodologies
  • Experience in Customer Experience Analytics tool like Quantum Metric, CatchPoint
  • Solid understand and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
  • Experience with visualization tools like Kibana and Grafana. EFK stack experience preferred.
  • Excellent communication and collaboration skills

Preferred Qualifications

  • Kubernetes Certified Engineer or equivalent certification
  • Azure / AWS certification
  • Experience mentoring & training others
  • Experience with Site Reliability Engineering preferred

AT&T is leading the way to the future – for customers, businesses and the industry. We're developing new technologies to make it easier for our customers to stay connected to their world. Together, we’ve built a premier integrated communications and entertainment company and an amazing place to work and grow.  Team up with industry innovators every time you walk into work, creating the world you always imagined. Ready to #transformdigital with us? Apply now!


AT&T is bringing it all together for our customers, from revolutionary smartphones to next-generation TV services and sophisticated solutions for multi-national businesses.

For more than a century, we have consistently provided innovative, reliable, high-quality products and services and excellent customer care. Today, our mission is to connect people with their world, everywhere they live and work, and do it better than anyone else. We're fulfilling this vision by creating new solutions for consumers and businesses and by driving innovation in the communications and entertainment industry.

We're recognized as one of the leading worldwide providers of IP-based communications services to businesses. We also have the nation's most reliable 4G LTE network.* We also have the largest international coverage of any U.S. wireless carrier, offering the most phones that work in the most countries. AT&T operates the nation's largest Wi-Fi network** including more than 32,000 AT&T Wi-Fi Hot Spots at popular restaurants, hotels, bookstores and retailers, and provides access to nearly 1 million hotspots globally through roaming agreements.

AT&T U-verse is TV inspired by you. It's TV the way you want it, with tons of cool features and capabilities. AT&T is the only national TV service provider to offer a 100-percent IP-based television service. It's part of our "three-screen" integration strategy to deliver services across the three screens people rely on most - the mobile device, the PC and the TV.

As we continue to break new ground and deliver new solutions, we're focused on delivering the high-quality customer service that is our heritage.