Principal Observability Engineer

Sidley Austin LLP

United States, Illinois, Chicago

Dec 13, 2024

Principal Observability Engineer

Recruiting Location

US-IL-Chicago

Department

Information Technology

Summary

A Principal Observability Engineer will be responsible for designing, implementing, and maintaining the tools and frameworks that provide comprehensive visibility into the performance and health of firm's systems. The role will work closely with application owners as well as engineering and support teams to improve monitoring, logging, tracing, and alerting capabilities across infrastructure and enterprise applications landscape, ensuring issues are detected, diagnosed, and responded to quickly. The Principal Observability Engineer will play a crucial part in building, improving, and maintaining observability stack, enabling the firm to monitor, troubleshoot, and optimize the performance of applications and systems in real time.

Duties and Responsibilities

Develops and maintains observability frameworks by building and scaling observability solutions (monitoring, logging, tracing, and alerting) for applications, infrastructure, and services.

Designs, implements, and maintains metrics collection using tools such as Prometheus, Grafana, Datadog, or similar systems. Works with teams to identify key performance indicators (KPIs), service-level objectives (SLOs), and service-level indicators (SLI).
Implements and optimizes centralized logging solutions such as Splunk or ELK stack to capture, store, and analyze logs for proactive troubleshooting and insights.
Sets up and maintains distributed tracing systems such as Jaeger, Zipkin, or OpenTelemetry to gain visibility into the flow of requests across applications and services, and troubleshoot latency and performance issues.
Facilitates incident detection and response by working with engineering and support teams to set up automated alerts, thresholds, and response workflows for system anomalies and incidents using tools like PagerDuty or Opsgenie.
Collaborates with engineering and support teams to optimize system performance and reduce downtime through effective observability practices.
Creates and maintains detailed documentation for observability processes, best practices, and troubleshooting guides.
Focuses on multiple areas and provides technical and thought leadership to the firm, specifically in the area of end-user experience visibility.
Develops and executes technical software development strategy for the observability engineering domain. Accountable for the quality, usability, and performance of the solutions.
Partners with development teams and enterprise application owners to ensure proper instrumentation is in place for all critical services, and ensures observability is part of new produce/service deployment.
Handles internal and external relationships to help build and maintain positive strategic partnerships and drive observability and automation program success.
Champions automation efforts across IT. Drives design and build of automation and custom features for IT Operations needs.

Qualifications

To perform this job successfully, an individual must be able to perform the Duties and Responsibilities (Duties) above satisfactorily and meet the requirements below. The requirements listed below are representative of the minimum knowledge, skill, and/or ability required. Reasonable accommodations will be made to enable individuals with disabilities to perform the essential functions of the job. If you need such an accommodation, please email staffrecruiting@sidley.com (current employees should contact Human Resources).

Education and/or Experience:

A minimum of 7 years of experience in a role focused on observability and monitoring
Strong experience with monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic).
Proficiency with log aggregation tools, specifically ELK stack and/or Splunk.
Proficiency with APM tools such as Datadog, AppDynamics, or Dynatrace.
Experience with distributed tracing (e.g., Jaeger, Zipkin, OpenTelemetry).
Familiarity with cloud platforms (e.g., AWS, Google Cloud, Azure) and cloud-native technologies (e.g., Kubernetes, Docker).
Solid knowledge of service-level objectives (SLOs) and service-level indicators (SLIs) for measuring reliability and performance.
Knowledge of programming/scripting languages such as Python, Go, or Bash for automation and custom integrations.
Hands-on experience with incident management tools like PagerDuty, Opsgenie, or similar systems for automating alerting and response workflows.
Exposure to infrastructure as code (IaC) tools such as Terraform, CloudFormation, or similar. Proficiency with modern CI/CD pipelines and DevOps best practices.
Knowledge of security observability tools and practices to integrate security monitoring across the stack.

Other Skills and Abilities:

The following will also be required of the successful candidate:

Strong organizational skills
Strong attention to detail
Good judgment
Strong interpersonal communication skills
Strong analytical and problem-solving skills
Able to work harmoniously and effectively with others
Able to preserve confidentiality and exercise discretion
Able to work under pressure
Able to manage multiple projects with competing deadlines and priorities

Sidley Austin LLP is an Equal Opportunity Employer