Finde jetzt einen Job mit...
Homeoffice
System Operations Engineer
Broadway Gaming, Remote

System Operations Engineer

Broadway Gaming is a dynamic and expanding online gaming company operating mainly in the UK gaming market. We offer Bingo, Casino and Slot products across multiple brands. We have office locations in Dublin, UK, Tel Aviv, Romania and India. We are looking for a Customer Support Specialist to join our team.

With a wide variety of backgrounds comes a wealth of experience, ideas and personalities and we use these to help us create a great service and a great place to work and learn. Because collaboration is fun and benefits us all and ultimately it benefits our customers!

We are seeking a highly skilled System Operations Engineer responsible for overseeing the day-to-day operations of our Network Operations Center (NOC), ensuring adherence to Service Level Agreements (SLA), and managing system monitoring tools, such as DataDog. The System Operations Engineer will be responsible for maintaining the performance, availability, and reliability of all IT infrastructure, both on-premises and in the cloud.

In this role, you will lead a team of NOC engineers, monitor critical systems, troubleshoot issues, and ensure that all incidents are resolved promptly. You will also be in charge of configuring and managing monitoring tools like DataDog\ Grafana to proactively identify potential problems and optimize system performance.

Responsibilities

1.NOC Management & Monitoring:

  • Oversee the daily operations of the NOC, ensuring continuous monitoring of infrastructure, networks, and applications.
  • Manage, mentor, and lead a team of NOC engineers responsible for real-time issue identification, resolution, and escalation.
  • Develop and maintain operational dashboards and alerts using monitoring tools such as DataDog, Nagios, and others.
  • Ensure incidents are appropriately prioritized, escalated, and resolved within defined SLAs.
  • Analyze system performance trends and implement improvements to ensure maximum uptime and performance

2.SLA Management:

  • Define, monitor, and enforce Service Level Agreements (SLAs) to ensure adherence to performance standards and contractual obligations.
  • Regularly review and update SLA metrics in line with evolving business needs and system performance requirements.
  • Collaborate with service providers, partners, and vendors to ensure external SLAs are met.
  • Prepare and present SLA compliance reports to stakeholders, identifying areas of improvement.

3. Incident and Problem Management:

  • Manage the incident lifecycle, including incident detection, logging, categorization, and resolution.
  • Implement root cause analysis (RCA) for major incidents and ensure proper follow-up actions to prevent recurrence.
  • Develop, maintain, and implement incident response protocols and disaster recovery plans.

4. Configuration and Management of Monitoring Tools:

  • Configure and manage DataDog and other monitoring tools to ensure end-to-end visibility of IT infrastructure and cloud environments.
  • Optimize monitoring dashboards and alerts to proactively identify and resolve system performance issues.
  • Implement and configure integrations between monitoring tools and incident management platforms (e.g., PagerDuty, ServiceNow).
  • Continuously review and fine-tune monitoring strategies to improve early detection and incident prevention.

5. Collaboration and Communication:

  • Act as a liaison between the NOC, IT infrastructure teams, and other departments to ensure effective communication of system health and incidents.
  • Collaborate with DevOps and Cloud Engineering teams to optimize the performance of cloud-based applications and services.
  • Provide regular updates and reporting to senior management on system performance, incident metrics, and SLA adherence.

Requirements

  • 5+ years of experience in IT operations, with at least 3 years in a management role, preferably in a NOC or infrastructure environment.
  • Strong experience in configuring and managing monitoring tools such as DataDog, Zabbix, Nagios, or similar platforms.
  • Proven experience managing and enforcing SLAs and ensuring compliance with performance and availability targets.
  • Solid understanding of IT infrastructure, including cloud environments (AWS, Azure, GCP), networks, and on-premise systems.
  • Experience in incident management, root cause analysis (RCA), and disaster recovery processes.
  • Familiarity with ITIL practices and service management tools such as ServiceNow, PagerDuty, or Jira.
  • Strong leadership, team management, and communication skills.
Diese Jobs könnten dir auch gefallen
Remote
Jena +1
+11
Jena +1
+11
Jena
+8
Jena
+8