Job Summary
Responsible for extracting knowledge and insights from high volume, high dimensional data in order to investigate complex business problems through a range of data preparation, modeling, analysis and/or visualization techniques, which may include the use of advanced statistical analysis, algorithms, predictive modeling, experimentation and pattern recognition to create solutions that enable enhanced business performance. Designs, develops and programs methods, processes, and systems to consolidate and analyze unstructured, diverse “big data” sources to generate actionable insights and solutions for client services and product enhancement. The team is composed of experts in artificial intelligence, deep learning, data-structures, algorithms, distributed systems, and system performance and analysis. The systems that the team builds get used across the multitude of Company data science-based services and deployments. Works independently with minimal-to-no supervision while also demonstrating the ability to lead projects and initiatives autonomously.Job Description
About the Role:
We are seeking an experienced Data Scientist to join our growing Operational Intelligence team. You will
play a key role in building intelligent systems that help reduce alert noise, detect anomalies, correlate
events, and proactively surface operational insights across our large-scale streaming infrastructure.
You’ll work at the intersection of machine learning, observability, and IT operations, collaborating
closely with Platform Engineers, SREs, Incident Managers, Operators and Developers to integrate smart
detection and decision logic directly into our operational workflows.
This role offers a unique opportunity to push the boundaries of AI/ML in large-scale operations. We
welcome curious minds who want to stay ahead of the curve, bring innovative ideas to life, and improve
the reliability of streaming infrastructure that powers millions of users globally.
What You’ll Do:
• Design and tune machine learning models for event correlation, anomaly detection, alert
scoring, and root cause inference
• Engineer features to enrich alerts using service relationships, business context, change history,
and topological data
• Apply NLP and ML techniques to classify and structure logs and unstructured alert messages
• Develop and maintain real-time and batch data pipelines to process alerts, metrics, traces, and
logs
• Use Python, SQL, and time-series query languages (e.g., PromQL) to manipulate and analyze
operational data
• Collaborate with engineering teams to deploy models via API integrations, automate workflows,
and ensure production readiness
• Contribute to the development of self-healing automation, diagnostics, and ML-powered
decision triggers
• Design and validate entropy-based prioritization models to reduce alert fatigue and elevate
critical signals
• Conduct A/B testing, offline validation, and live performance monitoring of ML models
• Build and share clear dashboards, visualizations, and reporting views to support SREs, engineers,
and leadership
• Participate in incident postmortems, providing ML-driven insights and recommendations for
platform improvements
• Collaborate on the design of hybrid ML + rule-based systems to support dynamic correlation and
intelligent alert grouping
• Lead and support innovation efforts including POCs, POVs, and exploration of emerging AI/ML
tools and strategies
• Demonstrate a proactive, solution-oriented mindset with the ability to navigate ambiguity and
learn quickly
• Participate in on-call rotations and provide operational support as needed
Qualifications:
• Bachelor's or Master's degree in Computer Science, Data Science, Machine Learning, Statistics or
a related field
• 5+ years of experience building and deploying ML solutions in production environments
• 2+ years working with AIOps, observability, or real-time operations data
• Strong coding skills in Python (including pandas, NumPy, Scikit-learn, PyTorch, or TensorFlow)
• Experience working with SQL, time-series query languages (e.g., PromQL), and data
transformation in pandas or Spark
• Familiarity with LLMs, prompt engineering fundamentals, or embedding-based retrieval (e.g.,
sentence-transformers, vector DBs)
• Strong grasp of modern ML techniques including gradient boosting (XGBoost/LightGBM),
autoencoders, clustering (e.g., HDBSCAN), and anomaly detection
• Experience managing structured + unstructured data, and building features from logs, alerts,
metrics, and traces
• Familiarity with real-time event processing using tools like Kafka, Kinesis, or Flink
• Strong understanding of model evaluation techniques including precision/recall trade-offs, ROC,
AUC, calibration
• Comfortable working with relational (PostgreSQL), NoSQL (MongoDB), and time-series
(InfluxDB, Prometheus) databases
• Ability to collaborate effectively with SREs, platform teams, and participate in Agile/DevOps
workflows
• Clear written and verbal communication skills to present findings to technical and non-technical
stakeholders
• Comfortable working across Git, Confluence, JIRA, & collaborative agile environments
Nice to Have:
• Experience building or contributing to the AIOps platform (e.g., Moogsoft, BigPanda, Datadog,
Aisera, Dynatrace, BMC etc.)
• Experience working in streaming media, OTT platforms, or large-scale consumer services
• Exposure to Infrastructure as Code (Terraform, Pulumi) and modern cloud-native tooling
• Working experience with Conviva, Touchstream, Harmonic, New Relic, Prometheus, & eventbased alerting tools
• Hands-on experience with LLMs in operational contexts (e.g., classification of alert text, log
summarization, retrieval-augmented generation)
• Familiarity with vector databases (e.g., FAISS, Pinecone, Weaviate) and embeddings-based
search for observability data
• Experience using MLflow, SageMaker, or Airflow for ML workflow orchestration
• Knowledge of LangChain, Haystack, RAG pipelines, or prompt templating libraries
• Exposure to MLOps practices (e.g., model monitoring, drift detection, explainability tools like
SHAP or LIME)
• Experience with containerized model deployment using Docker or Kubernetes
• Use of JAX, Hugging Face Transformers, or LLaMA/Claude/Command-R models in
experimentation
• Experience designing APIs in Python or Go to expose models as services
• Cloud proficiency in AWS/GCP, especially for distributed training, storage, or batch inferencing
• Contributions to open-source ML or DevOps communities, or participation in AIOps
research/benchmarking efforts
• Certifications in cloud architecture, ML engineering, or data science specialization
We believe that benefits should connect you to the support you need when it matters most, and should help you care for those who matter most. That's why we provide an array of options, expert guidance and always-on tools that are personalized to meet the needs of your reality—to help support you physically, financially and emotionally through the big milestones and in your everyday life.
Please visit the benefits summary on our careers site for more details.
Education
Master's DegreeWhile possessing the stated degree is preferred, Comcast also may consider applicants who hold some combination of coursework and experience, or who have extensive related professional experience.Certifications (if applicable)
Relevant Work Experience
5-7 YearsComcast is an equal opportunity workplace. We will consider all qualified applicants for employment without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability, veteran status, genetic information, or any other basis protected by applicable law.