Data & ML Engineer

EngineeringRemoteFull-time

About the Role

Design, build, and operate secure, multi-tenant data and machine-learning infrastructure. You’ll model operational datasets, ingest telemetry and application events, and deliver ML/AI features—from feature engineering and training to deployment, monitoring, and iteration—backed by reliable, compliant data flows.

Responsibilities

Design and evolve relational schemas for multi-tenant use; implement roles, policies, and row-level security
Build and maintain batch and streaming pipelines (ingest, transform, load) with strong SLAs and observability
Implement data quality checks, contracts, lineage, and automated validation for high-trust datasets
Optimize SQL performance (indexes, partitioning, materialized views) for time-series and analytics use cases
Own the ML lifecycle: feature engineering, experiment design, model training, evaluation, and model registry
Deploy and operate models for batch and real-time inference; implement CI/CD for data and models
Instrument online/offline metrics; monitor data drift, model performance, and reliability with alerting
Build and maintain feature stores and embedding pipelines; manage vector search for retrieval use cases
Develop and integrate LLM/RAG components (retrievers, indexers, evaluators) with appropriate guardrails
Ensure security, privacy, and compliance (access control, encryption, PII/PHI minimization, auditability)
Collaborate with product and engineering to integrate ML outputs into APIs, services, and user-facing features
Document data models, ML pipelines, and runbooks; mentor engineers on data/ML best practices

Required Qualifications

Bachelor’s in Computer Science, Software/Data Engineering, or equivalent experience
2–5 years in data engineering or backend roles focused on data-intensive systems, plus hands-on ML delivery
Strong SQL and PostgreSQL skills: schema design, query tuning, indexing, profiling
Proficiency in Python for ETL/ELT, feature engineering, and model development (e.g., pandas, scikit-learn, XGBoost/LightGBM, PyTorch or TensorFlow basics)
Experience building resilient pipelines with orchestration/scheduling and robust observability
Model deployment experience (REST/gRPC inference with FastAPI/Flask or similar; containerization)
Familiarity with experiment tracking and model registries; versioning data/models and managing rollouts
Understanding of vector embeddings and retrieval (e.g., pgvector or similar) and evaluating RAG quality
Security-first mindset: secrets management, least-privilege access, data governance, and compliance-by-design
Clear communication and cross-functional collaboration

Preferred Qualifications

Experience with multi-tenant SaaS concepts (roles/permissions, org/tenant boundaries) and tenant-aware queries
Streaming/CDC and event-driven patterns; time-series and geospatial data (partitioning strategies, PostGIS)
LLM tooling (prompt/retrieval evaluation, guardrails, caching) and lightweight fine-tuning/adapter methods
MLOps stack exposure (e.g., MLflow/Kedro/Dagster/Airflow, feature stores, model serving frameworks)
Cloud platform experience (Oracle/AWS/GCP), infrastructure-as-code, and cost/performance monitoring
Product analytics literacy and A/B testing of ML-powered features
Exposure to regulated domains or sensitive data handling

Apply for this Position

About the Role

Responsibilities

Design and evolve relational schemas for multi-tenant use; implement roles, policies, and row-level security

Build and maintain batch and streaming pipelines (ingest, transform, load) with strong SLAs and observability

Implement data quality checks, contracts, lineage, and automated validation for high-trust datasets

Optimize SQL performance (indexes, partitioning, materialized views) for time-series and analytics use cases

Own the ML lifecycle: feature engineering, experiment design, model training, evaluation, and model registry

Deploy and operate models for batch and real-time inference; implement CI/CD for data and models

Instrument online/offline metrics; monitor data drift, model performance, and reliability with alerting

Build and maintain feature stores and embedding pipelines; manage vector search for retrieval use cases

Develop and integrate LLM/RAG components (retrievers, indexers, evaluators) with appropriate guardrails

Ensure security, privacy, and compliance (access control, encryption, PII/PHI minimization, auditability)

Collaborate with product and engineering to integrate ML outputs into APIs, services, and user-facing features

Document data models, ML pipelines, and runbooks; mentor engineers on data/ML best practices

Required Qualifications

Bachelor’s in Computer Science, Software/Data Engineering, or equivalent experience

2–5 years in data engineering or backend roles focused on data-intensive systems, plus hands-on ML delivery

Strong SQL and PostgreSQL skills: schema design, query tuning, indexing, profiling

Proficiency in Python for ETL/ELT, feature engineering, and model development (e.g., pandas, scikit-learn, XGBoost/LightGBM, PyTorch or TensorFlow basics)

Experience building resilient pipelines with orchestration/scheduling and robust observability

Model deployment experience (REST/gRPC inference with FastAPI/Flask or similar; containerization)

Familiarity with experiment tracking and model registries; versioning data/models and managing rollouts

Understanding of vector embeddings and retrieval (e.g., pgvector or similar) and evaluating RAG quality

Security-first mindset: secrets management, least-privilege access, data governance, and compliance-by-design

Clear communication and cross-functional collaboration

Preferred Qualifications

Experience with multi-tenant SaaS concepts (roles/permissions, org/tenant boundaries) and tenant-aware queries

Streaming/CDC and event-driven patterns; time-series and geospatial data (partitioning strategies, PostGIS)

LLM tooling (prompt/retrieval evaluation, guardrails, caching) and lightweight fine-tuning/adapter methods

MLOps stack exposure (e.g., MLflow/Kedro/Dagster/Airflow, feature stores, model serving frameworks)

Cloud platform experience (Oracle/AWS/GCP), infrastructure-as-code, and cost/performance monitoring

Product analytics literacy and A/B testing of ML-powered features

Exposure to regulated domains or sensitive data handling