Data & ML Engineer
EngineeringRemoteFull-time
About the Role
Design, build, and operate secure, multi-tenant data and machine-learning infrastructure. You’ll model operational datasets, ingest telemetry and application events, and deliver ML/AI features—from feature engineering and training to deployment, monitoring, and iteration—backed by reliable, compliant data flows.
Responsibilities
- Design and evolve relational schemas for multi-tenant use; implement roles, policies, and row-level security
- Build and maintain batch and streaming pipelines (ingest, transform, load) with strong SLAs and observability
- Implement data quality checks, contracts, lineage, and automated validation for high-trust datasets
- Optimize SQL performance (indexes, partitioning, materialized views) for time-series and analytics use cases
- Own the ML lifecycle: feature engineering, experiment design, model training, evaluation, and model registry
- Deploy and operate models for batch and real-time inference; implement CI/CD for data and models
- Instrument online/offline metrics; monitor data drift, model performance, and reliability with alerting
- Build and maintain feature stores and embedding pipelines; manage vector search for retrieval use cases
- Develop and integrate LLM/RAG components (retrievers, indexers, evaluators) with appropriate guardrails
- Ensure security, privacy, and compliance (access control, encryption, PII/PHI minimization, auditability)
- Collaborate with product and engineering to integrate ML outputs into APIs, services, and user-facing features
- Document data models, ML pipelines, and runbooks; mentor engineers on data/ML best practices
Required Qualifications
- Bachelor’s in Computer Science, Software/Data Engineering, or equivalent experience
- 2–5 years in data engineering or backend roles focused on data-intensive systems, plus hands-on ML delivery
- Strong SQL and PostgreSQL skills: schema design, query tuning, indexing, profiling
- Proficiency in Python for ETL/ELT, feature engineering, and model development (e.g., pandas, scikit-learn, XGBoost/LightGBM, PyTorch or TensorFlow basics)
- Experience building resilient pipelines with orchestration/scheduling and robust observability
- Model deployment experience (REST/gRPC inference with FastAPI/Flask or similar; containerization)
- Familiarity with experiment tracking and model registries; versioning data/models and managing rollouts
- Understanding of vector embeddings and retrieval (e.g., pgvector or similar) and evaluating RAG quality
- Security-first mindset: secrets management, least-privilege access, data governance, and compliance-by-design
- Clear communication and cross-functional collaboration
Preferred Qualifications
- Experience with multi-tenant SaaS concepts (roles/permissions, org/tenant boundaries) and tenant-aware queries
- Streaming/CDC and event-driven patterns; time-series and geospatial data (partitioning strategies, PostGIS)
- LLM tooling (prompt/retrieval evaluation, guardrails, caching) and lightweight fine-tuning/adapter methods
- MLOps stack exposure (e.g., MLflow/Kedro/Dagster/Airflow, feature stores, model serving frameworks)
- Cloud platform experience (Oracle/AWS/GCP), infrastructure-as-code, and cost/performance monitoring
- Product analytics literacy and A/B testing of ML-powered features
- Exposure to regulated domains or sensitive data handling