Mid-Level Computer Vision Engineer
Job Description
We are a leading-edge AI products company focused on building next-generation intelligent systems at scale. As we continue to expand our production platforms, we are looking for a Mid-Level Computer Vision Engineer to join our team with a strong focus on video understanding and multimodal retrieval systems.
In this role, you will work primarily on production-grade computer vision pipelines that enable semantic understanding, indexing, and retrieval of large-scale video data. Your work will support multimodal search experiences, allowing users to retrieve precise information from massive image and video collections using natural language and cross-modal representations. You will collaborate closely with a small, high-impact engineering team to turn advanced models into reliable, scalable systems.
Responsibilities:
- Video Understanding & Multimodal Representation.
- Develop and maintain production pipelines for extracting semantic representations from videos and video frames using state-of-the-art Vision-Language Models (VLMs).
- Enable semantic understanding and retrieval across image, video, and text modalities.
- Temporal Localization & Alignment.
- Implement and optimize algorithms for temporal localization and alignment, including Video Moment Localization and cross-modal video-text matching.
- Support precise retrieval of relevant video segments from long-form content without manual annotation.
- Large-Scale Visual Data Processing.
- Build and optimize high-throughput ingestion systems for large-scale image and video datasets.
- Prepare visual embeddings for efficient storage and retrieval in vector databases.
- Model Optimization for Production.
- Fine-tune and optimize models with a focus on latency, throughput, and resource efficiency.
- Apply techniques such as quantization and parameter-efficient tuning to manage high-dimensional visual data in production environments.
- Production Engineering & Reliability.
- Develop, test, deploy, and document production-ready features and services.
- Monitor model and system performance, identify bottlenecks, and contribute to continuous improvements.
- Cross-Functional Collaboration.
- Work closely with data scientists, backend engineers, and product teams to integrate vision systems into end-user applications.
- Communicate technical decisions, trade-offs, and system limitations clearly across teams.
Required Skills & Qualifications:
- Solid experience with Computer Vision and Deep Learning, particularly for video-based tasks.
- Familiarity with Vision-Language Models (e.g., CLIP-like architectures) and multimodal embedding spaces.
- Proficiency in Python, with experience building production ML pipelines.
- Practical experience deploying or maintaining production ML or CV systems.
- Understanding of performance optimization, including batching, latency reduction, and memory efficiency.
- Ability to work independently on defined tasks while contributing effectively within a team.
- Clear communication skills for collaborating with technical and non-technical stakeholders.
- Experience with video retrieval, temporal grounding, or video-text alignment.
- Familiarity with vector databases and large-scale embedding systems.
- Experience working with high-volume visual or multimedia data.