Job Description
This is a remote position.The developer will be responsible for building a lightweight, self-healing, auto-scaling, multi-platform APM (Application Performance Monitoring) agent that can: Instrumentation & Data Collection:
Automatically instrument applications to collect transaction traces, logs, and performance metrics.Capture distributed tracing across microservices.Track response times, error rates, resource usage, and database query performance.Collect and forward application, system, and security logs. Performance & Efficiency Optimization:
Implement adaptive sampling to reduce overhead.Ensure async & non-blocking data collection.Optimize CPU, memory, and network utilization to minimize application impact. Distributed Tracing & Database Monitoring:
Assign and propagate trace IDs across microservices.Monitor slow queries and database calls with minimal overhead. Log Collection & Security Monitoring:
Collect, filter, and forward application/system logs.Detect security anomalies and unusual resource usage patterns. Communication & Data Transmission:
Efficiently batch and compress data before sending to the APM platform.Use lightweight protocols (gRPC, Protobuf, etc.) for communication. Self-Healing & Auto-Scaling Mechanisms:
Implement triggers for auto-scaling based on CPU, memory, and latency thresholds.Enable self-healing by restarting services upon failure or excessive resource usage. Delivery Timeline & Reporting:
Develop the agent within 2 to 3 weeks.Provide technical documentation and performance benchmarks. Requirements
Qualifications:
Programming Expertise: Proficiency in languages commonly used for APM agents such as Java, Python, Go, .NET, or C++.Instrumentation & Monitoring Experience: Hands-on experience with code profiling, distributed tracing (OpenTelemetry), and application instrumentation.Performance Optimization: Knowledge of efficient data collection strategies, async programming, and low-latency data transmission.Logging & Security: Experience integrating with logging pipelines (ELK, Splunk, Loki) and implementing basic security anomaly detection.Scalability & Resilience: Familiarity with auto-scaling, self-healing mechanisms, and cloud-native architectures.APM & Observability Tools: Experience with tools like Prometheus, OpenTelemetry, Datadog, New Relic, or Dynatrace is a plus.Networking & Communication Protocols: Proficiency in gRPC, Protobuf, or telemetry data transfer.Agile Development & Fast-Paced Execution: Ability to deliver a functional prototype within 2-3 weeks and iterate based on feedback.Strong Debugging & Problem-Solving Skills: Ability to analyze performance bottlenecks and optimize agent behavior.