Instrument your Nodejs Applications with Open Source Tools – Part 2

As we mentioned in the previous article, at NodeSource, we are dedicated to observability in our day-to-day, and we know that a great way to extend our reach and interoperability is to include the Opentelemetry framework as a standard in our development flows; because in the end our vision is to achieve high-performance software, and it is what we want to accompany the journey of developers in their Node.js base applications.

With this, we know that understanding the bases was very important to know the standard and its scope, but that it is necessary to put it into practice. How to integrate Opentelemetry in our application?; and although NodeSource has direct integration into its product in addition to more than 10 key functionalities in N|Solid, that extend the offer of a traditional APM, as you know, we are great contributors to the Open Source project, we also support the binary distributions of the Node.js project, our DNA is always helping the community and showing you how through Open Source tools you can still increase the visibility. So through this article, we want to share how to set up OpenTelemetry with Open Source tools.

In this article, you will find __How to Apply the OpenTelemetry OS framework in your Node.js Application__, which includes:

Step 1: Export data to the backend

Step 2: Set up the Open Telemetry SDK
__Step 3__: Inspect Prometheus to review we’re receiving data

Step 4: Inspect Jaeger to review we’re receiving data

Step 5: Getting deeper at Jaeger 👀

Note: This article is an extension of our talk at NodeConf.EU, where we had the opportunity to share the talk:

__Dot, line, Plane Trace!__
__Instrument your Node.js applications with Open Source Software__
Get insights into the current state of your running applications/services through OpenTelemetry. It has never been as easy as now to collect data with Open Source SDKs and tools that will help you extract metrics, generate logs and traces and export this data in a standardized format to be analyzed using the best practices. In this talk, We’ll show how easy it is to integrate OpenTelemetry in your Node.js applications and how to get the most out of it using Open Source tools.

To see the talks from this incredible conference, you can watch all sessions through live-stream links below 👇
– Day 1️⃣ – https://youtu.be/1WvHT7FgrAo
– Day 2️⃣ – https://youtu.be/R2RMGQhWyCk
– Day 3️⃣ – https://youtu.be/enklsLqkVdk

Now we are ready to start 💪 📖 👇

Apply the OpenTelemetry OS framework in your Node.js Application

So, going back to the distributed example we described in our previous article, here we can see what the architecture looks like this after adding observability.

Every service will collect signals by using the OpenTelemetry Node.js SDK and export the data to specific backends so we can analyze it.

We are going to use the following:

JAEGER for Traces and Logs.

Prometheus to visualize the metrics.

_Note: _Jaeger and Prometheus are probably the most popular open-source tools in space.

Step 1: Export data to the backend

How the data is exported to the backends differs:
To send data to _JAEGER__, we will use OTLP over HTTP, whereas for _Prometheus__, the data will be pulled from the services using HTTP.

First, we will show you how easy it is to set up the OpenTelemetry SDK to add observability to our applications.

### Step 2: Set up the OpenTelemetry SDK

First, we have the providers in charge of collecting the signals, in our case __NodeTracerProvider__ for traces and __MeterProvider__ for metrics.
Then the exporters send the collected data to the specific backends.
The Resource contains attributes describing the current process, in our case, __ServiceName__ and __Container. Id’s__. The name of these attributes is well defined by the spec (it’s in the __semantic_conventions module__) and will allow us to differentiate where a specific signal comes from.

So to set up traces and metrics, the process is basically the same: we create the provider passing the Resource, then register the specific exporter.

We also register instrumentations of specific modules (either core modules or popular userspace modules), which provide automatic Span creation of those modules.

Finally, the only important thing to remember is that we need to initialize OpenTelemetry before our actual code; the reason is these instrumentation modules (in our case for __http__ and fastify) __monkeypatch__ the module they’re instrumenting.

Also, we create the __meter instruments__ because we will use them on every service: an __HTTP request counter__ and a couple of observable gauges for __CPU usage__ and __ELU usage__.

So let’s spin the application now and send a request to the API. It returns a 401 Not Authorized. Before trying to figure out what’s going on, let’s see if Prometheus and jaeger are actually receiving data.

Step 3: Inspect Prometheus to review we’re receiving data

Let’s look at Prometheus first:
Looking at the HTTP requests counter, we can see there are 2 data points: one for the __API service__ and another one for the __AUTH service__. Notice that the data we had in the Resource is __service_name__ and __container_id__. We also can see the process_cpu is collecting data for the 4 services. The same is true for __thread_elu__.

Step 4: Inspect Jaeger to review we’re receiving data

Let’s look at Jaeger now:
We can see that one trace corresponding to the __HTTP request__ has been generated.

Also, look at this chart where the points represent traces, the X-axis is the timestamp, and the Y-axis is the duration. If we inspect the trace, we can see it consists of 3 spans, where every span represents an __HTTP transaction__, and it has been automatically generated by the instrumentation-HTTP modules:

The 1st span is an HTTP server transaction in the API service (the incoming HTTP request).
The 2nd span represents a POST request to AUTH from API.
The 3rd one represents the incoming HTTP POST in AUTH. If we inspect a bit this last span, apart from the typical attributes associated with the request (HTTP method, request_url, status_code…).

We can see there’s a Log associated with the Span this makes it very useful as we can know exactly which request caused the error. By inspecting it, we found out that the reason for the failure was missing the auth token.

This piece of information wasn’t generated automatically, though, but it’s very easy to do. So in the verify route from the service, in case there’s an error verifying the token, we retrieve the active span from the current context and just call __recordException()__ with the error. As simple as that.

Well, so far, so good. Knowing what the problem is, let’s add the auth token and check if everything works:

curl http://localhost:9000/ -H “Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiIiLCJpYXQiOjE2NjIxMTQyMjAsImV4cCI6MTY5MzY1MDIyMCwiYXVkIjoid3d3LmV4YW1wbGUuY29tIiwic3ViIjoiIiwibGljZW5zZUtleSI6ImZmZmZmLWZmZmZmLWZmZmZmLWZmZmZmLWZmZmZmIiwiZW1haWwiOiJqcm9ja2V0QGV4YW1wbGUuY29tIn0.PYQoR-62ba9R6HCxxumajVWZYyvUWNnFSUEoJBj5t9I”

Ok, now it succeeded. Let’s look at Jaeger now. We can see the new trace here, and we can see that it contains 7 spans, and no error was generated.

Now, it’s time to show one very nice feature of Jaeger. We can compare both traces, and we can see in grey the Spans that are equal, whereas we can see in Green the Spans that are new. So just by looking at this overview, we can see that if we’re correctly Authorized, the API sends a GET request to SERVICE1, which then performs a couple of operations against POSTGRES. If we inspect one of the POSTGRES spans (the query), we can see useful information there, such as the actual QUERY. This is possible because we have registered the instrumentation-pg module in SERVICE1.

And finally, let’s do a more interesting experiment. We will inject load to the application for 20 seconds with autocannon…

If we look at the latency chart, we see some interesting data: up until at least the 90th percentile, the latency is basically below 300ms, whereas starting at least from 97.5%, the latency goes up a lot. More than 3secs. This is Unacceptable 🧐. Let’s see if we can figure out what’s going on 💪.

Step 5: Getting deeper at Jaeger 👀

Looking at Jaeger and limiting this to like 500 spans, we can see that the graph here depicts what the latency char showed. Most of the requests are fast, whereas there are some significant outliers.

Let’s compare one of the fast vs. slow traces. In addition to querying the database, we can see the slow trace in that SERVICE1 sends a request to SERVICE2. That’s useful info for sure. Let’s take a look more closely at the slow trace.

In the __Trace Graph view__, every node represents a Span, and on the left-hand side, we can see the percentage of time with respect to the total trace duration that the subgraph that has this node as root takes. So by inspecting this, we can see that the branch representing the HTTP GET from SERVICE1 to SERVICE2 takes most of the time of the span. So it seems the main suspect is SERVICE2. Let’s take a look at the Metrics now. They might give us more information. If we look at the thread.elu, we can see that for SERVICE2, it went 100% for some seconds. This would explain the observed behavior.

So now, going to the SERVICE2 code route, we can easily spot the issue. We were performing a __Fibonacci operation__. Of course, this was easy to spot as this is a demo, but in real scenarios, this would not be so simple, and we would need some other methods, such as CPU Profiling, but regardless, the info we collected would help us narrow down the issue quite significantly.

So, that’s it for the demo. We’ve created a repo where you can access the full code, so go play with it! 😎

Main Takeaways

Finally, we just want to share the main takeaways about implementing observability with Open Software Tools:

Setting up observability in our Node.js apps is actually not that hard.
It allows us to observe requests as they propagate through a distributed system, giving us a clear picture of what might be happening.
It helps identify points of failure and causes of poor performance. (for some cases, some other tools might also be needed: CPU profiling, heap snapshots).
Adding observability to our code, especially tracing, comes with a cost. So Be cautious! ☠️But we are not going to go deeper into this, as it could be a topic for another article.

Before you go

If you’re looking to implement observability in your project professionally, you might want to check out N|Solid, and our ’10 key functionalities’. We invited you to follow us on Twitter and keep the conversation!

Enhance Observability with Opentelemetry tracing – Part 1

Recently, conversations have been increasing around OpenTelemetry; it is gaining more and more momentum in Node.js development circles, but what is it? How can we take advantage of the key concepts and implement them in our projects?

Of note, NodeSource is a supporter of OpenTelemetry, and we have recently implemented full support of the open-source standard in our product N|Solid. It allows us to make our powerful Node.js insights accessible via the protocol.

Opentelemetry is a relatively recent vendor-agnostic emerging standard that began in 2019 when OpenCensus and OpenTracing combined to form OpenTelemetry – seeking to provide a single, well-supported integration surface for end-to-end distributed tracing telemetry. In 2021, they released V1. 0.0, offering stability guarantees for the approach.

And most important, OpenTelemetry is an open-source observability project/framework with a collection of software development kits (SDKs), APIs, and tools for instrumentation from the Cloud Native Computing Foundation (CNCF).

W3C Trace Context is the standard format for OpenTelemetry. Cloud providers are expected to adopt this standard, providing a vendor-neutral way to propagate trace IDs through their services. Organizations use OpenTelemetry to send collected telemetry data to a third-party system for analysis.

But to break down its history a bit, we think it’s important to understand the concept of __Observability__.

At Nodesource, as you likely know, we work daily in Observability, focusing exclusively on the Node.js runtime and N|Solid from N|Solid 4.8.0 supports some OpenTelemetry features. But before getting deeper in OTEL, it is important to understand Observability and try to resolve this important question: What is Observability?

Setting the foundations to talk about OpenTelemetry

It’s important to understand that when we talk about Observability, we need first to know what questions we seek to answer or clarify when detailing a system.

The first question often asked is __why__ my application has specific behavior. And to solve this and other questions, we first must instrument our system so that our application can emit signals, that is, traces, metrics, and logs. When we correctly do this, we have the necessary information needed.

Observability is the ability to measure the internal states of a system by examining its outputs. – Splunk

Detailing a system through Data Collection: Telemetry Data

Your systems and apps need proper tooling to collect the appropriate telemetry data to achieve Observability.
But what is the telemetry data that we need?

The three key concepts are :

Metrics

Logs

Traces

Ok, let’s define each of these concepts:

Metrics

__Metrics__: are aggregations over a period of time of numeric data about your infrastructure or application. Examples include system error rate, CPU utilization, and request rate for a given service.

As quoted by isitobservable.io, OpenTelemetry has three metric instruments :

__Counter__: a value that is summed over time (similar to the Prometheus counter)
__Measure__: a value that is aggregated over time (a value over some defined range)
__Observer__: captures a current set of values ​​at a given time (like a gauge in Prometheus)

The context is still very important, along with metric information like name, description, unit, kind (counter, observer, measure), label, aggregation, and time.

Logs

__Logs__: A Log is a timestamped message emitted by services or other components. They are not necessarily associated with any particular user request or transaction, but they become more valuable when they are.

The logical line would tell us that here we must jump to traces because it is part of the three key concepts. But before defining what a trace is, we must zoom in on the concept of __Span__.

Span

__Span__: A Span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.

A span is the building block of a trace and is a named, timed operation representing a piece of the workflow in the distributed system. All traces are composed of Spans.

Traces

Traces__: A Trace records the paths taken by requests (made by an application or end-user) as they __propagate through multi-service architectures, like microservice and serverless applications. It is also known as Distributed Trace. A trace is almost always an assessment of end-to-end performance.

Without tracing, it is challenging to pinpoint the cause of performance problems in a distributed system.

Suppose you realize we broke into the three pillars of Observability when introducing the concept of Span. In that case, however, the three pillars and Span conform to what is known as __Telemetry Data__, which are simple __signals emitted from applications and resources about their internal state.__

The core concept of Context Propagation

When we want to correlate events across our services’ boundaries, we look for a context that helps us identify the current trace and Span. But context is not the only thing we need; we also need __propagation__.

If you are with us following the article carefully, you will realize that in the definition of trace, we talk about the word __‘propagation’__. You might wonder what this means.

Propagation is how context is bundled and transferred in and across services, often via HTTP headers. Now, With these clear concepts, we can begin to understand the concept of __Context Propagation__.

A critical functionality required to implement Distributed Tracing is the concept of Context Propagation. We can define it as a mechanism for storing state and accessing data across the lifespan of a distributed transaction, either across execution contexts inside a process or across the boundaries of the services that conform to our system.

For In Process propagation__, we typically use something like the __AsyncLocalStorage class from the async_hooks module.

Whereas Across Processes__, it will depend on the IPC protocol used. For example, for HTTP, there’s the Trace Context specification from the _W3C__, which defines the _traceparent and tracestate headers to propagate tracing info.

Getting into a Distributed Application

Let’s say we have a distributed application like the one in the picture. It has 4 Nodejs services: API, auth, Service1 and Service2, and 1 database.

Imagine we’re having intermittent performance issues. They could come from several points:
– Database access
– Network link status,
– DNS request latency, etc.
Finding where exactly may become a very hard and time-consuming task; the harder, the more complex the system is.

Distributed tracing will help us A LOT with that, as we’ll generate tracing information on every point of the distributed system (A, B, C, D, and E). Not only that, but while the request goes through all the services, thanks to Context Propagation, some ‘tracing state’ will be passed along so all the tracing info can be linked to the very same request.

Instrument your system

To get visibility into the performance and behaviors of the different microservices, we need to instrument the code with OpenTelemetry to generate traces. But first, let’s define what Instrumentation is…

Automatic Instrumentation

With __Automatic Instrumentation__, our instrumentation libraries will automatically take the configuration provided (through code or environment variables) and do most of the work.

In the following example, using the OpenTelemetry SDK, we show how we can automatically generate spans for every HTTP transaction handled by the Nodejs HTTP core module.

Manual Instrumentation

__Manual instrumentation__, on the other hand, while requiring more work on the user/developer side, enables far more options for customization, from naming various components within OpenTelemetry (for example, spans and the traces ) to add your own attributes, specific exception handling, and more. See the following example shows how to manually generate a Span using the OpenTelemetry SDK.

How to Implement Opentelemetry in my project?

The way we historically would implement a typical observability pipeline is shown in the following picture.

In this case, having all that data at your disposal is great and can give us a valuable overview of our system, but unless we are able to correlate the observability signals somehow: metrics, logs, and traces, we won’t be able to have the best of it.
OpenTelemetry comes to solve this problem. The solution is going to come from correlating these signals. This can be done by applying the same concept of Context Propagation that was used for Traces to Metrics and Logs, so in this case, identifiers such as the trace_id and the span_id are associated with those signals.

OpenTelemetry spanId and traceId can correlate Logs and Metrics with a specific Span in a Trace.

Opentelemetry Components

OpenTelemetry is much more… Before finishing this article, it is important to describe the different components of OpenTelemetry.

Note: For more detail, read the specification overview of the Opentelemetry project.

OpenTelemetry API: It provides an API, which defines data types and operations for generating and correlating tracing, metrics, and logging data.

💚 From N|Solid 4.8.0 we provide an implementation of the OpenTelemetry TraceAPI, allowing users to instrument their own code using the de-facto standard API.

OpenTelemetry SDK: It provides Language-specific implementations of the API.

OpenTelemetry OTLP: A protocol to transport the Telemetry Data.

💚 With N|Solid 4.8.0 we support many instrumentation modules available in the OpenTelemetry ecosystem. Supporting exporting traces using the OpenTelemetry Protocol(OTLP) over HTTP.

OpenTelemetry Collector: To receive, process, and export Telemetry data.

💚 In N|Solid 4.8.0 is now possible to send N|Solid Runtime monitoring information (metrics and traces) to backends supporting the OpenTelemetry standard like multiple APMS (Dynatrace, Datadog, Newrelic).

OpenTelemetry Semantic Conventions: To have well-defined naming for the attributes associated with the signals: (service.name, http. port, etc.)

We know that there may be other key concepts to develop around Opentelemetry, and for this reason, we invite you to visit the direct website of the project or the Github Repo directly.

This introductory article gives us the basis for sharing a demo we prepared for NodeConf.EU, where we apply open-source tools to implement Opentelemetry in your project. We invite you to stay tuned for our next blog post. 😉 Wait for the second part!

Conclusions

Traces are really useful for understanding modern distributed systems.
We build better software when we get the best of our traces.
With OTel (Opentelemetry), we’re able to have maximized insights and answer future questions without having to make any code changes.
#OTel provides interoperability with observability tools.
Collect and correlate telemetry data is easy if you follow the OpenTelemetry framework.
As far as we know, the OpenTelemetry community is working hard to develop support for metrics and logs. Waiting for news soon! 🤞
Note: If you want to learn more about OpenTelemetry in Javascript, click HERE

To start getting more value out of your traces and metrics, you can use Opentelemetry with N|Solid back-end.

Achieve Your Performance Goals With N|Solid

We know that you want to get the best out of your application and to do it professionally, you will surely need a great ally to help you with various tools without affecting your performance. We do not want to stay in a ‘marketing speech’ where we tell you that we are the best… you can 👀check it directly here with this Open Source tool that also includes OTEL Results.

We’d love to hear more from you! 💚
– Feel free to TRY N|Solid and get in touch with us on Twitter at @nodesource.