Pipedrive AI/ML: A Look at the Data Architecture



This is a functional and process-oriented description of the application of pCDI, Samurai’s predictive customer data infrastructure, to Pipedrive integration. Please refer to the pCDI documentation for more technical details.

Pipedrive AI Booster, our Pipedrive AI/ML implementation package, comprises two components: the configuration and finetuning of the built-in Pipedrive AI capabilities developed by Pipedrive’s team, and a custom ML model trained by us specifically on your account’s data schema.

The custom model component predicts specific lead/deal activities and properties, providing numerous predictive data points, including events and properties that can be used in your Pipedrive account automations to optimize the sales process.

In this entry, we will examine the custom model component. This module is based on the data architecture, whose functional representation is presented below.

As shown, analytical backend of Pipedrive AI Booster consists of the following layers: 

  • Data production layer 
  • Raw event collection layer 
  • Inference layer 
  • Predictive event collection layer
  • Predictive event activation layer

These components provide Pipedrive with valuable business data points, helping you scale your sales department using predictive data.

Before we discuss each component in detail, let’s be clear that the graph presented is a functional, business-oriented visualization of key data processing steps. It does not exhaustively cover all steps involved in model training and is not representative of the model architecture itself. It’s intended to show high-level data processing.

Data production layer

This layer is managed by the Pipedrive API backend. It triggers events for changes in your Pipedrive account, such as adding contacts, associating leads or deals, and their subsequent actions. Automated updates via callbacks notify us of significant changes, new data points, deal movements, or activities performed. The list of subscribed topics varies by client but typically includes over 20, covering all key aspects of your Pipedrive account.

Raw event collection layer

Changes in your Pipedrive account trigger events that come with a lot of detailed data, covering both standard info and custom fields for your Pipedrive entities like leads and deals. Each event carries current and past values of the object in a JSON format. This data gets captured by a server using a special endpoint, and we make sure it’s secure with authentication via a secret token to keep everything safe. 

Once we capture these events, they go through some important steps. First off, we check them against the validation schema set at the collector level to filter out any events that don’t fit. Then we enhance and organize them to make sure they’re as useful as possible. This includes adding data from Pipedrive API and other APIs like Hunter or Clearbit, which when combined adds a number of extra features like deal sentiment or deal momentum metrics.

Occassionally we also add earlier events related to specific identities to help predict what might happen next. Once everything’s been enhanced and tidied up, we format it so it can fit into our data warehousing system and be used for machine learning.

This process ensures that the data we use for modeling and analysis is in the best shape possible.

Inference layer

Key components in this layer include: 

  • Prediction component: Handles real-time predictions on the latest deployed model version.
  • Warehousing component: Ingests events into the warehouse for future model retraining and finetuning.

All events are processed here for real-time integration into our data warehouse, ensuring they’re ready for future model retraining. Additionally, a subset of events or event properties triggers real-time model predictions (hence the dotted line).

At the core of the raw event activation layer is the data client. It adapts structured and enriched events from the raw event data collector to fit the activation and warehousing platform requirements, as well as prediction needs. The data client ensures proper event identification and logs the execution of the inference process.

Let’s walk through each step in detail.


All events that make it to the inference layer are ingested in real-time into the data warehouse, which fuels machine learning models and predictions.

Before they appear in the warehouse, they undergo several crucial steps to ensure only fully validated, well-formatted data is included. 

After enrichment, events go through feature engineering to tailor them for modeling. Next, they’re validated to ensure they fit custom schemas designed for modeling. These schemas are made based on client needs to ensure all events in the warehouse meet the same standards. Finally, events are loaded into the warehouse.


Predictive activation acts as a wrapper around models in the inference layer, communicating with the latest version of the ML API to forward predictive responses downstream alongside the original event. This process enriches and completes the payload, facilitating multiple downstream sanity checks, such as verifying if identifying attributes from the predictive event match those in the original event (i.e whether we can extend the user journey into the future with the predictive events). Depending on your configuration, multiple such tags can exist in this layer, aiming to leverage a single processed event to trigger multiple predictive models and provide diverse data points.

The prediction component of the inference layer typically focuses on a subset of events or properties, which can be easily modified. For instance, predicting churn for newly closed deals in Pipedrive is a common use case where predicted churn values are associated with specific deals. To activate an event for predictions, we simply determine a triggering rule must be set.

Model training

Model training layer sits on the data warehouse, where all events collected are stored. Storing all events allows the system to monitor long-term behavioral patterns of user activity and optimize the final model based on them. 

Newly ingested data is processed to create the required features for training. 

Our predictive modeling strategy for the Samurai Predictive Event Model ensures high-quality inputs through multi-step data transformation. Using Long Short-Term Memory (LSTM) networks, we model sequential data to capture user behavior patterns, along with predictive details like deal value or time to close. Key steps include cleaning and normalizing numerical data, encoding categorical variables, and tokenizing text data. We generate user event sequences to reflect behavior over time, crucial for training LSTM models. Outliers are removed, and data is transformed and normalized for accurate modeling. Categorical data is encoded using ordinal and one-hot methods, while unstructured text is tokenized into numerical vectors. These processes ensure our model delivers reliable and actionable predictions, enhancing scalability in the sales department through predictive insights. You can find more details about the model strategy on the documentation page. 

The model is periodically retrained using the full datase. This is necessary because some sequences encoded in the data warehouse can be extremely long, requiring full model retraining to capture their internal dependencies and patterns of activities. We may reconsider this in the future.

After retraining, the latest model version is saved in the model registry, which stores and versions models. The registry then notifies the Model component of the inference layer of new model versions. 

When a new model version is available, the model component fetches the updated model from the registry and deploys it. Detection of a new model allows for loading without downtime (hot reloading). The deployed model serves predictions via the API endpoint, querying the model registry periodically or upon notification for new versions. 

The process in the inference layer is straightforward and standard. A client (prediction tag in the inference layer) sends a POST request to the ML API endpoint with input data. The API validates the input, feeds it into the model for predictions, formats the results, and returns the prediction directly to the client in the response body.

Predictive event collector layer 

The original event payload, enriched with prediction output, is sent the Samurai Predictive Collector pipeline through a dedicated, secure endpoint.

This pipeline uses predictive data points to create the final event destined for dispatch to the predictive activation layer (and from there to Pipedrive), generating business value.

The Samurai Predictive Collector captures the event, after authorizing it again through the secret token. Unlike the prediction component of the inference layer, which focuses solely on managing communication with the model and forming the final payload without assessing model quality metrics or prediction accuracy, the Samurai Predictive Collector verifies the incoming predictive event’s quality. It uses its own event validation schema tailored to predictive events.

Once validated against the specific predictive event schema at the collector level, the event is sent to the predictive event enricher. Here, raw predictive data is transformed to extract maximum value from raw predicted datapoints, enriching the incoming event. Tasks include formatting monetary predictions, applying various formulas to combine predictive data, and scoring deals based on predictive properties. If the model includes sentiment analysis or scope classification, these aspects are also integrated during this phase. The enricher performs checks such as verifying identifier consistency between the predictive and original events.

After enriching the event with predictive data points, the Samurai Predictive Collector finalizes its transformation, typically labeling predictive data points (e.g., using predicted_ prefixes) and structuring them for the next stage—the predictive activation layer.

Predictive activation

The predictive activation processes predictive events generated in the inference layer, and then processed by the predictive collector, based on the latest model fetched from the model registry, which is trained on data stored in the warehouse. In the predictive activation layer, there are two important processes: predictive destination sync and predictive reporting sync.

Predictive destination sync

The Predictive destination tags capture predictive events within the Samurai Data Client through the Predictive Collector. In the context of Pipedrive, this means we communicate specific predictive events or properties to the Pipedrive API. This allows us to update key data points related to your leads, deals, contacts, or other components using high-quality predictive information. For example, we can show you the probability of a deal closing within 60 days or predict churn likelihood. We also label leads as “hot” if they are predicted to have a high Lifetime Value (LTV).

Once we write predictive event data to your Pipedrive deals and leads, we configure Pipedrive automations in a standard manner, according to the Pipedrive Booster scope. For instance, if a deal is predicted to have a 70% chance of closing and is labeled as “hot,” it automatically receives higher priority, prompting more assigned tasks. All of this is facilitated through Pipedrive’s built-in automation suite. Our role is to provide Pipedrive with crucial data points, leveraging its standard features to optimize your operations.

The predictive destination sync ensures you derive maximum value from your predictive data points.

Predictive data warehousing 

Ingesting predictive data into Pipedrive is just one aspect of the package. As part of our Pipedrive AI Booster package, we also generate predictive reports based on Pipedrive data. Therefore, aside from direct integration with Pipedrive, predictive data points are also routed to dedicated tables within our predictive data warehouse to support report generation.

These data points follow a similar process to raw events that contribute to model building. Initially, they undergo validation specific to predictive events, focusing on the metrics we aim to build with predictive insights before being ingested into the warehouse.

In terms of reporting, our typical analyses include predictive deal values for the next 30, 60, and 90 days, the number of deals predicted to churn over a specified period, identification of top-performing sales reps based on predictions, and various other metrics.

Let’s roll out some predicted events

In this article, we’ve walked through the functional aspects of the Samurai pCDI (predictive customer data infrastructure) architecture. This was a simplified description focused on the practical aspects of the various pCDI pipelines in the context of the Pipedrive Booster package.

If this diagram were a map, it would have a “not to scale” subscript.

The most important thing to note is that once your data is processed by Pipedrive and predictive data points appear, you’re free to use them to shape your sales process and enjoy optimizing it. The most important thing to ensure before you start venturing into using AI with Pipedrive is to get your account to a proper standing, which we’ve described in another article.

Leave a Reply

Your email address will not be published. Required fields are marked *