Basic introduction to Apache Pinot

Pinot is a real-time distributed OLAP data store built to deliver ultra-low latency analytics, even at extremely high throughput. It can ingest directly from streaming data sources such as Apache Kafka and Amazon Kinesis and make events available for on-the-fly queries. It can also ingest from batch data sources such as Hadoop HDFS, Amazon S3, Azure ADLS, and Google Cloud Storage.

At the heart of the system is columnar storage, with multiple intelligent indexing and pre-aggregation techniques for low latency. This makes Pinot best suited for user-facing real-time analysis. Pinot is also a great choice for other analytics use cases, such as internal dashboards, anomaly detection, and ad-hoc data exploration.

imaged2e9d8d08d769846.png

Pinot is built by engineers at LinkedIn and Uber and is designed to scale without limit. Depending on the size of the cluster and the expected queries per second (QPS) threshold, performance remains constant.

Real-time analytics for users

User-facing analytics or site-facing analytics are analytics tools and applications that you will expose directly to the end-users of your product. In user-oriented analytics applications, think of the user base as all the end users of the application. This app can be a social networking app or a food delivery app. Apache Pinot isn't just about a handful of analysts doing offline analysis, and it's not just about a handful of data scientists in a company running ad hoc queries. Instead, all end users receive personalized analytics on their personal devices (think 100 out of 1000 queries per second). These queries are triggered by the application, not written by humans, so the scale will be as many as the number of active users on the application (million events/second)

And, this is for the latest data possible, which is real-time analysis. For some enterprises, "yesterday" may be a long time ago, and they can't wait for ETL and batch jobs. What they want is for the data to be ready for analysis (consider latency <1s).

Why is real-time user-facing analytics so challenging?

Wanting a user-facing analytics application like this that uses real-time events sounds great. But what does supporting such analytical workloads mean for the underlying infrastructure?

image4fead8cf0deb73b5.png

  1. Such applications require data that is as fresh as possible, so the system needs to be able to ingest the data in real time and make it available for real-time queries.
  2. The data for such applications tends to be event data, used across a wide range of operations, from multiple sources, so the data comes in at a very high velocity and is often high-dimensional.
  3. Queries are triggered by end users interacting with the application - hundreds of thousands of queries per second, with arbitrary query patterns, and latency is expected in milliseconds for a good user experience.
  4. And go one step further by doing all of the above with scalability, reliability, high availability and low-cost service.

Which companies use Pinot

Pinot originated at LinkedIn, where it currently has one of the largest deployments, powering over 50 user-facing applications such as View My Profile, Talent Analytics, Company Analytics, Advertising Analytics, and more. At LinkedIn, Pinot also serves as the backend for visualizing and monitoring over 10,000 business metrics.

As Pinot grows in popularity, several companies are now using it in production to support a variety of analytics use cases. A detailed list of companies using Pinot can be found here .

Apache Pinot basic features

  • Column-oriented database with various compression schemes, such as run length, fixed bit length
  • Pluggable indexing technology - sorted index, bitmap index, inverted index, star tree index, bloom filter, range index, text search index (Lucence/FST), Json index, geospatial index
  • Ability to optimize query/execution plans based on query and segment metadata
  • Near real-time ingest from streams in Kafka, Kinesis, and more, batch ingest from sources like Hadoop, S3, Azure, GCS, and more
  • SQL-like language that supports data selection, aggregation, filtering, grouping, sorting, and different queries
  • Support multi-valued fields
  • Horizontally scalable and fault tolerant

When can I use it?

Pinot is designed to execute OLAP queries with low latency. It is suitable for environments that require fast analysis (such as aggregation) of immutable data and may require real-time data ingestion.

User-oriented analytics products

Pinot is the perfect choice for user-facing analytics products. Pinot was originally built at LinkedIn to power rich, interactive, real-time analytics applications such as Who Viewed Profile, Company Analytics, Talent Insights, and more. UberEats Restaurant Manager is another example of a customer-facing analytics application. At LinkedIn, Pinot powers more than 50 user-facing products, ingests millions of events per second, and processes more than 100,000 queries per second with millisecond latency.

Real-time dashboards of business metrics

Pinot can also be used to perform typical analytical operations such as slicing and dicing, drill-down, roll-up, and rotation of large-scale multidimensional data. For example, at LinkedIn, Pinot provides dashboards for thousands of business metrics. Various BI tools such as Superset, Tableau or PowerBI can be connected to visualize data in Pinot.

abnormal detection

In addition to visualizing data in Pinot, you can also run machine learning algorithms to detect anomalies in the data stored in Pinot. For more information on how to use Pinot for anomaly detection and root cause analysis, see ThirdEye .

Guess you like

Origin blog.csdn.net/weixin_39636364/article/details/124755945