InfoSphere Streams - real-time big data analytics platform

Learn about InfoSphere Streams , part of IBM's big data platform. InfoSphere Streams addresses a pressing need for platforms and architectures capable of processing the massive amounts of streaming data generated in real time. Learn what the product was designed for, when it works, how it works, and how it complements InfoSphere BigInsights to perform highly complex analytics.

Information from multiple sources is growing at an incredible rate. The number of Internet users reached 2.27 billion in 2015. Every day, Twitter generates more than 12 TB of tweets, Facebook generates more than 25 TB of log data, and the New York Stock Exchange collects 1 TB of trade information. About 30 billion radio frequency identification (RFID) tags are created every day. In addition, the hundreds of millions of GPS devices sold each year and the more than 30 million connected sensors currently in use (and growing at a rate of more than 30% each year) are all generating data. These data volumes are expected to double every 2 years over the next 10 years.

A company can generate petabytes of information in a year: web pages, blogs, clickstreams, search indexes, social media forums, instant messages, text messages, emails, documents, user demographics, from active and passive System sensor data, etc. Many estimate that up to 80% of this data is semi-structured or unstructured. Companies are always looking to run their businesses with greater agility and perform data analysis and decision-making processes in more innovative ways . And they recognize that lost time in these processes can lead to missed business opportunities. At the heart of the big data challenge is the ability for companies to easily analyze and understand Internet-scale information, just as they can now analyze and understand smaller amounts of structured information.

streams

IBM is helping companies meet big data challenges, giving them the tools to integrate and manage massive, high-velocity data, apply analytics in native formats, visualize available data for ad hoc analysis, and more. This article introduces InfoSphere Streams, a technology that enables you to analyze many data types simultaneously and perform complex computations in real time. You'll learn how InfoSphere Streams works, what it's used for, and how to use it in conjunction with another IBM product for big data analytics (IBM InfoSphere BigInsights) to perform highly complex analytics.

InfoSphere BigInsights: Overview

MapReduce

The MapReduce framework (introduced by Google) enables the programming of commodity computer clusters to perform large-scale data processing in one go. A MapReduce cluster can scale to thousands of nodes in a fault-tolerant manner, processing petabytes of data in a highly parallel and cost-effective manner. A major advantage of this framework is that it relies on a simple yet powerful programming model. Additionally, it isolates the application developer from all the complex details of running a distributed program, such as issues related to data distribution, scheduling, and fault tolerance.

Understanding InfoSphere BigInsights will enable you to more fully understand the purpose and value of InfoSphere Streams.

BigInsights is an analytics platform that helps companies transform complex Internet-scale information sets into insights. It consists of a packaged Apache Hadoop distribution with a highly simplified installation process and associated tools for application development, data movement, and cluster management. Thanks to its simplicity and scalability, Hadoop , an open source implementation of the MapReduce framework, has enjoyed great success in industry and academia. Besides Hadoop, other open source technologies in BigInsights (all except Jaql are Apache Software Foundation projects) include:

  • Pig : This platform provides a high-level language to express programs that analyze large data sets. Pig is equipped with a compiler that converts Pig programs into sequences of MapReduce jobs executed by the Hadoop framework.
  • Hive : A data warehouse solution built on top of the Hadoop environment. It brings familiar relational database concepts, such as tables, columns, and partitions, and a subset of SQL (HiveQL) to the unstructured world of Hadoop. Hive queries are compiled into MapReduce jobs that are executed using Hadoop.
  • Jaql : A query language developed by IBM specifically for JSON (JavaScript Object Notation, JavaScript Object Notation) that provides an SQL-like interface. Jaql handles nesting moderately, is highly function-oriented, and is very flexible. It works with loosely structured data and is an interface to HBase columnar storage and text analytics.
  • HBase : A column-oriented NoSQL data storage environment designed to support large, sparsely populated tables in Hadoop.
  • Flume : A distributed, reliable, and available service for efficiently moving large amounts of data generated. Flume is great for collecting generated logs from multiple systems and inserting them into HDFS (Hadoop Distributed File System).
  • Lucene : A search engine library that provides high-performance, full-featured text search.
  • Avro : A data serialization technology that uses JSON to define data types and protocols to serialize data in a compact binary format.
  • ZooKeeper : A centralized service that maintains configuration information and naming, provides distributed synchronization and grouping services.
  • Oozie : A workflow scheduler system for managing and orchestrating the execution of Apache Hadoop jobs.

In addition, the BigInsights distribution includes the following technologies exclusive to IBM:

  • BigSheets : A browser-based, spreadsheet-like query and exploration interface that enables business users to easily collect and analyze data, leveraging the power of Hadoop. It provides built-in readers to handle data in many common formats, including JSON, comma-separated values ​​(CSV), and tab-separated values ​​(TSV).
  • Text analytics : A pre-built library of text annotations for common business entities. It provides a rich set of languages ​​and tools to build custom positional annotations.
  • Adaptive MapReduce : An IBM Research solution that speeds up the execution of small MapReduce jobs by changing how MapReduce tasks are processed.

Contact us about biginsights free trial version >>>

InfoSphere Platform

InfoSphere is a comprehensive information integration platform that includes data warehousing and analytics, information integration, master data management, lifecycle management, and data security and privacy. The platform improves the application development process so organizations can accelerate time-to-value, reduce integration costs, and improve information quality.

In general, BigInsights is not designed to replace a traditional relational database management system (DBMS) or traditional data warehouse. Specifically, it is not optimized for interactive queries on tabular data structures, online analytical processing (OLAP), or online transaction processing (OLTP) applications. However, as part of the IBM Big Data Platform, BigInsights provides potential integration points with other components of the platform, including data warehouses, data integration and governance engines, and third-party data analysis tools. As you'll see later in this article, it also integrates with InfoSphere Streams.

Streaming computing: a new computing paradigm

Streaming computing is a new computing paradigm that is integral to new data-sound field scenarios, such as ubiquitous mobile devices, location services, and ubiquitous sensors. There is a need for scalable computing platforms and parallel architectures to process the massive streams of data generated.

BigInsights technologies are insufficient for real-time stream processing tasks, as they are primarily geared towards support for batch processing of static data. In the process of processing static data, a query such as listing all connected users results in a single result set. With real-time processing of streaming data, you can perform an ongoing query such as listing all users who have been connected to the Internet in the past 10 minutes . This query will return continuously updated results. In the realm of static data, the user is looking for a needle in a haystack; in the realm of streaming data, the user can easily find the needle because the hay has been blown away.

streams

The InfoSphere Streams platform supports real-time processing of streaming data, enabling continuous updating of the results of continuous queries, and detecting insights in data streams that are still moving.

InfoSphere Streams overview

InfoSphere Streams is designed to uncover meaningful patterns from moving information (data streams) in a window of minutes to hours. The platform delivers business value by enabling low-latency insights and better outcomes for time-critical applications such as fraud detection or network management. InfoSphere Streams can also combine multiple streams, enabling you to gain new insights from multiple streams, as shown in Figure 3.

streams
Figure 3. Combined stream processing

The main design goals of InfoSphere Streams are:

  • Respond quickly to events and changing business conditions and needs.
  • Enables continuous analysis of data orders of magnitude faster than existing systems.
  • Quickly adapt to changing data forms and types.
  • Manage high availability, heterogeneity, and distribution for the new streaming model.
  • Provides security and information confidentiality for shared information.

InfoSphere Streams provides a programming model and IDE to define data sources, as well as software analysis modules called operators that are incorporated into processing execution units. It also provides the infrastructure to support the composition of scalable stream processing applications from these components. Major platform components include:

  • Runtime environment : This includes platform services, and a scheduler for deploying and monitoring Streams applications on a single host or an integrated set of hosts.
  • Programming model : You can use SPL (Streams Processing Language, a declarative language) to write Streams applications. You state your requirements in this language, and the runtime environment takes responsibility for determining how best to serve the request. In this model, a Streams application is represented as a graph of operators and streams connecting them.
  • Monitoring tools and management interfaces : Streams applications process data much faster than normal operating system monitoring utilities. InfoSphere Streams provides tools to handle this environment.

Contact us about Streams free trial version >>>

Stream Processing Language

SPL, the programming language for InfoSphere Streams, is a distributed data stream synthesis language. It is an extensible, full-featured language like C++ or Java™ that supports user-defined data types. You can write custom functions in SPL or a native language (C++ or Java). User-defined operators can also be written in C++ or Java.

An InfoSphere Streams continuous application describes a directed graph consisting of operators that are interconnected and process multiple streams of data. The data flow can come from outside the system, or it can be generated inside the application. The basic building blocks of an SPL program include:

  • stream : an infinite sequence of structured tuples. It can be used by operators on a tuple-by-tuple basis or through the definition of a window.
  • Tuple : A structured list of attributes and their types. Each tuple on a stream has a form specified by its stream type.
  • Stream Type : Specify the name and data type of each attribute in the tuple.
  • window : A finite, ordered grouping of tuples. It can be based on counts, times, attribute values, or punctuation.
  • Operators : The fundamental building blocks of SPL, its operators process data from streams and generate new streams.
  • Processing Element (PE) : The base execution unit. A PE can encapsulate a single operator or multiple combined operators.
  • Job : A Streams application that has been deployed for execution. It consists of one or more PEs. In addition to a set of PEs, the SPL compiler also generates an ADL (Application Description Language) file to describe the structure of the application. The ADL file contains details for each PE, such as which binary to load and execute, scheduling constraints, stream format, and an internal operator data flow graph.

Figure 4 illustrates the InfoSphere Streams runtime view of an SPL program:

streams
Figure 4. InfoSphere runtime execution

An operator represents a reusable stream transformer that converts some input stream to an output stream. In an SPL program, operator invocations can be implemented for specific purposes of the budget method, using specific input and output streams allocated, as well as locally specified parameters and logic. Input and output streams are named for each operator invocation. Various built-in InfoSphere Streams operators provide many powerful capabilities:

  • Source: Read input data in stream format.
  • Sink: Write the data of the output stream to external storage or the system.
  • Functor: Filter, transform, and perform various functions on the data of the input stream.
  • Sort: Sorts the stream data on the defined key.
  • Split: Split input stream data into multiple output streams.
  • Join: Merge input stream data on defined keys.
  • Aggregate: Streaming data on the keys defined by the aggregate.
  • Barrier: Combine and match stream data.
  • Delay: Demonstrates a streaming data flow.
  • Punctor: Identifies the data packets that should be processed together.

Where a stream is connected to an operator is called a port . Many operators (for example Functor) have one input port and one output port, but operators can also have no input port (for example Source) and no output port (for example Sink), or have multiple input or output ports (for example Splitand Join). Listing 1 Sinkshows an SPL example of , which has an input port and writes output tuples to a disk file.

Listing 1. SinkExample
() as Sink = FileSink(StreamIn) {
    param
    file : "/tmp/people.dat";
    format : csv;
    flush : 20u;
}

In Listing 1 , fileis a mandatory parameter that provides the path to the output file. flushThe argument is used to clear the output after the given number of tuples. formatThe parameter specifies the format of the output file.

A composition operator is a collection of operators. It represents a wrapper around a subgraph of a primitive (non-combined) operator or a composite (nested) operator. It is similar to a macro in procedural languages.

An application is represented by a primary composition operator with no input or output ports . Data can flow in and out, but not to streams within a graph, and streams can be exported to and imported from other applications running in the same instance. The code in Listing 2 gives the skeleton of the main combinatorial operators.

Listing 2. The structure of the main combinatorial operator
composite Main {
    graph
    stream ... {
    }
    stream ... {
    }
    ...
}

As an example, let's look at a simple streaming application WordCount that counts the number of lines and words in a file. The program consists of the following flow graph:

  • A Sourcebudget method call that reads a file and sends the lines to the data stream.
  • An Functoroperator call that counts the number of rows and words per data row, sending the statistics to its output stream.
  • An Counteroperator call that aggregates statistics for all lines in the file and prints it at the end.

Before introducing WordCount's main combinatorial operators, I'll define some helpers. I will use the LineStattype . Also, I need to build a countWords(rstring line)function to count the number of words in a line and a addM(mutable LineStat x, LineStat y)function to add two LineStatvalues ​​and store the result. Listing 3 defines these helpers.

Listing 3. WordCount helper definition
type LineStat = tuple<int32 lines, int32 words>;

    int32 countWords(rstring line) {
        return size(tokenize(line, " \t", false));
    }

    void addM(mutable LineStat x, LineStat y) {
        x.lines += y.lines;
        x.words += y.words;
    }

The main composition operator can now be defined, as shown in Listing 4.

Listing 4. The main combinatorial operators for WordCount
composite WordCount {

    graph
    stream<rstring line> Data = FileSource() {
        param file : getSubmissionTimeValue("file");
        format : line;
    }
    stream<LineStat> OneLine = Functor(Data) {

        output OneLine : lines = 1, words = countWords(line);
    }

    () as Counter = Custom(OneLine) {

        logic state : mutable LineStat sum = { lines = 0, words = 0 };
        onTuple OneLine : addM(sum, OneLine);
        onPunct OneLine : if (currentPunct() == Sys.FinalMarker)

        println(sum);

        }

}

development environment

InfoSphere Streams provides an agile development environment consisting of the Eclipse IDE, the Streams Live Graph view, and a stream debugger. The platform also includes toolkits for accelerating and simplifying solution development for specific functions or industries:

  • Standard Toolkit : Contains default operators shipped with the product:
    • Relational operators such as Filter, Sort, Functor, Join, PunctorandAggregate
    • Adapter operators such as FileSource, FileSink, DirectoryScanandExport
    • Utility operators such as Custom Split, DeDuplicate, Throttle, Union, Delay, ThreadedSplit, BarrierandDynamicFilter
  • Internet Toolkit : includes operators such as HTTP, FTP, HTTPS, andFTPS .RSS
  • Database Toolkit : Supports DBMSs including DB2®, Netezza, Oracle Database, SQL Server and MySQL.
  • Other built-in toolkits : Finance, Data Mining, Big Data and Text Toolkits.

In addition, you can define your own toolkit, provide reusable sets of operators and functions, and create cross-domain and domain-specific accelerators. They can contain primitive and composite operators, and can use both native and SPL functions.

Integration and interaction between BigInsights and InfoSphere Streams

Companies that continue to generate vast amounts of valuable data from their systems are struggling with analyzing data for two important purposes: sensing and responding to current events in a timely manner, and making predictions based on historical knowledge to guide responses. This situation creates the need to seamlessly run analysis of moving data (current data) and data at rest (historical data), processing massive, diverse, high-speed data. The integration of IBM's data-in-mobile (InfoSphere Streams) and data-at-rest (BigInsights) platforms addresses the needs of 3 main use cases:

  • Scalable data ingest : Continuously ingest data into BigInsights via Streams. For example, unstructured textual data from social media sources such as Twitter and Facebook is often required to extract various types of attitudes and cues. In this case, it would be much more efficient to eliminate extraneous data such as spam as early as possible if text extraction was performed while the data was being fetched. This integration allows companies to avoid huge unnecessary storage costs.
  • Accelerate and enrich : Accelerate analysis and enrichment of incoming Streams data from the context of BigInsights birthday history. BigInsights can be used to analyze data ingested and integrated from a variety of continuous and static data sources over long time windows. The results of this analysis provide context for various online analyses that can be used to steer them to a known state. Going back to the social media application scenario, an incoming Twitter message only has the ID of the person who posted it. However, historical data can enrich this information with attributes such as influencers, providing an opportunity to perform downstream analysis to appropriately address the attitudes expressed by this user.
  • Adaptive Analytical Models : Models produced by analytical operations on BigInsights, such as data mining, machine learning, or statistical modeling. These can be used as a basis for analyzing incoming data on Streams, updated based on real-time observations.

The data-in-motion and data-at-rest portions of the IBM Big Data Platform can be integrated through 3 main types of components:

  • Universal Analytics : The same analytics capabilities are available on Streams and BigInsights.
  • Common data formats : Streams format operators convert data between the Streams tuple format and the data format used by BigInsights.
  • Data Exchange Adapters : Streams Sourceand SinkAdapters can be used to exchange data with BigInsights.

concluding remarks

Helping companies manage, analyze and leverage big data is a key area of ​​focus for IBM's big data platform. This article introduces InfoSphere Streams, IBM's software platform for storing and analyzing moving data (streaming data). The paper also outlines how to integrate InfoSphere Streams with BigInsights, IBM's software platform for storing and analyzing data at rest, to enrich the capabilities for more complex analytics. Many companies recognize that leveraging big data is an important information management tool that provides unique business value and advantages. If you're ready to use InfoSphere streams, see Resources for free training materials and software.

Contact us about Streams free trial version >>>

For more big data and analysis related industry information, solutions, cases, tutorials, etc., please click to view >>>

For details, please consult online customer service !

Customer Service Hotline: 023-66090381

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326483220&siteId=291194637