Flink Learning (II): Experimental data of a cleaning

Author: chen_h
Micro Signal & QQ: 862251340
micro-channel public number: coderpai


Flink Learning (a): This section describes the flow

Flink Learning (II): Experimental data of a cleaning


data preparation

First we need to download the required test data, download the following address:

wget http://training.ververica.com/trainingData/nycTaxiRides.gz
wget http://training.ververica.com/trainingData/nycTaxiFares.gz

We just need to get the data, do not need to be decompressed.

Taxi format data

Our taxi dataset (TaxiRide) contains information about the various New York City taxi travel. Each trip by two events, said: journey beginning and end of the journey the event. Each event contains 11 fields:

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
startTime      : DateTime  // the start time of a ride
endTime        : DateTime  // the end time of a ride,
                           //   "1970-01-01 00:00:00" for start events
startLon       : Float     // the longitude of the ride start location
startLat       : Float     // the latitude of the ride start location
endLon         : Float     // the longitude of the ride end location
endLat         : Float     // the latitude of the ride end location
passengerCnt   : Short     // number of passengers on the ride

Note: The data set contains invalid or missing records coordinate information (latitude and longitude, 0.0).

There is also a taxi fare included (Taxi Fare) related data sets of data, these fields include:

rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
startTime      : DateTime  // the start time of a ride
paymentType    : String    // CSH or CRD
tip            : Float     // tip for this ride
tolls          : Float     // tolls for this ride
totalFare      : Float     // total fare collected

Generating a data stream in a taxi with a program Flink

Note: These exercises have provided the code to use these taxis data stream.

We provide Flink source function (TaxiRideSource), the function reads .gz file with taxi records and issue TaxiRide event stream. Source Function running time of the event. TaxiFare event has a similar source function (TaxiFareSource).

Modify ExerciseBase file

After downloading the data set, open com.ververica.flinktraining.exercises.datastream_java.utils.ExerciseBase class in your IDE, and then edit the two taxi data file to point to these two lines downloaded:

pathToRideData = "YOUR DATA PATH";
pathToFareData = "YOUR DATA PATH";

Experimental requirements

"Taxi ride data cleansing" task of the exercise is not in New York City by deleting the start or end of the event to clean up TaxiRide event stream.

GeoUtils utility class provides a static method isInNYC (float lon, float lat), to check whether a location in the NYC area.

It boils down to this:

  1. Screened and end all taxi data originating in New York.
  2. GeoUtils which provides to determine whether in New York.
  3. A very simple Filter

data input:

// get an ExecutionEnvironment
StreamExecutionEnvironment env =
  StreamExecutionEnvironment.getExecutionEnvironment();
// configure event-time processing
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

// get the taxi ride data stream
DataStream<TaxiRide> rides = env.addSource(
  new TaxiRideSource("/Users/XXX/Resources/2018/trainingData/nycTaxiRides.gz", maxDelay, servingSpeed));

Data output

New York's starting point is not in the expected output data to the console.

Complete code:

package com.dataartisans.flinktraining.exercises.datastream_java.basics;

import com.dataartisans.flinktraining.exercises.datastream_java.datatypes.TaxiRide;
import com.dataartisans.flinktraining.exercises.datastream_java.sources.TaxiRideSource;
import com.dataartisans.flinktraining.exercises.datastream_java.utils.ExerciseBase;
import com.dataartisans.flinktraining.exercises.datastream_java.utils.GeoUtils;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

/**
 * The "Ride Cleansing" exercise from the Flink training
 * (http://training.data-artisans.com).
 * The task of the exercise is to filter a data stream of taxi ride records to keep only rides that
 * start and end within New York City. The resulting stream should be printed.
 * <p>
 * Parameters:
 * -input path-to-input-file
 */
public class RideCleansingExercise extends ExerciseBase {
    public static void main(String[] args) throws Exception {

        ParameterTool params = ParameterTool.fromArgs(args);
        final String input = params.get("input", ExerciseBase.pathToRideData);

        final int maxEventDelay = 60;       // events are out of order by max 60 seconds
        final int servingSpeedFactor = 600; // events of 10 minutes are served in 1 second

        // set up streaming execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.setParallelism(ExerciseBase.parallelism);

        // start the data generator
        DataStream<TaxiRide> rides = env.addSource(rideSourceOrTest(new TaxiRideSource(input, maxEventDelay, servingSpeedFactor)));

        DataStream<TaxiRide> filteredRides = rides
                // filter out rides that do not start or stop in NYC
                .filter(new NYCFilter());

        // print the filtered stream
        printOrTest(filteredRides);

        // run the cleansing pipeline
        env.execute("Taxi Ride Cleansing");
    }

    private static class NYCFilter implements FilterFunction<TaxiRide> {

        @Override
        public boolean filter(TaxiRide taxiRide) throws Exception {
            // 起点和终点都在纽约
            return GeoUtils.isInNYC(taxiRide.startLon, taxiRide.startLat) && GeoUtils.isInNYC(taxiRide.endLon, taxiRide.endLat);
        }
    }

}

We just modified the program filter function:

private static class NYCFilter implements FilterFunction<TaxiRide> {

        @Override
        public boolean filter(TaxiRide taxiRide) throws Exception {
            // 起点和终点都在纽约
            return GeoUtils.isInNYC(taxiRide.startLon, taxiRide.startLat) && GeoUtils.isInNYC(taxiRide.endLon, taxiRide.endLat);
        }
    }
Published 414 original articles · won praise 168 · views 470 000 +

Guess you like

Origin blog.csdn.net/CoderPai/article/details/104898891