Python simulation generates new energy vehicle data

Article Directory

Table of contents

overview

overall architecture process

1. Composition and analysis of integrated functions

2. Generate 100 pieces of today's data

3. Generate data with repeated values and data other than today's date

4. Store data in hdfs

summary

overview

The topics are as follows:

编写一个程序，每天凌晨3点模拟生成当天的新能源车辆数据（字段信息必须包含：车架号、行驶总里程、车速、车辆状态、充电状态、剩余电量SOC、SOC低报警、数据生成时间等）。

要求：1、最终部署时，要将这些数据写到HDFS中。

2、车辆数据要按天存储，数据格式是JSON格式，另外如果数据文件大于100M，
则另起一个文件存。每天的数据总量不少于300M。比如假设程序是2023-01-1 03点运行，
那么就将当前模拟生成的数据写入到HDFS的/can_data/2023-01-01文件夹的
can-2023-01-01.json文件中，写满100M，则继续写到can-2023-01-01.json.2文件中，依次类推；

3、每天模拟生成的车辆数据中，必须至少包含20辆车的数据，即要含有20个车架号（一个车架号表示一辆车，用字符串表示）；

4、每天生成的数据中要有少量（20条左右）重复数据（所有字段都相同的两条数据则认为是重复数据），且同一辆车的两条数据的数据生成时间间隔两秒；

5、每天生成的数据中要混有少量前几天的数据（即数据生成时间不是当天，而是前几天的）。

overall architecture process

Knowing the topic, let's analyze the implementation steps:

1. Create multiple functions first. For example, the function random_vin is used to generate a random VIN, and the function rando_mileage generates a random mileage. Then create an empty list and create an integration function to obtain and splice the data generated by each vehicle information function. Finally, Form a complete vehicle data

2. Apply the completed integration function according to the requirements to generate data that meets the format, and finally import the data into the built hdfs.

1. Composition and analysis of integrated functions

In the first step, we first import the required libraries : radar (realize random or specify date and time) random (generate random numbers) time (get timestamp) pyhdfs (store json data in hdfs)

import random
import time
import radar
import pyhdfs

Generate random VIN function
Generate random kilometers

generate random speed

Generate random vehicle states

Generate random state of charge

Generate random power remaining

Generate a random drive time function

Generate random soc function

Integrating functions (functions before splicing to generate complete data)

2. Generate 100 pieces of today's data

刚才我们已经创建好可以生成1条完整的函数了，那么只需控制数量便可完成要求

3. Generate data with repeated values and data other than today's date

This can also be solved by cooperating with the functions of the imported library, the code is as follows:

4. Store data in hdfs

The topic requirements are as follows:

So we need to use the pyhdfs library that imports hdfs, and we need to set the host number, port number and other information. The specific code is as follows:

summary

Even if our task is over here, the whole process from setting a random VIN to finally completing all the requirements and storing it in hdfs is here. You will understand after seeing it a few times, and finally attach the complete code

import random
import time
import radar


def random_vin():
    # 随机生成车架号
    content_map = {
        'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5,
        'F': 6, 'G': 7, 'H': 8, 'I': 0, 'J': 1, 'K': 2, 'L': 3,
        'M': 4, 'N': 5, 'O': 0, 'P': 7, 'Q': 8, 'R': 9, 'S': 2, 'T': 3,
        'U': 4, 'V': 5, 'W': 6, 'X': 7, 'Y': 8, 'Z': 9, "0": 0, "1": 1,
        "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9
    }
    # 位置的全值
    location_map = [8, 7, 6, 5, 4, 3, 2, 10, 0, 9, 8, 7, 6, 5, 4, 3, 2]
    vin = ''.join(random.sample('0123456789ABCDEFGHJKLMPRSTUVWXYZ', 17))
    num = 0
    for i in range(len(vin)):
        num = num + content_map[vin[i]] * location_map[i]
    vin9 = num % 11
    if vin9 == 10:
        vin9 = "X"
    list1 = list(vin)
    list1[8] = str(vin9)
    vin = ''.join(list1)
    return vin


def rando_mileage():
    # 随机生成行驶总里程
    mileage = round(random.uniform(1000, 100000), 2)
    return mileage


def random_car_speed():
    # 随机生成车速
    car_speed = round(random.uniform(0, 200), 2)
    return car_speed


def random_state():
    # 随机生成车辆状态
    state = ['start', "unstart"]
    return random.sample(state, 1)[0]


def random_charge_state():
    # 随机生成充电状态 充电：放电：空闲
    state = ['recharge', "discharge", "leisure"]
    return random.sample(state, 1)[0]


def random_dump_energy():
    # 随机生成剩余电量SOC
    return round(random.uniform(0, 100), 2)


def random_time(datetimestr):
    # 随机生成时间
    # 转换成时间数组
    return datetimestr


def random_soc():
    # 随机生成soc
    state = ['no', "yes"]
    return random.sample(state, 1)[0]


# 随机生成前几天时间
def random_otherday_time(start, end):
    '''
        start:起始时间
        end：终止时间
        "2023-06-01T03:00:00","2023-06-29T03:00:00"
    '''
    otherDay_time = radar.random_datetime(start, end)
    return otherDay_time.strftime('%Y-%m-%d %H:%M:%S')


def generated_data(datetimeStr, number):
    '''
        生成数据
        datetimeStr: 需要生成的日期时间
        number: 生成数据条数
    '''

    data = []
    
    # 需要生成数据条数
    '''
        车架号:vin  行驶总里程:mileage     车速:car_speed
        车辆状态:state  充电状态:charge_state   剩余电量:dump_energy
        SOC低报警:soc   数据生成时间:time
    '''
    for i in range(number):
        values = {"vin": random_vin(), "mileage": rando_mileage(), "car_speed": random_car_speed(),
                  "state": random_state(),
                  "charge_state": random_charge_state(), "dump_energy": str(random_dump_energy()) + "%",
                  "soc": random_soc(),
                  "time": random_time(datetimeStr)}
        # print(values)
        # 时间戳+2，数据生成时间离上条数据间隔2秒

        # 转换成时间数组
        timeArray = time.strptime(datetimeStr, "%Y-%m-%d %H:%M:%S")
        # 转换成时间戳
        timeStamp = time.mktime(timeArray)
        timeStamp += 2
        # 转换成localtime
        time_local = time.localtime(timeStamp)
        # 转换成新的时间格式(2016-05-05 20:28:54)
        datetimeStr = time.strftime("%Y-%m-%d %H:%M:%S", time_local)

        data.append(values)
    return data


if __name__ == '__main__':
    result = []
    duplicate_data = []
    # 今天时间
    today_time = "2023-06-30 03:00:00"
    # 生成今天的数据100条
    data = generated_data(today_time, 100000)

    # 每天生成的数据中要有少量（20条左右）重复数据（所有字段都相同的两条数据则认为是重复数据）
    for i in range(20):
        duplicate_data.append(data[random.randrange(0, len(data))])
    # 把获取到的数据添加到data后面即可出现重复数据
    result = data + duplicate_data

    # 每天生成的数据中要混有少量前几天的数据,这里生成10条6月1到6月30的数据
    other_count = 10
    for i in range(other_count):
        otherday = random_otherday_time("2023-06-01T03:00:00", "2023-06-29T03:00:00")
        other_data = generated_data(otherday, 1)
        result += other_data

    # 车辆数据要按天存储，数据格式是JSON格式，另外如果数据文件大于100M，则另起一个文件存。
    # 每天的数据总量不少于300M。比如假设程序是2023-01-1 03点运行，那么就将当前模拟生成的数据
    # 写入到HDFS的/can_data/2023-01-01文件夹的can-2023-01-01.json文件中，
    # 写满100M，则继续写到can-2023-01-01.json.2文件中，依次类推；
    import pyhdfs

    # 创建连接
    fs = pyhdfs.HdfsClient(hosts="192.168.116.3:50070", user_name="root")
    # 写入文件的今天时间
    today_time2 = today_time[:10]
    path = "/can_data/" + today_time2 + "/can-" + today_time2 + ".json"
    fs.create(path, "")
    j = 1
    for i in range(len(result)):
        # 检测到文件大于等于100M,切换文件继续继续存储
        if (fs.get_file_status(path).length) <= 1024 * 1024 * 100:
            fs.append(path, str(result[i]) + ",")
        else:
            path = "/can_data/" + today_time2 + "/can-" + today_time2 + ".json.%d" % j
            fs.create(path, "")
            j += 1