How small teams can use JuiceFS

When I was still in ENJOY in the early years, I was already using JuiceFS, and it accompanied the four small companies I worked for. This thing has become an indispensable infrastructure for me. For the small teams I have served, it is a real helper. Taking advantage of the recent essay call, I will continue to expand my small team series and introduce how our team has been using JuiceFS over the years.

However, the uses mentioned here are probably not "magic uses". JuiceFS is a big community, and these uses may have been played by a long time, but it doesn't matter, this article is actually an expansion of internal project documents, mainly to record some maintenance processes. Feel the experience.

Magical Use: Container Shared Storage

Although we already have CSI support , we have always mounted Juicefs to all Kubernetes nodes /jfs, so that all container applications can easily mount the host directory in a hostPathway , and then have shared storage. This Some points to note about the law:

  • jfs.mountIt must be started before the container service, after all, the two have established a dependency. Taking docker as an example, you can write:
# /etc/systemd/system/docker.service.d/12-after-jfs.conf
[Unit]
After=jfs.mount
  • The directory to be mounted must be created manually first, and the permissions are set (to match the uid of the container process). This is actually inconvenient. If your team can't stand the toss, just use CSI directly.
  • There is some inconvenience in cluster expansion. After all, this method requires all nodes to be mounted with JuiceFS. In case the Kubernetes cluster is not redundant enough, new nodes need to be added, and JuiceFS needs to be added one more step before joining the cluster. In extreme cases Next, such a design is quite time-consuming, and there is another reason to use CSI directly.

After talking about so many issues, it seems that CSI should be preferred instead hostPathof , but the advantage hostPathis that the management is simpler and the reasoning is more straightforward, because according to our usage conventions, similar names will be /jfs/[appname]-[cluster]used , which is more clear at a glance. For those who are not familiar with Kubernetes For colleagues in PV's set, it is more convenient to do things.

Magical use: network disk

With a place to store files as you like, the natural idea is how to share files easily. Everyone knows that JuiceFS can be mounted on various platforms (even Windows is also very easy to use), but this is not what I want Introduced, because it is very difficult for users to mount JuiceFS locally (not to mention there are security issues). What I mean is that you can simply build a web service, mount JuiceFS, and then expose the file to the download portal.

It's really easy to do this. It took me only 5 minutes from the idea to build it. Thanks to lain , I only need such a short document values.yaml, and I can pull it up with Python http.server:

appname: jfs-http-server

volumes:
  - name: jfs-data
    hostPath:
      path: "/jfs"
      type: Directory

volumeMounts:
  - name: jfs-data
    mountPath: /jfs

deployments:
  web:
    replicaCount: 1
    image: python:latest
    podSecurityContext: {}
    resources:
      limits:
        cpu: 1000m
        memory: 80M
      requests:
        cpu: 10m
        memory: 80M
    command: ["python", '-m', 'http.server']
    workingDir: /jfs
    containerPort: 8000

ingresses:
  - host: jfs
    deployName: web
    paths:
      - /

People with a little development experience can understand that this is using the community python:latestimage , running http.server, and then mounting the host's /jfsdirectory . After the service is online, jfs.example.comyou can browse and download all files under jfs.

Compared with jfs, this section seems to be advertising lain, but in the world of DevOps, useful things always attract each other. If your team also implements DevOps, please refer to lain .

Magical Uses: Ad hoc programming in JupyterLab

Our team often has to deal with data, not only for data reporting, visual analysis, but also sometimes to verify some development ideas where the data can be touched. We can't let everyone connect to the online database locally, it is neither convenient nor safe , everyone is not necessarily good at tossing tools in this area locally. Therefore, I deployed JupyterLab and made a lot of usability improvements in it, that is, built-in shortcuts for many internal databases of the company, allowing all developers/data engineers You can easily use the packaged Python library for data analysis, and even use bokeh directly to deliver data visualization.

Obviously, the code written in Jupyter also needs to enter the version management, and the hard-written code must not be lost. Therefore, I directly set the working directory of JupyterLab to JuiceFS, and all notebooks are stored under JuiceFS. Use lain Deploying JupyterLab is very simple, here's what's going on values.yaml:

appname: lab

env:
  SHELL: zsh
  IPYTHONDIR: /lain/app

volumes:
  - name: jfs
    hostPath:
      path: "/jfs/lab"
      type: Directory

volumeMounts:
  - name: jfs
    mountPath: /jfs/lab

deployments:
  web:
    replicaCount: 1
    podSecurityContext: {'runAsUser': 0}
    terminationGracePeriodSeconds: 70
    resources:
      limits:
        cpu: 2
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 1Gi
    command: ['jupyter', 'lab', '--allow-root', '--collaborative', '--no-browser', '--config=/lain/app/jupyter_notebook_config.py']
    containerPort: 8888
    workingDir: /jfs/lab/notebooks

ingresses:
  - host: lab
    deployName: web
    paths:
      - /

build:
  base: lain:latest
  prepare:
    script:
      - apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4
      - echo "deb https://repo.clickhouse.tech/deb/stable/ main/" | tee /etc/apt/sources.list.d/clickhouse.list
      - apt-get update
      - apt-get install -y apt-transport-https ca-certificates dirmngr clickhouse-client=20.12.8.5 clickhouse-common-static=20.12.8.5
      - apt-get clean
      - pip3 install -r requirements.txt
  script:
    - pip3 install -r requirements.txt

Jupyter alone is not very useful, so as I mentioned above, we must do a good job in ease of use, for example, encapsulate various database clients:

from os import environ

import pandas as pd
import pymysql
from IPython.core.display import display


class MySQLClient:

    def __init__(self, config):
        config.update({
            'charset': 'utf8mb4',
            'cursorclass': pymysql.cursors.DictCursor,
            'autocommit': True,
        })
        self.config = config

    def use(self, db):
        self.config['database'] = db
        # just to make sure this db exists
        return self.execute(f'use {db}')

    def fetch(self, sql, *args, **kwargs):
        return self.execute(sql, *args, **kwargs)

    def fetchone(self, sql, *args, **kwargs):
        kwargs.update({'fetchone': True})
        return self.execute(sql, *args, **kwargs)

    def executemany(self, sql, *args, **kwargs):
        con = pymysql.connect(**self.config)
        with con.cursor() as cur:
            cur.executemany(sql, *args, **kwargs)
            res = cur.fetchall()

        con.close()
        return res

    def execute(self, sql, *args, **kwargs):
        con = pymysql.connect(**self.config)
        with con.cursor() as cur:
            fetchone = kwargs.pop('fetchone', None)
            as_pandas = kwargs.pop('as_pandas', None)
            cur.execute(sql, *args, **kwargs)
            if fetchone:
                res = cur.fetchone()
            else:
                res = cur.fetchall()

        con.close()
        if as_pandas:
            return pd.DataFrame(res)
        return res

    x = execute

    def preview(self, table_name=None, n=2):
        """
        # first, use a database
        mysql_client.use('configcenter')
        # show tables
        mysql_client.preview()
        # select example data from one table
        mysql_client.preview('post')
        # study one single column
        mysql_client.preview('post.visibility')
        """
        if not table_name:
            return self.execute('show tables', as_pandas=True)
        if '.' in table_name:
            n = max([n, 20])
            table_name, column_name = table_name.split('.')
            part = self.execute(
                f'''
                SELECT DISTINCT {column_name}, count(*) AS count
                FROM {table_name}
                GROUP BY {column_name}
                ORDER BY count DESC
                LIMIT {n}
                ''', as_pandas=True
            )
            return part

        part1 = self.execute(f'''
        SELECT `column_name`,
               `column_type`,
               `column_comment`
        FROM `information_schema`.`COLUMNS`
        WHERE `table_name` = "{table_name}"
        ''', as_pandas=True)
        display(part1)
        part2 = self.execute(
            f'''
            SELECT *
            FROM {table_name}
            ORDER BY RAND()
            LIMIT {n}
            ''', as_pandas=True
        )
        return part2


MYSQL_CONFIG = jalo(environ['MYSQL_CONFIG'])
mysql_client = mysql = my = MySQLClient(MYSQL_CONFIG)
mysql_client.use('mydatabase')

After so many packages, you can't even imagine how useful it is:

Relying on similar shortcut calls, in the teams I have worked for, from back-end engineers, data analysts, and even product managers, they can directly use JupyterLab to work.

It seems to be mainly introducing Jupyter, which is very embarrassing. But in fact, this project is also closely related to JuiceFS:

  • The generated data reports, or the products of other ad hoc processes, are placed on JuiceFS and can be easily shared with others directly (see the "Network Disk" section above)
  • All code (in the Jupyter world, called notebooks) is stored in juicefs snapshotJuiceFS for regular backups

妙用: GitLab, ClickHouse, Elasticsearch, etc.

In theory, when all applications require data to be placed on the disk, you can put it in JuiceFS, as long as the appropriate performance range matches. In this section, we introduce some of the usages we have explored:

  • GitLab's requirements for disk IO are still quite high, especially when MR is used. If the code base is large, it is best to move to SSD according to the situation. But if you are a small team and the project is not too big or not, then put it on JuiceFS You can enjoy a series of additional benefits, such as using a juicefs grepglobal search code repository (find junk code), convenient juicefs snapshotfull backup of all repo data, etc.
  • I use ClickHouse with JuiceFS CSI to easily pull up the CH cluster. This is described in more detail in How to maintain Sentry in a small team, and will not be repeated.

Magical Use: CI

Take GitLab CI as an example, set the mount directory for the Runner:

  [runners.docker]
    volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/jfs:/jfs:rw", "/cache"]

I didn't expect that just mounting JuiceFS into the CI Runner would open up so many possibilities. None of the following cases are very "god", but because of JuiceFS, things have become particularly convenient and easy to maintain:

Publish build artifacts (Artifacts)

Originally, jfs is used to store files. Throw the build products (for example, Android packaging) into jfs, and then cooperate with the file sharing described in the "Network Disk" section above, and you can easily make download links. If you have an internal team Need to distribute build products to non-technical people, JuiceFS + Python http.server will be a good match.

Continuous deployment

Not all services are launched, but the container needs to be updated. For example, many front-end application updates are actually just packaged and released static files, and this step is often done in CI. In this way, the release of front-end applications and jfs are Can do a good match:

  • CI Job compiles the static files of the front-end application and publishes them to the path with the version number under jfs
  • Update the Nginx configuration and point the website to the path of the latest version, so that even if it is published, CDN warm-up can be triggered if needed
  • If you want to roll back, you can also poke the corresponding version of the CI Job, and the old version will be deployed back.

For another example, we have some projects that are run on specific servers. These servers may be in the computer room or in the office. Of course, I can do a company intranet VPN for these machines, and then configure regular git clones one by one. Update, but with jfs, who else uses this laborious way to exchange data? So here's how we do it:

  • All servers are mounted with jfs, which is done in our machine initialization process
  • The project code is released to jfs with CI, for example, every time the code is updated, the content /jfs/[appname]under be covered
  • It is convenient to monitor /jfs/[appname]the , or make a scheduled restart every night in the middle of the night, etc.

global cache

GitLab CI, or other various CI systems, have various caching mechanisms, but some CI tools can be directly made into Global, without the need for Per Project. In this case, just save a copy of JuiceFS directly. , for example:

Trivy

We use Trivy for container image security scanning. What Trivy needs is the characteristic data of various security vulnerabilities, and scans the same db used by everyone, so I made a CI Job and regularly updated the data under JuiceFS:

refresh_trivy_db:
  stage: schedule
  variables:
    TRIVY_CACHE_DIR: /jfs/trivycache
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
  script:
    - trivy --cache-dir $TRIVY_CACHE_DIR image --download-db-only

Then all projects can share the data under this jfs to scan the mirror image, which is not bad:

container_scanning:
  stage: release
  rules:
    - if: '$CI_PIPELINE_SOURCE != "schedule"'
  variables:
    GIT_STRATEGY: none
    TRIVY_CACHE_DIR: /jfs/trivycache
  script:
    - trivy --cache-dir $TRIVY_CACHE_DIR image --skip-db-update=true --exit-code 0 --no-progress --severity HIGH "${IMAGE}:latest"
    - trivy --cache-dir $TRIVY_CACHE_DIR image --skip-db-update=true --exit-code 1 --severity CRITICAL --no-progress "${IMAGE}:latest"

Semgrep

Trivy scans images, Semgrep scans codes, and the rules files used for scanning need to be updated regularly. The usual posture is to download the rules files on the spot, but due to network problems, China is probably more willing to download these files in advance, and then quote them directly. So it's JuiceFS's turn:

# ref: https://semgrep.dev/docs/semgrep-ci/sample-ci-configs/#gitlab-ci
semgrep:
  image: semgrep-agent:v1
  script:
    - semgrep-agent
  variables:
    SEMGREP_RULES: >- # more at semgrep.dev/explore
      /jfs/semgrep/security-audit.yaml
      /jfs/semgrep/secrets.yaml
      /jfs/semgrep/ci.yaml
      /jfs/semgrep/python.yaml
      /jfs/semgrep/bandit.yaml
  rules:
    - if: $CI_MERGE_REQUEST_IID

As for the process of updating the rule file, it is not difficult, so I won't go into details here.

If it is helpful, please follow our project Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5389802/blog/5507962