By Jamesmorris-Oracle on Oct 27, 2015
This is a contributed post by Oracle mainline Linux kernel developer, Liu Bo, who recently presented at the 2015 China Linux Storage and File System (CLSF) Workshop. CLSF is an invite-only event for Linux kernel developers in China. This event was held on October 15th and 16th in Nanjing.
Haomai Wang from XSKY introduced some basic concepts about Ceph, e.g. MDS, Monitor, and OSD. Ceph caches all meta-data information (cluster map) on monitor-nodes so clients could fetch data by just on jump in network and if we use cephfs, which adds MDS, it still needs oonly one jump because the MDS doesn't store data or metadata of filesystem but only to store the context of distributed lock (range lock? not sure about this). Ceph also supports Samba 2.0/3.0 now.
In Linux, it is recommended to use iSCSI to access Ceph storage cluster because it will have to update kernel in clients if we use rbd/libceph kernel modules. Ceph uses a pipeline model in message processing therefore it is good for Hard Disk but not SSD. In the future, developers will use async-framework (such as Seastar) to refactor Ceph. And Robin (from Alibaba) asked that if the client can write three copies concurrently if making a setup of three replica in Ceph, Haomai answered it won't, the IO will come to the primary OSD and then the primary OSD issues two other replicated IOs to other two OSDs, waiting until the two IOs back before returning "the IO is success" to client.
The future development plan for Ceph is de-duplication on pool level. Coly Li (Suse) said that de-duplication is better to be implemented on business level instead of block level because the duplicated information has be split in block level.
Ceph Use CasesJiaju Zhang from Red Hat lead the topic about use cases of Ceph in enterprises. Ceph has become the most famous open source storage software around the world and also be used in Redhat/Intel/Sandisk(Low-end Storage Array)/Samsung/Suse. In the end, Jiaju proposed a question, "Can Ceph replace traditional storage (NAS, SAN) in future?"
Xen Block PV Driver
Bob Liu from Oracle shared the work he'd done on xen block pv driver, the patch is aimed to improve xen's performance by converting xen block pv driver to use block-mq API and multi ring buffer. The patch is located at https://git.kernel.org/cgit/linux/kernel/git/lliubbo/linux-xen.git/log/?h=v4.2-rc8-mq
ScyllaDB and Seastar
Asias He from OSv led this topic. ScyllaDB is a distributed Key/Value store engine which is written in C++14 code and completely compatible to Cassandra. It could also run CQL (Cassandra Query Language). The slides show that, ScyllaDB is 40 times more faster than Cassandra. The asynchronous developing framework in ScyllaDB is called Seastar. The magic in ScyllaDB is that it shards requests to every CPU core, and runs with no locks/no threads. Data is zero-copy and use bi-direction queue to transfer messages between cores. The test result is base on kernel TCP/IP network stack but they will use their own network stack in the future.
Yanhai Zhu (from Alibaba) doubted that the test results is not fair enough because ScyllaDB is designed to be run in multi-cores but Cassandra is not, so it'd be more fair to compare ScyllaDB with running 24 Cassandra instances. Asias replied that ScyllaDB uses message queues to transfer messages between CPU cores, so it avoids atomic-operation and lock-operation cost. And, Cassandra is written by Java, which means the performance will be low when the JVM do garbage- collection. ScyllaDB is written completely by c++ so its performance is much steady. Both of two projects are led by the KVM creator, Avi Kivity.
Fengguang Wu from Intel told me that they're using btrfs a bit on their autotest farm but often experience latency issues and we talked about using btrfs as docker's storage backend, everything seems perfect except each docker instance is an individual namespace so that they're not able to share page caching for the same content. This limits btrfs's use if users need to run a great amount of instances, memory becomes the biggest issue, besides that Zheng Liu from Alibaba shared that overlayfs also has latency issues in real production use, ie. if you just touch a large file, the file will be COPIED from the lower layer to the upper layer. So we all agreed that something should happen in this area.
HyperXu Wang shared what Hyper is, and why they developed it. The goal is to "run VM like a container", more precisely, Hyper allows you to run Docker images on any hypervisor. This is a fantastic thing that you can reuse your virtual infrastructure while keeping the advantages of using containers, for example, small image size, sub-second level boot time, and portability.
This shows one thing, the problem of traditional VM is that all kinds of VMs are aimed to simulate a bare metal machine in order to run a normal OS, but that's not what we want. And one more thing, yper is a vm which can provide secuity that is wanted by all production use cases.
Gang He from Suse shared with us how VxFS implements its deduplication. It provides online dedup and a serials of commands to control dedup behaviours, e.g. we can schedule a dedup and control dedup task's cpu usage, memory usage and priority, even it can 'dryrun' to find how many blocks can be deduped but do not really perform dedup.
Robin Dong from Alibaba shared that how they developed a distributed storage system based on a small open-source software called “sheepdog“, and modified it heavily to improve data recovery performance and make sure it could run in low-end but high-density storage servers. He talked about how they came up with the idea, the system design and deployment, the difficult part is not the design process, but to take care of every detail in the deployment of the cluster, e.g. find a proper place and power management for the machine. It's a good example to take advantage of opens ource and make contribute to it.
Chao Yu from Samsung led a topic about F2FS. He listed what happened in the F2FS community in the last year, and looks like that F2FS tends to be more generic than just a flash friendly filesystem, which is implied by the fact that F2FS now supports larger volume and larger sector and has in-memory extent cache and a global shrinker. Besides that, F2FS also improves its performance including flush performance, mixed data write performance and multi-threads performance. Developers also optimized F2FS a bit for SMR drive by allowing user to choose the over-provision ratio lower than 1% in mkfs.f2fs and tuning GC. In the future F2FS is planning to support online defrag, transparent compression and data deduplication.
Bcache in the Cloud
Yanhai Zhu from Alibaba led a topic about cache in virtual machines environment. Alibaba chose Bcache as code base to develop a new cache software. Yanhai explained why he didn't choose flashcache. flashcache was his first choice but it has some drawbacks which cannot be worked around, i.e. flashcache uses hash data structure to distributed IO requests at beginning, which will split the cache data in multi-tenant environment, and thus flashcache is unfriendly to sequential-write, so it proves that flashcache doesn't fit Alibaba's requirements. After then they turned to bcache which uses B-tree instead of hash-table to store data. For the strategy, they chose radical writeback strategy in order to make cache squentialize write IOs and make backend better at absorbing peak use.
Zheng Liu from Alibaba gave an update of ext4 in the last year, the biggest one is 'Remove ext3 filesystem driver'. Others are lazytime support, filesystem-level encryption, orphan file handling (by Jan Kara) and project quota. Besides that Seagate developers worked on ext4's SMR support (https://github.com/Seagate/SMR_FS-EXT4), in the future ext4 is likely to have data block checksumming and btree (by Mingming Cao of Oracle).
Zhongjie Wu is working at Memblaze, a famous startup company in China on flash storage technology. Zhongjie showed us one of their products on top of NVDIMM. An NVDIMM is not expensive, it is only a DDR DIMM with a capacitor. Memblaze has developed a new 1U storage server with a NVDIMM (as a write cache) and many flash cards (as the backend storage). It contains their own developed OS and could use Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM is to reduce latency, and they use write-back strategy. Zhongjie also mentioned that NVDIMM's write performance is quite better than its read performance, so they in fact uses shadow memory to increase read performance.
The big problem they face with NVDIMM is CPU can’t flush data in its L1 cache to NVDIMM when whole server powers down. To solve this problem, Memblaze use write-combining in CPU multi-cores, it hurts the performance a little but avoid the data missing finally.
NVDIMM Support in Linux
Bob Liu of Oracle talked about NVDIMM support in linux. There are three options:
- Use it as a block device -- an NVDIMM driver is needed.
- Filesystem support persistent memory . There are already patches called DAX for ext4 and xfs. Intel developed a PMFS but Bob said there is still room for improvement.
- Use it as main memory -- but this needs struct page, and how to allocate memory remains to be a problem.
-- Liu Bo