This is a contributed post by Oracle mainline Linux kernel
developer, Liu Bo, who recently presented at the 2015 China Linux
Storage and File System (CLSF) Workshop. CLSF is an invite-only event for Linux kernel developers in China. This event was held on October 15th and 16th in Nanjing.
Haomai Wang from XSKY introduced some basic concepts about Ceph, e.g. MDS, Monitor, and
OSD. Ceph caches all meta-data information (cluster map)
on monitor-nodes so clients could fetch data by just on jump in network
and if we use cephfs, which adds MDS, it still needs oonly one jump
because the MDS doesn't store data or metadata of filesystem but only to
store the context of distributed lock (range lock? not sure about this).
Ceph also supports Samba 2.0/3.0 now.
In Linux, it is recommended to use
iSCSI to access Ceph storage cluster because it will have to update
kernel in clients if we use rbd/libceph kernel modules. Ceph uses
a pipeline model in message processing therefore it is good for Hard Disk
but not SSD. In the future, developers will use async-framework (such as
Seastar) to refactor Ceph. And Robin (from Alibaba) asked that if
the client can write three copies concurrently if making a setup of
three replica in Ceph, Haomai answered it won't, the IO will come to the
primary OSD and then the primary OSD issues two other replicated IOs to
other two OSDs, waiting until the two IOs back before returning "the IO
is success" to client.
The future development plan for Ceph is de-duplication on pool level.
Coly Li (Suse) said that de-duplication is better to be implemented on
business level instead of block level because the duplicated information
has be split in block level.
Bob Liu from Oracle shared the work he'd done on xen block pv driver, the patch is aimed
to improve xen's performance by converting xen block pv driver to use
block-mq API and multi ring buffer.
The patch is located at
Asias He from OSv led this topic. ScyllaDB is a distributed Key/Value
store engine which is written in C++14 code and completely compatible to
Cassandra. It could also run CQL (Cassandra Query Language). The slides
show that, ScyllaDB is 40 times more faster than Cassandra. The
asynchronous developing framework in ScyllaDB is called Seastar. The
magic in ScyllaDB is that it shards requests to every CPU core, and runs
with no locks/no threads. Data is zero-copy and use bi-direction queue
to transfer messages between cores. The test result is base on kernel
TCP/IP network stack but they will use their own network stack in the
Yanhai Zhu (from Alibaba) doubted that the test results is not
fair enough because ScyllaDB is designed to be run in multi-cores but
Cassandra is not, so it'd be more fair to compare ScyllaDB with running
24 Cassandra instances. Asias replied that ScyllaDB uses message queues
to transfer messages between CPU cores, so it avoids atomic-operation
and lock-operation cost. And, Cassandra is written by Java, which means
the performance will be low when the JVM do garbage- collection.
ScyllaDB is written completely by c++ so its performance is much steady.
Both of two projects are led by the KVM creator, Avi Kivity.
Fengguang Wu from Intel told me
that they're using btrfs a bit on their autotest farm but often experience
latency issues and we talked about using btrfs as docker's storage
backend, everything seems perfect except each docker instance is an
individual namespace so that they're not able to share page caching for
the same content. This limits btrfs's use if users need to run a great
amount of instances, memory becomes the biggest issue, besides that
Zheng Liu from Alibaba shared that overlayfs also has latency issues in
real production use, ie. if you just touch a large file, the file will
be COPIED from the lower layer to the upper layer. So we all agreed
that something should happen in this area.
This shows one thing, the problem of traditional VM is that
all kinds of VMs are aimed to simulate a bare metal machine in order to
run a normal OS, but that's not what we want. And one more thing, yper
is a vm which can provide secuity that is wanted by all production
Gang He from Suse shared with us how VxFS implements its deduplication. It
provides online dedup and a serials of commands to control dedup
behaviours, e.g. we can schedule a dedup and control dedup task's cpu
usage, memory usage and priority, even it can 'dryrun' to find how many
blocks can be deduped but do not really perform dedup.
Robin Dong from Alibaba shared that how they developed a distributed storage system based on
a small open-source software called “sheepdog“, and modified it heavily
to improve data recovery performance and make sure it could run in
low-end but high-density storage servers. He talked about how they came
up with the idea, the system design and deployment, the difficult part
is not the design process, but to take care of every detail in the
deployment of the cluster, e.g. find a proper place and power management
for the machine. It's a good example to take advantage of opens ource
and make contribute to it.
Chao Yu from Samsung led a topic about F2FS. He listed what happened in the F2FS
community in the last year, and looks like that F2FS tends to be more
generic than just a flash friendly filesystem, which is implied by the
fact that F2FS now supports larger volume and larger sector and has
in-memory extent cache and a global shrinker. Besides that, F2FS also
improves its performance including flush performance, mixed data write
performance and multi-threads performance. Developers also optimized
F2FS a bit for SMR drive by allowing user to choose the over-provision
ratio lower than 1% in mkfs.f2fs and tuning GC. In the future F2FS is
planning to support online defrag, transparent compression and data
Yanhai Zhu from Alibaba led a topic about cache in virtual machines
environment. Alibaba chose Bcache as code base to develop a new cache
software. Yanhai explained why he didn't choose flashcache. flashcache was his first choice but it has some drawbacks which
cannot be worked around, i.e. flashcache uses hash data structure to
distributed IO requests at beginning, which will split the cache data in
multi-tenant environment, and thus flashcache is unfriendly to
sequential-write, so it proves that flashcache doesn't fit Alibaba's
requirements. After then they turned to bcache which uses B-tree
instead of hash-table to store data. For the strategy, they chose
radical writeback strategy in order to make cache squentialize write IOs
and make backend better at absorbing peak use.
Zheng Liu from Alibaba gave an update of ext4 in the last year, the biggest one is
'Remove ext3 filesystem driver'. Others are lazytime support,
filesystem-level encryption, orphan file handling (by Jan Kara) and
project quota. Besides that Seagate developers worked on ext4's SMR
support (https://github.com/Seagate/SMR_FS-EXT4), in the future ext4 is
likely to have data block checksumming and btree (by Mingming Cao of Oracle).
Zhongjie Wu is working at Memblaze, a famous startup company in China on
flash storage technology. Zhongjie showed us one of their products on top of
NVDIMM. An NVDIMM is not expensive, it is only a DDR DIMM with a
capacitor. Memblaze has developed a new 1U storage
server with a NVDIMM (as a write cache) and many flash cards (as the
backend storage). It contains their own developed OS and could use
Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM
is to reduce latency, and they use write-back strategy. Zhongjie also
mentioned that NVDIMM's write performance is quite better than its read
performance, so they in fact uses shadow memory to increase read
The big problem they face with NVDIMM is CPU can’t flush data in its L1
cache to NVDIMM when whole server powers down. To solve this problem,
Memblaze use write-combining in CPU multi-cores, it hurts the
performance a little but avoid the data missing finally.
Bob Liu of Oracle talked about NVDIMM support in linux. There are three options:
-- Liu Bo