WEBVTT

1
00:00:12.960 --> 00:01:09.650
Tobias Macey: Hello, welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, he'll need somewhere to deploy them. So check out the node with 200 gigabit private networking, scalable shared block storage and 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered to with worldwide data centers, including new ones in Toronto and Mumbai, go to data engineering podcast comm slash lindo today to get a $20 credit and launch a new server and under a minute and go to data engineering podcast. com Subscribe to the show. Sign up for the mailing list. Read the show notes and get in touch and don't forget to go to data engineering podcast.com slash chat to join the community and keep the conversation going. Your host is Tobias Macy and today I'm interviewing Bin Fan about Alluxio, a distributed virtual file system for unified access to disparate data sources. So Bin, could you start by introducing yourself?

2
00:01:09.840 --> 00:01:46.920
Bin Fan: Yeah. Hi. My name is Dean got to be here. So I'm right on the founding member of lots of complaints, and also the PMC member for Alexey open source project. I was working on the project for about almost four years. And before I joined the team, I was working in Google on a large scale distributed systems, storage systems, very similar to in the same space as a look CEO. And before I joined Google, I was working I was actually a PhD students in Carnegie Mellon working on distribute systems and focusing on storage system. So I've been working in this space for a while.

3
00:01:47.340 --> 00:01:50.670
Tobias Macey: And do you remember how you first got involved in the area of data management

4
00:01:50.880 --> 00:02:00.000
Bin Fan: from my early PhD years, so actually, maybe it's on the cert year, I started to look at key value systems, key value

5
00:02:00.000 --> 00:02:59.970
Study system as my one of my part of my PhD thesis, so I was involved in like a week, we had actually a lot into mem cache D at that time to emphasize on memory efficiency and throughput. So we had a one paper out of that. And then I actually because my work on that I visited UC Berkeley amp lab where I get to known the founder of love CO, how you lead who was a PhD students in UC Berkeley amp lab at that time, because he is also working on similar lines in memory storage. Yeah, so in the same space, so we know each other and and then after that, he got a because I graduated and just go to Google to work and how he only he continue studying in UC Berkeley. And finally, his research project was funded by VC and get that created this company and so I he called me and actually I pick up the call and I feel like

6
00:03:00.000 --> 00:03:09.120
this is very interesting, and very exciting opportunity. So I just clicked Google and joined a lawsuit. So yeah, that's my full story.

7
00:03:09.240 --> 00:03:34.650
Tobias Macey: And as you mentioned, Alexia started its life in the AMP lab at Berkeley. And I know that it was originally released under the name of Tachyon. So can you give a bit of an explanation about what the Alexia project is, and anything additional in terms of the history of how it was created, and how it has come to be an open source Apache project and the organization that you've built up to support it?

8
00:03:34.680 --> 00:06:00.000
Bin Fan: Yeah, so this is a great question. So in the early days, there was a proper research project called tacking on that I'm actually has even less known name before even tagging, but in attacking days was open source, different, very beginning, the motivation for having this project is actually heavily related to Apache Spark, I think at that time, the spark is already taking off from UCLA, I'm glad a lot of people are looking at a spark, and they see the great potential for this computation, new Compute Engine framework aggressively using memory. So one complaint people always have is even a at a time or even now memory is to consider as maybe more expensive resource storage resource, then the other storage capacities and how, in the other founder of the project at that time was taking a look at this project. And he found it can be more efficient if you have a system external to spark and let different spark jobs or different smart contacts to share data more efficiently. So that's how in the beginning, he started this project. And actually, he open source and gradually these becomes more like a general purpose file system distributed file system, rather than just one sharing or caching layer for Spark, but also for our computation frameworks, like a presto, like a MapReduce, or even 10s of low in in new days. So that's how basically the project is involving from various it starts as a very specific purpose, just help spark to manage the data more efficiently, but gradually becomes a in memory file system, because this in memory file system needs to handle data persistence, it because the first thing once you talk about something in memory, the first thing people are asking is, what if the memory if you restart machines, and you lost data from memory, how do you do with this, so from the first day Tachyon, or Alex, you at that time is designed to handle this kind of basically, data persistence, the issue with even volatile storage media like memory, so we have all so this is very fundamental to the design for the system and becomes fundamental in architecture, if you can,

9
00:06:00.000 --> 00:06:59.850
Alexia with other distributes delivery systems like HDFS or like Google file system from the paper. So this is I will say this is like something fundamental to the entire ecosystem to our entire design. And it turns out very interesting in the days especially nowadays, people are talking about separating storage and compute storage is getting far far can be far, far away from the compute because the storage can be some Cloud Storage. And in this case, having a look SEO works perfectly in this architecture. Because we can be in the middle between compute and the real persistence, the story since we are not targeting to solve this data persistence issue. Long story short, I think we we start from a caching layer for spark using memory to a general purpose file system. It's a file system. It's not a caching system a distributed file system with awareness.

10
00:07:00.000 --> 00:08:00.000
Some others we call enter stores to provide data persistence eat. So the architecture becomes different from 10 years ago when people are talking about, oh yeah, we should move, compute to storage. Let's have cup, deeply coupled storage and compute. And we build open source around this idea. We have more than 900 contributors from all over the world joining this project because they want to contribute, for example, adapter to read their own data source, their own persistent data store, and we have a lot of interfaces for people to customize. So they have different cash policies. These open source project is getting more and more popular, and we also provide a enterprise offering. So the core features in terms of the things I mentioned so far are in open source, it is what called Community Edition we also provides so as a company, we also need to survive we need to have a business model around this open source. So we provide enterprise offering which

11
00:08:00.510 --> 00:08:20.460
adds more enhancements in security or in large scale or high availability, this kind of enterprise readiness features like owning a lot of these large corporations they are looking for when you run services in their environment. Yeah. So this is our business model. 

12
00:08:20.550 --> 00:08:59.970
Tobias Macey: And a couple of the things that you mentioned in there preempted some of the questions I have, particularly in terms of the persistence of the data in memory, as far as what happens when you have a memory failure and the machine or the instance gets rebooted, and how that data distribution gets managed, at least in terms of the long term persistence of the data, where he rely on the underlying storage systems to provide those guarantees. I'll probably have a few more questions later on, about how you manage distribution among the Alexia layer once the data has been retrieved. But before we get into that, I want to talk a bit more about some of the use cases that Alexia and I

13
00:09:01.910 --> 00:09:46.910
Because as you mentioned, there was a big push of moving compute to the data because of the inherent gravity that these large data sets have. But as we move more toward a cloud oriented environment, where you have these object stores that don't have any guaranteed physical location, so you have varying levels of latency as you're accessing these object stores, or if you're trying to run in a hybrid cloud environment where you have your own local data centers, and then you're also trying to interface with public cloud for bursting capacity or for taking advantage of some of these specialized features that they offer. So going back again, just wondering if you can talk about some of the use cases that Alexia enables that would be impractical or sort of too difficult to want to deal with without it. 

14
00:09:47.000 --> 00:09:59.630
Bin Fan: Yeah, great question. The first case I would recommend to explore is really if you have data remote from this from compute, and we see cases like people are having a

15
00:10:00.000 --> 00:10:59.970
different data centers and the have cross data data center traffic to load data because we want to do a joint and the one table is in one data center and other parts of the computation. Other tables are in the local data center. So in this case, having a luxury in place will help greatly to reduce the computation time because you can either pre load the data the table from remote data center to the log co caching layer, or Alex, you have the intelligence built in to bring the data on demand and next time after the code reads. Next time. If you read that same data, again, this will be cached locally. So we do see a great performance gain in cases like this. So we have a published use cases with by do the search giants in China they see performance benefits by 30 x in this cases. See the other case interestingly, I see more is really the cloud as

16
00:11:00.690 --> 00:14:38.730
So actually, this is also what I observed the company's of the new generation, they typically Born on cloud, they start from AWS, or Amazon, or as you're on the first day, and even the older generation complain, they have aggressive plan nowadays to move there either data or computer or both to cloud. And once you are moving to cloud, the natural storage choice for you is for AWS s3, right? But s3 is is awesome. I like as you're a lot, it's cheaper compared to other storage capacity. And it's scalable, it's very easy, the semantics is very simple. It's very easy to manage single global namespace. However, there are also costs come with all the convenience here, for example, once you have all your data in s3, your compute, if it's saying easy to will naturally, always drag the data from s3, 12 local Compute Engine. So this will go across network. And s3 sometimes has different semantics compared to the tradition of our system. For example, the Rename can be expensive, and also listing a huge bucket, a bucket was saved thousands or millions objects inside can be very slow. So in this cases, people do want to have a more familiar or more traditional performance implication for the storage storage service. And also the new trending machine learning applications. they emphasize a lot of iterations, they take data from last iteration, and do some computation and output to the next iteration. And from that own the previous iteration, the data from previous iteration may not be important at all. So these kind of like a you need a temporary data store rather than some persistent data store. So we see cases on running computation I'm clouds, the ones have a caching or another tearing on top of the cloud storage provided by the vendors like s3 or Microsoft and alerts, you can fit perfectly in this case. So I named the two cases I did there is a certain case which is very interesting. I see mostly from users in China, the internet companies in China, they like to do a topology that I have a centralized main HDFS data source, but because my HDFS is HDFS services heavily, heavily loaded, because we have giant number of different applications depending on this data. So what we see is to guaranteed SLA they also set up we call satellites, clusters. And each cluster is maybe a zone just for one specific computing service, for example, presto. And in this case, the data and compute becomes a is still in the same data center. But it still process different machines go cross network, and because the have seen issues with like, high pressure on the HDFS service, so the see, okay, let's put another storage here, co located with my satellite compute, cluster and mirroring the data in men storage. And we don't need the whole set just need working sets, right. So Alex, what perfectly in this case to. So we see a lot of use cases around this in this, especially for Internet companies in China.

17
00:14:38.880 --> 00:15:21.750
Tobias Macey: And so can you talk a bit about how the Alexia project is implemented to allow for these data reflections and high performance access of data, particularly in the face of varying levels of latency, whether it's the case of interfacing with Cloud object storage, or the situation that you just described, where you're using as a way to keep a working set accessible to a compute layer that's remote from a centralized Hadoop or other sort of data lake that it would be too large to keep in memory of the compute instances, but is manageable by an Alexa to cluster without having to pull the entirety of this set of data from that central data lake?

18
00:15:21.840 --> 00:15:22.170
Bin Fan: Yeah,

19
00:15:22.200 --> 00:17:57.180
so as I mentioned in previous question, Alex, he was born actually in the in the beginning to to solve challenges like this. So there are few things we are doing to help reduce the latency for and reduce the performance variation, the retrieval latency variation, one thing is we add aggressively use memory as the main storage media. So although we provide optional choices, you can use Alexa to manage memory, or plus, optionally, SSD plus, optionally, hardest but a Memory A Lot Of we see you see, a lot of users use a luxury to managed memory to which has a higher bandwidth when you have a very hard data insight. And also, we have done a lot to provide what we call short circuited data read and data, right. So essentially, we aggressively use leverage data locality. For example, if there is a distributed application, for example, it's a spark and a spark, if it's using HDFS interface to access data, it will try to allocate the tasks closer to the storage. And this is by having some API's to understand where are the nodes serving this data. So Alex, you if co located, it's deployed, co located with a spark or other communication framework, we also provide a similar compatible interface. And in this way, we will try hard to match the task to our like workers serving the data in this way, they don't need to go cross network to fetch the data. So yeah, memory and data locality. And also, yeah, so the third party actually is related to the metadata. So it looks. So it's not just the data caching layer. We also, as I mentioned, this is a file system. So we have our own dedicated metadata service, it's basically it's a file system. So we maintain that I know trees and all these a journal, all these basic elements for required for a file system. That means if you have a slow lists operation on s3, that will happen once. But after that Alex understands how the metadata is on s3, and we will serve this instead of having s3 to serve this metadata queries. So by doing this, we can reduce even the operation variations on the metadata side.

20
00:17:58.230 --> 00:18:46.800
Tobias Macey: So to your point about the tearing storage where your primary operations are being done in memory, but you have the capability of also using the local disk on the Alexia nodes. I'm wondering if you can talk a bit about the life cycle possibilities that you have as far as managing when the data gets moved from one layer to the other in the Alexia nodes. But also as far as aging out data that you are accessing, where it gets initially pulled into the working set by a request from the compute engines, and then at some point is no longer relevant. Or if you need to page out data because more data is being requested, then can be held within the Alexia cluster, just how that overall tearing and Lifecycle Management happens within the Alexia layer. 

21
00:18:47.610 --> 00:20:50.310
Bin Fan: So this is also a core feature provided by I love. So essentially, from very high level speaking you can think Alexia is a cash it's a distributed cache for our life our data so each worker Alexey worker with Alexei Walker is the components handling all the data capacity like really provided a capacity in that you can add more and different workers into the classroom, each galaxy worker implement implement a cash and by default we use are you as the caching policy, we also make this pluggable. So we in addition to our you, we have some other policies built in and user we do see users provide their own caching policies replacement policies to because they understand the are will close by better. So that's on the caching sides. On the other side, we also provides functionalities for users to activity control that data life cycles, for example, we provide commands, you can set a TL for a file or directory, say I said the TD out to be one day, then after a day that they in this data in Alex, you can be afraid it's up to the workloads. But we do not guarantee its in memory anymore if you set it at all. And there are also policies like, I want to really pin this data in memory layer. So that's the another command you can use to really put the data in the top tier and closer faster to the computation. So um, yeah, so that's basically what we do for the life cycles. And we are actually looking at more complicated different policies in the future. Because we do see, especially for enterprise users, they have a more sophisticated Lifecycle Management requirements. But right now, these are all available in open source community edition.

22
00:20:51.060 --> 00:21:36.870
Tobias Macey: And as far as being able to manage the lifecycle of the data in the underlying systems where, for instance, you might have initial data, again, getting loaded into s3 from an inbound data pipeline, and then you do some analysis or processing on it. And then you want to store it more long term in your data lake, or maybe vice versa, where it's coming into your HDFS cluster, and then you want to store it long term and s3 or some form of cold storage. Is there any capacity in a lot SEO for being able to manage that life cycle? Or would it just be based on the Mount points for those underlying storage systems? And then having the computational framework be responsible for reading from one location? And then writing back to a different mount point for that more persistent long term storage? And archival?

23
00:21:37.890 --> 00:22:28.260
Bin Fan: This is a great, great question. And we this is basically what I mentioned, like we do see enterprise large enterprise the have requests like this, they really want to have a very intelligent data management system to migrate data from hot storage, to warm storage and from warm storage, too cold storage to improve the efficiency and not to read use that data cost. And right now with the current Alex, you open source edition of what you can do is just mentioned, like, you can ride at our jobs to move data from one location to another, like if they are backed by different endpoints and representing different vendor storage. So that's possible doing that, and we don't have automation on here yet. But I believe this is on the roadmap in the future.

24
00:22:29.130 --> 00:23:29.040
Tobias Macey: And another question I have, particularly as it relates to metadata is in a computational layer, if you're trying to search for either a particular set of records, or just do a discovery process of seeing what data is available in which systems does a lot to proactively retrieve the metadata from those storage systems so that those initial for instance, LS of the file structure return more quickly, or does it do a fetch at the initial read time and then just cash it for a particular period. And then also my I'm wondering if it has any native support for some of the more high level data storage format, such as parquet or Avro, where you can potentially query from the computational layer into the underlying files to determine if a particular set of records has the information that you're looking for, before you necessarily retrieve it into the Alexia layer for further processing.

25
00:23:29.520 --> 00:26:26.880
Bin Fan: This is another great question, go back to the mandated question part first, the boast mode, like a loading the data on the first operation and loading meditate on the first operation, and put it into our managers door. And also you can also the second mode is, you can also pre actively load meta data, like if, you know, my computer will touch this pockets. In the next hour or so, let me just try to run something first to preach audit manager. So both are supported. And in terms. So this is how the metadata enters a look. So I'm being remembered by Alex, you and remember, and, and regarding how we manage the lifecycle for the metadata. We also provide different policies, for example, you can set a cash exploration time, the metadata, I remember for this, I know tree for this part of the file system will be true for maybe an hour. And later on, we'll just do another load once some applications accessing data there. So this is one policy, the other policies Oh, ok. So there might be out of bounds modification to my endo store frequently, let me just don't trust all the cash meta data. And on every single operation, I need you to check updates on the managers underscore site. So there is another policy to be very aggressively. But we do see this well, you can see that this will make the metadata operations slower because it's just adding moral hats. So it really depends on the application. And we also adding new features so that for certain stores like HDFS, there are hoops, you can put once there are some modification to their metadata, you can trigger you can register some action on their sites, so we can just proactively update the metadata once there's modification there. So I think this is a Yeah, this is also something will do in the new feature in the new release. So the second part is regarding how we can optimize maybe if we understand more about data structure in the file like parquet, or do, we do some push down. So right now, Alex, there is a file system. So we understand the data format as a just a very plain be to stream here. So we do not go into the file file structure to put more to provide more optimization. On the other hand, I'm working with actually some researchers from universities and to see if this is I know this is a Paki file. And can I do something smarter? For example, can I do some clustering? Or can I do some more intelligent replication? Yeah, so that's something I think it's a very interesting direction. And we will see how do we read out goals?

26
00:26:26.970 --> 00:27:00.450
Tobias Macey: Yeah, and another possible approach to that would be to rely on other metadata storage systems were particularly for Hadoop where you have something like the hive meta store, or the iceberg table format that's being worked at in Netflix and with other people, where you can potentially interface with those metadata storage systems that are already doing the work of parsing the records within those more high level storage formats, and then be able to use that to determine which subset of files have the information you need to then retrieve into a look. Yeah,

27
00:27:00.570 --> 00:27:38.100
Bin Fan: yeah, yeah, that's definitely another way to go. And we are very interesting to, to see the some, like if there are some experimental results or some benchmark. So we are very interesting to see that and, and see if we can just influence their roadmap for Alex, you in a future. And I think this is something very interesting. And I hope this can like, because it's open source project, university researchers or some other people, if they feel this is a something worth doing for their workloads, we are very happy to provide some help or some collaboration. And on this end,

28
00:27:38.160 --> 00:27:56.280
Tobias Macey: and in terms of the actual code itself, I'm wondering if you can discuss some of the particular challenges that you faced in terms of building and maintaining and growing the project and discuss some of the evolution that it has undergone in terms of the overall architecture and design of the system?

29
00:27:56.310 --> 00:33:26.040
Bin Fan: Yes, I have a lot. So I should just just one interesting story, my first big project in a log. So after joining team is to make a luxury of modularized. So if you know a little more about Java JVM applications. So Alex who is building Java like 90, maybe 95% of source code is written in Java. So the way it works is you just build a log so into jars, different jar binary jars on JVM will just execute the jars. So in early days, because the research project at that time and developers or maintainer is for the project we're really looking for fast growing, right, just like move fast. So everything all components are in the same single Java module, which means you compile a single jar, even it's on a client sites they put in hive or putting into MapReduce, or it's a, it's also the jar you at, you execute a service provider, RPC service provider, the demon long running demons. And this is really about design. But it gives a lot of convenience. Because like the Java in Java, Lang, because it's in a single module, you can reference all the other classes without worrying about managing dependency. And my first project on Alexa is to break this a single monolithic jar into multiple jobs with a clean dependency. So we can distribute a much thinner jar to applications like high or MapReduce or Spark, and then takes me quite a few months to figure out the relationship between all these dependencies callers, and then figure out a way to break this into multiple different frameworks, different jars, different modules. And so one lesson I learned from there is, once you start, once the project starts to grow to a certain degree, we should start thinking about architecture much earlier. So we can have a much better management in terms of the like all the dependencies and and this will simply simplify the life down the road much easier. And after that, actually, we are pretty happy with the structure. And But still, after adding new features, more modules come in, and how to make this plugin a pluggable and how to make different components. Very easy to upgrades. It's also it's forever going process. So that's something regarding the source code management. And also as a Java, if you work in a javelin, you will understand the jar how issue which is the pollution of the binary libraries on other applications, class pass, and it's particularly a problem a challenge for us, because we are supposed to talk to all different kinds of data stores. So originally, we put everything like, we just put everything together on the class pass, and then you will see all the dependencies how issues, we spend a lot of time to dealing with that. And you have to do a lot of dirty work in like workarounds or hacky a ways to solve this jar, how issues. So later on, we decided to do this in a very clean way, and use the class loader to isolate that libraries. So in this way, we see much less issues in terms of job how, but it just also makes the cold, a little bit more complicated. But I think this is totally worth it. Yeah, code wise, this is something also we learned with very, very important other. The other part I would just want one last thing is really the resource management people know Java, like a Java is a good language, because you don't need to handle the you don't need to delete something, you allocate it from the heap, like the garbage collection will help you to be lazy. This is only in one part of the resource, there's a lot of other resources, for example, soccer, you open the locks you acquire, and these this especially this can be a distribute lock. And also, for example, the all the hundred $500 all these different pieces. And even sometimes, for efficiency, we also use the low lower level Java API's, and there can be memory leak related. So how to handle this well, to make sure you don't leak the resource. So in the beginning, we all just because this district of our system, right, so it once you have this kind of issues, it's very hard to debug to diagnose. And once you know that it might be relatively easy to fix. But even to, to diagnose the issue is very hard. So we, what we later on decide in a source code we do is to really have a very strict pattern like everyone should just follow the pattern when you acquire some resource. And we leverage some good things from Java, the proper, some good functionalities from Java it provides to make sure sure, once you once you quit, like Normally, this word resource will be returned. But also, for example, if you encounter some exception, so you don't really expect you go that code that you go through the code pass, you don't really expect, we make sure in that way, we also return the resource correspondingly. So, you know, this is this also helps a lot to make the source code much more robust

30
00:33:26.190 --> 00:34:04.290
Tobias Macey: in terms of being able to write it in such a way that you can easily add new back end capabilities for interfacing with different storage systems, and also making it easy for different computational frameworks to interface with a lot co what are some of the particular challenges that have come up on that front? And what are some of the trade offs that you've made in order to be able to provide a single API to the computational layer that works effectively across all of those different underlying storage systems that might have very capable abilities or more advanced functionality that's not necessarily exposed?

31
00:34:04.470 --> 00:35:26.670
Bin Fan: Yeah, so one challenge as Justin mentioned, like a different if you want to talk to different underscores, usually, they require different medicines, different libraries, and how to make all the libraries to work together happily. So as I mentioned, like, we use the class loading to isolate all the dependencies to make all of them happy. And also, the other part is how to how to find a common API. So that like, it's a, it's useful enough. And also, it's coming enough and to cover all the different storage types. And we do have multiple iterations on top of that, right. So in the beginning, we have one set of API. So we call the we call this underscore API. And later on, we realized this is not sufficient, especially when we see more use cases with object store. And the obvious store has very different semantics, not just a semantics, but also like a security security requirements with the tradition of our systems. So we do put a lot of emphasize on that to make it work. So I think after multiple iterations, we gradually converge to a version that works for most of the cases. This is something we caught us to the southbound basically how Alex to talk to the vendors, or we also have the API we call the northbound northbound API, that

32
00:35:27.920 --> 00:37:14.280
that's what we provide two different applications. And we on that side, most of the applications we see so far, use the our Hadoop compatible API. So in that way, applications can just assume this is just a Hadoop file system. This is just a Hadoop five different implementation for Hadoop file system. So they can continue use whatever the the assume, but just change if we place Hadoop HDFS with Alexey on top of that. We also have our own API file system API, which is more similar to the Java file system API. And adding more features like a setting to do is doing the mount and doing a lot of Alex's specific advocate functionalities. And one interesting trend we see nowadays is in the beginning, it's because people have their legacy the want have run their legacy applications on the distributed file system. So they want the politics API. So we do, we do have a contributor from I think it's from IBM contributed the politics API for us. But later on, then we just realized, actually a new generation machine learning applications, they leverage aggressively on the heavily on politics API. So they typically assume the data is on the local directory. So this becomes now popular again, like mom, this machine learning tours. So yeah, so essentially, the way we handle the northbound is we provide different set for like HDFS compatible for distributed application like Hadoop, Spark, and more traditional applications, or even a new wave of machine learning applications that we provide politics API.

33
00:37:14.400 --> 00:37:32.690
Tobias Macey: And from the operational perspective, I'd like to get an understanding of what the different scaling factors are in Alessio, and some of the edge cases that come up as you try to add and remove nodes, particularly as it relates to data distribution within the cluster.

34
00:37:33.210 --> 00:40:00.540
Bin Fan: Good question, we see a few different of depends on the workloads and environments we see different different scale scalability factors here, the major one is really the bottleneck on the master node, our master node, he can think it's equivalent to Hadoop file systems name node. And because the master node is using funky Java memory to store or the the mapping from file, two maps from I know trees, two different detailed information. So these can be very memory hungry. And also these can be GC intensive, there is a lot if there's a lot of small modifications intensive modifications to the file system. So this becomes one scaling bottleneck, once you hit once you store say, hundreds of millions of files into a log co namespace and, and becomes a stable part for the for the manager service. So in the new release, in the next coming months, we having the new 2.0 whole new 2.0 release, we are moving this part the on heat memory for file system information, five basic file system, I know tree to a off hip implementation on that. And by doing this, we can scale the file system to be much larger, like content, billions of files and directories. So this is one thing. The other part is really how many workers you can handle because there are heartbeats between workers and master. And there are different communication, the control messages sent over. So we do if you if you scale to thousands of nodes or beyond, you will see the master node becomes a bottleneck, or at least very pressured because of this kind of workloads. So one way we handle this is we passed in this aggressively on release to make sure we can scale to thousands of worker nodes. But also, we optimize a lot on the RPC sites, threading, pool, all these different things. So we really want to make it work. I think, in a new release, we can handle at least a few thousand workers in one deployment. So these are efforts at the bottom Next, we have seen in terms of scalability and our efforts to improve up on top of it. So what's the other part of question,

35
00:40:00.650 --> 00:40:33.330
Tobias Macey: just wondering how scaling the node count up or down impacts the data distribution in the cluster. So I'm wondering, if you use things like consistent hashing to ensure that you're always able to retrieve a record from a given instance. Or if you use more of the name node style approach, where you have metadata in the master about where all the files are located. And then also any sort of redundancy so that if one know does go away, you don't then have to retrieve the data all over again from the source system, and can instead just replicate it from within the Alexia cluster

36
00:40:33.510 --> 00:43:03.600
Bin Fan: two parts, one is on them, metadata service part, we do provide high availability mode, you can run multiple master nodes master service inside a look. So deployment, and we internally we use zookeeper. And in the next release, in upcoming release, we're using draft to decide who is their primary master node serving the manager service. And also on the data side, we do have revocation scheme built. So this is actually something very interesting, I want to highlight. If you think about HDFS, the default replication factor is a street meaning we want to have this data three copy in avoid to avoid we lost data in some time window. So that's the replication factor. And in us, because we are not aiming to provide persistence, so we do not have a target. By default, we do not have a targets replication factor, it really depends on the workloads the temperature for the data, for example, by default, if the data is hot, then we may, the application may just trigger more access to trip to data. And turns out this will make more applications in the local space cached into different workers. So in the coming release, we also provide ramifications. So we can set Okay, I want a minimal setup replication to be three for this data copy or a maximum only this can be 10 copies. And beyond that, I don't really need more so more fine grained control on the ramifications against a fault tolerant like a failures. And also regarding the consistent hashing, as you said, like we use Actually, we use both we use the master nodes to remember where the data is located across different workers. But on the other hand, we have a policy that when we load data into a log co to avoid some stillness, data data stillness, we can also specify a consistent hashing similar some something similar to consistent hashing way to place that data. So we can get a more uniform distribution across in a lot of space when we dragging data into from data source. And you can say, Okay, I want replication to be three or four. And then even you have 10 different workers reading the same data and they will not compete too much for the Ender store bandwidth. And also, this will guarantee the four three copy you specified are uniformly distributed. Basically, they will coordinate

37
00:43:03.650 --> 00:43:44.130
Tobias Macey: and when I was looking through your documentation, and blog posts to prepare for this show, and I was impressed that you consistently run testing and verification at scale, versus just making sure that you can execute a certain set of unit tests, and then rely on users or beta testers to ensure that new releases are bug free and not subject to regressions. So I'm wondering if you can talk a bit about your overall approach for being able to run these verifications at scale, and also, any operational considerations that you've built into Alexia to make it easy to be able to build a deploy these clusters in an automated fashion.

38
00:43:44.400 --> 00:47:10.830
Bin Fan: Yeah, we put a lot of emphasize on that in terms of testing, and also the easy of deployment, the ease of deployment. So regarding the testing, there are multiple different levels testing happening. So first, we because I'm from Google, I used to used to work with Google and my team in Google, we emphasize a lot on unit has an integration tests. So I tried to, when I join a team, I tried to bring the same minus that and we do have a lot of we do have requirements for the unit has anyone who contributed code will ask, okay, and you also write a unit test. And also, if possible, get an integration tests the two. So these are building with a source code coming with the source code, once you try to submit something to the our GitHub repository, this has will be automatically triggered. So we can have some sanity check. But later on, we realized, okay, this is not sufficient, especially after a few releases, we notice if some issues will only be identified if you're running this service for for a while, like four hours, or four days, or even four weeks. So what we do is we do we build more tasks we call nightly integration, nightly tasks, which we've rides the whole entire infrastructure of AWS to, like, launch this test nightly. So we can run different workloads, including high bench, including TV CDs, or more micro benchmark, like a just pressure the master nodes, or just pressure the worker nodes, and see what's the result in terms of performance in in terms of practice. So we do have nightly build to and before release, we have more strict paths, like in more complete test suite for the release. And also we will ask partners, friends, or users, happy users in in a community to help us test the beta in their environment. Even that, you know, bugs are always there will always be bugs, right. So we, we will also try hard to help users to get the bugs fixed in the first time in the in the in, like, immediately. So we have the, we just recently launched a Slack channel. So we can talk to the users more in a more real time fashion. But also we talked to this users like a periodically to check out whether they see any issues if they see any issues. So we will just try to solve them really soft and help them to solve these issues earlier, I think all these are actually common practices in from different industries and companies. And we just tried to implement it in our own suiting our own team and our environment. And it works pretty well so far. And the regarding the scalability as you mentioned, that's something we actually we were very proud of like, we're doing this task with thousands of like a walker instances to really pressure test our service. And we do spend a lot of efforts on this to make it affordable calls, because we always use AWS or the cloud infrastructure to build our tests. So as I just mentioned, we have a on that. And we actually have a white paper, summarizing out caveats. And we see we're running this scalability test on ALS. And I think it's a really good read.

39
00:47:10.980 --> 00:47:50.010
Tobias Macey: And going back to the deployment topology, as you mentioned, the primary use cases for being able to provide data reflections so that you can get faster access to some of these underlying storage layers, and possibly at a remote location in the case that you described with the Chinese internet providers. So can you discuss a bit about what the typical topology is, in terms of how Alexia is deployed in relation to the storage systems, and particularly with the compute wondering if you have the compute operations running directly on the same notes that Alex he was using, or if they're just co located from a network perspective.

40
00:47:50.340 --> 00:49:13.620
Bin Fan: So we have our our recommendation to co locate a log seal with the setup notes to running a log co to be the same set of nodes running the computer, for example, press the or Spark so in this way, you can trigger what we call the short circuits data if they read or write. And so they do need to go through because we try to aggressively leverage data locality. So if they happen to be on the same note, you don't have to go through network. So that's the recommendation for topology to deploy a look. So we also see cases that a lot of SEO is not really located co located in the same set of nodes are due to variety different reasons. And in that case, it's really depends what's the bandwidth and and also what's the latency from your computer to your data source. And versus your compute to Alex you deployment if it's a still a large ratio, meaning even if it's not on the same set of machines, but it's within the same rock or a same data center versus the remote data source is really remote. It's across the ocean. Or in that case, I think that's a su consider as closing f words. And and that's basically what we see. Essentially the recommendation I will say is make sure Alex who is close enough to the compute compared to how the data sources to compute.

41
00:49:13.830 --> 00:49:36.210
Tobias Macey: And for somebody who's planning a deployment of Alexia, what should they be thinking of, in terms of system capacity as far as ram available, and disk that's on the system, whether it's spinning disk drives, or SS IDs, and any sorts of optimizations that they can make either at the node or network layer,

42
00:49:36.240 --> 00:49:52.049
Bin Fan: the system requirements is not really I would say, for the master of depends on the master or worker nodes. For our managed service running on the master node, you probably want some relatively beefy nodes with enough memory, or see like, under order 10, 200

43
00:49:53.490 --> 00:51:03.779
gigabytes, so you can have enough space to store the fastest and supposedly information and making sure you don't really see a lot of Jesus for the worker nodes. It really depends on your target Sri, if you're really talking about very strict SLA, returning data in a very short latency compared to reading from disk, then use Alex you to manage and to allocate in managing memory for you. But if you're really dealing with the SLA to reduce the target if your target is really reduce the latency of reading data from a remote data center. In that case, I would say like having a look CEO to manage SSD or hardest is already good enough for really good, good enough for the workloads. Because in that case, having data in memory versus having data as local SSD or disk do not really provide like significant scale. On top of that you already reduced the network latency by removing removing parts dragging data from a remote data center so that yeah, that my philosophy to set up Alexi old deployments,

44
00:51:03.780 --> 00:51:20.280
Tobias Macey: and what are some of the cases where you would recommend against using Alexia for a given use case. And I'm curious if there are any other projects or products that are working in a similar or possibly adjacent space, that might be a better fit for those situations,

45
00:51:20.280 --> 00:53:17.910
Bin Fan: there are cases we think Alex do is not really the best fit, including if you have your database is your source of choose. And Alex, who right now is integrating with a file system or objects two or more. And database is not really one available or ready to use data source for Alex to and in that case, having a look. So in between is not really I don't think that's the target use case. And a another use case we think is not really a good fit for a look. So is if you have already co located computation and storage and your network bandwidth is awesome. And your hard disk bandwidth is also has no pressure at all, like at worry memory had is a lot, then from time to time, we see there is also marginal benefits running and this case. But we do also see some other benefits if you run in the CO located the case I mean, when I say co located I mean the data source is already co located with the computation. So we do see use cases by putting a lot of again in here for Sri, because different applications are competing for the discuss spindles. And they don't see large variants. But serving Alex, you serving data from a loss of memory will just increase the bandwidth data bandwidth. But if that is not a problem, then in the colleague environment is not our target use case, let me see the similar or adjacent space, we do see, for example, there is a product called s regard that is really more like a cash for s3. But that is only for I think that's only for the metadata part. So if you want to cache data that does not provide you. So Alex, it does both data caching and metadata management. So as regard is one way to translate as three APIs, the object store into a

46
00:53:19.260 --> 00:55:09.180
file system API for different applications to consume. But that that's only automatically decides. And there are also some projects we see doing similar, but only on the data pass, they do not handle the or they do very straightforward translation on the metadata side of it. For example, how do we solve it can use Hadoop to access s3 directly, that's no problem. But it just doing a client side translation and use some different client two and three, and there will be no data caching and no matter the caching either, yeah, so. And also, we see cases they do only data caching. But now the metadata management, so the, I think we see a few GitHub repositories providing functionalities like this, essentially, they just use a local node, once you read from s3, and you just put this data, you download it from s3 to your local node. And once the same node request data, you can just serve it from local storage. But this is not really a distributed, there's no coordination between different the local cache. So this is a very, I will say, this is very single node, and there is no coordination to help applications to access data from a different cash on the different nodes. So yeah, so my, in general, my takeaway is, our project is pretty unique. I guess one of the reason is, in the early days, it was a research project. So we, we just do whatever this is, this is a thing is interesting to do. And later on, it turns out has real use cases. And it turns out real problem, important problems for users. And it's open source. And I think a lot of users just take it. And instead of having similar launch a new similar project,

47
00:55:09.210 --> 00:55:14.640
Tobias Macey: and what do you have in store for the future of the Alexia project and company.

48
00:55:14.700 --> 00:56:42.140
Bin Fan: So in near future, we are releasing Alexa 2.0, as I mentioned, there are major upgrades to Alex, you including we're moving, I know, trees to off heap storage. So this is something we are doing. And also, we're changing the entire RPC system to be more scalable, more extensible. And a lot of new features added features, like easing propagation from Alex of space to that data stores. So that's really in the near future in a few months. And in the long run, Alex, you is really, we just want to be the unified data access layer for any applications. And now we start from Big Data world and the future We hope this can cover like a difference, different other areas for different workloads, for example, we start to also see more and serve more machine learning workloads. And so we just go across more areas. And also, we used to see only people use big data, or mostly use big data for unstructured data. But nowadays in the last few years, definitely, people build more also build more systems to derive insight from struck more structured data, like, for example, presto, or sparks equal are different sequel query a big data, sequel query engines, that's also becoming a trend we see. And we hope we can have more integration, deeper integration with this kind of systems. As for

49
00:56:42.150 --> 00:56:43.710
completing, we are

50
00:56:44.610 --> 00:57:26.010
still early stage I will say. So also, we in the early days, we focus a lot on engineering. And now we think open source is something we really want to focus in the community, I mean, that you Our community is something we really want to focus in the in the next few years to help really understand how users are using a look to understand how we can help users to improve that improve the value. And actually I'm part of the efforts. And if you need any help, or see any interesting you have any interesting ideas, feel free to talk to me and will be very happy to have more discussion.

51
00:57:26.160 --> 00:57:42.960
Tobias Macey: And for anybody who does want to give you that feedback, or follow the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today,

52
00:57:42.990 --> 00:59:13.140
Bin Fan: I think one actually, very important change in the last few years is really the trend moving data to the cloud. And I will see this is something you see every maybe 10 year such as such a huge change and cloud providers called storage vendors, they just provide their own mostly object store service with some maybe with some file system wrapper on top of that, but still, we are still in the early days to learn how to build a right architecture, when you really move your computation to cloud or is a public cloud or on premise cloud When this happens, how do we re architect the system? Well, and based on this new architecture, which means the computation and the storage are more and more dis aggregated? Then what are the different possibilities, new possibilities? And what are these two things are associated? Like one thing you mentioned how to automatically migrate data from hot to warm to from warm to cold storage? And also, for example, in the future? Can we have even a smarter way to decide which cloud storage vendor I should pick based on the price or based on the topology. So all this data management, I just we are entering a new era that we are able to do a lot more things. And I think this will take a few years for us to really understand what are the new things we can do in this new era.

53
00:59:13.290 --> 00:59:32.250
Tobias Macey: Well, thank you very much for taking the time today to describe the Alexia project. It's definitely very interesting platform and one that seems to fit a very big need in the Big Data community. So I'm happy to see the work that you're doing on that. So I want to thank you again for your time and I hope you enjoy the rest of your day.

54
00:59:32.300 --> 00:59:35.750
Bin Fan: Yeah, thank you so much, Tobias. I'm very happy to share my experience.