design a nodejs web server from db to s3

background

during ADS road test, there are Tb~Pb mount of data generated, e.g. sensor raw data(rosbag, mf4), images, point cloud e.t.c. previous few blogs are focusing on data storage. for a more robost and user-friendly data/file server, also need consider database and UI.

from the end-user/engineers viewpoint, a few basic functions are required:

query certain features and view the data/files filtered
download a single/a batch of interested files (for further usage or analysis)
upload large mount of files quickly to storage (mostly by admin users)

a FTP server to support s3

traditionally, for many large size files to download, FTP is common used. the prons and corns comparing fttp and ftp for transfering files: HTTP is more responsive for request-response of small files, but FTP may be better for large files if tuned properly. but nowadays most prefer HTTP. doing search a little more, there are a lot discussions to connect amazon s3 to ftp server:

transfer files from s3 storage to ftp server

FTP server using s3 as storage

using S3 as storage for attachments in a web-mail system

FTP/SFTP access to amazon s3 bucket

and there are popular ftp clients which support s3, e.g. winSCP, cyberduck, of course, aws has it own sftp client, as well as aws s3 browser windows client), more client tools check here

however, ftp can’t do metadata query. for some cases, e.g. resimulation of all stored scenarios, which makes no difference for each scenario, we can grab one by one and send it to resmiluator; but for many other cases, we need a certain pattern of data, rather than reading the whole storage, then a sql filter is much efficient and helpful. so a simple FTP is not enough in these cases.

s3 objects/files to db

starting from a common bs framework, e.g. react-nodejs, and nodejs can talk to db as well.

** nodejs query buckets/object header info from s3 server, and update these metadata into db.

there is a great disscussion about storing images in db - yea or nay: when manage many TB of images/mdf files, storing file paths in db is the best solution:

db storage is more expensive than file system storge
you can super-acc file system access: e.g. os sendfile() system call to asynchronously send a file directly from fs to network interface, sql can’t
web server need no special coding to access images in fs
db win out where transactional integrity between image/file and its metadata are important, since it’s more complex to manage integrity between db metdata to fs data; and it’s difficult to guarantee data has been flushed to disk in the fs

so for this file server, the metadata include file-path-in-s3, and other user interested items.

file_id  feature     file_path
  1          1       http://s3.aws.amazon.com/my-bucket/item1/img1a.jpg
  2          2       http://s3.aws.amazon.com/my-bucket/item1/img1b.jpg
  3          3       http://s3.aws.amazon.com/my-bucket/item2/img2a.jpg
  4          4       http://s3.aws.amazon.com/my-bucket/item2/img2b.jpg

** during browser user query/list request, nodejs talk to db, which is a normal bs case.

** when the browser user want to download a certain file, then nodejs parse the file metadata and talk to s3

nodejs to s3

nodejs fs.readFile()

taking an example from official nodejs fs doc:

const fs = require('fs')
fs.readFileSync(pathname, function(err, data){
	if(err){
		res.statusCode = 500 ;
		res.end(`Err getting file: $[err}`)
	}else{
		res.end(data);
	}
}); 
const fileUrl = new URL('file://tmp/mdf')
fs.readFileSync(fileUrl);

if not file directly, maybe fs.Readstream class is another good choice to read s3 streaming object. fs.readFile() and fs.readFileSync() both read full content of the file in memory before returning the data. which means, the big files are going to have a major impact on your memory consumption adn speed of execution of the program. another choice is fs-readfile-promise.

express res.download

res object represent the HTTP response that an Express app sends when it gets an HTTP request. expressjs res.download


res.downlaod('/ads/basic.mf4', 'as_basic.mf4', function(err){
	if(err){
		log(`download file error: ${err}`)
	}else{
		log(`download file successfully`)
	}
})

aws sdk for nodejs

taking an example from aws sdk for js

var aws = require('aws-sdk')
var s3 = new aws.S3()
s3.createBucket({Bucket: your_bucket_name, function(){
   var params={Bucket: your_bucket_name, Key: your_key, Body: mf4.streaming};
   s3.putObject(params, function(err, data){
   		if(err) console.log(err)
   		else console.log("upload data to s3 succesffully")
   	});
});

check aws sdk for nodejs api for more details.

in summary

either a FTP server or a nodejs server, it depends on the upper usage cases.

a single large-size(>100mb) file(e.g. mf4, rosbag) download, nodejs with db is ok, as db helps to filter out the file first, and a few miniutes download is needed
many of little-size(~1mb) files(e.g. image, json) downlaod, nodejs is strong without doubt.
many of large-size files download/upload, a friendly UI is not necessary, comparing to the performance, then FTP may be the solution.