For a year or so Nir Weingarten and I have been playing with the idea of founding a startup. This April I quit my job, Nir submitted his thesis, and we hopped on the ideation-validation wagon. We're both great believers in AI, and think there are certain use cases that AI can really excel in, such as visual media.
After a while in the ideation process we felt we needed to get our hands dirty, so we decided to explore the current capabilities of diffusion models by building an app, and have some fun on the way :) We saw some cool consumer apps that generate AI portraits of people, and thought we could do much better in terms of end result and of user experience.
This is the story of Selfyx: https://selfyx.com, our AI web app, how we built it, and what we've learned in the process 💪🏻
We'll go over the design, tech stack, ML architecture, costs, time we've spent, dev tools we used, dilemmas we had and decisions we took. Overall this project has been intense, interesting, fun to work on, time and money consuming, and we've learned a lot doing it.
We built the app around these three pivots:
Good ML models == good training data == bad user experience when uploading images. To solve this we built a robust preprocessing pipeline. Diffusion models output is chaotic and unpredictable, to solve this we built a robust postprocessing pipeline. Inference is slow and costly, to solve this we utilized serverless gpu calls and parallelization.
We came up with the following user flow:
The whole shebang takes around 30 minutes, the longest parts are training ~12 min, and generating the images ~10 min. These can be optimized to ~15 minutes all together, by compiling and quantizing the models, a task that we decided not to complete at this point.
The biggest factor on the results of the model is the quality of the training data.
Problem is, that getting good training data usually put a set of constraints on the user, such as uploading cropped photos of himself. The process of finding 15 images can be annoying on its own, without having to crop these images! This problem gets even worse when considering mobile users.
We decided to allow the user to upload any image of himself, without cropping, and do the dirty work ourselves.
We needed an extensible solution that would enable us to tag images using models during our processing pipeline. Such as having GPT4o discover certain features in an image, or running embedding models on faces.
example pixel-brain pipeline
We did not find any fitting solution, so we have developed an OS micro-framework (pixel-brain) for processing the data (with models) and storing results in a mongoDB database.
github.com
It's far from perfect, and we have made a lot of shortcuts to save time, but it's pretty useful. Did someone say technical debt? 🤓
In order to identify the user out of all the people in his uploaded images we embedded all the faces in the images using RetinaFace then performed clustering using DBScan and chose the largest cluster.
We chose to use Stable Diffusion 1.5 as the base model after a short research (we also looked at SDXL, StableCascade and SD2.0), our reasons are that it has the most support in the open-source community, it is the easiest and fastest to train, and it was much easier to control than the other models. While the bigger models can probably yield better results, we chose not to spend the time and money testing them at this point.
The base SD1.5 isn't great at generating realistic images of people, but the OS community offers a lot of great fine-tuned versions of SD1.5, available at Civit AI, and we chose the excellent RealisticVision V5.1 by SG_161222, which was finetuned on realistic data.
civitai.comRealistic Vision V6.0 B1 - V5.1 Hyper (VAE)
| Stable Diffusion Checkpoint | Civitai
To train the model on user images, we used the Dreambooth method. Dream booth takes some token with little
semantic meaning and fine tunes a diffusion model such that this token will represent the learned subject:
in our case the user.
arxiv.orgDreamBooth: Fine Tuning Text-to-Image
Diffusion Models for Subject-Driven Generation
The original Dreambooth is performed with a full fine-tune of the model, but after some research we have
found that
training a LoRA is not only much faster and cheaper, but also generates better results, as a result of the
inherent regularization effects. Wait, what is a LoRA? LoRA stands for Low Rank Adaptation, and is a method
to efficiently fine tune a model by inserting and training a small intermediate layer between the existing
model's layer.
arxiv.orgLoRA: Low-Rank Adaptation of Large Language
Models
As in any ML task, proper hyperparameter tuning is essential. There are a lot of knobs to turn, but in our case these were the essential ones:
We trained the models using sd-scripts, another amazing OS tool by Kohya Tech on Modal serverless GPU service, which is AMAZING, a single Dreambooth training run is ~12 min on a A100 Modal GPU, which costs ~1$ including the machine. More on Modal later ⬇️.
ML is a statistical practice, and results are probabilistic. This is especially true for high dimensional tasks such as diffusion models. When it comes to identity preservation there's no free lunch, and no one setting that works for all the users. We chose to tackle this problem by fighting fire with fire: Create a big pool of generated images, generated with different parameters, and use an additional model to pick the best images from that pool. From a probabilistic perspective, we create a meta distribution, and dramatically increase the scope of people we can accurately model.
We have found that there are numerous parameters that are very crucial for getting good results:
sander.ai/2022/05/26/guidance.html
Another problem is figuring out which of the parameters work best for the majority of users.
Thus, our approach is to generate images from multiple combinations of the above parameters and then choose the best images across all generated images.
For generating the images we spun a stable-diffusion-web-ui server, another great OS tool by Kohya Tech, and used its HTTP API - Not very sophisticated, but does the job, like eating cornflakes for dinner 😁.
To speed up generation time we use multiple parallel Modal GPUs to generate the images. We use 4 concurrent A10G for each user for ~10 min which costs about ~0.9$ per user.
Picking out the best 100 images from the ~500 generated is quite a hard problem.
There are multiple factors determining how good an AI portrait is:
We used an in-house developed model to choose the best images according to the categories above.
It's amazing how much SW development has advanced in the last couple of years. We've erected our entire stack on light weight and efficient cloud technology, and it took us just over 2 months (!).
architecture diagram
modal.com
Building this application taught us a lot about the capabilities of current diffusion models, and how much time and money it takes to put them into production.
In terms of diffusion models, I think that one of the biggest challenges dealing with image generation models is automatically evaluating their output, which is otherwise still too chaotic for use without cherry-picking. It's hard to define metrics on such vague features such as aesthetics and even harder to automatically judge images according to them. This makes it very hard to build an application which consistently improves over time, as the application output is hard to evaluate.
Nonetheless, I believe this problem is solvable, as we showed with Selfyx, and considering that this technology is advancing faster than Nvidia stocks, we expect to see more generated visuals in our daily life in the coming years.
In terms of dev velocity, we see that using the right tools can get you a very long way, very fast.
We have great belief that image generation models will change how we view the world, if we just learn to tame them 🦁.
We would love to hear your thoughts about Selfyx, possible features or anything else! Contact us on X: