Home Artificial Intelligence Towards Generative AI for Model Architecture Return of Model-Centric AI Neural Architecture Search Text-based MAD versus Graph-based MAD MAD Steps Final Note: Implication of Self-Improvement Conclusion

Towards Generative AI for Model Architecture Return of Model-Centric AI Neural Architecture Search Text-based MAD versus Graph-based MAD MAD Steps Final Note: Implication of Self-Improvement Conclusion

Towards Generative AI for Model Architecture
Return of Model-Centric AI
Neural Architecture Search
Text-based MAD versus Graph-based MAD
MAD Steps
Final Note: Implication of Self-Improvement

The “Attention is All You Need” transformer revolution has had a profound effect on the design of deep learning model architectures. Not long after BERT, there was RoBERTa, ALBERT, DistilBERT, SpanBERT, DeBERTa, and lots of more. Then there’s ERNIE (which still goes strong with “Ernie 4.0”), GPT series, BART, T5, and so it goes. A museum of transformer architectures has formed on the HuggingFace side panel, and the pace of recent models has only quickened. Pythia, Zephyr, Llama, Mistral, MPT, and lots of more, each making a mark on accuracy, speed, training efficiency, and other metrics.

By model architecture, I’m referring to the computational graph underlying the execution of the model. For instance, below is a snippet from Netron showing a part of the computational graph of T5. Each node is either an operation or a variable (input or output of an operation), forming a node graph architecture.

Image by the writer

Regardless that there are such a lot of architectures, we will be pretty sure that the long run holds much more modifications and latest breakthroughs. But every time, it’s human researchers which have to grasp the models, make hypotheses, troubleshoot, and test. While there’s boundless human ingenuity, the duty of understanding architectures gets tougher because the models get larger and more complex. With AI guidance, perhaps humans can discovery model architectures that will take humans many more years or a long time to discovery without AI assistance.

Intelligent Model Architecture Design (MAD) is the concept generative AI can guide scientists and AI researchers to raised, simpler model architectures faster and easier. We already see large language models (LLMs) providing immense value and creativity for every thing from summarizing, analyzing, image generation, writing assistance, code generation, and far more. The query is, can we harness the identical intelligent assistance and creativity for designing model architecture as well? Researchers may very well be guided by intuition and prompt the AI system with their ideas, like “self attention that scales hierarchically”, and even for more specific actions like “add LoRA to my model on the last layer”. By associating text-based descriptions of model architectures, e.g., using Papers with Code, we could learn what sorts of techniques and names are related to specific model architectures.

First, I’ll start with why model architecture is vital. After, I’ll cover a few of the trajectories towards intelligence MAD in neural architecture search, code assistance, and graph learning. Finally, I put together some project steps and discuss a few of the implications for AI designing and self-improving via autonomous MAD.

Andrew Ng’s push and coining of “data-centric” AI was very necessary for the AI field. With deep learning, the ROI for having clean and prime quality data is immense, and that is realized in every phase of coaching. For context, the era right before BERT within the text classification world was one where you wanted an abundance of information, even on the expense of quality. It was more necessary to have representation via examples than for the examples to be perfect. It’s because many AI systems didn’t use pre-trained embeddings (or they weren’t any good, anyway) that may very well be leveraged by a model to use practical generalizability. In 2018, BERT was a breakthrough for down-stream text tasks, however it took much more time for leaders and practitioners to achieve consensus and the concept of “data-centric” AI helped change the best way we feed data into AI models.

Image By Writer

Today, there are numerous that see the present architectures on the market as “ok”, and that it’s far more necessary to concentrate on improving the info quality than it’s on editing the model. There’s now an enormous community push for top of the range datasets for training, like Red Pajama Data for instance. Indeed we see that lots of the great improvements between LLMs lie not in model architecture but in the info quality and preparation methodology.

At the identical time, every week there’s a brand new method that involves a sort of model surgery that’s showing to have some great impact on training efficiency, inference speed, or overall accuracy. When a paper claims to be “the brand new transformer” like RETNET did, it has everyone talking. Because nearly as good as the present architectures are, one other breakthrough like self attention could have a profound impact on the sector and what AI will be productionized to perform. And even for small breakthroughs, training is pricey so you need to minimize the variety of times you train. Thus, if you may have a selected goal in mind, MAD can also be necessary for getting the very best return in your buck.

Transformer architectures are huge and complicated and that makes it tougher to concentrate on model-centric AI. We’re at a time when generative AI methods have gotten more advanced and intelligent MAD is in sight.


The premise and goal of Neural Architecture Search (NAS) is aligned with intelligent MAD, to alleviate the burden of researchers designing and discovering the very best architectures. Generally, this has been realized as a sort of AutoML where hyper-parameters include architecture designs, and I’ve seen it change into incorporated into many hyper-parameter configurations.

A NAS dataset (i.e., NAS benchmark) is a machine learning dataset, , where X is a machine learning architecture expressed as a graph, and Y is an evaluation metric when that architecture is trained and tested on a selected dataset. NAS benchmarks are still evolving. Initially, the educational representation within the NAS benchmarks were just ordered lists, so each list represented a neural architecture as a sequence of operations e.g., [3x3Conv, 10x10Conv, …], etc. This isn’t low-level enough to capture the partial ordering we discover in modern architectures, comparable to “skip connections” where layers feed forward in addition to to layers later within the model. Later, the DARTS representation used nodes to represent variables and edges to represent operations. Way more recently, some latest techniques for NAS have been created to avoid requiring a predefined search space, comparable to AGNN which is applying NAS for learning GNNs to enhance performance on graph-based datasets.

At the top of the day, there are only about 250 primitive-level tensor operations in a deep learning tensor library like TensorFlow or PyTorch. If the search space is from first principles and includes all possible models, then it should include the subsequent SOTA architecture variation in its search space. But in practice, this is just not how NAS is about up. Techniques can take the equivalent of 10 years of GPU compute, and that’s when the search space is relaxed and limited in various ways. Thus, NAS has mostly focused on recombining existing components of model architectures. For instance, NAS-BERT used a masked modeling objective to coach over smaller variations of BERT architectures that perform well on the GLUE downstream tasks, thus functioning to distill or compress itself into less parameters. The Autoformer did something similar with a unique search space.

Efficient NAS (ENAS) overcomes the issue of needing to exhaustively train and evaluate every model within the search space. This is completed by first training a brilliant network containing many candidate models as sub-graphs that share the identical weights. Normally, parameter-sharing between candidate models makes NAS more practical and permit the search to concentrate on architecture variation to best use the present weights.

From the generative AI perspective, there’s a possibility to pre-train on model architectures and use this foundation model for generating architectures as a language. This may very well be used for NAS in addition to a general guidance system for researchers comparable to using prompts and for context-based suggestions.

The primary query from this perspective is whether or not to represent architectures as text or directly as graphs. Now we have seen the recent rise of code generation AI assistance, and a few of that code is the PyTorch, TensorFlow, Flax, etc., related to deep learning model architecture. Nonetheless, code generation has quite a few limitations for this use case, mostly because much of code generation is concerning the surface form i.e., text representation.

Then again, Graph Neural Networks (GNNs) like graph transformers are very effective because graph structure is in all places, including MAD. The advantage of working on graphs directly is that the model is learning on an underlying representation of the info, closer to the bottom truth than the surface-level text representation. Even with some recent work to make LLMs amazing at generating graphs, like InstructGLM, there’s promise for graph transformers within the limit and particularly conjunction with LLMs.


Whether you employ GNNs or LLMs, graphs are higher representations than text for model architecture because what’s necessary is the underlying computation. The API for TensorFlow and PyTorch is continually changing, and the lines of code take care of greater than just model architecture, comparable to general software engineering principles and resource infrastructure.

There are several ways to represent model architectures as graphs, and here I review just a number of categories. First, there are code machine learning compilers like GLOW, MLIR, and Apache TVM. These can compile code like PyTorch code into intermediate representations that may take the shape of a graph. TensorFlow already has an intermediate graph representation you’ll be able to visualize with TensorBoard. There’s also an ONNX format which will be compiled from an existing saved model, e.g., using HuggingFace, as easy as:

optimum-cli export onnx --model google/flan-t5-small flan-t5-small-onnx/

This ONNX graph serialized looks something like (small snippet):

Image by the writer

One problem with these compiled intermediate representations is that they’re obscure at a high level. An ONNX graph of T5 in Netron, is immense. A more human-readable option for model architecture graphs is Graphbook. The free-to-use Graphbook platform can show you the values being produced by each operation within the graph and might show you where tensor shapes and kinds don’t match, plus it supports editing. As well as, the AI-generated model architectures might not be perfect, so having a straightforward option to go inside and edit the graph and troubleshoot why it doesn’t work may be very useful.

Example view of Graphbook graph, Image by the writer

While Graphbook models are only JSON, they’re hierarchical and subsequently allow higher levels of abstraction for model layers. See below, a side view of the hierarchical structure of GPT2’s architecture.

Image showing the hierarchical structure of Graphbook graphs, with GPT as example here. Full image: https://photos.app.goo.gl/rgJch8y94VN8jjZ5A, image by the writer

That is a top level view for a proposal for generative MAD. I wanted to incorporate these sub-steps to be more concrete about how one would approach the duty.

  1. Code-to-Graph. Create a MAD dataset from code comparable to HuggingFace model card, converting the model to a graph format comparable to ONNX or Graphbook.
  2. Create datasets. These can be datasets like classifying the operation variety of an operation within the graph, classifying the graph itself, masking/recovering operations and links, detect when a graph is incomplete, convert an incomplete graph into a whole one, etc. These will be self-supervised.
  3. Graph-Tokenizer. Tokenize the graph. For instance, let each variable within the graph be a singular vocabulary ID and generate the adjacency matrix that may feed right into a GNN layers.
  4. GNN design. Develop a GNN that uses the graph tokenizer output to feed through transformer layers.
  5. Train and Test. Test the GNN on the datasets, and iterate.

Once these steps are fleshed out, they may very well be used as a part of a NAS approach to assist guide the design of the GNN (step 4).

Image by the writer

I need to offer some notes concerning the implications of autonomous MAD. The implications of AI with the ability to design model architectures is that it may possibly improve the structure of its own brain. With some kind chain/graph of thought process, the model could iteratively generate successor states for its own architecture and test them out.

  1. Initially, AI has given model architecture, is trained on specific data for generating model architectures, and will be prompted to generate architectures. It has access to the training source which incorporates its own architecture design, and the training sources include quite a lot of tests around architecture tasks like graph classification, operation/node classification, link-completion, etc. These follow general tasks you discover within the Open Graph Benchmark.
  2. Initially, on the application-level there’s some sort of agent that may train and tests model architectures and add these to the AI’s training source, and maybe it may possibly prompt the AI with some sort of instructions about what works and what doesn’t work.
  3. Iteratively, the AI generates a set of recent model architectures, and agent (let’s call it MAD-agent) trains and tests them, gives them a rating, adds these to the training source, directs the model to retrain itself, and so forth.

In essence, as an alternative of only using AutoML/NAS to go looking the space of model architectures, learn model architectures as graphs after which use graph transformers to learn and generate. Let the graph-based dataset itself be a model architecture represented as graphs. The model architectures representing graph datasets and the space of possible model architectures for learning graph datasets change into the identical.

What’s the implication? Each time a system has the potential of improving itself, there’s some potential risk to having a runaway effect. If one designed the above AND it was designed within the context of a more complex agent, one where the agent could indefinitely pick data sources and tasks and coordinate itself into becoming a multi-trillion parameter end-to-end deep learning system, then perhaps there’s some risk. However the unsaid difficult part is the design of a more complex agent, resource allocation, in addition to the various difficulties in supporting the broader set of capabilities.

AI techniques in autonomous AI model architecture design (MAD) may help AI researchers in the long run discover latest breakthrough techniques. Historically, MAD has been approached through neural architecture search (NAS). At the side of generative AI and transformers, there may very well be latest opportunities to assist researchers and make discoveries.


Please enter your comment!
Please enter your name here