A challenge with proprietary large language models, particularly for regulated industries, is that they lack transparency in how they are developed. For instance, in all the hullabaloo around AI assistants, it’s easy to forget that https://thenewstack.io/copilot-lawyers-checking-claims-against-other-ai-companies/ over the coding assistant, Copilot.
Why Create an Open Source Model https://arxiv.org/abs/2305.06161, a Cornell-published paper about the project, explained why creating the open source model was necessary.
Big Code recently released its LLM, https://huggingface.co/bigcode/starcoderbase, which was trained on 1 trillion tokens (“words”) in 80 languages from the dataset https://huggingface.co/datasets/bigcode/the-stack, a collection of source code in over 300 languages.
Big Code is not the only open source LLM available, but it is the most recent and most performant one, von Werra claimed.