
Within the digital realm, identifying the kind of files we encounter is crucial for ensuring safety and security. Nevertheless, with the increasing complexity and variety of file formats, accurately detecting the content of files becomes a challenge. Existing solutions often face limitations in precision and recall, leaving room for improvement in file type detection.
Magika steps in as a novel AI-powered solution to deal with the necessity for a more accurate and efficient file type detection tool. Magika tackles the common problem of misidentifying file types using deep learning technology. Unlike existing tools that will struggle with accuracy, Magika relies on a custom, highly optimized Keras model that weighs only about 1MB. This enables for rapid and precise file identification, even when running on a single CPU.
Magika’s performance is really noteworthy, especially compared to existing approaches. In an evaluation involving over 1 million files and spanning greater than 100 content types, including each binary and textual formats, Magika achieves a remarkable 99% or more in each precision and recall. This implies it accurately identifies files and minimizes false positives or negatives.
The tool offers multiple modes of accessibility, available as a Python command line, a Python API, and even an experimental TFJS version. Trained on a considerable dataset of over 25 million files across diverse content types, Magika exhibits near-constant inference time, taking only about five milliseconds per file after the model is loaded. Its ability to process batches of files concurrently further enhances its efficiency.
One unique feature of Magika lies in its per-content-type threshold system. This technique helps determine the extent of trust within the model’s prediction for every file type, allowing for more nuanced and accurate results. Moreover, Magika supports three prediction modes – high-confidence, medium-confidence, and best-guess – catering to various error tolerance levels.
In conclusion, Magika emerges as a robust and efficient solution to the challenge of file type detection. Its impressive metrics and versatile accessibility make it a useful tool for enhancing safety and security, especially in large-scale applications like Gmail, Drive, and Protected Browsing. With an open invitation for community collaboration, Magika represents a positive stride towards improving the accuracy and reliability of file type detection within the digital landscape.
Installation
Magika is accessible as magika
on PyPI:
$ pip install magika
Niharika
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>
Niharika is a Technical consulting intern at Marktechpost. She is a 3rd 12 months undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the newest developments in these fields.