PandA Publications

This page contains a list of publications related to the PandA framework.

The publication list is organized by year. Inside each year, most recent publications appear first.

2024

  • [PDF] [URL] [DOI] F. Ferrandi, M. Fiorito, C. Barone, G. Gozzi, and S. Curzel, “High-Level Synthesis Developments in the Context of European Space Technology Research,” in 15th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2024), Dagstuhl, Germany, 2024, p. 1:1–1:12.
    [BibTeX]
    @InProceedings{ferrandi_et_al:OASIcs.PARMA-DITAM.2024.1,
    author =  {Ferrandi, Fabrizio and Fiorito, Michele and Barone, Claudio and Gozzi, Giovanni and Curzel, Serena},
    title =  {{High-Level Synthesis Developments in the Context of European Space Technology Research}},
    booktitle =  {15th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 13th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2024)},
    pages =  {1:1--1:12},
    series =  {Open Access Series in Informatics (OASIcs)},
    ISBN =  {978-3-95977-307-2},
    ISSN =  {2190-6807},
    year =  {2024},
    volume =  {116},
    editor =  {Bispo, Jo\~{a}o and Xydis, Sotirios and Curzel, Serena and Sousa, Lu{\'\i}s Miguel},
    publisher =  {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
    address =  {Dagstuhl, Germany},
    URL = {https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.PARMA-DITAM.2024.1},
    URN = {urn:nbn:de:0030-drops-196951},
    doi = {10.4230/OASIcs.PARMA-DITAM.2024.1},
    annote =  {Keywords: High-Level Synthesis, rad-hard FPGAs},
    pdf = {https://re.public.polimi.it/retrieve/6e12387e-361f-42f4-83e9-744c80536731/OASIcs.PARMA-DITAM.2024.1.pdf}
    }

  • [PDF] [URL] [DOI] S. Curzel, “Modern High-Level Synthesis: Improving Productivity with a Multi-level Approach,” in Special Topics in Information Technology, F. Amigoni, Ed., Cham: Springer Nature Switzerland, 2024, p. 15–25.
    [BibTeX] [Abstract]

    High-Level Synthesis (HLS) tools simplify the design of hardware accelerators by automatically generating Verilog/VHDL code starting from a general-purpose software programming language. Because of the mismatch between the requirements of hardware descriptions and the characteristics of input languages, HLS tools still require hardware design knowledge and non-trivial design space exploration, which might be an obstacle for domain scientists seeking to accelerate applications written, for example, in Python-based programming frameworks. This research proposes a modern approach based on multi-level compiler technologies to bridge the gap between HLS and high-level frameworks, and to use domain-specific abstractions to solve domain-specific problems. The key enabling technology is the Multi-Level Intermediate Representation (MLIR), a framework that supports building reusable compiler infrastructure. The proposed approach uses MLIR to introduce new optimizations at appropriate levels of abstraction outside the HLS tool while still relying on years of HLS research in the low-level hardware generation steps; users and developers of HLS tools can thus increase their productivity, obtain accelerators with higher performance, and not be limited by the features of a specific (possibly closed-source) backend. The presented tools and techniques were designed, implemented, and tested to synthesize machine learning algorithms, but they are broadly applicable to any input specification written in a language that has a translation to MLIR. Generated accelerators can be deployed on Field Programmable Gate Arrays or Application-Specific Integrated Circuits, and they can reach high energy efficiency without any manual optimization of the code.

    @Inbook{Curzel2024,
    author={Curzel, Serena},
    editor={Amigoni, Francesco},
    title={Modern High-Level Synthesis: Improving Productivity with a Multi-level Approach},
    bookTitle={Special Topics in Information Technology},
    year={2024},
    publisher={Springer Nature Switzerland},
    address={Cham},
    pages={15--25},
    abstract={High-Level Synthesis (HLS) tools simplify the design of hardware accelerators by automatically generating Verilog/VHDL code starting from a general-purpose software programming language. Because of the mismatch between the requirements of hardware descriptions and the characteristics of input languages, HLS tools still require hardware design knowledge and non-trivial design space exploration, which might be an obstacle for domain scientists seeking to accelerate applications written, for example, in Python-based programming frameworks. This research proposes a modern approach based on multi-level compiler technologies to bridge the gap between HLS and high-level frameworks, and to use domain-specific abstractions to solve domain-specific problems. The key enabling technology is the Multi-Level Intermediate Representation (MLIR), a framework that supports building reusable compiler infrastructure. The proposed approach uses MLIR to introduce new optimizations at appropriate levels of abstraction outside the HLS tool while still relying on years of HLS research in the low-level hardware generation steps; users and developers of HLS tools can thus increase their productivity, obtain accelerators with higher performance, and not be limited by the features of a specific (possibly closed-source) backend. The presented tools and techniques were designed, implemented, and tested to synthesize machine learning algorithms, but they are broadly applicable to any input specification written in a language that has a translation to MLIR. Generated accelerators can be deployed on Field Programmable Gate Arrays or Application-Specific Integrated Circuits, and they can reach high energy efficiency without any manual optimization of the code.},
    isbn={978-3-031-51500-2},
    doi={10.1007/978-3-031-51500-2_2},
    url={https://doi.org/10.1007/978-3-031-51500-2_2},
    pdf={https://re.public.polimi.it/retrieve/181f82a4-1079-48df-8886-d785c2cafdd1/Springer_Briefs_tesi.pdf}
    }

2023

  • [PDF] M. Fiorito, S. Curzel, and F. Ferrandi, “TrueFloat: A Templatized Arithmetic Library for HLS Floating-Point Operators,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, Cham, 2023, p. 486–493.
    [BibTeX] [Abstract]

    Hardware designers working on FPGA accelerators are free to explore ad-hoc value representations that differ from the IEEE 754 floating-point standard, significantly reducing resource utilization and latency. In fact, while some applications are amenable to fixed-point quantization, others may require a wider dynamic range of values, better represented through a customized floating-point encoding. TrueFloat automates the process of designing accelerators with custom floating-point representations by introducing a methodology for the generation of customized floating-point units within a state-of-the-art High-Level Synthesis tool, providing high performance and fast prototyping. With TrueFloat, it is possible to translate a software description with standard floating-point calculations into an optimized hardware design featuring any number of different value encodings. Generated floating-point units are competitive with respect to state-of-the-art templatized libraries.

    @InProceedings{fiorito2023truefloat,
    author={Fiorito, Michele and Curzel, Serena and Ferrandi, Fabrizio},
    editor={Silvano, Cristina and Pilato, Christian and Reichenbach, Marc},
    title={TrueFloat: A Templatized Arithmetic Library for HLS Floating-Point Operators},
    booktitle={Embedded Computer Systems: Architectures, Modeling, and Simulation},
    year={2023},
    publisher={Springer Nature Switzerland},
    address={Cham},
    pages={486--493},
    abstract={Hardware designers working on FPGA accelerators are free to explore ad-hoc value representations that differ from the IEEE 754 floating-point standard, significantly reducing resource utilization and latency. In fact, while some applications are amenable to fixed-point quantization, others may require a wider dynamic range of values, better represented through a customized floating-point encoding. TrueFloat automates the process of designing accelerators with custom floating-point representations by introducing a methodology for the generation of customized floating-point units within a state-of-the-art High-Level Synthesis tool, providing high performance and fast prototyping. With TrueFloat, it is possible to translate a software description with standard floating-point calculations into an optimized hardware design featuring any number of different value encodings. Generated floating-point units are competitive with respect to state-of-the-art templatized libraries.},
    isbn={978-3-031-46077-7},
    pdf={https://re.public.polimi.it/retrieve/e2629ec7-c0ed-446b-b866-9a4ee537834e/TrueFloat_SAMOS_LNCS.pdf}
    }

  • [PDF] [URL] [DOI] S. Curzel, M. Fiorito, P. L. Cueva, T. Jorge, T. Tsiodras, and F. Ferrandi, “Exploration of Synthesis Methods from Simulink Models to FPGA for Aerospace Applications,” in Proceedings of the 20th ACM International Conference on Computing Frontiers, New York, NY, USA, 2023, p. 243–249.
    [BibTeX] [Abstract]

    Model-based development techniques in Matlab/Simulink simplify the design and implementation of software for aerospace applications, providing the required level of abstraction for scientists that work on complex navigation and control algorithms. As Field Programmable Gate Arrays (FPGAs) have become more and more relevant in space hardware platforms, developers could benefit from automated acceleration flows that do not require extensive manual rewriting of their code to port it on FPGA. We analyze existing methods that synthesize Simulink models, showing how a combination of automated C code generation and High-Level Synthesis can enable rapid prototyping, fast design space exploration, and a good trade-off between accelerator efficiency and design flexibility. We test the proposed acceleration flow on real-world guidance and navigation control systems for CubeSats.

    @inproceedings{curzel2023compspace,
    author = {Curzel, Serena and Fiorito, Michele and Cueva, Patricia Lopez and Jorge, Tiago and Tsiodras, Thanassis and Ferrandi, Fabrizio},
    title = {Exploration of Synthesis Methods from Simulink Models to FPGA for Aerospace Applications},
    year = {2023},
    isbn = {9798400701405},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3587135.3592766},
    doi = {10.1145/3587135.3592766},
    abstract = {Model-based development techniques in Matlab/Simulink simplify the design and implementation of software for aerospace applications, providing the required level of abstraction for scientists that work on complex navigation and control algorithms. As Field Programmable Gate Arrays (FPGAs) have become more and more relevant in space hardware platforms, developers could benefit from automated acceleration flows that do not require extensive manual rewriting of their code to port it on FPGA. We analyze existing methods that synthesize Simulink models, showing how a combination of automated C code generation and High-Level Synthesis can enable rapid prototyping, fast design space exploration, and a good trade-off between accelerator efficiency and design flexibility. We test the proposed acceleration flow on real-world guidance and navigation control systems for CubeSats.},
    booktitle = {Proceedings of the 20th ACM International Conference on Computing Frontiers},
    pages = {243–249},
    numpages = {7},
    keywords = {aerospace, High-Level Synthesis, FPGA},
    location = {Bologna, Italy},
    series = {CF '23},
    pdf={https://re.public.polimi.it/retrieve/f94a6e8b-0228-4547-8e72-2f124c470b0a/3587135.3592766.pdf}
    }

  • [PDF] [DOI] N. Ibellaatti, E. Lepape, A. Kilic, K. Akyel, K. Chouayakh, F. Ferrandi, C. Barone, S. Curzel, M. Fiorito, G. Gozzi, M. Masmano, A. R. Navarro, M. Muñioz, V. N. Gallego, P. L. Cueva, J. Letrillard, and F. Wartel, “HERMES: qualification of High pErformance pRogrammable Microprocessor and dEvelopment of Software ecosystem,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2023, pp. 1-5.
    [BibTeX]
    @INPROCEEDINGS{ibellaatti2023hermes,
    author={Ibellaatti, Nadia and Lepape, Edouard and Kilic, Alp and Akyel, Kaya and Chouayakh, Kassem and Ferrandi, Fabrizio and Barone, Claudio and Curzel, Serena and Fiorito, Michele and Gozzi, Giovanni and Masmano, Miguel and Navarro, Ana Risquez and Muñioz, Manuel and Gallego, Vicente Nicolau and Cueva, Patricia Lopez and Letrillard, Jean-noel and Wartel, Franck},
    booktitle={2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)},
    title={HERMES: qualification of High pErformance pRogrammable Microprocessor and dEvelopment of Software ecosystem},
    year={2023},
    volume={},
    number={},
    pages={1-5},
    keywords={Virtual machine monitors;Radiation hardening (electronics);Microprocessors;Ecosystems;Europe;Aerospace electronics;Software;FPGA;Space;HLS;Virtualization},
    doi={10.23919/DATE56975.2023.10136921},
    pdf={https://re.public.polimi.it/retrieve/d1194fa8-1dd0-44a2-93ed-ceced2e32094/HERMES_qualification_of_High_pErformance_pRogrammable_Microprocessor_and_dEvelopment_of_Software_ecosystem.pdf}
    }

  • [PDF] [URL] [DOI] V. G. Castellana, N. B. Agostini, A. Limaye, V. Amatya, M. Minutoli, J. Manzano, A. Tumeo, S. Curzel, M. Fiorito, and F. Ferrandi, “Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis,” in Proceedings of the 28th Asia and South Pacific Design Automation Conference, New York, NY, USA, 2023, p. 632–638.
    [BibTeX] [Abstract]

    The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

    @inproceedings{castellana2023towards,
    author = {Castellana, Vito Giovanni and Agostini, Nicolas Bohm and Limaye, Ankur and Amatya, Vinay and Minutoli, Marco and Manzano, Joseph and Tumeo, Antonino and Curzel, Serena and Fiorito, Michele and Ferrandi, Fabrizio},
    title = {Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis},
    year = {2023},
    isbn = {9781450397834},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3566097.3568360},
    doi = {10.1145/3566097.3568360},
    abstract = {The Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.},
    booktitle = {Proceedings of the 28th Asia and South Pacific Design Automation Conference},
    pages = {632–638},
    numpages = {7},
    keywords = {edge computing, machine learning, design automation, high level synthesis, neural networks},
    location = {Tokyo, Japan},
    series = {ASPDAC '23},
    pdf={https://re.public.polimi.it/retrieve/314e74e8-f2c5-4d4a-838d-aa0827940d6f/ASPDAC23.pdf}
    }

2022

  • [PDF] [URL] [DOI] N. B. Agostini, A. Limaye, M. Minutoli, V. G. Castellana, J. Manzano, A. Tumeo, S. Curzel, and F. Ferrandi, “SODA Synthesizer: An Open-Source, Multi-Level, Modular, Extensible Compiler from High-Level Frameworks to Silicon,” in Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 2022.
    [BibTeX] [Abstract]

    The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.

    @inproceedings{agostini2022invited,
    author = {Agostini, Nicolas Bohm and Limaye, Ankur and Minutoli, Marco and Castellana, Vito Giovanni and Manzano, Joseph and Tumeo, Antonino and Curzel, Serena and Ferrandi, Fabrizio},
    title = {SODA Synthesizer: An Open-Source, Multi-Level, Modular, Extensible Compiler from High-Level Frameworks to Silicon},
    year = {2022},
    isbn = {9781450392174},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3508352.3561101},
    doi = {10.1145/3508352.3561101},
    abstract = {The SODA Synthesizer is an open-source, modular, end-to-end hardware compiler framework. The SODA frontend, developed in MLIR, performs system-level design, code partitioning, and high-level optimizations to prepare the specifications for the hardware synthesis. The backend is based on a state-of-the-art high-level synthesis tool and generates the final hardware design. The backend can interface with logic synthesis tools for field programmable gate arrays or with commercial and open-source logic synthesis tools for application-specific integrated circuits. We discuss the opportunities and challenges in integrating with commercial and open-source tools both at the frontend and backend, and highlight the role that an end-to-end compiler framework like SODA can play in an open-source hardware design ecosystem.},
    booktitle = {Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design},
    articleno = {18},
    numpages = {7},
    keywords = {high-level synthesis, hardware/software co-design},
    location = {San Diego, California},
    series = {ICCAD '22},
    pdf = {https://re.public.polimi.it/retrieve/528ff192-d39e-44d6-9cb1-f0f81aba2fc2/ICCAD-22-2.pdf}
    }

  • [PDF] [DOI] S. Curzel, N. B. Agostini, V. G. Castellana, M. Minutoli, A. Limaye, J. Manzano, J. Zhang, D. Brooks, G. Wei, F. Ferrandi, and A. Tumeo, “End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators,” IEEE Transactions on Computers, vol. 71, iss. 12, pp. 3074-3087, 2022.
    [BibTeX]
    @ARTICLE{curzel2022end2end,
    author={Curzel, Serena and Agostini, Nicolas Bohm and Castellana, Vito Giovanni and Minutoli, Marco and Limaye, Ankur and Manzano, Joseph and Zhang, Jeff and Brooks, David and Wei, Gu-Yeon and Ferrandi, Fabrizio and Tumeo, Antonino},
    journal={IEEE Transactions on Computers},
    title={End-to-End Synthesis of Dynamically Controlled Machine Learning Accelerators},
    year={2022},
    volume={71},
    number={12},
    pages={3074-3087},
    doi={10.1109/TC.2022.3211430},
    pdf={https://re.public.polimi.it/retrieve/fac6a9a2-2112-4e07-828f-39353836a0fc/IEEE_TC_SI_ML_2022_Dataflow_accepted.pdf}}

  • [PDF] [URL] [DOI] N. B. Agostini, S. Curzel, A. Limaye, V. Amatya, M. Minutoli, V. G. Castellana, J. Manzano, A. Tumeo, and F. Ferrandi, “The SODA Approach: Leveraging High-Level Synthesis for Hardware/Software Co-Design and Hardware Specialization: Invited,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, New York, NY, USA, 2022, p. 1359–1362.
    [BibTeX] [Abstract]

    Novel “converged” applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.

    @inproceedings{agostini2022soda,
    author = {Agostini, Nicolas Bohm and Curzel, Serena and Limaye, Ankur and Amatya, Vinay and Minutoli, Marco and Castellana, Vito Giovanni and Manzano, Joseph and Tumeo, Antonino and Ferrandi, Fabrizio},
    title = {The SODA Approach: Leveraging High-Level Synthesis for Hardware/Software Co-Design and Hardware Specialization: Invited},
    year = {2022},
    isbn = {9781450391429},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3489517.3530628},
    doi = {10.1145/3489517.3530628},
    abstract = {Novel "converged" applications combine phases of scientific simulation with data analysis and machine learning. Each computational phase can benefit from specialized accelerators. However, algorithms evolve so quickly that mapping them on existing accelerators is suboptimal or even impossible. This paper presents the SODA (Software Defined Accelerators) framework, a modular, multi-level, open-source, no-human-in-the-loop, hardware synthesizer that enables end-to-end generation of specialized accelerators. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming frameworks and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of converged applications. We then discuss some of the research opportunities that such a framework allows, including system-level design, profile driven optimization, and supporting new optimization metrics.},
    booktitle = {Proceedings of the 59th ACM/IEEE Design Automation Conference},
    pages = {1359–1362},
    numpages = {4},
    keywords = {high-level synthesis, hardware/software co-design},
    location = {San Francisco, California},
    series = {DAC '22},
    pdf = {https://re.public.polimi.it/retrieve/a4cf099a-68ad-4008-8723-42c2bfb025d0/3489517.3530628.pdf}
    }

  • [PDF] [DOI] N. B. Agostini, S. Curzel, J. J. Zhang, A. Limaye, C. Tan, V. Amatya, M. Minutoli, V. G. Castellana, J. Manzano, D. Brooks, G. Wei, and A. Tumeo, “Bridging Python to Silicon: The SODA Toolchain,” IEEE Micro, vol. 42, iss. 5, pp. 78-88, 2022.
    [BibTeX]
    @ARTICLE{agostini2022bridging,
    author={Agostini, Nicolas Bohm and Curzel, Serena and Zhang, Jeff Jun and Limaye, Ankur and Tan, Cheng and Amatya, Vinay and Minutoli, Marco and Castellana, Vito Giovanni and Manzano, Joseph and Brooks, David and Wei, Gu-Yeon and Tumeo, Antonino},
    journal={IEEE Micro},
    title={Bridging Python to Silicon: The SODA Toolchain},
    year={2022},
    volume={42},
    number={5},
    pages={78-88},
    doi={10.1109/MM.2022.3178580},
    pdf={https://re.public.polimi.it/retrieve/e8a93a64-22e7-4683-b041-c6ee41f97f3a/Bridging_Python_to_Silicon_The_SODA_Toolchain.pdf}}

  • [PDF] [DOI] M. Minutoli, V. G. Castellana, N. Saporetti, S. Devecchi, M. Lattuada, P. Fezzardi, A. Tumeo, and F. Ferrandi, “Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics,” IEEE Transactions on Computers, vol. 71, iss. 3, pp. 520-533, 2022.
    [BibTeX] [Abstract]

    Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task-level parallelism, that make existing general-purpose processors or accelerators (e.g., GPUs) suboptimal or difficult to program. To address these issues, research and industry are more and more relying on designs based on reconfigurable devices (Field Programmable Gate Arrays), sometimes even partially employing High-Level Synthesis (HLS) methods to accelerate the development of the accelerators. In this paper, we propose a novel architecture template for the automatic generation of accelerators for graph analytics and irregular applications. The architecture template includes a dynamic task scheduler, a parallel array of accelerators that enables supporting task-level parallelism with context switching, and a related multi-channel memory interface that decouples communication from computation and provides support for fine-grained atomic memory operations. We discuss the integration of the architectural template in an HLS flow, presenting the necessary modifications to enable automatic generation of the accelerators starting from OpenMP annotated code. We evaluate our approach by synthesizing custom designs for a set of graph database benchmark queries. We compare the synthesized accelerators with previous state-of-the-art methodologies for the synthesis of parallel architectures.

    @ARTICLE {TC_SVELTO2022,
    author = {Minutoli, Marco and Castellana, Vito Giovanni and Saporetti, Nicola and Devecchi, Stefano and Lattuada, Marco and Fezzardi, Pietro and Tumeo, Antonino and Ferrandi, Fabrizio},
    journal = {IEEE Transactions on Computers},
    title = {Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics},
    year={2022},
    volume={71},
    number={3},
    pages={520-533},
    issn = {1557-9956},
    month={March},
    keywords = {task analysis;parallel processing;computer architecture;dynamic scheduling;hardware;field programmable gate arrays;memory management},
    doi = {10.1109/TC.2021.3057860},
    publisher = {IEEE Computer Society},
    address = {Los Alamitos, CA, USA},
    abstract={Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task-level parallelism, that make existing general-purpose processors or accelerators (e.g., GPUs) suboptimal or difficult to program. To address these issues, research and industry are more and more relying on designs based on reconfigurable devices (Field Programmable Gate Arrays), sometimes even partially employing High-Level Synthesis (HLS) methods to accelerate the development of the accelerators. In this paper, we propose a novel architecture template for the automatic generation of accelerators for graph analytics and irregular applications. The architecture template includes a dynamic task scheduler, a parallel array of accelerators that enables supporting task-level parallelism with context switching, and a related multi-channel memory interface that decouples communication from computation and provides support for fine-grained atomic memory operations. We discuss the integration of the architectural template in an HLS flow, presenting the necessary modifications to enable automatic generation of the accelerators starting from OpenMP annotated code. We evaluate our approach by synthesizing custom designs for a set of graph database benchmark queries. We compare the synthesized accelerators with previous state-of-the-art methodologies for the synthesis of parallel architectures.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1161042/584239/tc_svelto.pdf}
    }

2021

  • [PDF] [DOI] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo, “Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 1327-1330.
    [BibTeX] [Abstract]

    This paper presents the open-source high-level synthesis (HLS) research framework Bambu. Bambu provides a research environment to experiment with new ideas across HLS, high-level verification and debugging, FPGA/ASIC design, design flow space exploration, and parallel hardware accelerator design. The tool accepts as input standard C/C++ specifications and compiler intermediate representations (IRs) coming from the well-known Clang/LLVM and GCC compilers. The broad spectrum and flexibility of input formats allow the electronic design automation (EDA) research community to explore and integrate new transformations and optimizations. The easily extendable modular framework already includes many optimizations and HLS benchmarks used to evaluate the QoR of the tool against existing approaches [1]. The integration with synthesis and verification backends (commercial and open-source) allows researchers to quickly test any new finding and easily obtain performance and resource usage metrics for a given application. Different FPGA devices are supported from several different vendors: AMD/Xilinx, Intel/Altera, Lattice Semiconductor, and NanoXplore. Finally, integration with the OpenRoad open-source end-to-end silicon compiler perfectly fits with the recent push towards open-source EDA.

    @INPROCEEDINGS{ferrandi2021bambu,
    author={Ferrandi, Fabrizio and Castellana, Vito Giovanni and Curzel, Serena and Fezzardi, Pietro and Fiorito, Michele and Lattuada, Marco and Minutoli, Marco and Pilato, Christian and Tumeo, Antonino},
    booktitle={2021 58th ACM/IEEE Design Automation Conference (DAC)},
    title={Invited: Bambu: an Open-Source Research Framework for the High-Level Synthesis of Complex Applications},
    year={2021},
    pages={1327-1330},
    abstract = {This paper presents the open-source high-level synthesis (HLS) research framework Bambu. Bambu provides a research environment to experiment with new ideas across HLS, high-level verification and debugging, FPGA/ASIC design, design flow space exploration, and parallel hardware accelerator design. The tool accepts as input standard C/C++ specifications and compiler intermediate representations (IRs) coming from the well-known Clang/LLVM and GCC compilers. The broad spectrum and flexibility of input formats allow the electronic design automation (EDA) research community to explore and integrate new transformations and optimizations. The easily extendable modular framework already includes many optimizations and HLS benchmarks used to evaluate the QoR of the tool against existing approaches [1]. The integration with synthesis and verification backends (commercial and open-source) allows researchers to quickly test any new finding and easily obtain performance and resource usage metrics for a given application. Different FPGA devices are supported from several different vendors: AMD/Xilinx, Intel/Altera, Lattice Semiconductor, and NanoXplore. Finally, integration with the OpenRoad open-source end-to-end silicon compiler perfectly fits with the recent push towards open-source EDA.},
    publisher={{IEEE}},
    doi={10.1109/DAC18074.2021.9586110},
    ISSN={0738-100X},
    month={Dec},
    pdf={https://re.public.polimi.it/retrieve/668507/dac21_bambu.pdf}
    }

  • [PDF] [DOI] S. Curzel, N. B. Agostini, S. Song, I. Dagli, A. Limaye, C. Tan, M. Minutoli, V. G. Castellana, V. Amatya, J. Manzano, A. Das, F. Ferrandi, and A. Tumeo, “Automated Generation of Integrated Digital and Spiking Neuromorphic Machine Learning Accelerators,” in Proceedings of the 40th International Conference on Computer-Aided Design, New York, NY, USA, 2021.
    [BibTeX] [Abstract]

    The growing numbers of application areas for artificial intelligence (AI) methods have led to an explosion in availability of domain-specific accelerators, which struggle to support every new machine learning (ML) algorithm advancement, clearly highlighting the need for a tool to quickly and automatically transition from algorithm definition to hardware implementation and explore the design space along a variety of SWaP (size, weight and Power) metrics. The software defined architectures (SODA) synthesizer implements a modular compiler-based infrastructure for the end-to-end generation of machine learning accelerators, from high-level frameworks to hardware description language. Neuromorphic computing, mimicking how the brain operates, promises to perform artificial intelligence tasks at efficiencies orders-of-magnitude higher than the current conventional tensor-processing based accelerators, as demonstrated by a variety of specialized designs leveraging Spiking Neural Networks (SNNs). Nevertheless, the mapping of an artificial neural network (ANN) to solutions supporting SNNs is still a non-trivial and very device-specific task, and completely lacks the possibility to design hybrid systems that integrate conventional and spiking neural models. In this paper, we discuss the design of such an integrated generator, leveraging the SODA Synthesizer framework and its modular structure. In particular, we present a new MLIR dialect in the SODA frontend that allows expressing spiking neural network concepts (e.g., spiking sequences, transformation, and manipulation) and we discuss how to enable the mapping of spiking neurons to the related specialized hardware (which could be generated through middle-end and backend layers of the SODA Synthesizer). We then discuss the opportunities for further integration offered by the hardware compilation infrastructure, providing a path towards the generation of complex hybrid artificial intelligence systems.

    @inproceedings{SNN21,
    author = {Curzel, Serena and Agostini, Nicolas Bohm and Song, Shihao and Dagli, Ismet and Limaye, Ankur and Tan, Cheng and Minutoli, Marco and Castellana, Vito Giovanni and Amatya, Vinay and Manzano, Joseph and Das, Anup and Ferrandi, Fabrizio and Tumeo, Antonino},
    title = {Automated Generation of Integrated Digital and Spiking Neuromorphic Machine Learning Accelerators},
    year = {2021},
    isbn = {},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    abstract = {The growing numbers of application areas for artificial intelligence (AI) methods have led to an explosion in availability of domain-specific accelerators, which struggle to support every new machine learning (ML) algorithm advancement, clearly highlighting the need for a tool to quickly and automatically transition from algorithm definition to hardware implementation and explore the design space along a variety of SWaP (size, weight and Power) metrics. The software defined architectures (SODA) synthesizer implements a modular compiler-based infrastructure for the end-to-end generation of machine learning accelerators, from high-level frameworks to hardware description language. Neuromorphic computing, mimicking how the brain operates, promises to perform artificial intelligence tasks at efficiencies orders-of-magnitude higher than the current conventional tensor-processing based accelerators, as demonstrated by a variety of specialized designs leveraging Spiking Neural Networks (SNNs). Nevertheless, the mapping of an artificial neural network (ANN) to solutions supporting SNNs is still a non-trivial and very device-specific task, and completely lacks the possibility to design hybrid systems that integrate conventional and spiking neural models. In this paper, we discuss the design of such an integrated generator, leveraging the SODA Synthesizer framework and its modular structure. In particular, we present a new MLIR dialect in the SODA frontend that allows expressing spiking neural network concepts (e.g., spiking sequences, transformation, and manipulation) and we discuss how to enable the mapping of spiking neurons to the related specialized hardware (which could be generated through middle-end and backend layers of the SODA Synthesizer). We then discuss the opportunities for further integration offered by the hardware compilation infrastructure, providing a path towards the generation of complex hybrid artificial intelligence systems.},
    booktitle = {Proceedings of the 40th International Conference on Computer-Aided Design},
    articleno = {45},
    numpages = {7},
    keywords = {MLIR, Artificial Neural Network Accelerators, Spiking Neural Network Accelerators},
    location = {Virtual Event, USA},
    doi={10.1109/ICCAD51958.2021.9643474},
    series = {ICCAD '21},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1194314/712446/ICCAD21_SODASNN.pdf}
    }

  • [PDF] [DOI] V. G. Castellana, A. Tumeo, and F. Ferrandi, “High-Level Synthesis of Parallel Specifications Coupling Static and Dynamic Controllers,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 192-202.
    [BibTeX]
    @INPROCEEDINGS{Castellana-IPDPS21,
    author={Castellana, Vito Giovanni and Tumeo, Antonino and Ferrandi, Fabrizio},
    booktitle={2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
    title={High-Level Synthesis of Parallel Specifications Coupling Static and Dynamic Controllers},
    year={2021},
    pages={192-202},
    doi={10.1109/IPDPS49936.2021.00028},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1180263/647916/High-Level_Synthesis_of_Parallel_Specifications_Coupling_Static_and_Dynamic_Controllers.pdf}
    }

  • [PDF] [DOI] C. Pilato, S. Böhm, F. Brocheton, J. Castrillón, R. Cevasco, V. Cima, R. Cmar, D. Diamantopoulos, F. Ferrandi, J. Martinovic, G. Palermo, M. Paolino, A. Parodi, L. Pittaluga, D. Raho, F. Regazzoni, K. Slaninová, and C. Hagleitner, “EVEREST: A design environment for extreme-scale big data analytics on heterogeneous platforms,” in Design, Automation & Test in Europe Conference & Exhibition, DATE 2021, Grenoble, France, February 1-5, 2021, 2021, p. 1320–1325.
    [BibTeX]
    @inproceedings{PilatoBBCCCCDFM21,
    author = {Christian Pilato and Stanislav B{\"{o}}hm and Fabien Brocheton and Jer{\'{o}}nimo Castrill{\'{o}}n and Riccardo Cevasco and Vojtech Cima and Radim Cmar and Dionysios Diamantopoulos and Fabrizio Ferrandi and Jan Martinovic and Gianluca Palermo and Michele Paolino and Antonio Parodi and Lorenzo Pittaluga and Daniel Raho and Francesco Regazzoni and Katerina Slaninov{\'{a}} and Christoph Hagleitner},
    title = {{EVEREST:} {A} design environment for extreme-scale big data analytics on heterogeneous platforms},
    booktitle = {Design, Automation {\&} Test in Europe Conference {\&} Exhibition, {DATE} 2021, Grenoble, France, February 1-5, 2021},
    pages = {1320--1325},
    publisher = {{IEEE}},
    year = {2021},
    doi = {10.23919/DATE51398.2021.9473940},
    pdf={https://arxiv.org/pdf/2103.04185}
    }

2020

  • [PDF] [URL] [DOI] M. Siracusa and F. Ferrandi, “Tensor Optimization for High-Level Synthesis Design Flows,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Best Paper Candidate of CODES+ISSS 2020, vol. 39, iss. 11, pp. 4217-4228, 2020.
    [BibTeX] [Abstract]

    Improving data locality of tensor data structures is a crucial optimization for maximizing the performance of Machine Learning and intensive Linear Algebra applications. While CPUs and GPUs improve data locality by means of automated caching mechanisms, FPGAs let the developer specify data structure allocation. Although this feature enables a high degree of customizability, the increasing complexity and memory footprint of modern applications prevent considering any manual approach to find an optimal allocation. For this reason, we propose a compiler optimization to automatically improve the tensor allocation of high-level software descriptions. The optimization is controlled by a flexible cost model that can be tuned by means of simple yet expressive callback functions. In this way, the user can tailor the optimization strategy with respect to the optimization goal. We tested our methodology integrating our optimization in the Bambu open-source HLS framework. In this setting, we achieved a 14% speedup on the digit recognition version proposed by the Rosetta benchmark. Moreover, we tested our optimization on the CHStone benchmark suite, achieving an average of 6% speedup. Finally, we applied our methodology on two industrial examples from the aerospace domain obtaining a 15% speedup. As a final step, we tested the versatility of our methodology inserting our optimization in the Clang software optimization flow achieving a 12% speedup on the Rosetta benchmark when running on CPU.

    @ARTICLE{TCAD-CODES2020,
    author={M. {Siracusa} and F. {Ferrandi}},
    journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Best Paper Candidate of CODES+ISSS 2020},
    title={Tensor Optimization for High-Level Synthesis Design Flows},
    year={2020},
    volume={39},
    number={11},
    pages={4217-4228},
    issn = {0278-0070},
    month = {oct},
    url = {https://doi.org/10.1109/TCAD.2020.3012318},
    doi={10.1109/TCAD.2020.3012318},
    keywords = {Design optimization, field programmable gatearrays, high-level synthesis, memory management.},
    abstract={Improving data locality of tensor data structures is a crucial optimization for maximizing the performance of Machine Learning and intensive Linear Algebra applications. While CPUs and GPUs improve data locality by means of automated caching mechanisms, FPGAs let the developer specify data structure allocation. Although this feature enables a high degree of customizability, the increasing complexity and memory footprint of modern applications prevent considering any manual approach to find an optimal allocation. For this reason, we propose a compiler optimization to automatically improve the tensor allocation of high-level software descriptions. The optimization is controlled by a flexible cost model that can be tuned by means of simple yet expressive callback functions. In this way, the user can tailor the optimization strategy with respect to the optimization goal. We tested our methodology integrating our optimization in the Bambu open-source HLS framework. In this setting, we achieved a 14% speedup on the digit recognition version proposed by the Rosetta benchmark. Moreover, we tested our optimization on the CHStone benchmark suite, achieving an average of 6% speedup. Finally, we applied our methodology on two industrial examples from the aerospace domain obtaining a 15% speedup. As a final step, we tested the versatility of our methodology inserting our optimization in the Clang software optimization flow achieving a 12% speedup on the Rosetta benchmark when running on CPU.},
    pdf={https://re.public.polimi.it/retrieve/584319/codes-isss2020.pdf}
    }

  • [PDF] [URL] [DOI] P. Fezzardi and F. Ferrandi, “Automated Bug Detection for High-Level Synthesis of Multi-Threaded Irregular Applications,” ACM Trans. Parallel Comput., vol. 7, iss. 4, 2020.
    [BibTeX] [Abstract]

    Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated bug identification for HLS of multi-threaded programs. This work defines a methodology to automatically detect and isolate bugs in parallel circuits generated with HLS. The technique relies on hardware/software Discrepancy Analysis and exploits a pattern-matching algorithm based on Finite State Automata to compare multiple hardware and software threads. Overhead, advantages, and limitations are evaluated on designs generated with an open-source HLS compiler supporting OpenMP.

    @article{10.1145/3418086,
    author = {Fezzardi, Pietro and Ferrandi, Fabrizio},
    title = {Automated Bug Detection for High-Level Synthesis of Multi-Threaded Irregular Applications},
    year = {2020},
    issue_date = {October 2020},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    volume = {7},
    number = {4},
    issn = {2329-4949},
    url = {https://doi.org/10.1145/3418086},
    doi = {10.1145/3418086},
    abstract = {Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated bug identification for HLS of multi-threaded programs. This work defines a methodology to automatically detect and isolate bugs in parallel circuits generated with HLS. The technique relies on hardware/software Discrepancy Analysis and exploits a pattern-matching algorithm based on Finite State Automata to compare multiple hardware and software threads. Overhead, advantages, and limitations are evaluated on designs generated with an open-source HLS compiler supporting OpenMP.},
    journal = {ACM Trans. Parallel Comput.},
    month = sep,
    articleno = {27},
    numpages = {26},
    keywords = {multi-threading, irregular, Debugging, FPGA, HLS},
    abstract={Field Programmable Gate Arrays (FPGAs) are becoming an appealing technology in datacenters and High Performance Computing. High-Level Synthesis (HLS) of multi-threaded parallel programs is increasingly used to extract parallelism. Despite great leaps forward in HLS and related debugging methodologies, there is a lack of contributions in automated bug identification for HLS of multi-threaded programs. This work defines a methodology to automatically detect and isolate bugs in parallel circuits generated with HLS. The technique relies on hardware/software Discrepancy Analysis and exploits a pattern-matching algorithm based on Finite State Automata to compare multiple hardware and software threads. Overhead, advantages, and limitations are evaluated on designs generated with an open-source HLS compiler supporting OpenMP.},
    pdf={https://re.public.polimi.it/retrieve/546446/3418086.pdf}
    }

2019

  • [PDF] [URL] [DOI] M. Lattuada and F. Ferrandi, “A Design Flow Engine for the Support of Customized Dynamic High Level Synthesis Flows,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, iss. 4, p. 19:1–19:26, 2019.
    [BibTeX] [Abstract]

    High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high-level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently from most of the traditional compilation flows, the complexity and the execution time of the High Level Synthesis techniques are much less relevant than the quality of the produced results. For this reason, fixed-point analyses, as well as successive refinement optimizations, can be accepted, provided that they can improve the quality of the generated designs. This article presents a design flow engine for the description and the execution of complex and customized synthesis flows. It supports dynamic addition of passes and dependencies, cyclic dependencies, and selective pass invalidation. Experimental results show the benefits of such type of design flows with respect to static linear design flows when applied to High Level Synthesis.

    @article{TRETS2019,
    author = {Lattuada, Marco and Ferrandi, Fabrizio},
    title = {A Design Flow Engine for the Support of Customized Dynamic High Level Synthesis Flows},
    journal = {ACM Trans. Reconfigurable Technol. Syst.},
    issue_date = {October 2019},
    volume = {12},
    number = {4},
    month = oct,
    year = {2019},
    issn = {1936-7406},
    pages = {19:1--19:26},
    articleno = {19},
    numpages = {26},
    url = {http://doi.acm.org/10.1145/3356475},
    doi = {10.1145/3356475},
    acmid = {3356475},
    publisher = {ACM},
    address = {New York, NY, USA},
    keywords = {High level synthesis, compilation steps},
    abstract={High Level Synthesis is a set of methodologies aimed at generating hardware descriptions starting from specifications written in high-level languages. While these methodologies share different elements with traditional compilation flows, there are characteristics of the addressed problem which require ad hoc management. In particular, differently from most of the traditional compilation flows, the complexity and the execution time of the High Level Synthesis techniques are much less relevant than the quality of the produced results. For this reason, fixed-point analyses, as well as successive refinement optimizations, can be accepted, provided that they can improve the quality of the generated designs. This article presents a design flow engine for the description and the execution of complex and customized synthesis flows. It supports dynamic addition of passes and dependencies, cyclic dependencies, and selective pass invalidation. Experimental results show the benefits of such type of design flows with respect to static linear design flows when applied to High Level Synthesis.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1114656/449320/paper.pdf}
    }

  • [URL] [DOI] V. G. Castellana, M. Minutoli, A. Tumeo, M. Lattuada, P. Fezzardi, and F. Ferrandi, “Software Defined Architectures for Data Analytics,” in Proceedings of the 24th Asia and South Pacific Design Automation Conference, New York, NY, USA, 2019, p. 711–718.
    [BibTeX]
    @inproceedings{ASPDAC2019,
    author = {Castellana, Vito Giovanni and Minutoli, Marco and Tumeo, Antonino and Lattuada, Marco and Fezzardi, Pietro and Ferrandi, Fabrizio},
    title = {Software Defined Architectures for Data Analytics},
    booktitle = {Proceedings of the 24th Asia and South Pacific Design Automation Conference},
    series = {ASPDAC '19},
    year = {2019},
    isbn = {978-1-4503-6007-4},
    location = {Tokyo, Japan},
    pages = {711--718},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/3287624.3288754},
    doi = {10.1145/3287624.3288754},
    acmid = {3288754},
    publisher = {ACM},
    address = {New York, NY, USA},
    keywords = {CGRAs, FPGAs, HLS, reconfigurable computing},
    }

2018

  • [PDF] [DOI] P. Fezzardi, C. Pilato, and F. Ferrandi, “Enabling Automated Bug Detection for IP-based Designs using High-Level Synthesis,” IEEE Design and Test, vol. 35, iss. 5, pp. 54-62, 2018.
    [BibTeX] [Abstract]

    Modern System-on-Chip (SoC) architectures are increasingly composed of Intellectual Property (IP) blocks, usually designed and provided by different vendors. This burdens system designers with complex system-level integration and verification. In this paper, we propose an approach that leverages HLS techniques to automatically find bugs in designs composed of multiple IP blocks. Our method is particularly suitable for industrial adoption because it works without exposing sensitive information (e.g., the design specification or the component generation process). This advocates the definition and the adoption of an interoperable format for cross-vendor hardware bug detection.

    @ARTICLE{DT-2018,
    author={P. Fezzardi and C. Pilato and F. Ferrandi},
    journal={IEEE Design and Test},
    title={Enabling Automated Bug Detection for IP-based Designs using High-Level Synthesis},
    year={2018},
    volume={35},
    number={5},
    pages={54-62},
    publisher={{IEEE}},
    keywords={Computer bugs;Hardware;Hardware design languages;IP networks;Intellectual property;Optimization;Software;Bug Detection;High-Level Synthesis;IP Protection;Intellectual Property},
    doi={10.1109/MDAT.2018.2824121},
    ISSN={2168-2356},
    month={October},
    abstract={Modern System-on-Chip (SoC) architectures are increasingly composed of Intellectual Property (IP) blocks, usually designed and provided by different vendors. This burdens system designers with complex system-level integration and verification. In this paper, we propose an approach that leverages HLS techniques to automatically find bugs in designs composed of multiple IP blocks. Our method is particularly suitable for industrial adoption because it works without exposing sensitive information (e.g., the design specification or the component generation process). This advocates the definition and the adoption of an interoperable format for cross-vendor hardware bug detection.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1052325/282482/dtFPF18.pdf}
    }

2017

  • [PDF] [DOI] P. Fezzardi, M. Lattuada, and F. Ferrandi, “Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis,” ACM Transactions on Embedded Computing Systems – Special Issue on ESWEEK2017, vol. 16, iss. 5s, p. 149:1–149:19, 2017.
    [BibTeX] [Abstract]

    High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are only observable on-chip. Debugging support for circuits generated with HLS is receiving a considerable attention. Among the data that can be collected on chip for debugging, one of the most important is the state of the Finite State Machines (FSM) controlling the components of the circuit. However, this usually requires a large amount of memory to trace the behavior during the execution. This work proposes an approach that takes advantage of the HLS information and of the structure of the FSM to compress control flow traces and to integrate optimized components for on-chip debugging. The generated checkers analyze the FSM execution on-fly, automatically notifying when a bug is detected, localizing it and providing data about its cause. The traces are compressed using a software profiling technique, called Efficient Path Profiling (EPP), adapted for the debugging of hardware accelerators generated with HLS. With this technique, the size of the memory used to store control flow traces can be reduced up to 2 orders of magnitude, compared to state-of-the-art.

    @ARTICLE{TECS-2017,
    author={P. Fezzardi and M. Lattuada and F. Ferrandi},
    journal = {ACM Transactions on Embedded Computing Systems -- Special Issue on ESWEEK2017},
    title={{Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis}},
    year={2017},
    pages={149:1--149:19},
    keywords={High-Level Synthesis; On-Chip Debugging; Automated Bug Detection; Memory Optimization; Efficient Path Profiling},
    doi={10.1145/3126564},
    volume={16},
    number={5s},
    month={October},
    publisher = {{ACM}},
    ISSN={1539-9087},
    abstract={High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are only observable on-chip. Debugging support for circuits generated with HLS is receiving a considerable attention. Among the data that can be collected on chip for debugging, one of the most important is the state of the Finite State Machines (FSM) controlling the components of the circuit. However, this usually requires a large amount of memory to trace the behavior during the execution. This work proposes an approach that takes advantage of the HLS information and of the structure of the FSM to compress control flow traces and to integrate optimized components for on-chip debugging. The generated checkers analyze the FSM execution on-fly, automatically notifying when a bug is detected, localizing it and providing data about its cause. The traces are compressed using a software profiling technique, called Efficient Path Profiling (EPP), adapted for the debugging of hardware accelerators generated with HLS. With this technique, the size of the memory used to store control flow traces can be reduced up to 2 orders of magnitude, compared to state-of-the-art.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1030731/222692/EPPDiscrepancyAnalysis.pdf}
    }

  • [PDF] [DOI] M. Lattuada, F. Ferrandi, and M. Perrotin, “Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems,” IEEE Transactions on Multi-Scale Computing Systems, vol. 4, iss. 1, pp. 1-14, 2017.
    [BibTeX] [Abstract]

    The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.

    @ARTICLE{TMSCS2017,
    author={M. Lattuada and F. Ferrandi and M. Perrotin},
    journal={IEEE Transactions on Multi-Scale Computing Systems},
    title={Data Transfers Analysis in Computer Assisted Design Flow of FPGA Accelerators for Aerospace Systems},
    year={2017},
    volume={4},
    number={1},
    pages={1-14},
    keywords={Data transfer;Field programmable gate arrays;Hardware;Hardware design languages;High level synthesis;Memory management;Software;FPGA;code analysis;high level synthesis;space systems},
    doi={10.1109/TMSCS.2017.2699647},
    month={Jan.-March},
    publisher = {{IEEE}},
    ISSN={2332-7766},
    abstract={The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system improves its efficiency and its flexibility thanks to their programmability, but increases the design complexity. The design flows indeed have to be composed of several steps to fill the gap between the starting solution, which is usually a reference sequential implementation, and the final heterogeneous solution which includes custom hardware accelerators. Among these steps, there are the analysis of the application to identify the functionalities that gain advantages in execution on hardware and the generation of their implementations by means of Hardware Description Languages. Generating these descriptions for a software developer can be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). With respect to other embedded systems scenarios, the aerospace systems introduce further constraints that have to be taken into account during the design of these heterogeneous systems. In this type of systems explicit data transfers to and from FPGAs are preferred to the adoption of a shared memory architecture. The first approach indeed potentially improves the predictability of the produced solutions, but the sizes of all the data transferred to and from any devices must be known at design time. Identifying the sizes in presence of complex C applications which use pointers can be a not so easy task. In this paper, a semi-automatic design flow based on the integration of an aerospace design flow, an application analysis technique, and High Level Synthesis methodologies is presented. The initial reference application is analyzed to identify which are the sizes of the data exchanged among the different components of the application. Next, starting from the high level specification and from the results of this analysis, High Level Synthesis techniques are applied to automatically produce the hardware accelerators.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1020704/201010/paper.pdf}
    }

  • [PDF] [URL] [DOI] M. Lattuada and F. Ferrandi, “Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops,” Journal of Systems Architecture, vol. 75, pp. 1-14, 2017.
    [BibTeX] [Abstract]

    Abstract Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a methodology for the synthesis of nested irregular DoAll loops based on outer vectorization is proposed. The methodology transforms the intermediate representation of the DoAll loop to introduce vectorization and it can be easily integrated in existing state of the art High Level Synthesis flows since does not require any modification in the rest of the flow. Vectorization is not limited to perfectly nested countable loops: conditional constructs and loops with variable number of iterations are supported. Experimental results on parallel benchmarks show that the generated parallel accelerators have significant speed-up with limited penalties in terms of resource usage and frequency decrement.

    @article{JSA2017,
    title = "Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops ",
    journal = "Journal of Systems Architecture ",
    volume = "75",
    number = "",
    pages = "1 - 14",
    year = "2017",
    note = "",
    issn = "1383-7621",
    doi = "http://dx.doi.org/10.1016/j.sysarc.2017.03.001",
    url = "http://www.sciencedirect.com/science/article/pii/S1383762117301327",
    author = "Marco Lattuada and Fabrizio Ferrandi",
    keywords = "High Level Synthesis",
    keywords = "Vectorization",
    keywords = "Code transformations ",
    publisher = {Elsevier},
    abstract = "Abstract Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a methodology for the synthesis of nested irregular DoAll loops based on outer vectorization is proposed. The methodology transforms the intermediate representation of the DoAll loop to introduce vectorization and it can be easily integrated in existing state of the art High Level Synthesis flows since does not require any modification in the rest of the flow. Vectorization is not limited to perfectly nested countable loops: conditional constructs and loops with variable number of iterations are supported. Experimental results on parallel benchmarks show that the generated parallel accelerators have significant speed-up with limited penalties in terms of resource usage and frequency decrement. ",
    pdf={https://re.public.polimi.it/retrieve/handle/11311/1010813/172873/paper.pdf}
    }

2016

  • [PDF] [URL] [DOI] M. Minutoli, V. G. Castellana, A. Tumeo, M. Lattuada, and F. Ferrandi, “Efficient Synthesis of Graph Methods: A Dynamically Scheduled Architecture,” in Proceedings of the 35th International Conference on Computer-Aided Design, New York, NY, USA, 2016, p. 128:1–128:8.
    [BibTeX] [Abstract]

    RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization intensive. Graph data structures expose large amounts of dynamic parallelism, but are difficult to partition without generating load unbalance. In this paper, we present a novel architecture to improve the synthesis of graph methods. Our design addresses the issues of these algorithms with two components: a Dynamic Task Scheduler (DTS), which reduces load unbalance and maximize resource utilization, and a Hierarchical Memory Interface controller (HMI), which provides support for concurrent memory operations on multi-ported/multi-banked shared memories. We evaluate our approach by generating the accelerators for a set of SPARQL queries from the Lehigh University Benchmark (LUBM). We first analyze the load unbalance of these queries, showing that execution time among tasks can differ even of order of magnitudes. We then synthesize the queries and compare the performance of the resulting accelerators against the current state of the art. Experimental results show that our solution provides a speedup over the serial implementation close to the theoretical maximum and a speedup up to 3.45 over a baseline parallel implementation. We conclude our study by exploring the design space to achieve maximum memory channels utilization. The best design used at least three of the four memory channels for more than 90\% of the execution time.

    @inproceedings{ICCAD2016,
    author = {Minutoli, Marco and Castellana, Vito Giovanni and Tumeo, Antonino and Lattuada, Marco and Ferrandi, Fabrizio},
    title = {Efficient Synthesis of Graph Methods: A Dynamically Scheduled Architecture},
    booktitle = {Proceedings of the 35th International Conference on Computer-Aided Design},
    series = {ICCAD '16},
    year = {2016},
    isbn = {978-1-4503-4466-1},
    location = {Austin, Texas},
    pages = {128:1--128:8},
    articleno = {128},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2966986.2967030},
    doi = {10.1145/2966986.2967030},
    acmid = {2967030},
    publisher = {{ACM}},
    address = {New York, NY, USA},
    keywords = {SPARQL, big data, dynamic task scheduling, high-level synthesis},
    abstract={RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization intensive. Graph data structures expose large amounts of dynamic parallelism, but are difficult to partition without generating load unbalance. In this paper, we present a novel architecture to improve the synthesis of graph methods. Our design addresses the issues of these algorithms with two components: a Dynamic Task Scheduler (DTS), which reduces load unbalance and maximize resource utilization, and a Hierarchical Memory Interface controller (HMI), which provides support for concurrent memory operations on multi-ported/multi-banked shared memories. We evaluate our approach by generating the accelerators for a set of SPARQL queries from the Lehigh University Benchmark (LUBM). We first analyze the load unbalance of these queries, showing that execution time among tasks can differ even of order of magnitudes. We then synthesize the queries and compare the performance of the resulting accelerators against the current state of the art. Experimental results show that our solution provides a speedup over the serial implementation close to the theoretical maximum and a speedup up to 3.45 over a baseline parallel implementation. We conclude our study by exploring the design space to achieve maximum memory channels utilization. The best design used at least three of the four memory channels for more than 90\% of the execution time.},
    keywords={SQL;concurrency (computers);data structures;database management systems;graph theory;parallel processing;resource allocation;shared memory systems;storage management;DTS;HMI;LUBM;Lehigh University Benchmark;RDF databases;SPARQL queries;concurrent memory operations;dynamic task scheduler;dynamically scheduled architecture;graph data structures;graph method synthesis;graph pattern matching routines;graph representation;hierarchical memory interface controller;load unbalance;memory channel utilization;multiported multibanked shared memories;parallel implementation;resource utilization;Algorithm design and analysis;Databases;Dynamic scheduling;Hardware;Kernel;Parallel processing;Resource description framework;Big Data;Dynamic Task Scheduling;High-Level Synthesis;SPARQL},
    pdf={https://re.public.polimi.it/retrieve/145493/paper.pdf},
    }

  • [PDF] [URL] [DOI] M. Minutoli, V. G. Castellana, A. Tumeo, M. Lattuada, and F. Ferrandi, “Enabling the High Level Synthesis of Data Analytics Accelerators,” in Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, New York, NY, USA, 2016, p. 15:1–15:3.
    [BibTeX] [Abstract]

    Conventional High Level Synthesis (HLS) tools mainly target compute intensive kernels typical of digital signal processing applications. We are developing techniques and architectural templates to enable HLS of data analytics applications. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dynamic task parallelism. We discuss an architectural template based around a distributed controller to efficiently exploit thread level parallelism. We present a memory interface that supports parallel memory subsystems and enables implementing atomic memory operations. We introduce a dynamic task scheduling approach to efficiently execute heavily unbalanced workload. The templates are validated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.

    @inproceedings{CODES2016,
    author = {Minutoli, Marco and Castellana, Vito Giovanni and Tumeo, Antonino and Lattuada, Marco and Ferrandi, Fabrizio},
    title = {Enabling the High Level Synthesis of Data Analytics Accelerators},
    booktitle = {Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis},
    series = {CODES '16},
    year = {2016},
    isbn = {978-1-4503-4483-8},
    location = {Pittsburgh, Pennsylvania},
    pages = {15:1--15:3},
    articleno = {15},
    numpages = {3},
    url = {http://doi.acm.org/10.1145/2968456.2976764},
    doi = {10.1145/2968456.2976764},
    acmid = {2976764},
    publisher = {{ACM}},
    address = {New York, NY, USA},
    abstract={Conventional High Level Synthesis (HLS) tools mainly target compute intensive kernels typical of digital signal processing applications. We are developing techniques and architectural templates to enable HLS of data analytics applications. These applications are memory intensive, present fine-grained, unpredictable data accesses, and irregular, dynamic task parallelism. We discuss an architectural template based around a distributed controller to efficiently exploit thread level parallelism. We present a memory interface that supports parallel memory subsystems and enables implementing atomic memory operations. We introduce a dynamic task scheduling approach to efficiently execute heavily unbalanced workload. The templates are validated by synthesizing queries from the Lehigh University Benchmark (LUBM), a well know SPARQL benchmark.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/999155/141283/paper.pdf}
    }

  • [PDF] [DOI] M. Minutoli, V. G. Castellana, A. Tumeo, F. Ferrandi, and M. Lattuada, “A Dynamically Scheduled Architecture for the Synthesis of Graph Database Queries,” in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016, pp. 136-136.
    [BibTeX] [Abstract]

    Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.

    @INPROCEEDINGS{FCCM2016,
    author={M. Minutoli and V. G. Castellana and A. Tumeo and F. Ferrandi and M. Lattuada},
    booktitle={2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    title={A Dynamically Scheduled Architecture for the Synthesis of Graph Database Queries},
    year={2016},
    pages={136-136},
    keywords={Databases;Dynamic scheduling;High level synthesis;Memory architecture;Pipeline processing;Registers},
    doi={10.1109/FCCM.2016.41},
    month={May},
    abstract={Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/995143/140525/paper.pdf},
    publisher = {{IEEE}},
    }

  • [PDF] [DOI] P. Fezzardi and F. Ferrandi, “Automated bug detection for pointers and memory accesses in High-Level Synthesis compilers,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1-9.
    [BibTeX] [Abstract]

    Modern High-Level Synthesis (HLS) compilers aggressively optimize memory architectures. Bugs involving memory accesses are hard to detect, especially if they are inserted in the compilation process. We present an approach to isolate automatically memory bugs introduced by HLS tools, without user interaction, using only the original high-level specification. This is possible by tracing memory accesses in software (SW) and hardware (HW) executions on a given input dataset. The execution traces are compared performing a context-aware HW/SW address translation, leveraging alias-analysis, HLS memory allocation information and SW memory debugging practices. No restrictions are imposed on memory optimizations. We show results on the relevance of the problem, the coverage, the detected bugs. We also show that the approach can be adapted to different commercial and academic HLS tools.

    @INPROCEEDINGS{FPL2016,
    author={P. Fezzardi and F. Ferrandi},
    booktitle={2016 26th International Conference on Field Programmable Logic and Applications (FPL)},
    title={Automated bug detection for pointers and memory accesses in High-Level Synthesis compilers},
    year={2016},
    pages={1-9},
    keywords={Algorithm design and analysis;Computer bugs;Debugging;Hardware design languages;Optimization;Resource management;Silicon},
    doi={10.1109/FPL.2016.7577369},
    month={Aug},
    abstract={Modern High-Level Synthesis (HLS) compilers aggressively optimize memory architectures. Bugs involving memory accesses are hard to detect, especially if they are inserted in the compilation process. We present an approach to isolate automatically memory bugs introduced by HLS tools, without user interaction, using only the original high-level specification. This is possible by tracing memory accesses in software (SW) and hardware (HW) executions on a given input dataset. The execution traces are compared performing a context-aware HW/SW address translation, leveraging alias-analysis, HLS memory allocation information and SW memory debugging practices. No restrictions are imposed on memory optimizations. We show results on the relevance of the problem, the coverage, the detected bugs. We also show that the approach can be adapted to different commercial and academic HLS tools.},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/998431/139966/address-discrepancy.pdf},
    keywords={ Resource management, Debugging, Optimization, Computer bugs, Hardware design languages, Algorithm design and analysis, Silicon },
    publisher = {{IEEE}},
    }

  • [PDF] [DOI] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of FPGA High-Level Synthesis Tools,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 35, iss. 10, p. 1591–1604, 2016.
    [BibTeX] [Abstract]

    High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing FPGA circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools.We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and use of resources.

    @ARTICLE{TCADHLSEVAL2016,
    author={R. Nane and V. M. Sima and C. Pilato and J. Choi and B. Fort and A. Canis and Y. T. Chen and H. Hsiao and S. Brown and F. Ferrandi and J. Anderson and K. Bertels},
    journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
    title={A Survey and Evaluation of FPGA High-Level Synthesis Tools},
    volume = {35},
    number = {10},
    pages = {1591--1604},
    year = {2016},
    publisher = {{IEEE}},
    abstract={High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing FPGA circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools.We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and use of resources.},
    keywords={Field programmable gate arrays;Hardware;Hardware design languages;Optimization;Program processors;Evaluation;High-Level Synthesis;Survey},
    ISSN={0278-0070},
    month={Oct},
    keywords={field programmable gate arrays;high level synthesis;integrated circuit design;FPGA;HLS;abstraction;comprehensive analysis;energy-efficient heterogeneous systems;field-programmable gate array circuit design;hardware functionality;high-level synthesis tool;software program;Field programmable gate arrays;Hardware;Hardware design languages;Optimization;Program processors;Yttrium;Bambu;Dwarv;LegUp;comparison;evaluation;field-programmable gate array (FPGA);high-level synthesis (HLS);survey},
    doi={10.1109/TCAD.2015.2513673},
    pdf={wp-content/papercite-data/pdf/TCADHLSEVAL2016.pdf},
    }

  • [PDF] [DOI] M. Lattuada, F. Ferrandi, and M. Perrotin, “Computer Assisted Design and Integration of FPGA Accelerators in Aerospace Systems,” in Proceedings of the IEEE Aerospace Conference, 2016, p. 1–11.
    [BibTeX] [Abstract]

    The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system allows to improve its efficiency and its flexibility thanks to their programmability. Generating the required hardware descriptions for a software developer could be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). In this paper the integration of a High Level Synthesis design flow in the TASTE framework (http://taste.tuxfamily.org) is presented. TASTE is a set of freely available tools for the development of real time embedded systems developed by the European Space Agency together with a set of its industrial partners. This framework allows to integrate specifications described in different languages (e.g., C, ADA, Simulink, SDL) by means of formal languages (AADL and ASN.1) and to early verify the correctness of the produced solutions. TASTE has been extended with bambu (http://panda.dei.polimi.it), a tool for the High Level Synthesis developed at Politecnico di Milano. In this way the TASTE users have the possibility to specify which functionalities, provided by means of high level languages such C, have to be implemented in hardware on the FPGA without having to directly provide the hardware implementations.

    @inproceedings{AEROCONF2016,
    author = {Marco Lattuada and Fabrizio Ferrandi and Maxime Perrotin},
    title = {Computer Assisted Design and Integration of FPGA Accelerators in Aerospace Systems},
    booktitle = {Proceedings of the {IEEE} Aerospace Conference},
    year = {2016},
    pages = {1--11},
    publisher = {{IEEE}},
    abstract = {The integration of Field Programmable Gate Arrays (FPGAs) in an aerospace system allows to improve its efficiency and its flexibility thanks to their programmability. Generating the required hardware descriptions for a software developer could be a very difficult task because of the different programming paradigms of software programs and hardware descriptions. To facilitate the developer in this activity, High Level Synthesis techniques have been developed aiming at (semi-)automatically generating hardware implementations of specifications written in high level languages (e.g., C). In this paper the integration of a High Level Synthesis design flow in the TASTE framework (http://taste.tuxfamily.org) is presented. TASTE is a set of freely available tools for the development of real time embedded systems developed by the European Space Agency together with a set of its industrial partners. This framework allows to integrate specifications described in different languages (e.g., C, ADA, Simulink, SDL) by means of formal languages (AADL and ASN.1) and to early verify the correctness of the produced solutions. TASTE has been extended with bambu (http://panda.dei.polimi.it), a tool for the High Level Synthesis developed at Politecnico di Milano. In this way the TASTE users have the possibility to specify which functionalities, provided by means of high level languages such C, have to be implemented in hardware on the FPGA without having to directly provide the hardware implementations.},
    doi={10.1109/AERO.2016.7500675},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/977752/92276/aeroconf2016.pdf},
    }

2015

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Exploiting Outer Loops Vectorization in High Level Synthesis,” in Proceedings of the Architecture of Computing Systems ARCS, 2015, p. 31–42.
    [BibTeX] [Abstract]

    Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a methodology for the synthesis of complex DoAll loops based on outer vectorization is proposed. Vectorization is not limited to the innermost loops: complex constructs such as nested loops, conditional constructs and function calls are supported. Experimental results on parallel benchmarks show up to 7.35x speed-up and up to 40% reduction of area-delay product.

    @inproceedings{ARCS2015,
    author = {Marco Lattuada and Fabrizio Ferrandi},
    title = {Exploiting Outer Loops Vectorization in High Level Synthesis},
    booktitle = {Proceedings of the Architecture of Computing Systems {ARCS}},
    series = {Lecture Notes in Computer Science},
    volume = {9017},
    pages = {31--42},
    publisher = {Springer International Publishing},
    year = {2015},
    abstract = {Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a methodology for the synthesis of complex DoAll loops based on outer vectorization is proposed. Vectorization is not limited to the innermost loops: complex constructs such as nested loops, conditional constructs and function calls are supported. Experimental results on parallel benchmarks show up to 7.35x speed-up and up to 40% reduction of area-delay product.},
    doi = {10.1007/978-3-319-16086-3_3},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/964118/92680/arcs2015.pdf},
    }

  • [DOI] M. Minutoli, V. G. Castellana, A. Tumeo, and F. Ferrandi, “Function Proxies for Improved Resource Sharing in High Level Synthesis,” in Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2015, pp. 100-100.
    [BibTeX] [Abstract]

    The current generation of High Level Synthesis (HLS) tools usually generates hierarchical and modular designs, mimicking the structure of the call graph of the original high-level input specification. The standard approach is to progressively synthesize functions into modules by navigating the application call graph from the leaves up to the top function. In the synthesized architecture, function calls corresponds to the instantiation of the related module into the data path generated for the caller. Our work introduces a methodology that enables sharing of (sub)modules across modules boundaries.

    @INPROCEEDINGS{FCCM2015,
    author={M. Minutoli and V. G. Castellana and A. Tumeo and F. Ferrandi},
    booktitle={Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
    title={Function Proxies for Improved Resource Sharing in High Level Synthesis},
    year={2015},
    pages={100-100},
    abstract={The current generation of High Level Synthesis (HLS) tools usually generates hierarchical and modular designs, mimicking the structure of the call graph of the original high-level input specification. The standard approach is to progressively synthesize functions into modules by navigating the application call graph from the leaves up to the top function. In the synthesized architecture, function calls corresponds to the instantiation of the related module into the data path generated for the caller. Our work introduces a methodology that enables sharing of (sub)modules across modules boundaries.},
    keywords={high level synthesis;resource allocation;HLS tools;call graph;function proxies;high level synthesis;high-level input specification;resource sharing;synthesized architecture;Complexity theory;Computer architecture;Corporate acquisitions;Field programmable gate arrays;High level synthesis;Optimization;Resource management;Resource sharing;function pointers;function proxies;high level synthesis},
    doi={10.1109/FCCM.2015.60},
    month={May},
    publisher={{IEEE}},
    }

  • [PDF] [DOI] M. Minutoli, V. G. Castellana, A. Tumeo, and F. Ferrandi, “Inter-procedural resource sharing in High Level Synthesis through function proxies,” in Proceedings of the 25th International Conference on Field Programmable Logic and Applications, FPL, 2015, pp. 1-8.
    [BibTeX] [Abstract]

    Modular design is becoming increasingly important in High Level Synthesis (HLS) flows. Current HLS flows generate hierarchical and modular designs that mimic the structure and call graph of the input specification by translating functions into modules. Function calls are translated by instantiating the callee module in the data-path of its caller, allowing for resource sharing when the same function is called multiple times. However, if two different callers invoke the same function, current HLS flows cannot share the instance of the module between the two callers, even if they invoke the function in a mutually exclusive way. In this paper, we propose a methodology that enables sharing of (sub)modules across modules boundaries. Sharing is obtained through function proxies, which act as forwarders of function calls in the original specification to shared modules without reducing performance. Building on the concept of function proxies, we propose a methodology and the related components to perform HLS of function calls through function pointers, without requiring complete static knowledge of the alias set (point-to set). We show that module sharing through function proxies provides valuable area savings and no significant impacts on the execution delays, and that our synthesis approach for function pointers enables dynamic polymorphism.

    @INPROCEEDINGS{FPL2015,
    author={M. Minutoli and V. G. Castellana and A. Tumeo and F. Ferrandi},
    title={Inter-procedural resource sharing in High Level Synthesis through function proxies},
    booktitle={Proceedings of the 25th International Conference on Field Programmable Logic and Applications, {FPL}},
    year={2015},
    pages={1-8},
    isbn = {978-0-9934-2800-5},
    publisher = {{IEEE}},
    month={Sept},
    location = {London, United Kingdom},
    abstract={Modular design is becoming increasingly important in High Level Synthesis (HLS) flows. Current HLS flows generate hierarchical and modular designs that mimic the structure and call graph of the input specification by translating functions into modules. Function calls are translated by instantiating the callee module in the data-path of its caller, allowing for resource sharing when the same function is called multiple times. However, if two different callers invoke the same function, current HLS flows cannot share the instance of the module between the two callers, even if they invoke the function in a mutually exclusive way. In this paper, we propose a methodology that enables sharing of (sub)modules across modules boundaries. Sharing is obtained through function proxies, which act as forwarders of function calls in the original specification to shared modules without reducing performance. Building on the concept of function proxies, we propose a methodology and the related components to perform HLS of function calls through function pointers, without requiring complete static knowledge of the alias set (point-to set). We show that module sharing through function proxies provides valuable area savings and no significant impacts on the execution delays, and that our synthesis approach for function pointers enables dynamic polymorphism.},
    keywords={graph theory;high level synthesis;polymorphism;resource allocation;HLS flow;call graph;dynamic polymorphism;function calls;function pointers;function proxies;hierarchical designs;high level synthesis;interprocedural resource sharing;modular design;module boundaries;module sharing;Benchmark testing;Corporate acquisitions;Hardware;Optimization;Registers;Resource management;Table lookup},
    doi={10.1109/FPL.2015.7293958},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/966133/92939/function-pointers-hls.pdf},
    }

  • [PDF] [DOI] V. G. Castellana, M. Minutoli, A. Morari, A. Tumeo, M. Lattuada, and F. Ferrandi, “High level synthesis of RDF queries for graph analytics,” in Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2015, pp. 323-330.
    [BibTeX] [Abstract]

    In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS’ front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM).

    @INPROCEEDINGS{ICCAD2015A,
    author={V. G. Castellana and M. Minutoli and A. Morari and A. Tumeo and M. Lattuada and F. Ferrandi},
    title={High level synthesis of {RDF} queries for graph analytics},
    booktitle={Proceedings of the {IEEE}/{ACM} International Conference on Computer-Aided Design},
    series = {ICCAD '15},
    year={2015},
    location = {Austin, TX, USA},
    pages={323-330},
    publisher = {{IEEE}},
    month={Nov},
    abstract={In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS' front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM).},
    keywords={graph theory;high level synthesis;memory architecture;multi-threading;query languages;shared memory systems;Bambu;DC architecture;GEMS;LUBM;Lehigh University Benchmark;MIC;RDF databases;RDF queries;SPARQL queries;accelerator design;adaptive architecture;atomic memory operations;concurrent memory operations;distributed controller architecture;graph analytics;graph database engine;graph pattern matching algorithms;memory intensive irregular applications;memory interface controller;multiported/multibanked shared memory;multithreaded systems;open source high-level synthesis tools;resource description framework databases;Acceleration;Computer architecture;Databases;Field programmable gate arrays;Parallel processing;Program processors;Resource description framework},
    doi={10.1109/ICCAD.2015.7372587},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/977866/92674/iccad15.pdf},
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Code Transformations Based on Speculative SDC Scheduling,” in Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2015, p. 71–77.
    [BibTeX] [Abstract]

    Code motion and speculations are usually exploited in the High Level Synthesis of control dominated applications to improve the performances of the synthesized designs. Selecting the transformations to be applied is not a trivial task: their effects can indeed indirectly spread across the whole design, potentially worsening the quality of the results. In this paper we propose a code transformation flow, based on a new extension of the System of Difference Constraints (SDC) scheduling algorithm, which introduces a large number of transformations, whose profitability is guaranteed by SDC formulation. Experimental results show that the proposed technique in average reduces the execution time of control dominated applications by 37% with respect to a commercial tool without increasing the area usage.

    @inproceedings{ICCAD2015B,
    author = {Lattuada, Marco and Ferrandi, Fabrizio},
    title = {Code Transformations Based on Speculative {SDC} Scheduling},
    booktitle = {Proceedings of the {IEEE}/{ACM} International Conference on Computer-Aided Design},
    series = {ICCAD '15},
    year = {2015},
    location = {Austin, TX, USA},
    pages = {71--77},
    publisher = {{IEEE}},
    month={Nov},
    abstract={Code motion and speculations are usually exploited in the High Level Synthesis of control dominated applications to improve the performances of the synthesized designs. Selecting the transformations to be applied is not a trivial task: their effects can indeed indirectly spread across the whole design, potentially worsening the quality of the results. In this paper we propose a code transformation flow, based on a new extension of the System of Difference Constraints (SDC) scheduling algorithm, which introduces a large number of transformations, whose profitability is guaranteed by SDC formulation. Experimental results show that the proposed technique in average reduces the execution time of control dominated applications by 37% with respect to a commercial tool without increasing the area usage.},
    doi={10.1109/ICCAD.2015.7372552},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/973456/92677/iccad2015.pdf}
    }

  • [PDF] [DOI] P. Fezzardi, M. Castellana, and F. Ferrandi, “Trace-based automated logical debugging for high-level synthesis generated circuits,” in Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD), 2015, pp. 251-258.
    [BibTeX] [Abstract]

    In this paper we present an approach for debugging hardware designs generated by High-Level Synthesis (HLS), relieving users from the burden of identifying the signals to trace and from the error-prone task of manually checking the traces. The necessary steps are performed after HLS, independently of it and without affecting the synthesized design. For this reason our methodology should be easily adaptable to any HLS tools. The proposed approach makes full use of HLS compile time informations. The executions of the simulated design and the original C program can be compared, checking if there are discrepancies between values of C variables and signals in the design. The detection is completely automated, that is, it does not need any input but the program itself and the user does not have to know anything about the overall compilation process. The design can be validated on a given set of test cases and the discrepancies are detected by the tool. Relationships between the original high-level source code and the generated HDL are kept by the compiler and shown to the user. The granularity of such discrepancy analysis is per-operation and it includes the temporary variables inserted by the compiler. As a consequence the design can be debugged as is, with no restrictions on optimizations available during HLS. We show how this methodology can be used to identify different kind of bugs: 1) introduced by the HLS tool used for the synthesis; 2) introduced using buggy libraries of hardware components for HLS; 3) undefined behavior bugs in the original high-level source code.

    @INPROCEEDINGS{ICCD2015,
    author={P. Fezzardi and M. Castellana and F. Ferrandi},
    booktitle={Proceedings of the 33rd IEEE International Conference on Computer Design (ICCD)},
    title={Trace-based automated logical debugging for high-level synthesis generated circuits},
    year={2015},
    pages={251-258},
    month={Oct},
    abstract={In this paper we present an approach for debugging hardware designs generated by High-Level Synthesis (HLS), relieving users from the burden of identifying the signals to trace and from the error-prone task of manually checking the traces. The necessary steps are performed after HLS, independently of it and without affecting the synthesized design. For this reason our methodology should be easily adaptable to any HLS tools. The proposed approach makes full use of HLS compile time informations. The executions of the simulated design and the original C program can be compared, checking if there are discrepancies between values of C variables and signals in the design. The detection is completely automated, that is, it does not need any input but the program itself and the user does not have to know anything about the overall compilation process. The design can be validated on a given set of test cases and the discrepancies are detected by the tool. Relationships between the original high-level source code and the generated HDL are kept by the compiler and shown to the user. The granularity of such discrepancy analysis is per-operation and it includes the temporary variables inserted by the compiler. As a consequence the design can be debugged as is, with no restrictions on optimizations available during HLS. We show how this methodology can be used to identify different kind of bugs: 1) introduced by the HLS tool used for the synthesis; 2) introduced using buggy libraries of hardware components for HLS; 3) undefined behavior bugs in the original high-level source code.},
    keywords={C language;electronic design automation;high level synthesis;source code (software);C program;HDL;HLS tools;compile time informations;compiler;hardware design debugging;high-level source code;high-level synthesis generated circuits;signal identification;trace-based automated logical debugging;Computer bugs;Controllability;Debugging;Hardware;Layout;Observability;Optimization},
    doi={10.1109/ICCD.2015.7357111},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/973455/92803/discrepancy.pdf},
    publisher={{IEEE}},
    }

  • [PDF] [DOI] M. Lattuada, C. Pilato, and F. Ferrandi, “Performance Estimation of Task Graphs Based on Path Profiling,” International Journal of Parallel Programming, 2015.
    [BibTeX] [Abstract]

    Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5% in average, even when compiling the code with different optimization levels.

    @ARTICLE{JPP2015,
    author={Lattuada, M. and Pilato, C. and Ferrandi, F. },
    title={Performance Estimation of Task Graphs Based on Path Profiling},
    journal={International Journal of Parallel Programming},
    year={2015},
    page_count={37},
    affiliation={Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy; Department of Computer Science, Columbia University, New York, NY, United States},
    abstract={Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5% in average, even when compiling the code with different optimization levels.},
    author_keywords={Hierarchical Task Graph; Path profiling; Performance estimation},
    document_type={Article in Press},
    doi={10.1007/s10766-015-0372-7},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/977870/92792/submitted.pdf},
    publisher={Springer},
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Modeling Resolution of Resources Contention in Synchronous Data Flow Graphs,” J. Signal Process. Syst., vol. 80, iss. 1, p. 39–47, 2015.
    [BibTeX] [Abstract]

    Synchronous Data Flow graphs are widely adopted in the designing of streaming applications, but were originally formulated to describe only how an application is partitioned and which data are exchanged among different tasks. Since Synchronous Data Flow graphs are often used to describe and evaluate complete design solutions, missing information (e.g., mapping, scheduling, etc.) has to be included in them by means of further actors and channels to obtain accurate evaluations. To address this issue preserving the simplicity of the representation, techniques that model data transfer delays by means of ad-hoc actors have been proposed, but they model independently each communication ignoring contentions. Moreover, they do not usually consider at all delays due to buffer contentions, potentially overestimating the throughput of a design solution. In this paper a technique to extend Synchronous Data Flow graphs by adding ad-hoc actors and channels to model resolution of resources contentions is proposed. The results show that the number of added actors and channels is limited but that they can significantly increase the Synchronous Data Flow graph accuracy.

    @article{JSPS2015,
    author = {Lattuada, Marco and Ferrandi, Fabrizio},
    title = {Modeling Resolution of Resources Contention in Synchronous Data Flow Graphs},
    journal = {J. Signal Process. Syst.},
    issue_date = {July 2015},
    volume = {80},
    number = {1},
    month = jul,
    year = {2015},
    issn = {1939-8018},
    pages = {39--47},
    numpages = {9},
    acmid = {2746441},
    publisher = {Kluwer Academic Publishers},
    address = {Hingham, MA, USA},
    abstract={Synchronous Data Flow graphs are widely adopted in the designing of streaming applications, but were originally formulated to describe only how an application is partitioned and which data are exchanged among different tasks. Since Synchronous Data Flow graphs are often used to describe and evaluate complete design solutions, missing information (e.g., mapping, scheduling, etc.) has to be included in them by means of further actors and channels to obtain accurate evaluations. To address this issue preserving the simplicity of the representation, techniques that model data transfer delays by means of ad-hoc actors have been proposed, but they model independently each communication ignoring contentions. Moreover, they do not usually consider at all delays due to buffer contentions, potentially overestimating the throughput of a design solution. In this paper a technique to extend Synchronous Data Flow graphs by adding ad-hoc actors and channels to model resolution of resources contentions is proposed. The results show that the number of added actors and channels is limited but that they can significantly increase the Synchronous Data Flow graph accuracy.},
    keywords = {Buffers, Contention, Data transfers, Synchronous data flow graphs},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/869737/92795/jsps_2013.pdf},
    doi = {10.1007/s11265-014-0923-y},
    }

2014

  • [DOI] V. G. Castellana, A. Tumeo, and F. Ferrandi, “An adaptive Memory Interface Controller for improving bandwidth utilization of hybrid and reconfigurable systems,” in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, pp. 1-4.
    [BibTeX] [Abstract]

    Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but also presents large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.

    @INPROCEEDINGS{DATE2014,
    author={V. G. Castellana and A. Tumeo and F. Ferrandi},
    title={An adaptive Memory Interface Controller for improving bandwidth utilization of hybrid and reconfigurable systems},
    booktitle={Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE)},
    year={2014},
    pages={1-4},
    month={March},
    abstract={Data mining, bioinformatics, knowledge discovery, social network analysis, are emerging irregular applications that exploits data structures based on pointers or linked lists, such as graphs, unbalanced trees or unstructured grids. These applications are characterized by unpredictable memory accesses and generally are memory bandwidth bound, but also presents large amounts of inherent dynamic parallelism because they can potentially spawn concurrent activities for each one of the element they are exploring. Hybrid architectures, which integrate general purpose processors with reconfigurable devices, appears promising target platforms for accelerating irregular applications. These systems often connect to distributed and multi-ported memories, potentially enabling parallel memory operations. However, these memory architectures introduce several challenges, such as the necessity to manage concurrency and synchronization to avoid structural conflicts on shared memory locations and to guarantee consistency. In this paper we present an adaptive Memory Interface Controller (MIC) that addresses these issues. The MIC is a general and customizable solution that can target several different memory structures, and is suitable for High Level Synthesis frameworks. It implements a dynamic arbitration scheme, which avoids conflicts on memory resources at runtime, and supports atomic memory operations, commonly exploited for synchronization directives in parallel programming paradigms. The MIC simultaneously maps multiple accesses to different memory ports, allowing fine grained parallelism exploitation and ensuring correctness also in the presence of irregular and statically unpredictable memory access patterns. We evaluated the effectiveness of our approach on a typical irregular kernel, graph Breadth First Search (BFS), exploring different design alternatives.},
    keywords={digital storage;graph theory;high level synthesis;parallel programming;tree searching;adaptive MIC;adaptive memory interface controller;atomic memory operation;bandwidth utilization;bioinformatics;concurrency;concurrent activities;data mining;data structures;dynamic arbitration scheme;fine-grained parallelism exploitation;general purpose processors;graph BFS;graph breadth first search;graphs;high-level synthesis framework;hybrid systems;inherent dynamic parallelism;irregular-unpredictable memory access pattern;knowledge discovery;linked lists;memory architectures;memory bandwidth bound;memory ports;memory resources;memory structures;parallel memory operation;parallel programming paradigm;pointers;reconfigurable devices;reconfigurable systems;shared memory location;social network analysis;statically-unpredictable memory access pattern;synchronization;synchronization directives;typical irregular kernel;unbalanced trees;unpredictable memory access;unstructured grids;Concurrent computing;Hardware;Kernel;Memory management;Microwave integrated circuits;Parallel processing;Synchronization},
    doi={10.7873/DATE.2014.192},
    publisher={{IEEE}},
    }

  • V. G. Castellana, A. Tumeo, and F. Ferrandi, “A Synthesis Approach for Mapping Irregular Applications on Reconfigurable Architectures,” in Technical Program Posters High Performance Computing, Networking, Storage and Analysis (SC), 2014.
    [BibTeX] [Abstract]

    Emerging applications such as bioinformatics and knowledge discovery algorithms are irregular. They generate unpredictable memory accesses and are mostly memory bandwidth bound. Several efforts are looking at accelerating these applications on hybrid architectures, which integrate general purpose processors with reconfigurable devices. Some solutions include custom-hand tuned accelerators on the reconfigurable logic. Hand crafted accelerators provide great performance benefits, but their development time often discourages their adoption. We propose a novel High Level Synthesis approach, for the automatic generation of adaptive custom accelerators, able to manage multiple execution flows. Our approach supports multiple, multi-ported and distributed memories, and atomic operations. It features a memory interface controller, which maps unpredictable memory access requests to the corresponding memory ports, while managing concurrency. We present a case study on a typical irregular kernel, the Graph Breadth First search, evaluating performance tradeoffs when varying the number of memories and the number of concurrent flows.

    @inproceedings {SC2013,
    author={V. G. Castellana and A. Tumeo and F. Ferrandi},
    title={A Synthesis Approach for Mapping Irregular Applications on Reconfigurable Architectures},
    booktitle={Technical Program Posters High Performance Computing, Networking, Storage and Analysis (SC)},
    year={2014},
    abstract={Emerging applications such as bioinformatics and knowledge discovery algorithms are irregular. They generate unpredictable memory accesses and are mostly memory bandwidth bound. Several efforts are looking at accelerating these applications on hybrid architectures, which integrate general purpose processors with reconfigurable devices. Some solutions include custom-hand tuned accelerators on the reconfigurable logic. Hand crafted accelerators provide great performance benefits, but their development time often discourages their adoption. We propose a novel High Level Synthesis approach, for the automatic generation of adaptive custom accelerators, able to manage multiple execution flows. Our approach supports multiple, multi-ported and distributed memories, and atomic operations. It features a memory interface controller, which maps unpredictable memory access requests to the corresponding memory ports, while managing concurrency. We present a case study on a typical irregular kernel, the Graph Breadth First search, evaluating performance tradeoffs when varying the number of memories and the number of concurrent flows.},
    }

2013

  • [DOI] V. G. Castellana and F. Ferrandi, “Scheduling independent liveness analysis for register binding in high level synthesis,” in Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pp. 1571-1574.
    [BibTeX] [Abstract]

    Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS approaches, such a relation can be difficult to be computed, or not statically computable at all, and adopting conventional register binding techniques, even when feasible, cannot guarantee maximum performances. To overcome these issues we introduce a novel scheduling-independent liveness analysis methodology, suitable for dynamic scheduling architectures. Such liveness analysis is exploited in register binding using standard graph coloring techniques, and unlike other approaches it avoids the insertion of structural dependencies, introduced to prevent run-time resource conflicts in dynamic scheduling environments. The absence of additional dependencies avoids performance degradation and makes parallelism exploitation independent from the register binding task, while on average not impacting on area, as shown through the experimental results.

    @INPROCEEDINGS{DATE2013,
    author={V. G. Castellana and F. Ferrandi},
    title={Scheduling independent liveness analysis for register binding in high level synthesis},
    booktitle={Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE)},
    year={2013},
    pages={1571-1574},
    month={March},
    ISSN={1530-1591},
    abstract={Classical techniques for register allocation and binding require the definition of the program execution order, since a partial ordering relation between operations must be induced to perform liveness analysis through data-flow equations. In High Level Synthesis (HLS) flows this is commonly obtained through the scheduling task. However for some HLS approaches, such a relation can be difficult to be computed, or not statically computable at all, and adopting conventional register binding techniques, even when feasible, cannot guarantee maximum performances. To overcome these issues we introduce a novel scheduling-independent liveness analysis methodology, suitable for dynamic scheduling architectures. Such liveness analysis is exploited in register binding using standard graph coloring techniques, and unlike other approaches it avoids the insertion of structural dependencies, introduced to prevent run-time resource conflicts in dynamic scheduling environments. The absence of additional dependencies avoids performance degradation and makes parallelism exploitation independent from the register binding task, while on average not impacting on area, as shown through the experimental results.},
    keywords={Dynamic scheduling;Equations;Reactive power;Registers;Resource management;Schedules;Standards},
    doi={10.7873/DATE.2013.319},
    publisher={{IEEE}},
    }

  • [DOI] V. G. Castellana and F. Ferrandi, “An automated flow for the High Level Synthesis of coarse grained parallel applications,” in Proceedings of the International Conference on Field-Programmable Technology (FPT), 2013, pp. 294-301.
    [BibTeX] [Abstract]

    High Level Synthesis (HLS) provides a way to significantly enhance the productivity of embedded system designers, by enabling the automatic or semiautomatic generation of hardware accelerators starting from high level descriptions with (usually software) programming languages. Typical HLS approaches build a centralized Finite State Machine (FSM) to control the generated datapath, performing the operations according to a pre-determined, static schedule. However, FSM-based approaches are only able to extract parallelism within a single execution flow. In the presence of coarse grained parallelism, in the form of concurrent function calls or parallel control structures, they either serialize all the operations, or build excessively complex controllers, aiming at executing as many operation as possible in a single control step (i.e., they try to extract as much instruction level parallelism as possible). The resulting controllers occupy an excessive amount of area or lead to very low operating frequencies. In this paper we propose a methodology for the HLS of accelerators supporting parallel execution and dynamic scheduling. The approach exploits an adaptive distributed controller, composed of a set of communicating elements associated with each operation. This controller design enables supporting multiple concurrent execution flows, thus increasing parallelism exploitation beyond instruction level parallelism. The approach also supports variable latency operations, such as memory accesses and speculative operations. We apply our methodology on a set of typical HLS benchmarks, and demonstrate valuable speed ups with limited area overheads with respect to conventional FSM-based flows.

    @INPROCEEDINGS{FPT2013,
    author={V. G. Castellana and F. Ferrandi},
    title={An automated flow for the High Level Synthesis of coarse grained parallel applications},
    booktitle={Proceedings of the International Conference on Field-Programmable Technology (FPT)},
    year={2013},
    pages={294-301},
    month={Dec},
    abstract={High Level Synthesis (HLS) provides a way to significantly enhance the productivity of embedded system designers, by enabling the automatic or semiautomatic generation of hardware accelerators starting from high level descriptions with (usually software) programming languages. Typical HLS approaches build a centralized Finite State Machine (FSM) to control the generated datapath, performing the operations according to a pre-determined, static schedule. However, FSM-based approaches are only able to extract parallelism within a single execution flow. In the presence of coarse grained parallelism, in the form of concurrent function calls or parallel control structures, they either serialize all the operations, or build excessively complex controllers, aiming at executing as many operation as possible in a single control step (i.e., they try to extract as much instruction level parallelism as possible). The resulting controllers occupy an excessive amount of area or lead to very low operating frequencies. In this paper we propose a methodology for the HLS of accelerators supporting parallel execution and dynamic scheduling. The approach exploits an adaptive distributed controller, composed of a set of communicating elements associated with each operation. This controller design enables supporting multiple concurrent execution flows, thus increasing parallelism exploitation beyond instruction level parallelism. The approach also supports variable latency operations, such as memory accesses and speculative operations. We apply our methodology on a set of typical HLS benchmarks, and demonstrate valuable speed ups with limited area overheads with respect to conventional FSM-based flows.},
    keywords={high level synthesis;parallel programming;processor scheduling;program control structures;FSM;HLS;automated flow;coarse grained parallel applications;coarse grained parallelism;concurrent function calls;controller design;dynamic scheduling;embedded system designers;finite state machine;hardware accelerators;high level synthesis;instruction level parallelism;memory accesses;multiple concurrent execution flows;parallel control structures;parallel execution;programming languages;speculative operations;Complexity theory;Delays;Dynamic scheduling;Hardware;Processor scheduling;Runtime},
    doi={10.1109/FPT.2013.6718370},
    publisher={{IEEE}},
    }

  • [DOI] S. Lovergine and F. Ferrandi, “Dynamic AC-scheduling for hardware cores with unknown and uncertain information,” in Proceedings of the IEEE 31st International Conference on Computer Design (ICCD), 2013, pp. 475-478.
    [BibTeX] [Abstract]

    Modern hardware cores necessarily have to deal with many sources of unknown or uncertain information. Components with variable latency and unpredictable behavior are becoming predominant in hardware designs. Conventional hardware cores underperform when dealing with unknown or uncertain information. Common High-Level Synthesis (HLS) approaches, which require to specify the complete behavior at design-time, present significant restrictions in supporting this kind of conditions. The literature proposes several dynamic scheduling techniques to improve the cores performance by handling inherent uncertainty of applications. However, they do not address other sources of unknown information. In this paper, we propose the dynamic Activating Conditions (AC)-scheduling: a methodology for the design automation of hardware cores which can dynamically adapt the instructions scheduling according to behaviors unknown at design-time. Neither assumptions about components latency nor worst case approach are required. Experimental results show significant performance increase, with limited area overhead, with respect to state-of-the-art approaches.

    @INPROCEEDINGS{ICCD2013,
    author={S. Lovergine and F. Ferrandi},
    title={Dynamic AC-scheduling for hardware cores with unknown and uncertain information},
    booktitle={Proceedings of the IEEE 31st International Conference on Computer Design (ICCD)},
    year={2013},
    pages={475-478},
    month={Oct},
    abstract={Modern hardware cores necessarily have to deal with many sources of unknown or uncertain information. Components with variable latency and unpredictable behavior are becoming predominant in hardware designs. Conventional hardware cores underperform when dealing with unknown or uncertain information. Common High-Level Synthesis (HLS) approaches, which require to specify the complete behavior at design-time, present significant restrictions in supporting this kind of conditions. The literature proposes several dynamic scheduling techniques to improve the cores performance by handling inherent uncertainty of applications. However, they do not address other sources of unknown information. In this paper, we propose the dynamic Activating Conditions (AC)-scheduling: a methodology for the design automation of hardware cores which can dynamically adapt the instructions scheduling according to behaviors unknown at design-time. Neither assumptions about components latency nor worst case approach are required. Experimental results show significant performance increase, with limited area overhead, with respect to state-of-the-art approaches.},
    keywords={electronic design automation;high level synthesis;microprocessor chips;processor scheduling;HLS;activating conditions;design automation;dynamic AC-scheduling;hardware cores;hardware designs;high-level synthesis approaches;limited area overhead;uncertain information;variable latency;Benchmark testing;Design automation;Dynamic scheduling;Hardware;Parallel processing;Table lookup;Uncertainty;Dynamic Scheduling;HLS;Hardware Design;Uncertain Information},
    doi={10.1109/ICCD.2013.6657086},
    publisher={{IEEE}},
    }

  • [DOI] S. Lovergine and F. Ferrandi, “Harnessing Adaptivity Analysis for the Automatic Design of Efficient Embedded and HPC Systems,” in Proceedings of the IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013, pp. 2298-2301.
    [BibTeX] [Abstract]

    In the past decades, design methodologies of Embedded Systems (ES) and High Performance Computing (HPC) systems have evolved following different trends. However, they are lately experiencing issues that affect both the domains, whose solutions converge to similar approaches. Examples of issues affecting both the domains are: large parallelism degrees, heterogeneity, power constraints, reliability issues, self-adaptation, and significant programming efforts to reach the desired performance on increasingly complex architectures. Systems able to dynamically adjust their behavior at run-time appear good candidates for the next computing generation, and will most probably condemn non-adaptable systems to rapid extinction. Adaptive systems can deal with uncertain and unpredictable conditions, due, for example, to reliability issues. In this paper we show how we can exploit adaptivity analysis to address several design challenges in embedded systems. The results show an average increase in performance around 34\% with respect to state of the art methodology, with a limited area overhead. Furthermore, we discuss our work-in-progress on the exploitation of adaptivity analysis to address new challenges in HPC systems design.

    @INPROCEEDINGS{IPDPSW2013A,
    author={S. Lovergine and F. Ferrandi},
    title={Harnessing Adaptivity Analysis for the Automatic Design of Efficient Embedded and HPC Systems},
    booktitle={Proceedings of the IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW)},
    year={2013},
    pages={2298-2301},
    month={May},
    abstract={In the past decades, design methodologies of Embedded Systems (ES) and High Performance Computing (HPC) systems have evolved following different trends. However, they are lately experiencing issues that affect both the domains, whose solutions converge to similar approaches. Examples of issues affecting both the domains are: large parallelism degrees, heterogeneity, power constraints, reliability issues, self-adaptation, and significant programming efforts to reach the desired performance on increasingly complex architectures. Systems able to dynamically adjust their behavior at run-time appear good candidates for the next computing generation, and will most probably condemn non-adaptable systems to rapid extinction. Adaptive systems can deal with uncertain and unpredictable conditions, due, for example, to reliability issues. In this paper we show how we can exploit adaptivity analysis to address several design challenges in embedded systems. The results show an average increase in performance around 34\% with respect to state of the art methodology, with a limited area overhead. Furthermore, we discuss our work-in-progress on the exploitation of adaptivity analysis to address new challenges in HPC systems design.},
    keywords={adaptive systems;embedded systems;parallel processing;HPC systems design;adaptive systems;adaptivity analysis;automatic design;computing generation;embedded systems;heterogeneity;high performance computing systems;nonadaptable systems;parallelism degrees;power constraints;reliability;Clocks;Computational modeling;Computer architecture;Design automation;Embedded systems;Hardware;Parallel processing;Adaptivity Analysis;Embedded Systems;HPC},
    doi={10.1109/IPDPSW.2013.230},
    publisher={{IEEE}},
    }

  • [DOI] V. G. Castellana and F. Ferrandi, “Applications Acceleration through Adaptive Hardware Components,” in Proceedings of the IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013, pp. 2274-2277.
    [BibTeX] [Abstract]

    High Level Synthesis (HLS) provides automatic flows for the generation of hardware accelerators starting from their behavioral description. HLS guarantees results comparable to hand-written design for some applications domains such as Digital Signal Processing. However, it is not yet able to cope with performance requirements when scaling the application complexity. One of the biggest limitation is an execution paradigm still based on the construction of a centralized Finite State Machine (FSM). Parallelism exploitation is thus bound to Instruction Level Parallelism within a single execution flow. This is in contrast to the current trends for hardware architectures and programming languages, which are progressively moving towards execution paradigms dominated other type of parallelisms, such as Task or Thread Level Parallelism. This work proposes a novel adaptive accelerator design, not based on the FSM execution paradigm, which provides support to dynamic parallel execution. Execution is parallel, because no pre-computed scheduled is considered. Operations are directly managed by dedicated lightweight hardware modules, which directly communicate to notify execution completion and to start other dependent operations. Execution is parallel, because several execution flows may run concurrently. The proposed design targets different application domains, from Embedded Systems accelerators to hybrid high-performance architectures.

    @INPROCEEDINGS{IPDPSW2013B,
    author={V. G. Castellana and F. Ferrandi},
    title={Applications Acceleration through Adaptive Hardware Components},
    booktitle={Proceedings of the IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW)},
    year={2013},
    pages={2274-2277},
    month={May},
    abstract={High Level Synthesis (HLS) provides automatic flows for the generation of hardware accelerators starting from their behavioral description. HLS guarantees results comparable to hand-written design for some applications domains such as Digital Signal Processing. However, it is not yet able to cope with performance requirements when scaling the application complexity. One of the biggest limitation is an execution paradigm still based on the construction of a centralized Finite State Machine (FSM). Parallelism exploitation is thus bound to Instruction Level Parallelism within a single execution flow. This is in contrast to the current trends for hardware architectures and programming languages, which are progressively moving towards execution paradigms dominated other type of parallelisms, such as Task or Thread Level Parallelism. This work proposes a novel adaptive accelerator design, not based on the FSM execution paradigm, which provides support to dynamic parallel execution. Execution is parallel, because no pre-computed scheduled is considered. Operations are directly managed by dedicated lightweight hardware modules, which directly communicate to notify execution completion and to start other dependent operations. Execution is parallel, because several execution flows may run concurrently. The proposed design targets different application domains, from Embedded Systems accelerators to hybrid high-performance architectures.},
    keywords={embedded systems;finite state machines;high level synthesis;parallel processing;signal processing;FSM execution paradigm;adaptive accelerator design;adaptive hardware components;automatic flows;behavioral description;centralized finite state machine;digital signal processing;dynamic parallel execution;embedded systems accelerators;execution completion;execution flows;hardware accelerators;hardware architectures;high level synthesis;hybrid high performance architectures;instruction level parallelism;lightweight hardware modules;parallelism exploitation;programming languages;single execution flow;thread level parallelism;Computer architecture;Design automation;Hardware;Processor scheduling;Registers;Runtime;Embedded Systems;High Level Synthesis;High Performance Computing Systems},
    doi={10.1109/IPDPSW.2013.244},
    publisher={{IEEE}},
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Modeling pipelined application with Synchronous Data Flow graphs,” in Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013, pp. 49-55.
    [BibTeX] [Abstract]

    Streaming applications can efficiently exploit multiprocessors architectures by means of pipelined parallelism, but designing this type of applications can be an hard task. Different subproblems have indeed to be solved: partitioning, mapping, scheduling and pipeline stage assignment. For this reason, high level abstraction models are adopted during design flow since they simplify this process by hiding most of the architectural details. Synchronous Data Flow (SDF) graphs, widely adopted to describe streaming applications, naturally model only their partitioning, so they usually have to be integrated with other types of representations. In this paper Pipelined Application Modeling (PAM), a methodology to create a Synchronous Data Flow graph describing all the aspects of a pipelined application, is presented. The methodology starts from the SDF graph describing the partitioning of the application and enriches it with new actors and channels detailing the mapping, the scheduling and the pipeline stage assignment of the considered solution. The obtained SDF graph, describing all the aspects of the solution in a formal and compact way, facilitates the evaluation of different solutions during design space exploration.

    @INPROCEEDINGS{SAMOS2013,
    author={M. Lattuada and F. Ferrandi},
    booktitle={Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII)},
    title={Modeling pipelined application with Synchronous Data Flow graphs},
    year={2013},
    pages={49-55},
    month={July},
    abstract={Streaming applications can efficiently exploit multiprocessors architectures by means of pipelined parallelism, but designing this type of applications can be an hard task. Different subproblems have indeed to be solved: partitioning, mapping, scheduling and pipeline stage assignment. For this reason, high level abstraction models are adopted during design flow since they simplify this process by hiding most of the architectural details. Synchronous Data Flow (SDF) graphs, widely adopted to describe streaming applications, naturally model only their partitioning, so they usually have to be integrated with other types of representations. In this paper Pipelined Application Modeling (PAM), a methodology to create a Synchronous Data Flow graph describing all the aspects of a pipelined application, is presented. The methodology starts from the SDF graph describing the partitioning of the application and enriches it with new actors and channels detailing the mapping, the scheduling and the pipeline stage assignment of the considered solution. The obtained SDF graph, describing all the aspects of the solution in a formal and compact way, facilitates the evaluation of different solutions during design space exploration.},
    keywords={data flow graphs;multiprocessing systems;pipeline processing;PAM method;SDF graphs;mapping problems;multiprocessors architectures;partitioning problems;pipeline stage assignment;pipelined application modeling method;pipelined parallelism;scheduling problems;streaming applications;synchronous data flow graphs;Analytical models;Data models;Optimization;Pipelines;Processor scheduling;Schedules;Synchronization},
    doi={10.1109/SAMOS.2013.6621105},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/768470/92472/samos2013_sdf.pdf},
    publisher={{IEEE}},
    }

2012

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Performance estimation of embedded software with confidence levels,” in Proceedings of the 17th Asia and South Pacific Design Automation Conference, ASP-DAC, 2012, p. 573–578.
    [BibTeX] [Abstract]

    Since time constraints are a very critical aspect of an embedded system, performance evaluation can not be postponed to the end of the design flow, but it has to be introduced since its early stages. Estimation techniques based on mathematical models are usually preferred during this phase since they provide quite accurate estimation of the application performance in a fast way. However, the estimation error has to be considered during design space exploration to evaluate if a solution can be accepted (e.g., by discarding solutions whose estimated time is too close to constraint). Evaluate if the possible error can be significant analyzing a punctual estimation is not a trivial task. In this paper we propose a methodology, based on statistical analysis, that provides a prediction interval on the estimation and a confidence level on meeting a time constraint. This information can drive design space exploration reducing the number of solutions to be validated. The results show how the produced intervals effectively capture the estimation error introduced by a linear model.

    @inproceedings {ASPDAC2012,
    author = {M. Lattuada and F. Ferrandi},
    title = {Performance estimation of embedded software with confidence levels},
    booktitle = {Proceedings of the 17th Asia and South Pacific Design Automation Conference, {ASP-DAC}},
    publisher = {{IEEE}},
    year = {2012},
    pages = {573--578},
    location = {Sydney, Australia},
    month = {Jan},
    abstract={Since time constraints are a very critical aspect of an embedded system, performance evaluation can not be postponed to the end of the design flow, but it has to be introduced since its early stages. Estimation techniques based on mathematical models are usually preferred during this phase since they provide quite accurate estimation of the application performance in a fast way. However, the estimation error has to be considered during design space exploration to evaluate if a solution can be accepted (e.g., by discarding solutions whose estimated time is too close to constraint). Evaluate if the possible error can be significant analyzing a punctual estimation is not a trivial task. In this paper we propose a methodology, based on statistical analysis, that provides a prediction interval on the estimation and a confidence level on meeting a time constraint. This information can drive design space exploration reducing the number of solutions to be validated. The results show how the produced intervals effectively capture the estimation error introduced by a linear model.},
    doi = {10.1109/ASPDAC.2012.6165022},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/665733/92470/aspdac2012_estimation.pdf}
    }

  • [DOI] S. Lovergine and F. Ferrandi, “Instructions Activating Conditions for Hardware-based Auto-scheduling,” in Proceedings of the 9th Conference on Computing Frontiers, 2012, p. 253–256.
    [BibTeX]
    @inproceedings {CF2012,
    author = {Silvia Lovergine and Fabrizio Ferrandi},
    title = {Instructions Activating Conditions for Hardware-based Auto-scheduling},
    booktitle = {Proceedings of the 9th Conference on Computing Frontiers},
    series = {CF '12},
    publisher = {ACM},
    acmid = {2212946},
    numpages = {4},
    pages = {253--256},
    location = {Cagliari, Italy},
    year = {2012},
    keywords = {automatic parallelism exploitation, autoscheduling, dynamic scheduling, high-level synthesis},
    doi = {10.1145/2212908.2212946},
    }

  • [DOI] K. Bertels, A. Lattanzi, E. Ciavattini, F. Bettarelli, M. T. Chiaradia, R. Nutricato, A. Morea, A. Antola, F. Ferrandi, M. Lattuada, C. Pilato, D. Sciuto, R. J. Meeuws, Y. Yankova, V. M. Sima, K. Sigdel, W. Luk, J. G. Figueiredo Coutinho, Y. Ming Lam, T. Todman, A. Michelotti, and A. Cerruto, “Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain,” , K. Bertels, Ed., Springer Netherlands, 2012, p. 9–109.
    [BibTeX]
    @Inbook{HARTES2012A,
    author={Bertels, Koen and Lattanzi, Ariano and Ciavattini, Emanuele and Bettarelli, Ferruccio and Chiaradia, Maria Teresa and Nutricato, Raffaele and Morea, Alberto and Antola, Anna and Ferrandi, Fabrizio and Lattuada, Marco and Pilato, Christian and Sciuto, Donatella and Meeuws, Roel J. and Yankova, Yana and Sima, Vlad Mihai and Sigdel, Kamana and Luk, Wayne and Figueiredo Coutinho, Jose Gabriel and Ming Lam, Yuet and Todman, Tim and Michelotti, Andrea and Cerruto, Antonio},
    chapter={The hArtes Tool Chain},
    title={Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain},
    editor={Bertels, Koen},
    isbn={978-94-007-1406-9},
    pages={9--109},
    year={2012},
    publisher={Springer Netherlands},
    document_type={Book Chapter},
    doi={10.1007/978-94-007-1406-9_2},
    }

  • [DOI] S. Cecchi, L. Palestini, P. Peretti, A. Primavera, F. Piazza, F. Capman, S. Thabuteau, C. Levy, J. -F. Bonastre, A. Lattanzi, E. Ciavattini, F. Bettarelli, R. Toppi, E. Capucci, F. Ferrandi, M. Lattuada, C. Pilato, D. Sciuto, W. Luk, and J. G. De Figueiredo Coutinho, “Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain,” , K. Bertels, Ed., Springer Netherlands, 2012, pp. 155-192.
    [BibTeX] [Abstract]

    In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab) has been developed employing professional audio equipment. The algorithms can be tested and validated on a PC exploiting each application as a plug-in of the real time NU-Tech framework. Then a set of tools (hArtes Toolchain) can be used to generate code for the embedded platform starting from the plug-in implementation. An overview of the whole system is here presented, taking into consideration a complete set of audio algorithms developed for the advanced car infotainment system (ACIS) that is composed of three main different applications regarding the In Car listening and communication experience. Starting from a high level description of the algorithms, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process undergone and the performance results achieved.

    @INBOOK{HARTES2012B,
    author={Cecchi, S. and Palestini, L. and Peretti, P. and Primavera, A. and Piazza, F. and Capman, F. and Thabuteau, S. and Levy, C. and Bonastre, J.-F. and Lattanzi, A. and Ciavattini, E. and Bettarelli, F. and Toppi, R. and Capucci, E. and Ferrandi, F. and Lattuada, M. and Pilato, C. and Sciuto, D. and Luk, W. and De Figueiredo Coutinho, J.G. },
    chapter={In car audio},
    title={Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain},
    editor={Bertels, Koen},
    year={2012},
    pages={155-192},
    publisher={Springer Netherlands},
    affiliation={DIBET-Universitá Politecnica delle Marche, Via Brecce Bianche 1, Ancona, Italy; Thales Communications, 146 Bd de Valmy, Colombes, France; Université d’Avignon et des Pays de Vaucluse, 339 Chemin des Meinajaries, Avignon, France; Leaff Engineering, Via Puccini 75, Porto Potenza Picena, Italy; Faital Spa, Via B. Buozzi 12, San Donato Milanese, Italy; Politecnico di Milano, Via Ponzio 34/5, Milan, Italy; Imperial College, 180 Queen’s Gate, London, United Kingdom},
    abstract={In the last decade automotive audio has been gaining great attention by the scientific and industrial communities. In this context, a new approach to test and develop advanced audio algorithms for an heterogeneous embedded platform has been proposed within the European hArtes project. A real audio laboratory installed in a real car (hArtes CarLab) has been developed employing professional audio equipment. The algorithms can be tested and validated on a PC exploiting each application as a plug-in of the real time NU-Tech framework. Then a set of tools (hArtes Toolchain) can be used to generate code for the embedded platform starting from the plug-in implementation. An overview of the whole system is here presented, taking into consideration a complete set of audio algorithms developed for the advanced car infotainment system (ACIS) that is composed of three main different applications regarding the In Car listening and communication experience. Starting from a high level description of the algorithms, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process undergone and the performance results achieved.},
    document_type={Book Chapter},
    doi={10.1007/978-94-007-1406-9_5},
    }

  • [DOI] F. Bettarelli, E. Ciavattini, A. Lattanzi, G. Beltrame, F. Ferrandi, L. Fossati, C. Pilato, D. Sciuto, R. J. Meeuws, S. A. Ostadzadeh, Z. Nawaz, Y. Lu, T. Marconi, M. Sabeghi, V. M. Sima, and K. Sigdel, “Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain,” , K. Bertels, Ed., Springer Netherlands, 2012, pp. 193-227.
    [BibTeX] [Abstract]

    In this chapter, we describe functionality which has also been developed in the context of the hArtes project but that were not included in the final release or that are separately released. The development of the tools described here was often initiated after certain limitations of the current toolset were identified. This was the case of the memory analyser QUAD which does a detailed analysis of the memory accesses. Other tools, such as the rSesame tool, were developed and explored in parallel with the hArtes tool chain. This tool assumes a KPN-version of the application and then allows for high level simulation and experimentation with different mappings and partitionings. Finally, ReSP was developed to validate the partitioning results before a real implementation was possible.

    @INBOOK{HARTES2012C,
    author={Bettarelli, F. and Ciavattini, E. and Lattanzi, A. and Beltrame, G. and Ferrandi, F. and Fossati, L. and Pilato, C. and Sciuto, D. and Meeuws, R.J. and Ostadzadeh, S.A. and Nawaz, Z. and Lu, Y. and Marconi, T. and Sabeghi, M. and Sima, V.M. and Sigdel, K. },
    chapter={Extensions of the hArtes tool chain},
    title={Hardware/Software Co-design for Heterogeneous Multi-core Platforms: The hArtes Toolchain},
    editor={Bertels, Koen},
    pages={193-227},
    year={2012},
    publisher={Springer Netherlands},
    abstract={In this chapter, we describe functionality which has also been developed in the context of the hArtes project but that were not included in the final release or that are separately released. The development of the tools described here was often initiated after certain limitations of the current toolset were identified. This was the case of the memory analyser QUAD which does a detailed analysis of the memory accesses. Other tools, such as the rSesame tool, were developed and explored in parallel with the hArtes tool chain. This tool assumes a KPN-version of the application and then allows for high level simulation and experimentation with different mappings and partitionings. Finally, ReSP was developed to validate the partitioning results before a real implementation was possible.},
    affiliation={Leaff Engineering, Via Puccini 75, Porto Potenza Picena, Italy; Politecnico di Milano, Via Ponzio 34/5, Milan, Italy; TU Delft, Delft, Netherlands},
    document_type={Book Chapter},
    doi={10.1007/978-94-007-1406-9_6},
    }

  • [DOI] V. G. Castellana and F. Ferrandi, “Abstract: Speeding-Up Memory Intensive Applications through Adaptive Hardware Accelerators,” in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, 2012, pp. 1415-1416.
    [BibTeX] [Abstract]

    Heterogeneous architectures are becoming an increasingly relevant component for High-Performance Computing: they combine the computational power of multi-core processors with the flexibility of reconfigurable co-processor boards. Such boards are often composed of a set of standard Field Programmable Gate Arrays (FPGAs), coupled with a distributed memory architecture. This allows the concurrent execution of memory access operations. Nevertheless, since the execution latency of these operations may be unknown at compile-time, the synthesis of such parallelizing accelerators becomes a complex task. In fact, standard approaches require the construction of Finite State Machines (FSMs) whose complexity, in terms of number of states and transitions, increases exponentially with respect to the number of unbounded operations that may execute concurrently. We propose an adaptive architecture for such accelerators which overcome this limitation, while exploiting the available parallelism. The proposed design methodology is compared with FSM-based approaches by means of a motivational example.

    @INPROCEEDINGS{SCC2012,
    author={V. G. Castellana and F. Ferrandi},
    title={Abstract: Speeding-Up Memory Intensive Applications through Adaptive Hardware Accelerators},
    booktitle={High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:},
    year={2012},
    pages={1415-1416},
    month={Nov},
    abstract={Heterogeneous architectures are becoming an increasingly relevant component for High-Performance Computing: they combine the computational power of multi-core processors with the flexibility of reconfigurable co-processor boards. Such boards are often composed of a set of standard Field Programmable Gate Arrays (FPGAs), coupled with a distributed memory architecture. This allows the concurrent execution of memory access operations. Nevertheless, since the execution latency of these operations may be unknown at compile-time, the synthesis of such parallelizing accelerators becomes a complex task. In fact, standard approaches require the construction of Finite State Machines (FSMs) whose complexity, in terms of number of states and transitions, increases exponentially with respect to the number of unbounded operations that may execute concurrently. We propose an adaptive architecture for such accelerators which overcome this limitation, while exploiting the available parallelism. The proposed design methodology is compared with FSM-based approaches by means of a motivational example.},
    keywords={coprocessors;distributed memory systems;field programmable gate arrays;finite state machines;multiprocessing systems;parallel memories;reconfigurable architectures;FSM-based approaches;accelerator parallelization;adaptive architecture;adaptive hardware accelerators;concurrent execution;distributed memory architecture;finite state machines;heterogeneous architectures;high-performance computing;memory access operations;memory intensive applications;multicore processor;reconfigurable coprocessor board flexibility;standard FPGA;standard field programmable gate arrays;FPGA;Hardware Accelerators;High Level Synthesis;Hybrid Architectures},
    doi={10.1109/SC.Companion.2012.226},
    publisher = {{IEEE}},
    }

2011

  • [DOI] C. Pilato, V. G. Castellana, S. Lovergine, and F. Ferrandi, “A runtime adaptive controller for supporting hardware components with variable latency,” in Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS), 2011, p. 153–160.
    [BibTeX] [Abstract]

    Nowadays, the design of hardware cores has to necessarily deal with unpredictable components, due to process variation or to the interaction with external modules (e.g., memories, sensors, IP cores). Adaptive systems are, thus, one of the most important solutions to substitute traditional approaches, based on analysis at design time, especially in critical environments. In this paper, we present an innovative lightweight controller architecture able to automatically adjust its behavior at run-time. It interacts with the surrounding environment by means of a simple token-based communication schema. We examine the capabilities of the proposed architectural model to adapt its behavior during the execution, compared to classical ones, such as the finite state machine.

    @inproceedings {AHS2011,
    author = {C. Pilato and V.G. Castellana and S. Lovergine and F. Ferrandi},
    title = {A runtime adaptive controller for supporting hardware components with variable latency},
    booktitle = {Proceedings of the NASA/ESA Conference on Adaptive Hardware and Systems (AHS)},
    publisher = {{IEEE}},
    year = {2011},
    pages = {153--160},
    abstract={Nowadays, the design of hardware cores has to necessarily deal with unpredictable components, due to process variation or to the interaction with external modules (e.g., memories, sensors, IP cores). Adaptive systems are, thus, one of the most important solutions to substitute traditional approaches, based on analysis at design time, especially in critical environments. In this paper, we present an innovative lightweight controller architecture able to automatically adjust its behavior at run-time. It interacts with the surrounding environment by means of a simple token-based communication schema. We examine the capabilities of the proposed architectural model to adapt its behavior during the execution, compared to classical ones, such as the finite state machine.},
    keywords={adaptive control;computer architecture;control system analysis computing;adaptive systems;architectural model;critical environments;design time analysis;external modules;finite state machine;hardware cores design;innovative lightweight controller architecture;process variation;runtime adaptive controller;supporting hardware components;token-based communication schema;unpredictable components;variable latency;Clocks;Lead;Monitoring},
    doi={10.1109/AHS.2011.5963930},
    }

  • [PDF] [DOI] C. Pilato, F. Ferrandi, and D. Sciuto, “A design methodology to implement memory accesses in High-Level Synthesis,” in Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS, 2011, p. 49–58.
    [BibTeX] [Abstract]

    Nowadays, the memory synthesis is becoming the main bottleneck for the generation of efficient hardware accelerators. This paper presents a design methodology to efficiently and automatically implement memory accesses in High-Level Synthesis. In particular, the approach starts from a behavioral specification (in pure C language) and a set of design constraints, such as the memory addresses where some of the data are stored. The methodology classifies which variables can be internally or externally allocated to the different modules to generate the proper architecture, fully supporting a wide range of C constructs, such as pointer arithmetic, function calls and array accesses. Moreover it allows to parallelize the accesses when the memory address is known at compile time, resulting in an efficient execution of the specification.

    @inproceedings {CODES2011,
    author = {C. Pilato and F. Ferrandi and D. Sciuto},
    title = {A design methodology to implement memory accesses in High-Level Synthesis},
    booktitle = {Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis, {CODES+ISSS}},
    publisher = {{ACM}},
    location ={Taipei, Taiwan},
    year = {2011},
    pages = {49--58},
    yy = {2011},
    abstract={Nowadays, the memory synthesis is becoming the main bottleneck for the generation of efficient hardware accelerators. This paper presents a design methodology to efficiently and automatically implement memory accesses in High-Level Synthesis. In particular, the approach starts from a behavioral specification (in pure C language) and a set of design constraints, such as the memory addresses where some of the data are stored.
    The methodology classifies which variables can be internally or externally allocated to the different modules to generate the proper architecture, fully supporting a wide range of C constructs, such as pointer arithmetic, function calls and array accesses. Moreover it allows to parallelize the accesses when the memory address is known at compile time, resulting in an efficient execution of the specification.},
    pdf = {wp-content/papercite-data/pdf/CODES2011.pdf},
    doi = {10.1145/2039370.2039381},
    }

  • [URL] M. Elhoj, A. Reis, R. Ribas, F. Ferrandi, C. Pilato, F. Moll, M. Miranda, P. Dobrovolny, N. Woolaway, A. Grasset, P. Bonnot, G. Desoli, and D. Pandini, “SYNAPTIC Project: Regularity Applied to Enhance Manufacturability and Yield at Several Abstraction Levels,” in In Proceedings of the 1st Exploiting Regularity in the Design of IPs, Architectures and Platforms Workshop, (ERDIAP ’11), 2011, p. 189–192.
    [BibTeX] [Abstract]

    In this paper, we describe a project to enhance manufacturability at several abstraction levels. The project targets several different abstraction levels seen through a design flow targeting regular approaches. The project intends to verify the role of applying regularity at different levels compared to a golden design flow used as reference. The SYNAPTIC project will span for three years involving eight different institutions, and this paper describes the intended goals.

    @inproceedings {ERDIAP2011A,
    author = {M. Elhoj and A. Reis and R. Ribas and F. Ferrandi and C. Pilato and F. Moll and M. Miranda and P. Dobrovolny and N. Woolaway and A. Grasset and P. Bonnot and G. Desoli and D. Pandini},
    title = {SYNAPTIC Project: Regularity Applied to Enhance Manufacturability and Yield at Several Abstraction Levels},
    booktitle = {In Proceedings of the 1st Exploiting Regularity in the Design of IPs, Architectures and Platforms Workshop, {(ERDIAP '11)}},
    editor = {Dimitrios Soudris and {Wolfgang Karl}},
    month = {February},
    year = {2011},
    pages = {189--192},
    numpages = {4},
    publisher = {VDE Verlag},
    isbn = {978-3-8007-3333-0},
    abstract = {In this paper, we describe a project to enhance manufacturability at several abstraction levels. The project targets several different abstraction levels seen through a design flow targeting regular approaches. The project intends to verify the role of applying regularity at different levels compared to a golden design flow used as reference. The SYNAPTIC project will span for three years involving eight different institutions, and this paper describes the intended goals.},
    url = {https://www.vde-verlag.de/proceedings-en/563333026.html},
    }

  • [URL] C. Pilato, F. Ferrandi, and D. Pandini, “Evaluating Static CMOS Complex Cells in Technology Mapping,” in In Proceedings of the 1st Exploiting Regularity in the Design of IPs, Architectures and Platforms Workshop, (ERDIAP ’11), 2011, p. 222–229.
    [BibTeX] [Abstract]

    Current EDA tools are often based on standard-cell libraries for the design of modern complex systems-on-chip. In general, the composition of such libraries does not follow a fixed rule, but it is mainly based on the experience of the chip foundries. They compact or extend the standard cell libraries by removing or adding certain implementations, respectively, in order to optimize specific goals (e.g., area, timing or power consumption) or a specific set of designs. In this paper, we define and present a comprehensive study about the effects of using static CMOS complex gates in technology mapping. The impact of such cells has been evaluated on several benchmarks usually adopted in logic synthesis targeting a 45mm technology with Synopsis Design Compiler.

    @inproceedings {ERDIAP2011b,
    author = {Christian Pilato and Fabrizio Ferrandi and Davide Pandini},
    mm = {2},
    yy = {2011},
    month = {February},
    year = {2011},
    title = {Evaluating Static CMOS Complex Cells in Technology Mapping},
    editor = {Dimitrios Soudris and {Wolfgang Karl}},
    booktitle = {In Proceedings of the 1st Exploiting Regularity in the Design of IPs, Architectures and Platforms Workshop, {(ERDIAP '11)}},
    pages = {222--229},
    numpages = {8},
    publisher = {VDE Verlag},
    isbn = {978-3-8007-3333-0},
    abstract = {Current EDA tools are often based on standard-cell libraries for the design of modern complex systems-on-chip. In general, the composition of such libraries does not follow a fixed rule, but it is mainly based on the experience of the chip foundries. They compact or extend the standard cell libraries by removing or adding certain implementations, respectively, in order to optimize specific goals (e.g., area, timing or power consumption) or a specific set of designs. In this paper, we define and present a comprehensive study about the effects of using static CMOS complex gates in technology mapping. The impact of such cells has been evaluated on several benchmarks usually adopted in logic synthesis targeting a 45mm technology with Synopsis Design Compiler.},
    url = {https://www.vde-verlag.de/proceedings-en/563333031.html},
    }

  • [DOI] C. Pilato, F. Ferrandi, and D. Pandini, “A design methodology for the automatic sizing of standard-cell libraries,” in Proceedings of the 21st ACM Great Lakes Symposium on VLSI, 2011, pp. 151-156.
    [BibTeX]
    @inproceedings {GLSVLSI2011,
    author = {Christian Pilato and Fabrizio Ferrandi and Davide Pandini},
    title = {A design methodology for the automatic sizing of standard-cell libraries},
    booktitle = {Proceedings of the 21st ACM Great Lakes Symposium on VLSI},
    editor = {David Atienza and Yuan Xie and Jos{'e} L. Ayala and Ken S. Stevens},
    publisher = {ACM},
    isbn = {978-1-4503-0667-6},
    pages = {151-156},
    year = {2011},
    doi = {10.1145/1973009.1973040},
    }

  • [DOI] G. Kuzmanov, V. M. Sima, K. Bertels, J. G. F. de Coutinho, W. Luk, G. Marchiori, R. Tripiccione, and F. Ferrandi, “Reconfigurable Computing: From FPGAs to Hardware/Software Codesign.” Springer Verlag, 2011, p. 91–115.
    [BibTeX] [Abstract]

    When targeting heterogeneous, multi-core platforms, system and application developers are not only confronted with the challenge of choosing the best hardware configuration for the application they need to map, but also the application has to be modified such that certain parts are executed on the most appropriate hardware component. The hArtes toolchain provides (semi) automatic support to the designer for this mapping effort. A hardware platform was specifically designed for the project, which consists of an ARM processor, a DSP and an FPGA. The toolchain, targeting this platform but potentially targeting any similar system, has been tested and validated on several computationally intensive applications and resulted in substantial speedups as well as drastically reduced development times. We report speedups of up to nine times compared against a pure ARM based execution, and mapping can be done in minutes. The toolchain thus allows for easy design space exploration to find the best mapping, given hardware availability and real time execution constraints.

    @Inbook {RECONFIGURABLECOMPUTING2011,
    author = {G. Kuzmanov and V.M. Sima and K. Bertels and J.G.F. de Coutinho and W. Luk and G. Marchiori and R. Tripiccione and F. Ferrandi},
    chapter={hArtes: Holistic Approach to Reconfigurable Real-Time Embedded Systems},
    title={Reconfigurable Computing: From {FPGA}s to Hardware/Software Codesign},
    publisher = {Springer Verlag},
    year = {2011},
    pages = {91--115},
    abstract={When targeting heterogeneous, multi-core platforms, system and application developers are not only confronted with the challenge of choosing the best hardware configuration for the application they need to map, but also the application has to be modified such that certain parts are executed on the most appropriate hardware component. The hArtes toolchain provides (semi) automatic support to the designer for this mapping effort. A hardware platform was specifically designed for the project, which consists of an ARM processor, a DSP and an FPGA. The toolchain, targeting this platform but potentially targeting any similar system, has been tested and validated on several computationally intensive applications and resulted in substantial speedups as well as drastically reduced development times. We report speedups of up to nine times compared against a pure ARM based execution, and mapping can be done in minutes. The toolchain thus allows for easy design space exploration to find the best mapping, given hardware availability and real time execution constraints.},
    doi={10.1007/978-1-4614-0061-5_5},
    }

  • [URL] S. Cecchi, A. Primavera, F. Piazza, F. Bettarelli, E. Ciavanttini, R. Toppi, J. G. F. Coutinho, W. Luk, C. Pilato, F. Ferrandii, V. Sima, and K. Bertels, “The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio,” J. Audio Eng. Soc, vol. 59, iss. 11, p. 858–869, 2011.
    [BibTeX]
    @article{cecchi2011the,
    title = {The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio},
    author = {Cecchi, Stefania and Primavera, Andrea and Piazza, Francesco and Bettarelli, Ferruccio and Ciavanttini, Emanuelle and Toppi, Romolo and Coutinho, Jose G. F. and Luk, Wayne and Pilato, Christian and Ferrandii, Fabrizio and Sima, Vlad-Miha and Bertels, Koen},
    journal = {J. Audio Eng. Soc},
    volume = {59},
    number = {11},
    pages = {858--869},
    year = {2011},
    url = {http://www.aes.org/e-lib/browse.cfm?elib=16153}
    }

2010

  • [URL] K. Bertels, F. Bettarelli, S. Cecchi, E. Ciavattini, J. D. F. Coutinho, F. Ferrandi, W. Luk, F. Piazza, C. Pilato, A. Primavera, V. Sima, and R. Toppi, “The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio,” in Audio Engineering Society Convention 129, 2010.
    [BibTeX]
    @conference {AESC2010,
    author = {Koen Bertels and Ferruccio Bettarelli and Stefania Cecchi and Emanuele Ciavattini and Jose De Figueiredo Coutinho and Fabrizio Ferrandi and Wayne Luk and Francesco Piazza and Christian Pilato and Andrea Primavera and Vlad Sima and Romolo Toppi},
    year = {2010},
    month = {11},
    booktitle = {Audio Engineering Society Convention 129},
    title = {The hArtes CarLab: A New Approach to Advanced Algorithms Development for Automotive Audio},
    url = {http://www.aes.org/e-lib/browse.cfm?elib=15605},
    }

  • [DOI] C. Pilato, D. Loiacono, A. Tumeo, F. Ferrandi, P. L. Lanzi, and D. Sciuto, “Computational Intelligence in Expensive Optimization Problems,” , Y. Ienne and C. Goh, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, p. 701–723.
    [BibTeX]
    @Inbook{ALO2010,
    author="Pilato, Christian
    and Loiacono, Daniele
    and Tumeo, Antonino
    and Ferrandi, Fabrizio
    and Lanzi, Pier Luca
    and Sciuto, Donatella",
    editor="Ienne, Yoel
    and Goh, Chi-Keong",
    chapter="Speeding-Up Expensive Evaluations in High-Level Synthesis Using Solution Modeling and Fitness Inheritance",
    title="Computational Intelligence in Expensive Optimization Problems",
    year="2010",
    publisher="Springer Berlin Heidelberg",
    address="Berlin, Heidelberg",
    pages="701--723",
    isbn="978-3-642-10701-6",
    doi="10.1007/978-3-642-10701-6_26",
    }

  • [DOI] F. Ferrandi, C. Pilato, D. Sciuto, and A. Tumeo, “Mapping and scheduling of parallel C applications with ant colony optimization onto heterogeneous reconfigurable MPSoCs,” in Proceedings of the 15th Asia South Pacific Design Automation Conference, ASP-DAC 2010, Taipei, Taiwan, January 18-21, 2010, 2010, p. 799–804.
    [BibTeX]
    @inproceedings{ASPDAC2010,
    author = {Fabrizio Ferrandi and Christian Pilato and Donatella Sciuto and Antonino Tumeo},
    title = {Mapping and scheduling of parallel {C} applications with ant colony optimization onto heterogeneous reconfigurable MPSoCs},
    booktitle = {Proceedings of the 15th Asia South Pacific Design Automation Conference, {ASP-DAC} 2010, Taipei, Taiwan, January 18-21, 2010},
    pages = {799--804},
    year = {2010},
    doi = {10.1109/ASPDAC.2010.5419782},
    publisher = {{IEEE}},
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Performance modeling of embedded applications with zero architectural knowledge,” in Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, 2010, p. 277–286.
    [BibTeX] [Abstract]

    Performance estimation is a key step in the development of an embedded system. Normally, the performance evaluation is performed using a simulator or a performance mathematical model of the target architecture. However, both these approaches are usually based on the knowledge of the architectural details of the target. In this paper we present a methodology for automatically building an analytical model to estimate the performance of an application on a generic processor without requiring any information about the processor architecture but the one provided by the GNU GCC Intermediate Representation. The proposed methodology exploits the linear regression technique based on an application analysis performed on the Register Transfer Level internal representation of the GNU GCC compiler. The benefits of working with this type of model and with this intermediate representation are three: we take into account most of the compiler optimizations, we implicitly consider some architectural characteristics of the target processor and we can easily estimate the performance of portions of the specification. We validate our approach by evaluating with cross-validation technique the accuracy and the generality of the performance models built for the ARM926EJ-S and the LEON3 processors

    @inproceedings {CODES2010,
    author = {Marco Lattuada and Fabrizio Ferrandi},
    abstract = {Performance estimation is a key step in the development of an embedded system. Normally, the performance evaluation is performed using a simulator or a performance mathematical model of the target architecture. However, both these approaches are usually based on the knowledge of the architectural details of the target. In this paper we present a methodology for automatically building an analytical model to estimate the performance of an application on a generic processor without requiring any information about the processor architecture but the one provided by the GNU GCC Intermediate Representation. The proposed methodology exploits the linear regression technique based on an application analysis performed on the Register Transfer Level internal representation of the GNU GCC compiler. The benefits of working with this type of model and with this intermediate representation are three: we take into account most of the compiler optimizations, we implicitly consider some architectural characteristics of the target processor and we can easily estimate the performance of portions of the specification. We validate our approach by evaluating with cross-validation technique the accuracy and the generality of the performance models built for the ARM926EJ-S and the LEON3 processors},
    keywords = {gnu gcc, performance estimation, profiling},
    publisher = {ACM},
    acmid = {1879010},
    numpages = {10},
    pages = {277--286},
    location = {Scottsdale, Arizona, USA},
    isbn = {978-1-60558-905-3},
    year = {2010},
    series = {CODES/ISSS '10},
    booktitle = {Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis},
    title = {Performance modeling of embedded applications with zero architectural knowledge},
    yy = {2010},
    doi = {10.1145/1878961.1879010},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/575582/92463/codes2010-estimation.pdf},
    }

  • F. Ferrandi, M. Lattuada, C. Pilato, and D. Sciuto, “Performance Estimation for Mapping and Scheduling Parallel Applications on Heterogeneous Multi-Processor Systems,” in Workshop on The European landscape of reconfigurable computing: Lessons learned, new perspectives and innovations, 2010.
    [BibTeX] [Abstract]

    Mapping and scheduling of parallel applications on heterogeneous reconfigurable multi-processor systems is a crucial step in embedded system design. In fact, to efficiently explore the design space, any algorithm that approaches these problems necessarily needs accurate performance estimations for the different tasks which compose the applications on each of the target processing elements. Moreover, because of their particular architectural characteristics, the different types of processing elements (e.g., general purpose or digital signal processors or reprogrammable devices, such as FPGAs) require specific estimation techniques. In this work, we present a methodology that, given a model of the target architecture, combines ad-hoc estimation techniques, based on machine learning, and an heuristic search method, based on Ant Colony Optimization, to efficiently map and schedule parallel applications on heterogeneous platforms. The proposed methodology has been integrated into the Zebu, one of the tools that compose the toolchain of the hArtes project.

    @inproceedings {DATE2010,
    author = {Fabrizio Ferrandi and Marco Lattuada and Christian Pilato and Donatella Sciuto},
    title = {Performance Estimation for Mapping and Scheduling Parallel Applications on Heterogeneous Multi-Processor Systems},
    booktitle = {Workshop on The European landscape of reconfigurable computing: Lessons learned, new perspectives and innovations},
    location = {held during DATE '10, Dresden, Germany},
    year = {2010},
    abstract = {Mapping and scheduling of parallel applications on heterogeneous reconfigurable multi-processor systems is a crucial step in embedded system design.
    In fact, to efficiently explore the design space, any algorithm that approaches these problems necessarily needs accurate performance estimations for the different tasks which compose the applications on each of the target processing elements.
    Moreover, because of their particular architectural characteristics, the different types of processing elements (e.g., general purpose or digital signal processors or reprogrammable devices, such as FPGAs) require specific estimation techniques.
    In this work, we present a methodology that, given a model of the target architecture, combines ad-hoc estimation techniques, based on machine learning, and an heuristic search method, based on Ant Colony Optimization, to efficiently map and schedule parallel applications on heterogeneous platforms.
    The proposed methodology has been integrated into the Zebu, one of the tools that compose the toolchain of the hArtes project.},
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Combining Target-independent Analysis with Dynamic Profiling to Build the Performance Model of a DSP,” in Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, 2010, p. 1895–1901.
    [BibTeX] [Abstract]

    Fast and accurate performance estimation is a key aspect of heterogeneous embedded systems design flow, since cycle-accurate simulators, when exist, are usually too slow to be used during design space exploration. Performance estimation techniques are usually based on combination of estimation of the single processing elements which compose the system. Architectural characteristics of Digital Signal Processors (DSP), such as the presence of Single Instruction Multiple Data operations or of special hardware units to control loop executions, introduce peculiar aspects in the performance estimation problem. In this paper we present a methodology to estimate the performance of a function on a given dataset on a DSP. Estimation is performed combining the host profiling data with the function GNU GCC GIMPLE representation. Starting from the results of this analysis, we build a performance model of a DSP by exploiting the Linear Regression Technique. Use of GIMPLE representation allows to take directly into account the target-independent optimizations performed by the DSP compiler. We validate our approach by building a performance model of the MagicV DSP and by testing the model on a set of significative benchmarks.

    @inproceedings {ICESS2010,
    author = {Marco Lattuada and Fabrizio Ferrandi},
    title = {Combining Target-independent Analysis with Dynamic Profiling to Build the Performance Model of a {DSP}},
    booktitle = {Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology},
    location = {Bradford, West Yorkshire, UK},
    publisher = {IEEE Computer Society},
    acmid = {1901146},
    numpages = {7},
    pages = {1895--1901},
    isbn = {978-0-7695-4108-2},
    year = {2010},
    series = {CIT '10},
    abstract = {Fast and accurate performance estimation is a key aspect of heterogeneous embedded systems design flow, since cycle-accurate simulators, when exist, are usually too slow to be used during design space exploration. Performance estimation techniques are usually based on combination of estimation of the single processing elements which compose the system. Architectural characteristics of Digital Signal Processors (DSP), such as the presence of Single Instruction Multiple Data operations or of special hardware units to control loop executions, introduce peculiar aspects in the performance estimation problem. In this paper we present a methodology to estimate the performance of a function on a given dataset on a DSP. Estimation is performed combining the host profiling data with the function GNU GCC GIMPLE representation. Starting from the results of this analysis, we build a performance model of a DSP by exploiting the Linear Regression Technique. Use of GIMPLE representation allows to take directly into account the target-independent optimizations performed by the DSP compiler. We validate our approach by building a performance model of the MagicV DSP and by testing the model on a set of significative benchmarks.},
    keywords = {DSP, performance estimation, linear regression},
    doi = {10.1109/CIT.2010.324},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/575581/92457/icess2010_zebu.pdf},
    }

  • [DOI] C. Pilato, F. Ferrandi, and D. Pandini, “A Fast Heuristic for Extending Standard Cell Libraries with Regular Macro Cells,” in VLSI (ISVLSI), 2010 IEEE Computer Society Annual Symposium on, 2010, pp. 23-28.
    [BibTeX]
    @INPROCEEDINGS{ISVLSI2010,
    author={C. Pilato and F. Ferrandi and D. Pandini},
    booktitle={VLSI (ISVLSI), 2010 IEEE Computer Society Annual Symposium on},
    title={A Fast Heuristic for Extending Standard Cell Libraries with Regular Macro Cells},
    year={2010},
    pages={23-28},
    keywords={integrated circuit design;integrated circuit technology;minimisation;network routing;power consumption;area minimization;circuit implementation;compound gates;extending standard cell libraries;fast heuristic;industrial design flows;physical design;power consumption;regular macro cells;routing effects;technology mapping;Boolean functions;Compounds;Data structures;Libraries;Logic gates;Timing;Transistors;Boolean matching;Cell Generator;Logic Synthesis},
    doi={10.1109/ISVLSI.2010.69},
    month={July},
    publisher={{IEEE}},
    }

  • [PDF] [DOI] K. Bertels, V. Sima, Y. Yankova, G. Kuzmanov, W. Luk, G. Coutinho, F. Ferrandi, C. Pilato, M. Lattuada, D. Sciuto, and A. Michelotti, “HArtes: Hardware-Software Codesign for Heterogeneous Multicore Platforms,” IEEE Micro, vol. 30, p. 88–97, 2010.
    [BibTeX] [Abstract]

    Developing heterogeneous multicore platforms requires choosing the best hardware configuration for mapping the application, and modifying that application so that different parts execute on the most appropriate hardware component. The hArtes toolchain provides the option of automatic or semi-automatic support for this mapping. During test and validation on several computation-intensive applications, hArtes achieved substantial speedups and drastically reduced development times.

    @article {MICRO2010,
    abstract = {Developing heterogeneous multicore platforms requires choosing the best hardware configuration for mapping the application, and modifying that application so that different parts execute on the most appropriate hardware component. The hArtes toolchain provides the option of automatic or semi-automatic support for this mapping. During test and validation on several computation-intensive applications, hArtes achieved substantial speedups and drastically reduced development times.},
    keywords = {reconfigurable hardware, hardware-software interface, compiler, tool chain, hArtes, heterogeneous multicore platforms},
    address = {Los Alamitos, CA, USA},
    publisher = {IEEE Computer Society Press},
    acmid = {1916498},
    numpages = {10},
    pages = {88--97},
    issn = {0272-1732},
    year = {2010},
    month = {September},
    issue = {5},
    volume = {30},
    issue_date = {September 2010},
    journal = {IEEE Micro},
    title = {HArtes: Hardware-Software Codesign for Heterogeneous Multicore Platforms},
    yy = {2010},
    mm = {9},
    author = {Koen Bertels and Vlad-Mihai Sima and Yana Yankova and Georgi Kuzmanov and Wayne Luk and Gabriel Coutinho and Fabrizio Ferrandi and Christian Pilato and Marco Lattuada and Donatella Sciuto and Andrea Michelotti},
    doi = {10.1109/MM.2010.91},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/575576/142795/paper.pdf},
    }

  • C. Pilato, F. Ferrandi, and D. Sciuto, “A Design Exploration Framework for Mapping and Scheduling onto Heterogeneous MPSoCs,” in Workshop on Mapping Applications to MPSoCs 2010, 2010.
    [BibTeX]
    @inproceedings {MPSOC2010,
    location = {St. Goar, Germany},
    year = {2010},
    booktitle = {Workshop on Mapping Applications to MPSoCs 2010},
    title = {A Design Exploration Framework for Mapping and Scheduling onto Heterogeneous {MPSoCs}},
    yy = {2010},
    author = {Christian Pilato and Fabrizio Ferrandi and Donatella Sciuto}
    }

  • [PDF] [DOI] M. Lattuada and F. Ferrandi, “Fine grain analysis of simulators accuracy for calibrating performance models,” in Rapid System Prototyping (RSP), 2010 21st IEEE International Symposium on, 2010, p. 1–7.
    [BibTeX] [Abstract]

    In embedded system design, the tuning and validation of a cycle accurate simulator is a difficult task. The designer has to assure that the estimation error of the simulator meets the design constraints on every application. If an application is not correctly estimated, the designer has to identify on which parts of the application the simulator introduces an estimation error and consequently fix the simulator. However, detecting which are the mispredicted parts of a very large application can be a difficult process which requires a lot of time. In this paper we propose a methodology which helps the designer to fast and automatically isolate the portions of the application mispredicted by a simulator. This is accomplished by recursively analyzing the application source code trace highlighting the mispredicted sections of source code. The results obtained applying the methodology to the TSIM simulator show how our methodology is able to fast analyze large applications isolating small portions of mispredicted code.

    @inproceedings {RSP2010,
    author = {Marco Lattuada and Ferrandi Ferrandi},
    title = {Fine grain analysis of simulators accuracy for calibrating performance models},
    booktitle = {Rapid System Prototyping (RSP), 2010 21st IEEE International Symposium on},
    keywords = {TSIM simulator;code misprediction;cycle accurate simulator;embedded system design;estimation error;fine grain analysis;performance model calibration;source code trace;embedded systems;iterative methods;multiprocessing systems;performance evaluation;recursive estimation;source coding;},
    abstract = {In embedded system design, the tuning and validation of a cycle accurate simulator is a difficult task. The designer has to assure that the estimation error of the simulator meets the design constraints on every application. If an application is not correctly estimated, the designer has to identify on which parts of the application the simulator introduces an estimation error and consequently fix the simulator. However, detecting which are the mispredicted parts of a very large application can be a difficult process which requires a lot of time. In this paper we propose a methodology which helps the designer to fast and automatically isolate the portions of the application mispredicted by a simulator. This is accomplished by recursively analyzing the application source code trace highlighting the mispredicted sections of source code. The results obtained applying the methodology to the TSIM simulator show how our methodology is able to fast analyze large applications isolating small portions of mispredicted code.},
    pages = {1--7},
    month = {june},
    year = {2010},
    doi = {10.1109/RSP.2010.5656414},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/575579/92453/rsp2010_zebu.pdf},
    publisher={{IEEE}},
    }

  • [PDF] [DOI] F. Ferrandi, P. L. Lanzi, C. Pilato, D. Sciuto, and A. Tumeo, “Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, iss. 6, pp. 911-924, 2010.
    [BibTeX] [Abstract]

    To exploit the power of modern heterogeneous multiprocessor embedded platforms on partitioned applications, the designer usually needs to efficiently map and schedule all the tasks and the communications of the application, respecting the constraints imposed by the target architecture. Since the problem is heavily constrained, common methods used to explore such design space usually fail, obtaining low-quality solutions. In this paper, we propose an ant colony optimization (ACO) heuristic that, given a model of the target architecture and the application, efficiently executes both scheduling and mapping to optimize the application performance. We compare our approach with several other heuristics, including simulated annealing, tabu search, and genetic algorithms, on the performance to reach the optimum value and on the potential to explore the design space. We show that our approach obtains better results than other heuristics by at least 16% on average, despite an overhead in execution time. Finally, we validate the approach by scheduling and mapping a JPEG encoder on a realistic target architecture.

    @ARTICLE{TCAD2010,
    author={F. Ferrandi and P. L. Lanzi and C. Pilato and D. Sciuto and A. Tumeo},
    journal={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
    title={Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems},
    year={2010},
    volume={29},
    number={6},
    pages={911-924},
    abstract={To exploit the power of modern heterogeneous multiprocessor embedded platforms on partitioned applications, the designer usually needs to efficiently map and schedule all the tasks and the communications of the application, respecting the constraints imposed by the target architecture. Since the problem is heavily constrained, common methods used to explore such design space usually fail, obtaining low-quality solutions. In this paper, we propose an ant colony optimization (ACO) heuristic that, given a model of the target architecture and the application, efficiently executes both scheduling and mapping to optimize the application performance. We compare our approach with several other heuristics, including simulated annealing, tabu search, and genetic algorithms, on the performance to reach the optimum value and on the potential to explore the design space. We show that our approach obtains better results than other heuristics by at least 16% on average, despite an overhead in execution time. Finally, we validate the approach by scheduling and mapping a JPEG encoder on a realistic target architecture.},
    keywords={computational complexity;embedded systems;genetic algorithms;multiprocessing systems;scheduling;search problems;simulated annealing;system-on-chip;JPEG encoder;ant colony optimization heuristic;genetic algorithms;heterogeneous embedded systems;multiprocessor embedded platforms;simulated annealing;tabu search;task mapping;task scheduling;Ant colony optimization;Approximation algorithms;Embedded system;Field programmable gate arrays;Genetic algorithms;Processor scheduling;Scheduling algorithm;Simulated annealing;Space exploration;Stochastic processes;Ant colony optimization (ACO);communications;field programmable gate arrays (FPGA);mapping;multiprocessors;scheduling},
    ISSN={0278-0070},
    month={June},
    doi={10.1109/TCAD.2010.2048354},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/572925/92949/trans_mapping.pdf},
    publisher={{IEEE}},
    }

2009

  • [DOI] A. Tumeo, M. Branca, L. Camerini, C. Pilato, P. L. Lanzi, F. Ferrandi, and D. Sciuto, “Mapping pipelined applications onto heterogeneous embedded systems: a bayesian optimization algorithm based approach,” in Proceedings of the 7th International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2009, Grenoble, France, October 11-16, 2009, 2009, p. 443–452.
    [BibTeX]
    @inproceedings{CODES2009,
    author = {Antonino Tumeo and
    Marco Branca and
    Lorenzo Camerini and
    Christian Pilato and
    Pier Luca Lanzi and
    Fabrizio Ferrandi and
    Donatella Sciuto},
    title = {Mapping pipelined applications onto heterogeneous embedded systems:
    a bayesian optimization algorithm based approach},
    booktitle = {Proceedings of the 7th International Conference on Hardware/Software
    Codesign and System Synthesis, {CODES+ISSS} 2009, Grenoble, France,
    October 11-16, 2009},
    pages = {443--452},
    year = {2009},
    doi = {10.1145/1629435.1629495},
    publisher = {{ACM}},
    }

  • [DOI] M. Branca, L. Camerini, F. Ferrandi, P. L. Lanzi, C. Pilato, D. Sciuto, and A. Tumeo, “Evolutionary algorithms for the mapping of pipelined applications onto heterogeneous embedded systems,” in Genetic and Evolutionary Computation Conference, GECCO 2009, Proceedings, Montreal, Québec, Canada, July 8-12, 2009, 2009, p. 1435–1442.
    [BibTeX]
    @inproceedings{GECCO2009,
    author = {Marco Branca and
    Lorenzo Camerini and
    Fabrizio Ferrandi and
    Pier Luca Lanzi and
    Christian Pilato and
    Donatella Sciuto and
    Antonino Tumeo},
    title = {Evolutionary algorithms for the mapping of pipelined applications
    onto heterogeneous embedded systems},
    booktitle = {Genetic and Evolutionary Computation Conference, {GECCO} 2009, Proceedings,
    Montreal, Qu{\'{e}}bec, Canada, July 8-12, 2009},
    pages = {1435--1442},
    year = {2009},
    doi = {10.1145/1569901.1570094},
    publisher = {{ACM}},
    }

  • [DOI] M. Rashid, F. Ferrandi, and K. Bertels, “hArtes design flow for heterogeneous platforms,” in 10th International Symposium on Quality of Electronic Design (ISQED) 2009), 16-18 March 2009, San Jose, CA, USA, 2009, p. 330–338.
    [BibTeX]
    @inproceedings{ISQED2009,
    author = {Muhammad Rashid and
    Fabrizio Ferrandi and
    Koen Bertels},
    title = {hArtes design flow for heterogeneous platforms},
    booktitle = {10th International Symposium on Quality of Electronic Design (ISQED) 2009), 16-18 March 2009, San Jose, CA, {USA}},
    pages = {330--338},
    year = {2009},
    doi = {10.1109/ISQED.2009.4810316},
    publisher = {{IEEE}},
    }

  • [PDF] [DOI] F. Ferrandi, M. Lattuada, C. Pilato, and A. Tumeo, “Performance estimation for task graphs combining sequential path profiling and control dependence regions,” in 7th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2009), July 13-15, 2009, Cambridge, Massachusetts, USA, 2009, p. 131–140.
    [BibTeX]
    @inproceedings{MEMOCODE2009,
    author = {Fabrizio Ferrandi and
    Marco Lattuada and
    Christian Pilato and
    Antonino Tumeo},
    title = {Performance estimation for task graphs combining sequential path profiling
    and control dependence regions},
    booktitle = {7th {ACM/IEEE} International Conference on Formal Methods and Models
    for Codesign (MEMOCODE 2009), July 13-15, 2009, Cambridge, Massachusetts,
    {USA}},
    pages = {131--140},
    year = {2009},
    crossref = {DBLP:conf/memocode/2009},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/553643/92299/memocode09_submitted.pdf},
    doi = {10.1109/MEMCOD.2009.5185389},
    publisher = {{IEEE Press}},
    }

  • [PDF] [DOI] M. Lattuada, C. Pilato, A. Tumeo, and F. Ferrandi, “Performance modeling of parallel applications on MPSoCs,” in System-on-Chip, 2009. SOC 2009. International Symposium on, 2009, pp. 64-67.
    [BibTeX] [Abstract]

    In this paper we present a new technique for automatically measuring the performance of tasks, functions or arbitrary parts of a program on a multiprocessor embedded system. The technique instruments the tasks described by OpenMP, used to represent the task parallelism, while ad hoc pragmas in the source indicate other pieces of code to profile. The annotations and the instrumentation are completely target-independent, so the same code can be measured on different target architectures, on simulators or on prototypes. We validate the approach on a single and on a dual LEON 3 platform synthesized on FPGA, demonstrating a low instrumentation overhead. We show how the information obtained with this technique can be easily exploited in a hardware/software design space exploration tool, by estimating, with good accuracy, the speed-up of a parallel application given the profiling on the single processor prototype.

    @INPROCEEDINGS{SOC2009,
    author={M. Lattuada and C. Pilato and A. Tumeo and F. Ferrandi},
    booktitle={System-on-Chip, 2009. SOC 2009. International Symposium on},
    title={Performance modeling of parallel applications on MPSoCs},
    year={2009},
    pages={064-067},
    abstract={In this paper we present a new technique for automatically measuring the performance of tasks, functions or arbitrary parts of a program on a multiprocessor embedded system. The technique instruments the tasks described by OpenMP, used to represent the task parallelism, while ad hoc pragmas in the source indicate other pieces of code to profile. The annotations and the instrumentation are completely target-independent, so the same code can be measured on different target architectures, on simulators or on prototypes. We validate the approach on a single and on a dual LEON 3 platform synthesized on FPGA, demonstrating a low instrumentation overhead. We show how the information obtained with this technique can be easily exploited in a hardware/software design space exploration tool, by estimating, with good accuracy, the speed-up of a parallel application given the profiling on the single processor prototype.},
    keywords={embedded systems;field programmable gate arrays;hardware-software codesign;logic design;multiprocessing systems;system-on-chip;FPGA;MPSoC design;OpenMP;ad hoc pragmas;dual LEON 3 platform;hardware-software design;multiprocessor embedded system;performance modeling;single processor prototype;task parallelism;Application software;Computer architecture;Embedded system;Field programmable gate arrays;Hardware;Instruments;Software design;Software prototyping;Space exploration;Virtual prototyping},
    month={Oct},
    doi={10.1109/SOCC.2009.5335675},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/553648/92304/soc2009.pdf},
    publisher = {{IEEE}},
    }

2008

  • [DOI] C. Pilato, D. Loiacono, F. Ferrandi, P. L. Lanzi, and D. Sciuto, “High-level synthesis with multi-objective genetic algorithm: A comparative encoding analysis,” in Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2008, June 1-6, 2008, Hong Kong, China, 2008, p. 3334–3341.
    [BibTeX]
    @inproceedings{CEC2008,
    author = {Christian Pilato and
    Daniele Loiacono and
    Fabrizio Ferrandi and
    Pier Luca Lanzi and
    Donatella Sciuto},
    title = {High-level synthesis with multi-objective genetic algorithm: {A} comparative
    encoding analysis},
    booktitle = {Proceedings of the {IEEE} Congress on Evolutionary Computation, {CEC}
    2008, June 1-6, 2008, Hong Kong, China},
    pages = {3334--3341},
    year = {2008},
    publisher = {{IEEE}},
    doi = {10.1109/CEC.2008.4631249},
    }

  • [DOI] F. Ferrandi, P. L. Lanzi, D. Loiacono, C. Pilato, and D. Sciuto, “A Multi-objective Genetic Algorithm for Design Space Exploration in High-Level Synthesis,” in IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2008, 7-9 April 2008, Montpellier, France, 2008, p. 417–422.
    [BibTeX]
    @inproceedings{ISVLSI2008,
    author = {Fabrizio Ferrandi and
    Pier Luca Lanzi and
    Daniele Loiacono and
    Christian Pilato and
    Donatella Sciuto},
    title = {A Multi-objective Genetic Algorithm for Design Space Exploration in
    High-Level Synthesis},
    booktitle = {{IEEE} Computer Society Annual Symposium on VLSI, {ISVLSI} 2008, 7-9
    April 2008, Montpellier, France},
    pages = {417--422},
    year = {2008},
    publisher = {{IEEE}},
    doi = {10.1109/ISVLSI.2008.73},
    }

  • [DOI] C. Pilato, A. Tumeo, G. Palermo, F. Ferrandi, P. L. Lanzi, and D. Sciuto, “Improving evolutionary exploration to area-time optimization of FPGA designs,” Journal of Systems Architecture – Embedded Systems Design, vol. 54, iss. 11, p. 1046–1057, 2008.
    [BibTeX]
    @article{JSA2008,
    author = {Christian Pilato and Antonino Tumeo and Gianluca Palermo and Fabrizio Ferrandi and Pier Luca Lanzi and Donatella Sciuto},
    title = {Improving evolutionary exploration to area-time optimization of {FPGA}
    designs},
    journal = {Journal of Systems Architecture - Embedded Systems Design},
    volume = {54},
    number = {11},
    pages = {1046--1057},
    year = {2008},
    publisher = {Elsevier North-Holland, Inc.},
    doi = {10.1016/j.sysarc.2008.04.010},
    }

  • [DOI] A. Tumeo, C. Pilato, F. Ferrandi, D. Sciuto, and P. L. Lanzi, “Ant colony optimization for mapping and scheduling in heterogeneous multiprocessor systems,” in Proceedings of the 2008 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS 2008), Samos, Greece, July 21-24, 2008, 2008, p. 142–149.
    [BibTeX]
    @inproceedings{SAMOS2008,
    author = {Antonino Tumeo and
    Christian Pilato and
    Fabrizio Ferrandi and
    Donatella Sciuto and
    Pier Luca Lanzi},
    title = {Ant colony optimization for mapping and scheduling in heterogeneous multiprocessor systems},
    booktitle = {Proceedings of the 2008 International Conference on Embedded Computer
    Systems: Architectures, Modeling and Simulation (IC-SAMOS 2008),
    Samos, Greece, July 21-24, 2008},
    pages = {142--149},
    year = {2008},
    publisher = {{IEEE}},
    doi = {10.1109/ICSAMOS.2008.4664857},
    }

2007

  • [DOI] C. Pilato, G. Palermo, A. Tumeo, F. Ferrandi, D. Sciuto, and P. L. Lanzi, “Fitness inheritance in evolutionary and multi-objective high-level synthesis,” in Proceedings of the IEEE Congress on Evolutionary Computation, CEC, 2007, pp. 3459-3466.
    [BibTeX] [Abstract]

    The high-level synthesis process allows the automatic design and implementation of digital circuits starting from a behavioral description. Evolutionary algorithms are very widely adopted to approach this problem or just part of it. Neverthless, some concerns regarding execution times exist. In evolutionary high-level synthesis, design solutions have to be evaluated to extract information about some figures of merit (such as performance, area, etc.) and to allow the genetic algorithm to evolve and converge to Pareto-optimal solutions. Since the execution time of such evaluations increases with the complexity of the specification, the overall methodology could lead to unacceptable execution time. This paper presents a model to exploit fitness inheritance in a multi-objective optimization algorithm (i.e. NSGA-II) by substituting the expensive real evaluations with estimations based on closeness in an hypothetical design space. The estimations are based on the measure of the distance between individuals and a weighted average of the fitnesses of the closest ones. The results shows that the Pareto-optimal set obtained by applying the proposed model well approximates the set obtained without fitness inheritance. Moreover, the overall execution time is reduced up to the 25% in average.

    @inproceedings {CEC2007,
    author = {Christian Pilato and Gianluca Palermo and Antonino Tumeo and Fabrizio Ferrandi and Donatella Sciuto and Pier Luca Lanzi},
    booktitle = {Proceedings of the {IEEE} Congress on Evolutionary Computation, {CEC}},
    title = {Fitness inheritance in evolutionary and multi-objective high-level synthesis},
    abstract = {The high-level synthesis process allows the automatic design and implementation of digital circuits starting from a behavioral description. Evolutionary algorithms are very widely adopted to approach this problem or just part of it. Neverthless, some concerns regarding execution times exist. In evolutionary high-level synthesis, design solutions have to be evaluated to extract information about some figures of merit (such as performance, area, etc.) and to allow the genetic algorithm to evolve and converge to Pareto-optimal solutions. Since the execution time of such evaluations increases with the complexity of the specification, the overall methodology could lead to unacceptable execution time. This paper presents a model to exploit fitness inheritance in a multi-objective optimization algorithm (i.e. NSGA-II) by substituting the expensive real evaluations with estimations based on closeness in an hypothetical design space. The estimations are based on the measure of the distance between individuals and a weighted average of the fitnesses of the closest ones. The results shows that the Pareto-optimal set obtained by applying the proposed model well approximates the set obtained without fitness inheritance. Moreover, the overall execution time is reduced up to the 25% in average.},
    location = {Singapore},
    month = {September 25-28},
    pages = {3459-3466},
    publisher = {{IEEE}},
    year = {2007},
    doi = {10.1109/CEC.2007.4424920},
    }

  • [PDF] F. Ferrandi, L. Fossati, M. Lattuada, G. Palermo, D. Sciuto, and A. Tumeo, “Partitioning and Mapping for the hArtes European Project,” in Workshop on Directions in FPGAs and Reconfigurable Systems: Design, Programming and Technologies for adaptive heterogeneous Systems on Chip and their European Dimensions, 2007.
    [BibTeX] [Abstract]

    The hArtes – Holistic Approach to Reconfigurale real Time Embedded Systems – project has three main objectives: the development of a toolchain and a methodology supporting effective automatic or semi-automatic design of complex heterogeneous embedded systems, the design of a scalable heterogeneous and reconfigurable hardware platform and the validation of the tool chain on a set of innovative applications in the audio and video field. This paper presents the ongoing works related to hArtes at Politecnico di Milano. Our role consists in the development of innovative methodologies and algorithms for software partitioning and for initial mapping of the resulting partitions on reconfigurable multiprocessor platforms. The development of these methodologies was integrated in PandA, our framework for hardware-software codesing; several other related were developed as an aid for the testing of the implemented technologies.

    @inproceedings {DATE2007,
    author = {Fabrizio Ferrandi and Luca Fossati and Marco Lattuada and Gianluca Palermo and Donatella Sciuto and Antonino Tumeo},
    title = {Partitioning and Mapping for the hArtes European Project},
    booktitle = {Workshop on Directions in FPGAs and Reconfigurable Systems: Design, Programming and Technologies for adaptive heterogeneous Systems on Chip and their European Dimensions},
    abstract = {The hArtes - Holistic Approach to Reconfigurale real Time Embedded Systems - project has three main objectives: the development of a toolchain and a methodology supporting effective automatic or semi-automatic design of complex heterogeneous embedded systems, the design of a scalable heterogeneous and reconfigurable hardware platform and the validation of the tool chain on a set of innovative applications in the audio and video field. This paper presents the ongoing works related to hArtes at Politecnico di Milano. Our role consists in the development of innovative methodologies and algorithms for software partitioning and for initial mapping of the resulting partitions on reconfigurable multiprocessor platforms. The development of these methodologies was integrated in PandA, our framework for hardware-software codesing; several other related were developed as an aid for the testing of the implemented technologies.},
    location = {held during DATE '07, Nice, France},
    year = {2007},
    month = {April 20},
    yy = {2007},
    mm = {0},
    pdf={https://re.public.polimi.it/retrieve/handle/11311/268434/92449/paper.pdf},
    }

  • [PDF] [DOI] F. Ferrandi, L. Fossati, M. Lattuada, G. Palermo, D. Sciuto, and A. Tumeo, “Automatic parallelization of sequential specifications for symmetric MPSoCs,” in Proceedings of the IESS07 – International Embedded Systems Symposium 2007, 2007, pp. 179-192.
    [BibTeX] [Abstract]

    This paper presents an embedded system design toolchain for automatic generation of parallel code runnable on symmetric multiprocessor systems from an initial sequential specification written using the C language. We show how the initial C specification is translated in a modified system dependence graph with feedback edges (FSDG) composing the intermediate representation which is manipulated by the algorithm. Then we describe how this graph is partitioned and optimized: at the end of the process each partition (cluster of nodes) represents a different task. The parallel C code produced is such that the tasks can be dynamically scheduled on the target architecture; this is obtained thanks to the introduction of start conditions for each task. We present the experimental results obtained by applying our flow on the sequential code of the ADPCM and JPEG algorithms and by running the parallel specification, produced by the toolchain, on the target platform: with respect to the sequential specification, speedups up to 70% and 42% were obtained for the two benchmarks respectively.

    @inproceedings {IESS2007,
    author = {Fabrizio Ferrandi and Luca Fossati and Marco Lattuada and Gianluca Palermo and Donatella Sciuto and Antonino Tumeo},
    title = {Automatic parallelization of sequential specifications for symmetric MPSoCs},
    booktitle = {Proceedings of the IESS07 - International Embedded Systems Symposium 2007},
    abstract = {This paper presents an embedded system design toolchain for automatic generation of parallel code runnable on symmetric multiprocessor systems from an initial sequential specification written using the C language. We show how the initial C specification is translated in a modified system dependence graph with feedback edges (FSDG) composing the intermediate representation which is manipulated by the algorithm. Then we describe how this graph is partitioned and optimized: at the end of the process each partition (cluster of nodes) represents a different task. The parallel C code produced is such that the tasks can be dynamically scheduled on the target architecture; this is obtained thanks to the introduction of start conditions for each task. We present the experimental results obtained by applying our flow on the sequential code of the ADPCM and JPEG algorithms and by running the parallel specification, produced by the toolchain, on the target platform: with respect to the sequential specification, speedups up to 70% and 42% were obtained for the two benchmarks respectively.},
    location = {Irvine, CA, USA},
    year = {2007},
    month = {May 30 - June 1},
    pages = {179-192},
    yy = {2007},
    mm = {0},
    publisher = {Springer},
    pdf = {https://re.public.polimi.it/retrieve/handle/11311/240811/92308/IESS.pdf},
    doi ={10.1007/978-0-387-72258-0_16},
    }

  • [DOI] A. P. E. Rosiello, F. Ferrandi, D. Pandini, and D. Sciuto, “A Hash-based Approach for Functional Regularity Extraction During Logic Synthesis,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI ISVLSI, 2007, p. 92–97.
    [BibTeX] [Abstract]

    Performance, power, and functionality, yield and manufacturability are rapidly becoming additional critical factors that must be considered at higher levels of abstraction. A possible solution to improve yield and manufacturability is based on the detection of regularity at logic level. This paper focuses its attention on regularity extraction, after technology independent logic synthesis, to detect recurring functionalities during logic synthesis and thus constraining the physical design phase to exploit the regular netlist produced. A fast heuristic to the template identification is proposed and analyzed on a standard set of benchmarks both sequential and combinational.

    @inproceedings {ISVLSI2007,
    author = {Angelo P.E. Rosiello and Fabrizio Ferrandi and Davide Pandini and Donatella Sciuto},
    title = {A Hash-based Approach for Functional Regularity Extraction During Logic Synthesis},
    booktitle = {Proceedings of the {IEEE} Computer Society Annual Symposium on {VLSI} {ISVLSI}},
    abstract = {Performance, power, and functionality, yield and manufacturability are rapidly becoming additional critical factors that must be considered at higher levels of abstraction. A possible solution to improve yield and manufacturability is based on the detection of regularity at logic level. This paper focuses its attention on regularity extraction, after technology independent logic synthesis, to detect recurring functionalities during logic synthesis and thus constraining the physical design phase to exploit the regular netlist produced. A fast heuristic to the template identification is proposed and analyzed on a standard set of benchmarks both sequential and combinational.},
    isbn = {0-7695-2896-1},
    location = {Porto Allegre,Brasil},
    publisher = {{IEEE} Computer Society},
    year = {2007},
    month = {May 09 - 11},
    pages = {92--97},
    doi = {10.1109/ISVLSI.2007.5},
    }

  • [DOI] F. Ferrandi, P. L. Lanzi, G. Palermo, C. Pilato, D. Sciuto, and A. Tumeo, “An Evolutionary Approach to Area-Time Optimization of FPGA designs,” in Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation IC-SAMOS, 2007, pp. 145-152.
    [BibTeX] [Abstract]

    This paper presents a new methodology based on evolutionary multi-objective optimization (EMO) to synthesize multiple complex modules on programmable devices (FPGAs). It starts from a behavioral description written in a common high-level language (for instance C) to automatically produce the register-transfer level (RTL) design in a hardware description language (e.g. Verilog). Since all high-level synthesis problems (scheduling, allocation and binding) are notoriously NP-complete and interdependent, the three problems should be considered simultaneously. This drives to a wide design space, that needs to be thoroughly explored to obtain solutions able to satisfy the design constraints. Evolutionary algorithms are good candidates to tackle such complex explorations. In this paper we provide a solution based on the Non-dominated Sorting Genetic Algorithm (NSGA-II) to explore the design space in order obtain the best solutions in terms of performance given the area constraints of a target FPGA device. Moreover, it has been integrated a good cost estimation model to guarantee the quality of the solutions found without requiring a complete synthesis for the validation of each generation, an impractical and time consuming operation. We show on the JPEG case study that the proposed approach provides good results in terms of trade-off between total area occupied and execution time.

    @inproceedings {SAMOS2007,
    author = {Fabrizio Ferrandi and Pier Luca Lanzi and Gianluca Palermo and Christian Pilato and Donatella Sciuto and Antonino Tumeo},
    title = {An Evolutionary Approach to Area-Time Optimization of FPGA designs},
    booktitle = {Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation {IC-SAMOS}},
    abstract = {This paper presents a new methodology based on evolutionary multi-objective optimization (EMO) to synthesize multiple complex modules on programmable devices (FPGAs). It starts from a behavioral description written in a common high-level language (for instance C) to automatically produce the register-transfer level (RTL) design in a hardware description language (e.g. Verilog). Since all high-level synthesis problems (scheduling, allocation and binding) are notoriously NP-complete and interdependent, the three problems should be considered simultaneously. This drives to a wide design space, that needs to be thoroughly explored to obtain solutions able to satisfy the design constraints. Evolutionary algorithms are good candidates to tackle such complex explorations. In this paper we provide a solution based on the Non-dominated Sorting Genetic Algorithm (NSGA-II) to explore the design space in order obtain the best solutions in terms of performance given the area constraints of a target FPGA device. Moreover, it has been integrated a good cost estimation model to guarantee the quality of the solutions found without requiring a complete synthesis for the validation of each generation, an impractical and time consuming operation. We show on the JPEG case study that the proposed approach provides good results in terms of trade-off between total area occupied and execution time.},
    isbn = {1-4244-1058-4},
    location = {Samos, Greece},
    publisher = {{IEEE}},
    year = {2007},
    month = {July 16-19},
    pages = {145-152},
    doi = {10.1109/ICSAMOS.2007.4285745},
    }

2006

  • [PDF] [DOI] R. Cordone, F. Ferrandi, M. D. Santambrogio, G. Palermo, and D. Sciuto, “Using speculative computation and parallelizing techniques to improve scheduling of control based designs,” in Proceedings of the IEEE ASP-DAC ’06 – Conference on Asia South Pacific design automation, 2006, p. 898–904.
    [BibTeX] [Abstract]

    Recent research results have seen the application of parallelizing techniques to high-level synthesis. In particular, the effect of speculative code transformations on mixed control-data flow designs has demonstrated effective results on schedule lengths. In this paper we first analyze the use of the control and data dependence graph as an intermediate representation that provides the possibility of extracting the maximum parallelism. Then we analyze the scheduling problem by formulating an approach based on Integer Linear Programming (ILP) to minimize the number of control steps given the amount of resources. We improve the already proposed ILP scheduling approaches by introducing a new conditional resource sharing constraint which is then extended to the case of speculative computation. The ILP formulation has been solved by using a Branch and Cut framework which provides better results than standard branch and bound techniques.

    @inproceedings {ASPDAC2006,
    author = {R. Cordone and F. Ferrandi and M.D. Santambrogio and G. Palermo and D. Sciuto},
    title = {Using speculative computation and parallelizing techniques to improve scheduling of control based designs},
    booktitle = {Proceedings of the IEEE ASP-DAC '06 - Conference on Asia South Pacific design automation},
    abstract = {Recent research results have seen the application of parallelizing techniques to high-level synthesis. In particular, the effect of speculative code transformations on mixed control-data flow designs has demonstrated effective results on schedule lengths. In this paper we first analyze the use of the control and data dependence graph as an intermediate representation that provides the possibility of extracting the maximum parallelism. Then we analyze the scheduling problem by formulating an approach based on Integer Linear Programming (ILP) to minimize the number of control steps given the amount of resources. We improve the already proposed ILP scheduling approaches by introducing a new conditional resource sharing constraint which is then extended to the case of speculative computation. The ILP formulation has been solved by using a Branch and Cut framework which provides better results than standard branch and bound techniques.},
    isbn = {0-7803-9451-8},
    location = {Yokohama, Japan},
    publisher = {IEEE Press},
    year = {2006},
    month = {24-27 Jan.},
    pages = {898--904},
    yy = {2006},
    mm = {0},
    doi = {10.1109/ASPDAC.2006.1594800},
    pdf={https://re.public.polimi.it/retrieve/92954/SpecSched.pdf}
    }

  • [URL] F. Bruschi and F. Ferrandi, “A SystemC-based Framework of Communication Architecture,” in Proceedings of the Forum on specification and Design Languages, FDL, 2006, pp. 319-326.
    [BibTeX]
    @inproceedings {FDL2006,
    author = {F. Bruschi and F. Ferrandi},
    booktitle = {Proceedings of the Forum on specification and Design Languages, {FDL}},
    abstract = {},
    location = {Darmstadt, Germany},
    year = {2006},
    month = {September 19-22},
    pages = {319-326},
    title = {A SystemC-based Framework of Communication Architecture},
    url = {http://www.ecsi-association.org/ecsi/main.asp?l1=library&fn=def&id=391},
    }

2004

  • [DOI] F. Ferrandi, P. Lanzi, D. Sciuto, and M. Tanelli, “System-level metrics for hardware/software architectural mapping,” in Proceedings of the 2nd IEEE International Workshop on Electronic Design, Test and Applications DELTA, 2004, p. 231–236.
    [BibTeX] [Abstract]

    The current trend in Embedded Systems (ES) design is moving towards the integration of increasingly complex applications on a single chip, while having to meet strict market demands which force to face always shortening design times. In general, the ideal design methodology shall support the exploration of the highest possible number of alternatives (in terms of HW-SW architectures) starting in the early design stages as this will prevent costly correction efforts in the deployment phase. The present paper will propose a new methodology for tackling the design exploration problem, with the aim of providing a solution in terms of optimal partitioning with respect of the overall system performance.

    @inproceedings {DELTA2004,
    author = {F. Ferrandi and P. Lanzi and D. Sciuto and M. Tanelli},
    title = {System-level metrics for hardware/software architectural mapping},
    booktitle = {Proceedings of the 2nd {IEEE} International Workshop on Electronic Design, Test and Applications {DELTA}},
    year = {2004},
    month = {January 28-30},
    pages = {231--236},
    publisher = {{IEEE} Computer Society},
    abstract = {The current trend in Embedded Systems (ES) design is moving towards the integration of increasingly complex applications on a single chip, while having to meet strict market demands which force to face always shortening design times. In general, the ideal design methodology shall support the exploration of the highest possible number of alternatives (in terms of HW-SW architectures) starting in the early design stages as this will prevent costly correction efforts in the deployment phase. The present paper will propose a new methodology for tackling the design exploration problem, with the aim of providing a solution in terms of optimal partitioning with respect of the overall system performance.},
    location = {Perth, Australia},
    doi = {10.1109/DELTA.2004.10060},
    }

  • [DOI] F. Ferrandi, P. L. Lanzi, and D. Sciuto, “System Level Hardware–Software Design Exploration with XCS,” in Proceedings of the Genetic and Evolutionary Computation GECCO, 2004, pp. 763-773.
    [BibTeX] [Abstract]

    The current trend in Embedded Systems (ES) design is moving towards the integration of increasingly complex applications on a single chip. An Embedded System has to satisfy both performance constraints and cost limits; it is composed of both dedicated elements, i.e. hardware (HW) components, and programmable units, i.e. software (SW) components, Hardware (HW) and software (SW) components have to interact with each other for accomplishing a specific task. One of the aims of codesign is to support the exploration of the most significant architectural alternatives in terms of decomposition between hardware (HW) and software (SW) components. In this paper, we propose a novel approach to support the exploration of feasible hardware-software (HW-SW) configurations. The approach exploits the learning classifier system XCS both to identify existing relationships among the system components and to support HW-SW partitioning decisions. We validate the approach by applying it to the design of a Digital Sound Spatializer.

    @inproceedings {GECCO2004,
    author = {Fabrizio Ferrandi and Pier Luca Lanzi and Donatella Sciuto},
    title = {System Level Hardware--Software Design Exploration with {XCS}},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation {GECCO}},
    abstract = {The current trend in Embedded Systems (ES) design is moving towards the integration of increasingly complex applications on a single chip. An Embedded System has to satisfy both performance constraints and cost limits; it is composed of both dedicated elements, i.e. hardware (HW) components, and programmable units, i.e. software (SW) components, Hardware (HW) and software (SW) components have to interact with each other for accomplishing a specific task. One of the aims of codesign is to support the exploration of the most significant architectural alternatives in terms of decomposition between hardware (HW) and software (SW) components. In this paper, we propose a novel approach to support the exploration of feasible hardware-software (HW-SW) configurations. The approach exploits the learning classifier system XCS both to identify existing relationships among the system components and to support HW-SW partitioning decisions. We validate the approach by applying it to the design of a Digital Sound Spatializer.},
    location = {Seattle, WA, USA},
    publisher = {Springer-Verlag},
    year = {2004},
    month = {June 26-30},
    series = {LNCS},
    pages = {763-773},
    doi = {10.1007/978-3-540-24855-2_91},
    }

2003

  • [DOI] F. Ferrandi, P. L. Lanzi, and D. Sciuto, “Mining interesting patterns from hardware-software codesign data with the learning classifier system XCS,” in Proceedings of the IEEE CEC 2003 – Congress on Evolutionary Computation, 2003, p. 1486–1492.
    [BibTeX] [Abstract]

    Embedded systems are composed of both dedicated elements (hardware components) and programmable units (software components), which have to interact with each other for accomplishing a specific task. One of the aims of hardware-software codesign is the choice of a partitioning between elements that will be implemented in hardware and elements that will be implemented in software is one of the important step in design. In this paper, we present an application of the learning classifier system XCS to the analysis of data derived from hardware-software codesign applications. The goal of the analysis is the discovering or explicitation of existing interelationships among system components, which can be used to support the human design of embedded systems. The proposed approach is validated on a specific task involving a digital sound spatializer.

    @inproceedings {CEC2003,
    author = {F. Ferrandi and P.L. Lanzi and D. Sciuto},
    title = {Mining interesting patterns from hardware-software codesign data with the learning classifier system {XCS}},
    booktitle = {Proceedings of the IEEE CEC 2003 - Congress on Evolutionary Computation},
    abstract = {Embedded systems are composed of both dedicated elements (hardware components) and programmable units (software components), which have to interact with each other for accomplishing a specific task. One of the aims of hardware-software codesign is the choice of a partitioning between elements that will be implemented in hardware and elements that will be implemented in software is one of the important step in design. In this paper, we present an application of the learning classifier system XCS to the analysis of data derived from hardware-software codesign applications. The goal of the analysis is the discovering or explicitation of existing interelationships among system components, which can be used to support the human design of embedded systems. The proposed approach is validated on a specific task involving a digital sound spatializer.},
    year = {2003},
    month = {8-12 Dec.},
    pages = {1486--1492},
    volume = {2},
    publisher = {{IEEE}},
    doi = {10.1109/CEC.2003.1299846},
    }

A framework for Hardware-Software Co-Design of Embedded Systems