General Information

Req #
WD00066350
Career area:
Research/Development
Country/Region:
China
State:
Beijing
City:
北京(Beijing)
Date:
Wednesday, June 5, 2024
Working time:
Full-time
Additional Locations
* China - Beijing - 北京(Beijing)

Why Work at Lenovo

 We are Lenovo. We do what we say. We own what we do. We WOW our customers. 

Lenovo is a US$62 billion revenue global technology powerhouse, ranked #217 in the Fortune Global 500, employing 77,000 people around the world, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver smarter technology for all, Lenovo has built on its success as the world’s largest PC company by further expanding into growth areas that fuel the advancement of ‘New IT’ technologies (client, edge, cloud, network, and intelligence) including server, storage, mobile, software, solutions, and services. 

This transformation together with Lenovo’s world-changing innovation is building a more inclusive, trustworthy, and smarter future for everyone, everywhere. To find out more visit www.lenovo.com, and read about the latest news via our StoryHub

Description and Requirements

岗位职责:

1. 负责设计高可用大模型训练容错系统,支持千亿大模型预训练

2. 负责大模型训练容错checkpoint优化,提升大模型checkpoint读写与恢复性能

3. 负责大模型弹性训练框架的研发

岗位要求:

1. 全日制硕士以上学历,计算机科学与技术、人工智能等相关专业;

2. 熟练C++/Python语言、数据结构以及计算机系统结构,有AI模型性能调优经验,以及良好的工程实现能力;

3. 熟悉 AI 领域常见的分布式训练技术,包括但不限于:数据并行、流水线并行和张量并行等,具有相应的项目经验;

4. 至少熟悉一种AI框架(PyTorch/TensorFlow/Paddle/DeepSpeed),能够熟练使用和调试;

5. 熟悉 GPU 硬件结构和 CUDA 计算原理,有 CUDA 相关算子开发、调试经验,对 NCCL/cuDNN 等有一定了解;

6. 大规模预训练模型有较好的了解,熟悉常见的预训练模型(如GPTBERT等)结构训练方法和优化技巧。

7. 具备出色的问题解决能力和创新思维,能够分析和解决复杂的训练问题,并提出改进和优化的方案

8. 具有良好的团队合作精神,能够与跨部门的团队紧密合作,共同推动项目的成功。

加分项:

1. 大模型研发和分布式训练经验

2. 熟悉Kubernetes架构以及大模型训练容错系统

3. AI或者HPC领域发表过高水平论文

Additional Locations
* China - Beijing - 北京(Beijing)
* China
* China - Beijing
* China - Beijing - 北京(Beijing)