添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

最新项目需要获取maven仓库中开源的组件版本信息,原以为使用wget命令,就可以从 Maven Repo 轻松获取。可惜,理想很丰满,现实很有骨感。既然wget获取不到,那就自己简单实现个爬虫获取吧。

打开仓库页面: https://repo.maven.apache.org/maven2/

页面上都是以目录和文件的方式展示的。

查看页面源码

可以轻易的发现目录和文件的内容都是在id为“contents”下的 a 标签中。

版本信息查看(在maven-metadata.xml)

不断深入某个目录,可以轻易的发现组件的版本信息都在 maven-metadata.xml 中进行描述。eg:

https://repo.maven.apache.org/maven2/tech/ibit/sql-builder/maven-metadata.xml 的内容

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
    <groupId>tech.ibit</groupId>
    <artifactId>sql-builder</artifactId>
    <versioning>
        <latest>2.0</latest>
        <release>2.0</release>
        <versions>
            <version>1.0</version>
            <version>1.1</version>
            <version>2.0</version>
        </versions>
        <lastUpdated>20201130115230</lastUpdated>
    </versioning>
</metadata>

maven-metadata.xml中包含groupIdartifactIdversion信息。

综合上述过程,获取maven所有版本信息,可以做以下操作

  • 遍历 maven repo 所有目录信息,并获取 maven-metadata.xml 文件
  • 解析 maven-metadata.xml,获取 groupIdartifactIdversion
  • 示例代码:

    爬取所有的 maven-metadata.xml文件和目录

    package tech.ibit.crawler;
    import org.apache.commons.lang.StringUtils;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import java.io.File;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.util.Scanner;
     * Maven爬虫
     * @author IBIT程序猿
    public class MavenCrawler {
         * 爬取跟目录
        private static final String ROOT = "https://repo.maven.apache.org/";
         * maven-metadata.xml文件名
        private static final String MAVEN_METADATA_XML_FILENAME = "maven-metadata.xml";
        public static void main(String[] args) {
            // 参数说明
            // args[0]: 爬取目录
            // args[1]: sleep毫秒数
            // args[2]: 开始层级(可选)
            // args[3]: 开始行(可选)
            String dirPath = args[0];
            File dir = new File(dirPath);
            if (!dir.exists() || !dir.isDirectory()) {
                System.err.println("爬取目录不存在,dir: " + dirPath);
                System.exit(1);
            int sleepMillis = Integer.parseInt(args[1]);
            int level = 0;
            if (args.length > 2) {
                level = Integer.parseInt(args[2]);
            String beginLine = null;
            if (args.length > 3) {
                beginLine = args[3];
            File urlFile;
            boolean begin = null == beginLine;
            while ((urlFile = getLevelFile(dir, level)).exists()) {
                level++;
                boolean fileEmpty = true;
                File subFile = getLevelFile(dir, level);
                try (Scanner scanner = new Scanner(urlFile);
                     FileWriter writer = new FileWriter(subFile)) {
                    while (scanner.hasNext()) {
                        String line = scanner.nextLine();
                        if (StringUtils.isNotBlank(line)) {
                            fileEmpty = false;
                            if (!begin && line.equals(beginLine)) {
                                begin = true;
                            if (begin) {
                                String url = ROOT + line;
                                findSubUrl(url, sleepMillis, writer);
                } catch (IOException e) {
                    e.printStackTrace();
                if (fileEmpty) {
                    urlFile.deleteOnExit();
                    subFile.deleteOnExit();
                    break;
         * 获取文件
         * @param dir   目录
         * @param level 等级
         * @return 文件
        private static File getLevelFile(File dir, int level) {
            return new File(dir.getAbsolutePath() + File.separator + "level_" + level + ".txt");
         * 查询子url
         * @param url         当前url
         * @param sleepMillis 睡眠毫秒数
         * @param writer      writer
        private static void findSubUrl(String url, int sleepMillis, FileWriter writer) {
            try {
                if (url.endsWith(MAVEN_METADATA_XML_FILENAME)) {
                    return;
                Thread.sleep(sleepMillis);
                Document doc = Jsoup.connect(url).get();
                Elements links = doc.select("#contents a");
                for (Element link : links) {
                    String absUrl = link.absUrl("href");
                    // 非子目录
                    if (!absUrl.contains(url) || url.equals(absUrl)) {
                        continue;
                    String relativePath = absUrl.substring(url.length());
                    if (MAVEN_METADATA_XML_FILENAME.equals(relativePath) || !relativePath.contains(".")) {
                        String path = absUrl.substring(ROOT.length());
                        writer.write(path + "\n");
                        writer.flush();
                        System.out.println(path);
            } catch (IOException | InterruptedException e) {
                e.printStackTrace();
    
  • 需要在保存的文件夹中新建level_0.txt文件,并将初始url https://repo.maven.apache.org/maven2/ 放置于文件中。执行过程中,会按照遍历目录的深度,生成level_1.txt, level_2.txt等。。
  • 当前示例代码使用单线程,并设置睡眠时间(避免ip被封),如果需要改为多线程,自行设计。
  • 解析 maven-metadata.xml 示例代码

    package tech.ibit.crawler;
    import org.apache.commons.collections4.CollectionUtils;
    import org.apache.commons.io.IOUtils;
    import org.apache.commons.lang.StringUtils;
    import org.w3c.dom.Document;
    import org.w3c.dom.Node;
    import org.w3c.dom.NodeList;
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import java.io.ByteArrayInputStream;
    import java.io.File;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.net.URL;
    import java.nio.charset.StandardCharsets;
    import java.util.LinkedHashSet;
    import java.util.Scanner;
    import java.util.Set;
     * Maven meta
     * @author IBIT程序猿
    public class MavenMetaDataParser {
         * 爬取跟目录
        private static final String ROOT = "https://repo.maven.apache.org/";
         * maven-metadata.xml文件名
        private static final String MAVEN_METADATA_XML_FILENAME = "maven-metadata.xml";
        public static void main(String[] args) {
            // 参数说明
            // args[0]: 爬取目录
            // args[1]: sleep毫秒数
            // args[2]: 开始层级
            // args[3]: 结束层级
            // args[4]: 开始行(可选)
            if (args.length < 4) {
                System.err.println("参数:爬取目录 sleep毫秒数 开始层级 结束层级 开始行(可选)");
                System.exit(1);
            String dirPath = args[0];
            File dir = new File(dirPath);
            if (!dir.exists() || !dir.isDirectory()) {
                System.err.println("爬取目录不存在,dir: " + dirPath);
                System.exit(1);
            int sleepMillis = Integer.parseInt(args[1]);
            int beginLevel = Integer.parseInt(args[2]);
            int endLevel = Integer.parseInt(args[3]);
            String beginLine = null;
            if (args.length > 4) {
                beginLine = args[4];
            boolean begin = null == beginLine;
            for (int i = beginLevel; i <= endLevel; i++) {
                File urlFile = getLevelFile(dir, i);
                if (!urlFile.exists()) {
                    break;
                try (Scanner scanner = new Scanner(urlFile);
                     FileWriter writer = new FileWriter(getVersionLevelFile(dir, i))) {
                    while (scanner.hasNext()) {
                        String line = scanner.nextLine();
                        if (StringUtils.isNotBlank(line)) {
                            if (!begin && line.equals(beginLine)) {
                                begin = true;
                            if (begin && line.endsWith(MAVEN_METADATA_XML_FILENAME)) {
                                String url = ROOT + line;
                                appendVersions(url, sleepMillis, writer);
                } catch (IOException e) {
                    e.printStackTrace();
         * 生成版本
         * @param url         url
         * @param sleepMillis 睡眠毫秒数
         * @param writer      writer
        private static void appendVersions(String url, int sleepMillis, FileWriter writer) {
            try {
                Thread.sleep(sleepMillis);
                String xmlContent = IOUtils.toString(new URL(url), StandardCharsets.UTF_8);
                DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
                DocumentBuilder builder = factory.newDocumentBuilder();
                try (ByteArrayInputStream in = new ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8))) {
                    Document doc = builder.parse(in);
                    String groupId = getSingleValue(doc, "groupId");
                    if (StringUtils.isBlank(groupId)) {
                        return;
                    String artifactId = getSingleValue(doc, "artifactId");
                    if (StringUtils.isBlank(artifactId)) {
                        return;
                    Set<String> versions = getMultiValues(doc, "version");
                    if (CollectionUtils.isEmpty(versions)) {
                        return;
                    String versionLine = groupId + ":" + artifactId + ":" + StringUtils.join(versions, ",");
                    writer.write(versionLine + "\n");
                    writer.flush();
                    System.out.println(versionLine);
            } catch (Exception e) {
                e.printStackTrace();
         * 获取文件
         * @param dir   目录
         * @param level 等级
         * @return 文件
        private static File getLevelFile(File dir, int level) {
            return new File(dir.getAbsolutePath() + File.separator + "level_" + level + ".txt");
         * 获取文件
         * @param dir   目录
         * @param level 等级
         * @return 文件
        private static File getVersionLevelFile(File dir, int level) {
            return new File(dir.getAbsolutePath() + File.separator + "version_level_" + level + ".txt");
         * 获取单个值
         * @param document 文档
         * @param tagName  标签名称
         * @return 单个值
        private static String getSingleValue(Document document, String tagName) {
            NodeList nodeList = document.getElementsByTagName(tagName);
            if (nodeList.getLength() == 0) {
                return null;
            return getNodeValue(nodeList.item(0));
         * 获取多个值
         * @param document 文档
         * @param tagName  标签名称
         * @return 值集合
        private static Set<String> getMultiValues(Document document, String tagName) {
            Set<String> values = new LinkedHashSet<>();
            NodeList nodeList = document.getElementsByTagName(tagName);
            for (int i = 0; i < nodeList.getLength(); i++) {
                String value = getNodeValue(nodeList.item(i));
                if (null != value) {
                    values.add(value);
            return values;
         * 获取节点值
         * @param node 节点
         * @return 节点值
        private static String getNodeValue(Node node) {
            if (null == node) {
                return null;
            return node.getFirstChild().getNodeValue();
    
  • 该示例代码就是读取爬虫生成的level_x.txt文件中的maven-metadata.xml文件,并解析出对应的groupId, artifactId, version
  • 当前示例代码使用单线程,并设置睡眠时间(避免ip被封),如果需要改为多线程,自行设计。
  • 其他说明,pom.xml引入依赖说明

        <dependencies>
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.14.3</version>
            </dependency>
            <dependency>
                <groupId>commons-lang</groupId>
                <artifactId>commons-lang</artifactId>
                <version>2.6</version>
            </dependency>
            <dependency>
                <groupId>org.apache.commons</groupId>
                <artifactId>commons-collections4</artifactId>
                <version>4.4</version>
            </dependency>
            <dependency>
                <groupId>commons-io</groupId>
                <artifactId>commons-io</artifactId>
                <version>2.11.0</version>
            </dependency>
        </dependencies>