08-05
12

部分内容截取,然后得到集合 (适用于采集)

正则表达:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
    private static String REGEX = "<a href=\"\\w*\"/>";

    private static String INPUT = "aabbsdasdaiqo<a href=\"xxx\"/>sasdadsa<a href=\"eee\"/>sdasadpqwo<a href=\"ggg\"/>||wxwdqwq<a href=\"bbb\"/>...";

    public static void main(String[] args) {
        List<String> list = new ArrayList<String>();

        Pattern p = Pattern.compile(REGEX);
        Matcher m = p.matcher(INPUT);

        while (m.find()) {
            String tmp = m.group();
            list.add(tmp.substring(tmp.indexOf('"') + 1, tmp.lastIndexOf('"')));
        }

        printList(list);
    }

    private static void printList(List<String> list) {
        for (String i : list) {
            System.out.println(i);
        }
    }
}



逻辑:
import java.util.ArrayList;
import java.util.List;
import java.util.Iterator;


public class Test {
    public static List getArrByStr(String str) {
        ArrayList<String> list = new ArrayList<String>();

        String[] arr = str.split("<a");

        for (int i = 1; i < arr.length; i++) {
            int startPosTemp = arr[i].indexOf("href=\"");
            int startPos = startPosTemp + 6;

            StringBuffer sb = new StringBuffer();
            for (int j = startPos; j < arr[i].length(); j++) {
                if (arr[i].charAt(j) != '\"') {
                    sb.append(arr[i].charAt(j));
                } else {
                    list.add(sb.toString());
                    break;
                }
            }
        }

        return list;
    }

    public static void main(String[] args) {
        String text = "aabbsdasdaiqo<a href=\"xxx\"/>sasdadsa<a href=\"eee\"/>sdasadpqwo<a href=\"ggg\"/>||wxwdqwq<a href=\"bbb\"/>";
        List list = getArrByStr(text);
        for (Iterator it = list.iterator(); it.hasNext();) {
            String strTemp = (String) it.next();
            System.out.println(strTemp);
        }

    }


文章来自: 本站原创
引用通告: 查看所有引用 | 我要引用此文章
Tags: 自动 采集
相关日志:
评论: 0 | 引用: 0 | 查看次数: 560
发表评论
昵 称:
密 码: 游客发言不需要密码.
内 容:
验证码: 验证码
选 项:
虽然发表评论不用注册,但是为了保护您的发言权,建议您注册帐号.
字数限制 1000 字 | UBB代码 开启 | [img]标签 关闭