Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContentExtractor中的computeInfo函数会出现StackOverflowError #116

Open
yanpeng opened this issue Nov 20, 2019 · 3 comments
Open

ContentExtractor中的computeInfo函数会出现StackOverflowError #116

yanpeng opened this issue Nov 20, 2019 · 3 comments

Comments

@yanpeng
Copy link

yanpeng commented Nov 20, 2019

比如处理这个url http://www.suixian.gov.cn/news/News_View.asp?NewsID=20939 时会出现上述错误,我已经将其修改为非递归版本了,请问可以将代码提交上来吗?

@xiaozizero
Copy link

修改的代码可以贴在这里啊

@yanpeng
Copy link
Author

yanpeng commented Nov 29, 2019

    // 计算叶子节点信息
    private CountInfo computeLeafNodeInfo(Node node)
    {
        if (node instanceof TextNode)
        {
            TextNode tn = (TextNode) node;
            CountInfo countInfo = new CountInfo();
            String text = tn.text();
            int len = text.length();
            countInfo.textCount = len;
            countInfo.leafList.add(len);
            return countInfo;
        }
        else
        {
            return new CountInfo();
        }
    }

    protected CountInfo computeInfo(Node node)
    {
        if (node instanceof Element)
        {
            Deque<Node> stack = new ArrayDeque<Node>();
            Deque<Node> queue = new ArrayDeque<Node>();
            Set<Node> accessedNodes = new HashSet<Node>();
            Map<Node, CountInfo> nodeInfoMap = new HashMap<Node, CountInfo>();

            stack.addFirst(node);
            while (!stack.isEmpty())
            {
                Node headerNode = stack.getFirst();
                // 如果是非叶子节点添加至已访问节点集合
                // 并且将它的孩子以逆序入栈
                if ((headerNode instanceof Element) && !accessedNodes.contains(headerNode))
                {
                    accessedNodes.add(headerNode);
                    for (int i = headerNode.childNodeSize() - 1; i >= 0; --i)
                    {
                        stack.addFirst(headerNode.childNode(i));
                    }
                }
                else // 对于叶子节点和已经访问过的非叶子节点则入队列
                {
                    queue.addLast(stack.removeFirst());
                }
            }

            while (!queue.isEmpty())
            {
                Node headerNode = queue.removeFirst();
                if (headerNode instanceof Element)
                {
                    Element tag = (Element) headerNode;
                    CountInfo countInfo = new CountInfo();

                    for (Node childNode : headerNode.childNodes()) 
                    {
                        CountInfo childCountInfo = nodeInfoMap.get(childNode);
                        countInfo.textCount += childCountInfo.textCount;
                        countInfo.linkTextCount += childCountInfo.linkTextCount;
                        countInfo.tagCount += childCountInfo.tagCount;
                        countInfo.linkTagCount += childCountInfo.linkTagCount;
                        countInfo.leafList.addAll(childCountInfo.leafList);
                        countInfo.densitySum += childCountInfo.density;
                        countInfo.pCount += childCountInfo.pCount;
                    }

                    countInfo.tagCount++;
                    String tagName = tag.tagName();
                    if (tagName.equals("a"))
                    {
                        countInfo.linkTextCount = countInfo.textCount;
                        countInfo.linkTagCount++;
                    }
                    else if (tagName.equals("p"))
                    {
                        countInfo.pCount++;
                    }

                    int pureLen = countInfo.textCount - countInfo.linkTextCount;
                    int len = countInfo.tagCount - countInfo.linkTagCount;
                    if (pureLen == 0 || len == 0)
                    {
                        countInfo.density = 0;
                    }
                    else
                    {
                        countInfo.density = (pureLen + 0.0) / len;
                    }

                    infoMap.put(tag, countInfo);
                    nodeInfoMap.put(headerNode, countInfo);
                }
                else
                {
                    nodeInfoMap.put(headerNode, computeLeafNodeInfo(headerNode));
                }
            }

            return nodeInfoMap.get(node);
        }
        else
        {
            return computeLeafNodeInfo(node);
        }
    }

@hujunxianligong
Copy link
Member

hujunxianligong commented Nov 29, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants