Big Data Algorithm Optimization (Computer Project) | ProjectAbstracts.com

ABSTRACT:

When sales representatives and customers negotiate, it must be confirmed that the final deals will render a high enough profit for the selling company. Large companies have different methods of doing this, one of which is to run sales simulations. Such simulation systems often need to perform complex calculations over large amounts of data, which in turn requires efficient models and algorithms.

This project intends to evaluate whether it is possible to optimize and extend an existing sales system called PCT, which is currently suffering from unacceptably high running times in its simulation process. This is done through analysis of the current implementation, followed by optimization of its models and development of efficient algorithms. The performance of these optimized and extended models are compared to the existing one in order to evaluate their improvement.

The conclusion of this project is that the simulation process in PCT can indeed be optimized and extended. The optimized models serve as a proof of concept, which shows that results identical to the original system’s can be calculated within < 1% of the original running time for the largest customers.

Structure of the simulation process in PCT.

SNIPPETS:

Data Needed for a Scaling Simulation:

A scaling simulation is based on data from the following sources:

Article tree – A tree structure where branch nodes represent “price levels” (article categories) and leaf nodes represent articles
Sales history – A set of order rows, containing information about previous sales
history. The aggregated sales history from the customer discount model is also required.
Existing scaling conditions – Agreed scaling conditions from existing contracts,
which set a certain discount stair to an article, article group or price level 3 node
in the article tree.
User input – Various parameters that specify which historical data to use, which
discount stair to use and various other simulation settings.

The Article Tree:

The article tree categorizes all of the company’s articles into article groups. These are in turn grouped together into more general categories in three “price levels”, where level 3 is the most specific and level 1 is the most general category. An example tree using this structure is presented in figure 2.1.

Figure 2.1: An example article tree containing clothes and accessories.

The Simulation Process:

The user specified discount rates will then be inherited down through the article tree just like the ones from the conditions. The result will thereby correspond to the profit which would be achieved if these new rates were added to the conditions database and the same orders as in the historical data were then placed again by the customer.

This simulation step will typically be run multiple times with different discount rates for the nodes in the path, until they are balanced in such a way that both the customer and the sales representative are satisfied with the results. Running multiple simulations with different discount rates for the same time period and historical data until one gets satisfying results is referred to as going through a simulation process.

Simulation Output:

So far, the output of simulations has been described in terms of “profit” and “value”. The actual values computed during a simulation are of course more specific than that and as such, the specification of requirements presents guidelines for the output data layout.

Figure 2.2: A print screen from PCT showing how simulation output is presented in the current system.

Discount Inheritance:

Discounts can be applied to nodes on any level of the article tree from price level 1 down to specific articles. It is intuitive that a discount which is set for a single article will only affect the price of that specific article.

Algorithm.

Customer Discount Model:

If we examine the current simulation process described in, and remember the issues from , we note that the major bottlenecks in the existing simple mentation are that:
• Simulations suffer from a high time complexity
• Data retrieval from the database and in-memory cache is done in an inefficient way
• The same data is redundantly calculated and aggregated several times in the simulation process.

Complexity Analysis:

In this we performed a complexity analysis of the PCT implementation. In order to get an idea of how this model relates to PCT, we will perform a complexity analysis of our model as well and compare them.

Algorithm.

RESULTS:

This chapter contains tables and plots showing the performance of actual simulations run in the different systems described in earlier chapters. Since correct results are always required from both PCT and our implementations, the interesting part is the running time in different scenarios rather than accuracy or correctness.

Customer Discount Results:

Every test case in this section has been run under the same conditions using both our implementation of our own model and PCT on the same server, in order to get comparable results.

Comparison of Running Time for Tirst Simulation Between PCT and our ModeL.

FUTURE WORK:

The optimized models described in this report are all implemented in Java, just like the original PCT system. Further studies of performance gain for implementations written in other programming languages would be interesting – perhaps a functional programming language would be able to run simulations even faster?

Another interesting approach would be a comparison between the performance of our models using different database solutions. NoSQL database systems could prove effective in handling the big data problems introduced by the scaling extension.

Particularly, an implementation using a graph database would be interesting due to this technology’s great performance when dealing with tree structures. For example, the graph database Neo4j has shown promising results in multiple studies such as, where Neo4j is concluded to be up to ten times faster than MySQL for traversals and, where the results show that running times for MySQL increase much faster than for Neo4j as the data magnitude grows.

CONCLUSION:

This project has consisted of analysis, optimization and implementation of the existing simulation algorithms in PCT as well as modelling and implementation of its upcoming scaling extension. The results show that the implementation of the optimized customer discount model provides large enough performance improvements to guarantee reasonable running times even for the largest customers.

The results for the scaling extension prove that implementation of the desired functionality in PCT is possible as well, as long as the big data issue is handled in an efficient way.

A final conclusion of this project is that optimization of existing algorithms is not always sufficient in order to improve the performance of a system. Creating new, optimized models and developing fast algorithms for these can prove far more efficient than optimization of existing algorithms based on improficient models.

Source: Chalmers University of Technology
Authors: Kasper Karlsson | Tobias Lans

Download Project